Learning Preference and Structured Data: Theory and Applications

UNIVERSITÀ DEGLI STUDI DI FIRENZE
Facoltà di Ingegneria – Dipartimento di Sistemi e Informatica
Dottorato di Ricerca in
Ingegneria Informatica e dell’Automazione
XVII Ciclo
Learning Preference
and Structured Data:
Theory and Applications
Sauro Menchetti
Dissertation submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
in Computer Science and Control Engineering
Ph.D. Candidate
Sauro Menchetti
Ph.D. Coordinator
Prof. Edoardo Mosca
Advisor
Prof. Paolo Frasconi
Anno Accademico 2004–2005
Ai miei genitori e a Francesca
Firenze, 31 Dicembre 2005
Abstract
This dissertation deals with the theory and applications to natural language processing and computational molecular biology of learning preference and structured
data.
From a theoretical point of view, a new and unpublished interpretation in
the dual space of the voted perceptron algorithm is provided, including an on–
line update rule and an upper bound for dual variables. Accordingly, a novel
formulation of regularization theory for this algorithm is devised.
A further new theoretical analysis based on a partial order model of preference
and ranking problems, explains why a setwise loss function which directly tackles
the problem exhibits a better performance of a pairwise loss function based on an
utility function. In the context of preference learning, we report applications to two
large scale problems involving learning a preference function that selects the best
alternative in a set of competitors: reranking parse trees generated by a statistical
parser and the prediction of first pass attachment under strong incrementality
hypothesis. We compare convolution kernels and recursive neural networks, two
effective approaches to solve the investigated problems, finding that the choice of
the loss function plays an essential role.
A novel, general and computationally efficient family of kernels on discrete data
structures called weighted decomposition kernels is developed within the general
class of decomposition kernels. We report experimental evidence that the proposed
family of kernel is highly competitive with respect to more complex and computationally demanding state–of–the–art methods on a set of practical problems in
bioinformatics, involving protein sequence and molecule graph classification.
Finally, we tackle the prediction task of the zinc binding sites and proteins
that is still a little widespread problem in the machine learning community. We
propose an ad–hoc remedy to the autocorrelation problem between residues close
in sequence. This approach lead to a significant improvement in the prediction
performance by modelling the linkage between examples in such a way that sequentially close pairs of candidate residues are classified as being jointly involved
in the coordination of a zinc ion. We develop a kernel for this particular type of
data that can handle variable length gaps between candidate coordinating residues.
Keywords: Structured Data, Preference Learning, Kernel Machines.
Table of Contents
Abstract
v
Table of Contents
vi
List of Figures
xvi
List of Tables
xx
List of Algorithms
xxi
Acknowledgements
xxiii
Notation
xxvii
Introduction
Machine Learning and Artificial Intelligence
Taxonomy of Machine Learning Algorithms
Structured Data . . . . . . . . . . . . . . . .
Kernel Machines . . . . . . . . . . . . . . .
Overview of the Thesis . . . . . . . . . . . .
Sources of the Thesis . . . . . . . . . . . . .
Other Projects . . . . . . . . . . . . . . . .
I
Learning Structured Data
1 Statistical Learning Theory
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
2
4
5
6
8
9
11
13
vii
TABLE OF CONTENTS
1.1
1.2
1.3
Statistical Learning Theory . . . . . . . . .
1.1.1 The Supervised Learning Problem . .
1.1.2 Loss Function . . . . . . . . . . . . .
1.1.3 Risk Functionals . . . . . . . . . . .
1.1.4 Empirical Risk Minimization . . . . .
1.1.5 Regularization Theory . . . . . . . .
1.1.6 Sample and Approximation Error . .
Mathematical Foundations of Kernels . . . .
1.2.1 Euclidean and Hilbert Spaces . . . .
1.2.2 Mercer’s Theorem . . . . . . . . . . .
1.2.3 Reproducing Kernel Hilbert Spaces .
1.2.4 Representer Theorem . . . . . . . . .
Support Vector Machines . . . . . . . . . . .
1.3.1 SVMs and the Regularization Theory
1.3.2 Primal and Dual Formulations . . . .
1.3.3 SVMs for Regression . . . . . . . . .
1.3.4 Support Vector Clustering . . . . . .
1.3.5 Complexity . . . . . . . . . . . . . .
1.3.6 Multiclass Classification . . . . . . .
2 Voted Perceptron Algorithm
2.1 Voted Perceptron . . . . . .
2.1.1 Training Algorithm .
2.1.2 Prediction Function .
2.2 Dual VP . . . . . . . . . . .
2.3 Regularization . . . . . . . .
2.4 Complexity . . . . . . . . .
2.5 Loss Function . . . . . . . .
3 Processing Structured Data
3.1 Basic Kernels . . . . . . . .
3.2 Constructing Kernels . . . .
3.3 Normalizing the Kernel . . .
3.4 Kernels for Discrete Objects
.
.
.
.
.
.
.
.
.
.
.
viii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
14
15
16
17
18
18
19
20
21
24
27
30
31
33
36
37
38
41
.
.
.
.
.
.
.
43
43
44
46
50
56
59
60
.
.
.
.
63
63
64
67
68
TABLE OF CONTENTS
3.5
3.6
3.7
3.8
II
Kernels for Strings . . . . . . . . . . . . . . . . . . .
3.5.1 Spectrum Kernel . . . . . . . . . . . . . . . .
3.5.2 Mismatch String Kernel . . . . . . . . . . . .
3.5.3 String Subsequence Kernel . . . . . . . . . . .
3.5.4 Weighted String Kernel . . . . . . . . . . . . .
3.5.5 Dynamic Time–Alignment Kernel . . . . . . .
3.5.6 Dynamic Alignment Kernel . . . . . . . . . .
3.5.7 Marginalized Kernel . . . . . . . . . . . . . .
3.5.8 Marginalized Count Kernel . . . . . . . . . . .
3.5.9 The Fisher Kernel . . . . . . . . . . . . . . .
Kernels for Trees . . . . . . . . . . . . . . . . . . . .
3.6.1 Parse Tree Kernel . . . . . . . . . . . . . . . .
3.6.2 String Tree Kernel . . . . . . . . . . . . . . .
3.6.3 Label Mutation Elastic Structure Tree Kernel
Kernels for Graphs . . . . . . . . . . . . . . . . . . .
3.7.1 Subgraph Kernel . . . . . . . . . . . . . . . .
3.7.2 Frequent Subgraphs Kernel . . . . . . . . . .
3.7.3 Cyclic Pattern Kernel . . . . . . . . . . . . .
3.7.4 Marginalized Graph Kernel . . . . . . . . . .
3.7.5 Extended Marginalized Graph Kernel . . . . .
3.7.6 A Family of Kernels for Small Molecules . . .
3.7.7 Synchronized Random Walks Kernel . . . . .
3.7.8 Walk Based Kernels . . . . . . . . . . . . . .
3.7.9 Tree–structured Pattern Kernel . . . . . . . .
3.7.10 Basic Terms Kernel . . . . . . . . . . . . . . .
Recursive Neural Networks . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Preference Learning
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
69
70
71
72
74
75
76
79
79
80
82
83
87
89
92
93
93
95
96
97
99
101
103
103
104
106
111
4 Preference Learning in Natural Language Processing
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
4.2 An Introduction to the Parsing Problem . . . . . . . .
4.3 Ranking and Preference Problems . . . . . . . . . . . .
4.3.1 The Utility Function Approach . . . . . . . . .
ix
.
.
.
.
.
.
.
.
.
.
.
.
113
. 114
. 115
. 117
. 120
TABLE OF CONTENTS
4.4
4.3.2
Recursive Neural Networks Preference Model . . . . . . 120
4.3.3
Kernel Ranking and Preference Model . . . . . . . . . 121
4.3.4
Cancelling Out Effect . . . . . . . . . . . . . . . . . . . 124
Preference Model for SVMs and VP . . . . . . . . . . . . . . . 125
4.4.1
SVMs and Preference Model . . . . . . . . . . . . . . . 125
4.4.2
VP and Preference Model . . . . . . . . . . . . . . . . 126
4.4.2.1
4.5
Applications to Natural Language . . . . . . . . . . . . . . . . 129
4.5.1
The First Pass Attachment
4.5.1.1
4.5.2
4.6
4.7
Dual VP and Preference Model . . . . . . . . 127
. . . . . . . . . . . . . . . 129
Tree Reduction and Specialization . . . . . . 133
The Reranking Task . . . . . . . . . . . . . . . . . . . 134
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 135
4.6.1
First Pass Attachment . . . . . . . . . . . . . . . . . . 135
4.6.2
Reranking Task . . . . . . . . . . . . . . . . . . . . . . 139
4.6.3
The Role of Representation . . . . . . . . . . . . . . . 140
4.6.4
Comparing Different Preference Loss Functions . . . . 142
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5 On the Consistency of Preference Learning
145
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.2
The Bayes Function . . . . . . . . . . . . . . . . . . . . . . . . 146
5.3
A New Model of Preference and Ranking . . . . . . . . . . . . 151
5.4
5.5
5.6
5.3.1
The Partial Order Model . . . . . . . . . . . . . . . . . 152
5.3.2
The 0–1 Loss Function . . . . . . . . . . . . . . . . . . 153
5.3.3
Three Approaches for the Partial Order Model
. . . . 153
A Comparison of the Three Approaches . . . . . . . . . . . . . 154
5.4.1
The Direct Model . . . . . . . . . . . . . . . . . . . . . 155
5.4.2
The Utility Function Model . . . . . . . . . . . . . . . 156
5.4.3
The Pairwise Model
. . . . . . . . . . . . . . . . . . . 158
Dependence on Size of Set of Alternatives . . . . . . . . . . . 160
5.5.1
Ranking Two Alternatives . . . . . . . . . . . . . . . . 160
5.5.2
Ranking k Alternatives . . . . . . . . . . . . . . . . . . 162
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
x
TABLE OF CONTENTS
III
Structured Kernels for Molecular Biology
167
6 Weighted Decomposition Kernel
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Decomposition Kernels . . . . . . . . . . . . . . . . . .
6.2.1 Equivalence of Tensor Product and Direct Sum
6.2.2 All–Substructures Kernels . . . . . . . . . . . .
6.3 Weighted Decomposition Kernels . . . . . . . . . . . .
6.3.1 Data Types . . . . . . . . . . . . . . . . . . . .
6.3.2 Graph Probability Distribution Kernels . . . . .
6.3.3 General Form of WDKs . . . . . . . . . . . . .
6.3.4 A WDK for Biological Sequences . . . . . . . .
6.3.5 A WDK for Molecules . . . . . . . . . . . . . .
6.4 Algorithms and Complexity . . . . . . . . . . . . . . .
6.4.1 Indexing and Sorting Selectors . . . . . . . . . .
6.4.2 Computing Histograms . . . . . . . . . . . . . .
6.4.2.1 Sequences Histograms Computation .
6.4.2.2 Trees Histograms Computation . . . .
6.4.2.3 DAGs Histograms Computation . . . .
6.4.3 Reducing HIK Complexity . . . . . . . . . . . .
6.4.4 Optimizing Histogram Inner Product . . . . . .
6.5 Experimental Results . . . . . . . . . . . . . . . . . . .
6.5.1 Protein Subcellular Localization . . . . . . . . .
6.5.2 Protein Family Classification . . . . . . . . . . .
6.5.3 HIV Dataset . . . . . . . . . . . . . . . . . . . .
6.5.4 Predictive Toxicology Challenge . . . . . . . . .
6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
169
. 169
. 172
. 173
. 178
. 179
. 179
. 180
. 182
. 183
. 184
. 186
. 187
. 188
. 188
. 189
. 191
. 194
. 196
. 197
. 197
. 202
. 204
. 209
. 213
.
.
.
.
.
215
. 215
. 218
. 218
. 219
. 220
7 Prediction of Zinc Binding Sites
7.1 Introduction . . . . . . . . . . . . . . . . . . . .
7.2 Dataset Description and Statistics . . . . . . . .
7.2.1 Data preparation . . . . . . . . . . . . .
7.2.2 A Taxonomy of Zinc Sites and Sequences
7.2.3 Bonding State Autocorrelation . . . . . .
xi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
TABLE OF CONTENTS
7.3
7.4
7.5
7.2.3.1 Patterns of Binding Sites . . .
Methods . . . . . . . . . . . . . . . . . . . . . .
7.3.1 Standard Window Based Local Predictor
7.3.2 Semipattern Based Predictor . . . . . . .
7.3.3 Gating Network . . . . . . . . . . . . . .
Experimental Results . . . . . . . . . . . . . . .
Discussion and Conclusions . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
222
223
225
225
226
228
234
Conclusions
237
Theoretical Contributions . . . . . . . . . . . . . . . . . . . . . . . 237
Experimental Contributions . . . . . . . . . . . . . . . . . . . . . . 238
A Ranking and Preference Error Probability
241
A.1 A Detailed Solution of the Integral . . . . . . . . . . . . . . . 241
A.2 Another Method to Solve the Integral . . . . . . . . . . . . . . 243
Index
268
xii
List of Figures
3.1
The spectrum kernel counts the common subsequences of a
fixed length between two strings. . . . . . . . . . . . . . . . . 70
3.2
Proper subtrees of a given tree. . . . . . . . . . . . . . . . . . 83
3.3
Relation between the set of parse tree proper subtrees and the
set of subtrees generated as contiguous substrings. . . . . . . . 88
3.4
An example of mutation: labels D and B are replaced by A
and C respectively. . . . . . . . . . . . . . . . . . . . . . . . . 90
3.5
An example of embedding subtree: the relative positions of
nodes of the subtree are preserved. . . . . . . . . . . . . . . . 91
3.6
A chemical compound and its topological representation. . . . 92
3.7
First two iterations of Morgan indexing procedure.
3.8
An example of totter: a path with labels C–C–C might either indicate a succession of three C–labelled vertices or just
a succession of two C–labelled vertices visited by a tottering
random walk. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.1
Relation between ranking, preference, classification and regression problems. . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.2
The two main syntactic interpretations of sentence (4.41) can
be obtained by attaching the same CP to one of the two alternative anchors. In general, several CPs and several attachments for each CP are possible. . . . . . . . . . . . . . . . . . 132
4.3
Ambiguities introduced by the dynamic grammar. Left: possible anchor points. Right: possible connection paths. . . . . . 133
xiii
. . . . . . 98
LIST OF FIGURES
4.4
An example of a constituent, a triple consisting of an internal
node, its label and the indexes of the first and the last word
it spans. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.5
VP and RNN learning curves in the first pass attachment prediction task using modularization in 10 POS–tag categories.
The reported values are the percentage of forests where the
best element is not ranked first. . . . . . . . . . . . . . . . . . 138
4.6
VP, RNN and VP on RNN State in the first pass attachment
prediction task: 5 independent subsets of 100 sentences. The
reported values are the percentage of forests where the best
element is not ranked first. . . . . . . . . . . . . . . . . . . . . 141
4.7
PCA on RNN state vectors of a large forest (top) and on state
vectors of all forests of the dataset (bottom). The crosses show
the best element, while the points are the alternatives. . . . . 143
5.1
Ranking and preference error in function of the difference ∆
between the two expected values of U on x1 and x2 . . . . . . . 162
5.2
Upper bound of preference and ranking error in function of
the number of alternatives. . . . . . . . . . . . . . . . . . . . . 164
6.1
Comparison between matching kernel (left) and WDK (right). 183
6.2
The simplest version of WDK is obtained by choosing D = 1
and a relation R depending on two integers r ≥ 0 (the selector
radius) and l ≥ r (the context radius). . . . . . . . . . . . . . 184
6.3
Example of selector (red vertex) and context (blue vertices
and red vertex) for a graph. . . . . . . . . . . . . . . . . . . . 185
6.4
Two versions of WDK for molecules: neighboring context
(top) and neighboring and complement contexts (bottom). . . 186
6.5
Preprocessing for constructing a lexicographically sorted index
that associates context histograms to selectors. Note that the
selectors have still to be sorted. . . . . . . . . . . . . . . . . . 187
6.6
Context histograms computation in the case of sequences. . . . 188
6.7
Relationship between the node p and its histograms at various
levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
xiv
LIST OF FIGURES
6.8
An example of a vertex u which has among its descendants
two vertices that have a common neighbor q. . . . . . . . . . . 193
6.9
Histogram intersection kernel between multiple histograms can
be evaluated efficiently computing the minimum between sorted
histogram bins over different parts with the same selector. . . 196
6.10 The main compartments of a cell. . . . . . . . . . . . . . . . . 197
6.11 Sequence frequency distribution representation based on amino
acid composition. . . . . . . . . . . . . . . . . . . . . . . . . . 198
6.12 The remote homology task consists in finding homologies between proteins that are in the same superfamily but not necessarily in the same family. . . . . . . . . . . . . . . . . . . . . 203
6.13 Remote Protein Homologies: family by family comparison of
the WDK and the spectrum kernel. The coordinates of each
point are the RFP at 100% coverage (a), at 50% coverage
(b) and the ROC50 scores (c) for one SCOP family, obtained
using the WDK and spectrum kernel. Note that the better
performance is under the diagonal in (a) and (b), while is over
in (c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.14 Remote Protein Homologies: comparison of the WDK and
spectrum kernel. The graphs plot the total number of families
for which a given method is within a RFP at 100% coverage
threshold (a), at 50% coverage threshold (b) and exceeds an
ROC50 score threshold (c). . . . . . . . . . . . . . . . . . . . . 206
6.15 ROC area evolution with the introduction of the Morgan index
for different values of pq parameter useful for preventing totters
in the PTC dataset for FR. . . . . . . . . . . . . . . . . . . . 212
7.1
Left figure: probabilities of zinc binding for a given residue:
prior and conditioned on the presence of another zinc binding
residue within a certain separation. Right figure: correlation
between the targets of pairs of residues within a given distance.222
7.2
Residue level recall–precision curves for the best [CH] local
and gated predictors. Top: cysteines and histidines together.
Middle: cysteines only. Bottom: histidines only. . . . . . . . . 230
xv
LIST OF FIGURES
7.3
7.4
Protein level recall–precision curves for the best [CH] gated
predictor. Top: all proteins together. Bottom: proteins divided by zinc site type. . . . . . . . . . . . . . . . . . . . . . . 231
Comparison, at protein level, of recall–precision curves between the best [CH] and [CHDE] gated predictors for Zn3
binding sites. . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
xvi
List of Tables
2.1
An example of VP algorithm execution focussed on example
x5 . After k mistakes, β5k is known but not ck and so only α5k−1
can be evaluated. . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1
VP and RNN learning curves in the first pass attachment prediction task using modularization in 10 POS–tag categories.
The reported values are the percentage of forests where the
best element is not ranked first. . . . . . . . . . . . . . . . . . 137
4.2
VP and RNN in the first pass attachment prediction task: 5
independent subsets of 100 sentences. The reported values are
the percentage of forests where the best element is not ranked
first. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.3
VP and RNN in the reranking task. We reported the standard
PARSEVAL measures described in Section 4.5.2. . . . . . . . . 140
4.4
Comparison between different loss functions and evaluation
measures: (P) indicates the pairwise loss function while (S)
the setwise loss function. . . . . . . . . . . . . . . . . . . . . . 144
6.1
Multiclass classification confusion matrix for 4 classes when
considering class 3 as positive. . . . . . . . . . . . . . . . . . . 199
6.2
Leave one out performance on the SubLoc data set described
in Hua and Sun (2001). The spectrum kernel is based on 3–
mers and C = 10. For the WDK, contexts width is 15 residues
(context radius l = 7), k–mers size is 3 (selector radius r = 1)
and C = 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
xvii
LIST OF TABLES
6.3
Test set performance on the SwissProt data set defined by Nair
and Rost (2003). The spectrum kernel is based on 3–mers and
C = 5. For the WDK, contexts width is 15 residues (context
radius l = 7), k–mers size is 3 (selector radius r = 1) and C = 5.202
6.4
Rate of false positives at 50% and 100% coverage levels and
ROC50 scores for all 33 SCOP families for spectrum and WDK
kernels. The spectrum kernel is based on 3–mers and C = 1.
For the WDK, contexts width is 15 residues (context radius
l = 7), k–mers size is 3 (selector radius r = 1) and C = 1. . . . 207
6.5
HIV dataset statistics: m is the dataset size, NA and NB are
the average number of atoms and bonds in each compound, TA
and TB are the average number of types of atoms and bonds,
max / min NA and max / min NB are the maximum/minimum
number of atoms and bonds over all the compounds. The
total number of vertices and edges is 1,951,154 and 2,036,712
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.6
HIV dataset: CA vs. CM task. Effect of varying the context
radius l and the absence D = 1 or presence D = 2 of graph
complement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.7
HIV dataset. FSG: best results (optimized support and β =
n− /n+ ) reported by Deshpande et al. (2003) using topological
features; CPK: results reported by Horváth et al. (2004) using
β = n− /n+ ; CPK∗ same, using an optimized β ∗ ; WDK: β =
n− /n+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.8
PTC dataset statistics: m is the dataset size, NA and NB are
the average number of atoms and bonds in each compound, TA
and TB are the average number of types of atoms and bonds,
max / min NA and max / min NB are the maximum/minimum
number of atoms and bonds over all the compounds. . . . . . 210
6.9
PTC: effect of varying the context radius l and the absence
D = 1 or presence D = 2 of graph complement. The best
performance is highlighted in boldface. . . . . . . . . . . . . . 211
xviii
LIST OF TABLES
6.10 Area under the ROC curve (AUC) varying the number of frequent subgraphs NFSG used by a feature selection procedure.
The best performance is highlighted in boldface. . . . . . . . . 212
7.1
Top: Distribution of site types (according to the number of coordinating residues in the same chain) in the 305 zinc–proteins
data set. The third column is the number of chains having at
least one site of the type specified in the row. Bottom: Number of chains containing multiple site types. The second row
gives the number of chains that contain at least one site for
each of the types belonging to the set specified in the first row. 220
7.2
Statistics over the 305 zinc proteins (464 binding sites) divided
by amino acid and site type. Na is the amino acid occurrence
number in corresponding site type. fa is the observed percentage of each amino acid in a given site type. fs is the observed
percentage of each site type for a given amino acid. “All” is
the total number of times a given amino acid binds zinc in
general. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
7.3
Binding site patterns ordered by frequency of occurrence in
the 464 zinc sites. Square brackets denote alternatives, x(·)
denotes a sequence of residues of an arbitrary length, x(n −
m) denotes a sequence between n and m residues, x(> n)
denotes a sequence of more than n residues. N is the number
of occurrences within the dataset. Type column highlights
some common binding site patterns: S refers to x(0–7), L
refers to x(> 7). . . . . . . . . . . . . . . . . . . . . . . . . . . 223
7.4
Most common zinc binding sites amongst all 464 sites. x(n)
denotes a sequence of n residues. N is the number of occurrences within the dataset. . . . . . . . . . . . . . . . . . . . . 224
7.5
Chain and site coverage for the [CHDE] x(0–7) [CHDE] semipattern. N is the absolute value of chains and sites, while f
is the percentage over the total number of chains and sites of
that type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
xix
LIST OF TABLES
7.6
Ratio between negative and positive training examples for
residues and semipatterns. A semipattern is positive if both
candidate residues bound a zinc ion, even if they were not
actually binding the same ion. . . . . . . . . . . . . . . . . .
7.7 AURPC and AUC for the local predictor and the gating network focused on cysteines and histidines. . . . . . . . . . . .
7.8 Ratios between negative and positive examples for the local
predictor. A 0.8 threshold on conservation was used for D and
E residues. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.9 Ratios between negative and positive examples for the semipattern predictor. A 0.8 threshold on conservation was used
for D and E residues within x(1–3) gaps, while D and E
residues are not used for x(0) and x(4–7) gaps (see — in the
table). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.10 Site and chain coverage for the [CH] x(0–7) [CH] (left) and
[CHDE] x(0–7) [CHDE] (right) semipatterns. N is the total
number of covered chains or sites, while f is the fraction of
chains or sites covered. A 0.8 threshold on conservation profile
was used for D and E residues within x(1–3) gaps, while D and
E residues are not used for x(0) and x(4–7) gaps. . . . . . .
7.11 AURPC and AUC of the gated predictor for aspartic acid (D)
and glutamic acid (E) with their baselines. . . . . . . . . . .
xx
. 228
. 229
. 233
. 233
. 233
. 234
List of Algorithms
1.1
1.2
2.1
2.2
2.3
2.4
2.5
2.6
3.1
4.1
4.2
6.1
6.2
6.3
6.4
7.1
SVM–Training–Algorithm(Dm ) . . . . . . . . . . . .
SVM–Prediction(x, α, K) . . . . . . . . . . . . . . . .
VP–Training–Algorithm(Dm ) . . . . . . . . . . . . .
VP–Kernel–Training–Algorithm(Dm , K) . . . . . .
VP–Prediction(x, W) . . . . . . . . . . . . . . . . . . .
VP–Kernel–Prediction(x, J , K) . . . . . . . . . . . .
Dual–VP–Training–Algorithm(Dm , γ) . . . . . . . .
Dual–VP–Prediction(x, α, K) . . . . . . . . . . . . .
FSG(Dm , σ) . . . . . . . . . . . . . . . . . . . . . . . . . .
Dual–VP–Preference–Training–Algorithm(Dm , γ)
Dual–VP–Preference–Prediction(x, α, K) . . . . .
Tree–Make–Histogram–Vector(T, p, l) . . . . . . . .
Tree–Make–Histogram(T, p, l) . . . . . . . . . . . . .
Graph–Make–Histogram(G, l) . . . . . . . . . . . . .
Redundant(Adju , E) . . . . . . . . . . . . . . . . . . . .
Remove–Poly–Histidine–Tags(s) . . . . . . . . . . . .
xxi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40
41
47
48
49
50
53
55
94
130
131
189
190
192
194
219
Ringraziamenti
Stendere questa tesi non è stato affatto semplice, il percorso è stato lungo
e difficoltoso ma alla fine sono giunto all’ultima pagina: questo lo devo alla
fatica che ho fatto, ma in massima parte a tutti coloro che mi hanno aiutato
ad arrivare fino a qui e che hanno contribuito a rendere migliori questi anni.
Innanzitutto un grazie di cuore ai miei genitori Marino e Ilva che mi
hanno fornito un ambiente ideale in cui crescere, per la loro fiducia cieca e il
loro aiuto durante e prima di questo lavoro: sono delle persone di cui posso
andare veramente fiero.
Ringrazio vivamente Francesca per aver condiviso con me questo periodo,
per il suo dolce affetto, per essermi stata vicino durante tutti gli studi e per
avermi incoraggiato nell’intraprendere quest’avventura.
Vorrei ringraziare Paolo per l’amicizia, l’aiuto, l’incoraggiamento e il supporto durante questi anni. Grazie al Machine Learning and Neural Network
Group, in particolare a Fabrizio, Alessio, Alessandro, Andrea e Giovanni
per le proficue discussioni e per la loro collaborazione. Un ringraziamento
speciale a Massimiliano per avermi dato la possibilità di trascorrere un periodo di sei mesi al dipartimento di Computer Science dell’University College
London, un’esperienza stupenda che non dimenticherò mai.
Infine grazie ai miei amici e a tutte le persone che mi hanno aiutato a
raggiungere un traguardo cosı̀ importante: in particolare Massimo per avermi
fatto rinascere la passione per la pesca, Daniela e Michele per la loro sincera
amicizia, Gianni B. per le lunghe serate passate davanti al pc, Gianni C. per
avere sempre la battuta pronta, Mario per i suoi capolavori in ferro battuto,
Mirella per le sue cene e Franco e Daniela per la loro stima e fiducia.
Un dolce ricordo va ai miei nonni e a mio cugino che sfortunatamente non
possono condividere con me questo splendido momento.
Firenze, 31 Dicembre 2005
Sauro
xxiii
Acknowledgements
Writing this thesis was not at all easy, the way was long and difficult but
finally I arrived at the last page: I owed it to the spent effort but mostly
to all who helped me to arrive to this point and contributed to make these
years better.
First of all, heartfelt thanks to my parents Marino and Ilva who have
provided me an ideal environment to grow up in, for their huge trust and
support during and before this work: I really take pride in them.
I really thank my girlfriend Francesca for sharing with me this period,
for her sweet affection, for keep close to me during all my studies and for
encouraging me in embarking on this adventure.
I would like to thank Paolo for his friendship, advice, encouragement and
support during these years. Thanks to all the Machine Learning and Neural
Network Group members, particularly to Fabrizio, Alessio, Alessandro, Andrea and Giovanni for their fruitful collaboration and discussions. A special
thank to Massimiliano for giving me the possibility of spending six months
at Computer Science Department of University College London: a beautiful
experience I will never forget.
Finally, thanks to my friends and all people who have helped me to achieve
such an important goal: particularly Massimo for reviving my fishing passion, Daniela and Michele for their true friendship, Gianni B. for the long
nights spent in front of the pc, Gianni C. for always joking, Mario for his
masterpieces in wrought iron, Mirella for her dinners and Franco and Daniela
for their regard and trust.
A sweet thought goes to my grandparents and my cousin who cannot
unfortunately share with me this gorgeous moment.
Florence, December 31st , 2005
Sauro
xxv
Notation
IN,IR
x∈X
y∈Y
D
m
ρ
F
N
n
nSV
R
c
hx, zi
φ : X 7→ F
K(x, z)
H
f (x)
V (f (x), y)
w
b
k·kp
e
ln
natural, real numbers
input and input space
output and output space
training data
training set size
joint distribution probability over X × Y
feature space
dimension of feature space
dimension of input space
number of support vectors
radius of the ball containing data
cardinality of Y in multi–class classification
inner product between x and z
mapping to feature space
kernel hφ(x), φ(z)i
hypothesis space or Hilbert space
function before thresholding or hypothesis from H
loss function
weight vector
bias, constant offset or threshold
P
1/p
p–norm, kxkp = ( ni=1 |xi |p )
base of natural logarithm
natural logarithm
xxvii
NOTATION
logb
logarithm to the base b
0
x0 , X
transpose of vector, matrix
α
dual variables or Lagrange multipliers
ξ
slack variables
γ
functional or geometric margin
C
regularization parameter
η
learning rate
T
number of epochs
E
number of mistakes
Im
m × m identity matrix
K
kernel matrix
sign(x)
sign function: equals 1 if x ≥ 0 else −1
θ(x)
heaviside function: equals 1 if x ≥ 0 else 0
I(proposition) indicator function: 1 if proposition is true, else 0
|·|
absolute value, cardinality of a set or length of a string
d·e
ceil function
b·c
floor function
|x|+
equals x if x ≥ 0 else 0
|x|
insensitive function: |x| = max (0, |x| − )
Pr{·}
probability of an event
EX {g(X )}
expected value of g(X ) wrt the distribution on X
VarX {g(X )}
variance of g(X ) wrt the distribution on X
ε
error probability
δ
confidence
2
`
square convergent real sequences
L2 (X )
square integrable functions on a compact set X
A
alphabet of characters
xxviii
Introduction
This thesis deals with the theory and applications of kernel methods for
structured data and preference learning in the framework of machine learning.
Since many real world data are structured and consequently have no natural representation in a vector space, statistical learning in structured and
relational domains is rapidly becoming one of the central areas of machine
learning. In this dissertation, the statistical learning theory is briefly reviewed with an outline of the mathematical foundations of kernels. Two kernel methods, support vector machines and voted perceptron, are presented,
including a new interpretation of regularization theory for the voted perceptron algorithm. A survey of kernels for structured data is presented, followed
by the design of new kernels for practical problems in bioinformatics which
achieve state–of–art performance. A novel framework for evaluating different
models for preference and ranking problems is proposed.
Main application domains tackled in this thesis are natural language processing and computational molecular biology. The prediction of first pass
attachment under strong incrementality hypothesis and reranking parse trees
generated by a statistical parser are two large scale preference learning problems involving learning a preference function that selects the best alternative
in a set of competitors. Practical problems in bioinformatics, involving protein sequence and molecule graph classification as subcellular localization,
prediction of toxicity and biological activity of chemical compounds and prediction task of the zinc binding sites, are suitable fields for the design of new
kernels for structured data.
1
INTRODUCTION
Machine Learning and Artificial Intelligence
The Artificial Intelligence (AI) (Russell and Norvig, 2003) is the branch of
computer science concerned with making computers behave like humans. The
term was coined in 1956 by John McCarthy at the Massachusetts Institute of
Technology (MIT) as the discipline that studies methods suitable for conceiving, devising and testing hardware or software systems which performance
could be compared to that of a human being. Some fields of AI research are
combinatorial search, computer vision, expert system, genetic programming,
knowledge representation, natural language processing, robotics, artificial life
and so on.
The Machine Learning (Mitchell, 1997) is a branch of artificial intelligence concerned with the development of techniques which allow computers
to “learn”. The idea of constructing machines capable of learning from experience has been very appealing from the advent of first electronic computers:
actually there are many tasks that cannot be solved by classical programming
techniques since no algorithm is known to solve the problem. The purpose
of machine learning is to learn automatically the theory from examples by
an inductive process, adapting the mathematical models to the available examples. The machine learning has a wide spectrum of applications including
search engines, medical diagnosis, detecting credit card fraud, stock market
analysis, classifying biological sequences, speech and handwriting recognition, game playing, robot locomotion and so on.
Taxonomy of Machine Learning Algorithms
Machine learning algorithms can be organized into a taxonomy, based on the
desired outcome of the algorithm. There are three main different scenarios:
supervised learning, unsupervised learning and semi–supervised learning. In
supervised learning, a set of input/output pairs is given and the task is to
learn an input/output function that predicts the outputs on new inputs. A
problem with binary outputs is referred to as a binary classification problem,
one with a finite number of categories as multiclass classification, while for
real–valued outputs the problem becomes known as regression. In unsuper2
INTRODUCTION
vised learning, only a set of inputs with no outputs is given and the task
is to gain some understanding of the process that generated the data. For
example, in clustering problems the task is to segment unlabelled data into
clusters that reflect meaningful structure of the data domain. Finally, in
semi–supervised learning a set of both labelled and unlabelled data is given
and the task is to construct a better classifier exploiting unlabelled data then
using only labelled data.
Moreover, there are many applications where unlabelled data is abundant
but labelling is expensive and then it is often realistic to model this scenario
in the framework of active learning. In the standard supervised learning
model, the learner has no control over which labelled examples it observes.
Active learning (also known as cooperative learning) is a framework that
allows the algorithm to choose which examples to ask for the label. Since
active learning allows for intelligent choices of which examples to label, often
the sample complexity, the number of labelled examples required to learn a
concept via active learning, is significantly lower.
A different scenario is reinforcement learning where the agent has a range
of actions at its disposal which it can take to attempt, to move towards states
where it can expect high rewards. Every action has some impact in the environment and the environment provides feedback that guides the learning
algorithm. The learning methodology can play a role in reinforcement learning if we treat the optimal action as the output of a function of the current
state: there are, however, significant complications since the quality of the
output can be assessed indirectly as the consequences of an action become
clear.
A further scenario is learning to learn where the algorithm learns its own
inductive bias based on previous experience. Learning to learn is a process
of discovery about learning. It involves a set of principles and skills which,
if understood and used, help learners to learn more effectively. At its heart
is a belief that learning is learnable.
Another way to distinguish learning algorithms concerns the way in which
the examples are presented. In batch learning all the data are given to the
algorithm at the beginning of the learning, while in on–line learning the
algorithm receives one example at time and gives its estimate of the output
3
INTRODUCTION
before seeing the correct value updating the current hypothesis in response
to each new example.
A further important aspect of each machine learning algorithm is the
generalization, the ability of correctly classifying unseen data that are not
present in training examples. Precisely, it is not sufficient for the algorithm to
be consistent with training data as, for example, the data could be noisy, but
the algorithm must show the property of correctly classifying new examples.
So, we shall aim to optimize the generalization and not the fitting on training
data: the case when a function becomes too complex in order to be consistent,
is called overfitting. One of the first, now historical, criterion for improving
generalization is the Ockham’s razor: it is a principle which suggests to prefer
simpler functions or better more complex functions must pay for themselves
by giving significant improvements in the classification rate on the training
data. In other words, there is a tradeoff between complexity and accuracy
on training data and various principles have been proposed for choosing the
optimal compromise.
In conclusion, the key point of machine learning is to construct machines
capable of learning from experience that reveal good generalization properties.
Structured Data
By structured data we mean data that is formed by combining simpler components into more complex items, frequently involving a recursive use of
simpler objects of the same type. Typically it will be easy to compare the
simpler components either with base kernels or using an inductive argument
over the structure of the objects. Examples of structured data include vectors, strings and sequences but also more complex objects such as trees,
images and graphs. The design of kernels for structured data enables the
same algorithms and analysis of structural pattern recognition to be applied
to new application domains. The extension of kernels to structured data
provides the way for analysing very different sets of data–types, opening up
possibilities such as discovering clusters in a set of trees, learning classifications of graphs and so on. It is therefore by designing specific kernels for
4
INTRODUCTION
structured data that kernel methods can demonstrate their full flexibility.
Probably the most important data type after vectors is that of symbol
strings of varying lengths. This type of data is commonplace in bioinformatics applications, where it can be used to represent proteins as sequence of
amino acids, genomic DNA as sequence of nucleotides, promoters and other
structures. Partly for this reason, a great deal of research has been devoted
to it in the last few years. Many other application domains consider data
in the form of sequences so that many of the techniques have a history of
development within computer science, as for example the study of string
algorithms. Kernel have been developed to compute the inner product between images of strings in high–dimensional feature spaces using dynamic
programming techniques. The set of kernels on strings and more generally
on structured objects, makes it possible for kernels methods to operate in a
domain that traditionally has belonged to syntactical pattern recognition, in
this way providing a bridge between that field and statistical pattern analysis.
Kernel Machines
The success of machine learning methods depends on their ability to solve
pattern recognition and regression problems. The kernel methodology provides a powerful and unified framework for many disciplines, from neural
networks to statical pattern recognition, to machine learning and data mining. Any kernel methods solution comprises two parts: a module that performs the mapping into the feature space and a learning algorithm designed
to discover linear pattern in that space. There are two main reasons why
this approach should work. First of all, detecting linear relations has been
the focus of much research in statistics and machine learning for decades and
the resulting algorithms are both well understood and efficient. Secondly,
we will see that a kernel function represents a computational shortcut which
makes it possible to represent linear patterns efficiently in high–dimensional
spaces, to ensure adequate representational power.
5
INTRODUCTION
Overview of the Thesis
This thesis consists of three parts: I) Learning Structured Data, II) Preference Learning and III) Kernels on Structured Data for Computational Molecular Biology.
The first part provides the conceptual foundations of the field by giving
an extended introduction to the statistical learning theory. It contains the
description of two kernel–based algorithms, support vector machines and
voted perceptron. A number of kernel functions are reported, from basic
kernels to advanced recursive kernels, kernels derived from generative models
such as HMMs and string matching kernels based on dynamic programming,
as well as special kernels designed to handle text documents.
The second part deals with theoretical and applicative aspects of the preference and ranking problems. Ranking is the problem of learning a ranking
function from a training set of ranked data. The number of ranks need not
be specified though typically the training data comes with a relative ordering
specified by assignment of an ordered sequence of labels. An interesting and
novel framework for evaluating different models for preference and ranking
problems is proposed.
Finally, in the third part we propose several kernels for structured data
represented as sequences, tree or graphs, applied to some topical and challenging problems in bioinformatics as subcellular localization, remote homology detection, prediction of toxicity and biological activity of chemical compounds and prediction task of the zinc binding sites and proteins. Proposed
kernels are highly competitive with respect to more complex and computationally demanding state–of–the–art methods.
In detail, we report a brief description of the content of each chapter.
Chapter 1 This chapter provides a brief review of the statistical learning
theory and an outline of the mathematical foundations of kernels. Precisely, the concepts of loss function, risk functional, Bayes function,
regularization theory, Mercer kernel and reproducing kernel Hilbert
space are revised and some theorems as Mercer’s theorem and Representer theorem are reported. The chapter ends with the description of
Support Vector Machines, the most widespread kernel–based machine
6
INTRODUCTION
learning algorithm nowadays.
Chapter 2 This chapter provides a novel extensive view of Voted Perceptron (VP) algorithm, an efficient on–line algorithm whose performance
is similar to that one of maximal–margin classifiers. We derive a new
on–line update rule for VP dual variables which permits to give a
new interpretation of regularization theory for VP, explaining how fast
grows the value of dual variables when the number of epoch increases
and give an upper bound for their value. It also provides a novelty way
to devise a VP loss function.
Chapter 3 This chapter consists on a survey of Mercer kernels for structured data described in the literature, starting from basic kernels on
vectors up to strings, trees and graphs. The chapter ends describing recursive neural networks, an alternative approach to kernel machines for
processing structured data that are a generalization of neural networks
capable of processing structured data as DOAGs where a discrete or
real label is associated with each vertex.
Chapter 4 In this chapter, we deal with two large scale structured preference learning problems, the prediction of first pass attachment under
strong incrementality hypothesis and the reranking parse trees generated by a statistical parser. Both problems involve learning a preference function that selects the best alternative in a set of competitors.
We show how to perform preference learning in this highly structured
domain using both convolution kernels and recursive neural networks.
Chapter 5 This chapter proposes a theoretical analysis of preference and
ranking problems: a new framework based on a partial order model
of preference and ranking, explains why a function that works on the
whole set of alternatives exhibits a better performance of a pairwise
loss function based on an utility function. In addition, we show how
the ranking and preference generalization error depends on the size of
set of alternatives.
Chapter 6 This chapter introduces weighted decomposition kernel, a new
family of efficient kernels on discrete data structures within the general
7
INTRODUCTION
class of decomposition kernels whose performance is highly competitive with respect to more complex state–of–the–art methods. It is
computed by dividing objects into substructures indexed by a selector:
two substructures are then matched if their selectors satisfy an equality predicate, while the importance of the match is determined by a
probability kernel on local distributions fitted on the substructures.
Chapter 7 This chapter tackles the task of the prediction of the zinc binding sites and proteins considering the problem of the autocorrelation
between the residues close in the protein sequence. We propose an ad–
hoc remedy in which sequentially close pairs of candidate residues are
classified as being jointly involved in the coordination of a zinc ion and
we develop a kernel for this particular type of data that can handle
variable length gaps between candidate coordinating residues.
Appendix A This appendix reports the mathematical details of the solution of the integral in Chapter 5 computing the probability of a ranking
error between two alternatives.
Sources of the Thesis
Chapter 2 is partially new. In detail, the material presented in Sections 2.2,
2.3, 2.5 is new and unpublished.
Chapter 4 is based on Costa et al. (2002), on Menchetti et al. (2003)
and on Menchetti et al. (2005c). Part of the material is novel and has been
expanded.
Chapter 5 is novel and still unpublished. It is based on Menchetti (2006)
which was submitted for publications.
Chapter 6 is based on Menchetti et al. (2005b) and on Menchetti et al.
(2005a). Part of the material has been expanded. The results of Section
6.2.1 are new.
Chapter 7 is based on Menchetti et al. (2006). Part of the material has
been expanded.
8
INTRODUCTION
Other Projects
A software implementing the Weighted Decomposition Kernel (WDK) described in Menchetti et al. (2005a) was developed by Fabrizio Costa. It
is freely available at http://www.dsi.unifi.it/neural/src/WDK/ under the
terms of the GNU General Public License.
A predictor of zinc binding proteins based on Menchetti et al. (2006)
is under construction and will be available at the RECOMB 2006 International Conference time. It will be accessibile following a link at web page
http://www.dsi.unifi.it/neural/.
9
Part I
Learning Structured Data
Chapter 1
Statistical Learning Theory
In this chapter, the statistical learning theory is briefly reviewed, followed by
an outline of the mathematical foundations of kernels. Precisely, the concepts
of loss function, risk functional, Bayes function, regularization theory, Mercer
kernel and reproducing kernel Hilbert space are revised and some theorems
as Mercer’s theorem and Representer theorem are reported. The chapter
ends with the description of Support Vector Machines, the most widespread
kernel–based machine learning algorithm nowadays.
1.1
Statistical Learning Theory
The problem of generalization of a learning algorithm is the core problem
of learning from examples. Given a set of examples and a learning task,
too complex hypothesis will perfectly memorize the training data without
being able to make predictions on unseen examples, while too simple ones
lack enough power to learn the given task. The statistical learning theory,
started by Vapnik and Chervonenkis in the sixties, asserts bounds on the
error of a prediction function f
err(f ) = Pr{(x, y) : f (x) 6= y}
(1.1)
in terms of several quantities. The most used is number m of training examples and frequently results have been presented as bounds on the number
13
1.1 Statistical Learning Theory
of examples required to obtain a particular level of error: this is also known
as the sample complexity of the learning problem. Since the error (1.1) cannot be explicitly computed, we refer to an approximation of the error (1.1)
computed on training set to find a good prediction function f
errm (f ) =
1
|{i, 1 ≤ i ≤ m : f (xi ) 6= yi }|
m
(1.2)
A detailed and complete overview of the statistical learning theory can be
found in a lot of books and papers. Some good starting points are the two
books of Vapnik (1995, 1998) and the work of Hastie et al. (2001). Some
relevant results are devised in the papers of Poggio and Girosi (1989), Evgeniou et al. (2000), Cucker and Smale (2001) and Poggio and Smale (2003).
In the following, we report a brief review of this model.
1.1.1
The Supervised Learning Problem
The key assumption on which the model is based is that the collection of
m pairs Dm = {(xi , yi )}m
i=1 used for training and testing is a set of independently and identically distributed (i.i.d.) examples drawn by an unknown
but fixed distribution ρ on input/output pairs (x, y) ∈ X × Y, where X and
Y are respectively the input and the output space. This assumption is strong
enough: actually, for example, time series does not satisfy this condition since
the observations are dependent. But there are many practical cases in which
this hypothesis is valid or it is a good approximation of the situation. Using
the decomposition ρ(x, y) = ρ(x)ρ(y|x), the sampling can be interpreted as
a two–steps precess where first in input x is sampled in according to ρ(x) and
then a corresponding output y is sampled with probability ρ(y|x). While the
first step can be totally random (e.g. ρ(x) can be the uniform distribution),
the second step usually models the sampling of function f with noise. So the
relation between input and output spaces is probabilistic and not functional:
for a given input x, there is a distribution ρ(y|x) on possible outputs.
The goal is to learn a function f ∈ H, f : X 7→ Y which models the
probabilistic relation between X and Y in a way such that f (x) ≈ y. The
hypothesis space H represents the set of admissible functions in which the
learning algorithm looks for a “good” function. We will see that H is a
14
CHAPTER 1 STATISTICAL LEARNING THEORY
subset of a larger space T called target space which contains a broader class of
functions from X to Y, for examples all the continuous functions from X to Y.
Putting some constraints on the elements of T leads to the hypothesis space
H. Common learning tasks are regression where Y = IR and classification
in which Y is a set of c classes (when c = 2 the problem is called binary
classification and it is convenient to take Y = {+1, −1}).
1.1.2
Loss Function
To measure how good is a function f : X 7→ Y on a given collection of data,
we have to introduce the concept of loss function.
Definition 1.1 (Loss Function) Let f : X 7→ Y be any function from X
to Y. A loss function V : Y × Y 7→ [0, ∞) is a non–negative function that
measures the error V (f (x), y) between the predicted output f (x) on x and
the actual output y. Note that V (y, y) = 0.
Common loss functions can be grouped depending on the problem. Usually
for regression problems the loss is a function of the difference between the
target and the predicted value V (f (x), y) = V (y − f (x)). A typical example
is the quadratic loss
V (f (x), y) = (y − f (x))2
(1.3)
Other interesting losses are the absolute loss V (f (x), y) = |y − f (x)| and
the SVM regression loss
V (f (x), y) = |y − f (x)|
(1.4)
where |·| is the insensitive function defined as |x| = max (0, |x| − ). In
the case of classification, the ideal loss is the misclassification loss (or 0–1
loss)
V (f (x), y) = θ(−yf (x))
(1.5)
where θ(·) is the heaviside function. Often in classification problems a real–
valued function f is first learnt, then the classification function is sign(f ) (it
will be clear from the context if f (x) ∈ IR or f (x) ∈ Y). In the case of
15
1.1 Statistical Learning Theory
real losses, we introduce the SVM misclassification loss (or hinge loss or soft
margin loss)
V (f (x), y) = |1 − yf (x)|+
(1.6)
where |x|+ = x if x ≥ 0 else 0, and the SVM hard margin classification loss
V (f (x), y) = θ(1 − yf (x))
1.1.3
(1.7)
Risk Functionals
The aim of statistical learning theory is to define a risk functional which
measure the average amount of the error of an hypothesis and to look for
an hypothesis among the allowed ones with lowest risk. If V (f (x), y) is a
loss function measuring the error between the prediction f (x) and the actual
output y, then the average error is called expected risk (or true error or
expected error or expected loss).
Definition 1.2 (Expected Risk) Given a function f ∈ T and a loss function V (f (x), y), the expected risk errρ (f ) of f with respect to distribution ρ
is the expected loss
Z
.
errρ (f ) =
V (f (x), y)ρ(x, y)dxdy
(1.8)
X ×Y
So we are looking for the minimizer fρ of the expected risk (1.8) in some
target space T :
.
fρ = arg min errρ (f )
f ∈T
(1.9)
This minimizer is called the Bayes function (see Section 5.2 for more details about the Bayes function) and its expected risk, called Bayes risk , is
a lower bound on the error that depends only on the intrinsic difficulty of
the problem. The distribution ρ on X × Y is unknown and the expected risk
cannot be explicitly computed. We can approximate the expected risk by
the empirical error (or sample error or empirical risk) on the data collection
Dm .
16
CHAPTER 1 STATISTICAL LEARNING THEORY
Definition 1.3 (Empirical Error) Given a function f ∈ T and a loss
function V (f (x), y), the empirical error errDm (f ) of f with respect to the
data Dm is the average loss
m
. 1 X
errDm (f ) =
V (f (xi ), yi )
m i=1
(1.10)
The empirical error is a random variable depending on the random selection
of data collection Dm . Since ρ is unknown, we can learn f by minimizing the
empirical error (1.10). The essential question is whether the expected risk of
the minimizer of the empirical error is close to the one of fρ .
1.1.4
Empirical Risk Minimization
Given an hypothesis space H and a training set Dm , the empirical risk minimization (ERM) is the method that find the function
.
fDm = arg min errDm (f )
f ∈H
(1.11)
A nice property called consistency we would like to be valid, is that the
expected risk of fDm tends to the expected risk of fρ when the training data
m tends to infinity independently from the distribution ρ:
∀ρ ∀ε > 0 lim Pr{|errρ (fDm ) − errρ (fρ )| > ε} = 0
m→∞
A theorem guarantees the consistency over all f :
Theorem 1.1 (Vapnik and Chervonenkis, 1971) ERM is consistent if
and only if
∀ε > 0 lim Pr sup (errρ (fDm ) − errρ (fρ )) > ε = 0
m→∞
f ∈T
In general ERM is ill–posed because of uniqueness and stability and generalization cannot be expected. So the next step is to define what is a well–posed
problem.
Definition 1.4 (Well–posed Problem (Hadamard, 1902)) A problem
is well–posed if (1) a solution exists, (2) the solution is unique and (3) the
solution depends continuously on the data. A problem is ill–posed if it is not
well–posed.
17
1.1 Statistical Learning Theory
1.1.5
Regularization Theory
The regularization theory is a framework in which ill–posed problems can be
solved by adding appropriate constraints on the solution. A general approach
is to choose the hypothesis space H to be a convex set into a Hilbert space
(see Section 1.2.1):
.
H = {f : Ω(f ) ≤ R2 }
(1.12)
where Ω(f ) is a convex function. For example, Ω(f ) = kf k2 where kf k is the
norm of f in the Hilbert space. So the well–posedness of the ERM problem
can be recovered by adding constraints on the target space T to obtain the
hypothesis space H. There are two main approaches:
Ivanov regularization The direct approach to find the solution of ERM
consists in putting the constraint that f must be bounded:
m
min
f ∈H
subject to
1 X
V (f (xi ), yi )
m i=1
2
kf k ≤ R
(1.13)
2
Tikhonov regularization The indirect approach adds a term estimating
the complexity of the solution f to the empirical risk:
m
1 X
min
V (f (xi ), yi ) + µ kf k2
f ∈H m
i=1
(1.14)
The parameter µ > 0 controls the tradeoff between the empirical error
and the complexity of the function f .
1.1.6
Sample and Approximation Error
The generalization error or expected risk of the minimizer fDm of the empirical error can be decomposed as the sum between the sample error (or
estimation error) and the approximation error :
errρ (fDm ) = (errρ (fDm ) − errρ (fH )) + (errρ (fH ) − errρ (fρ )) +errρ (fρ ) (1.15)
{z
} |
{z
}
|
approximation error
sample error
18
CHAPTER 1 STATISTICAL LEARNING THEORY
where
fDm = arg min errDm (f ),
fH = arg min errρ (f )
f ∈H
f ∈H
The sample error errρ (fDm ) − errρ (fH ) depends on the hypothesis space H
and on the sample Dm , while the approximation error errρ (fH ) − errρ (fρ )
depends on the hypothesis space H and on the distribution ρ. The study
of sample error is based on probabilistic bounds on the deviation between
expected and empirical error which guarantee that
Pr{|errρ (f ) − errDm (f )| ≥ ε} ≤ δ(ε, m, N )
(1.16)
for every function in H (it is an unform bound), where N > 0 is a complexity
measure of the “capacity” of the hypothesis space like the covering number or
the VC–dimension. The confidence δ increases with ε and N and decreases
with m. If we can assure that
errρ (f ) ≤ errDm (f ) + ε ∀f ∈ H
with some probability 1 − δ, then we have
errρ (fDm ) − errρ (fH ) ≤ 2ε
with probability at least 1−δ. Regarding the approximation error errρ (fH )−
errρ (fρ ), it depends on the distribution probability ρ and, especially, on the
hypothesis space, usually trough the covering numbers whose is a decreasing
function. Finally, note that there is a tradeoff between the sample error and
the approximation error: by a large H the approximation error is reduced
and the sample is increased but, on the other way, a small H increases the
approximation error and reduces the sample error.
1.2
Mathematical Foundations of Kernels
Before describing the two kernel machines employed in the next chapters,
we introduce some useful concepts which represent the mathematical foundations of kernel machines. We will characterize valid kernels and feature
spaces, interpreting a kernel as the inner product in some feature space.
19
1.2 Mathematical Foundations of Kernels
Two equivalent approaches are presented: the first uses the Mercer’s theorem to interpret the feature space as a Hilbert space of real sequences; the
other uses Reproducing Kernel Hilbert Spaces to interpret the feature space
as a Hilbert space of functions. The section ends showing the general form
of the solution of a Tikhonov regularized learning problem which minimizes
a cost functional composed by the error on training data and the complexity
of the learnt function.
1.2.1
Euclidean and Hilbert Spaces
First of all, we define Euclidean (or inner product or pre–Hilbert) and Hilbert
spaces which represent an extension of Euclidean spaces.
Definition 1.5 (Euclidean Space) An Euclidean space E is a vector space
with a bilinear map h·, ·i : E × E 7→ IR such that ∀f, g, h ∈ E,a ∈ IR
1. hf, gi = hg, f i
2. hf + g, hi = hf, hi + hg, hi and haf, gi = a hf, gi
3. hf, f i ≥ 0 and hf, f i = 0 ⇔ f = 0E
An Euclidean space is also a normed space with the norm induced by the
. p
inner product:kf k = hf, f i. Before giving the definition of Hilbert space,
we need the following definitions.
Definition 1.6 (Completeness) An Euclidean space E is complete with
respect to the norm induced by the inner product if all the Cauchy sequence
converge to an element in E:
∀ (f1 , f2 , . . .) : fn ∈ E and lim sup kfn − fm k = 0 ⇒ lim fn = f ∈ E
n→∞ m>n
n→∞
Definition 1.7 (Denseness) A set A is dense in a set B if A intersects
every nonempty open set in B.
Definition 1.8 (Separableness) An inner product space is separable if it
contains a countable dense subset.
20
CHAPTER 1 STATISTICAL LEARNING THEORY
Finally, a Hilbert space is defined as an Euclidean space with some more
properties.
Definition 1.9 (Hilbert Space) A Hilbert space is an Euclidean space that
is also (1) complete and (2) separable.
Note that Hilbert spaces are generally infinite–dimensional. An example of
a Hilbert space is the set of square convergent real sequences
(
)
∞
X
.
`2 = (x1 , x2 , . . .) :
x2i < ∞
(1.17)
i=1
with the inner product
∞
. X
hx, zi =
xi zi
(1.18)
i=1
Another example of a Hilbert space is the set of square integrable functions
on a compact set X ⊆ IRd , d ∈ IN
Z
.
2
2
L (X ) = f : X 7→ IR :
f (x)dx < ∞
(1.19)
X
with the following inner product
.
hf, gi =
Z
f (x)g(x)dx
(1.20)
X
For our purposes it is desirable that the hypothesis space is dense in L2 (X ).
1.2.2
Mercer’s Theorem
In this subsection, we first characterize what is a valid kernel, then define
an integral operator on valid kernels and shows what is the feature space of
a valid kernel. At the end, we prove that the hypothesis space is a Hilbert
space.
Definition 1.10 (Mercer Kernel) A function K : X × X 7→ IR is a Mercer kernel if
1. K is continuous
21
1.2 Mathematical Foundations of Kernels
2. K is symmetric, i.e. for all x, y ∈ X , K(x, z) = K(z, x)
3. K is positive definite, i.e. for all finite sets {x1 , . . . , xm } ⊂ X the
m × m matrix with entries K(xi , xj ) is positive definite
∀ m ∈ IN, ∀ c1 , . . . , cm ∈ IR
m X
m
X
ci cj K(xi , xj ) ≥ 0
(1.21)
i=1 j=1
Equivalently, a symmetric matrix is positive definite if all its eigenvalues are nonnegative
Definition 1.11 (Gram Matrix) Given a Mercer kernel K and a set of
objects {x1 , . . . , xm }, the m × m matrix K such that Kij = K(xi , xj ) is
called the Gram matrix of K with respect to {x1 , . . . , xm }.
Theorem 1.2 (Integral Operator on Mercer Kernel) The linear operator LK : L2 (X ) 7→ L2 (X ) on a Mercer kernel K defined by
Z
.
(LK f )(x) =
K(x, z)f (z)dz
(1.22)
X
is
1. well–defined: LK is continuous for all f
2. bounded: kLK f k ≤ akf k, a ∈ IR
R R
3. positive definite: X X K(x, z)f (x)f (z)dxdz ≥ 0
The proof is based on the Spectral Theorem for compact linear operators on
a Hilbert space (see Cucker and Smale (2001) for more details).
Theorem 1.3 (Mercer’s Theorem, 1909) Given a Mercer kernel K on
X × X , let {λk , ϕk }∞
k=1 be a system of the eigenvalue/eigenfunctions of LK
with λk ≥ λk+1 ≥ 0. Then for all x, z ∈ X
K(x, z) =
∞
X
λk ϕk (x)ϕk (z)
k=1
where the convergence is uniform on X × X and absolute.
22
(1.23)
CHAPTER 1 STATISTICAL LEARNING THEORY
The following theorem from Cucker and Smale (2001) shows what is the
feature map of a Mercer kernel.
Theorem 1.4 (Feature Space of a Mercer Kernel) The feature map
φ : X 7→ `2 defined as
. p
(1.24)
φ(x) = { λk ϕk (x)}∞
k=1
is well–defined, continuous and satisfies
K(x, z) =
∞
X
λk ϕk (x)ϕk (z) = hφ(x), φ(z)i
(1.25)
k=1
Remark An important consequence is that a Mercer kernel can be interpreted as an inner product in the Hilbert space `2 of real sequences. In
addition, the Hilbert space `2 of real sequences constitutes the feature space
of our Mercer kernel. Note that if we are given a feature map φ(x) which
we known to be in `2 for all x ∈ X , we can immediately build a valid kernel
by setting K(x, z) = hφ(x), φ(z)i.
Now we define the set of square integrable functions on a compact set X
associated with a Mercer kernel and show that it is a Hilbert space.
Definition 1.12 Given a Mercer kernel K and its linear operator LK defined
in Equation (1.22), define
(
)
∞
X
a
.
k
ak ϕk with √
HK = f ∈ L2 (X ) : f =
∈ `2
(1.26)
λ
k
k=1
where λk ,ϕk are the eigenvalues and the eigenfunctions of LK and define an
inner product h·, ·iHK : HK × HK 7→ IR as
∞
hf, giHK
. X ak b k
=
λk
k=1
(1.27)
The elements of HK are continuous functions and for all f ∈ HK the series
f=
∞
X
ak ϕ k
k=1
converges uniformly and absolutely.
23
(1.28)
1.2 Mathematical Foundations of Kernels
Theorem 1.5 (HK is a Hilbert space) The hypothesis space HK induced
by a Mercer kernel K associated with its linear operator LK defined in Equation (1.22) with an inner product h·, ·iHK defined in Equation (1.27) is a
Hilbert space.
Consequently given a Mercer kernel K, HK is the Hilbert space generated
by the eigenfunctions of the integral operator LK .
Remark The set HK of square integrable functions L2 (X ) on a compact set
X is the hypothesis space of a kernel machine, that is a learning algorithm
that deals with the data only through Mercer kernels. The general form of
the solution of a supervised learning algorithm with kernel machines is
f (x) =
m
X
ai K(xi , x)
(1.29)
i=1
for some coefficient ai ∈ IR. By the definition of HK , we mathematically
characterized the hypothesis space of kernel machines.
1.2.3
Reproducing Kernel Hilbert Spaces
Until now, we characterized valid kernels and feature spaces, interpreting
a kernel as an inner product in some feature space. The above approach
exploits the Mercer’s theorem to describe the feature space as a Hilbert space
of real sequences. An alternative approach uses the Reproducing Kernel
Hilbert Spaces (RKHS) to interpreter the feature space as a Hilbert space of
functions: the two approaches are equivalent. We start giving the definition
of the reproducing kernel Hilbert space.
Definition 1.13 (Reproducing Kernel Hilbert Space) A Reproducing
Kernel Hilbert Space HK is a Hilbert space of functions on a compact set
X . This functions have the properties that for each x ∈ X the evaluation
functionals zx are linear and bounded
• zx (f + g) = zx (f ) + zx (g) = f (x) + g(x)
• |zx (f )| = |f (x)| ≤ Ux kf kHK
Some more properties of RKHS are
24
CHAPTER 1 STATISTICAL LEARNING THEORY
p
K(x, x) K(z, z)
p
2. ∀ x ∈ X , |f (x)| ≤ kf kHK K(x, x)
1. ∀ x, z ∈ X , |K(x, z)| ≤
p
3. HK is made of continuous functions
4. if f =
P
k
ak ϕk , the series converges absolutely and uniformly in X
Theorem 1.6 (Reproducing Property of a Hilbert Space) Given a
Mercer kernel K, define a function Kx : X 7→ IR as
Kx (z) = K(x, z)
(1.30)
Then
• Kx ∈ HK , ∀x ∈ X
• for each RKHS there exists an unique Mercer kernel K called reproducing kernel
• conversely, for each Mercer kernel K there exists an unique RKHS that
has K as its reproducing kernel
The reproducing properties means that
zx (f ) = hKx , f iHK = f (x)
(1.31)
The next step is to understand if HK defined in Equation (1.26) and HK
are the same space. We start from a theorem reported in Cucker and Smale
(2001) which suggests how to build a RKHS.
Theorem 1.7 (Uniqueness) There exists an unique Hilbert space HK of
functions on X satisfying the following conditions:
1. ∀x ∈ X , Kx ∈ HK
2. the span H0 of the set {Kx : x ∈ X } is dense in HK
3. ∀f ∈ HK , f (x) = hKx , f iHK
Moreover, HK consists of continuous functions.
25
1.2 Mathematical Foundations of Kernels
The sketch of the proof starts by defining an inner product in H0 . If
f=
s
X
αi Kxi
and
g=
i=1
r
X
βj Kzj
j=1
then
s
hf, giHK
r
. XX
=
αi βj K(xi , z j )
(1.32)
i=1 j=1
Let HK be the completion of H0 with the associated norm. To check the
reproducing property, note that from Equation (1.32) follows that
Kxi , Kzj
HK
= K(xi , z j )
and so
hf, Kx iHK =
* s
X
+
αi Kxi , Kx
i=1
=
=
s
X
i=1
s
X
HK
αi hKxi , Kx iHK
αi K(xi , x) = f (x)
i=1
While before the data x was mapped into series of real numbers, now are
mapped into functions which sits on x that generate the Hilbert space:
φ
x 7−→ φ(x) = Kx
So each point is represented in the feature space by a function that measures
its similarity with the other points.
Then we show that the reproducing property holds also starting from
Mercer’s theorem (Cucker and Smale, 2001):
Proposition 1.1 (Reproducing Property from Mercer’s Theorem)
∀f ∈ HK and ∀x ∈ X ⇒ f (x) = hf, Kx iHK
26
CHAPTER 1 STATISTICAL LEARNING THEORY
Proof of Proposition 1.1 Since f ∈ HK , then
f (x) =
∞
X
ak ϕk (x)
k=1
It follows that
hf, Kx iHK =
=
∞
X
k=1
∞
X
k=1
ak hϕk , Kx iHK =
ak
(LK ϕk )(x) =
λk
Z
∞
X
ak
k=1
∞
X
k=1
λk
ϕk (z)K(x, z)dz
ak
λk ϕk (x)
λk
= f (x)
Finally, after introducing all necessary elements, we cite a theorem from
Cucker and Smale (2001) which states that HK and HK are the same space.
Theorem 1.8 (HK and HK are equal) The Hilbert spaces HK and HK
are the same space of functions on X with the same inner product:
HK ≡ HK
and
h·, ·iHK ≡ h·, ·iHK
Since HK and HK are the same space, the notation can be simplified. In the
following, we use HK to denote the hypothesis space and h·, ·iK to denote
the inner product in HK with respect to the Mercer kernel K.
1.2.4
Representer Theorem
Exploiting the results derived in the previous sections, now we show the
general form of the solution of a Tikhonov regularized learning problem which
minimizes a cost functional composed by the error on training data and the
complexity of the learnt function.
Theorem 1.9 (Kimeldorf and Wahba, 1971) Given a collection of independently and identically distributed data Dm = {(xi , yi ) ∈ X × Y}m
i=1
where X and Y are respectively the input and the output spaces, given a convex loss function V (f (x), y), given a Mercer kernel K, given an hypothesis
27
1.2 Mathematical Foundations of Kernels
space HK in according to Definition 1.26, given the norm k · kK in the RKHS
HK induced by the inner product (1.27), then the general form of the solution
of the Tikhonov regularization
m
1 X
V (f (xi ), yi ) + µ kf k2K
min
f ∈HK m
i=1
(1.33)
is
f (x) =
m
X
(1.34)
ai K(xi , x)
i=1
where µ > 0 is a parameter that controls the tradeoff between the empirical
error and the complexity of the function f .
Remark So the problem to find an f ∈ HK that minimizes the regularized
risk functional (1.33) is turned into the problem of finding the best coefficients
ai , i = 1, . . . , m. Note that the loss function V (f (x), y) must be convex: for
example, the SVM loss functions (1.4) and (1.6) are convex losses (even
though they are not differentiable everywhere).
Proof of Theorem 1.9 The proof is given in the simplified case that the loss
function V (f (x), y) is differentiable with respect to f . In general, the result
of the theorem holds for each convex loss: for more details see Kimeldorf and
Wahba (1971). Define
m
. 1 X
V (f (xi ), yi ) + µ kf k2K
H(f ) =
m i=1
Since f ∈ HK , then
f=
∞
X
bk ϕk
kf k2K =
and
k=1
∞
X
b2k
λk
k=1
Setting the first derivative of H(f ) with respect to bk to zero, then
m
0 =
∂H(f )
1 X ∂V (f (xi ), yi )
bk
=
ϕk (xi ) + 2µ
∂bk
m i=1
∂bk
λk
⇒ b k = λk
m
X
i=1
−
1 ∂V (f (xi ), yi )
ϕk (xi )
2µm
∂bk
28
(1.35)
CHAPTER 1 STATISTICAL LEARNING THEORY
So
b k = λk
m
X
ai ϕk (xi )
(1.36)
i=1
where
ai = −
1 ∂V (f (xi ), yi )
2µm
∂bk
(1.37)
Plugging Equation (1.36) back into Equation (1.35)
f (x) =
=
∞
X
k=1
m
X
bk ϕk (x) =
∞
X
k=1
ai
∞
X
i=1
λk
m
X
ai ϕk (xi )ϕk (x)
i=1
λk ϕk (xi )ϕk (x) =
m
X
ai K(xi , x)
i=1
k=1
where in the last step we used the Mercer’s theorem.
Equation (1.37) shows that the loss function V (f (x), y) affects the computation of coefficient ai . In the case of regression problem, if we use the quadratic
loss (1.3), we obtain a linear system. Precisely,
P
yi − m
1 ∂V (f (xi ), yi )
yi − f (xi )
j=1 aj K(xj , xi )
ai = −
=
=
2µm
∂bk
µm
µm
m
X
⇒ yi = µmai +
aj K(xj , xi )
j=1
that in matrix form becomes
(µmI m + K) a = y
(1.38)
In general, different loss functions originate different solutions not in the form
of a linear system and give different algorithms for finding the coefficient ai ;
furthermore the quadratic loss is not necessarily the best loss function.
So, by means of the representer theorem, the problem of finding f ∈ HK
is turned into finding coefficients ai , i = 1, . . . , m. Note that when µ → 0,
f (x) → 0; but setting µ not too small guarantees an unique solution of
problem (1.33) because the matrix µmI m + K has full rank, and a stable
solution, that is the linear system (1.38) is well conditioned.
29
1.3 Support Vector Machines
1.3
Support Vector Machines
Support Vector Machines (SVMs), originally developed by Vapnik and co–
workers (Boser et al., 1992), are the most widespread kernel–based machine
learning algorithm nowadays. Immediately after their introduction, an increasing number of researcher have worked on both the algorithmic and theoretical analysis of these systems, creating in just a few years what is effectively a new research direction, merging concepts from distant disciplines
as statistics, functional analysis, optimization and machine learning. The
soft margin classifier was introduced a few years later by Cortes and Vapnik (1995), then the algorithm was extended to the regression case (Vapnik,
1995) and to clustering problems (Tax and Duin, 1999; Schölkopf et al., 2001;
Ben-Hur et al., 2002). Two books written by Vapnik (1995, 1998) provide a
very extensive theoretical background of the field. Other references could be
found in Cristianini and Shawe-Taylor (2000); Shawe-Taylor and Cristianini
(2004); Schölkopf and Smola (2002); Hastie et al. (2001).
SVMs represent a very specific class of algorithms, characterized by the
use of kernels, the absence of local minima, the sparseness of the solution and
the capacity control obtained by acting on the margin or on other dimension
independent quantities as the number of support vectors. They provide a
new approach to the problem of pattern recognition based on the statistical
learning theory. Moreover producing sparse dual representation of the hypothesis results in an efficient algorithm which at the same time enforces the
learning biases suggested by the generalization theory.
SVMs differ radically from comparable approaches such as neural networks: they find always a global optimum because of they are formulated as
a Quadratic Programming (QP) optimization problem with box constraints.
Their simple geometric interpretation provides fertile ground for explaining
how they work in a very easy manner.
SVMs are largely characterized by the choice of kernel which maps the
inputs into a feature space in which they are separated by a linear hypothesis. Often the feature space is a very high dimensional one but the so–called
curse of dimensionality problem is cleverly solved turning to the statistical
learning theory. It says essentially that the difficulty of an estimation prob30
CHAPTER 1 STATISTICAL LEARNING THEORY
lem increases drastically with the dimension n of the space since one needs
exponentially many patterns in the dimension n to sample the space properly
(Littlestone, 1988). But the statistical learning theory tells us that learning
into a very high dimensional feature space can be simpler if one uses a low
complexity, i.e. simple class of decision rule. All the variability and richness
that one needs to have a powerful function class is then introduced by the
mapping φ through the kernel function. In short: not the dimensionality but
the complexity of the function matters. In addition, for certain feature spaces
and corresponding mapping φ, there is a highly effective trick for computing
scalar products in a high dimensional feature space using kernel functions.
So SVMs represent a complete framework where several concepts are combined together to form a powerful theory: dimension independent generalization bounds, Mercer kernels and RKHS, regularization theory and QP
optimization represent therefore the foundations of SVMs.
1.3.1
SVMs and the Regularization Theory
There are two possible ways of deriving SVMs: the classical one starts from
the geometrical interpretation of a binary labelled problem. If the data are
linearly separable, we can find a hyperplane that separates the two classes
with maximum margin, that is SVMs look for a linear separator which has the
maximum distance from both positive and negative examples (hard–margin).
The linearly separable case is then generalized to non linearly separable case
by adding slacks variables which permit to violate the margins paying some
penalties in the regularized functional cost (soft–margin). Since the data
only appear through inner products, a Mercer kernel can be plugged into the
algorithm permitting to separate the data by a non linear decision rule.
The second way of introducing SVMs starts from regularization theory
and derives SVMs as the solution of the Tikhonov regularized problem (1.14)
using the representer theorem. In more detail, given a set of independently
and identically distributed binary labelled examples Dm = {(xi , yi ) ∈ X ×
Y}m
i=1 where Y = {+1, −1}, SVMs for binary classification problems solve
31
1.3 Support Vector Machines
the following Tikhonov regularized problem:
m
1 X
V (f (xi ), yi ) + µ kf k2K
min
f ∈HK m
i=1
(1.39)
where K is a Merce kernel, HK is the hypothesis space defined by Equation
(1.26) and V (f (x), y) is the hinge loss function
V (f (x), y) = |1 − yf (x)|+
(1.40)
Since the hinge loss is not differentiable in a point, we cannot follows the
approach employed in the proof of representer theorem. But the hinge loss
is a convex loss function and so the representer theorem is still applicable.
Using the definition of the hinge loss (1.40), we can introduce non negative
slacks variables ξi ≥ 0 to simplify the problem. Let
ξi = |1 − yi f (xi )|+
⇒
ξi ≥ 1 − yi f (xi )
So the problem (1.39) simplifies to
m
1 X
min
ξi + µ kf k2K
f ∈HK m
i=1
(1.41)
subject to to follow constraints
yi f (xi ) ≥ 1 − ξi
(1.42)
ξi ≥ 0, i = 1, . . . , m
(1.43)
Now the representer theorem can be applied to regularized functional (1.41),
giving a general form for the function f ∈ HK
f (x) =
m
X
aj K(xj , x)
(1.44)
j=1
Plugging Equation (1.44) into functional (1.41) follows that
m
min
a∈IRm
1 X
ξi + µa0 Ka
m i=1
subject to yi
m
X
aj K(xj , xi ) ≥ 1 − ξi
j=1
ξi ≥ 0, i = 1, . . . , m
32
(1.45)
CHAPTER 1 STATISTICAL LEARNING THEORY
If we put µ = 1/2Cm and aj = yj αj into the problem (1.45), we obtain
m
X
1 0
α Qα + C
ξi
2
i=1
min
α∈IRm
subject to yi
m
X
yj αj K(xj , xi ) ≥ 1 − ξi
(1.46)
j=1
ξi ≥ 0, i = 1, . . . , m
where Q is an m by m matrix with elements Qij = yi yj K(xi , xj ). Problem
(1.46) is exactly the SVM primal formulation for binary classification problems for non linearly separable case if we omit the offset b in the form of the
prediction function f . There are two equivalent solutions for reintroducing
the regularized offset b into the prediction functions, leading to the following
Tikhonov regularized problem:
m
1 X
V (f (xi ) + b, yi ) + µ kf k2K
min
f ∈HK m
i=1
(1.47)
Using Equation (1.24) and Definition 1.26, f (x) can be seen as the inner
product between a weight vector w and the representation φ(x) of x in the
feature space induced by the kernel
f (x) = hw, φ(x)i =
∞
X
i=1
ak
wk φk (x) where wk = √
λk
(1.48)
So we can introduce an extended version of weight vector w and input x by
adding a constant component in the feature space to each example x
)
w = (w, b)
⇒ f (x) = hw, xi = hw, φ(x)i + b
(1.49)
φ(x) = (φ(x), 1)
Equivalently, we could add a positive quantity to the computation of kernel
K(x, z) = K(x, z) + c20
1.3.2
(1.50)
Primal and Dual Formulations
The main result of previous section is that the SVM primal formulation for
binary classification problems in non linearly separable case can be written
33
1.3 Support Vector Machines
as
m
X
1
hw, wi + C
ξi
2
i=1
min
w,b,ξ
subject to yi (hw, φ(xi )i + b) ≥ 1 − ξi
ξi ≥ 0, i = 1, . . . , m
(1.51)
where C > 0 is the regularization parameter which controls the tradeoff
between fitting the data (empirical error) and the model complexity: a large
value of C favors the empirical error, while a small value leads to a more
regularized prediction function. A very simple heuristic to choose an initial
value for C is
!−2
Pm p
K(x
,
x
)
−
2K(x
,
0
)
+
K(0
,
0
)
i
i
i
X
X
X
i=1
(1.52)
C=
m
where 0X is the null element of input space X . The prediction function f (x)
is
f (x) =
m
X
yi αi K(xi , x) + b = hw, φ(x)i + b
(1.53)
i=1
and the predicted class ŷ of an example x is given by
ŷ = sign(f (x))
(1.54)
Applying the Kuhn–Tucker theory for convex optimization problem and introducing the Lagrangian dual problem, problems (1.51) can be reformulated
in its dual form as
1
e0 α − α0 Qα
α∈IR
2
subject to 0 ≤ αi ≤ C, i = 1, . . . , m
y0α = 0
minm
(1.55)
where α ∈ IRm is the vector of dual variables and e is the vector of all ones
(see Cristianini and Shawe-Taylor (2000) for more details). Then, of course,
the training algorithm depends on the data only through inner product in
HK . Typically the number of non zero αi is much smaller than the number
m of training examples and so Equation (1.53) can be computed efficiently
34
CHAPTER 1 STATISTICAL LEARNING THEORY
summing only on examples xi for which αi > 0. This examples are called
Support Vectors (SVs) and represent the critical elements of training set:
they summarized all the information contained in the data set. If αi = C we
have bounded SVs, while if 0 < αi < C we have unbounded SVs. The Karush–
Kuhn–Tucker (KKT) complementary conditions permit to find the value of
ξi for SVs: if αi = C, then ξi has an arbitrary value, while if 0 < αi < C,
then ξi = 0: so unbounded SVs stay on the margin, while bounded SVs stay
anywhere. The KKT conditions also permits to compute the offset b
b = yi − hw, φ(xi )i
(1.56)
For a better computational stability, we can average on unbounded SVs
b=
X
1
yi − hw, φ(xi )i
|{i : 0 < αi < C}| i:0<α <C
(1.57)
i
A proposition from Cristianini and Shawe-Taylor (2000) shows the relation
between the optimum solution of problem (1.55) and the geometric margin, which measures the Euclidean distance of the points from the decision
boundary in the input space.
Proposition 1.2 Let α∗ and b∗ be the solution of problem (1.55). Then the
decision rule (1.54) computed using α∗ and b∗ in Equation (1.53) is equivalent
to the maximal margin hyperplane in the feature space implicity defined by
the Mercer kernel K and that hyperplane has geometric margin
γ=
sX
αi∗
(1.58)
i:α∗i 6=0
A not so tight upper bound on the expected risk can be obtained by a leave–
one–out (LOO) procedure. It consists on leaving out an example, learning
on the remaining m − 1 data and then classifying the left out example.
Since when a non support vector is omitted, it is correctly classified by the
remaining subset of the training data, the only examples which could perhaps
be misclassified are the SVs. So an upper bound of expected risk using the
LOO estimate is nSV /m, where nSV is number of SVs.
35
1.3 Support Vector Machines
1.3.3
SVMs for Regression
Many extensions of basic SVMs for binary classification are available. In
(Vapnik, 1995) SVMs for regression problems where Y = IR are introduced
for the first time. Also in this case, SVMs are the solution of the Tikhonov
regularized problem (1.39) but with a different loss function
V (f (x), y) = |y − f (x)|
(1.59)
where |·| is the insensitive function. The primal formulation for regression
problems is
m
min ∗
w,b,ξ,ξ
X
1
hw, wi + C
(ξi + ξi∗ )
2
i=1
subject to (hw, φ(xi )i + b) − yi ≤ + ξi
yi − (hw, φ(xi )i + b) ≤ + ξi∗
ξi , ξi∗ ≥ 0, i = 1, . . . , m
(1.60)
where the constraints in problem (1.60) provide that the prediction will be
close to the regression value
− − ξi∗ ≤ (hw, φ(xi )i + b) − yi ≤ + ξi
(1.61)
Applying the Kuhn–Tucker theory leads to its dual formulation
1
y 0 (α∗ − α) − e0 (α∗ + α) − (α∗ − α)0 K(α∗ − α)
α,α ∈IR
2
∗
(1.62)
subject to 0 ≤ αi , αi ≤ C, i = 1, . . . , m
e0 (α∗ − α) = 0
min
m
∗
The prediction function for regression is
f (x) =
m
X
(αi∗ − αi )K(xi , x) + b
(1.63)
i=1
Note that for regression the number of variables is twice the training set size,
resulting in a slower training optimization. If we define
ξ¯i = ξi + yi ,
ξ¯i∗ = ξi∗ − yi ,
36
= −1
CHAPTER 1 STATISTICAL LEARNING THEORY
then the regression problem is converted into a corresponding classification
problem with a double number of constraints:
m
min ∗
w,b,ξ̄,ξ̄
m
X
X
1
hw, wi + C
ξ¯i + C
ξ¯i∗
2
i=1
i=1
(−1) (hw, φ(xi )i + b) ≥ 1 − ξ¯i
(+1) (hw, φ(xi )i + b) ≥ 1 −
ξ¯i − yi ≥ 0, i = 1, . . . , m
(1.64)
ξ¯i∗
ξ¯i∗ + yi ≥ 0, i = 1, . . . , m
1.3.4
Support Vector Clustering
An interesting extension of basic SVMs for binary classification is Support
Vector Clustering (SVC) described in Tax and Duin (1999); Schölkopf et al.
(2001); Ben-Hur et al. (2002) (also called One–Class SVMs or Support Vector
Domain Description (SVDD)). In this algorithm, SVMs are employed to
characterize a set of unlabelled data in terms of SVs allowing to compute
the contour which encloses the data points. The outliers can be handled by
relaxing the enclosing constraint and allowing some points to stay out the
contour of the data set. The primal problem for SVC can be formulated in
following way
min
R,o,ξ
R2 + C
m
X
ξi
i=1
subject to kφ(xi ) − ok ≤ R2 + ξi
(1.65)
ξi ≥ 0, i = 1, . . . , m
where R is the radius of the sphere, o is the center of the sphere, ξi are
slacks variables allowing for soft constraint and C is a parameter balancing
the radius versus the number of outliers. Applying the Kuhn–Tucker theory
leads to its dual formulation
Pm
Pm
maxm
i=1 αi K(xi , xi ) −
i,j=1 αi αj K(xi , xj )
α∈IR
(1.66)
subject to 0 ≤ αi ≤ C, i = 1, . . . , m
The distance of a point to the center of the sphere
R2 (x) = kφ(x) − ok2
37
(1.67)
1.3 Support Vector Machines
can be written as
2
R (x) = K(x, x) − 2
m
X
αi K(xi , x) +
i=1
m
X
αi αj K(xi , xj )
(1.68)
i,j=1
Using Equation (1.68) we can classify each point as belonging or not to the
enclosing sphere. A Gaussian kernel was successfully employed in Ben-Hur
et al. (2002), while a polynomial kernel did not work as well as the Gaussian
one.
1.3.5
Complexity
Problem (1.55) is a QP optimization problem with linear constraints. Several
methods from convex optimization exist for solving QP problems. If we use a
non linear kernel, the number of features in HK is typically much larger than
the number of examples m. Thus, it is convenient to solve the dual problem
(1.55). However in some cases the dimensionality of the feature space HK
is less than m and so it could be more efficient to find a solution for the
primal problem (1.51). Note that Q is a m by m dense matrix and so the
memory requirement becomes a problem if the dataset set size is large. Thus
the solution of the dual problem (1.55) needs of a specialized method which
decomposes the problem into a series of smaller tasks: this decomposition
splits the problem (1.55) in an inactive and an active part called working
set on which the cost functional is optimized (see Algorithm 1.1 for more
details).
A lot of work has been done for finding efficient methods to solve the QP
optimization problem (1.55). Implementation problems and techniques are
discusses by Kaufmann (1999), Joachims (1999), Platt (1998, 1999a), Osuna
et al. (1997); Osuna and Girosi (1998), Keerthi et al. (2000); Shevade et al.
(2000), Steinwart (2003) and Bakır et al. (2005).
The key point is that the number of SVs has a strong impact on the
efficiency of SVMs during both the learning and prediction phases. Training
a SVMs leads to a QP convex optimization problem with bound constraints
and one linear equality constraint. Despite the fact that this type of problem
is well understood, for large learning tasks with many training examples, the
off–the–shelf optimization techniques for general quadratic programs quickly
38
CHAPTER 1 STATISTICAL LEARNING THEORY
become intractable in their memory and time requirements. For prediction,
the complexity depends again on the number of SVs. A recent result from
Steinwart (2003) shows that the number nSV of SVs increases linearly with
the number m of training examples:
nSV
→ 2 min errρ (f )
(1.69)
f ∈HK
m
where minf ∈HK errρ (f ) is the smallest classification error achievable by the
SVMs with kernel K. When using a universal kernel such as the Gaussian
kernel,
min errρ (f ) = errρ (fρ )
f ∈HK
(1.70)
that is the classification error corresponds to the Bayes risk, i.e. the smallest
error achievable by the Bayes function. When the problem is “easy”, then
errρ (fρ ) → 0
(1.71)
and so Equation (1.69) suggests that the number nSV of SVs increases less
than linearly with the number m of examples, resulting in a lower computational requirement. But if the problem is noisy or “very difficult”, then the
Bayes risk could become large leading to a worse performance.
Actually the computational requirements of modern SVMs training algorithms (Joachims, 1999; Chang and Lin, 2002) are very largely determined by
the amount of memory required to store the working set B of the kernel matrix. A general decomposition algorithm for SVMs is illustrated in Algorithm
1.1. For this class of problems the Kuhn–Tucker conditions are necessary and
sufficient conditions for optimality and so can be used a termination criterion
of step 1 in Algorithm 1.1. To check the optimality conditions, we need to
compute the predictions for all the training set elements and so most of time
in each iteration is due to the kernel computation of the q rows of the Hessian
matrix Q. This step has a time complexity of O(qmκ), where κ is the cost to
compute the kernel between two elements (O(qmκ) and not O(q 2 κ) because
of we have to compute the predictions for all the elements of the training
set). To update the predictions of the elements of training set, we can define
the predictions at iteration t as
m
X
f t (xi ) =
yj αjt K(xj , xi )
(1.72)
j=1
39
1.3 Support Vector Machines
Algorithm 1.1 SVM–Training–Algorithm(Dm )
Input: A binary labelled training set Dm = {(xi , yi )}m
i=1
Output: Dual variables αi , i = 1, . . . , m
1: while the optimality conditions are violated do
2:
choose q dual variables αi for the working set B
3:
remaining m − q variables are fixed to their current values
4:
decompose the problem
5:
solve the QP subproblem optimizing on the working set
6: end while
7: return αi , i = 1, . . . , m
and the predictions at the iteration t + 1 are computed by
X
f t+1 (xi ) = f t (xi ) +
yj (αjt+1 − αjt )K(xj , xi )
(1.73)
j∈B
Using the q rows of Q, updating f t+1 (xi ) is done in time O(qm) for all
examples in training set. Also the selection of the next working set, which
includes computing the gradient of cost functional, can be done in O(qm).
Then solving the QP subproblem requires O(q 2 ). The memory requirements
are due to storing the q rows of Q where O(qm) values need to be stored. In
addition, O(q 2 ) is needed to store the sub–matrix of working set and O(m)
to store f t (xi ). So the total cost of each iteration is O(q(m + q + mκ)).
The number of SVs is strictly related to number of iterations and to the
working set size q: the larger is the number of SVs, the bigger will be the
number of iterations and the working set size q to find the optimum. Ideally
q = nSV for having a fast convergence to the optimum but a good starting
point for the dual variables α can considerably speed up the convergence of
algorithm. So, if nSV m, a lower bound for complexity is O(m), while in
the case that all training examples are SVs, the complexity grows to O(m3 ).
On average, as a consequence of Equation (1.69) which relates nSV to m,
the complexity is O(m2 ). Also empirical studies reported in Joachims (1999)
which fit curves on training times, show an empirical scaling of m` with
` ∈ [1.7, 2.1] for some benchmark data sets.
In conclusion, SVMs learning is a polynomial algorithm on the training
set size, but it can be computationally very demanding when the data set size
40
CHAPTER 1 STATISTICAL LEARNING THEORY
and the number of SVs are large. The prediction of a new instance is linear in
the number of SVs as shown in Algorithm 1.2. Sometimes in practical cases,
Algorithm 1.2 SVM–Prediction(x, α, K)
Input: A new example x, dual variables α and a kernel K
Output: The predicted label ŷ of x
P
1: f (x) ←
i:αi 6=0 αi yi K(xi , x)
2: ŷ ← sign (f (x))
3: return ŷ
a linear time prediction in the size of data (for example, nonzero components
in the case of vectors or the length in the case of sequences) is possible
for kernel machines regardless of the number of SVs. Since the prediction
function f (x) for kernel machines is
X
(1.74)
αi yi K(xi , x),
f (x) = hw, φ(x)i =
i:αi 6=0
if the feature vectors are sparse and if the number of SVs is small with respect
to the total number of examples, we can compute directly the feature vectors
of the objects and the value of each component of w as
!
X
w=
αi yi φs (xi )
(1.75)
i:αi 6=0
s∈F
where F is the feature space. Then we can store the non zero values of w in
a look–up table. Moving across x, we can look–up the current feature φs (x)
and increment the prediction by the corresponding value ws φs (x).
1.3.6
Multiclass Classification
In a multiclass classification task the output domain is constituted by a set
Y = {1, 2, . . . , c} of more than two elements. The direct extension of a binary
learning machine to its multiclass version is not always possible or easy to
design. A common solution is to reduce the multiclass problem into several
binary sub–problems whose solutions, arranged in some way, provide the
final solution to the multiclass problem. The simplest but effective way of
41
1.3 Support Vector Machines
handling a multiclass problem is called one–vs–all and consists in defining as
many binary problems as the number of classes, where each binary problem
discerns one class from the others (Rifkin and Klautau, 2004). The classifier
which is more confident on own prediction assigns its class to the instance:
y(x) = arg max fi (x)
1≤i≤c
(1.76)
More complex schemes are possible, for examples ECOC introduced by Dietterich and Bakiri (1995) and recently extended in Allwein et al. (2000).
In the case of SVMs exist several works that address the problem of directly extending the algorithm to the multiclass case (Vapnik, 1998; Weston
and Watkins, 1998; Bredensteiner and Bennet, 1999; Guermeur et al., 2000;
Crammer and Singer, 2000). Resulting methods have the advantage of simultaneously learning all decision functions but they are computationally more
intensive.
42
Chapter 2
Voted Perceptron Algorithm
Kernels can be used in conjunction with several learning algorithms. SVMs
are employed to find a maximal–margin solution with a good generalization
performance since they are theoretically well–founded and there are guarantees on their convergence to an unique global optimum. Unfortunately SVMs
are a computational demanding algorithm and when the number of training
examples is very large, they could become prohibitive. This chapter provides
a novel interpretation of Voted Perceptron (VP) algorithm, an efficient on–
line algorithm whose performance is similar to that one of maximal–margin
classifiers. It include a new view of regularization theory for VP, explaining how fast grows the value of dual variables when the number of epoch
increases and give an upper bound for their value. It also shown an on–line
update rule for the VP dual variables and provides a novelty way to devise
a VP loss function.
2.1
Voted Perceptron
Voted Perceptron (Freund and Schapire, 1999) is an on–line mistake–driven
algorithm for binary classification based on Rosenblatt’s Perceptron algorithm (Rosenblatt, 1958). It is simpler to implement and more efficient in
terms of computational time with respect to SVMs, although there are no
convergence guarantees. On the other hand, as experimentally shown in Fre43
2.1 Voted Perceptron
und and Schapire (1999), the performance obtained with VP tends to the
performance of maximal–margin classifiers. The basic vector–based formulation can be easy kernelized, so it can be efficiently employed in very high
dimensional space and can be extended to generic data structures in a very
simple way. Unlike perceptron algorithm, VP stores and uses all prediction
vectors which are generated after every mistake to classify a new example.
Each such vector is weighted by the number of training examples it will correctly classify until the next mistake will occur: so good prediction vectors
that correctly classify a lot of examples, will have a larger coefficient in the
weighted sum.
2.1.1
Training Algorithm
The training procedure uses a binary labelled training set Dm = {(xi , yi )}m
i=1
drawn by a joint probability distribution on X × Y, where yi ∈ {+1, −1}.
Let T be the number of epochs, i.e. the number of times the algorithm runs
through the training data and let ck be the number of training examples
correctly classified by wk , where the index k counts misclassified examples
during learning. Let E be the total number of mistakes over all T epochs. In
this framework, wk is a sum on k incorrectly classified examples and it will
make (k + 1)–th mistake. Let J be the list of misclassified training examples
over all T epochs. Further we make the simplificative assumptions that
w0 = 0 and that w0 incorrectly classifies the first example x1 . Moreover
c0 = 0 and ck with k ≥ 1 is initialized to 1 to include all perceptrons in
the weighted sum. Finally, let φ : X 7→ F be a mapping from the input
space X into a Hilbert space F which represents the feature space and let
K : X × X 7→ IR be a Mercer kernel.
In the basic vector–based formulation, the Rosenblatt’s perceptron performs binary classification using a linear real–valued function f : X ⊆ IRn 7→
IR defined as
f (x) = hw, xi + b
n
X
=
wi xi + b
(2.1)
i=1
where (w, b) are the parameters that must be learnt from the data. The
44
CHAPTER 2 VOTED PERCEPTRON ALGORITHM
decision rule is given by sign(f (x)). The offset b can be easy omitted by
defining an extended version of weight vector and input as in the case of
SVMs (see Equation (1.49)).
The VP training procedure is the same of Rosenblatt’s perceptron. At
each step k + 1, VP classifies a training example xi using the current vector
of weights wk :
fk (xi ) = hwk , φ(xi )i
k
X
=
yqp K(xqp , xi )
(2.2)
p=1
where q1 , . . . , qk are the indexes of points incorrectly classified so far. If the
predicted label
ŷi = sign (hwk , φ(xi )i)
(2.3)
is different from the true label yi , then wk misclassifies xi and xi becomes
the (k + 1)–th mistake, xqk+1 = xi . So wk is modified depending on the
following update rule:
wk+1 = wk + yqk+1 φ(xqk+1 ),
(2.4)
xqk+1 is added to the list J of incorrectly classified examples and training
goes to next step. Otherwise, if xi is correctly classified by wk , ck , the
number of training examples correctly classified by wk , is increased. The list
J is similar to the set of SVs in SVMs: it contains the training examples
which will be used to classify a new instance. The only difference is that a
training example could appear more times if T > 1, due to several errors on
different epochs during learning. If we include the offset b, the update rule
for b is
bk+1 = bk + yqk+1
⇒
bk+1 =
k+1
X
y qp
(2.5)
p=1
A possible interpretation of Equation (2.4) is that the goal of VP update
rule is to modify the weight vector wk in a such way that the next time it
45
2.1 Voted Perceptron
will classify xi in the following epoch, there will be no error or a smaller error
on xi . More precisely, an error will occur if
yi hwk , φ(xi )i ≤ 0
(2.6)
After the update (2.4), wk is replaced by wk+1 and the next time xi will be
classified, the error will be:
yi hwk+1 , φ(xi )i = yi hwk , φ(xi )i + yi 2 K(xi , xi ) > yi hwk , φ(xi )i
(2.7)
and so there will be a smaller error or no error depending on the K(xi , xi )
value. Note that, as shown in Equation (2.8), it is not useful putting a
constant learning rate η in the update rule:
wk+1 = wk + η yqk+1 φ(xqk+1 )
k+1
X
= η
yqp φ(xqp )
(2.8)
p=1
It is only a scale factor on the absolute value of margin f (x) which does not
modify the predicted label.
VP outputs E weighted perceptrons W = {wk }E
k=1 or, equivalently, a list
E
J = {xqk }E
k=1 of misclassified examples, in addition to their weights {ck }k=1 .
If we employ a kernel function, we have to return the list of misclassified
examples, while, if we work directly in the input space, choosing between
the two alternatives depends on the dimensionality n of φ(x) and on the
number of errors E (see Section 2.1.2 for more details). The vector wk and
the elements xqp of J are related by
wk =
k
X
yqp φ(xqp )
(2.9)
p=1
for appropriate indexes qp corresponding to incorrectly classified examples.
A sketch of VP training algorithm in both cases is detailed in Algorithms 2.1
and 2.2.
2.1.2
Prediction Function
After training, the goal is to predict the label for a new instance x. The
VP prediction function is different from Rosenblatt’s perceptron prediction
46
CHAPTER 2 VOTED PERCEPTRON ALGORITHM
Algorithm 2.1 VP–Training–Algorithm(Dm )
Input: A binary labelled training set Dm = {(xi , yi )}m
i=1
E
Output: W = {(wk , ck )}k=1
1: k ← 0, w 0 ← 0, c0 ← 0
2: for t = 1 to T do
3:
for i = 1 to m do
4:
ŷi ← sign (hwk , φ(xi )i)
5:
if ŷi = yi then
6:
ck ← ck + 1
7:
else
8:
wk+1 ← wk + yi φ(xi )
9:
ck+1 ← 1
10:
k ←k+1
11:
end if
12:
end for
13: end for
14: return W
function and it gives the name to the algorithm. It tries to simulate a maximum margin classifier by a weighed sum of all training perceptrons. So, the
predicted class of a new instance x is
ŷ = sign (f (x))
(2.10)
where the margin f (x) is
f (x) =
E
X
ck sign (fk (x))
k=1
=
E
X
k=1
ck sign
k
X
p=1
47
!
yqp K(xqp , x)
(2.11)
2.1 Voted Perceptron
Algorithm 2.2 VP–Kernel–Training–Algorithm(Dm , K)
Input: A binary labelled training set Dm = {(xi , yi )}m
i=1 and a kernel K
E
Output: J = {(xqk , ck )}k=1
1: k ← 0, c0 ← 0, J ← ∅
2: for t = 1 to T do
3:
for i = 1 to m do
Pk
4:
ŷi ← sign
y
K(x
,
x
)
qp
i
p=1 qp
5:
if ŷi = yi then
6:
ck ← ck + 1
7:
else
8:
J ← J ∪ {xi }
9:
ck+1 ← 1
10:
k ←k+1
11:
end if
12:
end for
13: end for
14: return J
To obtain a smoother weighted summation version of Equation (2.11), the
sign function is replaced by the identity function:
f (x) =
E
X
ck fk (x)
k=1
=
E
X
k=1
ck
k
X
(2.12)
yqp K(xqp , x)
p=1
Note that the prediction function (2.2) used during training is different from
(2.11) and (2.12): the latter is a weighted summation of all training prediction vectors, while during training we only use the current weight vector.
Equation (2.12) provides better result then (2.11) and avoids ties if we have
to compare a set of instances based by their margins: so in the following we
refer to Equation (2.12).
In the linearly separable case, the VP training algorithm will eventually
converge to a consistent hypothesis, i.e. a weight vector that correctly classifies all examples. As this prediction vector makes no further mistakes, the
48
CHAPTER 2 VOTED PERCEPTRON ALGORITHM
last perceptron wE with its weight cE will eventually dominate the weighted
vote in Equation (2.12). So, for linearly separable data, when T → ∞, the
VP converges to Rosenblatt’s perceptron which predicts using the final predictor vector. In the non linearly separable case, it is not clear what happens:
probably there will be a set of prediction vectors associated with the most
misclassified examples that will dominate the weighted sum.
Computing efficiently the prediction in the case we use the standard inner
product in IRn as a kernel, depends on the dimensionality n of x and on the
number of errors E. Exactly, f (x) can be computed in the following two
ways
f (x) =
E
X
ck hwk , xi
(2.13)
k=1
=
E
X
k=1
ck
k
X
yqp xqp , x
(2.14)
p=1
Equation (2.13) takes O(En) time, while Equation (2.14) O(n + E 2 ). If we
indicate by nSV the number of distinct examples in J , we have to compute
and store nSV inner products in IRn and then to compute E(E + 1)/2 sums
of these inner products. If n is of the same order of E, the two Equations
(2.13) and (2.14) are computationally equivalent; otherwise, depending on
the relative order of n and E, one method outperforms the other one. Algorithms 2.3 and 2.4 summarize the VP prediction procedure directly in the
input space and employing a kernel function.
Algorithm 2.3 VP–Prediction(x, W)
Input: A new example x and W = {(wk , ck )}E
k=1
Output: The predicted label ŷ of x
PE
1: f (x) ←
k=1 ck hw k , φ(x)i
2: ŷ ← sign (f (x))
3: return ŷ
49
2.2 Dual VP
Algorithm 2.4 VP–Kernel–Prediction(x, J , K)
Input: A new example x, J = {(xqk , ck )}E
k=1 and a kernel K
Output: The predicted label ŷ of x
PE
Pk
1: f (x) ←
k=1 ck
p=1 yqp K(xqp , x)
2: ŷ ← sign (f (x))
3: return ŷ
2.2
Dual VP
In this section, we derive a new formulation of VP algorithm in the dual
space. VP belongs to the more general family of kernel machines that has
the property of dealing with the data only through inner products. So its
prediction function can be written as
f (x) =
m
X
αi yi K(xi , x) + b
(2.15)
i=1
The VP update rule (2.4) identifies a subset of misclassified training examples
with their coefficients ck , but does not suggest anything on what happens to
dual variables αi . It would be interesting to derive an update rule for dual
variables αi like
αik = αik−1 + ∆αik
which will on–line modify the value of dual variables whenever an error will
occur. To do this, we have to find a relation between Equations (2.12) and
(2.15).
First of all, the perceptron training prediction function fk (x) in Equation
(2.2) can be expressed by a summation of kernels between x and training examples xi , weighted by the number of times βik that xi is incorrectly classified
during learning:
fk (x) =
=
k
X
p=1
m
X
yqp K(xqp , xi )
(2.16)
βik yi K(xi , x)
i=1
50
CHAPTER 2 VOTED PERCEPTRON ALGORITHM
Precisely, βik is number of mistakes made on xi after k errors:
βik
=
τk
X
θ(−yi fkt (xi ))
(2.17)
t=1
where 0 ≤ τk ≤ T is the current epoch number after k mistakes, k1 , . . . , kτk
are the indexes of training prediction functions which misclassified xi and
θ : IR 7→ {1, 0} is the heaviside function which counts errors on training
examples. We use βi instead of βiE to indicate the total number of mistakes
on xi after learning. Indeed, the weights βi represent Rosenblatt’s perceptron
dual variables and the examples for which βi > 0 will constitute the set of
support vectors, i.e. the subset of training data useful for prediction. In
addition, βik must satisfy the constraint:
m
X
βik = k, k = 1, . . . , E
⇒
kβk1 = E
(2.18)
i=1
The perceptron dual variables βik update rule for example xi in the event of
an error is
βik+1 = βik + 1
(2.19)
At this point, it is easy to express the VP prediction function as a summation
on training examples substituting Equation (2.16) into Equation (2.12):
f (x) =
=
E
X
k=1
E
X
ck hwk , φ(x)i
ck fk (x)
k=1
=
=
E
X
ck
m
X
βik yi K(xi , x)
i=1
k=1
m
E
XX
ck βik yi K(xi , x)
i=1 k=1
Comparing Equations (2.20) and (2.15)
m X
E
X
ck βik yi K(xi , x)
=
i=1 k=1
m
X
i=1
51
αi yi K(xi , x)
2.2 Dual VP
follows an equation for αi :
αi =
E
X
ck βik ,
i = 1, . . . , m
(2.20)
k=1
Finally, the update rule for VP dual variables is:
αik = αik−1 + ck βik ,
i = 1, . . . , m
(2.21)
Some remarks about Equation (2.21):
• the final value of ck will be known only after the (k + 1)–th error. Consequently, after (k + 1)–th mistakes, the value of αk can be computed
but not αik+1 because αik+1 needs ck+1 ;
• furthermore any mistake modifies all the values αik for which βik > 0.
For example, after the (k + 1)–th mistake, the value of ck becomes
known and the term ck βik has to be added to αik−1 for which βik > 0;
• if the (k + 1)–th mistake happens on an example with βik = 0, then the
corresponding value of αik will not be updated but βik+1 = 1; only after
the (k + 2)–th error the value of ck+1 will be known and αik+1 will be
able to be updated by the term ck+1 βik+1 .
The VP dual training algorithm is detailed in Algorithm 2.5, while prediction function in Algorithm 2.6. We generalized the training procedure
introducing the concept of margin for VP, i.e. the minimum distance between the separating hyperplane to the nearest point by which the training
data have to be separated. In Table 2.1 is described an example of execution
trace of VP training algorithm focussed on mistakes happened on example
x5 . There are seven mistakes on x5 on a total of E = 100 errors which occur
at different epochs. x5 contributes to wk each time wk−1 misclassifies x5
and so x5 appears more times in the summation to create wk . Note that
the final value of cE is known after presenting all the m examples in the last
epoch and the last row updates α5E after presenting all the data on the last
epoch T .
A theorem described in Freund and Schapire (1999) which extends a
result from Block and Novikoff (1962), gives an upper bound on the number
52
CHAPTER 2 VOTED PERCEPTRON ALGORITHM
Algorithm 2.5 Dual–VP–Training–Algorithm(Dm , γ)
Input: A binary labelled training set Dm = {(xi , yi )}m
i=1 and margin γ
Output: Dual variables αi , i = 1, . . . , m
Require: γ ≥ 0
1: k ← 0, w 0 ← 0, c0 ← 0
0
2: SV ← {∅} {List of Indexes of Support Vectors}
3: for i = 1 to m do {Initialize dual variables}
4:
βi0 ← 0; αi0 ← 0
5: end for
6: for t = 1 to T do
7:
for i = 1 to m do
8:
if yi fk (xi ) ≤ γ then {fk made (k + 1)–th error on xi }
9:
for j ∈ SVk (j = 1 to m : βjk > 0) do {update αjk }
10:
αjk ← αjk−1 + βjk ck {Only ck in known and not ck+1 }
11:
end for
12:
k ← k + 1 {k becomes k + 1}
13:
βik ← βik−1 + 1
14:
ck ← 1
15:
if {i} ∈
/ SVk−1 then
16:
SVk ← SVk−1 ∪ {i}
17:
end if
18:
else {No error on xi }
19:
ck ← ck + 1
20:
end if
21:
end for
22: end for
E
23: for i ∈ SV (i = 1 to m : βiE > 0) do
24:
αiE ← αiE−1 + βiE cE
25: end for
26: return αi , i = 1, . . . , m
53
k
0
1
4
5
6
30
31
32
50
51
52
100
END
t
1
1
.
.
.
1
1
1
.
.
.
2
2
2
.
.
.
3
3
3
54
.
.
.
T
T
no mistake
w99 mistakes x9
w49 mistakes x3
w50 mistakes x5
w51 mistakes x9
w29 mistakes x3
w30 mistakes x5
w31 mistakes x9
w3 mistakes x5
w4 mistakes x7
w5 mistakes x9
w0 mistakes x1
Mistakes
w100 = w100
w100 = w99 + y9 x9
w50 = . . . + y5 x5 + . . . + y5 x5 + . . . + y3 x3
w51 = . . . + y5 x5 + . . . + y5 x5 + . . . + y3 x3 + y5 x5
w52 = . . . + y5 x5 + . . . + y5 x5 + . . . + y5 x5 + y9 x9
w30 = . . . + y5 x5 + . . . + y3 x3
w31 = . . . + y5 x5 + . . . + y5 x5
w32 = . . . + y5 x5 + . . . + y5 x5 + y9 x9
w4 = w3 + y5 x5 = y1 x1 + . . . + y5 x5
w5 = w3 + y5 x5 + y7 x7 = y1 x1 + . . . + y5 x5 + y7 x7
w6 = y1 x1 + . . . + y5 x5 + y7 x7 + y9 x9
w0 = 0
w1 = y1 x1
wk
7
7
2
3
3
1
2
2
1
1
1
0
0
k
β5
7c100
7c99
2c49
2c50
3c51
c29
c30
2c31
0
c4
c5
0
0
k−1
ck−1 β5
α100
= α98
5
5 + 7c99 + 7c100
98
α99
5 = α5 + 7c99
c4 + . . . + c30 + 2(c31 + . . . + c49 )
c4 + . . . + c30 + 2(c31 + . . . + c49 + c50 )
c4 + . . . + c30 + 2(c31 + . . . + c50 ) + 3c51
c4 + . . . + c29
c4 + . . . + c29 + c30
c4 + . . . + c30 + 2c31
0
c4
c4 + c5
0
0
k−1
αk−1
= αk−2
+ ck−1 β5
5
5
2.2 Dual VP
Table 2.1. An example of VP algorithm execution focussed on example x5 .
After k mistakes, β5k is known but not ck and so only α5k−1 can be evaluated.
CHAPTER 2 VOTED PERCEPTRON ALGORITHM
Algorithm 2.6 Dual–VP–Prediction(x, α, K)
Input: A new example x, dual variables α and a kernel K
Output: The predicted label ŷ of x
P
1: f (x) ←
i:αi 6=0 αi yi K(xi , x)
2: ŷ ← sign (f (x))
3: return ŷ
of mistakes E made by the on–line perceptron algorithm: the importance
of this result is that this bound does not depend on the dimensionality of
the examples and so the algorithm might perform well in high dimensional
spaces. This theorem can be also applied to VP algorithm as VP training
procedure is the same of Rosenblatt’s perceptron.
Theorem 2.1 (Freund and Schapire, 1999) Let Dm = {(xi , yi )}m
i=1 be a
binary labelled set of examples such that
R = max kxi k
1≤i≤m
Let w be any vector with kwk = 1 and let γ > 0. Define the deviation of
each example (similar to slacks variables in SVMs) as
ξi = |γ − yi (hw, xi i)|+
and define
v
u m
uX
ξi2
D=t
i=1
Informally, the ξi quantity measures how much a points fails to have a margin
of γ from the hyperplane. Then the number of mistakes E of the Rosenblatt’s
perceptron algorithm on Dm is bounded by
2
R+D
E≤
(2.22)
γ
For the linearly separable case
yi (hw, xi i) ≥ γ ⇒ di = 0,
i = 1, . . . , m
and so
2
R
E = kβk1 ≤
γ
55
(2.23)
2.3 Regularization
2.3
Regularization
This section introduces a novel approach to study the regularization and
generalization properties of VP. Equation (2.20) shows that VP dual variables
αi are a weighted summation of coefficients ck , 1 ≤ ck ≤ m for k = 1, . . . , E
which must satisfy:
E
X
ck = mT
k=1
The weights βik count the number of perceptrons which have incorrectly classified xi after k errors: for each epoch, there will be at most one perceptron
which will misclassify xi , so 0 ≤ βik ≤ T . In addition, βik must satisfy the
constraint (2.18). Furthermore, the number of mistakes E is bounded by
T ≤ E ≤ mT
where E ≥ T as at least a mistake happens on each epoch, otherwise all
training set is correctly classified and no further epochs are necessary; E ≤
mT as at most all the examples can be misclassified in each epoch. So
• αi = 0 if xi is correctly classified in all the epochs (also βi = 0);
• αi is large if xi is misclassified by any perceptrons with a large ck and
the greater is the number of perceptrons which incorrectly classify it,
the bigger is αi ;
• if xi is the only example misclassified during each epoch (all E errors
happen on xi ), then E = T , ck = m, k = 1, . . . , E and βik = k,
k = 1, . . . , E. The values of αi are
αi =
E
X
ck βik
k=1
where
PE
k=1
βik
k=1
E(E + 1)
T (T + 1)
=m
2
2
= 0 j = 1, . . . , m j 6= i
= m
αj
=m
E
X
k = E(E + 1)/2;
56
(2.24)
CHAPTER 2 VOTED PERCEPTRON ALGORITHM
• in the case that all the examples are misclassified at each epoch, E =
mT , ck = 1, k = 1, . . . , E and
βitj = t − 1 + θ(j − i),
t = 1, . . . , T, j = 1, . . . , m
where the indexes tj indicate the value of variable βi at t–th epoch
after presenting j examples. The values of αi , i = 1, . . . , m are
E
X
αi =
ck βik
k=1
T X
m
X
=
=
E
X
= m
βitj =
βik
k=1
T X
m
X
t − 1 + θ(j − i)
t=1 j=1
t−1+T
t=1
= m
=
mT
X
k=1
t=1 j=1
T
X
βik
m
X
θ(j − i)
j=1
T (T − 1)
+ T (m − i + 1)
2
(2.25)
For example, the value of α1 is
α1 = m
T (T + 1)
2
Also a formulae for 1–norm of α is obtained:
kαk1 =
m
X
αi =
m X
E
X
i=1
ck βik =
i=1 k=1
E
X
k=1
ck
m
X
i=1
βik =
E
X
ck k
k=1
In the case that only xi is misclassified in all the epochs, E = T , ck = m:
kαk1 =
E
X
mk = m
k=1
E(E + 1)
T (T + 1)
=m
2
2
(2.26)
while, if all the examples are misclassified in all epochs, E = mT , ck = 1:
kαk1 =
E
X
k=1
k=
mT (mT + 1)
E(E + 1)
=
2
2
(2.27)
The same results for kαk1 are obtained by summing up Equations (2.24) and
(2.25) over all examples.
57
2.3 Regularization
Another way to express Equation (2.12) is to consider each perceptron
during learning process and not only the perceptrons which made a mistake.
Let wtj be the weight vector after presenting j examples during the t–th
epoch:
wtj
=
=
t−1 X
m
X
µτi yi φ(xi )
+
τ =1 i=1
i=1
m X
t−1
X
m
X
µτi yi φ(xi ) +
i=1 τ =1
=
j
X
θ(j − i)µti yi φ(xi )
(2.28)
i=1
m
t−1
X
X
i=1
µti yi φ(xi )
!
µτi + µti θ(j − i) yi φ(xi )
τ =1
where µti is a binary function defined as
(
1 if xi is misclassified at epoch t
µti =
0 otherwise
(2.29)
with the constraint
T X
m
X
µti = E
t=1 i=1
Now, if the weight vector correctly classifies an example j, wtj = wtj−1 , else
wtj is updated using the same VP update rule. The coefficients ck disappear
from the weighted sum as the same wtj is summed up more times if it correctly
classifies more examples; w10 is initialized to 0. The margin f (x) becomes
f (x) =
T X
m
X
wtj · φ(x)
t=1 j=1
=
=
T X
m X
m
t−1
X
X
t=1 j=1 i=1
τ =1
m X
T X
m
X
t−1
X
i=1 t=1 j=1
τ =1
!
µτi + µti θ(j − i) yi hφ(xi ), φ(x)i (2.30)
!
µτi + µti θ(j − i) yi K(xi , x)
and thus
αi =
T X
m
t−1
X
X
t=1 j=1
τ =1
58
!
µτi + µti θ(j − i)
(2.31)
CHAPTER 2 VOTED PERCEPTRON ALGORITHM
If xi is misclassified at all epochs, µti = 1, t = 1, . . . , T , then the value of αi
is
αi =
T X
m
X
(t − 1 + θ(j − i))
(2.32)
t=1 j=1
T (T − 1)
= m
+ T (m − i + 1)
2
which is the same result reported in Equation (2.25).
In conclusion, the value of VP dual variables is derived in two different
cases: the maximum values is obtained when all the examples are misclassified in all the epochs:
0 ≤ αi ≤ αM AX = m
T (T + 1)
2
(2.33)
By analogy with dual variables of SVMs which are bounded by 0 ≤ αi ≤ C,
we can introduce a regularization parameter for VP algorithm:
C=m
T (T + 1)
2
(2.34)
As the dimension m of training set is fixed, the VP regularization parameter
C grows at most quadratically with the number of epochs T . A large number
of epochs favors fitting the data and so large values of T must be avoided for
getting better generalization properties.
2.4
Complexity
The VP is an on–line algorithm whose performance tend to the maximal–
margin classifiers (Freund and Schapire, 1999), useful when the training set
size is very large or when a fast learning procedere is needed. As stated
in Theorem 2.1, in the linearly separable case VP algorithm converges to a
vector which correctly classifies all the examples and the number of mistakes,
which stops to grow after some epochs, is upper bounded by a function of the
margin and the sphere containing the data. The number of support vectors
is bounded by the number of mistakes E. If the examples are not linearly
separable, a similar results is derived by adding to the radius R of the ball
59
2.5 Loss Function
containing the examples, a quantity representing how much the data violate
the margin. In this case, some examples will remain misclassified even if the
training algorithm will run for a large number of epochs and so the number of
misclassified examples will continue to grow. In most of cases, the examples
misclassified are the same from a certain epoch on and consequently the
number of support vectors remains constant. Experimentally we observed
that the number of support vectors is a small fraction of training examples.
Finally, to conclude, the worst case complexity of VP training algorithm
(that happens when all the training examples are misclassified at each epoch)
is O(T m2 κ), where T is the epoch number, m is the training set size and κ
is the cost to compute the kernel. If the number of support vector is small
and mostly constant with respect to the training set size, the complexity
come down to O(T mκ) and becomes linear in the number of examples. In
practice, VP converges reasonably fast and is mostly linear in the number of
examples. The computation of the prediction function as in Equation (2.12)
requires instead only E kernel calculations and not E(E + 1)/2 as would
seem. Precisely, if nSV is the number of distinct examples in J , we have to
compute only nSV kernel values. Actually, computing the margin is linear in
the number nSV of support vectors as shown in Equation (2.15), where the
summation concerns only the SVs.
2.5
Loss Function
In this section, we try to derive a loss function for the VP algorithm, showing
a novel approach to the problem. Following the approach described in Bottou
(1998), it is possible deriving a loss function for the perceptron algorithm
whose training phase is the same of VP. A common method to minimize the
sample error (1.2)
m
1 X
V (f (xi ), yi )
err
ˆ m (f ) =
m i=1
60
CHAPTER 2 VOTED PERCEPTRON ALGORITHM
is the gradient descent update rule. If f (x) = hw, xi + b, in batch gradient
descent the update rule is
m
wt+1
ηt X
∇w V (ft (xi ), yi )
= wt − ηt ∇w err
ˆ m (ft ) = wt −
m i=1
(2.35)
while in on–line gradient descent is
wt+1 = wt − ηt ∇w V (ft (xt ), yt )
(2.36)
where ηt is a variable learning rate. A necessary condition for applying
gradient descent is that the loss function must be differentiable. But if the
loss function V (f (x), y) is not differentiable in a finite set of points, we can
still apply gradient descent using the following trick: if a point for which
the loss is not differentiable is sampled, then another point is drawn from
the same distribution without consider the previous point. This procedure is
equivalent to consider null the gradient of the loss function in the examples for
which the loss function is not differentiable and to sample the next example.
Since the prediction function is ŷ = sign (f (x)), a possible perceptron
loss function is
V (f (x), y) =
1
(sign (f (x)) − y) f (x) = |−yf (x)|+
2
(2.37)
If we discern a correctly classified example from a misclassified one for which
yf (x) ≤ 0, the loss function (2.37) becomes
(
−yf (x) if yf (x) ≤ 0
V (f (x), y) =
(2.38)
0 if correct
This loss function is not differentiable only when f (x) = 0 and so we can
ignore the non differentiability using the trick previously described. Equation (2.37) is zero when the example is correctly classified, otherwise it is
positive and proportional to f (x) as we can see from Equation (2.38). Its
first derivative is
(
−yx if yf (x) ≤ 0
∂V (f (x), y)
1
= (sign (f (x)) − y) x =
(2.39)
∂w
2
0 correct
61
2.5 Loss Function
If we consider the on–line gradient descent as method to minimize the sample
error for the perceptron, the update rule using the loss function in Equation
(2.37) in the case of error becomes
wt+1 = wt + ηt yt xt
(2.40)
that is the same of Equation (2.4) if the learning rate ηt is constant; if xt
is correctly classified, then the weight vector is not modified. The sample
error (1.2) using the loss function (2.37) is zero if all examples are correctly
classified or when the weight vector w is null. If the training set is linearly
separable, then the perceptron algorithm will find a linear separator after
some epochs. Otherwise, the weights w will quickly tend towards zero.
Finally, comparing the perceptron loss function (2.37) to the SVMs classification loss function (1.6), we note that they differ only for the margin that
adds the term 1 in the SVMs loss function.
62
Chapter 3
Processing Structured Data
Kernel machines rely on kernels for accessing to the data. But many real–
world data are structured and consequently they have no natural representation in a real vector space. Statistical learning in structured domains is
becoming one of central area of machine learning. Learning algorithms on
discrete structures are often derived from vector based methods where structures are converted into a vector space, but the mapping from structures to
vector spaces can be also realized by a kernel function. In this chapter, we
present a survey of Mercer kernels for structured data which better capture
important relationships between the subparts that compose an object, starting from basic kernels on vectors up to strings, trees and graphs. The chapter
ends describing recursive neural networks, an alternative approach to kernel
machines for processing structured data.
3.1
Basic Kernels
The most simple kernel between two vectors x, z ∈ IRn is the inner product
K(x, z) = hx, zi
(3.1)
More complex kernels could be constructed as composition on more simpler
ones. The polynomial kernel is defined as
.
Kpol (x, z) = p(K(x, z))
63
(3.2)
3.2 Constructing Kernels
where p(·) is any polynomial with positive coefficients. A frequently used
polynomial kernel in IRn is
!
d
X
d
.
d
(3.3)
Kd (x, z) = (hx, zi + b) =
bd−s hx, zis
s
s=0
where the feature space has dimension
!
n+d
(n + d)!
(n + 1)(n + 2) · · · (n + d)
=
=
d!(n + d − d)!
d!
d
The feature space of hx, zis is indexed by all the monomial i of degree s
φi (x) =
xi11 xi22
· · · xinn
subject to
n
X
ij = s
(3.4)
ij ≤ d
(3.5)
j=1
while for the feature space of Kd (x, z) must hold
φi (x) =
xi11 xi22
· · · xinn
subject to
n
X
j=1
An example of feature map in the case that n = 2, d = 2, b = 0 and b = 1 is
√
x = (x1 , x2 ) ∈ IR2 7→ φ(x) = (x21 , x22 , 2x1 x2 ) ∈ IR3
√
x = (x1 , x2 ) ∈ IR2 7→ φ(x) = (x21 , x22 , 2x1 x2 , x1 , x2 , 1) ∈ IR6
Another commonly used class of kernels is Radial Basis Function (RBF)
KRBF (x, z) = g(d(x, z))
(3.6)
where d is a metric in X and g a function in IR+
0 . The Gaussian kernel is a
special case of RBF kernel where
K(x, x) − 2K(x, z) + K(z, z)
.
(3.7)
Kσ (x, z) = exp −
2σ 2
3.2
Constructing Kernels
The class of Mercer kernels has interesting closure properties with respect
to some operations. In detail, it is closed under addiction, product, multiplication by a positive constant and pointwise limits, so it forms a closed
64
CHAPTER 3 PROCESSING STRUCTURED DATA
convex cone (Berg et al., 1984). In the case of addiction, the features of the
addiction between two kernels are the union of the features of two kernels
Kadd (x, z) = K1 (x, z) + K2 (x, z)
= hφ1 (x), φ1 (z)i + hφ2 (x), φ2 (z)i
= h(φ1 (x), φ2 (x)) , (φ1 (z), φ2 (z))i
(3.8)
while in the case of multiplication the features of product between two kernels
are the Cartesian product of the set of the features of two kernels
Kprod (x, z) = K1 (x, z)K2 (x, z)
= hφ1 (x), φ1 (z)i hφ2 (x), φ2 (z)i
n
m
X
X
φ1i (x)φ1i (z)
φ2j (x)φ2j (z)
=
=
=
i=1
n X
m
X
j=1
(φ1i (x)φ2j (x))(φ1i (z)φ2j (z))
i=1 j=1
nm
X
φ12k (x)φ12k (z) = hφ12 (x), φ12 (z)i
(3.9)
k=1
where φ12 (x) = φ1 (x) × φ2 (x). The above properties are also valid in the
case that the two kernels are defined on different domains (Haussler, 1999).
If K1 and K2 are two kernels defined respectively on X1 × X1 and X2 × X2
then the direct sum and tensor product
.
K1 ⊕ K2 ((x1 , x2 ), (z 1 , z 2 )) = K1 (x1 , z 1 ) + K2 (x2 , z 2 )
.
K1 ⊗ K2 ((x1 , x2 ), (z 1 , z 2 )) = K1 (x1 , z 1 )K2 (x2 , z 2 )
(3.10)
(3.11)
are two kernels on (X1 × X2 ) × (X1 × X2 ), where x1 , z 1 ∈ X1 and x2 , z 2 ∈ X2 .
Also the zero extension of K is a kernel: if S ⊆ X and K is a kernel on S × S,
the K can be extended to X × X by defining K(x, z) = 0 if either x or z is
not in S.
An extension of these concepts is at the basis of the so called convolution
kernels, where the kernel on the object is made up from kernels defined
in its parts by a relation which capture the semantic of composite object.
The decomposition represents a flexible approach for inducing a similarity
65
3.2 Constructing Kernels
measure over complex objects based on the similarity between their parts.
~ = (x1 , . . . , xD ), ~z = (z 1 , . . . , z D ) ∈ X1 × · · · × XD be
Given x, z ∈ X , let x
tuples of parts of these objects. Let R be a relation on X1 × · · · × XD × X
~ : then we can define a decomposition of x as
between x and its parts x
−1
R (x) = {~x : R(~x, x)}. The R–convolution kernel K1 ? · · · ? KD on X × X
is defined as the zero extension to X of following kernel KS on S × S, S =
{x : R−1 (x) 6= ∅} (Haussler, 1999)
.
KS (x, z) =
X
X
~ ∈R−1 (x) ~
x
z ∈R−1 (z)
D
Y
Kd (xd , z d )
(3.12)
d=1
where Kd : Xd × Xd 7→ IR, d = 1, . . . , D is a kernel on Xd × Xd which
measures the similarity on parts. Because of their generality, convolution
kernels require a significant amount of work to adapt them to a specific
problem, which makes choosing R in real–word applications a non trivial
task.
A kernel on sets X, Z ⊆ X can be derived from Equation (3.12) by
~ = x1 = x
letting R be the set membership function. Suppose D = 1, so x
−1
and define x ∈ R (X) ⇔ x ∈ X. Then the set kernel can be defined as
. XX
(3.13)
Kset (X, Z) =
K(x, z)
x∈X z∈Z
where K is a kernel on X (Gärtner et al., 2002). If K(x, z) is the exact
matching kernel
(
1 if x = z
δ(x, z) =
(3.14)
0 otherwise
then we have the intersection set kernel
K∩ (X, Z) = |X ∩ Z|
(3.15)
A more general version of the intersection set kernel is
K∩ (X, Z) = µ(X ∩ Z)
(3.16)
where µ is a measure or a probability density function on X . As a consequence, the minimum between two nonnegative reals x, z ∈ IR+ is a kernel
Kmin (x, z) = min(x, z)
66
(3.17)
CHAPTER 3 PROCESSING STRUCTURED DATA
Exactly a nonnegative real number x can be represented by the interval [0, x],
that is by the set {y : 0 ≤ y ≤ x} and the minimum between two reals is
equivalent to the intersection of corresponding sets.
A more complex type of R–convolution kernel is the Analysis of Variance
(ANOVA) kernel (Saunders et al., 1998; Vapnik, 1998). Let X = S n for some
set S and Ki : S ×S 7→ IR, i = 1, . . . , n be a set of kernels which will typically
be the same. The ANOVA kernel of order D, D = 1 . . . , n is defined as
KANOVA (x, z) =
X
D
Y
Kid (xid , z id )
(3.18)
1≤i1 <···<iD ≤n d=1
By varying D between two extremes 1 and n, we get a range of kernels from
tensor product to direct sum. If D = n, the sum consists only of the term
for which i1 = 1, . . . , iD = n and KANOVA = K1 ⊗ · · · ⊗ Kn ; if D = 1, each
product collapses into a single factor, while i1 ranges from 1 to n giving
KANOVA = K1 ⊕ · · · ⊕ Kn . Further, choosing an appropriate definition of
X and R, we get a range of kernels from (normal) sums to products. A
recursive formula allows to compute these kernels efficiently also in the case
of an exponential number of terms (e.g. when D = n/2).
Finally, note that the composition of two kernels K1 and K2
K1 ◦ K2 (x, z) = K1 (φ2 (x), φ2 (z))
= hφ1 (φ2 (x)), φ1 (φ2 (z))i
(3.19)
= hφ1 ◦ φ2 (x), φ1 ◦ φ2 (z)i
amounts to find a non linear separation in the feature space F2 of K2 or,
equivalently, to mapping data from the feature space F2 into an even higher
dimensional feature space F1◦2 by the feature map φ1 : F2 7→ F1◦2 corresponding to K1 .
3.3
Normalizing the Kernel
In order to remove the bias introduced by the data size (for example, the
dimensionality of the space in the case of vectors or the number of vertices in
the case of sequences, trees or graphs), it could be useful defining a normalized
67
3.4 Kernels for Discrete Objects
version of a kernel K(x, z)
K(x, z)
p
=
K(x, x) K(z, z)
.
Knorm (x, z) = p
φ(z)
φ(x)
,
kφ(x)k kφ(z)k
(3.20)
which corresponds to the following feature map
x 7→
φ(x)
kφ(x)k
(3.21)
So Knorm (x, z) ∈ [0, 1] ∀ x, z ∈ X and Knorm (x, x) = 1 ∀ x ∈ X : then the
Gram matrix has the element 1 in the diagonal and the other ones are in the
interval [0, 1].
3.4
Kernels for Discrete Objects
Most of the real world data are structured and cannot be easily converted in
a vector based representation, so kernels for various kinds of structured data
have been recently proposed. Graphs represent a widely used structured data
for representing real world data with complex relationships between parts,
thus investigating kernels on objects modelled by graphs is an important and
interesting challenge. Also specific classes of graphs as sequences and trees
often represent a natural way to model semi–structured data as biological
sequences, natural language texts, HTML and XML documents and so on.
In the case of discrete data structure, there is the problem of converting
them into feature vectors. A common way is to use a bag of something
representation, where each component φi (x) of the feature vector φ(x) counts
the number of occurrences of some item in the data structure:
K(x, z) = hφ(x), φ(z)i =
∞
X
φi (x)φi (z)
(3.22)
i=1
Often the bag of something representation decomposes an object into substructures (i.e. subsequences, subtrees or subgraphs) and the feature vector
is made by the counts of the substructures. As the dimensionality of the
feature vectors is typically very high, we have to adopt efficient procedures
for avoiding explicit computation of feature vectors. In following, we present
an overview of kernel for real world data as sequences, trees and graphs.
68
CHAPTER 3 PROCESSING STRUCTURED DATA
3.5
Kernels for Strings
Sequences represent the first significant improvement over static data types
such as records or fixed–size numerical array. Two important new features
arise when modelling data using sequences with respect to static data:
• the number of elements in a sequence is not fixed;
• a serial order relation is defined among elements of the sequence.
So a sequence permits to represent data objects having variable size and its
serial order provides additional information which is not encoded within the
elements of the sequence. Serial order makes sequences naturally suited for
modelling data in temporal domains, where elements in a sequence can be
associated with temporal events. Also sequential–pattern recognition such
as speech recognition exploits variable length sequences for modelling the
objects of the speech domain. Sequences of symbols were widely employed
for describing biological sequences like proteins, DNA and RNA sequences.
Besides text documents can be represented as strings of characters. Now we
introduce some definitions about strings.
Definition 3.1 Let A be a finite alphabet. A string is a finite sequence of
characters from A, including the empty sequence. Let Ak be the set of all
finite strings of length k and let
∞
. [ k
A∗ =
A
(3.23)
k=0
be the input space of all finite length strings. Let A0 = {} be the empty
string. We denote by |s| the length of string s and by st the concatenation
of string s and t. The string s[i : j] is the substring si . . . sj of s. A string
u is a subsequence of s if there exist indexes i = (i1 , . . . , i|u| ) with 1 ≤ i1 <
· · · < i|u| ≤ |s| such that uj = sij for j = 1, . . . , |u| or u = s[i] for short. The
length `(i) of the subsequence u in s is i|u| − i1 + 1. Note that the definition
of subsequence does not mean that the subsequence u is contiguous in s: if
u is not contiguous in s, then `(i) > |u| else `(i) = |u|. For contiguous
subsequences, we use the notation u v s.
69
3.5 Kernels for Strings
3.5.1
Spectrum Kernel
An efficient sequence–similarity kernel which does not depend on any generative model for using with discriminative methods was introduced by Leslie
et al. (2002a). It counts the common subsequences of a fixed length between
two strings. First, let us define the k–spectrum of a string.
Definition 3.2 (k–spectrum) Given a number k ≥ 1, the k–spectrum of a
string is the set of all the k–length (“k–mers”) contiguous subsequences that
it contains.
The feature space of spectrum kernel is a weighted representation of its k–
spectrum. It is indexed by all the subsequences s of length k from alphabet A
and its elements count the number of times a k–mer occurs in the sequence.
k
The feature map φ : A∗ 7→ IR|A| is defined as
.
φk (x) = (φs (x))s∈Ak
(3.24)
φs (x) = number of times s ∈ Ak occurs in x
(3.25)
where
Then the k–spectrum kernel between x and z is the inner product in the
feature space (see Figure 3.1)
.
Kk (x, z) = hφk (x), φk (z)i
(3.26)
Note that the maximum number of non zero features is length(x) − k + 1 on
a total of |A|k features: so the feature vectors are sparse.
. . . L S I G D V A K K L K E M W N N T. . .
. . . V S A K K E M D K D TA K K Q W I . . .
Figure 3.1. The spectrum kernel counts the common subsequences of a fixed
length between two strings.
70
CHAPTER 3 PROCESSING STRUCTURED DATA
Computing the kernel between two strings x and z of the same length
` takes O(k`) time using a suffix tree for storing all the k–mers of x and z
(Ukkonen, 1992, 1995). A linear time prediction O(`) is possible because of
the sparseness of feature vectors and the small number of SVs with respect to
the total number of examples. In this case, we can compute the value of each
component of w and store the non zero values in a look–up table. Moving
a k–length sliding window across x, we can look–up the current k–mer and
increment the prediction by the corresponding value (see Section 1.3.5 for
more details).
3.5.2
Mismatch String Kernel
A variant of spectrum kernel concerning approximate matches between the
subsequence is introduced by Leslie et al. (2002b). It permits to insert substitutions in the matching k–mers, relaxing the tight constraint that two
k–mers match only if they are the same.
Definition 3.3 ((k, t)–neighborhood of a k–mer) Given a k–mer s, the
(k, t)–neighborhood N(k,t) (s) generated by s is the set of all k–length sequences
r from Ak that differ from s by at most t mismatches.
Note that the number of k–mers within t mismatches of any given k–mer is
!
t
X
k
|N(k,t) (s)| =
(|A| − 1)i = O(k t |A|t )
(3.27)
i
i=0
The feature map of a k–mer s is defined as
φ(k,t) (s) = (φr (s))r∈Ak
(3.28)
where
(
φr (s) =
1 if r ∈ N(k,t) (s)
0 otherwise
(3.29)
The extension of the feature map to a sequence x is obtained by summing
the feature vectors of all the k–mers in x
X
φ(k,t) (x) =
φ(k,t) (s)
(3.30)
k−mers in x
71
3.5 Kernels for Strings
The (k, t)–mismatch kernel between x and z is the inner product in the
feature map
. K(k,t) (x, z) = φ(k,t) (x), φ(k,t) (z)
(3.31)
The r–th coordinate with r ∈ Ak of φ(k,t) (x) is a count of all instances of the
k–mer r occurring with up to t mismatches in x. Note that the spectrum
kernel is a particular case of mismatch string kernel where t = 0.
A (k, t)–mismatch tree of depth k data structure similar to a suffix tree
can be used to compute the entire kernel matrix in an efficient way. Assuming
a data set of m sequences each of length ` (so m` is the total length of the
data set), the worst case for the kernel computation takes O(m2 `k t |A|t ) time,
where the effective number of k–mers we need to traverse on the recursive
procedure for creating the mismatch tree grows as O(m`k t |A|t ). As in the
case of spectrum kernel, the prediction function can be computed in linear
time O(`) storing the non zero values of weight vector w in a look–up table
and looking–up all the k–mers of the new string. In the various tasks in
which the mismatch string kernel was used, small values of t give better
performance, resulting in a fast kernel computation.
3.5.3
String Subsequence Kernel
The String Subsequence Kernel (SSK) was introduced by Lodhi et al. (2001)
for comparing two text documents by means of non contiguous substrings
they contain, where the degree of contiguity weights the contribution to the
kernel by an exponential decaying factor 0 < λ ≤ 1. Besides the mismatch
kernel, it allows for insertions, deletions and substitutions in the matching
k–mers. We first derive SSK starting from its feature space and then show
how it can be efficiently evaluated by a dynamic programming technique.
k
Given a number k ≥ 1, the feature map φ : A∗ 7→ IR|A| for a string x is
the vector of occurrences of subsequences s ∈ Ak in the string x weighted in
according to their lengths


X
φ(x) = (φs (x))s∈Ak = 
λ`(i) 
(3.32)
i:s=x[i]
72
s∈Ak
CHAPTER 3 PROCESSING STRUCTURED DATA
So the kernel is the inner product in the feature space
X
Kk (x, z) = hφ(x), φ(z)i =
φs (x)φs (z)
s∈Ak
=
X X
s∈Ak i:s=x[i]
=
λ`(j)
j:s=z[j]
X X
s∈Ak
X
λ`(i)
X
λ`(i)+`(j)
(3.33)
i:s=x[i] j:s=z[j]
and represent a sum over all common subsequences weighted in according to
their frequency and length. For long strings, the feature vectors will have
a lot of non zero features and so the computation of feature vectors takes
O(|A|k ) in time and space.
Now we show how to compute efficiently SSK in a recursive way using a
dynamic programming technique: the recursion is based on the observation
that the increment of the length of the strings leads to a factor of λ for each
added character. First of all, we introduce an auxiliary kernel function which
counts the length from the start of the particular sequence to the end of the
strings x and z instead of only `(i) and `(j) (note that `(i) = i|s| − i1 + 1
and `(j) = j|s| − j1 + 1 and that only the gaps within the subsequence have
to be penalized):
X X X
Ki0 (x, z) =
λ|x|−i1 +1+|z|−j1 +1 , i = 1, . . . , k − 1 (3.34)
s∈Ai i:s=x[i] j:s=z[j]
The SSK can be evaluated by the following recursive procedure. First, we
define the base steps in which the kernel is zero if the length of the shorter
sequence is less then the k–mer length and is 1 for 0–mers
K00 (x, z) = 1 ∀x, z ∈ A∗
Ki0 (x, z) = 0 if min(|x|, |z|) < i
Ki (x, z) = 0 if min(|x|, |z|) < i
Then the recursive step is defined for the additional kernel function Ki0 (x, z),
i = 1, . . . , k − 1 and for Kk (x, z), concatenating a character a ∈ A:
X
0
Ki−1
(x, z[1 : j − 1])λ|z|−j+2
Ki0 (xa, z) = λKi0 (x, z) +
j:z j =a
Kk (xa, z) = Kk (x, z) +
X
j:z j =a
73
0
Kk−1
(x, z[1 : j − 1])λ2
(3.35)
3.5 Kernels for Strings
We can extend the evaluation of kernel Kk (x, z) to a range of different k–mer
lengths and then combine this values using non negative weights ck ≥ 0:
X
K(x, z) =
(3.36)
ck Kk (x, z)
k
The complexity of computing the SSK Kk (x, z) between two strings x and
z as specified in Equation (3.35) takes O(k|x||z|2 ) time where the square can
be associated to the sorter string. By the introduction of a further auxiliary
kernel function, the computational complexity can be reduced to O(k|x||z|)
(see Lodhi et al. (2001) for more details).
Finally note that the spectrum kernel Kk (x, z) is a particular case of
SSK where λ = 1 and where indexes i and j in Equation (3.33) identify only
contiguous subsequences of length k. Also the mismatch kernel K(k,t) (x, z)
can be obtained provided that λ = 1 and that the i and j indexes locate the
subsequences of length k, k − 1, . . . , k − t with `(i) = `(j) = k.
3.5.4
Weighted String Kernel
Another kernel for strings was described in the paper of Viswanathan and
Smola (2003). The basic assumption is that the feature vectors are sparse
with respect to the large dimension fo feature space and so an efficient method
of computing the kernel is to sort the non zero elements of feature vectors
and then computing the inner product of sorted sparse vectors. If the sorting
procedure is done in a clever way, the complexity of kernel is linear in the sum
of non zero components. Efficient sorting can be realized by the compression
of the set of all substrings into a suffix tree and by keeping a look–up table
for associating weights to substrings (Viswanathan and Smola, 2003). The
general form of the string kernel is
X
. XX
(3.37)
K(x, z) =
I(s = u)ws =
nums (x)nums (z)ws
s∈A∗
svx uvz
where ws is a weight associated with each substring s and nums (x) counts
the number of occurrences of s in x. Experimental results described in
Viswanathan and Smola (2003) using a weight
(
1 if |s| ≤ 3
ws =
λ|s| if |s| > 3
74
CHAPTER 3 PROCESSING STRUCTURED DATA
are comparable to ones in Leslie et al. (2002a).
The time complexity of computing the kernel between x and z takes
O(|x| + |z|) time, while computing predictions requires O(|x|) once the set
of SVs is established.
3.5.5
Dynamic Time–Alignment Kernel
Shimodaira et al. (2001a,b) introduced the Dynamic Time–Alignment Kernel
(DTAK) for sequential–pattern recognition problems as speech recognition.
It is a direct extension of vector based kernels to the case of variable length
sequences that incorporates the operation of dynamic time alignment into
the kernel function.
We are given two sequences of frame vectors X = (x1 , . . . , xLX ) and
Z = (z 1 , . . . , z LZ ), where xi , z i ∈ IRn , |X| = LX and |Z| = LZ . If |X| =
|Z| = L, then the kernel between X and Z is
L
. X
κ(xk , z k )
KL (X, Z) =
(3.38)
k=1
where κ is a kernel for vectors in IRn . If |X| =
6 |Z|, then we can introduce
two time–warping functions ψ(k) and ζ(k) of normalized time frame k for
the patterns X and Z respectively which align the lengths of the patterns
and define the kernel as
M
. 1 X
KTW (X, Z) =
κ(xψ(k) , z ζ(k) )
M k=1
(3.39)
where M is a normalized length that can be either |X|, |Z| or a positive
integer. In linear time–warping function, ψ(k) and ζ(k) take the form
|X| k
|Z| k
ψ(k) =
, ζ(k) =
(3.40)
M
M
where dxe is the ceiling function which gives the smallest integer that is
greater than or equal to x. But non–linear time warping, called also dynamic
time warping (DTW), has shown better performance then the linear one.
DTW uses a distance/distortion measure for finding the optimal path that
75
3.5 Kernels for Strings
maximizes the accumulated similarity:
M
1 X
ω(k)κ(xψ(k) , z ζ(k) )
KDTAK (X, Z) = max
ψ,ζ Pψζ
k=1
subject to
1 ≤ ψ(k) ≤ ψ(k + 1) ≤ |X|,
1 ≤ ζ(k) ≤ ζ(k + 1) ≤ |Z|,
(3.41)
ψ(k + 1) − ψ(k) ≤ ∆
ζ(k + 1) − ζ(k) ≤ ∆
where ω(k) is a nonnegative path weighting coefficient, Pψζ is a path normalizing factor and ∆ is a constant which assures the local continuity. A
method for computing Pψζ is
Pψζ =
M
X
(3.42)
ω(k)
k=1
where ω(k) are chosen so that Pψζ is independent of the warping functions.
A dynamic programming strategy allows to solve efficiently the optimization
problem (3.41) by the following recursive formula:
KDTAK (X, Z) =
Q(|X|, |Z|)
|X| + |Z|
(3.43)
where


 Q(i − 1, j) + κ(xi , z j ),
Q(i, j) = max Q(i − 1, j − 1) + 2κ(xi , z j ),


Q(i, j − 1) + κ(xi , z j )



(3.44)


The complexity for computing KDTAK (X, Z) takes O(|X| + |Z|) time.
3.5.6
Dynamic Alignment Kernel
A kernel for sequences of different lengths that exploits the scores produced
by dynamic alignment algorithms was introduced by Watkins (2000). The
idea is to express the alignment scores produced by Pair Hidden Markov
Models (PHMM) as inner products in some feature space.
We first describe Conditionally Symmetric Independence Kernels (CSIK),
a class of joint probability distribution (JPD) that can be expressed as an
inner product in some feature space. Precisely, a JPD can be used as a scoring
76
CHAPTER 3 PROCESSING STRUCTURED DATA
function between two objects assigning higher probability to more related
objects. We will show that a conditionally symmetrically independent joint
probability distribution is a kernel: it is not clear whether the inverse holds,
that is all kernels from JPD are conditionally symmetrically independent.
Definition 3.4 (Conditionally Symmetrically Independent) A JPD
is conditionally symmetrically independent (CSI) if it is a mixture of a finite
or countable number of symmetric independent distributions.
In order to show that that a CSI joint probability distribution is a Mercer
kernel, it must be written as an inner product. Let X, Z and H be three
discrete random variables and let p be a CSI JPD such that
p(x, z) = p(z, x) = Pr{X = x and Z = z}
p(x, z|h) = Pr{X = x, Z = z|H = h} = p(x|h)p(z|h)
(3.45)
(3.46)
Then
p(x, z) =
X
p(x|h)p(z|h)p(h) =
X
p
p
p(x|h) p(h) p(z|h) p(h) (3.47)
h∈H
h∈H
where H is the set of admissible values for H. So
p(x, z) = hφ(x), φ(z)i
(3.48)
where the feature map is
p
x 7−→ p(x|h) p(h)
φ
(3.49)
h∈H
This kernel requires p(x|h) which means that the generative process of x from
h needs to be known.
A joint probability distribution over finite length sequences not necessary
of the same length can be defined by a pair hidden Markov model, an Hidden Markov Model (HMM) (Rabiner, 1989) that generates two sequences
simultaneously used in bioinformatics to construct probabilistic models of
relatedness of pairs of sequences (Durbin et al., 1998).
Definition 3.5 (Pair HMM) Given two sequences of symbols A and B, a
PHMM is defined as follows:
77
3.5 Kernels for Strings
• a finite set S of states which are a partition on following subsets:
S AB — states that emit one symbol for A and one symbol for B
S A — states that emit only one symbol for A
S B — states that emit only one symbol for B
S − — a starting state START and an ending state END that emit no
symbol from which starts and ends the process
• an |S| by |S| state transition probability matrix P , where P (n|c) is the
probability that the next state is n given that the current state is c
• an alphabet A
• for states emitting symbols,
– a probability distribution over A × A for each s ∈ S AB
– a probability distribution over A for each s ∈ S A or s ∈ S B
The PHMM starts in START, then typically repeats cycles through states
S AB which emit matching or nearly matching symbols for both sequences;
occasionally it reaches S A or S B generating insertions of several symbols,
before going back to S AB . Eventually it reaches the END state where the
process finishes. So a PHMM permits to compute joint probabilities of pairs
of sequences.
Let S AB = S AB ∪ S − be the union of S AB with START and END and
let A(s, t) be the random variable denoting a possible empty subsequence of
states in S A that the process passes through, given that the process starts
in a state s ∈ S AB and given that state t is the next state in S AB reached;
let B(s, t) be a random variable defined in the same manner. A sufficient
condition for a PHMM to be CSI is given.
Theorem 3.1 (PHMM is CSI) Let M be a PHMM such that
• the JPD over sequences induced by M is unchanged if S A and S B are
swapped
• ∀ s ∈ S AB , the symbol–emission JPD over A × A is CSI
78
CHAPTER 3 PROCESSING STRUCTURED DATA
• M has the independent insertion property, that is ∀ s, t ∈ S AB , A(s, t)
and B(s, t) are independent
Then the JPD induced by M over pairs of sequences of symbols is CSI.
The proof is given in Watkins (2000). So a PHMM can be represented as
an inner product in a feature space that has a dimension for each possible
sequence of atomic doubly emitting states h; the number of such h for which
the feature mapping is not zero is in general exponential in the length of the
sequence.
3.5.7
Marginalized Kernel
The marginalized kernels arose from CSI kernels (Watkins, 2000) described
in Section 3.5.6 by equation
KCSI (x, x0 ) =
X
p(x|h)p(x0 |h)p(h)
(3.50)
h
where h ∈ H is an hidden variable taking values from a finite set H. If
p(h|x) is known instead of p(x|h), we can define a marginalized kernel as
. XX
p(h|x)p(h0 |x0 )Kz (z, z 0 )
KMK (x, x0 ) =
h
(3.51)
h0
where z = (x, h), z 0 = (x0 , h0 ), x, x0 ∈ X , h, h0 ∈ H and Kz (z, z 0 ) is a
joint kernel depending on both visible and hidden variables (Tsuda et al.,
2002). In general the posterior distribution p(h|x) is unknown and has to be
estimated from the data, for example by HMMs.
3.5.8
Marginalized Count Kernel
The Marginalized Count Kernel (MCK) is a marginalized kernel for biological
sequences (Tsuda et al., 2002; Kin et al., 2002). Let x = (x1 , . . . , xs ), xi ∈ A
be a sequence of s symbols, let h = (h1 , . . . , hs ), hi ∈ H be a sequence of s
hidden variables and define the combined sequence as
z = (z1 , . . . , zs ) = ((x1 , h1 ), . . . , (xs , hs ))
79
3.5 Kernels for Strings
where zi can be assume |A| · |H| different symbols. The joint kernel Kz (z, z 0 )
is defined as
|A| |H|
. XX
Kz (z, z ) =
ck` (z)ck` (z 0 )
0
(3.52)
k=1 `=1
where
s
1X
ck` (z) =
I(xi = k, hi = `)
s i=1
Now we can define the MCK as
. XX
KMCK (x, x0 ) =
p(h|x)p(h0 |x0 )Kz (z, z 0 )
h
(3.53)
(3.54)
0
h
where Kz (z, z 0 ) is defined in Equation (3.52). Another equivalent formulation
of the MCK is
0
KMCK (x, x ) =
|A| |H|
X
X
γk` (x)γk` (x0 )
(3.55)
k=1 `=1
where
s
γk` (x) =
X
1X
p(h|x)
I(xi = k, hi = `)
s h
i=1
s
(3.56)
|H|
1XX
p(hi |x)I(xi = k, hi = `)
=
s i=1 h =1
(3.57)
i
If we use an HMM to represent p(x), the posterior probability p(hi = `|x) =
γi (`) as described in Rabiner (1989).
The MCK can be extended to deal with relations between adjacent symbols, for example it can include combinations of two adjacent symbols, yielding a second–order marginalized count kernel.
3.5.9
The Fisher Kernel
The Fisher kernel was introduced in Jaakkola and Haussler (1999a,b) for a
DNA sequence classification problem and applied to the problem of detecting
80
CHAPTER 3 PROCESSING STRUCTURED DATA
remote protein homologies in Jaakkola et al. (2000) using a HMM as a generative model. The key idea is to derive the kernel from a generative model
using the gradient of the log–likelihood with respect to the parameters of the
generative model as the features in a discriminative classifier. This procedure
defines a metric relation directly from the generative model, capturing the
differences in the generative process of a pairs of objects
Let P (x|θ) be a parametric class of generative probability model where
θ = (θ1 , . . . , θn ) ∈ Θ are the parameters of the model. The key ingredient
of the Fisher kernel is the Fisher score Ux ∈ IRn , that is the gradient of
the log–likelihood with respect to the parameters θ of the generative model
P (x|θ) of x
.
Ux = ∇θ log P (x|θ)
(3.58)
It describes how a parameter θi contributes to the process of generating an
example x and forms sufficient statistics of x, less a normalization constant
depending on θ. An important role is played by the Fisher information
matrix
FI = EP (x|θ) {Ux Ux0 }
(3.59)
that is the expected value of the inner product Ux Ux0 over P (x|θ) (Ux0 is the
transpose of Ux ). The Fisher kernel is defined as
.
KFI (x, z) = Ux0 FI−1 Uz = φ(x)0 FI φ(z)
(3.60)
where the feature map is defined as φ(x) = FI−1 Ux . It can be computed
if the generative model P (x|θ) has a twice differentiable likelihood and the
Fisher information matrix is positive definite at chosen θ. Often, in practical
cases, the Fisher information matrix is ignored
KF (x, z) = hUx , Uz i
(3.61)
It can be useful a scaled or translated version of the kernel
K̃(x, z) = c1 KFI (x, z) + c0
81
(3.62)
3.6 Kernels for Trees
where c1 , c0 ≥ 0. If the examples are not linearly separable in the Fisher
feature space, it can be advantageous combining the Fisher kernel with a
polynomial one
K̃(x, z) = (KFI (x, z) + b)d
(3.63)
where b and d are respectively the offset and the degree of the polynomial
kernel.
Note that when we derive a kernel from a generative model, the value of
kernel between two objects depends also on the other objects used for constructing the generative model (for example, all training set or an expanded
version of training set useful to construct the generative model). Finally,
kernels from a generative model boils down to counts of sufficient statistics:
in this sense, they are similar to the weighted decomposition kernel described
in Section 6.3.
3.6
Kernels for Trees
Trees represent a generalization of sequences where more complex relations
than the simple serial order amongst elements may exist. Tree data structures
was employed to model objects from several domains. In natural language
processing, parse trees are modelled as ordered labelled tress. In pattern
recognition, an image can be represented by a tree whose vertices are associated with image components, retaining information concerning the structure
of the image. In automated reasoning, many problems are solved by searching and the search space is often represented as a tree whose vertices are
associated with search states and edges represent inference steps. Also semi–
structured data such as HTML and XML documents can be modelled by
labelled ordered trees. We start from some definitions about trees.
Definition 3.6 A tree is a connected graph without cycles. A leaf is a node
with no children. An ordered tree is one in which the child nodes of every node
are ordered in according to some relation. A labelled tree has the properties
that each node v contains a label label(v) ∈ A. Let V (x) be the set of nodes
of x. A subtree t0 is a subset of nodes in the tree t with corresponding edges
82
CHAPTER 3 PROCESSING STRUCTURED DATA
which forms a tree: in notation t0 v t. A proper subtree is a subtree in which
each node is either a leaf or all its children belongs to the subtree.
3.6.1
Parse Tree Kernel
A kernel for labelled ordered trees was introduced by Collins and Duffy (2001,
2002) for a Natural Language Processing (NLP) task in which the goal is to
rerank the candidate parse trees of a sentence generated by a probabilistic context free grammar (PCFG). The parse tree kernel evaluates the tree
similarity by counting the number of common proper subtrees between two
trees, weighting larger subtree fragments by an exponential decaying factor.
Counting proper subtrees (and not simple subtrees) in a parse tree aims at
do not split the production rules which constitute the tree.
The feature space of parse tree kernel is indexes by all the proper subtrees
(Bod, 2001) present in training data with the only constraint that a production rule cannot be divided into further subparts. The value of a component
φt (x) of a feature vector counts the number of times a proper subtree t occurs
in a tree x
(3.64)
φt (x) = number of times a proper subtree t occurs in x
This representation can be seen as a bag of proper subtrees representation:
the object is mapped into a feature vector of which each component counts
the number of occurrence of any structure. Figure 3.2 shows an example of
proper subtrees of a given tree.
A
A
B
B
C
E
D
D
C
E
Tree
B
A
B
C
C
E
D
D
E
Proper Subtrees
Figure 3.2. Proper subtrees of a given tree.
The kernel between two trees x and z is the inner product in the feature
83
3.6 Kernels for Trees
space F
K(x, z) = hφ(x), φ(z)i =
|F |
X
φt (x)φt (z)
(3.65)
t=1
Note that |F| is very large because the number of subtree is exponential in
the size of the tree. So we need an efficient method to compute the kernel
without explicitly enumerating all the proper subtrees. We start defining the
following indicator function
(
1 if proper subtree t is rooted at node v of tree x
(3.66)
It (x, v) =
0 otherwise
Let V (x) and V (z) be respectively the sets of nodes of x and z. Using the
indicator function,
X
X
(3.67)
φt (x) =
It (x, i) and φt (z) =
It (z, j)
i∈V (x)
j∈V (z)
So the parse tree kernel can be expressed as
K(x, z) =
|F |
X
φt (x)φt (z) =
|F |
X
X
t=1
=
X
X
It (x, i)It (z, j)
t=1 i∈V (x) j∈V (z)
X
C(i, j)
i∈V (x) j∈V (z)
where
C(i, j) =
|F |
X
It (x, i)It (z, j)
(3.68)
t=1
counts the number of common proper subtrees rooted at both i and j. Note
that C(i, j) can be computed by the following recursive procedure:
1. if the production rules at i and j are different1
C(i, j) = 0
1
Given two nodes i and j, we say that i and j have the same production rule if they
have the same label, if they have the same number of children and if the corresponding
children in the ordered list have the same labels.
84
CHAPTER 3 PROCESSING STRUCTURED DATA
2. if the production rules at i and j are the same and i and j are pre–
terminals2
C(i, j) = 1
3. if the production rule at i and j are the same but i and j are not
pre–terminals
|ch(i)|
C(i, j) =
Y
(1 + C(ch(i, `), ch(j, `)))
(3.69)
`=1
where ch(i) is the list of children of node i, ch(i, `) is the `–th child of
node i and |ch(i)| is the number of children of i (note that nodes i and
j have the same children in this case).
To verify the soundness of the recursive formula, we see that the ground
steps are immediately correct. To prove the recursive step (3.69), we first
refer to a production rule A 7→ B rooted at both i and j with only one child
B (A is the label of both i and j) and we suppose that the child B has c
common proper subtrees rooted at itself in both trees x and z. The number
of common proper subtrees rooted at both i and j is
C(i, j) = 1 + c
(3.70)
We can consider only the production rule A 7→ B (it contributes with 1) and
then we can attach A 7→ B to each one of the c subtrees rooted at the child
B, producing further c subtrees (this contributes with c). A more complex
situation is when the production rule has p children A 7→ B1 , . . . , Bp . In this
case, we have to multiply the number of common proper subtrees produced
by each child: if c1 , . . . , cp is the number of common subtrees rooted at each
child in both trees x and z, then the number of common proper subtrees
rooted at both i and j is
C(i, j) =
p
Y
(1 + ck )
(3.71)
k=1
where 1 + ck is the number of subtrees rooted at both i and j generated by
k-th child (see Equation (3.70)). Equation (3.71) counts all the possible ways
to combine the fragments produced by each child.
2
Pre-terminals are nodes directly above leaves.
85
3.6 Kernels for Trees
Another method to verify if Equation (3.69) is correct, is to follow the
ideas described in Goodman (1996). A common subtree at both i and j can
be formed by taking the same production rule at i and j, together with a
choice at each child of simply taking the non–terminal symbol at that child,
or any of the ck common sub–trees at that child. In other words, if there
are c1 non–trivial subtrees headed by ch(i, 1), there will be the trivial case
where the subtree is simply the non–terminal symbol at i. Thus there are
c1 + 1 different possibilities on i. Similarly, for the other children there are
ck +1 possibilities. We can create a subtree by choosing any possibile subtree
Q
rooted at the children. Thus, there are pk=1 (1+ck ) possible subtrees headed
by both i and j.
A problem of this kernel is its dependency on the number of tree nodes:
the large is the number of nodes, the bigger is the kernel value. It can be
overcome by normalizing the kernel value. Another drawback of parse tree
kernel is that it is very peaked, i.e. the values of the kernel between the same
trees are very large, while the values between two different trees are typically
much smaller. A first methods to reduce this problem consists in restricting
the height of proper subtrees which are used, since there will be more tree
fragments of larger size with respect to fragments of smaller size:
X X
(3.72)
Kh (x, z) =
C(i, j, h)
i∈V (x) j∈V (z)
where C(i, j, h) is the number of common fragments to x and z rooted respectively at both i and j of depth less than or equal to h. An alternative
solution is to exponentially downweight the importance of subtrees with their
size introducing a parameter 0 < λ ≤ 1, which corresponds to the following
modified version of the kernel
Kλ (x, z) =
|F |
X
λsizet φt (x)φt (z)
(3.73)
t=1
where sizei is the number of production rules in the t–th proper subtree. The
recursive definition of C(i, j) becomes
1. if the production rules at i and j are different
C(i, j) = 0
86
CHAPTER 3 PROCESSING STRUCTURED DATA
2. if the production rules at i and j are the same and i and j are pre–
terminals
C(i, j) = λ
3. if the production rule at i and j are the same but i and j are not
pre–terminals
|ch(i)|
C(i, j) = λ
Y
(1 + C(ch(i, `), ch(j, `)))
(3.74)
`=1
The parse kernel between x and z can be evaluated in O(|V (x)||V (z)|)
time, first computing and then summing all the values for C(i, j). A more
tight bound is that it runs linearly in the number of nodes (i, j) ∈ V (x)×V (z)
such that the production rules at i and j are the same. In practical cases,
the number of nodes with identical productions is typically linear and so the
running time is close to linear in the size of the tree.
3.6.2
String Tree Kernel
In Viswanathan and Smola (2003), in addition to the string kernel described
in Section 3.5.4, a kernel for trees is introduced. Its general form is
. XX
K(x, z) =
I(t = u)wt
(3.75)
tvx uvz
where x, z are two trees and t, u are the corresponding subtrees. Unless a
specific order for children is given, we assume that a lexicographic order is
associated with the labels if they exist.
The basic idea is to map the tree in a string and then compute the tree
kernel as the string kernel described in Section 3.5.4 between corresponding
strings. First, we introduce two additional symbols [ and ] which satisfy [ < ]
and [ , ] < a ∈ A. Then we give a recursive definition of the tags for each
node:
• tag(v) = [ ] if v is a unlabelled leaf
• tag(v) = [ label(v) ] if v is a labelled leaf
87
3.6 Kernels for Trees
• tag(v) = [ tag(v1 ) tag(v2 ) · · · tag(vc ) ] if v is a unlabelled node with children v1 , v2 , . . . , vc , where the tags of children are lexicographical sorted
• tag(v) = [ label(v) tag(v1 ) tag(v2 ) · · · tag(vc ) ] if v is labelled
The tag of the root node tag(root) is an unique identifier of the tree and
can be constructed in (λ + 2)(` log2 `) time and O(`) space, where ` is the
number of nodes and λ is the maximum length of a label. Any s v tag(root)
correspond to a subtree and substrings s starting with [ and ending with a
balanced ] correspond to subtrees whose s is the tag.
Note that the set of subtrees generated as contiguous substrings of the tag
of the root node tag(root) does not contain all the proper subtrees of the parse
tree kernel described in Section 3.6.1. For example, the tag of the root node
of the tree in Figure 3.2 is [A[B[D][E]][C]] and the proper subtree constituted
by nodes labelled by A, B and C cannot be generated as contiguous substring
of tag(root). Moreover, only a subset of all contiguous substrings of tag(root)
represent a subtree in the sense of Definition 3.6: for example, the contiguous
substring [B[D][E]][C] represents two distinct connected components. The
relation between the set of parse tree proper subtrees and the set of subtrees
generated as contiguous substrings of tag(root) is illustrated in Figure 3.3.
All subtrees
Collins proper subtrees
Smola string subtrees
Figure 3.3. Relation between the set of parse tree proper subtrees and the
set of subtrees generated as contiguous substrings.
88
CHAPTER 3 PROCESSING STRUCTURED DATA
3.6.3
Label Mutation Elastic Structure Tree Kernel
A kernel for semi–structured data modelled by labelled ordered trees is introduced in Kashima and Koyanagi (2002) as a generalization of parse tree
kernel described in Collins and Duffy (2001) for the problems of the node
marking and the classification of HTML documents. This extended kernel
can allow some mutations of node labels and elastic subtree structure, maintaining the same complexity of the parse tree kernel.
Three extensions are introduced with respect to Collins and Duffy (2001):
1. it is not necessary that two nodes would have the same number of
children with same labels, but we are interested in one–to–one correspondences where the left–to–right order of the children is preserved.
Let Sv1 ,v2 (i, j) be the sum of the products of the numbers of times each
subtree appears at v1 and v2 when we consider only the nodes up to
the i–th child of v1 and the nodes up to the j–th child of v2 . So
C(v1 , v2 ) = Sv1 ,v2 (|ch(v1 )|, |ch(v2 )|)
(3.76)
Since all correspondences preserve the left–to-right ordering of children,
Sv1 ,v2 (i, j) can be recursively computed as
Sv1 ,v2 (i, j) = Sv1 ,v2 (i − 1, j) + Sv1 ,v2 (i, j − 1)
+ Sv1 ,v2 (i − 1, j − 1)
(3.77)
+ Sv1 ,v2 (i − 1, j − 1) C(ch(v1 , i), ch(v2 , j))
2. mutations between node labels are allowed, penalizing the score of the
subtree using a mutation score function Pmut : A × A 7→ [0, 1] where
a low value of Pmut (a2 |a1 ) indicates a low acceptance of the mutation
from a1 to a2 . So the score of a subtree in which appears a mutation
from a1 to a2 is penalized by a factor of Pmut (a2 |a2 )Pmut (a2 |a1 ). The
i–th component of feature vector of a tree t is defined as the sum of
the penalized scores of the i–th subtree over all position in t where a
structural matching occurs. The computation of C(v1 , v2 ) becomes
C(v1 , v2 ) = Sim(label(v1 ), label(v2 )) Sv1 ,v2 (|ch(v1 )|, |ch(v2 )|) (3.78)
89
3.6 Kernels for Trees
where
Sim(`1 , `2 )) =
X
Pmut (`1 |a)Pmut (`2 |a)
(3.79)
a∈A
is a similarity measure between the labels `1 and `2 of two nodes taking
into account all the possible mutations. Figure 3.4 shows an example
of mutation.
A
A
D
A
B
C
C
C
Tree
Subtree
Figure 3.4. An example of mutation: labels D and B are replaced by A
and C respectively.
3. elastic matching between subtrees is allowed, that is a subtree appears
in a tree if the subtree is embedded in the tree while the relative positions of nodes of the subtree are preserved. For example, if a node is
a descendant of another node or if it is to the left of another node in
the subtree, then the same relations must hold in the embedding (see
Figure 3.5). Computing elastic structure matching must take into account matches between subtrees rooted at all descendants of each node
and not only at its children. So we need to define new variables as
Celas (v1 , v2 ) =
X X
C(i, j)
(3.80)
i∈Dv1 j∈Dv2
where Dv1 , Dv2 are the sets of descendants of v1 and v1 including v1 and
v2 itself. The recursive definition of Sv1 ,v2 (i, j) in function of Celas (v1 , v2 )
90
CHAPTER 3 PROCESSING STRUCTURED DATA
B
A
A
D
D
B
E
A
A
A
A
C
E
B
E
Subtree
Tree
Figure 3.5. An example of embedding subtree: the relative positions of
nodes of the subtree are preserved.
becomes
Sv1 ,v2 (i, j) = Sv1 ,v2 (i − 1, j) + Sv1 ,v2 (i, j − 1)
+ Sv1 ,v2 (i − 1, j − 1)
(3.81)
+ Sv1 ,v2 (i − 1, j − 1) Celas (ch(v1 , i), ch(v2 , j))
The efficient computation of Celas (v1 , v2 ) is made in a recursive way as
|ch(v1 )|
Celas (v1 , v2 ) =
X
Celas (ch(v1 , i), v2 )
i=1
|ch(v2 )|
+
X
Celas (v1 , ch(v2 , j))
(3.82)
j=1
|ch(v1 )| |ch(v2 )|
+
X
X
i=1
j=1
Celas (ch(v1 , i), ch(v2 , j)) + C(v1 , v2 )
The label mutation elastic structure tree kernel considers a wider class of
subtrees than the ones defined in Definition 3.6. Actually any subset of nodes
91
3.7 Kernels for Graphs
which maintains the relative position with any mutation is an admissible
subtree.
The time complexity for computing the label mutation elastic structure
tree kernel remains the same of the parse tree kernel.
3.7
Kernels for Graphs
Graphs are data structures useful for directly representing in a natural way
objects from several domains. For example, a natural way to represent a
chemical compound which is able to directly take into account its structural
nature, is the topological graph representation: an undirected labelled graph
whose vertices correspond to atoms and whose edges correspond to bonds
between atoms (see Figure 3.6). The automatic classification of chemical
O
N
NH2
D
S
N
D
C
S
H
S
F
S
S
C
D
C
S
F
S
H
S
N
C
S
N
NH
H
S
O
H
S: Single Bond
D: Double Bond
Figure 3.6. A chemical compound and its topological representation.
compounds has a crucial importance in the rationalization of drug discovery
processes to predict the effectiveness or toxicity of drugs from their chemical
structures.
Unfortunately, due to the powerful expressiveness of graphs, defining appropriate kernel functions for graphs has proven difficult. Gärtner et al.
(2003) showed that computing a strictly positive definite graph kernel is at
least as hard as solving the graph isomorphism problem and computing an inner product in a feature space indexed by all possible subgraphs is NP–hard.
Let us introduce some definitions about graphs.
Definition 3.7 A labelled undirect graph is a quadruple G = (V, E, A, label),
where V is a finite set of vertices, E ⊆ V × V is a set of edges (a binary
relation on V ), A is a finite ordered set of labels and label : V ∪ E 7→ A is
92
CHAPTER 3 PROCESSING STRUCTURED DATA
a function assigning a label tho each element of V ∪ E. A edge (v, w) ∈ E
is undirected if both (v, w) and (w, v) are in E, otherwise if (w, v) ∈
/ E, then
(v, w) is directed. A graph is directed if all its edges are directed. A cycle
is a sequence of contiguous vertices where the first vertex corresponds to the
last one. A graph is acyclic if it has no cycles. The acronym DAG refers
to a directed acyclic graph. A directed ordered acyclic graph (DOAG) is an
ordered DAG where a total order relation ≺ is defined on the edges leaving
from each vertex. A vertex s is a supersource for a graph G if for each
vertex v ∈ V /{s} there exists a path from s to v. A DAG has at most one
supersource. Let ch(v) be the ordered tuple of vertexes whose elements are
children of v, ch(v, i) the i-th child of v and |ch(v)| the number of children
of v.
3.7.1
Subgraph Kernel
The general idea of graph kernels is to measure common subgraphs of two
graphs. Let G be a set of graphs and let φ : G 7→ 2G be a function mapping
each graph G to a set of subgraphs of G. Using the intersection set kernel
given in Equation (3.15), we can define a subgraph kernel as
.
KSG (G, G0 ) = K∩ (φ(G), φ(G0 )) = |φ(G) ∩ φ(G0 )|
(3.83)
Frequent subgraphs kernel (Deshpande et al., 2003) and cyclic pattern kernel
Horváth et al. (2004) are two examples of subgraph kernels.
3.7.2
Frequent Subgraphs Kernel
Frequent Subgraphs Kernel (FSGK) was introduced as an alternative technique to standard quantitative structure–activity relationships (QSAR) methods for classifying chemical compounds (Deshpande et al., 2003, 2002; Kuramochi and Karypis, 2004). The key idea is to decouple the substructure
discovery process from the classification model construction step and use a
frequent subgraph algorithm (FSG) that does not rely on heuristic search
methods to find all chemical substructures which occur a sufficiently large
number of times.
93
3.7 Kernels for Graphs
The FSG algorithm takes in input a set of undirected labelled graphs Dm
and a minimum support σ ∈ [0, 1] and find all connected subgraphs which
occur in at least σ|Dm |% of the graphs: note that the minimum support and
the fact that the subgraphs are connected make the problem computationally
tractable. A sketch of FSG is shown in Algorithm 3.1.
Algorithm 3.1 FSG(Dm , σ)
Input: A set of graphs Dm and a minimum support σ
Output: The set of all frequent subgraphs F 1 , F 2 , . . . , F k−2
Require: σ ∈ [0, 1]
1: F 1 ← the set of all frequent subgraphs in Dm with 1 edge
2: F 2 ← the set of all frequent subgraphs in Dm with 2 edges
3: k ← 3
4: while F k−1 6= ∅ do
5:
C k ← fsg–gen(F k−1 )
6:
for all candidate Gk ∈ C k do
7:
Gk .count ← 0
8:
for all graph G ∈ Dm do
9:
if candidate Gk is included in graph G then
10:
Gk .count ← Gk .count + 1
11:
end if
12:
end for
13:
end for
14:
F k ← {Gk ∈ C k : Gk .count ≥ σ|Dm |}
15:
k ←k+1
16: end while
17: return F 1 , F 2 , . . . , F k−2
The procedure fsg–gen generates candidate subgraphs from the set of frequent subgraphs F k−1 by adding an edge at a time: each pair of frequent
subgraphs which share a common core of k −2 edges are joined to form a candidate subgraphs with k edges. For computing the frequency of a candidate
subgraph, FSG uses graph identifier (GID) lists: for each frequent subgraph,
a list of graph identifiers that support it is stored. When we need to compute
the frequency of a candidate subgraph Gk , we first compute the intersection
94
CHAPTER 3 PROCESSING STRUCTURED DATA
of the GID lists of its frequent subgraphs with k − 1 edges: if the size of the
intersection is below the support, Gk is pruned, otherwise we compute the
frequency of Gk using a subgraph isomorphism by limiting our search only
to the set of graphs in the intersection of the GID lists. A canonical labelling
algorithm based on various vertex invariants is used to establish the identity
and a total order of frequent and candidate subgraphs.
An approximate version of FSG algorithm can be also applied to geometric graphs (a graph describing the 3D structure, an important feature of a
compound, in which each vertex indicates the position of corresponding atom
in 3D space) and a feature selection technique can be used for reducing the
number of generated subgraphs. For finding meaningful features for all class,
it can be useful running the FSG algorithm on each data set class separately
and then combining the subgraphs from each class.
3.7.3
Cyclic Pattern Kernel
Cyclic Pattern Kernel (CPK) was introduced by Horváth et al. (2004) for the
problem of predictive graph mining. To eliminate the restriction of frequent
subgraphs of FSGK, CPK uses a natural set of cyclic and tree patterns to
represent a graph.
A simple cycle of a graph G is a sequence
C = {(v0 , v1 ), (v1 , v2 ), . . . , (vk−1 , vk )}
(3.84)
of edges, where the vi ∈ V are all distinct and v0 = vk . We note that the
number of simple cycles is exponential in the number |V | of vertices in the
worst case. Let S(G) be the set of simple cycles of a graph G. The canonical
representation of a cycle C is the lexicographically smallest string π(C) ∈ A∗
among the string obtained by concatenating the labels along the vertices and
edges of the cyclic permutation of C and its reverse. More precisely, denoting
by ρ(s) the set of cyclic permutation of a sequence s and its reverse, we define
π(C) as
π(G) = min{σ(w) : w ∈ ρ(v0 v1 . . . vk−1 )}
(3.85)
where for w = w0 w1 . . . wk−1 ,
σ(w) = label(w0 )label((w0 , w1 ))label(w1 ) . . . label(wk−1 )label((wk−1 , w0 ))
95
3.7 Kernels for Graphs
Clearly, π is unique up to isomorphism. The set of cyclic patterns C(G) of a
graph G is defined as
C(G) = {π(G) : C ∈ S(G)}
(3.86)
If we remove from a graph G the edges of all simple cycles, we obtaining a
forest B(G) consisting of the set of bridges of the graph (a bridge is an edge
not belonging to simple cycles). We can associate to each tree T ∈ B(G) a
canonical representation π(T ) ∈ A∗ that si unique up to isomorphism and
define the set of tree patterns T (G) of G as
T (G) = {π(T ) : T ∈ B(G)}
(3.87)
The cyclic pattern kernel is defined as the intersection set kernel between
cyclic patterns and tree patterns
.
KCP (G, G0 ) = |C(G) ∩ C(G0 )| + |T (G) ∩ T (G0 )|
(3.88)
The problem of computing cyclic pattern kernels is NP–hard and so it is
intractable (Horváth et al., 2004). This result derives from the observation
that the problem of enumerating N ≤ |C(G)| elements from C(G) is NP–
hard with |V | and N (note that computing T (G) takes a polynomial time
in |V |). But a result from Reed and Tarjan (1975) states that N ≤ |S(G)|
elements of the set S(G) of simple cycles of a graph G can be listed in
polynomial complexity: exploiting this fact, we can consider only the graphs
whose number of simple cycles is bounded by a constant. So computing cyclic
pattern kernels becomes a tractable problem, disregarding only a small subset
of the dataset which requires too much computation time.
3.7.4
Marginalized Graph Kernel
The Marginalized Graph Kernel (MGK) is an application of marginalized kernels described in Section 3.5.7 to graph domain (Kashima et al., 2003). The
feature vectors are defined as the counts of label paths produced by random
walks on graphs and they can be infinite dimensionality due to graph cycles.
The efficient evaluation of kernel is performed by finding the stationary state
of a discrete–time linear system.
96
CHAPTER 3 PROCESSING STRUCTURED DATA
Let h = (h1 , . . . , hs ) be a sequence of s natural numbers hi ∈ {1, . . . , |V |}
associated with a graph G which describes a random walk. The posterior
probability of h is
p(h|G) = pinit (h1 )
s
Y
ptran (hi |hi−1 )pend (hs )
(3.89)
i=2
where pinit (h1 ) is the initial probability distribution from which h1 is sampled,
ptran (hi |hi−1 ) is the transition probability and pend (hs ) is the probability that
the walk ends on hs . Also gaps can be simulated by setting the transition
probabilities appropriately. The joint kernel Kz (z, z 0 ) depending on both
visible and hidden variables z = (G, h) is defined as
.
Kz (z, z 0 ) =
(
0
Q
0
K(vh1 , vh0 0 ) si=2 K(ehi−i hi , eh0
1
s 6= s0
(3.90)
0
s = s0
0 )K(vhi , v 0 )
h
h
i−i i
i
where K is a nonnegative kernel on vertex and edge labels (for example, the
exact matching kernel for discrete labels or a Gaussian kernel for real labels).
So the MGK is defined as
∞
. XXX
p(h|G)p(h0 |G0 )Kz (z, z 0 )
KMGK (G, G ) =
0
s=1
h
(3.91)
0
h
P
P|V |
P|V |
where
h1 =1 . . .
hs =1 and s in the length of random walk. The
h =
straightforward enumeration is obviously impossible since s spans from 1
to infinity. (Kashima et al., 2003) proved that computing KMGK (G, G0 ) is
equivalent to find the stationary state of a discrete–time linear system which
is efficiently performed by solving simultaneous linear equations.
3.7.5
Extended Marginalized Graph Kernel
An extension of marginalized graph kernel to speed up the kernel computation is described in Mahé et al. (2004). Two modifications are introduced:
• Morgan index;
• the prevention of totters.
97
3.7 Kernels for Graphs
The Morgan indexing procedure (Morgan, 1965) modifies the labels of
vertices to increase the specificity of labels: the number of common label
paths between graphs decreases, while the relevance of features used is increased. At the beginning, each vertex is labelled by the integer 1; at each
iteration, the label of a vertex is the sum of its label and its direct neighbor’s
labels. If M t is the vector of vertex labels, then
(3.92)
M t+1 = (A + I |V | )M t
where M 0 = 1, I |V | is the |V | × |V | identity matrix and A is the |V | × |V |
graph adjacency matrix. Figure 3.7 shows the first two iterations of Morgan
indexing procedure. The choice of the number of iterations of indexing procedure is critical since the performance is not so stable varying this parameter.
1
3
1
1
1
9
3
1
3
3
3
1
1
3
1
N
N
1
O
ORIGINAL COMPOUND
2
10
14
4
N
2
O
9
11
4
1
O
10
13
3
3
1
9
10
4
1
1
9
O
AFTER 1 ITERATION
6
O
12
6
O
AFTER 2 ITERATIONS
Figure 3.7. First two iterations of Morgan indexing procedure.
The second extension consists in preventing totters which add noise to
the representation of the graph. Totters are path of the form h = (v1 , . . . , vs )
with vi = vi+2 for some i. For example, a path with labels C–C–C might
either indicate a succession of three C–labelled vertices or just a succession
of two C–labelled vertices visited by a tottering random walk (see Figure
3.8). Introducing a second–order Markov model of random walk instead a
first–order one like in Equation (3.89), solves the problem of totters.
98
CHAPTER 3 PROCESSING STRUCTURED DATA
C
C
C
i
i+1
i+2
C
i
C
C
i+1
C
C
i+2
Figure 3.8. An example of totter: a path with labels C–C–C might either
indicate a succession of three C–labelled vertices or just a succession of two
C–labelled vertices visited by a tottering random walk.
3.7.6
A Family of Kernels for Small Molecules
Several kernels for small molecules was introduced in Swamidass et al. (2005)
for the prediction of mutagenicity, toxicity and anti–cancer activity of compounds. These kernels are based on strings (1D), graphs (2D) and atom
coordinates (3D) representations.
The first class of kernel described is the 1D kernels based on SMILES
strings. SMILES are strings over a small alphabet for representing in an
unique way molecules which require ordering of atoms (Weininger et al.,
1989). Kernel for strings as the spectrum kernel (Leslie et al., 2002a) or
the mismatch kernel (Leslie et al., 2002b) can be applied to the SMILES
representation of molecules.
A second class of kernels based on the topological graph (2D representation) exploits labelled paths (sequences of atoms and bonds of maximal
length d) using a depth–first search from each vertex. A hash value v is
computed for each path which is used to initialize a random number generator that produces b integers. The b integers are reduced modulo l and the
corresponding bits are set to one in the fingerprint, a bit vector of size l. Let
P(d) be the set of all possible atom–bond paths with a maximum of d bonds.
Given a depth d, the feature map φ for a molecule x is
φd (x) = (φpath (x))path∈P(d)
(3.93)
where φpath (x) can be defined in two ways:
• φpath (x) = φbin
path (x) is equal to 1 if at least one depth–first search of
depth d starting from all the atoms of x produces the path “path”;
99
3.7 Kernels for Graphs
• φpath (x) = φmul
path (x) is the number of occurrences of the path “path” in
all the depth–first searches of depth d starting from all the atoms of x.
An alternative feature map based on fixed–size vectors of size l is given by
φd,l (x) = φγl (path) (x) path∈P(d)
(3.94)
where γl : P(d) 7→ {1, . . . , l}b is a function mapping paths to a set of indices
and φγl (path) (x) captures the hash function mapping, the random generation
and the congruence operation on fingerprints. A possible kernel is the inner
product between feature vectors
X
bin
bin
(z)
=
(x),
φ
φbin
Kd (x, z) = φbin
(3.95)
d
path (x)φpath (z)
d
path∈P(d)
A more interesting kernel is the Tanimoto kernel defined as
KdT (x, z)
bin
|φbin
Kd (x, z)
d (x) ∩ φd (z)|
(3.96)
= bin
=
bin
Kd (x, x) + Kd (z, z) − Kd (x, z)
|φd (x) ∪ φd (z)|
which represent the ratio between number of elements in the intersection
and in the union of the sets of features. An effective variant is the so called
MinMax kernel
P
mul
mul
path∈P(d) min{φpath (x), φpath (z)}
M
Kd (x, z) = P
(3.97)
mul
mul
path∈P(d) max{φpath (x), φpath (z)}
The Equation (3.97) is identical to Equation (3.96) when applied to binary
feature vectors. In addition, the MinMax kernel can be seen as a Tanimoto
kernel on binary vectors obtained by transforming the vectors of counts as
described in Section 6.3.2. Using a suffix tree data structure, allows us to
compute each of the proposed kernels in time O(d(|V1 ||E1 |+|V2 ||E2 |)), where
d is the depth of the search and V1 , V2 , E1 , E2 are respectively the sets of
vertices and of edges of the two molecules considered.
Finally, a family of 3D kernels based on atomic distances proposes to
represent a molecule as a set of histograms. For each pair of atom labels, the
histogram stores the distances between all the pairs of the atom labels in a
given molecule. Different weights can score different histograms and the similarity between molecules is measured by the similarity between histograms.
100
CHAPTER 3 PROCESSING STRUCTURED DATA
3.7.7
Synchronized Random Walks Kernel
Kashima and Inokuchi (2002) described a kernel for computing inner products between pairs of graphs by a random walk on a vertex product graphs
of the two graphs. Precisely, the kernel is defined to be the probability with
which two label sequences generated by two “synchronized” random walks
on the graphs are identical. The kernel between two graphs G1 = (V1 , E1 ),
G2 = (V2 , E2 ) is defined as
K(G1 , G2 ) =
X X
1
k(v1 , v2 )
|V1 ||V2 | v ∈V v ∈V
1
1
2
(3.98)
2
The kernel k(v1 , v2 ) between pairs of vertices would be the indicator function
I(v1 , v2 ). However I(v1 , v2 ) does not incorporate any information on local
structure around v1 and v2 . So we generalized the kernel k(v1 , v2 ) to take
higher scores when not only the labels of v1 and v2 are identical, but also
when the labels of the edges and vertices adjacent to v1 and v2 and the further
edges and vertices are identical. Mathematically, given a decaying constant
λ ∈ [0, 1], we can define k(v1 , v2 ) as
k(v1 , v2 ) = (1 − λ)k0 (v1 , v2 )
X
+λ(1 − λ)
k1 (v1 , v2 , e1 , e2 )
(3.99)
e1 ∈ A(v1 )
e2 ∈ A(v2 )
+λ2 (1 − λ)
X
X
k2 (v1 , v2 , e1 , e2 , e01 , e02 )
e1 ∈ A(v1 ) e01 ∈ A(δ(v1 , e1 ))
e2 ∈ A(v2 ) e02 ∈ A(δ(v2 , e2 ))
+λ3 (1 − λ)
X
X
X
k3 (v1 , v2 , e1 , e2 , e01 , e02 , e001 , e002 )
0
e1 ∈ A(v1 ) e01 ∈ A(δ(v1 , e1 )) e00
1 ∈ A(δ(δ(v1 , e1 ), e1 ))
e2 ∈ A(v2 ) e02 ∈ A(δ(v2 , e2 )) e02 ∈ A(δ(δ(v2 , e2 ), e02 ))
+···
101
3.7 Kernels for Graphs
where
k0 (v1 , v2 ) = I(v1 , v2 )
I(e1 , e2 )I(δ(v1 , e1 ), δ(v2 , e2 ))
|A(v1 )||A(v2 )|
0
0
k2 (v1 , v2 , e1 , e2 , e1 , e2 ) = k1 (v1 , v2 , e1 , e2 ) ·
I(e01 , e02 )I(δ(δ(v1 , e1 ), e01 ), δ(δ(v2 , e2 ), e02 ))
·
|A(v10 )||A(v20 )|
k3 (v1 , v2 , e1 , e2 , e01 , e02 , e001 , e002 ) = k2 (v1 , v2 , e1 , e2 , e01 , e02 ) ·
I(e001 , e002 )I(δ(δ(δ(v1 , e1 ), e01 ), e001 ), δ(δ(δ(v2 , e2 ), e02 )), e002 )
·
|A(v100 )||A(v200 )|
k1 (v1 , v2 , e1 , e2 ) = k0 (v1 , v2 )
where A(v) is the set of edges adjacent to v and δ(v, e) is the vertex at the
other side of e adjacent to v. Equation (3.99) can be recursively written as
k(v1 , v2 ) = I(v1 , v2 ){(1 − λ) + λ
X
e1 ∈ A(v1 )
e2 ∈ A(v2 )
I(e1 , e2 )
k(δ(v1 , e1 ), δ(v2 , e2 ))}
|A(v1 )||A(v2 )|
or in a matrix formulation
k = (1 − λ)k0 + λK̃k
2
= (1 − λ)(k0 + λK̃k0 + λ2 K̃ k0 + · · · )
(3.100)
= (1 − λ)(I − λK̃)−1 k0
where k and k0 are two |V1 ||V2 | dimensional vectors whose iv1 v2 –th elements
are respectively k(v1 , v2 ) and I(v1 , v2 ) and K̃ is a (|V1 ||V2 |)×(|V1 ||V2 |) matrix
defined as
X
X
I(e1 , e2 )
K̃ iv1 v2 ,iv0 v0 =
1 2
|A(v1 )||A(v2 )|
e1 ∈ A(v1 )
e2 ∈ A(v2 )
δ(v1 , e1 ) = v10 δ(v2 , e2 ) = v20
Solving Equation (3.100) requires the computation of the solution of a system
of linear equations with a (|V1 ||V2 |) × (|V1 ||V2 |) matrix. However the matrix
K̃ is sparse since the number of non zero elements is
|V1 ||V2 | max |A(v1 )| max |A(v2 )|
v1 ∈V1
v2 ∈V2
102
CHAPTER 3 PROCESSING STRUCTURED DATA
3.7.8
Walk Based Kernels
Gärtner et al. (2003) defined two polynomial time computable graphs kernels
based on label pairs and contiguous label sequences as the inner product in
the corresponding feature spaces.
The first kernel is based on label pairs and it is useful when only the
distance between all pairs of vertices of some label has an impact on the
classification fo the graph. Given two labels `i , `j ∈ A, the corresponding
component in the feature space of a graph G is defined as
φ`i ,`j (G) =
∞
X
λn |{w ∈ Wn (G) : label(v1 ) = `i ∧ label(vn+1 ) = `j }|(3.101)
n=0
where Wn (G) is the set of all possible walks with n edges and n + 1 vertices,
w = (v1 , e1 , . . . , en , vn+1 ) and λn ∈ IR, λn ≥ 0. The dimensionality of the
label pair feature space is equal to |A|2 : in domains in which only few labels
occur, this might be a feature space of too low dimension. An efficiente way
to compute the kernel is described in Gärtner et al. (2003).
To overcome the reduced dimensionality of the feature space described
in Equation (3.101), we could count how many walks match a given label
sequence. Let Sn be the set of all label sequences of walks with n edges and
let labeli (w) be the i–th label of the walk w. The sequence feature space is
indexed by each possible label sequence. In particular, for any given length n
and label sequence s = s1 , s2 , . . . , s2n+1 ∈ Sn , the corresponding components
for a graph G is given by
p
φs (G) = λn |{w ∈ Wn (G), ∀ i si = labeli (w)}|
(3.102)
Gärtner et al. (2003) shows a solution for computing efficiently the kernel
based on Equation (3.102).
3.7.9
Tree–structured Pattern Kernel
Ramon and Gärtner (2003) introduced an efficiently computable graph kernel
counting the number of common subtree patterns in two graphs. Given a
graph G = (V, E), a subtree pattern is recursively defined as:
• if v ∈ V , then v is a subtree pattern of G rooted at v;
103
3.7 Kernels for Graphs
• if t1 , t2 , . . . , tn are subtree patterns of G rooted at respectively different
v1 , v2 , . . . , vn and if (v, v1 ), (v, v2 ), . . . , (v, vn ) ∈ E, then v(t1 , t2 , . . . , tn )
is a subtree pattern of G rooted at v.
The kernel between two graphs G1 = (V1 , E1 ) and G2 = (V2 , E2 ) is defined
as
X X
kh (v1 , v2 )
(3.103)
Ktree,h (G1 , G2 ) =
v1 ∈V1 v2 ∈V2
where kh (v1 , v2 ) is the weighted count of the pairs of subtrees of the same
signature of height less than or equal to h. Precisely, kh (v1 , v2 ) is recursively
defined as
• if h = 1 and label(v1 ) = label(v2 ), kh (v1 , v2 ) = 1;
• if h = 1 and label(v1 ) 6= label(v2 ), kh (v1 , v2 ) = 0 ;
• if h > 1,
X
kh (v1 , v2 ) = λv1 λv2
Y
kh−1 (v10 , v20 )
R∈Mv1 ,v2 (v10 ,v20 )∈R
where 0 ≤ λv1 , λv2 < 1 cause higher trees to have a smaller weight in
the overall sum and Mv1 ,v2 is the set of all matching from δ + (v1 ) to
δ + (v2 ), i.e.
Mv1 ,v2 = R ⊆ δ + (v1 ) × δ + (v2 ) | (∀ (a, b), (c, d) ∈ R : a = c ⇔ b = d)
∧(∀ (a, b) ∈ R : label(a) = label(b))}
where δ + (v) = {u : (v, u) ∈ E}.
To include all the subtree patterns, we leave h grows to infinity:
Ktree (G1 , G2 ) = lim Ktree,h (G1 , G2 )
h→∞
3.7.10
(3.104)
Basic Terms Kernel
Gärtner et al. (2003); Gärtner et al. (2004) proposed a general framework
for defining kernels for structured data by identifying the structure of the
104
CHAPTER 3 PROCESSING STRUCTURED DATA
object. Since the extent to which the semantics of the domain are reflected
in the definition of the kernel is a crucial aspect to the success of kernel–based
learning algorithm, often strongly typed syntaxes are used. Syntax–driven
kernels are an attempt to define good kernels based on the semantics of the
domain as described by the syntax of the representation.
Gärtner et al. (2003); Gärtner et al. (2004) give an account of a knowledge representation formalism which is a typed higher–order logic and then
define a syntax kernel on the terms of this logic. A syntax kernel is often used in typed systems to formally describe the semantics of the data.
For a syntax–driven kernel definition, one needs a knowledge representation
formalism which is able to accurately and naturally model the underlying semantics of the structured data: the used formalism is based on the principles
of using a typed syntax.
Individuals that are the subject of learning, are represented as basic
terms. A rigorous recursive definition of basic terms is reported in Gärtner
et al. (2003); Gärtner et al. (2004). Informally, there are three types kinds
of basic terms: those that represent individuals that are natural numbers,
reals, lists, trees, and so on; those that represent sets, multisets and so on;
those that represent tuples.
After introducing the knowledge representation formalism for providing a
type systems which can be used to express the structure of the data, a kernel
on basic terms can be defined. Its definition follows the recursive definition
of basic terms (Gärtner et al., 2003; Gärtner et al., 2004).
Intuitively, given two individuals x and z represented as basic terms, if
x and z are not of the same type, the kernel equals to zero. Otherwise, the
type of the term is identified: in the case of a tuple of basic terms, the kernel
between x and z is the product of a kernel on its basic terms; in the case of
a set/multiset of basic terms, the kernel is the summation on the elements of
the set/multiset of the product of two kernels, one between the elements of
the set/multiset, the other on the value associated with the elements of the
set/multiset; finally, in the case that individuals are natural numbers, reals,
lists, tress and so on, the syntax used for constructing the data is used to
define the kernel on a data constructors and on its arguments.
105
3.8 Recursive Neural Networks
3.8
Recursive Neural Networks
Connectionist models can be examined in the general framework of adaptive processing of data structures (Frasconi et al., 1998). In this setting, the
supervised learning problem is reduced to the problem of learning transductions from an input structured space to an output structured space, where
transductions are assumed to admit a recursive hidden state–space representation. Connectionist models for structured data processing are based on the
same recursive state updating scheme that characterizes the state space representation used in system theory to describe nonlinear dynamical systems.
While in sequential dynamics the state at a given time point t is a function
of the state at the previous time point t − 1, in graphical dynamics the state
at a given vertex v is a function of the states at the children of that vertex. In particular, Recursive Neural Networks (RNNs) are a generalization
of Neural Networks (NNs) capable of processing structured data as DOAGs
where a discrete or real label is associated with each vertex (Frasconi et al.,
1998). The key idea is to replicate a NN for each node of the DOAG and
to consider as input to the network both the atomic information represented
by the label and the structured information derived by the output of all the
networks instantiated for each child node. The process of replicating a NN
is called network unfolding and, as a result of this procedure, we obtain a
large network having shared weights and whose topology matches that of the
input graph.
Now RNNs are introduced in the general framework of structural transductions (Frasconi et al., 1998). Let us define some notation. A class of
skeletons # is a set of unlabelled graphs which satisfies some specified topological conditions (the skeleton skel(y) of a data structure y is obtained by
ignoring all the labels, but retaining the topology of the graph). We denote
by Y # the space of data structures with labels in Y and topology in #. A
deterministic transduction for structured domains is a relation
τ ⊆ X # × Y#
(3.105)
where X and Y are input and output label spaces and # is a skeleton class
contained in the class of DOAGs. We consider only IO–isomorph transduc106
CHAPTER 3 PROCESSING STRUCTURED DATA
tions where
skel(τ (x)) = skel(x) ∀ x ∈ X #
(3.106)
The concept of causality from dynamical system theory can be generalized
to IO–isomorph transductions for structured domains defining τ : X # 7→ Y #
being causal if the output at node v depends on the input substructure
induced by v and its descendant. An IO–isomorph transduction τ : X # 7→
Y # admits a recursive representation if, for each node v ∈ skel(x) in the
skeleton of the input structure x, there exists a state variable ϕ(v) ∈ Ω and
two functions
t : Ωd × X × V 7→ Ω
(3.107)
f : Ω × X × V 7→ Y
(3.108)
such that
ϕ(v) = t(ϕ(ch(v)), x(v), v)
(3.109)
y(v) = f (ϕ(v), x(v), v)
(3.110)
where d is the maximum outdegree of # and ϕ(ch(v)) is the d–tuple of state
variables attached to the children of vertex v. Whenever a child is missing,
the corresponding entry on the tuple is replaced by a predeterminated state
ϕ0 ∈ Ω called frontier state. Note that a recursive representation only exists
if the transduction is causal. The state ϕ(v) represents a summary of the information contained in the input subgraph induced by v and its descendants.
The function t is referred to as the state transition function and f is referred
to as the output function. Both functions t and f in Equations (3.109) and
(3.110) are dependent on v: if these functions are independent on node v,
the causal IO–isomorph transduction is said stationary. In same cases, the
output space is not structured: for example, in a classification problem, only
a categorical variable is associated with the whole input structure. So we
may think the skeleton of the output structure is constituted by only a vertex without edges and we can give a specialized definition for transductions
whose output space is not structured. A supersource transduction
τ : X # 7→ Y
107
(3.111)
3.8 Recursive Neural Networks
is defined by the following recursive procedure
ϕ(v) = t(ϕ(ch(v)), x(v), v)
y = f (ϕ(s))
(3.112)
(3.113)
where s is the supersource of the input graph and the state ϕ(s) of s summarizes all the information contained in the input data structure x.
In a connectionist model of a causal IO–isomorph supersource transduction, it is assumed that the state ϕ(v) ∈ IRn . The state transition function f
and the output function g are realized by static neural networks such as multilayered perceptrons, parametrized by a set of connection weights that have
to be tuned to adapt the behaviour of the model according to training data.
If stationary is assumed, the same set of weights is replicated for each vertex
of the input structure. At each node v, the NN outputs a vector encoding of
the whole subgraph induced by vertexes reachable from v. Data processing
takes place in recursive fashion, traversing the DOAG in post–order, using a
transition function t such that
(3.114)
ϕ(v) = t(ϕ(ch(v)), label(v))
where ϕ(v) ∈ IRn denotes the state vector associated with node v and
ϕ(ch(v)) ∈ IRd·n is a vector obtained by concatenating the components of
the state vectors contained in the d children of v. Note that we have omitted
the input space X . The state transition function
t : IRd·n × A 7→ IRn
(3.115)
maps states at v’s children and the label at v into the state vector at v. A
frontier state ϕ0 = 0 is used as the base step of recursion. A feedforward
neural network can model the transition function t according to the scheme:
aj (v) = ωj0 +
|A|
X
ωjh zh (label(v)) +
h=1
ϕj (v) = tanh(aj (v)),
d X
n
X
k=1 `=1
wjk` ϕ` (ch(v, k))
(3.116)
h = 1, . . . , n
where ϕj (v) denotes the j-th component of the state vector at vertex v,
zh (label(v)) is a one–hot encoding of label at node v and ωjh , wjk` are adjustable weights. Proceeding in this fashion, the state vector
Φω (x) = ϕ(s)
108
(3.117)
CHAPTER 3 PROCESSING STRUCTURED DATA
at the supersource s (the root node in the case of the trees) encodes the whole
data structure and can be used for subsequent processing. The prediction
f (x) ∈ IR is computed by the output network as
f (x) = ho, Φω (x)i
(3.118)
where o are the weights of the output network. In the case of regression, a
candidate error function to minimize is
errDm (f ) =
m
X
(f (xi ) − yi )2
(3.119)
i=1
Minimizing the error (3.119) leads to find a value for the parameters of the
RNN and to discover a vector state representation of input structures. Minimization is achieved by a variant of the gradient descend backpropagation
algorithm (Goller and Kuechler, 1996).
109
Part II
Preference Learning
Chapter 4
Preference Learning in Natural
Language Processing
In this chapter, we deal with the first learning task involving data structures, showing as convolution kernels and recursive neural networks are both
suitable approaches for supervised learning when the input is a discrete structure like a labelled tree or graph. We compare these techniques in two large
scale preference learning problems that occur in computational linguistics:
prediction of first pass attachment under strong incrementality hypothesis
(Sturt et al., 2003; Costa et al., 2003a) and reranking parse trees generated
by a statistical parser (Collins and Duffy, 2001, 2002). Both problems involve learning a preference function that selects the best alternative in a set
of competitors. We show how to perform preference learning in this highly
structured domain and we enlighten some interesting connections between
these two approaches. We report about several experimental comparisons
showing that in this class of problems generalization performance is determined by several factors including the similarity measure induced by the kernel or by the adaptive internal representation of the RNN and, importantly,
by the loss function associated with the preference model. This chapter is
based on Costa et al. (2002), on Menchetti et al. (2003) and on Menchetti
et al. (2005c).
113
4.1 Introduction
4.1
Introduction
Supervised learning algorithms on discrete structures such as strings, trees
or graphs are very often derived from vector based methods, using a function
composition approach. In facts, if we exclude symbolic learning algorithms
such as inductive logic programming, any method that learns classification,
regression or preference functions from variable–size discrete structures must
first convert structures into a vector space and subsequently apply “traditional” learning tools to the resulting vectors.
The mapping from discrete structures to vector spaces can be realized
in alternative ways. One possible approach is to employ a kernel function:
for example, Haussler (1999) introduces convolution kernels on discrete structures, Jaakkola and Haussler (1999a) describes the Fisher kernel, Collins and
Duffy (2001) proposes a kernel for parse tree, Viswanathan and Smola (2003)
uses a kernel for string and tree matching and Lodhi et al. (2001) employs
string kernels for text classification. Similarly, Recursive Neural Networks
(RNNs) (Frasconi et al., 1998) can solve the supervised learning problem
when the input portion of the data is a labelled Directed Ordered Acyclic
Graph (DOAG). The two methods have different potential advantages and
disadvantages.
The use of kernels allows us to apply large margin classification or regression methods, such as support vector machines (Vapnik, 1998) or the voted
perceptron algorithm (Freund and Schapire, 1999). These methods have a
solid theoretical foundation and may attain good generalization even with
relatively small data sets by searching the solution of the learning problem
in an appropriately small hypothesis space. When using kernels, the feature
space representation of the input data structure is accessed implicitly and
the associated feature space may have very high or infinite dimensionality.
Separating data in a very high dimensional spaces does not necessarily lead
to overfitting, since large margin methods such as SVMs scale up with the
ratio between the sphere containing the data points and the distance between
the separating hyperplane and the closest training examples (Vapnik, 1998).
Kernel methods for discrete structure are linear in an infinite–dimensional
representation space, while RNNs are highly non non linear but capable of
114
CHAPTER 4 PREFERENCE LEARNING IN NATURAL LANGUAGE PROCESSING
developing an adaptive representation space. Many kernel–based algorithms,
unlike neural networks, typically minimize a convex functional, thus avoiding
the difficult problem of dealing with local minima. However, this problem
is only partially avoided. In fact, the kernel function usually needs to be
tuned/adapted to the problem at hand, a problem which cannot in general
be cast as a convex optimization problem. A typical example is tuning the
variance of a Gaussian kernel. Learning the kernel function is still an open
problem, particularly in the case of discrete structures. When using kernels
such as the spectrum kernel (Leslie et al., 2002a), the string kernel (Lodhi
et al., 2001) or the parse tree kernel (Collins and Duffy, 2001), the mapping
from discrete structures to the feature space is fixed before learning by the
choice of the kernel function and remains unchanged during all the learning
procedure. The Fisher kernel (Jaakkola and Haussler, 1999a), if applied for
example to strings, introduce some degree of adaptation with respect to the
distribution of training instances. A non–optimally chosen kernel may lead
to very sparse representations and outweigh the benefits of the subsequent
large margin methods.
On the other hand, RNNs operate by composing two adaptive functions.
First, a discrete structure is recursively mapped into a low–dimension vector
by an adaptive function Φ. Second, the output is computed by a feedforward
neural network that takes as argument the vector representation computed
in the first step. Thus the role played by Φ in RNNs is similar to that of
the kernel function but it is carried out by means of an adaptive function,
leading to vector representations that are focused on the particular learning
task.
4.2
An Introduction to the Parsing Problem
Most of Natural Language Processing (NLP) tasks deal with discrete structures as strings, trees, graphs. A classical task in NLP is the problem of
parsing which consists on learning a mapping from an input sentence to a
parse tree that describes syntactic relations between words in a sentence. A
parse tree is an ordered composite structure in which each node is labelled
by a non–terminal symbol and the leaves are part of speech tags or sentence
115
4.2 An Introduction to the Parsing Problem
words. Each parse tree has a root node which spans the whole sentence and
the number of children of a node is related to the production rules. A production rule describes how non–terminal symbols are expanded into terminal or
non–terminal symbols. The set of production rules forms the grammar and a
parse tree is a composition of production rules obeying to some constraints.
For each sentence, there are many parse trees which satisfy grammar rules as
a consequence of the ambiguity of the language. A possible solution for this
problem is to use Probabilistic Context Free Grammars (PCFGs), which assign a probability to each context–free rule in the grammar. The probability
of a parse tree is the product of rule probabilities constituting the tree and
then the best tree will have the largest score.
Classical learning algorithms cannot be directly applied to these discrete
structures and then we have to convert them into feature vectors. Collins and
Duffy (2001, 2002) proposes a convolution kernel for parse trees generated
by a natural language parser that can be applied to each generic ordered
tree with labels in a finite set, for example non terminal symbols in natural
language processing. Given a set of production rules generated by a treebank
of parse trees, the feature space is defined as the set of all the tree fragments
that can be built from the set of production rules with the only constraint
that a production rule can not be divided into further subparts (Bod, 2001).
The number of element of the feature space grows exponentially with the
number of rules in the grammar: these elements range from fragments constituted by only a rule to complex trees constituted by many rules with
several occurrences of the same rule. An overview of this kernel is reported
in Section 3.6.1. Since SVMs are a computational demanding algorithm (in
our experiments in the two large scale preference learning problems there are
about 108 training examples), we use the VP algorithm (Freund and Schapire,
1999) that is a more efficient online algorithm for binary classification whose
the performance tends to the performance of maximal–margin classifiers and
whose convergence is reasonably fast (see Chapter 2 for a complete overview
on VP).
Recently also RNNs, a generalization of neural networks capable of processing structured data as DOAGs, have been successfully used to process
linguistic data such as syntactic trees in order to model psycholinguistic pref116
CHAPTER 4 PREFERENCE LEARNING IN NATURAL LANGUAGE PROCESSING
erences in the incremental parsing process (Costa et al., 2003a; Sturt et al.,
2003). The key idea is to replicate a NN for each node of the DOAG and
consider as input to the network both the atomic information represented
by the label and the structured information derived by the output of all the
networks instantiated for each child node. An overview of RNNs is reported
in Section 3.8.
4.3
Ranking and Preference Problems
Work on learning theory has mostly concentrated on classification and regression. However, there are many applications in which it is desirable to
order rather than to classify instances: these problems arise frequently in social sciences, in information retrieval, in econometric models and in classical
statistics where human preferences play a major role.
In a general supervised learning task, the goal is to learn a function
f : X 7→ Y which best models the probabilistic relation between the input
space X and the target space Y minimizing a risk functional depending on a
predefined loss function. The properties of the target set Y define different
learning problems:
• if Y is a finite unordered set, we have a classification problem;
• if Y is a metric space, e.g., the set of real numbers, we have a regression
problem.
Ordinal regression, partial ranking and preference model tasks do not fit
in any two previous classes but share properties of both classification and
regression problems:
• Y is a finite ordered set;
• Y is not a metric space, that is the distance between two elements is
not defined.
So, as in classification problems, we have to assign a possible label to a new
instance, but similar to regression problems, the label set admits an order
relation (see Figure 4.1). More precisely, in a ranking problem we have to
117
4.3 Ranking and Preference Problems
Ranking and Preference
Non metric space
Finite
Ordered
Unordered
Metric space
Classification
Regression
Figure 4.1. Relation between ranking, preference, classification and regression problems.
sort a set of competing alternatives by their importance, while in a preference
problem we are only interested in the best element: note that the preference
problem is a particular case of ranking.
In a ranking problem, let Dm = {(xi1 , yi1 ), . . . , (xiki , yiki )}m
i=1 be a data
set, where (xi1 , . . . , xiki ) is the i-th sequence of competing instances, xij ∈ X ,
yij ∈ IN is the rank of xij with yij ∈ {1, . . . , ki }: in this setting, xij precedes
xik (written xij ≺ xik ) if yij < yik . In a preference problem, let Dm =
{(xi1 , . . . , xiki ), yi }m
i=1 be a data set, where yi ∈ IN with yi ∈ {1, . . . , ki } is
the index of best element. In the case of ranking, we learn a function fRANK :
X ∗ 7→ P ∗ that maps sequences of instances into corresponding ranks. Here
X ∗ is the set of all sequences of instances in X and P ∗ = ∪k Pk being Pk the
set of permutations of the first k integers. By writing fRANK (xi1 , . . . , xiki ) =
(πi1 , . . . , πiki ) we mean that πij is the rank assigned to the element xij . A
suitable loss function for ranking should penalize predicted permutations that
are too different from the correct one. In the particular case of preference
learning, we are only interested in ranking high the best element of the
sequence, indexed by πi = arg minj=1,...,ki {πij }. Hence a natural family of
118
CHAPTER 4 PREFERENCE LEARNING IN NATURAL LANGUAGE PROCESSING
preference 0–1 loss functions is
(
Lr (fRANK (xi1 , . . . , xiki ), yi ) =
1 if πiyi > r
0 if πiyi ≤ r
(4.1)
where r = 1, . . . , ki is the number of top positions the correct element is
ranked in before counting an error. In particular, L1 measures the number
of sequences whose best element is not ranked first. As detailed below, both
kernel methods and RNNs rely on the definition of a suitable utility function
to realize fRANK .
There are many works in literature that try to propose a solution for
these problems. Herbrich et al. (2000) investigates the problem of ordinal
regression and uses a large margin algorithm based on a mapping from objects to scalar utility values for classifying pairs of objects. Herbirch et al.
(1998) deals with the task of learning a preference relation from a given set of
ranked documents. The problem is reformulated as a classification problem
on pairs of documents, where each document is mapped to a scalar utility
value. Crammer and Singer (2002b) discusses the problem of ranking instances. They describe an efficient online algorithm similar to perceptron
algorithm that projects the instances into sub–intervals of the reals: each
interval is associated with a distinct rank. Also Cohen et al. (1999) considers
the problem of ranking instances. It describes a two–stage approach: before
a binary preference function indicating if a instance is better than another is
learned, and then new instances are ordered with the purpose of maximizing
the agreement with the learned preference function. In Crammer and Singer
(2002a) and Elisseeff and Weston (2002) is described the problem of multi–
labelled documents. Both Crammer and Singer (2002a) and Elisseeff and
Weston (2002) maintains a set of prototypes associated with topics. Elisseeff and Weston (2002) reduce the multi–label problem into multiple binary
problems by comparing all pairs of labels. Crammer and Singer (2002a) suggests an online algorithm similar to perceptron algorithm that updates the
prototypes only if the predicted ranking is not perfect. In Joachims (2002b)
is described a method to rerank the results of a search engine, adapting them
to a particular group of users: it uses a SVM classifier on pairs of examples.
119
4.3 Ranking and Preference Problems
4.3.1
The Utility Function Approach
The utility function approach is a method for ranking a set of competing
alternatives based on scores assigned to the objects by a function which
measures the importance of the alternatives. In details, the importance of
an instance is estimated by introducing an utility function U : X 7→ IR
which maps objects into real numbers: so given x, z ∈ X such that x ≺ z,
then U (x) > U (z). To rank a sequence of objects, we can use the values
assigned by the utility function to the instances. In this way, ranking and
preference problems are reduced to learning an utility function U . Since
x ≺ z ⇔ U (x) > U (z), the rank of xij is
πij =
ki
X
θ(U (xir ) − U (xij ))
(4.2)
r=1
where θ is the heaviside function. In the following, we focus on learning
preferences. In this case we have to select only the best element
πi = arg max U (xij )
j=1,...,ki
4.3.2
(4.3)
Recursive Neural Networks Preference Model
For the connectionist approach, we consider only the preference task. The
utility function U (x) is implemented by a neural network having a single
linear output: more in detail, the neural network architecture is formed by
an encoder that maps a tree into a real vector representation (the state vector
computed by the recursive network on the root node of the tree itself) and
by a feed–forward output network that performs the final mapping into a
real number (Costa et al., 2003a).
Consider the multinomial variable Yi representing the index of the correct
element in the i–th input sequence of competing instances. The conditional
probability that Yi = j, i.e. that xij is ranked first by the utility function realized as a neural network with parameters ω, is estimated using the softmax
function as follows:
P (Yi = j|xi1 , . . . , xiki , ω) =
eU (xij )
ki
X
eU (xi` )
`=1
120
(4.4)
CHAPTER 4 PREFERENCE LEARNING IN NATURAL LANGUAGE PROCESSING
Under this model, the likelihood function is
m
Y
L(Dm , ω) =
P (Yi = yi |xi1 , . . . , xiki , ω)
(4.5)
i=1
Learning proceeds by minimizing the negative log–likelihood
m
X
eU (xiyi )
PREF
errDm (U ) = −
log k
i
X
i=1
eU (xi` )
=
m
X
log
`=1
ki
X
U (xi` )−U (xiyi )
i=1
(4.6)
e
`=1
with respect to the model parameters ω (for a complete derivation see Costa
et al. (2003a)). A backpropagation gradient descent algorithm for this purpose can be easily defined by injecting as error signals the partial derivatives
of errPREF
Dm (U ) with respect to U (xi` ) (Goller and Kuechler, 1996).
4.3.3
Kernel Ranking and Preference Model
The approach proposed in Cohen et al. (1999) and also followed in Collins
and Duffy (2001, 2002) starts from a simple linear model where the utility
function U is parametrized by a vector w = (w1 , . . . , wn ) such that
n
X
U (x) = hw, xi =
wi xi
(4.7)
i=1
where
if x ≺ z ⇒ U (x) > U (z) ⇔ hw, xi > hw, zi ⇔ hw, (x − z)i > 0
(4.8)
So we see that the relation x ≺ z is expressed in terms of the difference
x − z between input vectors. The difference x − z can be interpreted as a
new instance whose attribute values are differences between attribute values
of a pair of objects. Thus, in a general rank problem, we can only use pairwise
comparisons between instances and look for a w which satisfies the following
constraints
hw, (xij − xik )i > 0
i = 1, . . . , m
j, k = 1, . . . , ki : yij < yik
121
(4.9)
4.3 Ranking and Preference Problems
As a special case, in a preference model we have
hw, (xiyi − xij )i > 0
i = 1, . . . , m
(4.10)
j = 1, . . . , ki , j 6= yi
The number of constraints in Equation (4.9) grows quadratically in the size
of the sequence of alternatives
0
m =
m
X
ki (ki − 1)
(4.11)
2
i=1
while it is linear in Equation (4.10)
0
m =
m
X
ki − 1 = m −
i=1
m
X
ki
(4.12)
i=1
This is equivalent to transform the original data set Dm in a new one D0
whose elements are pair differences:
0
Dm
0 = {(xij − xik ), yijk }, i = 1, . . . , m, j, k = 1, . . . , ki : yij < yik
(4.13)
for the ranking problem and
0
Dm
0 = {(xiyi − xij ), yij }, i = 1, . . . , m, j = 1, . . . , ki , j 6= yi
(4.14)
for the preference problem. Note that yijk = 1 and yij = 1 ∀ i, j, k = 1, . . . , m
since the first element of each pair is ranked first than the second one. In this
way, ranking and preference problems are reduced to binary classification of
pairwise differences between instances. In other words, we learn a function
fPAIR : X × X 7→ {+1, −1} with an associated 0–1 loss function
L0 (fPAIR (xij , xi` ), zij` ) =
1 − fPAIR (xij , xi` )zij`
2
(4.15)
where zij` = 1 if xij ≺ xi` and zij` = −1 otherwise. The empirical error
associated with the training data in the case of a preference model using the
0–1 pair loss (4.15) is
errPAIR
Dm (U ) =
ki
m
X
X
L0 (fPAIR (xiyi , xij ), ziyi j )
i=1 j=1j6=yi
122
(4.16)
CHAPTER 4 PREFERENCE LEARNING IN NATURAL LANGUAGE PROCESSING
as opposite to the correct empirical error under the preference model
errPREF
Dm (U ) =
m
X
L1 (fRANK (xi1 , . . . , xiki ), yi )
(4.17)
i=1
So the preference and ranking problems are approximated by a pairwise
model using a 0–1 loss function that works on pairs and not on the whole
sequence of alternatives; moreover the best element is used as many times
as the number of elements ki of a sequence of alternatives while the other
elements are used only once.
A non linear utility function can be easy realized by introducing a kernel
function. During the training procedure, we have to compute the following
value
f (xij − xik ) = hw, φ(xij ) − φ(xik )i
nSV
X
=
ast pt qt hφ(xst pt ) − φ(xst qt ), φ(xij ) − φ(xik )i
=
t=1
n
SV
X
ast pt qt K∆ (xst pt − xst qt , xij − xik )
(4.18)
t=1
where
K∆ = K(xst pt , xij ) − K(xst pt , xik ) − K(xst qt , xij ) + K(xst qt , xik ) (4.19)
for appropriate coefficients ast pt qt and indexes st , pt and qt which select the
SVs. Since the examples are replaced by differences between pairs of examples, also SVs are differences between pairs of instances. Consequently
each computation of the kernel is substituted by the evaluation of four kernels. But computing the prediction f (x) involves only the evaluation of two
kernels
f (x) = hw, φ(x)i
nSV
X
=
ast pt qt hφ(xst pt ) − φ(xst qt ), φ(x)i
=
t=1
n
SV
X
ast pt qt (K(xst pt , x) − K(xst qt , x))
t=1
123
4.3 Ranking and Preference Problems
4.3.4
Cancelling Out Effect
In the utility function approach to ranking and preference problems applied
to kernel machines, each single instance is replaced by the difference between
a pair of instances: so each kernel computation is replaced by four kernel
evaluations as explained by Equation (4.19)
K∆ (x1 − x2 , z 1 − z 2 ) = K(x1 , z 1 ) − K(x1 , z 2 ) − K(x2 , z 1 ) + K(x2 , z 2 )
Suppose each object is composed by a set of parts and let x = (x1 , . . . , xD )
and z = (z 1 , . . . , z D ) be the parts of x and z and assume that the kernel
between single objects is a linear combination of kernels on parts
K(x, z) =
D
X
cd kd (xd , z d )
(4.20)
d=1
where cd , d = 1, . . . , D are the coefficients of the linear combination and kd
are kernels on parts. If each pair of instances shares a common set of parts,
that is if
x1 = (xC , x1N ), x2 = (xC , x2N )
z 1 = (z C , z 1N ), z 2 = (z C , z 2N )
where xC is the common part to x1 and x2 , z C is the common part to z 1
and z 2 , x1N , x2N , z 1N and z 2N are the non common parts, we have the problem
of cancelling out. The computation of four kernels becomes
K(x1 , z 1 ) = KC (xC , z C ) + KN (x1N , z 1N )
K(x1 , z 2 ) = KC (xC , z C ) + KN (x1N , z 2N )
K(x2 , z 1 ) = KC (xC , z C ) + KN (x2N , z 1N )
K(x2 , z 2 ) = KC (xC , z C ) + KN (x2N , z 2N )
where KC is the value of kernel computed on the common set of parts and
KN is the kernel on non common parts. So the final value if kernel is
K∆ (x1 −x2 , z 1 −z 2 ) = K(x1N , z 1N )−K(x1N , z 2N )−K(x2N , z 1N )+K(x2N , z 2N )
The value KC (xC , z C ) of kernel between two common sets of parts is cancelled out from the kernel result: this could be a undesired effect if, for
124
CHAPTER 4 PREFERENCE LEARNING IN NATURAL LANGUAGE PROCESSING
example, xC and z C represent a contest or an useful information on the set
of alternatives. A possible solution of cancelling out problem is to introduce
some non–linearities (for example, a polynomial or Gaussian kernel or a normalization) in the computation of the kernel K between single instances or
in the computation of kernel K∆ between pairs of objects.
4.4
Preference Model for SVMs and VP
The utility function approach illustrated in Section 4.3.1 can be extended
to the non linear case introducing a kernel function which projects instances
in a high dimensional feature space as described in Section 4.3.3. In next
two sections, we show a detailed description about how the utility function
approach can be combined with SVMs and VP in the case of a preference
problem.
4.4.1
SVMs and Preference Model
As already shown in Section 4.3.3, learning preference relations reduces to
a standard classification problem if pairs of objects are considered. Given
a preference dataset D = {(xi1 , . . . , xiki ), yi }m
i=1 where yi is the index of the
best competing instance, for simplifying the notation we suppose that yi = 1,
i = 1, . . . , m, i.e. that the preferred instance is the first one. Then we build
all the pair differences with the best instance and use them as new examples.
The primal SVMs problem (1.51) can be modified in the following way
m
min
w,ξ
k
i
XX
1
ξij
hw, wi + C
2
i=1 j=2
subject to yij (hw, φ(xi1 ) − φ(xij )i) ≥ 1 − ξij
ξij ≥ 0,
i = 1, . . . , m,
(4.21)
j = 2, . . . , ki
where yij = 1 since we make the differences between the best alternative
and the other ones. Note that the offset b is zero because it is a constant
term which is deleted when the difference between two examples is computed.
Moreover the problem is also symmetrical: we can consider xi1 − xij as a
positive example or equivalently xij − xi1 as negative one.
125
4.4 Preference Model for SVMs and VP
0
Dual formulation (1.55) remains the same, but now α ∈ IRm where the
value m0 is computed by Equation (4.12), Q is an m0 by m0 matrix
Qiq,jp = yiq yjp hφ(xi1 ) − φ(xiq ), φ((xj1 ) − φ(xjp )i
(4.22)
= yiq yjp (K(xi1 , xj1 ) − K(xi1 , xjp ) − K(xiq , xj1 ) + K(xiq , xjp ))
i, j = 1, . . . , m,
q, p = 2, . . . , ki
The solution of problem (4.21) in function of dual variables αij is
w=
ki
m X
X
yij αij (φ(xi1 ) − φ(xij ))
(4.23)
i=1 j=2
and the margin of an example x is
f (x) = hw, φ(x)i
ki
m X
X
=
yij αij hφ(xi1 ) − φ(xij ), φ(x)i
(4.24)
i=1 j=2
=
ki
m X
X
yij αij (K(xi1 , x) − K(xij , x))
i=1 j=2
Note that now SVs are pairs of training examples. This approach can be
extended to the case of complete rank problem by simply changing the set
of constraints
min
w,ξ
ki
m
X
X
1
ξijk
hw, wi + C
2
i=1 j,k=1:x ≺x
ij
ik
yijk hw, φ(xij ) − φ(xij )i ≥ 1 − ξijk
ξijk ≥ 0,
i = 1, . . . , m,
(4.25)
j, k = 1, . . . , ki : xij ≺ xik
where yijk = 1 since xij ≺ xik . The main drawback of SVMs is that the
optimization algorithm is computationally demanding and they cannot be
used if the number of training examples is large.
4.4.2
VP and Preference Model
The preference model can be integrated into the VP algorithm to obtain an
efficient on–line learning algorithm. The method exploits the same trick of
126
CHAPTER 4 PREFERENCE LEARNING IN NATURAL LANGUAGE PROCESSING
SVMs, where each instance is replaced by the difference between a pair of
examples. During the learning phase, the standard VP algorithm needs to
compute
fk (xi ) = hwk , φ(xi )i
(4.26)
for classifying the current example. If xi is replaced by the difference between
of a pair of competing examples xi1 − xij where xi1 is the preferred instance,
we obtain
fk (xi1 − xij ) = hwk , (φ(xi1 ) − φ(xij ))i
k
X
=
yq` p` hφ(xq` 1 ) − φ(xq` p` ), φ(xi1 ) − φ(xij )i
=
`=1
k
X
(4.27)
yq` p` (K(xq` 1 , xi1 ) − K(xq` 1 , xij ) − K(xq` p` , xi1 ) + K(xq` p` , xij ))
`=1
where yq` p` = 1 is the target of xq` 1 − xq` p` and the q` , p` indexes select
the misclassified pairs of examples. Now the list J of misclassified examples
contains pairs of examples and not single instances and
wk =
k
X
yq` p` (φ(xq` 1 ) − φ(xq` p` ))
(4.28)
`=1
for appropriate indexes q` and p` . The margin of an instance x is
f (x) =
=
=
E
X
k=1
E
X
k=1
E
X
k=1
4.4.2.1
ck hwk , φ(x)i
ck
ck
k
X
`=1
k
X
yq` p` hφ(xq` 1 ) − φ(xq` p` ), φ(x)i
yq` p` (K(xq` 1 , x) − K(xq` p` , x))
(4.29)
`=1
Dual VP and Preference Model
The dual formulation of VP algorithm for the preference model involves the
computation of the prediction function and corresponding dual variables.
127
4.4 Preference Model for SVMs and VP
There are two different ways to write the prediction function on an example
x: we can make a summation on pairs of examples as in Equation (4.30) or
a summation on simple examples as in Equation (4.31):
fk (x) = hwk , φ(x)i
ki
m X
X
=
βijk (K(xi1 , x) − K(xij , x))
(4.30)
i=1 j=2
ki
m X
X
=
γijk K(xij , x)
(4.31)
i=1 j=1
where the relation between γijk and βijk is

ki

X


βijk , j = 1
γijk =
j=2


 −β k ,
j = 2, . . . , k
ij
(4.32)
i
and
βijk
=
τk
X
θ(−yij fkt (xi1 − xij ))
(4.33)
t=1
where 0 ≤ τk ≤ T is the current epoch number after k mistakes and
k1 , . . . , kτk are the indexes of training prediction functions which misclassified
xi1 − xij . The weight vector wk after k errors can be written as
wk =
ki
m X
X
βijk (φ(xi1 ) − φ(xij ))
(4.34)
i=1 j=2
The prediction function using Equation (4.30) becomes
f (x) =
E
X
ck fk (x)
k=1
=
E
X
k=1
=
ck
ki
m X
X
βijk (K(xi1 , x) − K(xij , x))
(4.35)
ck βijk (K(xi1 , x) − K(xij , x))
(4.36)
i=1 j=2
ki X
m X
E
X
i=1 j=2 k=1
128
CHAPTER 4 PREFERENCE LEARNING IN NATURAL LANGUAGE PROCESSING
and the corresponding dual variables are
αij =
E
X
ck βijk ,
i = 1, . . . , m,
j = 2, . . . , ki
(4.37)
k=1
Note that it is useless computing αi1 , i = 1, . . . , m: this coefficient would
correspond to the pairs composed by the difference between the best element
and itself. Alternatively, the prediction function can be reformulated using
Equation (4.31) as
f (x) =
E
X
ck fk (x)
k=1
=
E
X
k=1
=
ck
ki
m X
X
γijk K(xij , x)
(4.38)
ck γijk K(xij , x)
(4.39)
i=1 j=1
ki X
m X
E
X
i=1 j=1 k=1
and the dual variables become
αij =
E
X
ck γijk i = 1, . . . , m j = 1, . . . , ki
(4.40)
k=1
The dual VP training algorithm combined with the preference model is detailed in Algorithm 4.1, while the prediction function is described in Algorithm 4.2.
4.5
Applications to Natural Language
We introduce two tasks in natural language processing that can be modelled
as learning preferences over structured data. Both problems can be formulated as the task of selecting the best element in a set of (partial) parse
trees.
4.5.1
The First Pass Attachment
Resolution of syntactic ambiguities is a fundamental problem in natural language processing and learning is believed to play a crucial disambiguation
129
4.5 Applications to Natural Language
Algorithm 4.1 Dual–VP–Preference–Training–Algorithm(Dm , γ)
Input: A binary labelled training set Dm = {(xi , yi )}m
i=1 and margin γ
Output: Dual variables αij , i = 1, . . . , m, j = 2, . . . , ki
Require: γ ≥ 0
1: k ← 0, w 0 ← 0, c0 ← 0
0
2: SV ← {∅} {List of Index Pairs of Support Vectors}
3: for i = 1 to m, j = 2 to ki do {Initialize dual variables}
0
←0
4:
βij0 ← 0; αij
5: end for
6: for t = 1 to T do
7:
for i = 1 to m do
k
8:
γi1
← fk (xi1 )
9:
for j = 2 to ki do
k
− fk (xij ) ≤ γ then {fk made (k + 1)–th error on xij }
10:
if γi1
k
> 0) do {update αjqp }
11:
for (q, p) ∈ SVk (q = 1 to m, p = 2 to ki : βqp
k
k−1
k
ck {Only ck in known and not ck+1 }
+ βqp
← αqp
12:
αqp
13:
end for
14:
k ← k + 1 {k becomes k + 1}
15:
βijk ← βijk−1 + 1
16:
ck ← 1
17:
if {(i, j)} ∈
/ SVk−1 then
18:
SVk ← SVk−1 ∪ {(i, j)}
19:
end if
20:
else {No error on xij }
21:
ck ← ck + 1
22:
end if
23:
end for
24:
end for
25: end for
E
E
26: for (i, j) ∈ SV (i = 1 to m, j = 2 to ki : βij
> 0) do
E−1
E
E
27:
αij ← αij + βij cE
28: end for
29: return αij , i = 1, . . . , m, j = 2, . . . , ki
130
CHAPTER 4 PREFERENCE LEARNING IN NATURAL LANGUAGE PROCESSING
Algorithm 4.2 Dual–VP–Preference–Prediction(x, α, K)
Input: A new example x, dual variables α and a kernel K
Output: The predicted label ŷ of x
P
1: f (x) ←
i,j:αij 6=0 αij (K(xi1 , x) − K(xij , x))
2: ŷ ← sign (f (x))
3: return ŷ
role in the human language processing system. For example, consider the
sentence
The servant of the actress who
was on the balcony died.
(4.41)
where ambiguity resolution accounts to determining which noun the pronoun
who refers to. Cuetos and Mitchell (1988) found that native English speakers prefer lower attachment (the actress was on the balcony) while native
Spanish speakers prefer higher attachment (the servant was on the balcony)
when confronted with the Spanish equivalent of the sentence. The subsequent tuning hypothesis (Mitchell et al., 1995) states that parsing choices are
affected by the exposure to the different statistical regularities of languages.
Under a second widely accepted and experimentally validated assumption in
psycholinguistics, human processing is incremental, i.e. sentences are parsed
left–to–right, maintaining at every time a connected syntactic structure that
is incrementally augmented when new words arrive. So an interesting problem in psycholinguistic is to determine structural preferences exhibited while
interpreting a sentence in a sequential fashion.
This approach can be formalized by introducing a dynamic grammar
(Lombardo et al., 1998) where states are incremental trees Tk (that span
the first k words in a sentence) and state transitions are obtained by attaching a substructure called connection path (CP) to the previous tree to
obtain a new incremental tree Tk+1 . A corpus based set of CPs can be readily
obtained from a treebank (Lombardo and Sturt, 2002). Note that at each
stage of the elaboration, the interpretation (i.e. the syntactic parse tree) is a
fully connected structure as opposed to traditional parsing procedures that
determine smaller sub–structures to be joined at later stages. The lexical
131
4.5 Applications to Natural Language
items are represented by their grammatical category (like verb, noun, etc.)
called part of speech (POS) tag rather then by actual words. The node on
the right frontier of Tk where attachment occurs is called an anchor 1 while
the POS–tag of the word in the CP is called a foot. In this framework, ambiguity resolution reduces to the first–pass attachment problem, illustrated
in Figure 4.2. In general, the dynamic grammar introduces some ambiguities
incremental tree
NP
connection path
alternative
anchors
NP
PP
S'
NP
WHNP
D
N
P
D
N
WH
The
servant
of
the
actress
who
foot
...
Figure 4.2. The two main syntactic interpretations of sentence (4.41) can
be obtained by attaching the same CP to one of the two alternative anchors.
In general, several CPs and several attachments for each CP are possible.
that generate several licensed parse trees (but only one is correct) and several
CPs may be attached to a tree Tk . For the attachment to be admissible, it
suffices that a matching anchor is found in the right frontier of Tk and that
the POS–tag of the new word matches the foot of the CP. The resulting forest of admissible incremental trees that include the next word may contain
hundreds of alternative trees when using realistic wide coverage corpora. So,
given a syntactic interpretation for a fragment of a sentence called left context, and a new lexical item that we want to attach to left context, we have to
choose the right anchor and CP between a set of admissible candidates (see
Figure 4.3). Disambiguation can be formulated as the problem of learning
1
More precisely, the anchor is the node in common between the connection path and
the left context once the connection path has been joined.
132
CHAPTER 4 PREFERENCE LEARNING IN NATURAL LANGUAGE PROCESSING
to predict the correct member of the forest (Sturt et al., 2003) and consists
in choosing at each stage the correct incremental tree: it can naturally cast
into a preference or ranking task.
4
S
S
3
VP
NP
VP
2
NP
NP
1
NP
NP
NP
NP
PP
ADJP
SBAR
NP
PP
NP
ADVP
NP
NP
WHADVP
NP
PRN
QP
IN
PRP
It
VBZ
has
DT
no
NN
bearing
PRP
It
VBZ
has
DT
no
NN
bearing
IN
IN
IN
on
1
NONE
2
3
4
Figure 4.3. Ambiguities introduced by the dynamic grammar. Left: possible anchor points. Right: possible connection paths.
4.5.1.1
Tree Reduction and Specialization
We found that the disambiguation accuracy previously reported in Sturt et al.
(2003) can be significantly improved by means of two linguistically motivated
heuristics (Costa et al., 2005). The first is called tree reduction and consists
in removing nodes from the syntactic parse that are considered irrelevant for
discriminating between alternative incremental trees. Intuitively these nodes
are deep nodes, where the depth is measured with respect to the part of
the left context where the attachment process takes place. More precisely,
though the result is the same, the discarded nodes are selected on the basis
of linguistically motivated characteristics such as the notion of c–command.
This reduction of the complexity of the trees has been proved beneficial in
increasing the prediction accuracy.
The second heuristic consists of specializing the first pass attachment
prediction with respect to the class of the item being attached. The idea
is to train and employ specialized predictors for each different word classes
(nouns, verbs, etc.). The heuristic is applicable since the different classes
133
4.5 Applications to Natural Language
naturally partition the data set into non–overlapping sets (for example, what
is learnt on attaching nouns is not relevant for attachment decisions on verbs
or punctuation).
4.5.2
The Reranking Task
The second task, originally formulated in Collins and Duffy (2002), consists
of reranking alternative parse trees generated for the same sentence by a
statistical parser. In this case each forest consists of full candidate trees for
the entire sentence. In addition, each parse tree has a score that measures
the its probability given the sentence and an underlying stochastic grammar.
Note that the forest does not necessarily contain the correct parse tree for
the sentence (the gold tree). Alternative parses are ranked according to the
standard PARSEVAL measures: labelled recall (LR), labelled precision (LP)
and crossing brackets (CB). A constituent is a triple consisting of an internal
node, its label (a nonterminal symbol) and the indexes of the first and the
last word it spans (see example in Figure 4.4). A constituent is correct if it
S
VP
NP
NP
CNP
NAME
V
Jill
saw
1
2
ART
(VP,2,4)
N
dog
the
3
4
Figure 4.4. An example of a constituent, a triple consisting of an internal
node, its label and the indexes of the first and the last word it spans.
spans the same set of words and has the same label as a constituents in the
gold tree. LP is the number of correct predicted constituents divided by the
134
CHAPTER 4 PREFERENCE LEARNING IN NATURAL LANGUAGE PROCESSING
number of constituents in the predicted parse tree
LP =
# Correct Constituents
# Constituents in the Parse Tree
(4.42)
LR the is the number of correct predicted constituents divided by the number
of constituents in the gold tree
LR =
# Correct Constituents
# Constituents in the Gold Tree
(4.43)
CB is number of constituents which violate constituent boundaries with a
constituent in the gold tree; sentences with no crossing brackets (0 CBs) is
the percentage of sentences which have zero crossing brackets and sentences
with two or less crossing brackets (2 CBs) is the percentage of sentences
which have less equal then two crossing brackets.
The best tree in a forest is defined as the one having maximum harmonic
average of LP and LR (F1 metric)
F1 =
2 · LP · LR
LP + LR
(4.44)
The reranking task consists of predicting such best tree given the forest and
it is naturally formulated as a preference problem.
4.6
Experimental Results
In our experiments, we use the Wall Street Journal (WSJ) section of Penn
TreeBank (Marcus et al., 1993). It is a large size realistic corpus of natural
language that contains about 40,000 parsed sentences for a total of 1 million
words that has been widely used in the computational linguistic community.
We followed the standard split of the data set using sections 2–21 for training,
section 23 for test and section 24 for validation.
4.6.1
First Pass Attachment
In the first pass attachment disambiguation task, all parse trees have been
preprocessed with tree reduction and specialization (see Section 4.5.1.1). In
particular, the 45 different POS–tag of WSJ have been grouped in 10 classes
135
4.6 Experimental Results
(Costa et al., 2003a) and then 10 specialized data sets have been obtained by
splitting data according to the syntactic category of the foot node: Nouns,
Verbs, Prepositions, Articles, Punctuations, Adjectives, Adverbs, Conjunctions, Possessives, Others.
In order to estimate learning curves, we created data sets with 100, 500,
2, 000, 10, 000 and 40, 000 sentences, randomly extracting them from the
training set. In the full data set of 40, 000 sentences the average sentence
length is 24 and for each word there are 120 alternative incremental trees on
average (ranging from a minimum of 2 to a maximum of 2, 000 alternatives),
yielding a total of about 108 trees. Due to the high computational cost
of the validation procedure, the model parameters were optimized using a
subset of section 24 of WSJ as a validation set. Specifically, we used 500
validation sentences to estimate the parse tree kernel parameter λ of Equation
(3.73) and we found an optimal value λ = 0.5. The size of the RNN’s state
vector was fixed to 25 units which is the minimum size to ensure enough
expressive power to learn the training set. Weights were initialized in the
range [−0.01, +0.01] and updated after the presentation of each forest; the
learning rate η ranges from 10−2 to 10−4 and the momentum was 0.1. The
maximum node outdegree was set to 15. Fixing the maximum outdegree is an
architectural constraint which, in our implementation, has the consequence of
pruning syntactic trees with very long productions. Note that since each child
position is associated with its own set of weights, pruning long productions
avoids poor estimates of the weights associated with very infrequent rules.
Using 15 children, only 0.3% productions are pruned. Because of the large
size of the training set, we observed that one epoch of VP training is sufficient
to reach a steady state validation set accuracy. For the RNN, we used early
stopping based on 1, 000 validation sentences from section 24 of WSJ. We
found that on the order of 105 sentence presentations are typically needed to
complete the training procedure.
The results of the comparison are shown in Table 4.1 and in Figure 4.5,
where the error measure is based on the L1 loss of Equation (4.1) which
counts the number of forests where the best element is not ranked first.
In most of classes of POS–tags and in most of training set sizes, the RNN
outperforms the VP. If the specialized classifiers are combined with their
136
CHAPTER 4 PREFERENCE LEARNING IN NATURAL LANGUAGE PROCESSING
weights to obtain an overall measure not grouped for classes of POS–tag, we
see the RNN exhibits about 1% better prediction accuracy on average with
respect to VP and no evidence is found for the superiority kernel VP when
trained on a small data set.
POS Noun Verb Prep Art Punc Adj Adv
Size % 33.0 13.4 12.6 12.5 11.7 7.5 4.3
VP after 1 Epoch
100 12.4 12.6 47.6 26.5 50.9 24.2 65.8
500 8.4 8.9 42.3 17.4 39.0 18.2 58.8
2,000 7.1 6.8 38.1 14.2 33.3 14.7 55.7
10,000 5.5 5.1 34.5 11.1 25.7 12.1 51.0
40,000 4.4 3.9 31.6 9.6 22.5 11.0 46.4
RNN
100 11.4 17.9 43.6 30.4 36.6 23.0 65.2
500 8.2 9.1 38.2 14.8 31.8 16.4 54.9
2,000 8.5 6.0 37.7 12.3 25.7 16.8 48.1
10,000 5.9 5.1 35.8 10.6 21.0 13.0 43.9
40,000 4.3 3.2 32.5 9.0 19.2 10.5 40.6
Conj Poss Oth Total
2.3 2.0 0.7 100
46.3
38.7
31.9
28.1
25.0
7.5
6.5
5.4
4.4
2.9
64.9
54.7
34.6
25.8
21.8
27.3
21.3
18.3
15.3
13.4
39.2
40.4
31.5
23.2
21.3
40.0
10.2
5.9
2.5
2.9
89.0
48.3
34.5
22.3
31.4
26.6
19.4
17.3
14.5
12.6
Table 4.1. VP and RNN learning curves in the first pass attachment prediction task using modularization in 10 POS–tag categories. The reported
values are the percentage of forests where the best element is not ranked
first.
In order to better assess the behavior on small data sets, we carried out
a more robust experiment training the RNN and the VP on 5 independent
subsets of 100 sentences each, randomly chosen from WSJ sections 2–21. In
this case, no class partition was performed, though we used the tree reduction
heuristic but not the specialization heuristic. Model parameters were kept
identical to the previous experiment. Results are reported in Table 4.2 and
confirm the hypothesis that the RNN outperforms kernel VP even in regime
of scarce data available for training.
In addition, kernel VP has the drawback of high computational costs:
training over 5 subsets of 100 sentences takes about a week on a 2 GHz CPU
and moreover VP does not scale linearly with the number of examples as the
137
4.6 Experimental Results
VP vs RNN
Noun
VP vs RNN
Verb
15
20
VP
RNN
9
6
10
5
3
0
0
100
500
2000
10000
40000
100
500
2000
10000
Data set size
Data set size
VP vs RNN
Preposition
VP vs RNN
Article
50
35
VP
RNN
45
40000
VP
RNN
30
25
Error
Error
VP
RNN
15
Error
Error
12
40
20
15
10
35
5
30
0
100
500
2000
10000
40000
100
500
2000
10000
Data set size
Data set size
VP vs RNN
Punctuation
VP vs RNN
Total
60
30
VP
RNN
50
40000
VP
RNN
25
Error
Error
40
30
20
20
15
10
0
10
100
500
2000
10000
40000
Data set size
100
500
2000
10000
40000
Data set size
Figure 4.5. VP and RNN learning curves in the first pass attachment prediction task using modularization in 10 POS–tag categories. The reported
values are the percentage of forests where the best element is not ranked
first.
138
CHAPTER 4 PREFERENCE LEARNING IN NATURAL LANGUAGE PROCESSING
Subset
VP
RNN
1
26.4
26.7
2
26.3
24.6
3
26.5
26.3
4
26.8
27.0
5
27.4
25.6
Average
26.7±0.4
26.0±1.0
Table 4.2. VP and RNN in the first pass attachment prediction task: 5 independent subsets of 100 sentences. The reported values are the percentage
of forests where the best element is not ranked first.
RNN does. For small data sets, the CPU time of kernel VP is about the
CPU time of RNN, while for larger data sets the elaboration time is much
higher: in the first experiments, VP took over 2 months to complete an epoch
but RNN learns in 1–2 epoch (about 3 days with a 2 GHz CPU). This high
computational cost has forced us to train the VP for only one epoch in the
full experiment with all the 40,000 sentences. An advantage of the kernel
VP is its smoothness with respect to training iterations, i.e. validating the
performance on a working set yields a smooth, single–maximum function.
In contrast the RNN is much more sensitive, making it hard to decide for a
good generalization point.
4.6.2
Reranking Task
The data set used in this experiment is the same described in Collins and
Duffy (2002). There are 30 alternatives on average for each sentence and so
the task is computationally less intensive. We do not use the two heuristics,
because they are not applicable in this case. The task is very difficult, because
the statical parser employed is a very good parser and all the trees output
by the parser have an high similarity score with the gold tree. To obtain
a better performance of kernel VP and RNN with respect to the statistical
parser, we incorporate the probability from the parser in our models. In the
case of VP, the new tree kernel is composed by two parts
Kλ,β (x, z) = Kλ (x, z) + β log p(x) log p(z)
(4.45)
where Kλ (x, z) is the kernel defined in Equation (3.73), p(x) and p(z) are
the probabilities of the parser for x and z and β ≥ 0 controls the relative
139
4.6 Experimental Results
contribution of the two terms. The parameters λ and β are set to 0.3 and
0.2 respectively through tuning on the validation set.
For the RNN, a rescaling of the parser probability a log p(x) + b is used
as an additional input to the output network for appropriate values a and b.
Some parameters of RNN are: state vector size has been fixed to 25 units, a
learning rate η = 0.0001 and a momentum of 0.5, weight initialization with
random weights in [−0.01, +0.01] and maximum node outdegree set to 15.
The number of iterations needed to reach an error minimum on validation
set is about 1820.
The standard PARSEVAL measures outlined above in Section 4.5.2 are
used to assess the performance of predictors. Results for a sentence length
less equal then 40 (simple sentences) and for all the sentences are reported in
Table 4.3. In this task, the performance of two methods is mostly the same.
Model
VP
RNN
Model
VP
RNN
≤ 40 Words (2245 sentences)
LR
LP
CBs
0 CBs
89.1
89.4
0.85
69.3
89.2
89.5
0.84
67.9
≤ 100 Words (2416 sentences)
LR
LP
CBs
0 CBs
88.6
88.9
0.99
66.5
88.6
88.9
0.98
64.8
2 CBs
88.2
88.4
2 CBs
86.3
86.3
Table 4.3. VP and RNN in the reranking task. We reported the standard
PARSEVAL measures described in Section 4.5.2.
4.6.3
The Role of Representation
To compare the vector representations of trees φ(x) induced by the parse tree
kernel and Φ(x) adaptively computed by the RNN (see Equation (3.117)), we
trained a VP using a linear kernel on the set of vectors Φ(xi ) obtained from
the trained RNN. In this way, the tree kernel representation is replaced by the
RNN adaptive representation and so the two methods can be compared with
140
CHAPTER 4 PREFERENCE LEARNING IN NATURAL LANGUAGE PROCESSING
VP vs. RNN
5 Subsets
29
VP
RNN
VP on RNN State
Error
28
27
26
25
24
1
2
3
4
5
Avg
Subset
Subset
VP
RNN
VP on RNN State
1
26.4
26.7
27.8
2
26.3
24.6
25.6
3
26.5
26.3
27.2
4
26.8
27.0
28.4
5
27.4
25.6
26.4
Average
26.7±0.4
26.0±1.0
27.1±1.1
Figure 4.6. VP, RNN and VP on RNN State in the first pass attachment
prediction task: 5 independent subsets of 100 sentences. The reported
values are the percentage of forests where the best element is not ranked
first.
the same representation. Results are reported in Figure 4.6 and show once
again the superiority of RNNs with respect to VP. Since the two algorithms
are compared with an equal representation, the different performance is due
to the preference loss function, because the VP algorithm and the RNN
output network have almost the same behavior. We argue that the problem
is the pairwise loss function that does not take into consideration all the
alternatives together.
To better understand the adaptive representation generated by RNN, we
have applied the Principal Components Analysis (PCA) to the state vector
representations of trees for the incremental task. PCA is a technique that
141
4.6 Experimental Results
can be used to simplify a dataset: more formally, it is a linear transformation
that chooses a new coordinate system for the data set such that the greatest
variance by any projection of the data set comes to lie on the first axis (then
called the first principal component), the second greatest variance on the
second axis, and so on. PCA can be used for reducing dimensionality in a
dataset while retaining those characteristics of the dataset that contribute
most to its variance by eliminating the later principal components. These
characteristics may be the “most important”, but this is not necessarily the
case, depending on the application. PCA has the speciality of being the optimal linear transformation for keeping the subspace that has largest variance.
Unlike other linear transforms, the PCA does not have a fixed set of basis
vectors but its basis vectors depend on the data set. Projecting the RNN
state vectors on a two dimensional space, we observe that the elements of a
set of alternatives tend to stay in a mainfold of IR2 and that the best element
is the most right element of the set (if the RNN has correctly classified it).
In Figure 4.7, we report the PCA of a large set of alternatives with its best
element plotted as a cross and the PCA of the state vectors of all forests of
the dataset. The experiments reported in the next section further investigate
the role played by the loss function.
4.6.4
Comparing Different Preference Loss Functions
To investigate the role of the loss function, we have compared the setwise Lr
in Equation (4.1) and the pairwise L0 in Equation (4.15) loss functions using
as evaluation measure the errors in Equation (4.16) and in Equation (4.17)
in the incremental task. Results are reported in Table 4.4 for the 5 subsets
of 100 sentences. The first two columns show if the training and the testing
refer to a pairwise loss function (P) or a setwise loss function (S) and the
others report the results of the two algorithms. We see that for the RNN the
pairwise loss function shows worst performance with respect to the setwise
loss function on both the evaluation measures. No setwise loss function for
the VP has been proposed so far. When using the pairwise loss function, the
VP performance measured on forests is better than RNN. We argue that a
setwise loss function for kernel methods could improve the performance.
142
CHAPTER 4 PREFERENCE LEARNING IN NATURAL LANGUAGE PROCESSING
0
-0.5
-1
-1.5
-2
-2.5
-3
-4.5
-4
-3.5
-3
-2.5
-2
-1.5
-1
2
1
0
-1
-2
-3
-5
-4
-3
-2
-1
0
1
2
3
4
5
Figure 4.7. PCA on RNN state vectors of a large forest (top) and on state
vectors of all forests of the dataset (bottom). The crosses show the best
element, while the points are the alternatives.
143
4.7 Conclusions
Train
P
P
S
S
Test
P
S
P
S
RNN
1.79±0.12
29.1±1.8
1.50±0.20
26.0±1.0
VP
2.32 ± 0.09
26.7 ± 0.4
NA
NA
Table 4.4. Comparison between different loss functions and evaluation measures: (P) indicates the pairwise loss function while (S) the setwise loss
function.
4.7
Conclusions
The experimental analysis presented above shows that both RNNs and the
kernel VP are effective to solve the investigated problems. In particular, the
adaptive representation developed by the RNN allows a simple linear utility
function to solve the preference problem.
The experiments indicate that the choice of a pairwise vs. global loss
function plays an import role. In particular, it appears that the pairwise
loss is not well suited to train an RNN and it remains to be investigated
if this is also the case for kernel methods. Interesting, previous works with
kernels Herbrich et al. (2000); Joachims (2002a); Collins and Duffy (2001)
focus exclusively on pairwise loss functions. The development of global loss
function for preference tasks may lead to more effective solutions.
144
Chapter 5
On the Consistency of
Preference Learning
In this chapter, we give a novel theoretical analysis which explains why a
setwise loss function exhibits a better performance of a pairwise loss function
based on an utility function. We introduce a model of preference and ranking
problems based on the concept of partial order relation and we provide three
different approaches for carrying out this model. For understanding what
is the approach with smaller generalization error, we evaluate the Bayes
risk of realizing the preference and ranking model by each one of the three
approaches. We will understand that the direct approach exhibits better
performance than the utility function approach and than a model based on
a function that works directly on pairs. Finally, we show how the ranking
and preference generalization error depends on the size of set of alternatives.
This chapter is based on Menchetti (2006).
5.1
Introduction
Work on learning theory has mostly concentrated on classification and regression. However, there are many applications in which it is desirable to
choose the best element in a set of alternatives (preference problem) or to
order a collection of objects (ranking problem). There are many works in lit145
5.2 The Bayes Function
erature that try to propose a solution for these problems: Section 4.3 reports
an extensive reference to the literature.
Menchetti et al. (2003, 2005c) show an experimental analysis comparing
recursive neural networks and voted perceptron for solving preference problems. Results indicate that both RNNs and the kernel VP are effective to
solve the proposed problems. The experiments also indicate that the choice
of a pairwise or global loss function plays an import role. In particular, it appears that the pairwise loss is not well suited to train an RNN and it remains
to be investigated if this is also the case for kernel methods. Interesting, previous works with kernels Herbrich et al. (2000); Joachims (2002a); Collins
and Duffy (2001) focus exclusively on pairwise loss functions. The development of global loss function for preference tasks may lead to more effective
solutions. So it is interesting to theoretically investigate why a pairwise loss
function behaves worse than a global loss function based on all the elements
of the set of alternatives.
The remainder of the chapter is organized as follows. In Section 5.2 we
introduce some useful results on the Bayes function for regression and binary
and multiclass classification problems. In Section 5.3 we derive a new model
of preference and ranking problems based on the idea that a binary partial
order relation can model the constraints of preference and ranking problems.
We describe three possible approaches for the partial order model based on
a 0–1 loss function. Section 5.4 we compare the three approaches described
in Section 5.3, showing which is the best methods. Finally, in Section 5.5 we
describe how the ranking and preference errors depend on the size of set of
alternatives.
5.2
The Bayes Function
As described in Section 1.1.3, the Bayes function is the minimizer of expected
risk (1.8) and depends on the distribution ρ (Devroye et al., 1996; Cucker and
Smale, 2001; Duda et al., 2001). If ρ was known, then it would be possible
to compute directly the Bayes function fρ using its definition
.
fρ = arg min errρ (f )
f ∈T
146
CHAPTER 5 ON THE CONSISTENCY OF PREFERENCE LEARNING
In the following, the Bayes function is derived for regression and for binary
and multiclass classification problems. We start from the Bayes function for
regression (Cucker and Smale, 2001).
Theorem 5.1 (Bayes Function for Regression) In the regression problem where f : X 7→ IR, for a quadratic loss function V (f (x), y) = (y−f (x))2 ,
the Bayes function (also called regression function) is:
Z
(5.1)
yρ(y|x)dy = EY|x {Y|x}
fρ (x) =
Y
So the Bayes function fρ (x) is the expected value of random variable Y given
X = x.
Proof of Theorem 5.1 In the case of regression, it is useful to write the
expected risk (1.8) as follows:
Z Z
(5.2)
(y − f (x))2 ρ(y|x)dyρ(x)dx
errρ (f ) =
X
Y
Setting its first derivative with respect to f (x) to zero, it follows that
Z Z
∂errρ (f )
=−
2(y − f (x))ρ(y|x)dyρ(x)dx = 0
∂f (x)
X Y
Z Z
Z Z
yρ(y|x)dyρ(x)dx =
f (x)ρ(y|x)dyρ(x)dx
X Y
X Y
Z Z
Z Z
yρ(y|x)dyρ(x)dx =
ρ(y|x)dyf (x)ρ(x)dx
X
Since
R
Y
X
Y
Y
ρ(y|x)dy = 1, then
Z Z
Z
yρ(y|x)dyρ(x)dx =
X
Y
f (x)ρ(x)dx
X
Comparing the two members of previous equation, the regression function is
derived
Z
fρ (x) =
yρ(y|x)dy
Y
147
5.2 The Bayes Function
It is interesting to compute the error associated with the regression function
which represents a lower bound on the error that depends only on the intrinsic
difficulty of the problem.
Proposition 5.1 (Regression Bayes Function Error) In the regression
problem where f : X 7→ IR, for a quadratic loss function V (f (x), y) = (y −
f (x))2 , the error of Bayes function fρ (x) is:
errρ (fρ (x)) = EX VarY|x {Y|x}
(5.3)
Proof of Proposition 5.1 Substituting Equation (5.1) in Equation (5.2),
if follows that
Z Z
(y − EY|x {Y|x})2 ρ(y|x)dyρ(x)dx
errρ (fρ (x)) =
ZX Y
=
VarY|x {Y|x}ρ(x)dx
X
= EX VarY|x {Y|x}
Remark If VarY|x {Y|x} = 0 ∀ x ∈ X , that is if exists only one possible y for
each x (the relation between X and Y is deterministic and not probabilistic),
then errρ (fρ (x)) = 0 if we assume that the target space T contains the
function which assigns to each x its output y.
After the regression task, the Bayes function is devised for binary and multiclass classification problems where the cost of each error is weighted by the
value of the loss function (Lee et al., 2004; Tewari and Bartlett, 2005).
Theorem 5.2 (Bayes Function for Multiclass Classification) In the
multiclass classification problem where f : X 7→ {1, . . . , c}, if we use a loss
function V (f (x), y), the Bayes function is:
fρ (x) = arg
min
f (x)∈{1,...,c}
EY|x {V (f (x), y)}
(5.4)
So the Bayes function fρ (x) predicts the class label ŷ = f (x) which minimizes
the expected value of V (ŷ, y) over ρ(y|x).
148
CHAPTER 5 ON THE CONSISTENCY OF PREFERENCE LEARNING
Proof of Theorem 5.2 The expected risk errρ (f ) for a multiclass classification problem is:
Z
c
X
errρ (f ) =
V (f (x), y)ρ(y|x)ρ(x)dx
X y=1,y6=f (x)
So the task is to minimize the value of integral on the input space X :
c
X
fρ (x) = arg min
f ∈T
= arg min
V (f (x), y)ρ(y|x)
y=1,y6=f (x)
c
X
f ∈T
V (f (x), y)ρ(y|x)
(5.5)
(5.6)
y=1
since V (f (x), f (x))ρ(f (x)|x) = 0. If we define the expected value of the loss
function V (f (x), y) over ρ(y|x) as
EY|x {V (f (x), y)} =
c
X
V (f (x), y)ρ(y|x)
(5.7)
y=1
then
fρ (x) = arg min EY|x {V (f (x), y)}
f ∈T
(5.8)
The Bayes function chooses the class label f (x) which minimizes the expected
value of V (f (x), y) over ρ(y|x).
Remark For a binary classification problem where f : X 7→ {+1, −1},
Equation (5.4) reduces to
fρ (x) = arg
min
{V (f (x), +1)ρ(+1, x) + V (f (x), −1)ρ(−1, x)}
f (x)∈{+1,−1}
or, equivalently:
(
fρ (x) =
+1 V (−1, +1)ρ(+1, x) ≥ V (+1, −1)ρ(−1, x)
−1 otherwise
(5.9)
If the classification loss functions treats all the errors in the same way as in
the case of the 0–1 loss function, the previous results can be simplified.
149
5.2 The Bayes Function
Proposition 5.2 (Bayes Function for Multiclass Classification) In
the multiclass classification problem where f : X 7→ {1, . . . , c}, if we use a
loss function V (f (x), y) = I(y 6= f (x)) where I is the indicator function,
the Bayes function is:
(5.10)
fρ (x) = arg max ρ(y|x)
y=1,...,c
So the Bayes function fρ (x) assigns to each x its maximal probability output.
Proof of Proposition 5.2 First of all, we compute the expected risk errρ (f )
for multiclass classification problem:
Z Z
errρ (f ) =
I(y 6= f (x))ρ(y|x)dyρ(x)dx
X
Y
c
X
Z
=
ρ(y|x)ρ(x)dx
(5.11)
X y=1,y6=f (x)
Z
(1 − ρ(f (x)|x)) ρ(x)dx
=
X
where we used the normalization property that
c
X
ρ(y|x) = 1
y=1
So we are looking for a function that minimizes the previous error (5.11)
fρ (x) = arg min (1 − ρ(f (x)|x)) = arg max ρ(y|x)
f ∈T
y=1,...,c
Remark For a binary classification problem where Y = {+1, −1}, Equation
(5.10) reduces to
(
+1 ρ(+1|x) ≥ ρ(−1|x)
fρ (x) =
(5.12)
−1 otherwise
Now we compute the error for the multiclass classification Bayes function in
the case of a 0–1 loss function.
150
CHAPTER 5 ON THE CONSISTENCY OF PREFERENCE LEARNING
Proposition 5.3 (Multiclass Classification Bayes Function Error)
In the multiclass classification problem where f : X 7→ {1, . . . , c}, if we use
a loss function V (f (x), y) = I(y 6= f (x)), the error of Bayes function fρ (x)
is:
errρ (fρ (x)) = EX 1 − max ρ(y|x)
y=1,...,c
Z (5.13)
=
1 − max ρ(y|x) ρ(x)dx
X
y=1,...,c
where (1 − maxy=1,...,c ρ(y|x)) is the probability that x is not classified in the
most probable class.
Proof of Proposition 5.3 Substituting Equation (5.10) in Equation (5.11)
yields Equation (5.13).
5.3
A New Model of Preference and Ranking
In this section, we introduce a new model for preference and ranking problems. We have to model a framework in which we are given a set of i.i.d.
pairs Dm = {(X i , Ri )}m
i=1 where X i = {xi1 , . . . , xiki } ⊆ X , xij ∈ X and Ri
is a relation between the elements of each subset. For example, Ri can be
the ranking of {xi1 , . . . , xiki } or a preference relation which chooses the best
element of {xi1 , . . . , xiki }. Before introducing the model for preference and
ranking problems, we recall the definitions of binary relation, partial order
relation and total order relation.
Definition 5.1 A binary relation R is a subset of Cartesian product of two
sets A and B: R ⊆ A × B, aRb is an ordered pair (a, b).
Definition 5.2 A partial order on a set A is a binary relation ⊆ A × A
that satisfies the following three properties:
1. Reflexivity: a a ∀a ∈ A
2. Antisymmetry: if a b and b a, then a = b ∀a, b ∈ A
151
5.3 A New Model of Preference and Ranking
3. Transitivity: if a b and b c, then a ∀a, b, c ∈ A
Definition 5.3 A total order ≤ on a set A is a partial order that satisfies
the following property:
4. Comparability: ∀a, b ∈ A, either a ≤ b or b ≤ a
5.3.1
The Partial Order Model
The partial order model of preference and ranking is based on the idea that
a binary partial order relation can model the constraints of a preference and
ranking problem. Let RX be the set of all the partial order relations on X
RX = {R : R ⊆ X × X , R is a partial order on X } ⊆ 2X ×X
(5.14)
We can model Dm = {(X i = {xi1 , . . . , xiki }, Ri )}m
i=1 where xij ∈ X as a set
of i.i.d. pairs (X i , Ri ) drawn from a fixed but unknown distribution ρ on
2X × RX , where 2X is the set of all the subsets of X and RX is the set of all
the partial order relations on X . The goal is to learn a function f ∈ H
f : 2X 7→ RX
(5.15)
f (X) ≈ R
(5.16)
such that
which models the probabilistic relation between 2X and RX . If we decompose
ρ as
ρ(2X , RX ) = ρ(RX |2X )ρ(2X )
(5.17)
each pair of the dataset Dm can be obtained by a two steps process:
1. first we get a subset X of X in according to ρ(2X );
2. then we get a partial order R on X from ρ(RX |2X ).
This two steps process expressed by Equation (5.17) is able to model the
noise that can corrupt the function from the input to the target space: for
example, different users or the same user can sort in a different way the same
set X of instances. Note that for a given X, the relation on X is consistent
in the sense that the transitivity between the elements of X holds. Given
this model, it is interesting
152
CHAPTER 5 ON THE CONSISTENCY OF PREFERENCE LEARNING
• defining a loss function to measure how good is a function on a given
collection of data;
• finding the Bayes function fρ : 2X 7→ RX assuming that ρ is known;
• computing its expected risk errρ (fρ ).
5.3.2
The 0–1 Loss Function
The simplest loss is 0–1 loss function defined as
(
1 if R 6= f (X)
V (f (X), R) = I(f (X) 6= R) =
0 if R = f (X)
(5.18)
It counts an error when f (X) 6= R, without evaluating if f (X) and R are
similar or very different. It behaves as the misclassification loss for classification defined in Equation (1.5).
5.3.3
Three Approaches for the Partial Order Model
Given the framework described in Section 5.3.1, we can compute the Bayes
function and its expected risk of this model. Then we can learn f using
several models and then compute the expected risk of these different models.
At this point, the expected risk of Bayes function can be compared with
the error of other models to find the model with smaller Bayes error. We
investigate three models for the function f : 2X 7→ RX .
1. We can directly model the probabilistic relation between 2X and RX
using a function
D : 2X 7→ RX
(5.19)
This is the more expressive method of modelling f . For all practical
purposes, this approach is not simple to realize because the target space
RX is a complex output space: so we are looking for a function with a
simpler target space. We will call this approach the direct model.
153
5.4 A Comparison of the Three Approaches
2. We can map each object into a real number which measures its importance by a function
U : X 7→ IR
(5.20)
then we can sort the alternatives by this score using Equations (4.2) and
(4.3) for ranking and preference respectively (Herbirch et al., 1998; Herbrich et al., 2000; Crammer and Singer, 2002b). This is a very simple
approach which assumes that exists a function that maps each object
into a real number whereby we can sort the objects, hypothesis which
is not always valid. It is employed in the utility function approach described in Section 4.3.1. We will call this approach the utility function
model.
3. Finally we can use a function which works on pairs of objects assigning
a score or a label (Cohen et al., 1999)
P : X × X 7→ {+1, −1}
or
P : X × X 7→ [0, 1]
(5.21)
In this way, we can sort pairs of objects based on their scores or labels
but we have to guarantee the transitivity property and to resolve possible inconsistencies. For example, a score greater than 0.5 or a label
+1 means that the first object has to be ranked first than the other
one. We will call this approach the pairwise model.
We use fD , fU and fP for indicating the ranking and preference function f
modelled by D, U and P respectively.
5.4
A Comparison of the Three Approaches
After defining in Section 5.3.3 three different models for the problems of
ranking and preference, we show a new approach for comparing these models.
We first compute the Bayes function of preference and ranking problem under
our framework with its corresponding risk (this corresponds to modelling f
using D) and then compare this value with the expected risk of the other two
models. We start computing the Bayes function of the problem defined in
Section 5.3.1 which corresponds to directly model the probabilistic relation
between 2X and RX .
154
CHAPTER 5 ON THE CONSISTENCY OF PREFERENCE LEARNING
5.4.1
The Direct Model
In the direct model, we model f by a function D that has its same behaviour,
so we can compute the output as
f (X) = fD (X) = D(X)
(5.22)
The following theorem compute the Bayes function of preference and ranking
problem under the model described in Section 5.3.1: it is based on the assumption that if |X| is finite, then also the cardinality |RX | of set of all the
partial order relations on X is finite. This observation permits to deriving
the Bayes function using the results on multiclass classification.
Theorem 5.3 (Preference and Ranking Bayes Function) In a preference and ranking problem in which f : 2X 7→ RX , if we use a 0–1 loss function
V (f (X), R) = I(f (X) 6= R), the Bayes function fρ (X) is:
fρ (X) = arg max ρ(R|X)
R∈RX
(5.23)
Proof of Theorem 5.3 If we assume that X = {x1 , . . . , xk } ⊆ X is a finite
set, then also |RX |, the set of all the partial order relations on X , is finite.
The expected risk of f becomes:
Z Z
V (f (X), R)ρ(R|X)dRρ(X)dX
errρ (f ) =
2X RX
Z X
I(f (X) 6= R)ρ(R|X)ρ(X)dX
=
2X R∈R
X
Z
[1 − ρ(f (X)|X)]ρ(X)dX
=
2X
So we have cast the preference and ranking problems to a multiclass classification problem where the categories are the elements of RX . The direct
application of Equation (5.10) leads to Equation (5.23) which proves the
theorem.
As in the case of multiclass classification, the Bayes function (5.23) assigns
to each set of alternatives its maximal probability relation. Computing the
expected risk of the Bayes function is a direct consequence of Proposition
5.3.
155
5.4 A Comparison of the Three Approaches
Theorem 5.4 (Bayes Risk for Preference and Ranking) In a preference and ranking problems in which f : 2X 7→ RX , if we use a 0–1 loss
function V (f (X), R) = I(f (X) 6= R), the error of Bayes function fρ (X) is:
errρ (fρ (X)) = E2X 1 − max ρ(R|X)
R∈RX
Z
(5.24)
=
[1 − max ρ(R|X)]ρ(X)dX
2X
R∈RX
Proof of Theorem 5.4 Applying Proposition 5.3 for multiclass classification
to Equation (5.23) leads to Equation (5.24).
If we model f by D : 2X 7→ RX , the expected risk of the Bayes function fρ
corresponds to expected risk of fD
errρ (fD (X)) = E2X 1 − max ρ(R|X)
R∈RX
Z
(5.25)
=
[1 − max ρ(R|X)]ρ(X)dX
2X
R∈RX
The next step involves the computation of the expected risk when we model
the ranking and preference function by an utility function U and by a function
P that works on pairs of objects.
5.4.2
The Utility Function Model
Now we model the ranking and preference function f by an utility function
U : X 7→ IR which assigns to each object a score proportional to its importance. The prediction f (X) = fU (X) can be reconstruct using the utility
function U in the following way:
fU (X) = {(x, z) ∈ X × X , x, z ∈ X : U (x) ≥ U (z)}
(5.26)
Note that if U (x) 6= U (z) ∀ x, z ∈ X × X , then U induces a total order on X
and the cardinality of fU (X) is equal to the number of simple combinations
Cn,k =
Dn,k
n!
=
Pk
k!(n − k)!
156
CHAPTER 5 ON THE CONSISTENCY OF PREFERENCE LEARNING
where k = 2 is the size of the subsets (in our case we pick pairs) and n is the
cardinality of X
|X |(|X | − 1)
(5.27)
2
But if ∃ x, z ∈ X × X : U (x) = U (z), we can model ties by two elements
of the relation as x z and z x ⇒ x = z. Then the maximum value of
|fU (X)| is the number simple arrangements
|fU (X)| = C|X |,2 =
Dn,k =
n!
Pn
=
Pn−k
(n − k)!
where k = 2 is the size of the subsets and n is the cardinality of X . So
|fU (X)| ranges from C|X |,2 to D|X |,2 depending on the number of ties:
|X |(|X | − 1)
(5.28)
≤ |fU (X)| ≤ |X |(|X | − 1)
2
The upper bound represent the situation in which all the alternatives get the
same score. The following theorem compares the expected risk of the Bayes
function fρ modelled by the direct model D : 2X 7→ RX and by the utility
function model U : X 7→ IR.
Theorem 5.5 (Direct Model vs Utility Function Model) The expec–
ted risk of the ranking and preference function f : 2X 7→ RX modelled by a
direct approach D : 2X 7→ RX is less than or equal to the expected risk of
modelling f by an utility function U : X 7→ IR such that
fU (X) = {(x, z) ∈ X × X , x, z ∈ X : U (x) ≥ U (z)}
(5.29)
In mathematical terms
errρ (fD (X)) ≤ errρ (fU (X))
(5.30)
Proof of Theorem 5.5 We start computing the expected risk of the utility
function model, then we compare this value to the expected risk of direct
model.
Z Z
errρ (fU (X)) =
V (fU (X), R)ρ(R|X)dRρ(X)dX
2X RX
Z X
=
I(fU (X) 6= R)ρ(R|X)ρ(X)dX
2X R∈R
X
Z
[1 − ρ(fU (X)|X)]ρ(X)dX
=
2X
157
(5.31)
5.4 A Comparison of the Three Approaches
Comparing Equations (5.24) and (5.31), we obtain Equation (5.30). If U is
expressive enough such that
fU (X) = max ρ(R|X)
R∈RX
then
errρ (fD (X)) = errρ (fU (X))
The Theorem 5.5 shows that modelling the ranking and preference function
f by an utility function U : X 7→ IR leads to a Bayes risk greater than or
equal to the direct model. Only in the case that the utility function U leads
to a fU which behaves as the Bayes function, the two errors are the same.
As a consequence, the utility function model by itself could induce a greater
generalization error.
5.4.3
The Pairwise Model
The pairwise model is a more expressive model than the utility function
one. Precisely, scoring the pairs and not single objects can lead to a more
rich relation on the set of alternatives and the utility function approach can
be obtained as a particular case of the pairwise model. It can be proved
that exists some relations modelled by the pairwise approach which are not
represented in the utility function one. The ranking and preference prediction
function f (X) = fP (X) can be reconstruct using the pairwise function P in
the following way
fP (X) = {(x, z) ∈ X × X , x, z ∈ X : P (x, z) ≥ 0.5}
(5.32)
if P : X × X 7→ [0, 1] is a probability score on pairs and as
fP (X) = {(x, z) ∈ X × X , x, z ∈ X : P (x, z) = +1}
(5.33)
if P : X × X 7→ {+1, −1} is a binary classification function on pairs. The
following theorem compares the expected risk of the Bayes function fρ modelled by the direct model D : 2X 7→ RX and by the pairwise function model
P : X × X 7→ [0, 1] or P : X × X 7→ {+1, −1}.
158
CHAPTER 5 ON THE CONSISTENCY OF PREFERENCE LEARNING
Theorem 5.6 (Direct Model vs Pairwise Model) The expected risk of
the ranking and preference function f : 2X 7→ RX modelled by a direct approach D : 2X 7→ RX is less than or equal to the expected risk of modelling
f by a pairwise function P : X × X 7→ [0, 1] or P : X × X 7→ {+1, −1} as
described in Equations (5.32) and (5.33)
errρ (fD (X)) ≤ errρ (fP (X))
(5.34)
Proof of Theorem 5.6 The proof is the same of the utility function approach: we compute the expected risk of the pairwise model, then we compare
this value to the expected risk of direct model.
Z Z
errρ (fP (X)) =
V (fP (X), R)ρ(R|X)dRρ(X)dX
2X RX
Z X
I(fP (X) 6= R)ρ(R|X)ρ(X)dX
=
2X R∈R
X
Z
[1 − ρ(fP (X)|X)]ρ(X)dX
=
(5.35)
2X
Comparing Equations (5.24) and (5.35), we obtain Equation (5.34). Note
that if P is expressive enough such that
fP (X) = max ρ(R|X)
R∈RX
then
errρ (fD (X)) = errρ (fP (X))
As in the case of the utility function model, the Theorem 5.6 shows that
modelling the ranking and preference function f by a pairwise function P
leads to a Bayes risk greater than or equal to the direct model. Only in the
case that the pairwise function P leads to a fP which behaves as the Bayes
function, the two errors are the same. As a consequence, pairwise function
model by itself could induce a greater generalization error.
Finally, to conclude, we can show the relation between the expected risk
of the direct, the utility function and the pairwise function models:
errρ (fD (X)) ≤ errρ (fP (X)) ≤ errρ (fU (X))
159
(5.36)
5.5 Dependence on Size of Set of Alternatives
Modelling the ranking and preference function f by indirect approaches as
the utility or pairwise function can lead to a greater generalization error
than the direct one due to the inherent characteristics of the model which is
unable to represent all the possible relations on the set of alternatives: the
more expressive is the model, the smaller will be the prediction error.
5.5
Dependence on Size of Set of Alternatives
In this section, we describe a novel approach on how the ranking and preference errors depend on the size of set of alternatives. The larger is the size of
the set of alternatives, the bigger is the probability of a ranking or preference
error. But if the scores of the objects are well “separated”, the probability
of error can become arbitrarily small.
In the utility function approach, we learn a function U : X 7→ IR that
measures the importance of an object using a training set Dm . Then to rank
a set of alternatives, we sort the elements by their score; in the case of a
preference problem, we select only the best element. Since U depends on Dm
and since Dm is a set of i.i.d. pairs sampled from a probability distribution ρ
on X × IN, it follows that U and furthermore also U (x) are random variables
depending on Dm .
5.5.1
Ranking Two Alternatives
Let X = {x1 , x2 } be a set of alternatives that contains only two elements: in
this case, ranking the two elements or choosing the best element are equivalent. We assume that x2 is ranked first than x1 , that is y1 = 2 and y2 = 1.
Let U1 and U2 be the random variables whose realizations u1 and u2 represent the scores associated to x1 and x2 by the utility function U and let pU1 ,
PU1 , pU2 and PU2 be the probability distribution functions and the cumulative
distribution functions of U1 and U2 respectively.
Since x2 ≺ x1 , we expect that U (x1 ) = u1 < u2 = U (x2 ). If u1 ≥ u2 ,
then we have a ranking error:
Pr{Error} = Pr{u2 ≤ u1 } = 1 − Pr{u2 > u1 }
160
(5.37)
CHAPTER 5 ON THE CONSISTENCY OF PREFERENCE LEARNING
Using the definition of cumulative distribution function
Pr{u2 ≤ c} = PU2 (c) ∀ c ∈ [0, 1]
(5.38)
we obtain
Z
Pr{u2 ≤ u1 } =
Z
PU2 (u1 )pU1 (u1 )du1 =
U1
PU2 (u1 )PU0 1 (u1 )du1
(5.39)
U1
dPU1 (u1 )
where for definition pU1 (u1 ) = PU0 1 (u1 ) =
, that is the probability
du1
distribution is the derivative of the cumulative distribution. If U1 and U2
have the same distribution probability but different expected values EU1 {u1 }
and EU2 {u2 }, we obtain
PU2 (u) = PU1 (u − ∆)
where ∆ = EU2 {u2 } − EU1 {u1 }. The Equation (5.39) becomes
Z
0
Pr{u2 ≤ u1 } =
PU1 (u1 − ∆)PU1 (u1 )du1
(5.40)
(5.41)
U1
In the case that ∆ = 0 ⇒ EU1 {u} = EU2 {u}, i.e. the two probability
distributions of U on x1 and x2 are the same, it follows that
+∞
Z
PU21 (u1 ) 1
0
=
(5.42)
Pr{u2 ≤ u1 } =
PU1 (u1 )PU1 (u1 )du1 =
2
2
U1
−∞
where, for definition, PU1 (+∞) = 1 and PU1 (−∞) = 0. This means that if
on average the utility function U maps both x1 and x2 into the same value,
then we have the highest probability to make an error.
If we suppose that the cumulative distributions PU1 (u) and PU2 (u) are
bell–shaped distributions, for example:
PU1 (u) =
1
⇒ pU1 (u) = PU1 (u)(1 − PU1 (u))
1 + e−u
then the Equation (5.41) becomes
Z
Pr{u2 ≤ u1 } =
PU1 (u1 − ∆)PU0 1 (u1 )du1
U1
Z +∞
1
1
e−u1
=
du1
−(u1 −∆) 1 + e−u1 1 + e−u1
−∞ 1 + e
Z +∞
e−u1
du1
=
−u1 e∆ )(1 + e−u1 )2
−∞ (1 + e
161
(5.43)
(5.44)
5.5 Dependence on Size of Set of Alternatives
After solving the integral (see Appendix A for a detailed solution), we obtain
Pr{Error} =
e∆ (∆ − 1) + 1
∆
≈
(e∆ − 1)2
e∆
(5.45)
Also in this case, if ∆ = 0, then Pr{Error} = 1/2. The plot of Pr{Error}
is shown in Figure 5.1. We see that the probability of error tends quickly to
Error between Two Alternatives
0.5
Pr{Error}
Probability
0.4
0.3
0.2
0.1
0
0
2
4
6
8
10
Delta
Figure 5.1. Ranking and preference error in function of the difference ∆
between the two expected values of U on x1 and x2 .
zero as ∆ increments.
5.5.2
Ranking k Alternatives
Now we generalize above results to the case in which X = {x1 , x2 , . . . , xk }
is a set of k alternatives. We can define two different types of errors: the
ranking error, that is probability of incorrectly ranking the set of alternatives
and the preference error, that is the probability of not ranking first the best
element of the set of alternatives. The ranking error is the probability of the
joined event u1 < u2 < · · · < uk , where we supposed that yi = k − i + 1,
i = 1, . . . , k
Pr{RankingError} = 1 − Pr{u1 < u2 < · · · < uk }
162
(5.46)
CHAPTER 5 ON THE CONSISTENCY OF PREFERENCE LEARNING
If we assume that the single events are independent, then we can express the
ranking in function of pairs of elements
Pr{RankingError} = 1 − Pr{u1 < u2 }Pr{u2 < u3 } · · · Pr{uk−1 < uk }
k−1
Y
= 1−
Pr{ui < ui+1 }
= 1−
i=1
k−1
Y
(1 − Pr{ui+1 ≤ ui })
(5.47)
i=1
In a preference problem, the error is the probability of the joined event
u1 < uk , u2 < uk , . . . , uk−1 < uk , where we suppose that xk is the best
element
Pr{PreferenceError} = 1 − Pr{u1 < uk , u2 < uk , . . . , uk−1 < uk } (5.48)
Also for the preference task, if we suppose that the single events are independent, then we can express the preference in function of pairs of elements
involving the best element
Pr{PreferenceError} = 1 − Pr{u1 < uk }Pr{u2 < uk } · · · Pr{uk−1 < uk }
k−1
Y
= 1−
Pr{ui < uk }
= 1−
i=1
k−1
Y
(1 − Pr{uk ≤ ui })
(5.49)
i=1
So the ranking and preference errors have been reduced to a production on
error on pairs of elements and so we can use Equation (5.45) that express
the probability of error of a pair of objects. Since
0 ≤ Pr{ui ≤ uj } ≤
1
1
⇒ ≤ 1 − Pr{ui ≤ uj } ≤ 1
2
2
we can derive a lower and an upper bound for the probability of error for
ranking and preference:
1−
k−1
Y
i=1
1 ≤ Pr{Error} ≤ 1 −
k−1
Y
i=1
1
1
⇒ 0 ≤ Pr{Error} ≤ 1 − k−1 (5.50)
2
2
163
5.5 Dependence on Size of Set of Alternatives
where Pr{Error} is either Pr{RankingError} or Pr{PreferenceError}. The
curve of the upper bound of Pr{Error} is plotted in Figure 5.2, where we
can see that the probability of error grows exponentially fast towards 1 in
function of the cardinality of the set of alternatives. If we define
Error vs Number of Alternatives
1
Probability
0.8
0.6
0.4
0.2
Pr{Error}
0
1
2
3
4
5
6
7
8
Number of Alternatives
9
10
Figure 5.2. Upper bound of preference and ranking error in function of the
number of alternatives.
∆i,j = EUi {ui } − EUj {uj },
i, j = 1, . . . , k : i > j
(5.51)
then we can express the probabilities of ranking and preference error in function of distance between the values of the utility function U on pairs
Pr{RankingError} = 1 −
= 1−
k−1
Y
(1 − Pr{ui+1 ≤ ui })
i=1
k−1
Y ∆i+1,i
e
i=1
164
(e∆i+1,i − ∆i+1,i − 1)
(e∆i+1,i − 1)2
CHAPTER 5 ON THE CONSISTENCY OF PREFERENCE LEARNING
for the ranking error and
Pr{PreferenceError} = 1 −
= 1−
k−1
Y
(1 − Pr{uk ≤ ui })
i=1
k−1
Y ∆k,i
e
i=1
(e∆k,i − ∆k,i − 1)
(e∆k,i − 1)2
for the preference error. Note that the ranking error depends on ∆i+1,i ,
i = 1, . . . , k − 1 while the preference error depends on ∆k,i , i = 1, . . . , k − 1,
where xk is the best element in both problems. If the utility function is
able to map more similar elements into closer values, we see that the ranking
problem is inherently more difficult then the preference one since ∆i+1,i <
∆k,i , i = 1, . . . , k − 1. But if the scores of the objects computed by U are well
“separated”, the probability of error can become arbitrarily small despite
the size of the set of alternatives. Finally, note that similar results can be
obtained using any other probability distribution for the scores assigned by
the utility function to the elements in the set of alternatives.
5.6
Conclusions
We derived three approaches for a new partial order model of preference
and ranking based on a 0–1 loss function exploiting the idea that a binary
partial order relation can model the constraints of preference and ranking
problems. We showed that modelling the ranking and preference function by
indirect approaches as the utility or pairwise function could lead to a greater
generalization error than the direct one due to the inherent characteristics of
the model which is unable to represent all the possible relations on the set
of alternatives.
Finally, we described a novel approach about how the ranking and preference errors depend on the size of set of alternatives. The larger is the size
of the set of alternatives, the bigger is the probability of an error. But if the
scores of the objects computed by the utility function were well separated,
the probability of error could become arbitrarily small.
165
Part III
Kernels on Structured Data for
Computational Molecular
Biology
Chapter 6
Weighted Decomposition
Kernel
We introduce a family of kernels on discrete data structures within the general
class of decomposition kernels. A Weighted Decomposition Kernel (WDK) is
computed by dividing objects into substructures indexed by a selector. Two
substructures are then matched if their selectors satisfy an equality predicate,
while the importance of the match is determined by a probability kernel on local distributions fitted on the substructures. Under reasonable assumptions,
a WDK can be computed efficiently and can avoid combinatorial explosion
of the feature space. We report experimental evidence that the proposed
kernel is highly competitive with respect to more complex state–of–the–art
methods on a set of problems in bioinformatics, involving protein sequence
and molecule graph classification. This chapter is based on Menchetti et al.
(2005b) and on Menchetti et al. (2005c).
6.1
Introduction
Statistical learning in structured and relational domains is rapidly becoming one of the central areas of machine learning, boosted by the increasing
awareness that the traditional propositional setting lacks expressiveness for
modelling many domains of interest. In this chapter we focus on super169
6.1 Introduction
vised learning of discrete data structures driven by several practical problems in bioinformatics that involve classification of sequences (e.g. protein
sub–cellular localization) and graphs (e.g. prediction of toxicity or biological
activity of chemical compounds).
Starting from the seminal work of Haussler (1999), several researchers
have defined convolution and other decomposition kernels on various types
of discrete data structures such as sequences (Lodhi et al., 2001; Leslie et al.,
2002a; Cortes et al., 2004), trees (Collins and Duffy, 2001), and annotated
graphs (Gärtner, 2003) (see Chapter 3 for an overview of these kernels).
Thanks to its generality, decomposition is an attractive and flexible approach
for constructing similarity on structured objects based on the similarity of
smaller parts. Still, defining a good kernel for practical purposes may be
challenging when prior knowledge about relevant features is not sufficient.
At one extreme, it may be desirable to take all possible subparts into
account. However, in so doing, the dimension of the feature space associated
with the kernel can become too large due to the combinatorial growth of the
number of distinct subparts with their size. For example, the number of distinct substrings grows exponentially with their lengths. Arguably, unless an
extensive use of prior knowledge guides the selection of relevant parts — e.g.
as done by Cumby and Roth (2003) using description logics — most dimensions in the feature space will be poorly correlated with the target function
and the explosion of features may adversely affect generalization in spite of
using large margin classifiers (Ben-David et al., 2002). As observed by many
researchers, the problem also manifests itself in the form of a Gram matrix
having large diagonal values, in close analogy to what would be obtained
using a too narrow Gaussian kernel. Common sense remedies include down–
weighting the contribution of larger fragments (Collins and Duffy, 2001) or
limiting their size a priori, although in so doing, we could miss some relevant features. A remedy based on kernel transformations is described by
Schölkopf et al. (2002a). An alternative promising direction that can avoid
dimensionality explosion is the generation of relevant features via mining frequent substructures. Methods of this family have been successfully applied
to the classification of chemical compounds (Kramer et al., 2001; Deshpande
et al., 2003). Other researchers have found that kernels based on paths can
170
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
also be very effective in chemical domains. Graph kernels based on counting label paths produced by random walks have been proposed by Kashima
et al. (2003) and later extended by Mahé et al. (2004) to include contextual
information. Horváth et al. (2004) have proposed counting the number of
common cyclic and tree patterns in a graph.
At the opposite extreme, one might flatten discrete structures into propositional representations, reducing the number of features at the (possibly
severe) cost of losing valuable structural information. One example of this
extreme is the use of amino acid composition for protein sequence classification (Hua and Sun, 2001): in this case, a sequence is simply represented by
the frequencies of occurrence of symbols, without taking into account their
relative positions.
In this chapter, we show how between the two above extreme approaches
(taking all subparts and flattening) it is possible to explore a useful class of
kernels that perform well in practice for both protein sequence and molecule
graph classification. A weighted decomposition kernel focuses on relatively
small parts of a structure, called selectors, that are matched according to
an equality predicate. The importance of the match is then weighted by a
factor that depends on the similarity of the context in which the matched
selectors occur. In order to introduce a “soft” similarity notion on contexts,
we extract attribute frequencies in each context and then apply a kernel
on distributions. Suitable options include Histogram Intersection Kernels
(HIKs) (Odone et al., 2005) and probability product kernels (Jebara et al.,
2004). In order to deal with complex structures, it is needed a way to easily
specify prior knowledge on the type of relations or patterns that we want to
discover. We propose to express this knowledge in terms of how to decompose the structured data in relevant parts and how to measure the similarity
between these parts.
The remainder of the chapter is organized as follows. In Section 6.2 we
review Haussler’s decomposition kernels giving a slightly more flexible definition. In Section 6.3 we introduce a general class of weighted decomposition
kernels and in Section 6.4 we discuss efficient algorithmic implementations.
Finally, in Section 6.5 we validate the method on several problems in bioinformatics, involving classification of protein sequences and classification of
171
6.2 Decomposition Kernels
molecules represented as graphs.
6.2
Decomposition Kernels
In this section, we review Haussler’s decomposition kernels giving a slightly
more flexible definition. We start from some of the definitions and results in
Shawe-Taylor and Cristianini (2004) and in Haussler (1999).
Definition 6.1 (R–decomposition structure) A R–decomposition struc~ R, ~ki where
ture on a set X is a triple R = hX,
~ = (X1 , . . . , XD ) is a D–tuple of non–empty subsets of X, D ∈ IN
• X
• R is a finite parthood relation on X1 × · · · × XD × X
• ~k = (k1 , . . . , kD ) is a D–tuple of Mercer kernels kd : Xd × Xd 7→ IR
R(~x, x) is true if and only if ~x is a tuple of parts for x — i.e. ~x is a decom~ : R(~x, x)} denote the
position of x. For any x ∈ X, let R−1 (x) = {~x ∈ X
multiset of all possible decompositions of x. A decomposition kernel is then
defined as the multiset kernel between the decompositions:
.
KR (x, z) =
X
X
~
x∈R−1 (x)
~
z ∈R−1 (z)
κ(~x, ~z)
(6.1)
where we adopt the convention that summations over the elements of a multiset take into account their multiplicity (Haussler (1999) gives a slightly
different definition). To compute κ(~x, ~z), kernels on parts are combined by
means of operators that need to be closed with respect to kernel positive definiteness. Haussler (1999) proved that combinations based on tensor product
(R–convolution kernels) and direct sum are positive definite:
KR,⊗ (x, z) =
~
x∈R−1 (x)
KR,⊕ (x, z) =
D
X Y
X
~
z ∈R−1 (z)
~
x∈R−1 (x)
~
z ∈R−1 (z)
172
(6.2)
kd (xd , zd )
(6.3)
d=1
D
X X
X
kd (xd , zd )
d=1
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
It is immediate to extend the previous result to decomposition kernels based
on other closed operators. Thus, denoting by opi a valid closed operator, we
can write the general form
κ(~x, ~z) = k1 (x1 , z1 ) op1 k2 (x2 , z2 ) op2 · · · opD−1 kD (xD , zD )
where it is not even necessary that the same operator opi should be used for
combining all kd kernels on parts. Note that, as introduced in Definition 6.1,
the tuple ~k = (k1 , . . . , kD ) is a D–tuple of kernel functions kd : Xd ×Xd 7→ IR:
when we allow these kernel functions to belong to the decomposition kernel
family KR , we recursively define a decomposition kernel composing kernels
on different R–decomposition structures.
6.2.1
Equivalence of Tensor Product and Direct Sum
We proved that the two forms (tensor product and direct sum in Equations
(6.2) and (6.3)) are interchangeable up to modification of the underlying
R–decomposition structure. The following theorem formalizes this result.
Theorem 6.1 (Equivalence of ⊗ and ⊕) There exists a R–decomposition
~ R, ~ki on a space X if and only if for all D0 there exists a
structure R = hX,
~ 0 , R0 , k~0 i such that
structure R0 = hX
KR,⊗ = KR0 ,⊕
(6.4)
Proof of Theorem 6.1 Since Theorem 6.1 states a necessary and sufficient
condition, we have to prove the theorem for both implications
~ R, ~ki ⇒ ∀ D0 ∈ IN ∃ R0 = hX
~ 0 , R0 , k~0 i : KR,⊗ = KR0 ,⊕
(⇒) ∃ R = hX,
~ 0 , R0 , k~0 i ⇒ ∃ R = hX,
~ R, ~ki : KR0 ,⊕ = KR,⊗
(⇐) ∀ D0 ∈ IN ∃ R0 = hX
Direct Implication (⇒) Let us start from direct implication. Without loss
of generality, we can consider only the case D = 1 in Equation (6.2) Lemma
6.1. In this way, the composite structure space X is decomposed in only
one part X1 , R is defined on set X1 × X, k1 : X1 × X1 7→ IR and the set
173
6.2 Decomposition Kernels
R−1 (x) ⊆ X1 contains all possible ways to choose one part x1 of x. It follows
that
X
X
KR,⊗ (x, z) =
(6.5)
k1 (x1 , z1 )
x1 ∈R−1 (x) y1 ∈R−1 (z)
~ 0 , R0 , k~0 i in the
Then we can define a new decomposition structure R0 = hX
following way. We create a partition of X1 in D0 parts X11 , . . . , X1D0 ⊆ X1 .
−1
This partition of X1 induces a partition of R−1 in D0 subsets R1−1 , . . . , RD
0
−1
0
corresponding to X11 , . . . , X1D0 such that Rd0 (x) ⊆ X1d0 , d = 1, . . . , D0
where
0
R−1 (x) =
Ri−1 (x) ∩
D
[
Rd−1
0 (x)
d0 =1
Rj−1 (x)
(6.6)
= ∅, ∀ i, j = 1, . . . , D0 , i 6= j
Now we define the D0 parts as
Xd0 0 = X1d0 × {1, . . . , D0 },
d0 = 1, . . . , D0
(6.7)
where the set {1, . . . , D0 } represents the index d0 of the subset X1d0 to which
0
x1 belongs to after partitioning X1 . Then we introduce a kernel k 0 : X1d
0 ×
0
0
0
0
X1d
0 7→ IR, d = 1, . . . , D on the D parts as
(
k1 (x1 , z1 ) if i = j
k 0 ((x1 , i), (z1 , j)) =
(6.8)
0 otherwise
that is k 0 = k1 if x1 and z1 belong to the same subset of the partition, 0
otherwise. So we can write the kernel in Equation (6.5) as
0
KR,⊗ (x, z) =
D
X
0
X
D
X
X
k 0 ((x1 , i), (z1 , j))
(6.9)
i=1 x1 ∈R−1 (x) j=1 y1 ∈R−1 (z)
i
j
0
=
D
X
X
X
k 0 ((x1 , d0 ), (z1 , d0 ))
(6.10)
−1
d0 =1 x1 ∈R−1
0 (x) z1 ∈R 0 (z)
d
d
since the kernel k 0 is zero if the parts do not belong to same subset of the
0
partition. Finally, we define the new relation R0 on X10 , . . . , XD
0 × X using
0
the partition of R in D subsets as
R0 ((x1 , 1), . . . , (xD0 , D0 ), x) iff ∀ d0 = 1, . . . , D0 Rd0 (xd0 , x)
174
(6.11)
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
The idea is to include in the relation R0 the summation on the D0 parts: so
(x1 , 1), . . . , (xD0 , D0 ) are in relation with x if xd0 , d0 = 1, . . . , D0 is in relation
with x by Rd0 , or rather if xd0 belongs the the d0 –th set of the partition. In
conclusion,
0
KR0 ,⊕ (x, z) =
X
D
X
X
k 0 ((xd0 , d0 ), (zd0 , d0 ))
(6.12)
kd0 0 (x0d0 , zd0 0 )
(6.13)
~
x0 ∈R0 −1 (x) ~
z 0 ∈R0 −1 (z) d0 =1
0
=
X
X
D
X
~
x0 ∈R0 −1 (x)
~
z 0 ∈R0 −1 (z)
d0 =1
where ~x0 = (x1 , 1), . . . , (xD0 , D0 ), ~z0 = (z1 , 1), . . . , (zD0 , D0 ), kd0 0 (x0d0 , zd0 0 ) =
k 0 ((xd0 , d0 ), (zd0 , d0 )) and x0d0 = (xd0 , d0 ). So we have converted a tensor product
based kernel into a kernel based on direct sum modifying the R–decomposition
structure.
Inverse Implication (⇐) We start from a kernel KR0 ,⊕ based on a decom~ 0 , R0 , ~k 0 i on D0 parts, where X
~ 0 = (X10 , . . . , X 0 0 )
position structure R0 = hX
D
0
0
0
0
~
and k = (k1 , . . . , kD0 ). First of all, we define D sets as
Xd0 = Xd0 0 × {1, . . . , D0 }, d0 = 1, . . . , D0
(6.14)
and then we define their union X1 as
X1 =
D0
[
Xd0
(6.15)
d0 =1
where the set of indexes {1, . . . , D0 } identifies the set Xd0 0 to which x1 ∈ X1
belongs to. Let k1 : X1 × X1 7→ IR be a kernel on the part X1 defined as
(
k1 ((x01 , i), (z10 , j)) =
ki0 (x01 , z10 )
0
175
if i = j
if i =
6 j
(6.16)
6.2 Decomposition Kernels
that is k1 is zero if x01 and z10 do not belong to same part and ki0 (x01 , z10 ) if x01
and z10 belong to same part Xi0 . So Equation (6.3) becomes
0
KR,⊕ (x, z) =
X
X
D
X
~
x0 ∈R0 −1 (x)
~
z 0 ∈R0 −1 (z)
d0 =1
X
X
D X
D
X
0 −1
0 −1
(z) i=1 j=1
0
=
~
x0 ∈R
(x) ~
z 0 ∈R
D0
=
X
X
k1 ((x0d0 , d0 ), (zd0 0 , d0 ))
(6.17)
0
k1 ((x0i , i), (zj0 , j))
(6.18)
k1 ((x0i , i), (zj0 , j))
(6.19)
0
X
D
X
~
x0 ∈R0 −1 (x) i=1 ~
z 0 ∈R0 −1 (z) j=1
P
PD0
The meaning of the double summation
~
x0 ∈R0 −1 (x)
i=1 is to choose one
0
possible decomposition of x in D parts using relation R0 and then to pick
the i–th element of decomposition. So we can define a new relation R on set
X1 × X using relation R0 as
d0
z}|{
R((x01 , d0 ), x) iff R0 ( ·, . . . , x01 , . . . , ·, x)
(6.20)
that is (x01 , d0 ) is in relation with x by R if x01 is the d0 –part out of D0 parts
in which R0 decomposes x. So R−1 (x01 , d0 ), d0 = 1, . . . , D0 is the union of the
sets which elements are parts in which R0 decomposes x in all possible ways.
Then Equation (6.19) becomes
KR,⊗ (x, z) =
X
X
k1 (x1 , z1 )
(6.21)
~
x1 ∈R−1 (x) ~
z1 ∈R−1 (z)
where ~x1 = x1 = (x01 , i) and ~z1 = z1 = (z10 , j), which is a particular case of
Equation (6.2) when D = 1. This concludes the proof.
The following lemma, which completes the Theorem 6.1, states that the
decomposition of X in one or more parts (D ≥ 1) is equivalent to a decomposition in only one part (D = 1) up to modification of the underlying
R–decomposition structure.
176
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
Lemma 6.1 (One Part Equivalence for Tensor Product) There exists
~ R, ~ki on a space X if and only if there
a R–decomposition structure R = hX,
~ 0 , R0 , k~0 i with D0 = 1 such that
exists a structure R0 = hX
KR,⊗ = KR0 ,⊗
(6.22)
Remark Note that the number of parts D corresponding to the decomposition R can assume any value greater than or equal to 1.
Proof of Lemma 6.1 Since Lemma 6.1 states a necessary and sufficient
condition, we have to prove the theorem for both implications
~ R, ~ki ⇒ ∃ R0 = hX
~ 0 , R0 , k~0 i with D0 = 1 : KR,⊗ = KR0 ,⊗
(⇒) ∃ R = hX,
~ 0 , R0 , k~0 i with D0 = 1 ⇒ ∃ R = hX,
~ R, ~ki : KR0 ,⊗ = KR,⊗
(⇐) ∃ R0 = hX
Direct Implication (⇒) If D = 1, the lemma is immediately verified. If
~ 0 , R0 , k~0 i in the
D > 1, we can define a new decomposition structure R0 = hX
following way
• the only part D0 = 1 in which X is decomposed as X10 = X1 × · · · × XD
• a new relation R0 on X10 × X defined as
R0 (x01 , x) iff R(x1 , . . . , xD , x)
(6.23)
where x01 = (x1 , . . . , xD )
• a new kernel k10 : X10 × X10 7→ IR defined as
k10 (x01 , z10 )
=
D
Y
kd (xd , zd )
(6.24)
d=1
where x01 = (x1 , . . . , xD ) and z10 = (z1 , . . . , zD )
So we obtain a new R–convolution kernels on only one part
X
X
KR0 ,⊗ (x, z) =
k10 (x01 , z10 )
(6.25)
x01 ∈R0 −1 (x) z10 ∈R0 −1 (z)
Inverse Implication (⇐) If D = 1, the lemma is immediately verified. If
~ R, ~ki in the
D > 1, we can define a new decomposition structure R = hX,
following way
177
6.2 Decomposition Kernels
• the D parts in which X is decomposed as X1 = X2 = · · · = XD = X10
• a new relation R on X1 × X2 × · · · × XD × X defined as
R(x1 , . . . , xD , x) iff ∀ d = 1 . . . , D R0 (xd , x) ∧ x1 = x2 = . . . = xD (6.26)
• the D kernels kd : Xd × Xd 7→ IR defined as
k1 (x1 , z1 ) = k10 (x1 , z1 )
kd (xd , zd ) = 1, d = 2, . . . , D
(6.27)
or
kd (xd , zd ) =
p
D
k10 (xd , zd ),
d = 1, . . . , D
(6.28)
So we obtain a new R–convolution kernels on D parts
KR,⊗ (x, z) =
D
X Y
X
~
x∈R−1 (x)
~
z ∈R−1 (z)
kd (xd , zd )
(6.29)
d=1
This ends the proof.
6.2.2
All–Substructures Kernels
Since decomposition kernels form a rather vast class, the relation R needs
to be carefully tuned to different applications in order to characterize a suitable kernel. One commonly used family consists of all–substructures kernels,
which count the number of common substructures in two decomposable objects. In this case, the R–decomposition structure R = hX, R, δi has D = 1,
the relation R is defined as
R(x1 , x) iff x1 is a substructure of x
and δ is the exact matching kernel
(
δ(x1 , z1 ) =
1 if x1 = z1
0 otherwise
178
(6.30)
(6.31)
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
The all–substructures kernel becomes
X
KR,δ (x, z) =
x1
X
∈R−1 (x)
z1
δ(x1 , z1 )
(6.32)
∈R−1 (z)
The resulting convolution kernel can also be written as
KR,δ (x, z) = R−1 (x) ∩ R−1 (z)
(6.33)
where the intersection is between two multisets. So an integral part of many
kernels for structured data is the decomposition of an object into a set of its
parts and the intersection of two set of parts. Note that in general, computing
the equality predicate between x1 and z1 may not be computationally efficient
as it might require solving a subgraph isomorphism problem (Gärtner et al.,
2003). Known kernels that can be reduced to the above form include the
spectrum kernel on strings (Leslie et al., 2002a), the basic version (with no
down–weighting) of co–rooted subtree kernel on trees (Collins and Duffy,
2001) and kernels counting common walks on graphs (Gärtner, 2003).
6.3
Weighted Decomposition Kernels
In this section, after introducing the concepts of data types and graph probability distribution kernels, we describe the general form of a weighted decomposition kernel and the we specialize this formulation to the case of biological
sequences and of molecules.
6.3.1
Data Types
We focus on instances from a wide class of annotated graphs. This includes
sequences and trees as special cases. No particular restriction needs to be
assumed about graph topologies. In particular, we allow the presence of cycles and we can use directed or undirected edges and ordered, unordered or
positional adjacency lists. For simplicity, we assume that labels associated
with vertices and edges are tuples of atomic attributes. Attributes are organized into classes and can be instantiated for each vertex or edge. So, for
example, in a chemical domain we may introduce the class AtomType for
179
6.3 Weighted Decomposition Kernels
vertex attributes and write AtomType(3) = C to indicate that vertex 3 in
a graph molecule is a carbon atom. In the following, we will denote by ξ
a generic vertex attribute class and by ξ(v) its value at vertex v. Similarly,
we denote by ξ(u, v) the value of an edge attribute of class ξ at edge (u, v).
Finally, if x is a graph, we denote by ξ(x) the value multiset associated with
attribute ξ. In the case of vertex attributes, ξ(x) = {ξ(v) : v ∈ V (x)}
where V (x) is the vertex set of x. Similarly, in the case of edge attributes,
ξ(x) = {ξ(u, v) : (u, v) ∈ E(x)} where E(x) is the edge set of x.
6.3.2
Graph Probability Distribution Kernels
In a probability product kernel, a simple generative model is fitted to each
example and the kernel between two examples is evaluated by integrating the
product of the two corresponding distributions (Jebara et al., 2004). These
probability kernels can be simply extended to value multisets associated with
vertex or edge attributes in a (sub)graph. For a given graph x and a given
attribute ξ, we fit a probability model p(λ) to ξ(x). Then we can introduce
a probability kernel between two distributions p and p0 as
Z
0
p(λ)d p0 (λ)d dλ
(6.34)
kprob (p, p ) =
Λ
where Λ is the set of admissible values for attribute ξ and d is a positive
constant (a possible choice is d = 1/2 yielding the so–called Bhattacharyya
kernel). If p and p0 are fitted to the value multisets ξ(x) and ξ(x0 ) respectively,
then the above kernel can be interpreted as a measure of similarity between
the flattened representations of x and x0 .
In this work, we use a discrete version of these kernels, based on multinomial frequencies. When attributes are categorical, multinomial probabilities
can be estimated from frequencies. Multiple attributes ξi with i = 1, . . . , na
can be taken into account by fitting separate probability models pi to the
value multisets of each attribute, obtaining the discrete version of the above
probability product kernel (6.34). Given a graph x and an attribute ξi , let
pi (j) be the observed frequency of value j in ξi (x). A first type of kernel is
defined as
mi
X
ki (x, x0 ) =
pi (j)d p0i (j)d
(6.35)
j=1
180
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
where mi is the number of distinct values for ξi . Setting d = 1/2 we obtain a
discrete version of the Bhattacharyya kernel. Note that the discrete version
of the Bhattacharyya kernel with d = 1/2 is a normalized kernel. Precisely
ki (x, x) =
mi
X
p
pi (j) pi (j) =
pi (j) = 1
mi
X
p
(6.36)
j=1
j=1
As an interesting alternative in the categorical case, we may use histogram
intersection kernels (Barla et al., 2002, 2003; Odone et al., 2005) defined as:
0
ki (x, x ) =
mi
X
min{pi (j), p0i (j)}
(6.37)
j=1
To demonstrate the histogram intersection kernel is a Mercer kernel, we suppose that pi (j) and p0i (j) are the number of occurrences of value j of attribute
ξi in x and x0 respectively, so pi (j) and p0i (j) are two histograms on attribute
values. Since an histogram can be represented in the feature space by a
binary vector φi of mi sections as
pi (1)
φi
pi (2)
pi (mi )
z }| {
z }| {
z }| {
= (1, 1, . . . , 1, 0, 0, . . . , 1, 1, . . . , 1, 0, 0, . . . , . . . , 1, 1, . . . , 1, 0, 0, . . .)
where the number of ones in each section is equal to the values pi (j) of
corresponding bins, the histogram intersection kernel is equivalent to the
inner product ki (x, x0 ) = hφi , φ0i i between two feature vectors φi and φ0i .
Note that
mi
mi
X
X
ki (x, x) =
min{pi (j), pi (j)} =
pi (j)
j=1
i=1
and so the normalized version is
ki (x, x0 )
qP
mi
0
p
(j)
j=1 i
j=1 pi (j)
kinorm (x, x0 ) = qP
mi
The contributions of multiple attributes can be summed or multiplied,
yielding kernels of the form:
na
Y
κ(x, x ) =
(1 + ki (x, x0 ))
0
(6.38)
i=1
κ(x, x0 ) =
na
X
i=1
181
ki (x, x0 )
(6.39)
6.3 Weighted Decomposition Kernels
where ki (x, x0 ) is one of Equation (6.35) or Equation (6.37). In the case of
continuous attributes (no experimentation reported in this work) one could
fit appropriate continuous distributions and apply kernels defined in Jebara
et al. (2004).
6.3.3
General Form of WDKs
A weighted decomposition kernel is characterized by the following decomposition structure:
~ R, (δ, κ1 , . . . , κD )i
R = hX,
(6.40)
~ = (S, Z1 , . . . , ZD ), R(s, z1 , . . . , zD , x) is true iff s ∈ S is a subgraph
where X
of x called the selector and ~z = (z1 , . . . , zD ) ∈ Z1 × · · · × ZD is a tuple of
subgraphs of x called the contexts of occurrence of s in x (precise definitions of
s and ~z are domain–dependent as shown in Sections 6.3.4 and 6.3.5). In order
to ensure an efficient computation of the kernel, some restrictions have to be
placed on the sizes of the above entities. First we assume that |R−1 (x)| =
O(|V (x)| + |E(x)|), i.e. the number of ways a graph can be decomposed
grows at most linearly with its size. Second, we assume that selectors have
constant size with respect to x, i.e. R(s, ~z, x) ⇒ |V (s)| + |E(s)| = O(1). The
definition is completed by the kernels on parts: δ is an exact matching kernel
on S × S and κd is one of graph probability distribution kernels on Zd × Zd
defined in Section 6.3.2. This setting results in the following general form of
the kernel:
0
K(x, x ) =
X
X
(s,~
z )∈R−1 (x)
(s0 ,~
z 0 )∈R−1 (x0 )
0
δ(s, s )
D
X
κd (zd , zd0 )
(6.41)
d=1
where the direct sum between kernels over parts κd can be replaced by the
tensor product. Compared to kernels that simply count the number of substructures, the above function weights different matches between selectors
according to contextual information as illustrated in Figure 6.1. Intuitively,
WDK transforms a graph in a bag of substructures whose description is enriched by the information extracted from the neighboring subgraph of each
substructure. The kernel can be afterwards normalized. In the following
subsections we specialize this general form to practical cases of interest.
182
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
Z‘
S‘
Z
S‘
S
S
Figure 6.1. Comparison between matching kernel (left) and WDK (right).
6.3.4
A WDK for Biological Sequences
Biological sequences are finite length strings on a finite alphabet A (for example A consists of the 20 amino acid letters in the case of proteins) and therefore X = A∗ . Given a string x ∈ A∗ , two integers e ≥ 0 and e ≤ t ≤ |x| − e,
let x(t, e) denote the substring of x spanning string positions from t − e to
t + e included. The simplest version of WDK is obtained by choosing D = 1
and a relation R depending on two integers r ≥ 0 (the selector radius) and
l ≥ r (the context radius) defined as
R = {(s, z, x) : x ∈ A∗ , s = x(t, r), z = x(t, l), l ≤ t ≤ |x| − l}
Figure 6.2 shows an example of a selector (3–mer) and a context (amino acid
composition of a larger window around the 3–mer) applied to a biological
sequence.
The kernel is then defined as
|x|−l |x0 |−l
K(x, x0 ) =
XX
t=l
δ(x(t, r), x0 (τ, r))κ(x(t, l), x0 (τ, l))
(6.42)
τ =l
Intuitively, when applied to protein sequences, this kernel computes the number of common (2r + 1)–mers weighting matching pairs by the similarity
between the amino acid composition of their environments — measured, for
example, by one of the probability distribution kernels as defined in Equation
(6.38) or Equation (6.39). Of course if κ(·, ·) ≡ 1, then this WDK reduces
to the spectrum kernel. So the similarity between sequences is measured
by counting common k–mers and weighting each match using contextual information: a small k–mer size avoid the problem of sparseness and a large
183
6.3 Weighted Decomposition Kernels
t
r
l
k-mer
. . . L A S S I G I V A K K L G E M W N W T. . .
context
AKK
3-mer
ADGKLQSTVW
. . . L S I G I V A K K L G E M W N W T. . .
context
AKK
ADGKLQSTVW
. . L V S A K K G A T D K T A K K Q W I T. . .
AKK
ADGKLQSTVW
Figure 6.2. The simplest version of WDK is obtained by choosing D = 1
and a relation R depending on two integers r ≥ 0 (the selector radius) and
l ≥ r (the context radius).
context accounts for more structural information. Note that although the
above equation seems to imply a complexity of O(|x||x0 |), more efficient implementations are possible (see Section 6.4).
6.3.5
A WDK for Molecules
A molecule is naturally represented by an undirected graph x where vertices
are atoms and edges are bonds between atoms. Vertices are annotated with
attributes such as atom type, atom charge, membership to specific functional
184
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
groups (i.e. whether the atom is part of a carbonyl, metil, alcohol or other
group in the molecule) and edges are annotated with attributes such as bond
type. Given a vertex v and an integer l ≥ 0, we denote by x(v, l) the subgraph
of x induced by the set of vertices which are reachable from v by a path of
length at most l and by the set of all edges that have at least one end in the
vertex set of x(v, l) (see Figure 6.3). Also let x(v) be the node v of x.
l
v
Figure 6.3. Example of selector (red vertex) and context (blue vertices and
red vertex) for a graph.
The first WDK we propose is obtained by choosing D = 1 and a relation
R that depends on an integer l ≥ 0 (the context radius) defined as
R = {(s, z, x) : x ∈ X, s = x(v), z = x(v, l), v ∈ V (x)}
(6.43)
The kernel is defined as
K(x, x0 ) =
X
v∈V (x)
X
v 0 ∈V
δ(x(v), x0 (v 0 )) · κ(x(v, l), x0 (v 0 , l))
(6.44)
(x0 )
Note that selectors consist of single vertices, allowing us to compute δ in
constant time. The matching selectors are weighted by the similarity between
the contexts of their environments — measured, for example, by one of the
probability distribution kernels as defined in Equation (6.38) or Equation
(6.39). As discussed in Section 6.4 about complexity, other options that still
preserve efficiency may be available.
In the second version of WDK, we set D = 2 and use two types of contexts,
z1 (v, l) = x(v, l) as in the first version and its graph complement denoted by
185
6.4 Algorithms and Complexity
z2 (v, l). Probability kernels over contexts can be combined under direct sum
obtaining
κ((~z, v), (~z0 , v 0 )) = κ1 (z1 (v, l), z10 (v 0 , l)) + κ2 (z2 (v, l), z20 (v 0 , l))
(6.45)
or alternatively under tensor product. Equation (6.41) including graph complement finally becomes
X X
K(x, x0 ) =
(6.46)
δ(x(v), x0 (v 0 )) · κ((~z, v), (~z0 , v 0 ))
v∈V (x) v 0 ∈V (x0 )
O
The two versions of WDK for molecule graphs are shown in Figure 6.4.
D=1
N
O
S
D=2
S
N
Figure 6.4. Two versions of WDK for molecules: neighboring context (top)
and neighboring and complement contexts (bottom).
6.4
Algorithms and Complexity
The computational efficiency of a decomposition kernel depends largely on
the cost of constructing and matching substructures. In particular, exact
matching of substructures might lead to intractability when dealing with
general graphs (Gärtner et al., 2003). This problem is avoided in WDK.
Selectors require exact matching but consist of small substructures that can
be reasonably constructed and matched in O(1). Examples of acceptable
selectors include: short substrings for sequences, tuples formed by vertices
186
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
and the ordered list of their children for trees (e.g. production rules in the
case of parse trees) and single vertices for non–ordered graphs. Contexts may
be large but in this case efficiency is achieved by constructing and matching attribute frequencies (histograms) and not subgraphs leading to a linear
complexity in the size of the structure for constructing and matching all
the histograms. Efficient procedures can be devised for calculating the kernel under the assumption that graphs are labelled by categorical attributes.
The code implementing a computationally efficient WDK is freely available
at http://www.dsi.unifi.it/neural/src/WDK/ under the terms of the GNU
General Public License.
6.4.1
Indexing and Sorting Selectors
Before kernel calculation, each selector is mapped into an index and each
instance is pre–processed to construct a lexicographically sorted list that
associates context histograms to selectors as illustrated in Figure 6.5. The
LSI
SIG
IGD
. . . L S I G D VA K K L G E M A K K TA K K . . .
GDV
...
AKK
KKT
Figure 6.5. Preprocessing for constructing a lexicographically sorted index
that associates context histograms to selectors. Note that the selectors have
still to be sorted.
cost of this step is O(NSel log NSel ) + Tc for each instance, where NSel =
|R−1 (x)| and Tc is the time for calculating all the context histograms. The
two outer summations over selectors in the general formulation of WDK in
Equation (6.41) below reported
0
K(x, x ) =
X
X
(s,~
z )∈R−1 (x)
(s0 ,~
z 0 )∈R−1 (x0 )
187
0
δ(s, s )
D
X
d=1
κd (zd , zd0 )
(6.47)
6.4 Algorithms and Complexity
are then computed by scanning the two ordered indices of x and x0 associated with selectors as in the case of the inner sparse product between two
sparse vectors. This strategy leads to a complexity reduction of the kernel
2
)
computation between two instances of the same size ranging from O(NSel
up to O(NSel ), depending on indexing sparseness: the best case is when each
bucket contains a single context histogram, the worst when a single bucket
contains all of them.
6.4.2
Computing Histograms
We now briefly discuss algorithmic ideas for computing label histograms efficiently when contexts are subgraphs formed by all vertices at bounded distance from the selector (this is the case of the kernels for biological sequences
and for molecules proposed in Sections 6.3.4 and 6.3.5).
6.4.2.1
Sequences Histograms Computation
Bounded distance contexts defined over sequences have the desirable property
that their frontiers can be constructed in O(1) by directly accessing the
appropriate items of the sequence. An algorithm can exploit this property
adjusting the histogram of an element with respect to a neighboring element
by simply removing and adding single labels. So in the case of sequences,
context histograms can be updated in O(1) moving along the sequence from
left to right as shown in Figure 6.6. Therefore, the time for constructing all
AEGKLMNV AEGKLMNV
. . . L S I G D VA K K L G E M E N N T G G . . .
Figure 6.6. Context histograms computation in the case of sequences.
context histograms is Tc = O(|s| + l), where l is the size of each context (as
188
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
defined in Section 6.3.4) and |s| is the length of the sequence s (we make
the assumption that the size of the set of admissible values for attributes is
independent from the data size).
6.4.2.2
Trees Histograms Computation
In the case of trees, let T = (V, E, r) be a binary rooted tree, where V is the
set of nodes, E the set of edges and r ∈ V is the root of the tree. Without
loss of generality, we restrict to binary trees but the following results can be
easily extended to non binary trees.
An algorithm can efficiently compute the histogram Hp of the subtree
dominated by a node p ∈ V in a recursive fashion, partially using the histograms of its children. Given a context radius l, it is possible to use a vector
of histograms Hp [d], d = 1, . . . , l associated with each node p ∈ V to store
the information on statistics at increasing distances d. Algorithm 6.1 shows
how to compute Hp and Hp [d], d = 1, . . . , l for a node p in the tree T within
a context radius l, where LEFT(p) and RIGHT(p) are respectively the left
child and the right child of node p, Hv [i] + Hp [j] is the sum over all the
components of the histograms and Hp [d] ← label(v) is the insertion of the
label of node v into the histogram.
Algorithm 6.1 Tree–Make–Histogram–Vector(T, p, l)
Input: A binary rooted tree T , a node p ∈ V and the context radius l
Output: Hp and Hp [d ], d = 1, . . . , l
1: if p 6= NULL then
2:
Tree–Make–Histogram–Vector(T, LEFT(p), l)
3:
Tree–Make–Histogram–Vector(T, RIGHT(p), l)
4:
Hp [1] ← label(p)
5:
for i = 2 to l do
6:
Hp [i] ← HLEFT(p) [i − 1] + HRIGHT(p) [i − 1]
7:
end for
8:
Hp ← Hp [1] + Hp [2] + . . . + Hp [l]
9:
return Hp , Hp [1], Hp [2], . . . , Hp [l]
10: end if
189
6.4 Algorithms and Complexity
Exploiting the output of the Algorithm 6.1, the label histograms for each
b p be the hisnode of a tree structure can be computed efficiently. Let H
togram of the context of a node p within a context radius l. Precisely, the
b p can be constructed by
algorithm can use the property that the histogram H
the histograms of all its children excluding the histograms Hi [l] at distance
l and retaining the information on label(p) of p. Algorithm 6.2 shows how
b p for a node p in the tree T within a context radius l, where
to compute H
PARENT(p) and SIBLING(p) are respectively the parent and the sibling
of node p and Hp [d] + label(v) is the insertion of label of node v into the
histogram.
Algorithm 6.2 Tree–Make–Histogram(T, p, l)
Input: A binary rooted tree T , a node p ∈ V and the context radius l
b p of context of node p within a context radius l
Output: Histogram H
1: if p 6= NULL then
2:
Tree–Make–Histogram(T, LEFT(p), l)
3:
Tree–Make–Histogram(T, RIGHT(p), l)
b p ← label(p) + H
b LEFT(p) − HLEFT(p) [l] + H
b RIGHT(p) − HRIGHT(p) [l]
4:
H
5:
c←p
6:
for i = 1 to l do
bp ← H
b p + HSIBLING(c) [l − i] + label(PARENT(c))
H
7:
8:
c ← PARENT(c)
9:
end for
bp
10:
return H
11: end if
b p for node p is obtained
Step 4 of Algorithm 6.2 shows how the histogram H
by merging its children histograms and removing the contribution from the
b p exists from
deepest level l. In addition, a contribution to the histogram H
each node on the path from p to the root node r as illustrated by steps 5–8
in Algorithm 6.2 and in Figure 6.7. Such an algorithm can construct all
context histograms with a complexity O(|V |) with a linear increase in space
complexity (we suppose the size of histograms is constant with respect to the
size of the tree).
190
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
d
p
d−3
d−2
d−1
Figure 6.7. Relationship between the node p and its histograms at various
levels.
6.4.2.3
DAGs Histograms Computation
Context histograms for DAGs can efficiently be computed following the same
procedure as in the tree case described in Section 6.4.2.2, once DAGs have
been topologically sorted. The algorithm gets more complicated for the presence of vertices with more than one incident edge. The key idea is that if a
vertex u has among its descendants two vertices that have a common neighbor q, care has to be taken to count the contribution of q only once (see
Figure 6.8). In addition to a vector of histograms, we need to associate a
hash table for each vertex in order to efficiently access those descendant vertices with multiple incident edges and subtract their multiple contribution.
The computational complexity is bounded by O(|V | + |E|). For the general
case of undirected cyclic graphs directly, computing the histogram visiting in
breadth–first the neighborhood of each vertex achieves a complexity bounded
by O(|V |2 + |V ||E|).
Algorithm 6.3 describes in detail an efficient procedure for computing
context histograms for all the vertices of a DAG. Given a directed acyclic
graph G = (V, E), after sorting all vertices in reverse topological order, the
construction of the histogram Hu of the context graph for each vertex u ∈ V
proceeds summing the components of the histograms of adjacent vertices
in the adjacency list Adju of vertex u. We store histograms at increasing
191
6.4 Algorithms and Complexity
Algorithm 6.3 Graph–Make–Histogram(G, l)
Input: A DAG G and the context radius l
Output: Context histograms Hu , u ∈ V
1: Topological–Sort(G)
2: for all u ∈ V in reverse topological order do
3:
Hu [0] ← label(u)
4:
. Mark vertices with multiple incident edges
5:
if InEdges(u) > 1 then
6:
Eu ← 0
7:
end if
8:
. Compose vertex histogram from the histograms of adjacent vertices
9:
for all v ∈ Adju do
10:
for i = 1 to l do
11:
Hu [i] ← Hu [i] + Hv [i − 1]
12:
end for
13:
. Add link reference and update link reference distance
14:
Eu ← Eu + Ev + 1
15:
end for
16:
. Remove histogram redundancy
17:
for all v ∈ REDUNDANT(Adju ) do
18:
dv = LinkReferenceDistancev
19:
for i = 0 to l do
20:
Hu [i + dv ] ← Hu [i + dv ] − Hv [i]
21:
end for
22:
end for
23:
Hu ← Hu [0] + Hu [1] + · · · + Hu [l]
24: end for
25: return Hu , u ∈ V
192
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
u
v
q:4
q:3
v
q:2
q:1
q:2
q:1
q
Figure 6.8. An example of a vertex u which has among its descendants two
vertices that have a common neighbor q.
distances from the vertex u in a vector of histograms so that Hu [i], the
histogram at distance i for vertex u, is built from Hv [i − 1] with v ∈ Adju
(steps 9–11 in Algorithm 6.3). Note that elements at distance l from vertices
in the adjacency of vertex u, have distance l + 1 from u. Given the number
of incident edges InEdges(u), information about vertices u with more than a
single incident edge is stored in a hash data structure E, where elements in E
are indexed by the vertex identifier and have Link Reference Distance (LRD)
information associated. Eu denote the hash table for edge annotation of
vertex u and the expression E +1 is used to represent the addition of 1 to link
reference distance information associated with all elements in E. Information
of the distance of u from the current vertex v is added in order to identify
the histogram of vertices whose contribution to the histogram computation
would, incorrectly, be accounted for more than once. The numerical part of
the labels of incident edges into node q in Figure 6.8 is the distance from q to
the current vertex v. Once again, the set of vertices with multiple incident
edges Eu is composed from Ev with v ∈ Adju (step 17 in Algorithm 6.3).
The elements in the histogram of these vertices are then removed from the
overall histogram (steps 17–20 in Algorithm 6.3).
193
6.4 Algorithms and Complexity
A procedure described in Algorithm 6.4 computes the subset of adjacent
vertices to u that occur multiple times whose distance is not minimal. It uses
two auxiliary structures C and M for storing respectively common descendants and minimal distance. The complexity of the Algorithm 6.3 is linear
Algorithm 6.4 Redundant(Adju , E)
Input: The adjacency list Adju and the hash data structure E
Output: A redundant subset L of Adju
1: . Count occurrences of common descendants and store minimal distance
2: for all v ∈ Adju do
3:
for all q ∈ Ev do
4:
C[q] ← C[q] + 1
5:
M [q] ← min (M [q], LinkReferenceDistanceq )
6:
end for
7: end for
8: . List all descendants that occur multiple times whose distance is not
minimal
9: L ← ∅
10: for all v ∈ Adju do
11:
for all q ∈ Ev do
12:
if C[q] > 1 and LinkReferenceDistanceq > M [q] then
13:
L←L∪q
14:
end if
15:
end for
16: end for
17: return L
with respect to |E| if the context radius l is independent from E (e.g. l is
constant).
6.4.3
Reducing HIK Complexity
Histogram intersection kernel (6.37) between multiple histograms can be evaluated efficiently computing the minimum between sorted histogram bins over
different parts with the same selector. So, instead of computing HIK between
194
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
histograms, we can work with sorted data structures and compute the minimum between histogram bins over different parts with the same selector.
The basic idea is to swap the summation on the NHist histograms associated with selectors with the summation on the NBin components of the
histograms:
N
Hist
X
kHIK (Hj , Hj0 )
=
N
Hist N
Bin
X
X
j=1
min{Hj (i), Hj0 (i)}
(6.48)
min{Hj (i), Hj0 (i)}
(6.49)
j=1 i=1
=
N
Bin N
Hist
X
X
i=1 j=1
So the vector containing the i–th components of the NHist histograms can be
sorted and the computation of minimum between two histograms reduces to
the minimum between two sorted vectors of histogram components.
For example, in the case of graphs with only one histogram for each node,
the WDK reduces to
X X
K(G1 , G2 ) =
δ(v1 , v2 )kHIK (Hv1 , Hv2 )
(6.50)
v1 ∈V1 v2 ∈V2
=
X X
v1 ∈V1 v2 ∈V2
=
N
Bin
X
X X
N
Bin
X
{Hv1 (i), Hv2 (i)}
(6.51)
δ(v1 , v2 ){Hv1 (i), Hv2 (i)}
(6.52)
δ(v1 , v2 )
i=1
i=1 v1 ∈V1 v2 ∈V2
Considering only the parts with the same selector, we have to compute
X X
(6.53)
{Hv1 (i), Hv2 (i)}
v1 ∈V1 v2 ∈V2
for each component i = 1, . . . , NBin and then summing up over components.
The vectors Hv1 (i), v1 ∈ V1 and Hv2 (i), v2 ∈ V2 containing all the i–th components of various histograms can be sorted by value and then computing HIK
between histograms reduces to computing componentwise minimum between
two sorted vectors. Figure 6.9 shows the optimized procedure for reducing
HIK complexity between multiple histograms. In a pre–processing phase (for
example, while reading the data), the bins of different histograms are sorted
195
6.4 Algorithms and Complexity
G1
Histogram=j
Selector=k
G2
Selector=k
Bin=i
Bin=i
Part1
1
Part1
2
Part2
4
Part2
3
Part3
5
Part3
6
H1_v1(i)=1
H2_v1(i)=2
H1_v2(i)=4
H2_v2(i)=3
Merging
H1_v3(i)=5
H2_v3(i)=6
H1_v1(i)=1
3 H1_v1(i)
H2_v1(i)=2
2 H2_v1(i)
H2_v2(i)=3
Counting
2 H2_v2(i)
H1_v2(i)=4
1 H1_v2(i)
H1_v3(i)=5
1 H1_v3(i)
H2_v3(i)=6
0 H2_v3(i)
Summing
3 H1_v1(i) + 2 H2_v1(i) + 2 H2_v2(i) + 1 H1_v2(i) + 1 H1_v3(i) + 0 H2_v3(i) =
3 x 1 + 2 x 2 + 2 x 3 +1 x 4 + 1 x 5 + 0 x 6 = 22
Figure 6.9. Histogram intersection kernel between multiple histograms can
be evaluated efficiently computing the minimum between sorted histogram
bins over different parts with the same selector.
by value. Calculating minimum between all the elements of two sorted vectors of size N1 and N2 respectively takes O(N1 + N2 ) time: we can merge the
two sorted vectors and then compute how many elements ci of an histogram
follow the elements of the other one in the merged list (paying attention to
ties between components). Finally, the coefficients ci are multiplied with the
corresponding bin components and the results are summing up.
2
The complexity reduction is from O(NBin NPart
) to O(NBin NPart ), where
NPart is the number of parts with matching selector (3 in the example in
Figure 6.9).
6.4.4
Optimizing Histogram Inner Product
If we use the inner product between vectors instead of the histogram intersection as a kernel, we can merge the histograms of different parts with the
same selector into only one histogram: so each selector has associated one
histogram which summarizes the information of all the parts with that selector. Computing the kernel between two objects reduces to scan the two
196
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
corresponding ordered lists of selectors and make the inner product between
associated histograms.
6.5
Experimental Results
To demonstrate the effectiveness and versatility of our approach, we report
experimental results on four classification tasks, two on proteins and two on
chemical compounds. In the experimentation, classification was performed
using SVMs.
6.5.1
Protein Subcellular Localization
The protein subcellular localization task consists of predicting the cell compartment in which the mature protein will reside. An accurate localization
prediction is considered a useful step towards understanding protein function,
since proteins belonging to the same compartment could cooperate towards
a common function. Figure 6.10 shows a cell with its main compartments.
Figure 6.10. The main compartments of a cell.
We report comparative results on two data sets previously studied in
197
6.5 Experimental Results
the literature. The first data set was prepared by Hua and Sun (2001) and
consists of 2,427 eukaryotic sequences1 . The second data set was prepared by
Nair and Rost (2003) and consists of 1,461 (train) and 512 (test) SwissProt
proteins2 . In both cases proteins are grouped in four classes: cytoplasmic,
extra–cellular, mitochondrial and nuclear. We cast the multiclass problem
into four one–vs–all binary classification problems, considering each time one
class as positive and the others as negative.
We compare WDK against SubLoc (Hua and Sun, 2001) and LOCNet
(Nair and Rost, 2003). SubLoc is a SVM predictor based on an amino acid
composition in which each sequence is represented by a 20 dimensional vector whose components are the frequencies of residues (see Figure 6.11). So
. . . L S I G D V A K K L G E M W N W T. . .
ACDFGIKSTZ
. . . S A K K G AT L D K T G W L Q W I . . .
ACDFGIKSTZ
Figure 6.11. Sequence frequency distribution representation based on amino
acid composition.
arbitrary long sequences are mapped into fixed size vectors, losing structural
and sequential information. The kernel between two sequences x and z is
given by
K(x, z) =
20
X
fi (x)fi (z)
(6.54)
i=1
where fi (x) is the frequency of residue i in sequence x.
1
2
The dataset is available at http://www.bioinfo.tsinghua.edu.cn/SubLoc/.
The dataset is available at http://cubic.bioc.columbia.edu/results/2003/localization/.
198
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
LOCNet is a more sophisticated connectionist approach based on neural networks that employs amino acid sequence, profile, predicted secondary
structure, predicted solvent accessibility and surface composition as additional inputs. In addition, we compared results with our implementation
of the spectrum kernel (Leslie et al., 2002a) which evaluates the similarity
between sequences counting common k–mers fragments (see Section 3.5.1).
Increasing k–mers size permits to deal with more structured information but,
at the same time, the quality of similarity measure decreases. Precisely, the
space size of possible k–mers grows exponentially with k, so, if we suppose
that the k–mers are equally probable, the probability of a match decreases
exponentially. The similarity measure of a sequence against itself is very
high, while the similarity between two sequences is always close to zero: this
leads to a bad generalization.
Performance was measured for each class in terms of precision, recall, geometric average, Matthews correlation coefficient (between targets and predictions) and overall 4–class accuracy. For each class i = 1, . . . , c, we can
estimate T Pi , T Ni , F Pi and F Ni considering the examples belonging to class
i as positives and the others as negatives. Table 6.1 explains how to compute
these quantities considering the class 3 as positive and using a 4–classes confusion matrix. In this way, precision, recall, accuracy for class i are defined
Target
Predictions
1
2
3
4
1
TN TN
FN
TN
2
TN TN
FN
TN
3
FP FP
TP
FP
4
TN TN
FN
TN
Table 6.1. Multiclass classification confusion matrix for 4 classes when
considering class 3 as positive.
199
6.5 Experimental Results
as
T Pi
T P i + F Pi
T Pi
=
T Pi + F Ni
T P i + T Ni
=
m
Prei =
(6.55)
Reci
(6.56)
Acci
(6.57)
and the total accuracy for all c classes is
Acc =
c
X
T Pi
i=1
m
(6.58)
where m is the total number of examples in the dataset. In addition, we
define the geometric average
√
Prei · Reci
gAvi =
(6.59)
100
and the Matthews correlation coefficient between targets and predictions
T Pi T Ni − F Pi F Ni
MCCi = p
(T Pi + F Ni )(T Pi + F Pi )(T Ni + F Pi )(T Ni + F Ni )
(6.60)
In Nair and Rost (2003), Acc is called Q4 , Prei is pL and Reci is oL, while in
Hua and Sun (2001) Reci is accuracy(i) and Acc is Total Accuracy. Note that
Acc is equal to micro averaged precision µPre and equal to micro averaged
recall µRec:
Pc
Pc
Pc
T Pi
T Pi
T Pi
i=1
i=1
Pc
µRec = Pc
= Pc
= i=1
= Q4
m
i=1 F Ni
i=1 T Pi +
i=1 ci
Pc
Pc
Pc
T Pi
T Pi
T Pi
i=1
i=1
Pc
µPre = Pc
= Pc
= i=1
= µRec
m
i=1 T Pi +
i=1 F Pi
i=1 pi
where ci is the number of examples belonging to class i and pi is the number
of examples predicted in class i. Furthermore, for definition, Acc ≤ Acci
P
since ci=1 T Pi ≤ T Pi + T Ni , i = 1, . . . , c.
Kernel parameters have been optimized using cross–validation, obtaining
context radius l = 7 for the WDK, 3–mers for both the spectrum kernel
and WDK selector, and regularization parameter C = 10 for both kernels.
200
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
The WDK is able to exploit larger amino acid subsequences in the form of
context, while using large selectors (or large k–mers for the spectrum kernel)
leads to worse generalization due to sparseness.
In Table 6.2, we report leave one out classification results obtained with
our implementation of SubLoc, the spectrum kernel and the WDK. As specified in Hua and Sun (2001), the SubLoc predictor was trained using an RBF
kernel with γ = 16 and C = 500. In Table 6.3 we compare the performance
of spectrum kernel, WDK and LOCNet (Nair and Rost, 2003) on the test
set. The spectrum kernel consistently outperforms SubLoc (showing that
features other than the overall amino acid composition are useful for this
prediction task) and is also highly competitive against LOCNet (in Table 6.3
the spectrum kernel outperforms LOCNet and this is a new result). In all
cases, WDK results show that further improvement over the spectrum kernel
is possible by exploiting context information around 3–mers. We conjecture
that WDK is capturing some short sorting signals in the protein sequence.
This a remarkable results considering that WDK does not use explicit knowledge of secondary structure or solvent accessibility.
Method
SubLoc
Spectrum3
WDK
Pre
72.6
80.4
82.6
Method
SubLoc
Spectrum3
WDK
Pre
70.8
75.8
89.7
Acc
Cytoplasmic
Rec gAv MCC
76.6 .74
.64
83.3 .81
.74
87.9 .85
.79
Mitochondrial
Rec gAv MCC
57.3 .63
.58
61.4 .68
.63
62.3 .74
.71
SubLoc
79.4
Pre
81.2
90.6
96.9
Pre
85.2
88.3
88.7
Spectrum3
84.9
Extra–Cellular
Rec gAv MCC
79.7 .80
.77
85.5 .88
.86
87.7 .92
.91
Nuclear
Rec gAv MCC
87.4 .86
.74
92.6 .90
.82
95.5 .92
.85
WDK
87.9
Table 6.2. Leave one out performance on the SubLoc data set described in
Hua and Sun (2001). The spectrum kernel is based on 3–mers and C = 10.
For the WDK, contexts width is 15 residues (context radius l = 7), k–mers
size is 3 (selector radius r = 1) and C = 10.
201
6.5 Experimental Results
Method
LOCNet
Spectrum3
WDK
Pre
54.0
69.7
71.4
Method
LOCNet
Spectrum3
WDK
Pre
45.0
65.3
78.9
Acc
Cytoplasmic
Rec gAv MCC
56.0 .54
68.8 .69
.57
72.9 .72
.60
Mitochondrial
Rec gAv MCC
53.0 .49
53.3 .59
.54
50.0 .62
.59
LOCNet
64.2
Pre
76.0
78.7
85.7
Pre
71.0
76.5
77.8
Spectrum3
74.1
Extra–Cellular
Rec gAv MCC
86.0 .81
80.6 .79
.72
87.1 .86
.81
Nuclear
Rec gAv MCC
73.0 .72
80.8 .78
.66
85.3 .81
.70
WDK
78.0
Table 6.3. Test set performance on the SwissProt data set defined by Nair
and Rost (2003). The spectrum kernel is based on 3–mers and C = 5. For
the WDK, contexts width is 15 residues (context radius l = 7), k–mers size
is 3 (selector radius r = 1) and C = 5.
A new work on localization is in progress: we compare our WDK against
more known methods on a wider range of datasets. In addition, we develop a
series of new versions of WDK (for example, we allow the inexact matching
between selectors and/or we restrict WDK to more informative subsequences
of protein as N– or C–terminus of the chain).
6.5.2
Protein Family Classification
Remote protein homology detection is the task to find homologies between
proteins that are in the same superfamily but not necessarily in the same
family. The superfamily classification is useful to annotate new unknown
proteins with structural and functional features from similar known proteins.
We tested WDK on the sample of the Structural Classification of Proteins
(SCOP) dataset used in the experimental setup by Jaakkola et al. (2000)
and Leslie et al. (2002a). We followed Jaakkola et al. (2000) simulating the
remote homology task by holding out all members of a target family from
a given superfamily. The holding out family sequences were positive test
202
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
examples, while remaining families in the superfamily were positive training
examples; negative training and test examples were chosen from outside the
target family fold. In this way, 33 binary classification problems were created,
one for each target family; there was about 65,000 sequences on average in
each family. The simulation of remote homology task is illustrated in Figure
6.12, where it is shown the splitting in positive train, positive test, negative
train and negative test sequences. Note that for negative sequences, train
and test are swapped yielding to two different classification problems.
all alpha
Class
4 helical
cytokines
Fold
4 helical
cytokines
Superfamily
interferons/
interleukin-10
short-chain
cytokines
long-chain
cytokines
1ilk
1bgc
1hmc
Positive train
Family
Negative test 0
Negative train 1
Positive test
Negative test 1
Negative train 0
Figure 6.12. The remote homology task consists in finding homologies between proteins that are in the same superfamily but not necessarily in the
same family.
Classification performance was evaluated measuring ROC50 , RFP100% and
RFP50% . ROC50 is the area under the ROC curve (AUC) (Fawcett, 2003)
spanning the first 50 false positive (Gribskov and Robinson, 1996). The
ROC curve is a two–dimensional graph in which rate of true positives (RTP)
is plotted on the Y axis and the rate of false positive (RFP) is plotted on the
X axis: it highlights the relative trade–off between benefits (true positive)
and costs (false positive). RTP (also called hit rate and recall) and RFP
(also called false alarm rate) are defined as
TP
TP
=
= Rec
TP + FN
P
FP
FP
RF P =
=
FP + TN
N
RT P =
(6.61)
(6.62)
where P and N are respectively the number of positive and negative examples
in the data set. A reason for computing the ROC50 measure instead of ROC is
203
6.5 Experimental Results
that, if the number of positives is small with respect to number of negatives,
since most of positive sequences are good discriminators for their families,
the true negatives will be large and the area under ROC curve will usually be
very close to 1 for all the classifiers. A score of 1 indicates perfect separation
of positives from negatives, whereas a score of 0 indicates that none of the
top 50 sequences selected by the algorithm were positives. RFP100% is the
rate of false positives at recall 1, that is when all the positive sequences are
classified as positive (F N = 0). RFP50% is the rate of false positives at
recall 0.5, that is an half of the positive sequences are classified as positive
(T P = P/2, F N = P/2).
Parameters for WDK have been chosen to be: selector radius r = 1,
context radius l = 7 residues and regularization parameter C = 1. The
spectrum kernel is based on 3–mers and C = 1. Sequences with more than
one unknown residual are discarded. Detailed results for all 33 SCOP families
obtained with our implementation of the spectrum kernel and the WDK are
reported in Table 6.4, Figures 6.13 and 6.14. We note that WDK performs
favorably against the spectrum kernel on relatively hard family to recognize,
i.e. families with low ROC50 or high RFP, but also on the easy ones, while
it is comparable on families laying in an intermediate region. The relative
error reduction obtained by the WDK when measuring the ROC50 , RFP100% ,
RFP50% averaged over all 33 families is 3.2%, 3.4% and 0.8% respectively.
6.5.3
HIV Dataset
The HIV dataset3 contains 42, 687 compounds evaluated for evidence of anti–
HIV activity by DTP AIDS Antiviral Screen of the National Cancer Institute
(Kramer et al., 2001), 422 of which are confirmed active (CA), 1081 are
moderately active (CM) and 41184 are confirmed inactive (CI). The screen
utilizes a soluble fomrazan assay to measure protection of human CEM cells
from HIV–1 infection (Weislow et al., 1989b). A compound is inactive (CI)
if a test showed less than 50% protection of human CEM cells. All other
compounds were retested. Compounds showing less than 50% protection
(in the second test) are also classified inactive (CI). The other compounds
3
The dataset is available at http://dtp.nci.nih.gov/docs/aids/aids data.html.
204
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
1
WDK RFP100%
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
Spectrum RFP100%
0.8
1
0.4
0.5
0.8
1
(a)
0.5
WDK RFP50%
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
Spectrum RFP50%
(b)
1
WDK ROC50
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
Spectrum ROC50
(c)
Figure 6.13. Remote Protein Homologies: family by family comparison of
the WDK and the spectrum kernel. The coordinates of each point are the
RFP at 100% coverage (a), at 50% coverage (b) and the ROC50 scores (c)
for one SCOP family, obtained using the WDK and spectrum kernel. Note
that the better performance is under the diagonal in (a) and (b), while is
over in (c).
205
6.5 Experimental Results
35
Families Number
30
25
20
15
10
5
WDK
Spectrum
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
RFP100%
1
(a)
35
Families Number
30
25
20
15
10
5
WDK
Spectrum
0
0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
RFP50%
(b)
35
WDK
Spectrum
Families Number
30
25
20
15
10
5
0
0
0.2
0.4
0.6
ROC50
0.8
1
(c)
Figure 6.14. Remote Protein Homologies: comparison of the WDK and
spectrum kernel. The graphs plot the total number of families for which
a given method is within a RFP at 100% coverage threshold (a), at 50%
coverage threshold (b) and exceeds an ROC50 score threshold (c).
206
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Family
1.1.1.2
1.25.1.1
1.25.1.2
1.25.1.3
1.34.1.4
1.34.1.5
2.1.1.1
2.1.1.2
2.1.1.3
2.1.1.4
2.1.1.5
2.5.1.1
2.5.1.3
2.8.1.2
2.8.1.4
2.19.1.1
2.31.1.1
2.31.1.2
2.34.1.1
2.41.1.1
3.1.1.1
3.1.1.3
3.1.1.5
3.19.1.1
3.19.1.3
3.19.1.4
3.19.1.5
3.25.1.1
3.25.1.3
3.33.1.1
3.33.1.5
3.50.1.7
3.73.1.2
RFP100%
0.614
0.233
0.104
0.064
0.000
0.000
0.239
0.063
0.101
0.633
0.500
0.713
0.793
0.435
0.371
0.617
0.610
0.067
0.201
0.973
0.040
0.102
0.185
0.407
0.807
0.215
0.400
0.683
0.545
0.522
0.859
0.123
0.127
Spectrum
RFP50%
0.289
0.073
0.030
0.032
0.000
0.000
0.005
0.000
0.002
0.016
0.128
0.136
0.118
0.061
0.105
0.172
0.000
0.001
0.067
0.098
0.005
0.031
0.013
0.028
0.144
0.093
0.055
0.062
0.030
0.000
0.436
0.029
0.079
ROC50
0.000
0.122
0.263
0.418
1.000
0.998
0.687
0.929
0.768
0.432
0.147
0.144
0.183
0.197
0.163
0.012
0.816
0.817
0.315
0.000
0.682
0.401
0.445
0.329
0.059
0.056
0.250
0.208
0.274
0.571
0.090
0.417
0.003
RFP100%
0.669
0.179
0.076
0.076
0.000
0.000
0.120
0.039
0.079
0.667
0.648
0.714
0.780
0.444
0.387
0.644
0.589
0.026
0.131
0.947
0.031
0.112
0.159
0.302
0.686
0.136
0.295
0.635
0.463
0.497
0.871
0.094
0.129
WDK
RFP50%
0.174
0.054
0.022
0.036
0.000
0.000
0.002
0.000
0.002
0.019
0.122
0.142
0.110
0.049
0.101
0.121
0.000
0.000
0.078
0.084
0.004
0.028
0.016
0.016
0.127
0.079
0.065
0.074
0.021
0.000
0.405
0.028
0.100
ROC50
0.042
0.144
0.371
0.410
1.000
1.000
0.773
0.954
0.778
0.426
0.138
0.148
0.191
0.215
0.138
0.052
0.850
0.868
0.311
0.000
0.719
0.395
0.389
0.479
0.055
0.062
0.263
0.199
0.347
0.571
0.082
0.464
0.000
Table 6.4. Rate of false positives at 50% and 100% coverage levels and
ROC50 scores for all 33 SCOP families for spectrum and WDK kernels. The
spectrum kernel is based on 3–mers and C = 1. For the WDK, contexts
width is 15 residues (context radius l = 7), k–mers size is 3 (selector radius
r = 1) and C = 1.
207
6.5 Experimental Results
are classified active (CA) if they provided 50% protection in both test and
moderately active otherwise (CM). Detailed statistics are reported in Table
6.5.
m
NA
42,687 46
NB
48
TA
82
TB
4
max NA
438
min NA
2
max NB
276
min NB
1
Table 6.5. HIV dataset statistics: m is the dataset size, NA and NB are
the average number of atoms and bonds in each compound, TA and TB
are the average number of types of atoms and bonds, max / min NA and
max / min NB are the maximum/minimum number of atoms and bonds over
all the compounds. The total number of vertices and edges is 1,951,154 and
2,036,712 respectively.
WDK was tested on three binary classification problems, following the
experimental setup of Deshpande et al. (2003): CA vs. CM (1,503 molecules),
CA vs. CI (41,606 molecules) and CA+CM vs. CI (42,687 molecules). The
percentage of positive elements in each binary problem is 28.1%, 1.0% and
3.5% respectively. Classification performance was measured by the mean and
standard deviation of the ROC area on a five folds cross validation setup in
which the original class distribution was preserved in each fold.
We used as vertex attribute the atom type and as edge attribute both
bond type and triplets encoding the bond type and the two bonded atoms
types. The regularization parameter C = 100 was optimized on a four folds
cross validation set. The misclassification cost β for positive examples was
increased to match the ratio n− /n+ between positive and negative examples.
In Table 6.6 we report results for the CA vs. CM task at increasing values for
the context radius. In addition to the standard WDK with a single context
Equation (6.44) we tested the WDK using subgraph complements Equation
(6.46). We note that performance improves with larger context radii and,
consistently, is better when using subgraph complement. In the subsequent
experiments, reported in Table 6.7, we used graph complement and context
radius l = 4. For comparison, Table 6.7 shows the best results reported on
this data set by Deshpande et al. (2003) and by Horváth et al. (2004) (also
measured by five–fold cross validation but on a different split). Note however
that we compare our results on a fair base, i.e. against Deshpande et al.
208
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
(2003) FSG without additional geometrical features and against Horváth
et al. (2004) CPK without the composition with gaussian kernel.
D
1
2
l=1
80.1±0.8
82.2±1.2
l=2
81.8±0.9
83.5±0.7
l=3
81.6±0.8
83.8±1.7
l=4
82.0±1.5
84.2±1.2
l=5
81.6±1.5
83.8±0.8
Table 6.6. HIV dataset: CA vs. CM task. Effect of varying the context
radius l and the absence D = 1 or presence D = 2 of graph complement.
FSG
CPK
CPK∗
WDK
CA vs. CM
79.2
82.7±1.3
82.9±1.2
84.2±1.2
CA+CM vs. CI
79.4
80.1±1.7
80.1±1.7
81.7±1.8
CA vs. CI
90.8
92.8±1.0
93.4±1.1
94.0±1.5
Table 6.7. HIV dataset. FSG: best results (optimized support and
β = n− /n+ ) reported by Deshpande et al. (2003) using topological features; CPK: results reported by Horváth et al. (2004) using β = n− /n+ ;
CPK∗ same, using an optimized β ∗ ; WDK: β = n− /n+ .
6.5.4
Predictive Toxicology Challenge
The Predictive Toxicology Challenge (PTC) is a classification problem over
the carcinogenicity properties of chemical compounds on mice and rats organized as a part of PKDD/ECML 2001 conference (Toivonen et al., 2003).
To test the WDK classification performance on this task, we used the U.S.
National Institute for Environment Health Studies dataset4 . The dataset
(Helma et al., 2001) lists the bioassays of 417 chemical compounds of four
type of rodents: male mice (MM), female mice (FM), male rats (MR) and female rats (FR), which give rise to four distinct and independent classification
problems. Each compound is classified as clear evidence (CE), positive (P),
some evidence (SE), negative (N), no evidence (NE), equivocal (E), equivocal
4
The dataset is available at http://www.predictive–toxicology.org/ptc.
209
6.5 Experimental Results
evidence (EE) and inadequate study (IS). The final goal is to estimate the
carcinogenicity (cancer inducing) of different compounds on humans. Detailed statistics on the dataset are reported in Table 6.8.
m
417
NA
25
NB
26
TA
40
TB
4
max NA
106
min NA
2
max NB
85
min NB
1
Table 6.8. PTC dataset statistics: m is the dataset size, NA and NB are
the average number of atoms and bonds in each compound, TA and TB
are the average number of types of atoms and bonds, max / min NA and
max / min NB are the maximum/minimum number of atoms and bonds
over all the compounds.
We followed the experimental design of Deshpande et al. (2003):
• E, EE, IS classes are ignored;
• CE, P, SE are grouped in the positive class;
• N, NE are grouped in the negative class.
In addition to the atom type attribute, we enriched vertex information with a
discrete attribute on atom charge (taking values in {−1, 0, 1}) and functional
group membership, that is whether the atom is part of one among 28 different
group types such as carbonyl, ester, anhydrid, ketone, alcohol, etc. Edge
attributes comprise both bond type and triplets encoding the bond type and
the two bonded atoms types. The regularization parameter C was optimized
on a four folds cross validation for each one of the four classification problems.
Both in the optimization and training phase the misclassification cost for
positive examples was increased to match the positive to negative example
ratio.
Classification performance was evaluated measuring the mean and standard deviation of the area under the ROC curve on a five folds cross validation
preserving the original class distribution on each fold.
We performed two experiments to identify the effect of different parameters over classification performance. In the first experiment we let context
radius l vary and we contrast single context WDK against WDK with additional complementary context. Results reported in Table 6.9 show that the
210
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
presence of the graph complement (D = 2) increases performance for MR
and FR, while a larger context radius is useful for FM and FR. In the second
D=1
MM
FM
MR
FR
D=2
MM
FM
MR
FR
l=1
70.5±4.3
67.4±6.9
63.8±6.4
61.5±8.1
l=1
68.1±6.2
65.4±7.6
69.7±7.2
62.2±4.8
l=2
70.0±5.5
68.1±9.7
67.8±7.2
61.3±7.4
l=2
68.1±6.2
65.1±8.8
69.1±7.3
62.2±5.6
l=3
69.9±6.3
69.1±5.8
68.4±6.3
60.4±5.7
l=3
68.1±5.8
66.9±8.1
67.7±6.3
64.9±5.1
Table 6.9. PTC: effect of varying the context radius l and the absence
D = 1 or presence D = 2 of graph complement. The best performance is
highlighted in boldface.
experiment, we compared four WDKs obtained combining tensor product,
direct sum, Bhattacharyya and histogram intersection kernels. Results indicate that the direct sum version generally outperforms the tensor product
kernel. Our conjecture is that simpler problems benefit from the smaller
feature space generated by direct sum version, while more complex problems
can be best solved in a larger feature space induced by the tensor product
kernel.
We finally compared our results to the FSG (Deshpande et al., 2003)
results reported in Table 6.10 and to the EMGK (Mahé et al., 2004) results
reported in Figure 6.15 (we did not implement FSG and EMGK but we only
reported published results).
The best results obtained by WDK and FSG are comparable, while the
performance of EMGK for FR is worse than WDK and FSG. The WDK for
molecules defined by Equations (6.44) and (6.46) has the advantage of not
requiring a computationally expensive graph pre–processing phase compared
to FSG. In addition, it exhibits a stable behavior with respect to model
parameters, as opposed to EMGK, indicating that the proposed method can
211
6.5 Experimental Results
MM
FM
MR
FR
AUC
65.3
66.8
62.6
65.2
NFSG
24510
7875
7504
25790
AUC
65.4
69.5
68.0
66.3
NFSG
143
160
171
156
AUC
66.4
69.6
65.2
66.0
NFSG
85
436
455
379
AUC
66.5
68.0
64.2
64.5
NFSG
598
718
730
580
AUC
66.7
67.5
64.5
64.1
NFSG
811
927
948
775
Table 6.10. Area under the ROC curve (AUC) varying the number of
frequent subgraphs NFSG used by a feature selection procedure. The best
performance is highlighted in boldface.
65
pq = 0.1
pq = 0.4
pq = 0.7
60
ROC area
55
50
45
40
0
2
4
6
8
10
12
Morgan process iteration
14
16
18
20
Figure 6.15. ROC area evolution with the introduction of the Morgan index
for different values of pq parameter useful for preventing totters in the PTC
dataset for FR.
212
CHAPTER 6 WEIGHTED DECOMPOSITION KERNEL
compete against approaches based on more sophisticated subgraph exact
matching algorithms.
6.6
Conclusions
We introduced the weighted decomposition kernels, a computationally efficient and general family of kernels on decomposable objects. We report experimental evidence showing that the proposed kernel performs remarkably
well with respect to more complex and computationally demanding methods
on a number of different bioinformatics problems ranging from protein sequences to molecule graphs classification. Future working directions include
the extension of the proposed approach to non trivial selectors (using for example frequent subgraph mining algorithms) and to probability distributions
over subgraphs occurrences.
213
Chapter 7
Prediction of Zinc Binding Sites
We describe and empirically evaluate machine learning methods for improving the prediction of the zinc binding sites by modelling the linkage between
residues close in the protein sequence. We start from the observation that a
data set consisting of single residues as examples is affected by autocorrelation and we propose an ad–hoc remedy in which sequentially close pairs of
candidate residues are classified as being jointly involved in the coordination
of a zinc ion. We develop a kernel for this particular type of data that can
handle variable length gaps between candidate coordinating residues. Our
empirical evaluation on a data set of non redundant protein chains shows
that explicit modelling the correlation between residues close in sequence
allows us to gain a significant improvement in the prediction performance.
This chapter is based on Menchetti et al. (2006).
7.1
Introduction
Automatic discovery of structural and functional sites from protein sequences
can help towards understanding of protein folding and completing functional
annotations of genomes. Machine learning approaches have been applied to
several prediction tasks of this kind including the prediction of phosphorylation sites (Blom et al., 1999), signal peptides (Nielsen et al., 1999, 1997),
bonding state of cysteines (Martelli et al., 2002; Fiser and stvn Simon, 2000)
215
7.1 Introduction
and disulfide bridges (Fariselli and Casadio, 2001; Vullo and Frasconi, 2004).
Here we are interested in the prediction of metal binding sites from sequence
information alone, a problem that has received relatively little attention so
far. Proteins that must bind metal ions for their function (metalloproteins)
constitute a significant share of the proteome of any organism. A metal ion
(or metal–containing cofactor) may be needed because it is involved in the
catalytic mechanism and/or because it stabilizes/determines the protein tertiary or quaternary structure. The genomic scale study of metalloproteins
could significantly benefit from machine learning methods applied to prediction of metal binding sites. In fact, the problem of whether a protein needs
a metal ion for its function is a major challenge, even from the experimental
point of view. Expression and purification of a protein may not solve this
problem as a metalloprotein can be prepared in the demetallated form and
a non–metalloprotein can be prepared as associated to a spurious metal ion.
In this chapter, we focus on an important class of structural and functional
sites that involves the binding with zinc ions. Zinc is essential for Life and
is the second most abundant transition metal ion in living organisms after
iron. In contrast to other transition metal ions, such as copper and iron,
zinc(II) does not undergo redox reactions thanks to its filled d shell. In
Nature, it has essentially two possible roles: catalytic or structural, but can
also participate in signalling events in quite specific cellular processes. A
major role of zinc in humans is in the stabilization of the structure of a huge
number of transcription factors, with a profound impact on the regulation
of gene expression. Zinc ions can be coordinated by a subset of amino acids
(see Table 7.2) and binding sites are locally constrained by the side chain
geometry. For this reason, several sites can be identified with high precision
just mining regular expression patterns along the protein sequence.
The method presented in Andreini et al. (2004) mines patterns from metalloproteins having known structure to search gene banks for new metalloproteins. Regular expression patterns are often very specific but may give a
low coverage (many false negatives). In addition, the amino acid conservation near the site is a potentially useful source of information that is difficult
to take into account by using simple pattern matching approaches. Results
in Passerini and Frasconi (2004) corroborate these observations showing that
216
CHAPTER 7 PREDICTION OF ZINC BINDING SITES
a SVM predictor based on multiple alignments significantly outperforms a
predictor based on PROSITE patterns in discriminating between cysteines
bound to prosthetic groups and cysteines involved in disulfide bridges. The
method used in Passerini and Frasconi (2004) is conceptually very similar
to the traditional 1D prediction approach originally developed for secondary
structure prediction (Rost and Sander, 1993), where each example consists of
a window of multiple alignment profiles centered around the target residue.
Although effective, the above approaches are less than perfect and their
predictive performance can be further improved. In this work, we identify a
specific problem in their formulation and propose an ad–hoc solution. Most
supervised learning algorithms (including SVM) build upon the assumption
that examples are sampled independently. Unfortunately, this assumption
can be badly violated when formulating prediction of metal binding sites
as a traditional 1D prediction problem. The autocorrelation between the
metal bonding state is strong in this domain because of the linkage between
residues that coordinate the same ion. The linkage relation is not observed
on future data but we show in Section 7.2.3 that a strong autocorrelation is
also induced by simply modelling the close–in–sequence relation. This is not
surprising since most binding sites contain at least two coordinating residues
with short sequence separation.
Autocorrelation problems have been recently identified in the context of
relational learning (Jensen and Neville, 2002) and collective classification solutions have been proposed based on probabilistic learners (Taskar et al.,
2002; Jensen et al., 2004). Similar solutions do not exist yet for extending
in the same direction other statistical learning algorithms such as SVM. Our
solution is based on reformulating the learning problem by considering examples formed by pairs of sequentially close residues. We test our method on a
representative non redundant set of zinc proteins in order to assess the generalization power of the method on new chains. Our results show a significant
improvement over the traditional 1D prediction approach.
The remainder of the chapter is organized as follows. In Section 7.2 we
report a description of the dataset statistics. In Section 7.3 we describe our
approach to the prediction problem of zinc sites. Finally, in Section 7.4, we
test our method on a representative non redundant set of zinc proteins.
217
7.2 Dataset Description and Statistics
7.2
7.2.1
Dataset Description and Statistics
Data preparation
We generated a data set of high quality annotated sequences extracted from
the Protein Data Bank (PDB). 305 unique zinc binding proteins were selected
among all the structures deposited in the PDB at June 2005 and containing
at least one zinc ion in the coordinate file. Metal bindings were detected using
a threshold of 3Å and excluding carbon atoms and atoms in the backbone,
yielding a total of 464 zinc sites. In order to provide negative examples of non
zinc binding proteins, an additional set was generated by running UniqueProt
(Mika and Rost, 2003) with zero HSSP distance on PDB entries that are not
metalloproteins. We obtained in this way a second data set of 2,369 chains.
Zinc binding proteins whose structure was solved in the apo (i.e. without
metal) form, and thus did not contain a zinc ion in the coordinate file, were
removed from the ensemble of non–metalloproteins.
Some proteins in the PDB contain a poly–Histidine tag (typically containing 6 His residues) at either the N– or C–terminus of the chain, as a
result of protein engineering aimed at making protein purification easier. So,
before generating the training examples, poly–Histidine tags have be deleted
to avoid the creation of artificial negative examples. If we examine all the
histidine subsequences of length greater or equal to 3 in the 305 zinc binding
proteins, we see that only 3 sequences have an histidine subsequence that
bonds the zinc in the middle of the sequence: these subsequences of histidine
which bond the zinc, are not removed. Moreover, we find 12 zinc proteins
and 59 negative proteins that have an histidine subsequence not exactly at
the beginning or at the ending of sequence. As a common feature, there is the
fact that only a few of residues precede or follow the histidine subsequence:
as a consequence, these histidine subsequences are removed from sequence.
The following algorithm was employed to remove these artificial extensions. First, we searched all histidine subsequences of length ≥ 4. Second,
we extended the histidine subsequences towards the C or the N terminal,
respectively, if they were found within the first 16 residues or within the last
12 residues of the chain. Finally, we removed from the data set the patterns
218
CHAPTER 7 PREDICTION OF ZINC BINDING SITES
(see Sections 7.2.3.1) having residues falling in these terminal regions.
Algorithm 7.1 Remove–Poly–Histidine–Tags(s)
Input: A protein sequence s
Output: The set P of patterns of s do not fall in the poly–Histidine tags
1: H4 ← Find–All–H–Subsequences–of–Length–≥ 4(s)
2: for all h ∈ H4 do {extend H subsequences}
3:
if h ∈ the first 16 residues then
4:
extend h to the beginning of sequence
5:
else if h ∈ the last 12 residues then
6:
extend h to the ending of sequence
7:
end if
8: end for
9: P ← Find–All–Patters(s)
10: for all p ∈ P do
11:
for all h ∈ H4 do
12:
if residues(p) ∈ h then
13:
P ← P/{p} {remove patterns in poly histidine tags}
14:
end if
15:
if p is close to h then
16:
cancel the part of window(p) which overlaps h
17:
end if
18:
end for
19: end for
20: return P
7.2.2
A Taxonomy of Zinc Sites and Sequences
Zinc binding sites of zinc metalloenzymes are traditionally divided into two
main groups (Vallee and Auld, 1992):
• catalytic (if the ions bind a molecule directly involved in a reaction);
• structural (stabilizing the folding of the protein but not involved in any
reaction).
219
7.2 Dataset Description and Statistics
In addition, zinc may influence quaternary structure; in these cases we have
a third site type (interface site), which also lacks a catalytic role. Site types
can be heuristically correlated to the number of coordinating residues in the
same chain. The distribution of site types obtained in this way is reported
in Table 7.1.
Number of Coordinating Residues
1 (Zn1)
2 (Interface – Zn2)
3 (Catalytic – Zn3)
4 (Structural – Zn4)
Total Site Number
Site Number
37
65
123
239
464
Chain Number
20
53
106
175
Site types {1, 2} {1, 3} {1, 4} {2, 3} {2, 4} {3, 4} {1, 2, 3} {1, 2, 4} {1, 3, 4} {2, 3, 4} {1, 2, 3, 4}
# Chains 14
9
3
21
4
8
7
1
0
2
0
Table 7.1. Top: Distribution of site types (according to the number of
coordinating residues in the same chain) in the 305 zinc–proteins data set.
The third column is the number of chains having at least one site of the
type specified in the row. Bottom: Number of chains containing multiple
site types. The second row gives the number of chains that contain at least
one site for each of the types belonging to the set specified in the first row.
Table 7.2 reports statistics on which residues are actually involved in
binding zinc, both in general and separately for each site type. As expected,
cysteines, histidines, aspartic acid and glutamic acid are the only residues
which bind zinc with a reasonable frequency. It is interesting to note that
such residues show different behaviors with respect to the site type. While
cysteines are mainly involved in structural sites and histidines participate to
both Zn4 and Zn3 sites with similar frequency, aspartic and glutamic acids
are much more common in catalytic sites than in any other site type.
7.2.3
Bonding State Autocorrelation
Jensen and Neville (2002) define relational autocorrelation as a measure of
linkage between examples in a data set due to the presence of binary relations that link examples to other objects in the domain (e.g. in a domain
220
CHAPTER 7 PREDICTION OF ZINC BINDING SITES
Site type
Zn4
Amino acid Na
C
H
D
E
N
Q
Total
663
220
48
18
5
2
fa
69.3
23.0
5.0
1.9
0.5
0.2
956 100
Zn3
fs Na
91.8
45.7
27.6
17.5
83.3
33.3
45
194
83
46
0
1
fa
Zn2
fs Na
12.2
52.6
22.5
12.5
0.0
0.3
— 369 100
6.2
40.3
47.7
44.7
0.0
16.7
10
59
30
28
1
2
fa
7.7
45.4
23.1
21.5
0.8
1.5
— 130 100
Zn1
fs Na
fa
All
fs
Na
1.4 4 10.8 0.6
12.3 8 21.6 1.7
17.2 13 35.1 7.5
27.2 11 29.7 10.7
16.7 0 0.0 0.0
33.3 1 2.7 16.7
722
481
174
103
6
6
— 37 100
— 1492
Table 7.2. Statistics over the 305 zinc proteins (464 binding sites) divided
by amino acid and site type. Na is the amino acid occurrence number in
corresponding site type. fa is the observed percentage of each amino acid
in a given site type. fs is the observed percentage of each site type for a
given amino acid. “All” is the total number of times a given amino acid
binds zinc in general.
where movies are the examples, linkage might be due to the fact that two
movies were made by the same studio). Here we expect the bonding state of
candidate residues be affected by autocorrelation because of the presence of
at least two relations causing linkage: coordinates(r,z), linking a residue
r to a zinc ion z, and member(r,c), linking a residue r to a protein chain
c. Unfortunately the first kind of linkage cannot be directly exploited by
a classifier as the relation coordinates is hidden on new data. However,
we may hope to capture some information about this relation by looking at
the sequence separation between two candidate residues. In particular, there
should be some positive correlation between the bonding state of pairs of
residues within the same chain, and it should also depend on the sequence
separation between them.
We investigated such correlations on our dataset, both by probability
measures and correlation coefficient. Figure 7.1 (left) compares the prior
probability of zinc binding for a residue, to the same probability conditioned
on the presence of another zinc binding residue within a certain separation,
for different values of the separation threshold. Figure 7.1 (right) reports
the correlation coefficient between the bonding state of pairs of residues,
again varying the separation threshold between them. Both curves show a
221
7.2 Dataset Description and Statistics
very similar behavior, with the highest peak at a distance of less then three
residues, and a small one at a distance of around twenty residues. A non
null base correlation is also visible regardless of the distance, and can be
attributed to fact of being in the same chain.
1
0.6
Pr(X=bonded)
Pr(X=bonded|Y=bonded)
0.9
Correlation
0.5
0.8
Correlation
Probabilities
0.7
0.6
0.5
0.4
0.3
0.4
0.3
0.2
0.2
0.1
0.1
1
10
100
Distance Threshold on Residue Pairs
1000
1
10
100
Distance Threshold on Residue Pairs
1000
Figure 7.1. Left figure: probabilities of zinc binding for a given residue:
prior and conditioned on the presence of another zinc binding residue within
a certain separation. Right figure: correlation between the targets of pairs
of residues within a given distance.
7.2.3.1
Patterns of Binding Sites
Zinc binding sites tend to have quite regular patterns as a matter of distances
in sequence between residues coordinating the same zinc ion. Tables 7.3 and
7.4 reports some of the most common binding site patterns together to their
occurrences within the dataset. Many of these sites, especially structural
ones, contain pairs of coordinating residues having a sequence distance within
7 residues.
Table 7.5 shows the fraction of sites containing at least once the semipattern [CHDE] x(0–7) [CHDE], as well as the fraction of zinc proteins containing such a semipattern. This observation suggested us to try to directly
predict the presence of such semipatterns within a given sequence and use
that prediction as an indicator of the presence of a zinc binding site.
222
CHAPTER 7 PREDICTION OF ZINC BINDING SITES
Binding Site Patterns
N
[CHDE] x(·) [CHDE] x(·) [CHDE] x(·) [CHDE]
[CH] x(·) [CH] x(·) [CH] x(·) [CH]
[CHDE] x(0–7) [CHDE] x(·) [CHDE] x(0–7) [CHDE]
[CHDE] x(0–7) [CHDE] x(> 7) [CHDE] x(0–7) [CHDE]
[CHDE] x(·) [CHDE] x(·) [CHDE]
[C] x(·) [C] x(·) [C] x(·) [C]
[CHDE] x(·) [CHDE]
[CHDE] x(0–7) [CHDE] x(> 7) [CHDE]
[CH] x(·) [CH] x(·) [CH]
[CHDE] x(> 7) [CHDE] x(0–7) [CHDE]
[CH] x(·) [CH]
[CHDE] x(0–7) [CHDE] x(> 7) [CHDE] x(> 7) [CHDE]
[CHDE] x(> 7) [CHDE] x(0–7) [CHDE] x(0–7) [CHDE]
[DE] x(·) [DE]
[DE] x(·) [DE] x(·) [DE]
[CHDE] x(> 7) [CHDE] x(> 7) [CHDE] x(0–7) [CHDE]
[CHDE] x(0–7) [CHDE] x(0–7) [CHDE] x(> 7) [CHDE]
[DE] x(·) [DE] x(·) [DE] x(·) [DE]
232
196
161
141
122
85
62
55
37
24
21
17
16
15
10
10
8
1
Type
SLS
SL
LS
SLL
LSS
LLS
SSL
Table 7.3. Binding site patterns ordered by frequency of occurrence in the
464 zinc sites. Square brackets denote alternatives, x(·) denotes a sequence
of residues of an arbitrary length, x(n − m) denotes a sequence between n
and m residues, x(> n) denotes a sequence of more than n residues. N
is the number of occurrences within the dataset. Type column highlights
some common binding site patterns: S refers to x(0–7), L refers to x(> 7).
7.3
Methods
In this section, we describe some methods applied to the problem of the
prediction of the zinc binding sites. We first introduce a standard local
predictor based on a window of residues centered around the site of interest.
Then we developed a semipattern predictor for pairs of residues in nearby
positions within the sequence for modelling the correlations between zinc
binding residues. Finally, we described a gating network for combining the
predictions of the local predictor with the semipattern predictor.
223
7.3 Methods
Binding Sites
C x(2) C x(17) C x(2) C
C x(2) C x(16) C x(2) C
C x(2) C x(12) H x(3) H
C x(2) C x(4) H x(4) C
C x(2) C x(18) C x(2) C
C x(2) C x(17) H x(2) C
C x(2) C x(12) H x(4) H
H x(37) H
H x(3) H x(5) H
H x(3) H
D x(1) H
C x(2) C x(8) C x(2) C
C x(2) C x(22) C x(2) C
C x(1) H x(17) C x(2) C
H x(3) H x(19) E
H x(3) C x(4) C x(4) C
H x(28) H
H x(26) C x(0) C
N
7
7
5
4
4
4
4
3
3
3
3
3
3
3
2
2
2
2
Binding Sites
E x(3) H
D x(3) D
C x(8) C x(2) C x(2) C
C x(4) C x(12) H x(3) H
C x(3) C x(26) C x(2) C
C x(2) C x(6) C x(6) C
C x(2) C x(34) H x(2) C
C x(2) C x(24) C x(2) C
C x(2) C x(21) C x(2) C
C x(2) C x(20) C x(2) C
C x(2) C x(19) C x(2) C
C x(2) C x(18) H x(2) C
C x(2) C x(15) C x(2) C
C x(2) C x(14) C x(2) C
C x(2) C x(13) C x(2) H
C x(2) C x(12) H x(4) C
C x(1) H x(16) C x(2) C
C x(1) D x(53) H x(2) C
N
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Table 7.4. Most common zinc binding sites amongst all 464 sites. x(n)
denotes a sequence of n residues. N is the number of occurrences within
the dataset.
Site Type
All
Zn4
Zn3
Zn2
Zn1
Chain Coverage
N
f
261
85.5
168
96.0
85
80.1
35
66.0
13
65.0
Site Coverage
N
f
338
72.8
227
94.9
86
69.9
25
38.4
0
0.0
Table 7.5. Chain and site coverage for the [CHDE] x(0–7) [CHDE] semipattern. N is the absolute value of chains and sites, while f is the percentage
over the total number of chains and sites of that type.
224
CHAPTER 7 PREDICTION OF ZINC BINDING SITES
7.3.1
Standard Window Based Local Predictor
Many applications of machine learning to 1D prediction tasks use a simple
vector representation obtained by forming a window of flanking residues centered around the site of interest. Following the seminal work of Rost and
Sander (1993), evolutionary information is incorporated in these representations by computing multiple alignment profiles. In this approach, each
example is represented as a vector of size d = (2k + 1)p, where k is the size
of the window and p the size of the position specific descriptor.
We enriched multiple alignment profiles by two indicators of profile quality, namely the entropy and the relative weight of gapless real matches to
pseudocounts. An additional flag was included to mark positions ranging
out of the sequence limits, resulting in an all–zero profile. We obtained in
this way a position specific descriptor of size p = 23. We use as a baseline
this type of representation in conjunction with a SVM classifier (see Section
1.3) trained to predict the zinc bonding state of individual residues (cysteine,
histidine, aspartic acid and glutamic acid). We employed the inner product
between the example vectors as a baseline linear kernel, then we combined
this kernel with more complex kernels as described in the experimental section.
7.3.2
Semipattern Based Predictor
A standard window based local predictor such as the one described in the previous section, does not explicitly model the correlations analyzed in Section
7.2.3, missing a strong potential source of information. We thus developed
an ad–hoc semipattern predictor for pairs of residues in nearby positions
within the sequence. A candidate semipattern is a pair of residues (cysteine,
histidine, aspartic acid or glutamic acid) separated by a gap of δ residues,
with δ ranging from zero to seven. The task is to predict whether the semipattern is part of a zinc binding site. Each example is represented by a
window of local descriptors (based on multiple alignment profiles) centered
around the semipattern, including the gap between the candidate residues.
A semipattern containing a gap of length δ is thus encoded into a vector of
size d = (2k + 2 + δ)p, where k is the window size and p is the size of each
225
7.3 Methods
window element as described in Section 7.3.1.
A single predictor must be able to compare pairs of semipatterns having
gaps of different lengths in order to address the task. We thus developed an
ad–hoc semipattern kernel in the following way. Given two vectors x and z
of size dx and dz representing two semipatterns with gap length δx and δz
respectively, we can define the semipattern kernel as
Ksemipattern (x, z) = hx[1 : w], z[1 : w]i
(7.1)
+ hx[dx − w : dx ], z[dz − w : dz ]i
+ Kgap (x[w + 1 : w + δx p], z[w + 1 : w + δz p])
where v[i : j] is the subvector of v that extends from i to j and w = (k + 1)p.
The first two contributions compute the inner products between the left and
right windows around the semipatterns included the two candidate residues:
exactly, the sizes of the left and right windows do not vary regardless of the
different gap lengths. The last term Kgap is the kernel between the gaps
separating the candidate residues which could be of different lengths: so it
distinguishes the case of same length from the one with different gap lengths.
Kgap is computed as
(
Kµgap (u, v) + hu, vi if |u| = |v|
Kgap (u, v) =
(7.2)
Kµgap (u, v)
otherwise
where
Kµgap (u, v) =
* |u|
X
u[(i − 1)p + 1 : ip],
i=1
|v|
X
+
v[(i − 1)p + 1 : ip]
(7.3)
i=1
Kµgap computes the inner product between the mean local descriptors within
each gap and, if the two gaps have same length, the Kgap kernel adds the
full inner product hu, vi between the descriptors to Kµgap . So the averaged
local descriptors within the gap are always computed but in case that the
two gaps have the same length the inner product is added.
7.3.3
Gating Network
The coverage of the [CHDE] x(0–7) [CHDE] semipattern (see Table 7.5)
makes it a good indicator of a zinc binding but a number of binding sites remain uncovered. Moreover, the semipattern can match a subsequence which,
226
CHAPTER 7 PREDICTION OF ZINC BINDING SITES
while not being part of a binding site as a whole, still binds zinc with just
one of the two candidate residues. As only full binding semipatterns are considered positives, the predictor is trained to treat such a case as a negative
instance. This implies that one of the two residues would by construction
receive an incorrect label. However, when a residue misses evidence of being
involved in a positive semipattern, we can still rely on the local predictor in
order to decide if it does not really bind zinc. For a given a residue, we actually have a single output from the local predictor and a number of (eventually
zero) predictions from the semipattern based predictor, one for each subsequence matching the semipattern and containing the residue as one of the
two binding candidates. Outputs from different SVM predictors usually have
different distributions and a straightforward comparison can be misleading.
An effective solution is that of turning SVM outputs into conditional probabilities of the positive class given the output by a fitting a sigmoid (Platt,
1999b):
P (Y = 1|x) =
1
1 + exp (−Af (x) − B)
(7.4)
where f (x) is the SVM output for example x and sigmoid slope (A) and offset
(B) are parameters to be learned from data. The probability P (Yb = 1|x)
that a single residue binds zinc can now be computed by the following gating
network :
P (Yb = 1|x) = P (Ys = 1|x) + (1 − P (Ys = 1|x))P (Yl = 1|x)
(7.5)
where P (Yl = 1|x) is the probability of zinc binding from the local predictor, while P (Ys = 1|x) is the probability of x being involved in a positive
semipattern, computed as the maximum between the probabilities for each
semipattern x is actually involved in. So the probability P (Yb = 1|x) that a
single residue binds zinc is equal to the probability P (Ys = 1|x) of x being
involved in a positive semipattern or to the probability 1 − P (Ys = 1|x) of
x not being involved in a positive semipattern multiplied by the probability
P (Yl = 1|x) of zinc binding from the local predictor.
227
7.4 Experimental Results
7.4
Experimental Results
We run a series of experiments aimed at comparing the predictive power of
the sole local predictor to that of the full gating network. While aspartic and
glutamic acids are much less common than cysteines and histidines as zinc
binding residues (see Table 7.2), they are far more abundant in protein chains.
This implies a huge disproportion between positive and negative examples,
driving the unbalancing in the dataset to 1 : 59 for the local predictor, and
1 : 145 for the semipattern one. We thus initially focused on cysteines and
histidines, bringing the unbalancing down to 1 : 16 and 1 : 11 at the residue
and semipattern level respectively (see Table 7.6). Moreover, we labelled a
Residues
CHDE
CH
Local Predictor
58.7
15.9
Semipattern Predictor
144.7
11.3
Table 7.6. Ratio between negative and positive training examples for
residues and semipatterns. A semipattern is positive if both candidate
residues bound a zinc ion, even if they were not actually binding the same
ion.
[CH] x(0–7) [CH] semipattern as positive if both candidate residues bound
a zinc ion, even if they were not actually binding the same ion. Preliminary
experiments showed this to be a better choice than considering such a case
as a negative example, allowing to recover a few positive examples, especially
for semipattern matches with longer gaps.
Multiple alignment profiles were computed using PSI–BLAST (Altschul
et al., 1997) on the NCBI non–redundant protein database. In order to reduce noise in the training data, we discarded examples whose profile had a
relative weight less than 0.015, indicating that too few sequences had aligned
at that position. This also allowed to discard poly–Histidine tags which are
attached at either the N– or C–terminus of some chains in the PDB, as a
result of protein engineering aimed at making protein purification easier. We
employed a Gaussian kernel on top of both the linear kernel of the local
predictor and the semipattern kernel. Model selection was conducted on a
stratified 4–fold cross validation procedure and was used to tune Gaussian
228
CHAPTER 7 PREDICTION OF ZINC BINDING SITES
width, C regularization parameter, window size and parameters of the sigmoids of the gating network. Due to the strong unbalancing of the dataset,
accuracy is not a reliable measure of performance. So we used the area under the recall–precision curve (AURPC) for both model selection and final
evaluation, as it is especially suitable for extremely unbalanced datasets. We
also computed the area under the ROC curve (AUC) to further assess the
significance of the results.
The best models for the local predictor and the gating network were tested
on an additional stratified 5–fold cross validation procedure and obtained an
AURPC equal to 0.554 and 0.611 respectively. Figure 7.2 reports the full
recall precision curves, showing that the gating network consistently outperforms the local predictor. While cysteines are far better predicted with
respect to histidines, both predictions are improved by the use of the gating
network. AUC values were 0.889 ± 0.006 and 0.911 ± 0.006 for local predictor and gating network respectively, where the method for obtaining the
confidence intervals is only available for the AUC computing the standard
error of the Wilcoxon–Mann–Whitney statistic, confirming that the gating
network attains a significant improvement over the local predictor. Table 7.7
summarizes the results.
AURPC
AUC
Local Predictor
0.554
0.889
Gating Network
0.611
0.911
Table 7.7. AURPC and AUC for the local predictor and the gating network
focused on cysteines and histidines.
Protein level predictions were obtained by choosing the maximum prediction between those of the residues contained in each chain. Figure 7.3
(top) reports the recall–precision curve obtained at a protein level for the
best gated predictor, while Figure 7.3 (bottom) shows the results separately
for proteins containing different binding site types. As expected, Zn4 sites
were the easiest to predict, being the ones showing the strongest regularities
and most commonly containing the [CH] x(0–7) [CH] semipattern.
Finally, we investigated the viability of training a predictor for all the four
229
7.4 Experimental Results
1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
Local Predictor
Gated Predictor
0.1
0
0
0.2
0.4
0.6
Recall
0.8
1
0.8
1
0.8
1
1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
Local Predictorn on C
Gated Predictor on C
0.1
0
0
0.2
0.4
0.6
Recall
1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
Local Predictor on H
Gated Predictor on H
0.1
0
0
0.2
0.4
0.6
Recall
Figure 7.2. Residue level recall–precision curves for the best [CH] local and
gated predictors. Top: cysteines and histidines together. Middle: cysteines
only. Bottom: histidines only.
230
CHAPTER 7 PREDICTION OF ZINC BINDING SITES
1
All
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.2
0.4
0.6
Recall
1
0.8
1
Zn4
Zn3
Zn2
Zn1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Recall
0.8
1
Figure 7.3. Protein level recall–precision curves for the best [CH] gated
predictor. Top: all proteins together. Bottom: proteins divided by zinc site
type.
amino acids involved in zinc binding, trying to overcome the disproportion
issue. On the rationale that binding residues should be evolutionary well
conserved for their role in the protein function, we put a threshold on the
residue conservation in the multiple alignment profile in order to consider it
a candidate target. By requiring that
Pr{D} + Pr{E} ≥ 0.8,
(7.6)
we reduced the unbalancing in the dataset for the local predictor to 1 : 24.
At the level of semipatterns, we realized that such a threshold produced a
reasonable unbalancing only for gaps between one and three and thus decided
231
7.4 Experimental Results
to ignore semipatterns containing aspartic or glutamic acid with gaps of
different lengths, yielding a 1 : 18 unbalancing.
Table 7.8 shows the ratio between negative and positive training examples for the local predictor, where the CHDE0.8 column reports the ratio
value requiring a 0.8 threshold on conservation profile for D and E. Table
7.9 shows the unbalancing for the semipattern predictors, varying the gap
length between 0 to 7. The column “Th” reports the value of the threshold
on residue conservation (the short line “—” indicates that D and E amino
acids are not used), while the column CHDETh shows the positive negative
ratio for different gap lengths when a threshold on D and E is applied. At
last, Table 7.10 reports the site and chain coverage for the [CH] x(0–7) [CH]
and [CHDE] x(0–7) [CHDE] semipatterns when a 0.8 threshold on conservation profile was used for D and E residues within x(1–3) gaps. As already
explained, a semipattern is positive if both candidate residues bound a zinc
ion, even if they were not actually binding the same ion and both predictors
use a threshold of 0.015 on relative profile weight to filter semipatterns from
less aligned sequences.
While global performances were almost unchanged, aspartic acid and glutamic acid alone obtained a value of the AURPC of 0.203 and 0.130 respectively. Due to the still high unbalancing, baseline values for a random predictor are as low as 0.007 for aspartic acid and 0.014 for glutamic acid. AUC
values of 0.780 ± 0.03 and 0.700 ± 0.04, respectively (with respect to the 0.5
baseline) confirm that results are significantly better than random. Table
7.11 summarizes the results.
However, results on these two residues are still preliminary and further
work has to be done in order to provide a prediction quality comparable to
that obtained for cysteines and histidines. It is interesting to note that at
the level of protein classification, the only difference which can be noted by
using [CHDE] instead of [CH] is a slight improvement in the performances for
the Zn3 binding sites, as shown in Figure 7.4. This is perhaps not surprising
given that half of [DE] residues binding zinc are contained in Zn3 sites, as
reported in Table 7.2.
232
CHAPTER 7 PREDICTION OF ZINC BINDING SITES
Local Predictor
CH
CHDE
CHDE0.8
Negative–Positive Ratio
15.9
58.7
23.8
Table 7.8. Ratios between negative and positive examples for the local
predictor. A 0.8 threshold on conservation was used for D and E residues.
Semipattern Predictor
CH
CHDE
CHDETh
Th
x0
x1
x2
x3
x4
x5
x6
x7
x(0–7)
88.6
11.6
3.2
14.8
10.3
18.6
38.1
13.8
11.3
634.4
119.1
47.2
147.1
147.6
243.1
435.0
217.8
144.7
88.6
26.6
9.0
33.8
10.3
18.6
38.1
13.8
17.9
—
0.8
0.8
0.8
—
—
—
—
—
Table 7.9. Ratios between negative and positive examples for the semipattern predictor. A 0.8 threshold on conservation was used for D and E
residues within x(1–3) gaps, while D and E residues are not used for x(0)
and x(4–7) gaps (see — in the table).
[CH] x(0–7) [CH]
[CHDE] x(0–7) [CHDE]
Chain Coverage Site Coverage Chain Coverage Site Coverage
Site Type N
All
Zn4
Zn3
Zn2
Zn1
212
151
56
22
10
f
N
f
N
f
N
f
69.5
86.2
52.8
41.5
50.0
263
203
51
9
0
56.6
84.9
41.4
13.8
0.0
237
160
75
27
12
77.7
91.4
70.7
50.9
60.0
292
211
71
10
0
62.9
88.2
57.7
15.3
0.0
Table 7.10. Site and chain coverage for the [CH] x(0–7) [CH] (left) and
[CHDE] x(0–7) [CHDE] (right) semipatterns. N is the total number of
covered chains or sites, while f is the fraction of chains or sites covered. A
0.8 threshold on conservation profile was used for D and E residues within
x(1–3) gaps, while D and E residues are not used for x(0) and x(4–7) gaps.
233
7.5 Discussion and Conclusions
Residues
D
E
AURPC (Baseline)
0.203 (0.007)
0.130 (0.014)
AUC (Baseline)
0.780 (0.500)
0.700 (0.500)
Table 7.11. AURPC and AUC of the gated predictor for aspartic acid (D)
and glutamic acid (E) with their baselines.
1
CH Zn3
CHDE Zn3
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Recall
0.8
1
Figure 7.4. Comparison, at protein level, of recall–precision curves between
the best [CH] and [CHDE] gated predictors for Zn3 binding sites.
7.5
Discussion and Conclusions
We have enlightened the autocorrelation problem in the prediction of metal
binding sites from sequence information and presented an improved approach
based on a simple linkage modelling strategy. Our results, focused to the
prediction of zinc binding proteins, appear to be very promising, especially if
we consider that they have been obtained on a non redundant set of chains.
Sites mainly coordinated by cysteines and histidines are easier to predict
thanks to the availability of a larger number of examples. Linkage modelling
allows us to gain a significant improvement in the prediction of the bonding
state of these residues. Sites coordinated by aspartic acid and glutamic acid
are more difficult to predict because of data sparsity but our results are
significantly better than chance.
The method has been also evaluated on the task of predicting whether
234
CHAPTER 7 PREDICTION OF ZINC BINDING SITES
a given protein is a zinc protein. Good results were obtained in the case
of chains where zinc plays a structural role (Zn4). In the case of chains
with catalytic sites (Zn3), the inclusion of D and E targets does allow us
to obtain slightly improved predictions. In future work, we plan to test the
effectiveness of this method at the level of entire genomes.
235
Conclusions
The main subject developed in this thesis concerns the design of new kernels
for structured data but also the preference and ranking tasks was developed
in their theoretical and practical aspects. The original contributions are
enumerated below, distinguishing the theoretical from the experimental ones.
Theoretical Contributions
From a theoretical viewpoint, we devised some new interesting results.
First of all, we investigated the theoretical aspects of VP algorithm, providing a new picture of regularization theory for this algorithm. In particular,
we found out the dual formulation of VP and a novel on–line update rule for
the VP dual variables, explaining how fast grows the value of dual variables
when the number of epoch increases and giving an upper bound for their
value.
Also the problems of preference and ranking was examined in depth from
a theoretical side (Menchetti, 2006). In detail, we proved that modelling the
problem by a direct approach which works on the whole set of competing
alternatives exhibits a better performance of the other approaches proposed
in the literature which exploit only partial information. The new model is
based on a partial order relation that models the constraints within the set of
competitors. Finally, we described a novel approach about how the ranking
and preference generalization error depends on the size of set of alternatives.
With respect to kernels for structured data, we introduced the weighted
decomposition kernels (Menchetti et al., 2005b,a), a computationally efficient, general family of kernels on decomposable objects represented as se237
CONCLUSIONS
quences, tree or graphs that is competitive with respect to state–of–the–art
methods.
Experimental Contributions
A lot of experimental results was reported, ranging from domains as natural
language processing to computational molecular biology.
Prediction of first pass attachment under strong incrementality hypothesis and reranking parse trees generated by a statistical parser, are two large
scale preference learning problems involving learning a preference function
that selects the best alternative in a set of competitors. The experimental
analysis presented in Menchetti et al. (2003) and in Menchetti et al. (2005c)
showed that the generalization performance for the two preference tasks is
determined by several factors, including the similarity measure induced by
the kernel or by the adaptive internal representation of the RNN and, importantly, by the loss function associated with the preference model. Actually
the experiments indicated that the choice of a pairwise with respect to a
global loss function plays an import role, highlighting that the development
of global loss function for preference tasks may lead to more effective solutions. Interesting, previous works with kernels focus exclusively on pairwise
loss functions.
About molecular biology, several topical and challenging problems in
bioinformatics as subcellular localization, remote homology detection and
prediction of toxicity and biological activity of chemical compounds were
tackled (Menchetti et al., 2005b,a). We reported experimental evidence that
the weighted decomposition kernels are highly competitive with respect to
more complex and computationally demanding state–of–the–art methods on
above problems, ranging from protein sequences to molecule graphs classification.
Finally, we faced the problem of prediction of the zinc binding sites and
proteins, a relatively new task which is little know in the bioinformatics
literature. Actually Menchetti et al. (2006) represents one of the first papers
in this direction. We proposed an ad–hoc remedy in which sequentially close
pairs of candidate residues was classified as being jointly involved in the
238
CONCLUSIONS
coordination of a zinc ion and we developed a kernel for this particular type
of data that can handle variable length gaps between candidate coordinating
residues. Our results, focused to the prediction of zinc binding proteins,
appeared to be very promising, especially if we consider that they have been
obtained on a non redundant set of chains. Our empirical evaluation showed
that explicitly modelling the linkage between residues close in the protein
sequence allowed us to gain a significant improvement in the prediction of
the bonding state of these residues. In future work, we plan to test the
effectiveness of this method at the level of entire genomes.
239
Appendix A
Ranking and Preference Error
Probability
A.1
A Detailed Solution of the Integral
We have to solve the following integral
Z
Pr{u2 ≤ u1 } =
PU1 (u1 − ∆)PU0 1 (u1 )du1
U1
+∞
Z
=
−∞
Z +∞
=
−∞
1
e−u1
du1
1 + e−(u1 −∆) 1 + e−u1 1 + e−u1
e−u1
du1
(1 + e−u1 e∆ )(1 + e−u1 )2
1
We solve the integral by substitution:
x = e−u1 ⇒ u1 = − ln x
du1
1
dx
= − ⇒ du1 = −
dx
x
x
where the correspondence between the values of u1 and x is
u1 −∞ +∞
x +∞ 0+
241
(A.1)
A.1 A Detailed Solution of the Integral
After the substitution
0+
1
x
−
dx
Pr{u2 ≤ u1 } =
∆
2
x
+∞ (e x + 1)(x + 1)
Z +∞
1
=
dx
(A.2)
(e∆ x + 1)(x + 1)2
0+
Z +∞ A
B
C
+
+
dx
=
(e∆ x + 1) (x + 1) (x + 1)2
0+
Z
Now we have to solve three integrals
+∞
Z +∞
Z
A
A +∞
e∆
A ln |e∆ x + 1| dx = ∆
dx =
+
(e∆ x + 1)
e 0+ (e∆ x + 1)
e∆
0+
0
A
= lim ∆ ln |e∆ x + 1|
(A.3)
x→+∞ e
Z +∞
+∞
B
dx = B ln |x + 1|
= lim B ln |x + 1|
(A.4)
x→+∞
(x + 1)
0+
0+
+∞
Z +∞
C C
dx = −
=C
(A.5)
(x + 1)2
x + 1 +
+
0
0
The result of the whole integral is
Z +∞
1
Pr{u2 ≤ u1 } =
dx
(A.6)
∆
(e x + 1)(x + 1)2
0+
A
= ∆ lim ln |e∆ x + 1| + B lim ln |x + 1| + C
x→+∞
e x→+∞
In general, we have that
Z
dx
A
C
= ∆ ln |e∆ x + 1| + B ln |x + 1| −
+ cost
∆
2
(e x + 1)(x + 1)
e
x+1
To compute A,B and C we have to solve a linear system:
1
A
B
C
= ∆
+
+
2
+ 1)(x + 1)
(e x + 1) (x + 1) (x + 1)2
(A + Be∆ )x2 + (2A + B + Be∆ + Ce∆ )x + A + B + C
=
(e∆ x + 1)(x + 1)2
(e∆ x
The linear system is

∆

 A + Be = 0
2A + B + Be∆ + Ce∆ = 0


A+B+C =1
242
(A.7)
APPENDIX A RANKING AND PREFERENCE ERROR PROBABILITY
and the solution is
A=
e2∆
−e∆
1
with ∆ 6= 0
,
B
=
, C=
∆
2
∆
2
(e − 1)
(e − 1)
(1 − e∆ )
(A.8)
If U has the same distribution for x1 and x2 , it follows that ∆ = 0 and
Z +∞
1
(x + 1)−3 dx
dx =
3
(x
+
1)
+
+
0
0
+∞
−2 +∞
(x + 1) −1 1
=
=
2
−2
2(x + 1) +
2
+
Z
Pr{u2 ≤ u1 } =
=
+∞
0
0
If we replace A,B and C with their values:
Pr{u2 ≤ u1 } =
−e∆
e2∆
1
∆
ln |e x + 1| + ∆
ln |x + 1| + C
= lim
x→+∞
(e∆ − 1)2 e∆
(e − 1)2
e∆
e∆
∆
= lim
ln |e x + 1| − ∆
ln |x + 1| + C
x→+∞
(e∆ − 1)2
(e − 1)2
|e∆ x + 1|
e∆
lim ln
=
+C
(e∆ − 1)2 x→+∞
|x + 1|
e∆
ln e∆ + C
=
∆
2
(e − 1)
∆e∆
1
=
+
∆
2
(e − 1)
(1 − e∆ )
∆e∆
1
=
− ∆
∆
2
(e − 1)
(e − 1)
Finally, the probability of a preference or ranking error is
Pr{Error} =
A.2
e∆ (∆ − 1) + 1
(e∆ − 1)2
Another Method to Solve the Integral
Now we describe another method to solve Equation (A.2) which consists in
replacing the upper and lower bounds of the integral with two variables and
then making the limits of the new variables towards the original bounds of
243
A.2 Another Method to Solve the Integral
the integral.
Z
+∞
dx
(e∆ x + 1)(x + 1)2
0+
Z x2
dx
= lim+ lim
∆
x1 →0 x2 →+∞ x1 (e x + 1)(x + 1)2
Pr{u2 ≤ u1 } =
The we have to solve the following integral
x
Z x2
dx
A
C 2
∆
= ∆ ln |e x + 1| + B ln |x + 1| −
∆
2
e
x + 1 x1
x1 (e x + 1)(x + 1)
A
C
= ∆ ln |e∆ x2 + 1| + B ln |x2 + 1| −
e
x2 + 1
A
C
∆
−
ln |e x1 + 1| + B ln |x1 + 1| −
e∆
x1 + 1
∆
1
|e x2 + 1|
|x2 + 1|
1
A
+ B ln
+C
−
= ∆ ln ∆
e
|e x1 + 1|
|x1 + 1|
x1 + 1 x2 + 1
A
|e∆ x2 + 1|
|x2 + 1|
x2 − x1
= ∆ ln ∆
+ B ln
+C
e
|e x1 + 1|
|x1 + 1|
(x1 + 1)(x2 + 1)
2∆
∆
∆
e
|e x2 + 1|
−e
|x2 + 1|
= ∆ ∆
ln ∆
+ ∆
ln
2
2
e (e − 1)
|e x1 + 1| (e − 1)
|x1 + 1|
1
x2 − x1
+
(1 − e∆ ) (x1 + 1)(x2 + 1)
e∆
|e∆ x2 + 1|
−e∆
|x2 + 1|
=
ln ∆
+ ∆
ln
∆
2
2
(e − 1)
|e x1 + 1| (e − 1)
|x1 + 1|
x2 − x1
1
+
(1 − e∆ ) (x1 + 1)(x2 + 1)
e∆
|e∆ x2 + 1|
|x2 + 1|
=
− ln
ln ∆
(e∆ − 1)2
|e x1 + 1|
|x1 + 1|
1
x2 − x1
+
(1 − e∆ ) (x1 + 1)(x2 + 1)
e∆
|e∆ x2 + 1| |x1 + 1|
1
x2 − x1
=
ln
+
∆
2
∆
∆
(e − 1)
|e x1 + 1| |x2 + 1| (1 − e ) (x1 + 1)(x2 + 1)
∆
e
(e∆ x2 + 1) (x1 + 1)
1
x2 − x1
=
ln
+
∆
2
∆
∆
(e − 1)
(e x1 + 1) (x2 + 1) (1 − e ) (x1 + 1)(x2 + 1)
In the last step we removed the absolute value because all values are positive.
If we choose a symmetric interval in u1 , that is −ū1 ≤ u1 ≤ ū1 , since x = e−u1 ,
we have x1 = e−ū1 = 1/eū1 and x2 = eū1 . So x2 = 1/x1 and we can replace
244
APPENDIX A RANKING AND PREFERENCE ERROR PROBABILITY
x2 in the integral by 1/x1 and then solve the limit for x1 → 0+ . Otherwise
we can solve a double limit as follows
e∆
1
x2 − x1
(e∆ x2 + 1) (x1 + 1)
lim lim
+
ln
x1 →0+ x2 →+∞
(e∆ − 1)2 (e∆ x1 + 1) (x2 + 1) (1 − e∆ ) (x1 + 1)(x2 + 1)
1
x2
e∆
(e∆ x2 + 1)
+
=
lim
ln
x2 →+∞
(e∆ − 1)2
(x2 + 1)
(1 − e∆ ) x2 + 1
e∆
1
=
ln e∆ +
∆
2
(e − 1)
(1 − e∆ )
∆e∆
1
=
− ∆
∆
2
(e − 1)
(e − 1)
∆
∆
∆e − e + 1
=
(e∆ − 1)2
e∆ (∆ − 1) + 1
=
(e∆ − 1)2
Finally, the probability of a preference or ranking error is
Pr{Error} =
e∆ (∆ − 1) + 1
(e∆ − 1)2
245
Bibliography
Allwein, E. L., Schapire, R. E., and Singer, Y. (2000). Reducing Multiclass
to Binary: A Unifying Approach for Margin Classifiers. In Proceedings
of the Seventeenth International Conference on Machine Learning, (ICML
2000), pages 9–16, San Francisco, CA, USA. Morgan Kaufmann Publishers
Inc.
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller,
W., and Lipman, D. J. (1997). Gapped BLAST and PSI–BLAST: a new
generation of protein database search programs. Nucleic Acids Research,
25(17):3389–3402.
Altun, Y., Tsochantaridis, I., and Hofmann, T. (2003). Hidden Markov Support Vector Machines. In Fawcett, T. and Mishra, N., editors, Machine
Learning, Proceedings of the Twentieth International Conference (ICML
2003), August 21-24, 2003, Washington, DC, USA, pages 3–10, Washington, D.C., USA. ICML 2003, AAAI Press.
Andreini, C., Bertini, I., and Rosato, A. (2004). A Hint to Search for Metalloproteins in Gene Banks. Bioinformatics, 20(9):1373–1380.
Andrews, S., Tsochantaridis, I., and Hofmann, T. (2003). Support Vector
Machines for Multiple–Instance Learning. In Becker, S., Thrun, S., and
Obermayer, K., editors, Advances in Neural Information Processing Systems 15, pages 561–568. MIT Press, Cambridge, MA, USA.
Bakır, G. H., Bottou, L., and Weston, J. (2005). Breaking SVM Complexity
with Cross–Training. In Saul, L. K., Weiss, Y., and Bottou, L., editors,
247
BIBLIOGRAPHY
Advances in Neural Information Processing Systems 17, pages 81–88. MIT
Press, Cambridge, MA, USA.
Barla, A., Franceschi, E., Odone, F., and Verri, A. (2002). Image Kernels. In
SVM 2002: Proceedings of the First International Workshop on Pattern
Recognition with Support Vector Machines, satellite event of ICPR 2002,
pages 83–96. Springer–Verlag.
Barla, A., Odone, F., and Verri, A. (2003). Histogram Intersection Kernel
for Image Classification. In Proceedings of the International Conference on
Image Processing (ICIP 2003), Barcelona, volume 3, pages III: 513–516.
Ben-David, S., Eiron, N., and Simon, H.-U. (2002). Limitations of Learning
Via Embeddings in Euclidean Half Spaces. Journal of Machine Learning
Research, 3:441–461.
Ben-Hur, A., Horn, D., Siegelmann, H. T., and Vapnik, V. (2002). Support
Vector Clustering. Journal of Machine Learning Research, 2:125–137.
Berg, C., Christensen, J., and Ressel, P. (1984). Harmonic Analysis on
Semigroups: Theory of Positive Definite and Related Functions. Springer–
Verlag.
Blom, N., Gammeltoft, S., and Brunak, S. (1999). Sequence and Structure–
Based Prediction of Eukaryotic Protein Phosphorylation Sites. Journal of
Molecular Biology, 294(5):1351–1362.
Bod, R. (2001). What is the Minimal Set of Fragments that Achieves Maximal Parse Accuracy? In Proceedings of ACL.
Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A Training Algorithm
for Optimal Margin Classifiers. In Haussler, D., editor, Proceedings of the
Fifth Annual Workshop on Computational Learning Theory (COLT 1992),
pages 144–152, New York, NY, USA. ACM Press.
Bottou, L. (1998). Online Algorithms and Stochastic Approximations. In
Saad, D., editor, Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK.
248
BIBLIOGRAPHY
Bredensteiner, E. and Bennet, K. (1999). Multicategory Classification by
Support Vector Machines. Computational Optimizations and Applications,
12:53–79.
Chang, C.-C. and Lin, C.-J. (2002). Training ν–Support Vector Regression:
Theory and Algorithms. Neural Computation, 14(8):1959–1977.
Cohen, W. W., Schapire, R. E., and Singer, Y. (1999). Learning to Order
Things. Journal of Artificial Intelligence Research, 10:243–270.
Collins, M. (2000). Discriminative Reranking for Natural Language Parsing.
In Proceedings of ICML 2000.
Collins, M. and Duffy, N. (2001). Convolution Kernels for Natural Language.
In Dietterich, T., Becker, S., and Ghahramani, Z., editors, Advances in
Neural Information Processing Systems 14, pages 625–632, Cambridge,
MA, USA. NIPS 14, MIT Press.
Collins, M. and Duffy, N. (2002). New Ranking Algorithms for Parsing and
Tagging: Kernels over Discrete Structures, and the Voted Perceptron. In
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 263–270. ACL–02.
Cortes, C., Haffner, P., and Mohri, M. (2004). Rational Kernels: Theory and
Algorithms. Journal of Machine Learning Research, 5:1035–1062.
Cortes, C. and Vapnik, V. N. (1995). Support–Vector Networks. Machine
Learning, 20(3):273–297.
Costa, F., Frasconi, P., Lombardo, V., and Soda, G. (2003a). Towards Incremental Parsing of Natural Language using Recursive Neural Networks.
Applied Intelligence, 19(1/2):9–25.
Costa, F., Frasconi, P., Lombardo, V., Sturt, P., and Soda, G. (2003b). Ambiguity Resolution Analysis in Incremental Parsing of Natural Language.
IEEE Transactions on Neural Network. Submitted for publication.
249
BIBLIOGRAPHY
Costa, F., Frasconi, P., Lombardo, V., Sturt, P., and Soda, G. (2005). Ambiguity Resolution Analysis in Incremental Parsing of Natural Language.
IEEE Transactions on Neural Network, 16(4):959–971.
Costa, F., Frasconi, P., Menchetti, S., and Pontil, M. (2002). Comparing Convolutional Kernels and Recursive Neural Networks on a Wide–
Coverage Computational Analysis of Natural Language. In Becker, S.,
Thrun, S., and Obermayer, K., editors, NIPS 2002 Workshop on Unreal
Data: Principles of Modeling Nonvectorial Data (Invited), Advances in
Neural Information Processing Systems 15 (NIPS 2002), December 9–14,
2002, Vancouver, British Columbia, Canada. MIT Press.
Costa, F., Lombardo, V., Frasconi, P., and Soda, G. (2001). Wide Coverage
Incremental Parsing by Learning Attachment Preferences. Conference of
the Italian Association for Artificial Intelligence.
Crammer, K. and Singer, Y. (2000). On the Learnability and Design of Output Codes for Multiclass Problems. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (COLT 2000), pages
35–46, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
Crammer, K. and Singer, Y. (2001). On the Algorithmic Implementation of
Multiclass Kernel-based Vector Machines. Journal of Machine Learning
Research, 2(Dec):265–292.
Crammer, K. and Singer, Y. (2002a). A New Family of Online Algorithms for
Category Ranking. In Proceedings of the 25th Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval,
pages 151–158, New York, NY, USA. SIGIR 2002, ACM Press.
Crammer, K. and Singer, Y. (2002b). Pranking with Ranking. In Dietterich,
T. G., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing Systems 14, pages 641–647, Cambridge, MA, USA.
NIPS 14, MIT Press.
Crammer, K. and Singer, Y. (2003). A Family of Additive Online Algorithms
for Category Ranking. Journal of Machine Learning Research, 3:1025–
1058.
250
BIBLIOGRAPHY
Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support
Vector Machines (and other kernel-based learning methods). Cambridge
University Press.
Cucker, F. and Smale, S. (2001). On the Mathematical Foundations of Learning. Bulletin of Americam Mathematical Society, 39(1):1–49.
Cuetos, F. and Mitchell, D. (1988). Cross–linguistic Differences in Parsing:
Restrictions on the Use of the Late Closure Strategy in Spanish. Cognition,
30(1):73–105.
Cumby, C. M. and Roth, D. (2003). On Kernel Methods for Relational
Learning. In Fawcett, T. and Mishra, N., editors, Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August
21-24, 2003, Washington, DC, USA, pages 107–114. AAAI Press.
Dekel, O., Manning, C., and Singer, Y. (2004). Log–Linear Models for Label
Ranking. In Thrun, S., Saul, L., and Schölkopf, B., editors, Advances in
Neural Information Processing Systems 16. MIT Press, Cambridge, MA,
USA.
Deshpande, M., Kuramochi, M., and Karypis, G. (2002). Automated Approaches for Classifying Structures. In Zaki, M. J., Wang, J. T.-L., and
Toivonen, H., editors, Proceedings of the 2nd ACM SIGKDD Workshop on
Data Mining in Bioinformatics (BIOKDD 2002), July 23rd , 2002, Edmonton, Alberta, Canada, pages 11–18.
Deshpande, M., Kuramochi, M., and Karypis, G. (2003). Frequent Sub–
Structure–Based Approaches for Classifying Chemical Compounds. In
Proceedings of the 3rd IEEE International Conference on Data Mining
(ICDM 2003), 19–22 December 2003, Melbourne, Florida, USA, pages 35–
42. IEEE Computer Society.
Devroye, L., Györfi, L., and Lugosi, G. (1996). A Probabilistic Theory of
Pattern Recognition. Springer–Verlag.
251
BIBLIOGRAPHY
Dietterich, T. G. and Bakiri, G. (1995). Solving Multiclass Learning Problems via Error–Correcting Output Codes. Journal of Artificial Intelligence
Research, 2:263–86.
Duda, R. O., Hart, P. E., and Stork, D. G. (2001). Pattern Classification.
John Wiley and Sons, New York.
Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.
Elisseeff, A. and Weston, J. (2002). A Kernel Method for Multi–Labelled
Classification. In Dietterich, T. G., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing Systems 14, pages 681–
687, Cambridge, MA, USA. NIPS 14, MIT Press.
Evgeniou, T., Pontil, M., and Poggio, T. (2000). Regularization Networks
and Support Vector Machines. Advances in Computational Mathematics,
13(1):1–50.
Fariselli, P. and Casadio, R. (2001). Prediction of Disulfide Connectivity in
Proteins. Bioinformatics, 17(10):957–964.
Fawcett, T. (2003). ROC Graphs: Notes and Practical Considerations for
Researchers. Technical Report HPL–2003–4, HP Labs.
Fiser, A. and stvn Simon (2000). Predicting the Oxidation State of Cysteines
by Multiple Sequence Alignment. Bioinformatics, 16(3):251–256.
Frasconi, P., Gori, M., and Sperduti, A. (1998). A General Framework
for Adaptive Processing of Data Structure. IEEE Transaction on Neural
Networks, 9(5):768–786.
Freund, Y. and Schapire, R. E. (1996). Experiments with a New Boosting
Algorithm. In Saitta, L., editor, Proceedings of the Thirteenth International
Conference on Machine Learning, pages 148–156, Bari, Italy. ICML 1996,
Morgan Kaufmann.
252
BIBLIOGRAPHY
Freund, Y. and Schapire, R. E. (1999). Large Margin Classification using the
Perceptron Algorithm. Machine Learning, 37(3):277–296.
Gärtner, T. (2003). A Survey of Kernels for Structured Data. SIGKDD
Explorations Newsletter, 5(1):49–58.
Gärtner, T., Flach, P. A., Kowalczyk, A., and Smola, A. J. (2002). Multi–
Instance Kernels. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML 2002), pages 179–186, San Francisco,
CA, USA. Morgan Kaufmann Publishers Inc.
Gärtner, T., Flach, P. A., and Wrobel, S. (2003). On Graph Kernels: Hardness Results and Efficient Alternatives. In Schölkopf, B. and Warmuth,
M. K., editors, Computational Learning Theory and Kernel Machines, 16th
Annual Conference on Computational Learning Theory and 7th Kernel
Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24–27,
2003, Proceedings, volume 2777 of Lecture Notes in Computer Science,
pages 129–143. Springer.
Gärtner, T., Lloyd, J. W., and Flach, P. A. (2003). Kernels for Structured
Data. In Matwin, S. and Sammut, C., editors, Proceedings of the 12th International Conference on Inductive Logic Programming (ILP 2002), volume 2583 of Lecture Notes in Artificial Intelligence LNAI, pages 66–83.
Springer–Verlag.
Gärtner, T., Lloyd, J. W., and Flach, P. A. (2004). Kernels and Distances
for Structured Data. Machine Learning, 57(3):205–232.
Goller, C. and Kuechler, A. (1996). Learning Task–Dependent Distributed
Structure–Representations by Back–Propagation through Structure. In
IEEE International Conference on Neural networks, pages 347–352.
Goodman, J. (1996). Efficient Algorithms for Parsing the DOP Model. In
Proceedings of the Conference on Empirical Methods in Natural Language
Processing, pages 143–152.
253
BIBLIOGRAPHY
Graepel, T. and Herbrich, R. (2003). Invariant Pattern Recognition by
Semidefinite Programming Machines. In Becker, S., Thrun, S., and Obermayer, K., editors, Advances in Neural Information Processing Systems
15, pages 561–568. MIT Press, Cambridge, MA, USA.
Gribskov, M. and Robinson, N. L. (1996). The Use of Receiver Operating
Characteristic (ROC) Analysis to Evaluate Sequence Matching. Computers
and Chemistry, 20(1):25–33.
Guermeur, Y., Elisseeff, A., and Paugam-Moisy, H. (2000). A New Multi–
Class SVM Based on a Uniform Convergence Result. In Amari, S., Giles,
C., Gori, M., and Piuri, V., editors, Proceedings of the IEEE–INNS–ENNS
International Joint Conference on Neural Networks (IJCNN 2000), volume IV, pages 183–188, Los Alamitos. IEEE Computer Society.
Hastie, T., Tibshirani, R., and Friedman, J. H. (2001). The Elements of
Statistical Learning. Springer–Verlag.
Haussler, D. (1999). Convolution Kernels on Discrete Structures. Technical
Report UCSC–CLR–99–10, University of California at Santa Cruz.
Helma, C., King, R. D., Kramer, S., and Srinivasan, A. (2001). The Predictive Toxicology Challenge 2000–2001. Bioinformatics, 17(1):107–108.
Herbirch, R., Graepel, T., Bollmann-Sdorra, P., and Obermayer, K. (1998).
Learning Preference Relations for Information Retrieval. In Proceedings Workshop Text Categorization and Machine Learning, International
Conference on Machine Learning, pages 80–84, Madison Wisconsin.
ICML/AAAI–98 Workshop on Learning for Text Categorization, The
AAAI Press.
Herbrich, R., Graepel, T., and Obermayer, K. (2000). Large Margin Rank
Boundaries for Ordinal Regression. In Smola, A., Bartlett, P., Schölkopf,
B., and Schuurmans, D., editors, Advances in Large Margin Classifiers.
MIT Press.
Horváth, T., Gärtner, T., and Wrobel, S. (2004). Cyclic Pattern Kernels
for Predictive Graph Mining. In Kim, W., Kohavi, R., Gehrke, J., and
254
BIBLIOGRAPHY
DuMouchel, W., editors, Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2004(,
Seattle, Washington, USA, August 22–25, 2004, pages 158–167, New York,
NY, USA. ACM Press.
Hua, S. and Sun, Z. (2001). Support Vector Machine for Protein Subcellular
Localization Prediction. Bioinformatics, 17(8):721–728.
Jaakkola, T. S., Diekhans, M., and Haussler, D. (2000). A Discriminative
Framework for Detecting Remote Protein Homologies. Journal of Computational Biology, 7(1–2):95–114.
Jaakkola, T. S. and Haussler, D. (1999a). Exploiting Generative Models in
Discriminative Classifiers. In Advances in Neural Information Processing
Systems (NIPS 14), volume 10, pages 487–493, Cambridge, MA, USA.
MIT Press.
Jaakkola, T. S. and Haussler, D. (1999b). Probabilistic Kernel Regression
Models. In Proceedings of the 1999 Conference on AI and Statistics. Morgan Kaufmann.
Jebara, T., Kondor, R., and Howard, A. (2004). Probability Product Kernels.
The Journal of Machine Learning Research, 5:819–844.
Jensen, D. and Neville, J. (2002). Linkage and Autocorrelation Cause Feature
Selection Bias in Relational Learning. In Proceedings of the Nineteenth
International Conference on Machine Learning (ICML 2002), pages 259–
266, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
Jensen, D., Neville, J., and Gallagher, B. (2004). Why Collective Inference Improves Relational Classification. In Proceedings of the Tenth ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD 2004), pages 593–598, New York, NY, USA. ACM Press.
Joachims, T. (1998). Text Categorization with Support Vector Machines:
Learning with Many Relevant Features. In Proceedings of the 10th European Conference on Machine Learning, pages 137–142. Springer.
255
BIBLIOGRAPHY
Joachims, T. (1999). Making Large–Scale Support Vector Machine Learning
Practical. In Schölkopf, B., Burges, C., and Smola, A. J., editors, Advances
in Kernel Methods — Support Vector Learning, pages 169–184, Cambridge,
MA, USA. MIT Press.
Joachims, T. (2002a). Evaluating Retrieval Performance using Clickthrough
Data. In Proceedings of the SIGIR Workshop on Mathematical/Formal
Methods in Information Retrieval.
Joachims, T. (2002b). Optimizing Search Engines Using Clickthrough Data.
In Proceedings of the Eighth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD 2002), pages 133–142, New
York, NY, USA. ACM, ACM Press.
Kamiya, H. and Takemura, A. (1997). On Rankings Generated by Pairwise
Linear Discriminant Analysis of m Populations. Journal of Multivariate
Analysis, 61(1/2):1–28.
Kashima, H. and Inokuchi, A. (2002). Kernels for Graph Classification. In
Proceedings of 1st ICDM International Workshop on Active Mining (AM
2002), Maebashi, Japan, 2002, pages 31–36.
Kashima, H. and Koyanagi, T. (2002). Kernels for Semi–Structured Data. In
Sammut, C. and Hoffmann, A. G., editors, Proceedings of the Nineteenth
International Conference (ICML 2002), University of New South Wales,
Sydney, Australia, July 8–12, 2002, pages 291–298, San Francisco, CA,
USA. Morgan Kaufmann Publishers Inc.
Kashima, H., Tsuda, K., and Inokuchi, A. (2003). Marginalized Kernels
Between Labeled Graphs. In Fawcett, T. and Mishra, N., editors, Machine
Learning, Proceedings of the Twentieth International Conference (ICML
2003), August 21–24, 2003, Washington, DC, USA, pages 321–328. AAAI
Press.
Kaufmann, L. (1999). Solving the Quadratic Programming Problem arising
in Support Vector Classification. In Schölkopf, B., Burges, C., and Smola,
A. J., editors, Advances in Kernel Methods — Support Vector Learning,
pages 147–168, Cambridge, MA, USA. MIT Press.
256
BIBLIOGRAPHY
Keerthi, S., Shevade, S., Bhattacharyya, C., and Murthy, K. (2000). A Fast
Iterative Nearest Point Algorithm for Support Vector Machine Classifier
Design. IEEE Transactions on Neural Networks, 11(1):124–136.
Kimeldorf, G. S. and Wahba, G. (1971). Some Results on Tchebycheffian
Spline Functions. Journal of Mathematical Analysis and Applications,
33:82–95.
Kin, T., Tsuda, K., and Asai, K. (2002). Marginalized Kernels for RNA
Sequence Data Analysis. In Lathtop, R. H., Nakai, K., Miyano, S., Takagi, T., and Kanehisa, M., editors, Genome Informatics, pages 112–122.
Universal Academic Press.
Kramer, S., Raedt, L. D., and Helma, C. (2001). Molecular Feature Mining
in HIV Data. In Proceedings of the Seventh ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 136–143, San
Francisco, California, USA. ACM Press.
Kuramochi, M. and Karypis, G. (2004). An Efficient Algorithm for Discovering Frequent Subgraphs. IEEE Transactions on Knowledge and Data
Engineering, 16(9):1038–1051.
Kwok, J. T. and Tsang, I. W. (2003). Learning with Idealized Kernels. In
Fawcett, T. and Mishra, N., editors, Proceedings of the Twentieth International Conference on Machine Learning, pages 400–407, Washington,
D.C., USA. ICML 2003, AAAI Press 2003.
Lebanon, G. and Lafferty, J. (2003). Conditional Models on the Ranking
Poset. In Becker, S., Thrun, S., and Obermayer, K., editors, Advances
in Neural Information Processing Systems 15, pages 415–422. MIT Press,
Cambridge, MA, USA.
Lebanon, G. and Lafferty, J. D. (2002). Cranking: Combining Rankings
Using Conditional Probability Models on Permutations. In Sammut, C.
and Hoffmann, A. G., editors, Proceedings of the Nineteenth International
Conference on Machine Learning, pages 363–370, University of New South
Wales, Sydney, Australia. ICML 2002, Morgan Kaufmann.
257
BIBLIOGRAPHY
Lee, Y., Lin, Y., and Wahba, G. (2004). Multicategory Support Vector Machines, Theory, and Application to the Classification of Microarray Data
and Satellite Radiance Data. Journal of the American Statistical Association, 99:67–81.
Leslie, C., Eskin, E., and Noble, W. S. (2002a). The Spectrum Kernel: a
String Kernel for SVM Protein Classification. In Pacific Symposium on
Biocomputing, pages 566–575.
Leslie, C. S., Eskin, E., Weston, J., and Noble, W. S. (2002b). Mismatch
String Kernels for SVM Protein Classification. In Becker, S., Thrun, S.,
and Obermayer, K., editors, Advances in Neural Information Processing
Systems 15 [Neural Information Processing Systems (NIPS 2002), December 9-14, 2002, Vancouver, British Columbia, Canada], pages 1417–1424.
MIT Press.
Littlestone, N. (1988). Learning Quickly when Irrelevant Attributes Abound:
a New Linear–Threshold Algorithm. Machine Learning, 2(4):285–318.
Lodhi, H., Shawe-Taylor, J., Cristianini, N., and Watkins, C. (2001). Text
Classification using String Kernels. In Leen, T. K., Dietterich, T. G., and
Tresp, V., editors, Advances in Neural Information Processing Systems 13,
pages 563–569. NIPS 13, MIT Press.
Lombardo, V., Lesmo, L., Ferraris, L., and Seidenari, C. (1998). Incremental
Processing and Lexicalized Grammars. In Proceedings of the XXI Annual
Meeting of the Cognitive Science Society.
Lombardo, V. and Sturt, P. (2002). Incrementality and Lexicalism: a Treebank Study. In Stevenson, S. and Merlo, P., editors, Lexical Representations in Sentence Processing, Computational Psycholingusitics Series,
pages 137–155. John Benjamins: Natural Language Processing Series.
Mahé, P., Ueda, N., Akutsu, T., Perret, J.-L., and Vert, J.-P. (2004). Extensions of Marginalized Graph Kernels. In Brodley, C. E., editor, Machine
Learning, Proceedings of the Twenty-first International Conference (ICML
2004), Banff, Alberta, Canada, July 4–8, 2004, New York, NY, USA. ACM
Press.
258
BIBLIOGRAPHY
Mallows, C. L. (1957). Non–Null Ranking Models. Biometrika, 44(1/2):114–
130.
Marcus, M., Santorini, B., and Marcinkiewicz, M. (1993). Building a Large
Annotated Corpus of English: the Penn Treebank. Computational Linguistics, 19(2):313–330.
Martelli, P. L., Fariselli, P., Malaguti, L., and Casadio, R. (2002). Prediction of the Disulfide–Bonding State of Cysteines in Proteins at 88 Protein
Science, 11(11):2735–2739.
Menchetti, S. (2001). Estensione del Classificatore Naive Bayes per la Categorizzazione del Testo con Dati Parzialmente Etichettati. Master’s thesis,
Dipartimento di Sistemi e Informatica, Facoltà di Ingegneria, Via di Santa
Marta, 3, 50139 Florence – Italy. In Italian.
Menchetti, S. (2006). On the Consistency of Preference Learning. Technical
Report RT 1/2006, Dipartimento di Sistemi e Informatica (DSI), Università di Firenze, Italy, Via di Santa Marta, 3 – 50139 Firenze. Submitted
for Pubblication.
Menchetti, S., Costa, F., and Frasconi, P. (2005a). Weighted Decomposition Kernels. In Raedt, L. D. and Wrobel, S., editors, Machine Learning, Proceedings of the 22th International Conference (ICML 2005), Bonn,
Germany, August 7–11, 2005, volume 119, pages 585–592, New York, NY,
USA. ICML 2005, ACM Press.
Menchetti, S., Costa, F., and Frasconi, P. (2005b). Weighted Decomposition Kernels for Protein Subcellular Localization. In Proceedings of BITS
Annual Meeting 2005 (BITS 2005), Milan, Italy.
Menchetti, S., Costa, F., Frasconi, P., and Pontil, M. (2003). Comparing
Convolution Kernels and Recursive Neural Networks for Learning Preferences on Structured Data. In Proceedings of IAPR – TC3 International
Workshop on Artificial Neural Networks in Pattern Recognition (ANNPR
2003).
259
BIBLIOGRAPHY
Menchetti, S., Costa, F., Frasconi, P., and Pontil, M. (2005c). Wide Coverage Natural Language Processing using Kernel Methods and Neural Networks for Structured Data. Pattern Recognition Letters, Special Issue on
Artificial Neural Networks in Pattern Recognition, 26(12):1896–1906. PATREC3670.
Menchetti, S., Passerini, A., Frasconi, P., Andreini, C., and Rosato, A.
(2006). Improving Prediction of Zinc Binding Sites by Modelling the Linkage between Residues Close in Sequence. In Proceedings of the Tenth Annual International Conference on Research in Computational Molecular
Biology (RECOMB 2006), Venice, Italy, April 2–5, 2006.
Mika, S. and Rost, B. (2003). UniqueProt: Creating Sequence–Unique Protein Data Sets. Nucleic Acids Research, 31(13):3789–3791.
Mitchell, D., Cuetos, F., Corley, M., and Brysbaert, M. (1995). Exposurebased Models of Human Parsing: Evidence for the Use of Coarse-grained
(nonlexical) Statistical Records. Journal of Psycholinguistics Research,
24(6):469–488.
Mitchell, T. (1997). Machine Learning. McGraw Hill, New York.
Morgan, H. (1965). The Generation of a Unique Machine Description for
Chemical Structures — A Technique Developed at Chemical Abstracts
Service. Journal of Chemical Documentation, 5:107–113.
Nair, R. and Rost, B. (2003). Better Prediction of Sub-Cellular Localization by Combining Evolutionary and Structural Information. Proteins:
Structure, Function, and Genetics, 53(4):917–930.
Nielsen, H., Brunak, S., and von Heijne, G. (1999). Machine Learning Approaches for the Prediction of Signal Peptides and Other Protein Sorting
Signals. Protein Engineering, 12(1):3–9.
Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijne, G. (1997). Identification of Prokaryotic and Eukaryotic Signal Peptides and Prediction of
their Cleavage Sites. Protein Engineering, 10(1):1–6.
260
BIBLIOGRAPHY
Noble, W. S. (2004). Support Vector Machine Applications in Computational
Biology. In Schoelkopf, B., Tsuda, K., and Vert, J.-P., editors, Kernel
Methods in Computational Biology, pages 71–92. MIT Press.
Odone, F., Barla, A., and Verri, A. (2005). Building Kernels from Binary
Strings for Image Matching. IEEE Transactions on Image Processing,
14(2):169–180.
Osuna, E., Freund, R., and Girosi, F. (1997). An Improved Training Algorithm for Support Vector Machines. In Principe, J., Gile, L., Morgan,
N., and Wilson, E., editors, Neural Networks for Signal Processing VII —
Proceedings of the 1997 IEEE Workshop, pages 276–285, New York, USA.
IEEE Press.
Osuna, E. and Girosi, F. (1998). Reducing the Run–Time Complexity of Support Vector Machines. In Proceedings of the 14th International Conference
on Pattern Recognition (ICPR 1998), Brisbane, Australia.
Paass, G., Leopold, E., Larson, M., Kindermann, J., and Eickeler, S. (2002).
SVM Classification Using Sequences of Phonemes and Syllables. In Elomaa, T., Mannila, H., and Toivonen, H., editors, Proceedings of Principles of Data Mining and Knowledge Discovery, 6th European Conference
(PKDD 2002), Helsinki, Finland, August 19–23, 2002, volume 2431 of Lecture Notes in Computer Science, pages 373–384. Springer–Verlag, London,
UK.
Passerini, A. and Frasconi, P. (2004). Learning to Discriminate Between
Ligand–Bound and Disulfide–Bound Cysteines. Protein Engineering Design and Selection, 17(4):367–373.
Platt, J. (1998). Sequential Minimal Optimization: A Fast Algorithm for
Training Support Vector Machines. Technical Report MSR–TR–98–14,
Microsoft Research.
Platt, J. (1999a). Fast Training of Support Vector Machines using Sequential
Minimal Optimization. In Schölkopf, B., Burges, C., and Smola, A. J.,
editors, Advances in Kernel Methods — Support Vector Learning, pages
185–208, Cambridge, MA, USA. MIT Press.
261
BIBLIOGRAPHY
Platt, J. (1999b). Probabilistic Outputs for Support Vector Machines and
Comparisons to Regularized Likelihood Methods. In Smola, A., Bartlett,
P., Schölkopf, B., and Schuurmans, D., editors, Advances in Large Margin
Classifiers, pages 61–74. MIT Press.
Poggio, T. and Girosi, F. (1989). A Theory of Networks for Approximation
and Learning. Technical Report AIM–1140, Massachusetts Institute of
Technology, Artificial Intelligence Laboratory and Center for Biological
Information Processing, Whitaker College, Cambridge, MA, USA.
Poggio, T. and Smale, S. (2003). The Mathematics of Learning: Dealing with
Data. Notices of the American Mathematical Society (AMS), 50(5):537–
544.
Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected
Applications in Speech Recognition. Proceedings of the IEEE, 77(2):257–
286.
Ramon, J. and Gärtner, T. (2003). Expressivity versus Efficiency of Grapgh
Kernels. In De Raedt, L. and Washio, T., editors, First International
Workshop on Mining Graphs, Trees and Sequences (MGTS 2003), held
with ECML/PKDD 2003, pages 65–74. ECML/PKDD 2003 Workshop
Proceedings.
Ramon, J. and Gärtner, T. (2003). Expressivity Versus Efficiency of Graph
Kernels. In De Raedt, L. and Washio, T., editors, Proceedings of the First
International Workshop on Mining Graphs, Trees and Sequences (MGTS
2003), pages 65–74. ECML/PKDD’03 Workshop Proceedings.
Reed, R. C. and Tarjan, R. E. (1975). Bounds on Backtrack Algorithms for
Listing Cycles, Paths, and Spanning Trees. Networks, 5(3):237–252.
Rifkin, R. and Klautau, A. (2004). In Defense of One–Vs–All Classification.
Journal of Machine Learning Research, 5:101–141.
Rosenblatt, F. (1958). The Perceptron: a Probabilistic Model for Information
Storage and Organisation in the Brain. Psychological Review, 65:386–408.
262
BIBLIOGRAPHY
Rost, B. and Sander, C. (1993). Improved Prediction of Protein Secondary
Structure by Use of Sequence Profiles and Neural Networks. Proceedings
of the National Academy of Sciences, USA, 90(16):7558–7562.
Russell, S. and Norvig, P. (2003). Artificial Intelligence: A Modern Approach.
Prentice Hall Series in Artificial Intelligence, Second edition.
Sang, E. F. T. K. (2002). Introduction to the CoNLL–2002 Shared Task:
Language–Independent Named Entity Recognition. In Proceedings of
CoNLL–2002, pages 155–158. Taipei, Taiwan.
Saunders, C., Gammerman, A., and Vovk, V. (1998). Ridge Regression
Learning Algorithm in Dual Variables. In Proceedings of the Fifteenth
International Conference on Machine Learning (ICML 1998), pages 515–
521, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
Saunders, C., Shawe-Taylor, J., and Vinokourov, A. (2002). String Kernels,
Fisher Kernels and Finite State Automata. In Becker, S., Thrun, S., and
Obermayer, K., editors, Advances in Neural Information Processing Systems 15 (NIPS 2002), December 9–14, 2002, Vancouver, British Columbia,
Canada, pages 633–640. MIT Press.
Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson,
R. C. (2001). Estimating the Support of a High–Dimensional Distribution.
Neural Computation, 13(7):1443–1471.
Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels. MIT Press,
Cambridge, MA.
Schölkopf, B., Smola, A. J., and Müller, K.-R. (1999). Kernel Principal
Component Analysis. In Schölkopf, B., Burges, C., and Smola, A. J.,
editors, Advances in Kernel Methods — Support Vector Learning, pages
327–352, Cambridge, MA, USA. MIT Press.
Schölkopf, B., Weston, J., Eskin, E., Leslie, C. S., and Noble, W. S. (2002a).
A Kernel Approach for Learning from Almost Orthogonal Patterns. In
Proceedings of the 13th European Conference on Machine Learning (ECML
2002), pages 511–528.
263
BIBLIOGRAPHY
Schölkopf, B., Weston, J., Eskin, E., Leslie, C. S., and Noble, W. S. (2002b).
A kernel approach for learning from almost orthogonal patterns. In Elomaa, T., Mannila, H., and Toivonen, H., editors, Principles of Data Mining and Knowledge Discovery, 6th European Conference (PKDD 2002),
Helsinki, Finland, August 19-23, 2002, Proceedings, volume 2431 of Lecture Notes in Computer Science, pages 494–511. Springer.
Schultz, M. and Joachims, T. (2004). Learning a Distance Metric from Relative Comparisons. In Thrun, S., Saul, L., and Schölkopf, B., editors,
Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, USA.
Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern
Analysis. Cambridge University Press.
Shevade, S., Keerthi, S., Bhattacharyya, C., and Murthy, K. (2000). Improvements to the SMO Algorithm for SVM Regression. IEEE Transactions on
Neural Networks, 11(5):1188–1193.
Shimodaira, H., ichi Noma, K., Nakai, M., and Sagayama, S. (2001a). Dynamic Time–Alignment Kernel in Support Vector Machine. In Dietterich,
T. G., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing Systems 14: Natural and Synthetic (NIPS 2001), December 3–8, 2001, Vancouver, British Columbia, Canada, volume 2, pages
921–928. MIT Press.
Shimodaira, H., ichi Noma, K., Nakai, M., and Sagayama, S. (2001b). Support Vector Machine with Dynamic Time–Alignment Kernel for Speech
Recognition. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech 2001), volume III, pages 1841–1844.
Srinivasan, A., King, R. D., Muggleton, S. H., and Sternberg, M. J. E. (1997).
The Predictive Toxicology Evaluation Challenge. In Proceedings of the
Fifteenth International Joint Conference on Artificial Intelligence (IJCAI
1997), pages 1–6. Morgan–Kaufmann.
Steinwart, I. (2003). Sparseness of Support Vector Machines. Journal Machine Learning Research, 4:1071–1105.
264
BIBLIOGRAPHY
Stroustrup, B. (1997). The C++ Programming Language. Addison Wesley,
Third edition.
Sturt, P., Costa, F., Lombardo, V., and Frasconi, P. (2003). Learning First–
Pass Structural Attachment Preferences with Dynamic Grammars and Recursive Neural Networks. Cognition, 88(2):133–169.
Swamidass, S. J., Chen, J., Bruand, J., Phung, P., Ralaivola, L., and Baldi,
P. (2005). Kernels for Small Molecules and the Prediction of Mutagenicity, Toxicity and Anti–Cancer Activity. Bioinformatics, 21(Supplement
1):i359–i368.
Taskar, B., Abbeel, P., and Koller, D. (2002). Discriminative probabilistic
models for relational data. In Proceedings of the Eighteenth Conference on
Uncertainty in Artificial Intelligence (UAI 2002), University of Alberta,
Edmonton, Canada. Morgan Kaufmann.
Taskar, B., Guestrin, C., and Koller, D. (2004). Max-Margin Markov Networks. In Thrun, S., Saul, L., and Schölkopf, B., editors, Advances in
Neural Information Processing Systems 16. MIT Press, Cambridge, MA,
USA.
Tax, D. M. J. and Duin, R. P. W. (1999). Support Vector Domain Description. Pattern Recognition Letters, 20(11–13):1191–1199.
Tewari, A. and Bartlett, P. L. (2005). On the Consistency of Multiclass Classification Methods. In Auer, P. and Meir, R., editors, Proceedings of 18th
Annual Conference on Learning Theory (COLT 2005), Bertinoro, Italy,
June 27–30, 2005, volume 3559 of Lecture Notes in Computer Science,
pages 143–157. Springer.
Toivonen, H., Srinivasan, A., King, R. D., Kramer, S., and Helma, C. (2003).
Statistical Evaluation of the Predictive Toxicology Challenge 2000–2001.
Bioinformatics, 19(10):1183–1193.
Tsochantaridis, I., Hofmann, T., Joachims, T., and Altun, Y. (2004). Support Vector Machine Learning for Interdependent and Structured Output
Spaces. In Machine Learning, Proceedings of the Twenty-first International
265
BIBLIOGRAPHY
Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004, pages
823–830. ACM.
Tsuda, K., Kin, T., and Asai, K. (2002). Marginalized Kernels for Biological
Sequences. Bioinformatics, 18(1):268–275.
Ukkonen, E. (1992). Constructing Suffix Trees On–Line in Linear Time. In
Proceedings of the IFIP 12th World Computer Congress on Algorithms,
Software, Architecture — Information Processing ’92, Volume 1, pages
484–492. North–Holland.
Ukkonen, E. (1995). On–Line Construction of Suffix Trees. Algorithmica,
14(3):249–260.
Vallee, B. L. and Auld, D. S. (1992). Functional zinc–binding motifs in
enzymes and DNA–binding proteins. Faraday Discussions, 93:47–65.
Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer, New
York.
Vapnik, V. (1998). Statistical Learning Theory. John Wiley and Sons, New
York.
Vert, J.-P. and Kanehisa, M. (2003). Graph–driven Features Extraction from
Microarray Data using Diffusion Kernels and Kernel CCA. In Becker,
S., Thrun, S., and Obermayer, K., editors, Advances in Neural Information Processing Systems 15, pages 1425–1432. MIT Press, Cambridge, MA,
USA.
Viswanathan, S. and Smola, A. J. (2003). Fast Kernels for String and Tree
Matching. In Becker, S., Thrun, S., and Obermayer, K., editors, Advances
in Neural Information Processing Systems 15, pages 569–576. MIT Press,
Cambridge, MA, USA.
Vullo, A. and Frasconi, P. (2004). Disulfide Connectivity Prediction Using
Recursive Neural Networks and Evolutionary Information. Bioinformatics,
20(5):653–659.
266
BIBLIOGRAPHY
Watkins, C. (2000). Dynamic Alignment Kernels. In Smola, A., Bartlett,
P., Schölkopf, B., and Schuurmans, D., editors, Advances in Large Margin
Classifiers, pages 39–50. MIT Press, Cambridge, MA, USA.
Weininger, D., Weininger, A., and Weininger, J. L. (1989). SMILES. 2. Algorithm for Generation of Unique SMILES Notation. Journal of Chemical
Information and Computer Sciences, 29(2):97–101.
Weislow, O., Kiser, R., Fine, D., Bader, J., Shoemaker, R., and Boyd, M.
(1989a). New Soluble Formazan Assay for HIV–1 Cytopathic Effects: Application to High Flux Screening of Synthetic and Natural Products for
AIDS Antiviral Activity. Journal of the National Cancer Institute, 81:577–
586.
Weislow, O., Kiser, R., Fine, D. L., Bader, J. P., Shoemaker, R. H., and Boyd,
M. R. (1989b). New Soluble Fomrazan Assay for HIV–1 Cyopathic Effects:
Appliication to High Flux Screening of Synthetic and Natural Products for
AIDS Antiviral Activity. Journal of National Cancer Institute.
Weston, J., Chapelle, O., Elisseeff, A., Schölkopf, B., and Vapnik, V. (2003).
Kernel Dependency Estimation. In Becker, S., Thrun, S., and Obermayer,
K., editors, Advances in Neural Information Processing Systems 15, pages
873–880. MIT Press, Cambridge, MA, USA.
Weston, J. and Watkins, C. (1998). Multi–Class Support Vector Machines.
Technical Report CSD–TR–98–04, Department of Computer Science,
Royal Holloway, University of London, Egham, TW20 0EX, UK.
Xing, E. P., Ng, A. Y., Jordan, M. I., and Russell, S. (2003). Distance Metric
Learning with Application to Clustering with Side-Information. In Becker,
S., Thrun, S., and Obermayer, K., editors, Advances in Neural Information
Processing Systems 15, pages 505–512. MIT Press, Cambridge, MA, USA.
Zhang, T. (2002). On the Dual Formulation of Regularized Linear Systems
with Convex Risk. Machine Learning, 46(1–3):91–129.
267
BIBLIOGRAPHY
Zhou, D., Weston, J., Gretton, A., Bousquet, O., and Schölkopf, B. (2004).
Ranking on Data Manifolds. In Thrun, S., Saul, L., and Schölkopf, B., editors, Advances in Neural Information Processing Systems 16. MIT Press,
Cambridge, MA, USA.
Zien, A., Rtsch, G., Mika, S., Schlkopf, B., Lengauer, T., and Mller, K.R. (2000). Engineering Support Vector Machine Kernels That Recognize
Translation Initiation Sites. Bioinformatics, 16(9):799–807.
268
Index
accuracy . . . . . . . . . . . . . . . . . . . . 199
active learning . . . . . . . . . . . . . . . . . 3
anchor . . . . . . . . . . . . . . . . . . . . . . 132
approximation error . . . . . . . . . . 18
artificial intelligence . . . . . . . . . . . 2
AUC . . . . . . . . . . . . . . . . . . . . . . . . 203
AURPC . . . . . . . . . . . . . . . . . . . . . 229
autocorrelation . . . . . . . . . . . . . . 220
direct sum . . . . . . . . . . . . . . . 65, 172
DOAG . . . . . . . . . . . . . . . . . . . . . . . 93
EMGK . . . . . . . . . . . . . . . . . . . . . . 211
empirical error . . . . . . . . . . . . . . . 16
empirical risk minimization . . . 17
Euclidean space . . . . . . . . . . . . . . 20
expected risk . . . . . . . . . . . . . . . . . 16
F1 metric . . . . . . . . . . . . . . . . . . . 135
first pass attachment . . . . . . . . 129
foot . . . . . . . . . . . . . . . . . . . . . . . . . 132
FSG . . . . . . . . . . . . . . . . . . . . . . . . 211
algorithm . . . . . . . . . . . . . . . . . 94
bag representation . . . . . . . . . . . . 68
batch learning . . . . . . . . . . . . . . . . . 3
Bayes function . . . . . . 16, 146, 149
classification . . . . . . . . . . . . . 148
regression . . . . . . . . . . . . . . . 147
Bayes risk . . . . . . . . . . . . . . . . . . . . 16
binary classification . . . . . . . . . . . . 2
binary relation . . . . . . . . . . . . . . 151
bounded SVs . . . . . . . . . . . . . . . . . 35
cancelling out effect . . . . . . . . .
collective classification . . . . . . .
connection path . . . . . . . . . . . . .
constituent . . . . . . . . . . . . . . . . . .
context . . . . . . . . . . . . . . . . . . . . . .
crossing brackets . . . . . . . . . . . .
gating network . . . . . . . . . . . . . . 226
generalization . . . . . . . . . . . . . . . . . 4
geometric average . . . . . . . . . . . 199
Gram matrix . . . . . . . . . . . . . . . . . 22
124
217
131
134
182
134
HIK . . . . . . . . . . . . . . . . . . . . . . . . . 181
Hilbert space . . . . . . . . . . . . . . . . . 20
HIV . . . . . . . . . . . . . . . . . . . . . . . . . 204
HMM . . . . . . . . . . . . . . . . . . . . . . . . 77
hypothesis space . . . . . . . . . . . . . . 14
ill–posed problem . . . . . . . . . . . . . 17
incremental tree . . . . . . . . . . . . . 131
IO–isomorph transductions . . 107
DAG . . . . . . . . . . . . . . . . . . . . . . . . . 93
direct model . . . . . . . . . . . . 153, 155
269
INDEX
Ivanov regularization . . . . . . . . . 18
probability distribution . . 180
ranking model . . . . . . . . . . . 121
RBF . . . . . . . . . . . . . . . . . . . . . 64
semi–structured data . . . . . 89
semipatter . . . . . . . . . . . . . . . 226
set . . . . . . . . . . . . . . . . . . . . . . . 66
spectrum . . . . . . . . . . . . . . . . . 70
string subsequence . . . . . . . . 72
string tree . . . . . . . . . . . . . . . . 87
subgraph . . . . . . . . . . . . . . . . . 93
synchronized random walk . . .
101
tree–structured pattern . . 103
walk based . . . . . . . . . . . . . . 103
weighted decomposition . 182
kernel machines . . . . . . . . . . . . . . . 5
kernel . . . . . . . . . . . . . . . . . . . . . . . . 21
1D . . . . . . . . . . . . . . . . . . . . . . . 99
2D . . . . . . . . . . . . . . . . . . . . . . . 99
3D . . . . . . . . . . . . . . . . . . . . . . . 99
all–substructures . . . . . . . . 178
ANOVA . . . . . . . . . . . . . . . . . . 67
basic terms . . . . . . . . . . . . . . 104
composition . . . . . . . . . . . . . . 67
convolution . . . . . . . . . . . . . . . 65
CSI . . . . . . . . . . . . . . . . . . . . . . 76
cyclic pattern . . . . . . . . . . . . . 95
decomposition . . . . . . . . . . . 172
dynamic alignment . . . . . . . 76
dynamic time–alignment . . 75
EMG . . . . . . . . . . . . . . . . 97, 211
Fisher . . . . . . . . . . . . . . . . . . . . 80
frequent subgraph . . . . . . . . 93
gap . . . . . . . . . . . . . . . . . . . . . . 226
Gaussian . . . . . . . . . . . . . . . . . 64
histogram intersection . . . 181
inner product . . . . . . . . . . . . . 63
intersection set . . . . . . . . . . . 66
marginalized . . . . . . . . . . . . . . 79
marginalized graph . . . . . . . 96
Mercer . . . . . . . . . . . . . . . . . . . 21
minimum . . . . . . . . . . . . . . . . . 66
mismatch string . . . . . . . . . . 71
normalized . . . . . . . . . . . . . . . 68
nultiset . . . . . . . . . . . . . . . . . . 172
parse tree . . . . . . . . . . . . . . . . 83
polynomial . . . . . . . . . . . . . . . 63
preference model . . . . . . . . 121
labelled precision . . . . . . . . . . . . 134
labelled recall . . . . . . . . . . . . . . . 134
learning to learn . . . . . . . . . . . . . . . 3
linkage . . . . . . . . . . . . . . . . . . . . . . 220
local predictor . . . . . . . . . . . . . . . 225
LOCNet . . . . . . . . . . . . . . . . . . . . 198
loss function . . . . . . . . . . . . . . . . . . 15
misclassification . . . . . . . . . . 15
pairwise . . . . . . . . . . . . . . . . . 122
preference . . . . . . . . . . 119, 153
quadratic . . . . . . . . . . . . . . . . . 15
ranking . . . . . . . . . . . . . . . . . . 153
SVM hard margin . . . . . . . . 16
SVM misclassification . . . . 16
SVM regression . . . . . . . . . . . 15
VP . . . . . . . . . . . . . . . . . . . . . . . 60
machine learning . . . . . . . . . . . . . . 2
270
INDEX
Matthews coefficient . . . . . . . . . 199
Mercer kernel . . . . . . . . . . . . . . . . 21
Mercer’s theorem . . . . . . . . . . . . . 22
Morgan index . . . . . . . . . . . . . . . . 97
multiclass classification . . . . 2, 41
Bayes function . . . . . . 148, 149
natural language processing .
network unfolding . . . . . . . . . . .
neural networks . . . . . . . . . . . . .
NLP . . . . . . . . . . . . . . . . . . . 115,
NN . . . . . . . . . . . . . . . . . . . . . . . . . .
R–decomposition structure . . 172
ranking . . . . . . . 117, 118, 151, 152
loss function . . . . . . . . . . . . . 153
ranking model . . . . . . . . . . . . . . . 151
rate of false positive . . . . . . . . . 203
rate of true positives . . . . . . . . 203
recall . . . . . . . . . . . . . . . . . . . . . . . . 199
recursive neural networks . . . . 106
regression . . . . . . . . . . . . . . . . . . . . . 2
Bayes function . . . . . . . . . . 147
regularization theory . . . . . . . . . 18
reinforcement learning . . . . . . . . . 3
relational learning . . . . . . . . . . . 217
remote homology . . . . . . . . . . . . 202
representer theorem . . . . . . . . . . 27
reranking task . . . . . . . . . . . . . . . 134
RFP . . . . . . . . . . . . . . . . . . . . . . . . 203
risk functional . . . . . . . . . . . . . . . . 16
RKHS . . . . . . . . . . . . . . . . . . . . . . . . 24
RNN . . . . . . . . . . . . . . . . . . . . . . . . 106
preference model . . . . . . . . 120
ROC . . . . . . . . . . . . . . . . . . . 203, 208
RTP . . . . . . . . . . . . . . . . . . . . . . . . 203
115
106
106
129
106
on–line learning . . . . . . . . . . . . . . . 3
output function . . . . . . . . . . . . . 107
overfitting . . . . . . . . . . . . . . . . . . . . . 4
pairwise model . . . . . . . . . 154, 158
PARSEVAL measures . . . . . . . 140
part of speech . . . . . . . . . . . . . . . 132
partial order model . . . . . . . . . . 152
partial order relation . . . . . . . . 151
PCA . . . . . . . . . . . . . . . . . . . . . . . . 141
PDB . . . . . . . . . . . . . . . . . . . . . . . . 218
PHMM . . . . . . . . . . . . . . . . . . . . . . . 77
poly–Histidine tag . . . . . . . . . . . 218
POS . . . . . . . . . . . . . . . . . . . . . . . . 132
precision . . . . . . . . . . . . . . . . . . . . 199
predictive toxicology . . . . . . . . 209
preference . . . . . . . . . 117, 151, 152
loss function . . . . . . . . 119, 153
preference model . . . . . . . . . . . . 151
protein data bank . . . . . . . . . . . 218
PSI–BLAST . . . . . . . . . . . . . . . . . 228
PTC . . . . . . . . . . . . . . . . . . . . . . . . 209
sample error . . . . . . . . . . . . . . . . . . 18
SCOP . . . . . . . . . . . . . . . . . . . . . . . 202
selector . . . . . . . . . . . . . . . . . . . . . 182
semi–supervised learning . . . . . . . 3
semipattern predictor . . . . . . . 225
sigmoid . . . . . . . . . . . . . . . . . . . . . 227
state transition function . . . . . 107
statistical learning theory . . . . . 13
structured data . . . . . . . . . . . . . . . . 4
subcellular localization . . . . . . 197
271
INDEX
SubLoc . . . . . . . . . . . . . . . . . . . . . 198
supersource transduction . . . . 107
supervised learning . . . . . . . . 2, 14
support vector machines . . . . . . 30
SVM . . . . . . . . . . . . . . . . . . . . . . . . . 30
classification . . . . . . . . . . . . . . 31
clustering . . . . . . . . . . . . . . . . . 37
complexity . . . . . . . . . . . . . . . 38
dual . . . . . . . . . . . . . . . . . . . . . . 34
loss functions . . . . . . . . . . . . . 15
preference model . . . . . . . . 125
primal . . . . . . . . . . . . . . . . . . . . 33
regression . . . . . . . . . . . . . . . . 36
preference model . . . . . . . . 126
regularization . . . . . . . . . . . . . 56
training algorithm . . . . . . . . 44
WDK . . . . . . . . . . . . . . . . . . 179, 182
algorithm . . . . . . . . . . . . . . . 186
complexity . . . . . . . . . . . . . . 186
general form . . . . . . . . . . . . . 182
molecule . . . . . . . . . . . . . . . . 184
sequence . . . . . . . . . . . . . . . . 183
well–posed problem . . . . . . . . . . . 17
working set . . . . . . . . . . . . . . . . . . . 38
zinc . . . . . . . . . . . . . . . . . . . . . . . . .
zinc site . . . . . . . . . . . . . . . . . . . . .
catalytic . . . . . . . . . . . . . . . . .
interface . . . . . . . . . . . . . . . . .
pattern . . . . . . . . . . . . . . . . . .
structural . . . . . . . . . . . . . . .
zinc site pattern . . . . . . . . . . . . .
target space . . . . . . . . . . . . . . . . . . 15
tensor product . . . . . . . . . . . 65, 172
Tikhonov regularization . . . . . . 18
total order relation . . . . . . . . . . 152
totters . . . . . . . . . . . . . . . . . . . . . . . 97
tree reduction . . . . . . . . . . . . . . . 133
tree specialization . . . . . . . . . . . 133
unbounded SVs . . . . . . . . . . . . . . 35
unsupervised learning . . . . . . . . . . 3
utility function . . . . . . . . . . . . . . 120
utility function model . . 154, 156
voted perceptron . . . . . . . . . . . . . 43
VP . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
complexity . . . . . . . . . . . . . . . 59
dual . . . . . . . . . . . . . . . . . . . . . . 50
dual preference model . . . 127
dual variables . . . . . . . . . . . . 52
loss function . . . . . . . . . . . . . . 60
prediction function . . . . . . . 46
272
216
218
219
220
222
219
222