Data Dependent Risk Bounds and Algorithms for Hierarchical

Data Dependent Risk Bounds and Algorithms
for Hierarchical Mixture of Experts Classifiers
Research Thesis
Submitted in Partial Fulfillment of the Requirements for the Degree of
Master of Science in Electrical Engineering
Arik Azran
Submitted to the Senate of the Technion - Israel Institute of Technology
TISHRAY, 5764
HAIFA
JUNE, 2004
2
The Research Thesis Was Done Under the Supervision of Professor Ron Meir
in the Faculty of Electrical Engineering.
The Generous Financial Help of the Technion is Gratefully Acknowledged.
Contents
1 A short introduction to pattern classification
5
1.1
Example of pattern classification problem . . . . . . . . . . . . . . . . . . . .
6
1.2
Formulating the binary pattern classification problem . . . . . . . . . . . . .
11
1.3
A brief classifiers survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.3.1
Nearest Neighbor Classifiers . . . . . . . . . . . . . . . . . . . . . . .
13
1.3.2
Kernel classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
1.3.3
Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
1.3.4
Tree classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
1.4
2 A short introduction to statistical learning theory
21
2.1
Supervised learning and generalization . . . . . . . . . . . . . . . . . . . . .
22
2.2
Vapnik-Chervonenkis theory . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.3
Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
2.3.1
The basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
2.3.2
Sums of independent random variables . . . . . . . . . . . . . . . . .
30
i
ii
CONTENTS
2.3.3
2.4
General mappings of independent random variables . . . . . . . . . .
31
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3 Data dependent risk bounds
35
3.1
The φ-risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.2
The Rademacher complexity . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.3
Risk bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.4
Improved risk bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.4.1
Improved bound for RN (φ ◦ F) . . . . . . . . . . . . . . . . . . . . .
44
3.4.2
Deriving an improved risk bound . . . . . . . . . . . . . . . . . . . .
47
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.5
4 Risk bounds for Mixture-of-Experts classifiers
49
4.1
Mixture of Experts Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.2
Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.3
Establishing risk bounds for MoE classifiers . . . . . . . . . . . . . . . . . .
55
4.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
5 Risk bounds for Hierarchical Mixture of Experts
63
5.1
Some preliminaries & definitions . . . . . . . . . . . . . . . . . . . . . . . . .
64
5.2
Upper bounds for HMoE R̂N (F) . . . . . . . . . . . . . . . . . . . . . . . . .
67
5.3
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
6 Fully data dependent bounds
71
CONTENTS
iii
6.1
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
6.2
Fully data dependent risk bound for MoE classifiers . . . . . . . . . . . . . .
75
6.2.1
Some definitions, notations & preliminary results . . . . . . . . . . .
75
6.2.2
Establishing a fully data dependent bound . . . . . . . . . . . . . . .
77
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
6.3
7 Numerical experiments
7.1
81
Numerical experiments protocol . . . . . . . . . . . . . . . . . . . . . . . . .
82
7.1.1
Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
7.1.2
Real world data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
7.2.1
Cross-Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
7.2.2
Greedy search of local minima . . . . . . . . . . . . . . . . . . . . . .
89
7.3
Synthetic data set results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
7.4
Real world data set results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
7.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
7.2
A About the Lipschitz property of functions
95
iv
CONTENTS
List of Figures
1.1
A classification problem example. . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.2
Supervised learning block diagram. . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.3
One dimensional demonstration of class complexity. . . . . . . . . . . . . . . . .
8
1.4
A training sequence, after pre-processing. . . . . . . . . . . . . . . . . . . . . .
9
1.5
Model selection - underfitting. . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.6
Model selection - overfitting. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.7
Model selection - fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.8
Some widely used activation functions. . . . . . . . . . . . . . . . . . . . . . . .
16
1.9
A single neuron block diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
1.10 Two layer feed forward network with M neurons. . . . . . . . . . . . . . . . . .
17
1.11 Three layer feedforward network. . . . . . . . . . . . . . . . . . . . . . . . . . .
18
1.12 Binary tree classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.1
The estimation-approximation trade off. . . . . . . . . . . . . . . . . . . . .
26
2.2
The VC dimension for linear classifiers in the two dimensional space. . . . . . . .
27
3.1
Some widely used functions as φ(yf (x)). . . . . . . . . . . . . . . . . . . . . . .
38
v
vi
LIST OF FIGURES
3.2
Risk bounds comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
4.1
MoE classifier with M experts. . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.2
A combination of simple classifiers in the feature space. . . . . . . . . . . . .
51
4.3
A block diagram of the classifier in Figure 4.2. . . . . . . . . . . . . . . . . .
52
5.1
Balanced two-levelled HMoE classifier with M experts. . . . . . . . . . . . . . .
65
7.1
R cross-validation data partition.
. . . . . . . . . . . . . . . . . . . . . . . . .
85
7.2
Two-stage cross-validation data partition. . . . . . . . . . . . . . . . . . . . . .
86
7.3
A visual demonstration of the synthetic data set. . . . . . . . . . . . . . . . . .
91
7.4
Synthetic data set results. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
Abstract
According to the Oxford dictionary, a pattern is defined as a way in which something happens
and the noun recognize is described as knowing, (being able to) identify again (a person or
a thing) that one has seen, heard, etc before. Thus, by recognizing a pattern we identify
something that is similar to something that we saw in the past. Pattern Recognition is
about guessing the unknown nature of an observation as one out of a set of possibilities.
In this work an observation is a collection of numerical measurements, denoted by x, where
x ∈ Rk . The unknown nature of the observation is referred to as the label, denoted in this
work by y, where y ∈ {0, 1}. In pattern recognition a mapping f : Rk 7→ {0, 1} is defined.
This mapping is referred to as the classifier.
There are many ways in which classifiers can be defined. One possibility is the Hierarchical
Mixture of Experts classifier, which is in the focus of our discussion. The Hierarchical Mixture
of Experts classifier is given by a recursive soft partition of the feature space Rk in a datadriven fashion. Such a procedure enables local classification where several experts are used,
each of which is assigned with the task of classification over some subspace of the feature
space. In this work, we provide data-dependent error bounds for this class of models, which
lead to effective procedures for performing model selection. Tight bounds are particularly
important here, because the model is highly parameterized. The theoretical results are
complemented with some numerical experiments based on a randomized algorithm, which
mitigates the effects of local minima that plague other approaches such as the expectationmaximization algorithm.
1
2
LIST OF FIGURES
This work consists of three parts.
Introductory chapters Chapters 1-3 establish the framework of our discussion and introduce the mathematical and probabilistic tools that are used through out this work.
Though most of the results are taken from the literature, they are combined in a way
which leads to the tightest achievable bounds. Furthermore, some known results are
combined with the recently proposed ‘Entropy method’ for concentration inequalities
to establish a new risk bound that often outperforms the best known bounds.
Analytical chapters In chapters 4-6 the Mixture of Experts and Hierarchical Mixture of
Experts classifiers are introduced and discussed. These classifiers are studied extensively and several analytical results regarding them are established.
Numerical chapter Chapter 7 provides a numerical demonstration for the way in which
the analytical results can be used in practical problems. It also provides a comparison
between the performance of Mixture of Experts and some Kernel classifiers using some
real data sets.
LIST OF FIGURES
3
4
LIST OF FIGURES
Notation and Abbreviations
R
the set of all real numbers
R+
the set of all nonnegative and real numbers
EZ
average according the distribution of the random variable Z
k
dimension of the feature space
X
feature space, Rk unless stated otherwise
Y
label space {±1}
x
feature vector
y
label
P
(unknown) distribution, defined over X × Y
N
size of the training sequence
DN
training sequence {(Xn , Yn )}N
n=1 , drawn i.i.d according to P
M
number of experts
am (·)
m-th gating function, mapping Rk 7→ R+
h(·)
m-th expert, mapping Rk 7→ R
f (·)
classifier, mapping Rk 7→ R
wm , v m
parameter of the m-th gating function and expert
Wm , V m
feasible set of wm , vm
m
m
Wmax
, Vmax
diameter of Wm , Vm
σn
balanced binary random variable, n = 1, 2, . . . , N
σ
vector of N i.i.d. random variables [σ1 , σ2 , . . . , σN ]
φ(·)
loss function, mapping R 7→ R+
φ̄
supremum of φ over R
I[·]
indicator function, mapping R 7→ {0, 1}
F, H, A
classes of all mappings f, h, a
MA , MH
supremum over all functions in A, H
La , Lh , Lφ
Lipschitz constants of a, h, φ
R̂N (F)
empirical Rademacher complexity
RN (F)
rademacher complexity
Pe (f )
risk of the classifier f
P̂e (f, DN )
empirical risk of the classifier f
Eφ (f )
φ-risk of the classifier f
Êφ (f, DN )
empirical φ-risk of the classifier f
φ◦F
function composition
MoE
Mixture of Experts
HMoE
Hierarchical MoE
Chapter 1
A short introduction to pattern
classification
Imagine a toll road in which a fee is charged according to the vehicle size. A camera located
at the entrance takes pictures of the vehicles. The pictures are then transferred to a machine
(a processing unit) which classifies the vehicle in each picture as Small or Large. Making
the decision regarding the type of vehicle is an example of what is referred to as pattern
classification.
This chapter provides a brief introduction to the problem of pattern classification, focusing
on the binary case. To introduce the nature of the problem, we extend the example above,
followed by a formal definition of the problem within a statistical framework. Section 1.3
presents a short survey of classifiers widely used in real world applications.
A reader familiar with these issues can be satisfied with paying attention to the notations
and definitions, without reading the entire chapter.
5
6
CHAPTER 1. A SHORT INTRODUCTION TO PATTERN CLASSIFICATION
1.1
Example of pattern classification problem
Consider the toll road example above (see Figure 1.1), and assume that the fee is 1£ for small
and 2£ for large vehicles. The machine extracts the vehicle length and height from the picture
and uses these measurements to make a decision regarding the vehicle type. Denote by x1
and x2 the length and height measurements, respectively. The machine’s task is to select a
mapping f from the two dimensional space X = {(x1 , x2 ) : 2.5 ≤ x1 ≤ 18, 1 ≤ x2 ≤ 4} into
Y = {Small,Large}.
x
Camera
Camera on
a toll road
Classifier
y
{Small, Large}
Feature
extraction &
pre-processing
Figure 1.1: A classification problem example.
A camera on a toll road takes pictures of the vehicles entering the road, each of which is classified
as Small or Large. First, each vehicle is segmented (isolated from the environment). Then, the
characteristics of the vehicle, such as length and height, are extracted and described as a feature
vector x = [length, height]. Some pre-processing of x, such as mean substraction and variance normalization, might take place, followed by the classification of x with one of the labels {Small,Large}.
Suppose the machine is required to independently learn the rule which is used to classify
the vehicles. Such a scheme is addressed in the context of supervised learning (or learning
with a teacher ), where the machine is provided with a collection of labelled examples, based
on which the most ‘appropriate’ classifier is selected. This classifier is then used to classify
new, unlabelled examples. A block diagram, describing the scheme of supervised machine
learning, is provided in Figure 1.2.
Figure 1.4 describes a collection of 100 samples (after preprocessing), some of which are
1.1. EXAMPLE OF PATTERN CLASSIFICATION PROBLEM
7
Teacher
Desired
output
Learning
system System
output
DN
+
-
Figure 1.2: Supervised learning block diagram.
In supervised learning the machine receives a collection of labelled examples. The learning process
is carried out under the supervision of a ‘teacher’ that motivates the machine to improve its performance by introducing some measurement of the error between the desired and the actual outputs.
Generally speaking, the improvement is performed by minimizing an error measure.
labelled ‘Small’ and the rest are ‘Large’. The machine selects a mapping f : X 7→ Y based
on this set. Figures 1.5 and 1.6 describes two possible classifiers, the former is too simple for
the training sequence, a phenomenon referred to as underfitting, and the latter is too complex, resulting in overfitting (see [11] for further discussion). Figure 1.7 describes a classifier
which seems to be ‘just right’ for the training sequence. Generally, we would like to limit
our machine to select a classifier from a given class F, hoping that neither underfitting nor
overfitting occurs. To do so, a class of suitable complexity should be selected. One practical
way of controlling the size of F uses a regularization parameter, as demonstrated in the
following example. Example (Parameterized regularization).
Let v = [v1 , v2 , . . . , vM ] ∈ RM . Set some positive number K and define a nonnegative
and monotonically nondecreasing sequence {V1 , V2 , . . . , VK } where Vk ∈ R+ for all k =
P
1, 2, . . . , K. Consider the one dimensional function h(v, x) = M
m=1 tanh(vm x) and define
the collection of classes
(
Hk =
h(v, x) =
M
X
)
tanh(vm x) : |vm | ≤ Vk , m = 1, 2, . . . , M
m=1
for all k. It is easy to see that
Hk ⊆ Hk+1 for all k = 1, 2, . . . , K − 1 .
8
CHAPTER 1. A SHORT INTRODUCTION TO PATTERN CLASSIFICATION
Thus, the size of the class Hk is controlled by setting the parameter Vk . Figure 1.3 demonstrates this idea using 3 functions, drawn from three different classes for which V1 = 1, V2 = 5
and V3 = 30.
Functions from different classes, M=10.
0.5
V=1
V=5
V=30
0.4
0.3
0.2
h(v,x)
0.1
0
−0.1
−0.2
−0.3
−0.4
−0.5
−1
−0.8
−0.6
−0.4
−0.2
0
x
0.2
0.4
0.6
0.8
1
Figure 1.3: One dimensional demonstration of class complexity.
We set M = 10 and chose some arbitrary v such that |vm | < 1 for all m = 1, 2, . . . , 10. The
functions h(v, x), h(5v, x) and h(30v, x) are described. It is clear that as the radius of the feasible
set is increased, the functions in the associated class are more ‘complex’.
The problem of setting the class F is known as the problem of model selection, and is a
main focus of our discussion. To put our work in context we point out some of the main
problems of machine learning:
Feature extraction What are the vehicle characteristics that provide the most relevant
information? Will other attributes, such as vehicle weight, speed or color make the
classification easier?
Feature selection Following feature extraction, it may seem that some of them are practically useless as far as classification is concerned. The problem of finding the useful
1.1. EXAMPLE OF PATTERN CLASSIFICATION PROBLEM
9
Training Sequence
1
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 1.4: A training sequence, after pre-processing.
Linear classifier (overfitting)
1
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 1.5: Model selection - underfitting.
Out of the collection of all linear classifiers, the described classifier seems to be the most ‘suitable’
one.
10
CHAPTER 1. A SHORT INTRODUCTION TO PATTERN CLASSIFICATION
Radial SVM (overfitting)
1
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 1.6: Model selection - overfitting.
If we consider a very large class of classifiers, the selected classifier might fit the labelled samples
too well. Though almost every sample in the training sequence is classified correctly, we expect
such a classifier to have poor generalization performance, measured by its classification of future
examples.
’Appropriate’ classifier
1
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 1.7: Model selection - fitting
The described classifier seems most likely to capture the ‘true’ nature of the underlying source.
1.2. FORMULATING THE BINARY PATTERN CLASSIFICATION PROBLEM
11
features is of major significance in applications where thousands of features are available.
Prior knowledge How can prior knowledge be exploited? For example, if it is known that
small vehicles are more likely to appear during the mornings than during midday, how
can this knowledge be used to improve classification?
Noise All measurements are subject to inaccuracies. How can assumptions concerning such
inaccuracies be combined with the procedure of obtaining a classifier?
Segmentation How can the object be ‘isolated’ from the environment? For example, how
are the pixels of the picture which represent the vehicle selected?
Performance How is the classifier’s performance measured? In the sequel it will be shown
that using the 0 − 1 loss in real world applications is usually inferior to other possible
loss functions.
1.2
Formulating the binary pattern classification problem
In the previous section a simple example was used to demonstrate the problem of binary
pattern classification. We now provide an exact formulation of the problem within a statistical framework [10, 36, 12]. Let (X, Y ) be a random pair in X × Y drawn according
to some probability measure P (X, Y ). In binary pattern classification we wish to obtain a
classifier f : X 7→ Y, to classify an observation X ∈ X using one of two labels, for example
Y ∈ Y = {±1}. We say that a classification error occurs if f (X) 6= Y . We evaluate the
performance of the classifier f by the risk (a.k.a. probability of error ), defined as
Pe (f ) = EX,Y {I [Y f (X) 6= 1]} ,
where

 1
I[Y f (X) 6= 1] =
 0
if Y f (X) 6= 1
otherwise
(1.1)
.
(1.2)
12
CHAPTER 1. A SHORT INTRODUCTION TO PATTERN CLASSIFICATION
It is desirable that the selected classifier minimizes the risk with respect to a given source.
Theorem 1.1 [10] provides us with the absolute best achievable performance and the classifier
associated with it.
Theorem 1.1 (Bayes classifier) Let (X, Y ) be a random pair defined over the product
space X × Y with some p.d.f. P (X, Y ), and denote the aposteriori probability of Y given the
observation x as ηy (x) = P{Y = y|X = x}, y ∈ {±1}. Consider the class F of all possible
mappings f : X 7→ {±1}. Then, every f ∈ F satisfies
Pe (fB ) ≤ Pe (f ) ,
where fB (x) = argmaxy∈{±1} ηy (x) is the Bayes classifier. The risk incurred by this classifier
is referred to as the Bayes risk, given by
½
Pe (fB ) = E
¾
min ηy (x) .
y∈{±1}
However, notice that to be able to determine fB (x), we need to have the conditional
distribution at our disposal. In real life, this condition rarely holds, and usually we have
partial or no information of the distribution at all. In the context of supervised learning it
is assumed that a set of samples, drawn i.i.d. from the source, is available. We refer to this
set as the training sequence, denoted as DN ,
¡ k
¢N
DN = {(Xn , Yn )}N
∈
R
×
Y
∼ P (X, Y )N .
n=1
(1.3)
Based on this collection of examples we wish to obtain a classifier f from a class F. Even
though the Bayes classifier is usually not achievable using DN alone, it is very useful to have
theoretical absolute bounds for the performance of any possible classifier.
1.3
A brief classifiers survey
To put this work in context, we give a short overview of some well known classification
methods, widely used in real world applications. Some comprehensive discussions can be
1.3. A BRIEF CLASSIFIERS SURVEY
13
found at [10, 11, 13, 5]. Recall that we have a training set of N samples, DN , based on
which we wish to choose a classifier fˆN ∈ F to classify any future sample x ∈ Rk with a
label y ∈ Y such that Pe (fˆN ) will be as small as possible.
1.3.1
Nearest Neighbor Classifiers
Given a metric d(x, x0 ) on Rk and a positive integer p, the p nearest neighbor classifier is
a mapping Rk 7→ Y, dependent on p elements of DN . More formally, fix any x ∈ Rk and
assume, w.l.o.g., that
d(x, xi ) ≤ d(x, xj ) , 1 ≤ i < j ≤ N .
(1.4)
The p nearest neighbors of x is the set {(x1 , y1 ), . . . , (xp , yp )}. The p nearest neighbors
classifier labels x according to a majority vote policy,
fˆN (x) = majority(y1 , . . . , yp ) .
(1.5)
Notice that there is a need for some procedure in case of a tie. One immediate refinement
of this classifier is the assignment of different weights to different neighbors of x, typically
decreasing with respect to the distance between x and the sample. For the binary case, this
can be described as

fˆN (x) = sgn 

X
n∈I1
wn −
X
wn  ,
(1.6)
n∈I−1
where the non negative numbers w1 , . . . , wp are the weights, I1 = {n : 1 ≤ n ≤ p, yn = 1}
and I−1 is defined similarly. A surprising result, proved by Bailey and Jain [2], states that
for all distributions for which P {ηy (x) = 1/2} < 1, the weighted p nearest neighbor classifier
achieves its minimal asymptotic risk when the weights are uniform. So, for large samples
the p nearest neighbors classifier should be preferred over the weighted p nearest neighbors
classifier. However, this does not mean that the latter should be ignored when a sample
of finite length is given. In fact, Royall proved that if p is allowed to vary with N then
nonuniform weights are advantageous [31].
14
CHAPTER 1. A SHORT INTRODUCTION TO PATTERN CLASSIFICATION
1.3.2
Kernel classifiers
Similar to the nearest neighbor classifier, the kernel classifier labels any point x ∈ Rk according to a majority vote among the labels of the training sequence. However, while the former
considers only the p nearest points to x, the latter performs a weighted majority vote among
all the points of DN . The kernel classifier combines the labels of the training sequence with
data dependent weights, determined by some kernel function
fˆN (x) = sgn
à N
X
I [yi = 1] K(x, xi ) −
i=1
N
X
!
I [yi = −1] K (x, xi )
.
(1.7)
i=1
³P
´
N
A compact form of (1.7) is fˆN (x) = sgn
y
K(x,
x
)
. The kernel function K : Rk 7→
i
n=1 i
R+ is usually monotonically decreasing with respect to the distance between x and xi . The
following example describes a very popular kernel function.
Example (Weighted Gaussian kernel function).
The Gaussian kernel is given by
2
K(x, xi ) = e−kx−xi k .
(1.8)
An immediate refinement of this classifier assigns to each dimension of the feature space a
different weight,
> A(x−x
K(x, xi ) = e−(x−xi )
i)
.
(1.9)
Such a weighted distance measure is useful for two reasons:
1. It enables rescaling the features when each one of them has different units (such as
length in cm. and speed in m.p.h.).
2. If some of the features seem to be more relevant than others when classification is
concerned, they can be weighted according to some measure of their relevance.
1.3. A BRIEF CLASSIFIERS SURVEY
1.3.3
15
Neural networks
A neural network is an assembly of some well defined computational elements, called ‘artificial
neurons’, operating collectively. These ‘neurons’ may be interconnected in various ways to
form a network. For example, the output of some of them may be used as inputs to others.
The output of the neuron typically depends on a weighted sum of the inputs, exercising a
threshold behavior by exhibiting different outputs depending on whether the sum exceeds
some threshold.
An artificial neuron is a mapping Rk+1 7→ Y ⊆ [−1, 1], where Y can be discrete, assuming only the elements {±1}. The mapping is given by some function ϕ(·), referred to as
the activation function. The most popular activation function is a sigmoid, which has the
property
ϕ(t) →


1
t→∞

−1 t → −∞
,
(1.10)
where t is usually defined as the difference between the inner product of some parameter
ω ∈ Rk and x, and a threshold ω0 , i.e. t = ω > x − ω0 .
Some widely used activation functions, described graphically in Figure 1.8, are given here
mathematically:
1. Threshold activation function
ϕ(t) = sgn(t) .
2. Piecewise-linear activation function



1
t ≥ 1,



ϕ(t) = t
−1 < t < −1, .




−1 t ≤ −1
3. Logistic activation function
ϕ(t) =
et − e−t
.
et + e−t
16
CHAPTER 1. A SHORT INTRODUCTION TO PATTERN CLASSIFICATION
Activation functions
1
0.5
0
−0.5
Threshold
Piecewise−linear
Logistic
−1
−4
−3
−2
−1
0
t
1
2
3
4
Figure 1.8: Some widely used activation functions.
x1
x2
x0=-1
w1
w0
w2
y
wk
Activation function
xk
Figure 1.9: A single neuron block diagram.
The elements of the feature vector x are linearly combined and compared with a threshold. The
difference between the weighted sum of the features and the threshold is then transferred through
a nonlinear mapping.
1.3. A BRIEF CLASSIFIERS SURVEY
17
Notice that each of the sigmoids above converges pointwise to sgn(t/T ) as T → ∞ (except
maybe for t = 0), so that they are a natural generalization of the threshold activation
function. There are practically numerous possible architectures to build a neural network.
We will give a general description of two basic networks, relevant to our discussion. The first
is the two layer feedforward network.
Example (Two layer feedforward network ).
A two layer feedforward network is a collection of M ‘neurons’ (see Figure 1.10), each of
which receive the same input x, linearly combined to produce the network outcome,
ÃM
!
X
¡ >
¢
f (x) = ϕ
υm ϕm ωm x − ωm,0
.
(1.11)
m=1
>
For all m = 1, 2, . . . , M , υm ∈ R and [ωm
, ωm,0 ] ∈ Rk+1 are the parameters of the m’th
‘neuron’.
x0
x1
y
{
xk
Mneurons
Figure 1.10: Two layer feed forward network with M neurons.
The second example is a three layer feedforward network.
Example (Three layer feedforward network ).
An example of three layer feedforward network is given in Figure 1.11, where M1 neurons
(the ‘input layer’) receive the feature vector as their input. Their outputs are then used as
the inputs for M2 neurons, called the ‘hidden layer’. The outputs of the hidden layer are
18
CHAPTER 1. A SHORT INTRODUCTION TO PATTERN CLASSIFICATION
used, in turn, as the inputs to a neuron (the ‘output layer’), whose output is the network
output. The mathematical description of this network is given by
f (x) = ϕ
ÃM
1
X
υm ϕ m
m=1
ÃM
2
X
¡
υmj ϕmj ωj> x − ωj,0
!!
¢
,
(1.12)
j=1
where υm , υmj ∈ R and [ωj> , ωj,0 ] ∈ Rk+1 for all m = 1, 2, . . . , M1 and j = 1, 2, . . . , M2 .
x0
x1
y
M 2 neurons
{
{
xk
M1 neurons
Figure 1.11: Three layer feedforward network.
Remark. Similar to the way in which the two layer feedforward network is generalized
to a three layer network, it can also be generalized to a multilayer feedforward network with
an arbirary number of hidden layers.
The parameters of the neural network are set to minimize some loss function. For example,
denote the desired output for the sample xn as dn and define the sample error en = dn −f (xn ).
P
2
The network parameters are then set by minimizing the cost function ² = N1 N
n=1 en , possibly subject to some regularization conditions.
1.4. SUMMARY
1.3.4
19
Tree classifiers
Classification trees usually provides a hard recursive partition of Rk into regions (see Figure
1.12). The parameters of the tree are used in a straightforward manner to determine the
boundaries of the different regions.
X 1>a
4
X 2 >b
X 2>c
1
2
6
x2
X 1>d
3
X 1>e
5
4
7
3
1
2
X 2>f
5
x1
7
6
Figure 1.12: Binary tree classifier.
Notice that the tree classifier and a general multilayer feedforward neural networks share a
closely related structure. In this work we address a classifier that attempts to combine the
best of these two classifiers.
1.4
Summary
In this chapter the binary classification problem was introduced and illustrated using a
simple example. A formal description of the problem within a statistical framework was also
introduced, followed by a short discussion concerning the solution for the case where the
conditional distribution of Y given X is available. For this case, the best classifier (Bayes
classifier) and the associated risk (Bayes risk) were provided. For the more realistic scenario,
where the conditional distribution is not at our disposal but a training sequence is available,
20
CHAPTER 1. A SHORT INTRODUCTION TO PATTERN CLASSIFICATION
several practical classifiers were proposed.
Chapter 2
A short introduction to statistical
learning theory
In real world application the source distribution is unknown. In the context of our discussion,
a classifier fˆN is selected based on some sequence drawn i.i.d. according to the unknown
distribution. Thus, it is not realistic to expect the selected classifier to achieve the same
performance as the Bayes classifier, fB . The mismatch between the two classifiers, measured
by Pe (fˆN ) − Pe (fB ), is the focus of our discussion.
This chapter serves as an introduction to the mathematical tools that form the basis of
our discussion. It deals with the concept of generalization and serves as an introductory
chapter for what is known as concentration inequalities, used to measure the concentration
of a random variable around its mean. Such inequalities are found to be very useful in
the context of learning problems where one wishes to minimize Pe (f ) when only a training
sequence is available. The interested reader is referred to [10, 1, 24] for a comprehensive
discussion.
21
22CHAPTER 2. A SHORT INTRODUCTION TO STATISTICAL LEARNING THEORY
2.1
Supervised learning and generalization
Let us begin by a precise formulation of the problem. Let F be a class of mappings f : Rk 7→
R, and let fˆN ∈ F be a function selected based on the training sequence DN . We define the
empirical risk
P̂e (f, DN ) =
N
1 X
I [yn sgn (f (xn )) = −1] ,
N n=1
(2.1)
and the empirical risk minimizer over the class
fˆN∗ = argmin P̂e (f, DN ) .
(2.2)
f ∈F
We emphasize that fˆN can be an arbitrary function of the data, not necessarily identical to
fˆ∗ . In fact, we don’t even assume that Pe (fˆN ) = Pe (fˆ∗ ) or that P̂e (fˆN , DN ) = P̂e (fˆ∗ , DN ).
N
N
N
Given the class F, we state that our goal is to obtain the risk minimizer over the class,
fF = argmin Pe (f ) .
(2.3)
f ∈F
Since the source distribution is not available, one may believe that the best thing to do is
set fˆN = fˆN∗ . This approach is motivated by the philosophy that underlies the laws of large
numbers. Generally speaking, these laws state that by choosing N to be large enough, the
empirical risk and the true risk can be arbitrarily close with probability that tends to 1.
In the sequel, several procedures that lead to tight bounds are discussed, but for now, to
demonstrate a formal way of phrasing this, we are satisfied with the well known Chebyshev’s
inequality. According to Chebyshev’s inequality, for any random variable Z and every ² > 0,
P {|Z − EZ| ≥ ²} ≤
Setting Z =
PN
n=1
Var {Z}
.
²2
Zn , where Zn are drawn independently, yields
PN
Var {Zn }
.
P {|Z − EZ| ≥ ²} ≤ n=1 2
²
(2.4)
(2.5)
By defining Zn = I[Yn sgn (f (Xn )) = −1], p = P {Zn = 1} for all n = 1, 2, . . . , N and
normalizing properly, we have
¯
o p(1 − p)
n¯
¯
¯
.
P ¯P̂e (f, DN ) − Pe (f )¯ ≥ ² ≤
N ²2
(2.6)
2.1. SUPERVISED LEARNING AND GENERALIZATION
23
This result states that for every ² > 0, there is a positive integer N for which
¯
¯
¯
¯
¯P̂e (f, DN ) − Pe (f )¯ < ² with probability that is arbitrarily close to 1.
Thus, we may expect that with overwhelming probability, the empirical risk minimizer fˆN∗
will minimize the risk as well. It is not surprising to see that as N tends to infinity, this
probability tends to 1. However, while this result can be theoretically useful, in practical
problems we only have a finite length training sequence. The following example illustrates
one of the problems that might arise in such a setup when we select fˆ∗ .
N
Example (Uniform bounds).
Denote by F the family of all possible mappings from Rk 7→ {±1} and define
n
o
FN = f : Pe (f ) = Pe (fˆN∗ ) .
Notice that any classifier for which f (xn ) = yn , n = 1, 2, . . . , N , is a member of FN . One of
these classifiers is given by
f0 (x) =


yn
x = xn

1
otherwise
,
implying P̂e (f0 , DN ) = 0. Yet, it is very easy to find a distribution of (X, Y ) such that
½
¾
P Pe (f0 ) − inf Pe (f ) > ² = 1,
f ∈F
for any ² ∈ (0, 1/2).
So, we see that it is not sufficient to minimize P̂e (f, DN ), and we need to consider the class
from which the classifier is drawn. The problem of choosing the class F is known as the
model selection problem. One possibility of addressing this problem is by setting uniform
¯
¯
n
o
¯
¯
bounds such as upper bounds for P supf ∈F ¯P̂e (f, DN ) − Pe (f )¯ ≥ ² . The following Lemma
¯
¯
¯
¯
exhibits the importance of upper bounds for supf ∈F ¯P̂e (f, DN ) − Pe (f )¯.
Lemma 2.1 (Vapnik and Chervonenkis, 1974)
|P̂e (fˆN∗ , DN ) − Pe (fˆN∗ )| ≤ sup |P̂e (f, DN ) − Pe (f )|,
f ∈F
and
Pe (fˆN∗ ) − Pe (fF ) ≤ 2 sup |P̂e (f, DN ) − Pe (f )|.
f ∈F
24CHAPTER 2. A SHORT INTRODUCTION TO STATISTICAL LEARNING THEORY
Proof
The first inequality is trivial. As for the second inequality, we have
Pe (fˆN∗ ) − Pe (fF ) = Pe (fˆN∗ ) − P̂e (fˆN∗ , DN ) + P̂e (fˆN∗ , DN ) + P̂e (fF , DN ) − P̂e (fF , DN ) − Pe (fF )
(a)
≤ Pe (fˆN∗ ) − P̂e (fˆN∗ , DN ) + P̂e (fF , DN ) − Pe (fF )
≤ |P̂e (fˆN∗ , DN ) − Pe (fˆN∗ )| + |P̂e (fF , DN ) − Pe (fF )|
≤ 2 sup |P̂e (f, DN ) − Pe (f )| ,
f ∈F
where (a) is due to the fact that fˆN∗ is the empirical risk minimizer.
¤
So, we see that upper bounds for supf ∈F |P̂e (f, DN ) − Pe (f )| provide us with upper bounds
for two important quantities:
1. An upper bound for |P̂e (fˆN∗ , DN )−Pe (fˆN∗ )|. This is of great importance when we choose
our classifier by minimizing the empirical risk.
2. An upper bound for Pe (fˆN∗ ) − Pe (fF ), the sub-optimality of fN∗ within F.
To broaden our view of the model selection problem, recall that Pe (fB ) is the lowest risk
one could hope for, while Pe (fF ) is the best achievable risk using classifiers from F. The
performance of fˆ∗ compared to fB can be expressed in the following way:
N
Pe (fˆN∗ ) − Pe (fB ) = Pe (fˆN∗ ) − Pe (fF ) + Pe (fF ) − Pe (fB ) .
{z
} |
{z
}
|
(1)
(2.7)
(2)
We give different interpretations to each of the terms above:
(1) The estimation error measures the mismatch between fˆN∗ and fF . For every given N ,
this error is always nonnegative and expected to grow as the class complexity grows.
(2) The approximation error measures the mismatch between fF and fB due to a suboptimal model selection. It is always nonnegative and equals zero if fB ∈ F .
2.2. VAPNIK-CHERVONENKIS THEORY
25
There is an inherent trade-off between these two measures. By enriching F we decrease
the approximation error, up to the point where fB ∈ F where it equals zero. However, by
doing so we also increase the estimation error, thus exhibiting the trade-off. So, on top of
selecting a classifier from within F we need to solve an optimization problem to determine
F itself. This idea, graphically illustrated in Figure 2.1, motivates the search for bounds
which converge uniformly over the class.
Definition 2.2 (Uniform risk bounds) A risk bound is said to be uniform with respect
to a class of classifiers if it is of the following form:
For every δ ∈ (0, 1), with probability at least 1 − δ over training sequences of length N ,
every f ∈ F satisfies
Pe (f ) ≤ Ω(f, DN ) + Ψ(F, f, DN , δ),
where Ω(f, DN ) is some empirical assessment of the real probability of error and Ψ(F, f, DN , δ)
measures the class complexity.
As an example, Ω(f, DN ) can be the proportion of misclassified elements in DN , given
by P̂e (f, DN ). The term Ψ(F, f, DN , δ) is more context dependent, and can be addressed
in more than one approach. In the next section we introduce the VC theory which address the problem of measuring the class complexity and introduces a possible definition for
Ψ(F, f, DN , δ).
2.2
Vapnik-Chervonenkis theory
Vapnik and Chervonenkis [34, 35, 36] introduced a purely combinatorial measure of the class
complexity. First, observe that there exists a unique mapping between any binary classifier
in F and the subset of Rk which it maps to label ‘1’. Thus, instead of thinking of classifiers
in a class we may think of subsets in the feature space. This interpretation of a classifier
motivates the following definition.
26CHAPTER 2. A SHORT INTRODUCTION TO STATISTICAL LEARNING THEORY
Pe (f)
Uniform risk bound
Risk assesment
Complexity term
Underfitting
*
Overfitting
Figure 2.1: The estimation-approximation trade off.
A solution for the model selection problem given by the class which minimize the sum of these two
quantities.
Definition 2.3 (Shatter Coefficient) Given a set of points SN = {x1 , . . . , xN }, xn ∈ Rk
for every n = 1, 2, . . . , N , and a class of mappings F : Rk 7→ {±1}, define Cf = {x ∈ Rk :
f (x) = 1} and CF = {Cf : f ∈ F }. Let ∆F (SN ) denote the number of distinct sets of the
form {SN ∩ C : C ∈ CF }. The N ’th shatter coefficient of F is defined as
S(F, N ) = max ∆F (SN ) .
SN
Thus, ∆F (SN ) is simply the number of possible ways in which the set SN can be classified
using classifiers from F. S(F, N ) is no more than the maximal number of different subsets of
a set of N points which can be obtained by classifying them using classifiers from F. The VC
dimension of a class F, denoted by VF , is defined as the largest N such that S(F, N ) = 2N .
If S(F, N ) = 2N for all N , we say that that VF = ∞. We give a simple example that
illustrates the meaning of V C dimension.
Example (VC dimension).
For the class H of hyperplanes in Rk , VH = k + 1. We demonstrate it for k = 2 in Figure
2.2.
Vapnik and Chervonenkis proved the following distribution free result.
Theorem 2.4 (Vapnik and Chervonenkis,1971) Let F denote a class of mappings Rk 7→
2.2. VAPNIK-CHERVONENKIS THEORY
27
N=3
N=4
Figure 2.2: The VC dimension for linear classifiers in the two dimensional space.
For N = 3, 8 different divisions are possible. For N = 4, some divisions can not be achieved using
linear classifiers.
{±1}. Then, for any positive integer N ≥
VF
2
and δ ∈ (0, 1), with probability at least 1 − δ
over training sequences of length N , every f ∈ F satisfies
v
´
³
u
u V ln 2eN + ln 4
F
t
VF
δ
Pe (f ) ≤ P̂e (f, DN ) + 3
,
N
and
Pe (fˆN∗ ) ≤ Pe (fF ) + 6
v
´
³
u
u V ln 2eN + ln 4
F
t
VF
δ
N
.
(2.8)
(2.9)
Remark. The following result, taken from [1] (Theorem 4.2), is an improved and simplified
version of (2.8). Under the conditions of Theorem 2.4, there is an absolute constant c such
that
s
Pe (f ) ≤ P̂e (f, DN ) + c
VF + ln 1δ
.
N
Notice that the convergence rate of this version is slightly (logarithmically) better than the
convergence rate of (2.8). In Chapter 4 we use this bound to compare previous results, based
on VF , and new results, established in this work.
This result provides us with upper bounds for both the deviation of P̂e (f, DN ) from Pe (f )
28CHAPTER 2. A SHORT INTRODUCTION TO STATISTICAL LEARNING THEORY
and the sub-optimality of fˆN∗ within F. An immediate and significant implication of (2.9) is
that for any distribution and any class F with finite VF , learning a classifier with risk that
is arbitrarily close to the best achievable risk is possible, provided the training sequence is
large enough.
The VC theory also provides an answer to the question of how tight this bound is. In
the following theorem we consider all distributions P of the random pair (X, Y ) such that
Pe (fF ) is a constant.
Theorem 2.5 (Devroye, Györfy and Lugosi, [10]) Let F denote a class with VC dimension VF > 2 and define
P = {P (X, Y ) : Pe (fF ) = PF } ,
where PF ∈ (0, 0.5) is fixed. Then, for any N ≥
f ∈ F satisfies
VF −1
2PF
r
sup E {Pe (f ) − PF } ≥
P ∈P
max(9, (1 − 2PF )−2 ), every mapping
PF (VF − 1) −8
e .
24N
A significant implication of Theorems 2.4 and 2.5 is that for classes with finite VF , we have
upper and lower bounds for the sub-optimality of fˆN∗ within F. Furthermore, both bounds
p
behave like VF /N .
In spite of its attractiveness, the VC dimension has several disadvantages:
Excessive Pessimism. Even though the existence of lower and upper bounds of the same
magnitude are established, the proof of the lower bound is based on very ‘bizarre’
distributions. It is enough to assume some smoothness assumptions of the underlying
p.d.f. to obtain much tighter bounds.
Problem geometry. The VC dimension is a purely combinatorial and distribution free
measure of the class complexity, computed without any consideration of the source
distribution. It is reasonable to expect that a measure which considers the source
distribution will provide tighter bounds. In the sequel we will consider a data dependent
2.3. CONCENTRATION INEQUALITIES
29
complexity term which presumes no knowledge of the distribution, but uses the training
sequence to measure the class complexity.
Finiteness and tightness. The VC dimension can be very large, and even infinite. Such a
behavior indicates that the bound might be very loose. Thus, it is of great importance
to have complexity measures that exhibits a more restrained behavior.
2.3
Concentration inequalities
According to the law of large numbers, under rather mild conditions, the empirical average
of random variables drawn i.i.d. according to any distribution with finite mean is close to
its mean with probability that tends to one as the sample size increases. This behavior
was mathematically described in (2.6), and other inequalities of a similar nature are used
intensively throughout this work. We provide a brief overview of this subject, focusing on
some results that are relevant to our discussion.
2.3.1
The basics
The first result we mention is known as Markov’s inequality.
Lemma 2.6 (Markov’s inequality) Let Z be a nonnegative random variable. Then, for
any ² > 0,
P {Z ≥ ²} ≤
EZ
.
²
An immediate extension of Lemma 2.6 for random variables which are not necessarily nonnegative is given by the following corollary.
Corollary 2.7 Let Z be a random variable and let ϕ(t) be any monotonically increasing and
nonnegative function. Then, for any ² > 0,
P {Z − EZ ≥ ²} = P {ϕ(Z − EZ) ≥ ϕ(²)} ≤
Eϕ(Z − EZ)
.
ϕ(²)
(2.10)
30CHAPTER 2. A SHORT INTRODUCTION TO STATISTICAL LEARNING THEORY
Notice that by setting ϕ(t) = t2 we obtain Chebyshev’s inequality. Also, by setting ϕ(t) =
est , s > 0, and searching for the best positive s we have Chernoff ’s bound,
P {Z − EZ ≥ ²} ≤ inf {e−s² Ees(Z−EZ) } .
s>0
2.3.2
(2.11)
Sums of independent random variables
The most important random variables in the context of this work are variants of SN =
P
N −1 N
n=1 Zn where {Z1 , Z2 , . . . , ZN } is a set of random variables drawn i.i.d. according
to some distribution. From chebyshev’s inequality and the i.i.d. property of the random
variables, we immediately have
P {|ESN − SN | ≥ ²} ≤
Var(Z1 )
.
N ²2
(2.12)
2
However, while the central limit theorem leads to a tail probability that decays as O(e−N ² ),
this inequality results in a decay rate of order O(N −1 ). Thus, it is desirable to replace (2.12)
with a tighter bound that exhibits an exponential rate with respect to N . For bounded
random variables such an inequality was proved by Hoeffding [16]. The following Lemma is
one of the many variants of Hoeffding’s inequality.
Lemma 2.8 (Hoeffding’s inequality) Let Z1 , . . . , ZN be a set of independent bounded
random variables such that |Zn | ≤ b with probability one and EZn = 0 for all n = 1, 2, . . . , N .
P
Set SN = N −1 N
n=1 Zn . Then, for any 0 < ² < b
"µ
¶ 21 + 2b² µ
¶ 12 − 2b² #N
1
1
P{SN ≥ ²} ≤
(2.13)
1 + b²
1 − b²
N ²2
≤ e− 2b2 .
(2.14)
Using a slightly different version of Hoeffding’s inequality, it is easy to prove the following
result.
Corollary 2.9 For every f ∈ F and any ² > 0,
¯
o
n¯
2
¯
¯
P ¯P̂e (f, DN ) − Pe (f )¯ ≥ ² ≤ 2e−2N ² .
(2.15)
2.3. CONCENTRATION INEQUALITIES
31
2
Thus, Hoeffding’s inequality provides us with a tail probability that decays like O(e−N ² ),
similar to the central limit theorem. However, the fact that the variance of the random
variables is not considered in this bound indicates a major weakness. The following Theorem
[10] provides a concentration inequality for the sum of independent random variables with
bounded variance.
Lemma 2.10 (Bennet (1962) and Bernstein (1946)) Let Z1 , . . . , ZN be a set of independent bounded random variables such that |Zn | ≤ b with probability one and EZn = 0 for
P
PN
−1
all n = 1, 2, . . . , N . Define σ 2 = N1 N
n=1 Var(Zn ) and set SN = N
n=1 Zn . Then, for
any ² > 0
½
µµ
¶ µ
¶
¶¾
σ2
2b²
N²
P {SN > ²} ≤ exp −
1+
ln 1 + 2 − 1
2b
2b²
σ
(Bennet, 1962), and
½
¾
N ²2
P {SN > ²} ≤ exp − 2
.
2σ + 2b²/3
(2.16)
(2.17)
Comparing Hoeffding’s and Bernstein’s inequalities is insightful. Notice that both inequalities exhibits tail inequalities with exponential decay rate with respect to the sample size
N . However, when σ 2 ¿ b² we see that Bernstein’s (and Bennet’s) inequality lead a tail
³ 3N ² ´
probability of O e− 2b , otperforming Hoeffding’s inequality which lead a tail probability
¶
µ
2
− N ²2
of O e 2b .
2.3.3
General mappings of independent random variables
Recall that what is actually needed is an upper bound on supf ∈F |P̂e (f, DN ) − Pe (f )|. So,
even though Hoeffding’s and Bernstein’s inequalities exhibit the desirable exponential rate
for the tail probability, they are not directly applicable to this task. In this section we
quote two concentration inequalities that will be used in the sequel to establish uniform
upper bounds on the risk. The first one is McDiarmid’s inequality which addresses bounded
difference functions [27].
0
0
Definition 2.11 (Bounded difference functions) Let {z1 , . . . , zN } and {z1 , . . . , zN } be
two sets of N elements, each of which belong to some space Z. Any function f : Z N 7→ R
32CHAPTER 2. A SHORT INTRODUCTION TO STATISTICAL LEARNING THEORY
that satisfies
sup
0
z1 ,...,zN ,zn
0
|f (z1 , . . . , zn−1 , zn , zn+1 , . . . , zN ) − f (z1 , . . . , zn−1 , zn , zn+1 , . . . , zN )| ≤ cn
for all n = 1, 2, . . . , N , is called a bounded difference function.
P
Two simple examples of bounded difference functions are f (z1 , . . . , zN ) = N1 N
n=1 g(zn ) and
P
N
f (z1 , . . . , zN ) = supg∈G N1 n=1 g(zn ) where for every g ∈ G, |g(·)| is uniformly bounded.
Such functions exhibit a ‘concentration’ behavior, thus motivating the search for tight
bounds. Theorem 2.12, proved by McDiarmid [27], provide such a bound.
Theorem 2.12 (McDiarmid’s Inequality) Let Z1 , . . . , ZN be independent random variables, each of which takes values in a set Z and assume the function f : Z N 7→ R is a
bounded difference function. Then, for every ² ≥ 0,
P {f (Z1 , . . . , ZN ) − Ef (Z1 , . . . , ZN ) ≥ ²} ≤ e
and
P {Ef (Z1 , . . . , ZN ) − f (Z1 , . . . , ZN ) ≥ ²} ≤ e
2
2
n=1 cn
− PN2²
2
2
n=1 cn
− PN2²
,
.
Notice that if cn is O(N −1 ), McDiarmid’s inequality results in an exponential bound for the
tail probability, similarly to Hoeffding’s inequality. The most significant difference between
these two inequalities is that while the latter addresses only the sum of random variables,
the former considers any mapping of the random variables which satisfies the condition of
bounded difference functions. This is clearly a much larger class of functions.
The last concentration inequality we present in this chapter is based on the entropy method
[7]. For a certain type of functions, referred to as the self bounding functions, it may lead
to tighter bounds for the tail probability than McDiarmid’s inequality. We begin with the
definition of self bounding functions [7].
Definition 2.13 (Self bounding functions) Let {z1 , . . . , zN } be a set of N elements, each
of which belongs to a space Z. Assume that f is a nonnegative function defined over Z N and
that there exists a function fn , defined over Z N −1 , such that the following two conditions
hold for all n = 1, 2, . . . , N
2.4. SUMMARY
33
1. 0 ≤ f (z1 , . . . , zn−1 , zn , zn+1 , . . . , zN ) − fn (z1 , . . . , zn−1 , zn+1 , . . . , zN ) ≤ 1,
2.
PN
n=1 [f (z1 , . . . , zn−1 , zn , zn+1 , . . . , zN )
− fn (z1 , . . . , zn−1 , zn+1 , . . . , zN )] ≤ f (z1 , . . . , zN ).
Then, the function f is called a self bounding function.
For example, let Z = Rk ×{±1} and G : Z 7→ R some class of classifiers. The worst empirical
P
risk P̂e (g, {z1 , . . . , zN }) = supg∈G N1 N
n=1 I[yn g(zn ) < 0] is a self bounding function. The
following Theorem, concerning self bounding functions, is taken from [7].
Theorem 2.14 Let Z1 , . . . , ZN be independent random variables taking values in some set
Z and assume the function f : Z N 7→ R is a self bounding function. Then, for any ² ≥ 0,
²2
P {f (Z1 , . . . , ZN ) − Ef (Z1 , . . . , ZN ) ≥ ²} ≤ e− 2Ef +2²/3 ,
and
²2
P {Ef (Z1 , . . . , ZN ) − f (Z1 , . . . , ZN ) ≥ ²} ≤ e− 2Ef .
Recall that the results of Lemma 2.10 improves the result of Lemma 2.8 by using the variance
of the random variable. Similarly, Theorem 2.14 potentially improves on Theorem 2.12 using
the mean of the random variable. When the random variable is nonnegative and bounded,
its variance can be bounded using its mean. In the next chapter, Theorem 2.14 will be shown
to provide a better tail probability rate than Theorem 2.12 when Ef is O(N −γ ), where γ is
some positive scalar.
2.4
Summary
We discussed the problem of generalization in the context of supervised learning and demonstrated the importance of model selection. We introduced the concept of class complexity
¯
¯
¯
¯
and showed the usefulness of the random variable supf ∈F ¯P̂e (f, DN ) − Pe (f )¯ as such a measure. This measure is addressed in the VC theory, which provides us with distribution free
upper bounds for the risk in terms of VF . Since such bounds are not satisfactory in real
34CHAPTER 2. A SHORT INTRODUCTION TO STATISTICAL LEARNING THEORY
world problems, the concept of concentration inequalities was introduced, especially Theorems 2.12 and 2.14. These inequalities are used extensively in the sequel to replace random
variables with their means and vice versa leading to tight uniform risk bounds.
Chapter 3
Data dependent risk bounds
In many real life classification problems the source distribution is not available. Moreover,
there is no knowledge of a class containing classifiers that are known to perform well. Typically, the only indication of the underlying distribution is given by a training sequence of
finite length. This practical scenario motivates the search for analytical and algorithmic
tools to help us gain better understanding of the problem.
In this chapter bounds that hold for every classifier f ∈ F and are independent of
the source distribution are introduced. Even though most of the results are adapted from
the literature, we either prove or provide an outline of the proofs for most of them, for
several reasons: (1) some authors improved part of the results of others, and we combine the
best relevant results available, (2) to obtain the optimal bounds, we pay special care to the
constants, and (3) make this document self contained.
This chapter is organized as follow. In section 3.1 we consider losses other than the 0 − 1
loss, which are motivated from both theoretical and practical considerations. In section 3.2
we introduce a data dependent measure for the class complexity which is used to replace the
VC dimension in the risk bound. We conclude by introducing several risk bounds in sections
3.3 and 3.4.
35
36
CHAPTER 3. DATA DEPENDENT RISK BOUNDS
3.1
The φ-risk
Consider a soft classifier f : Rk 7→ R, where the classifier maps the feature space into the
real line R rather than {±1}, and the 0 − 1 loss incurred by it I[yf (x) ≤ 0]. While we
attempt to minimize the expected value of the 0 − 1 loss, it turns out to be inopportune
to directly minimize functions based on this loss. First, the computational task is often
intractable due to its non-smoothness. Second, minimizing the empirical 0 − 1 loss may lead
to severe overfitting. Many recent approaches are based on minimizing a smooth convex
function φ(yf (x)) which upper bounds the 0 − 1 loss (e.g. [40, 25, 3]). We assume that the
loss function φ(t) satisfies the following assumptions.
1. Upper bound. I[t ≤ 0] ≤ φ(t) for all t ∈ R. This condition enables us to upper
bound the risk using the loss function φ.
2. Finiteness. Denote φ̄ = supt∈R φ(t); we assume that φ̄ < ∞.
3. Smoothness. φ(t) is Lipschitz with constant Lφ . Such a behavior of φ is important
from both analytical (as will be demonstrated in the sequel) and practical considerations (e.g. when gradient based algorithms are used).
4. Monotonicity & tightness. limt→∞ φ(t) → 0, φ(0) = 1 and limt→−∞ φ(t) → φ̄.
We might also require that
d
dt
φ(t) < 0. Notice that t stands for yf (x), so that such
monotonic behavior implies that the φ-risk is consistent with the confidence level of the
classification result, known as the margin [3]. The values at t → ±∞ and t = 0 prevents
any unnecessary loss of tightness. The property φ(0) = 1 can be easily achieved given
a loss function that satisfies all other assumptions, simply by normalizing it by φ(0).
Since φ(0) is a positive scalar, it is easy to show that now all of the desired properties
are satisfied.
Some widely used loss functions are described in Figure 3.1.
Define the φ-risk,
Eφ (f ) = E {φ(Y f (X))} ,
3.2. THE RADEMACHER COMPLEXITY
37
and the empirical φ-risk,
Êφ (f, DN ) =
N
1 X
φ (yn f (xn )) .
N n=1
Using the φ-risk instead of the risk itself is motivated by several reasons.
1. Minimizing the φ-risk often leads asymptotically to the Bayes decision rule [40]. The
basic idea behind this statement is that for every x ∈ Rk , the function that minimizes
the φ-risk, fφ∗ , is positive if P {Y = 1|X = x} > 1/2 and negative otherwise. Thus, by
¡
¢
taking sgn fφ∗ (x) we achieve the Bayes risk.
2. Rather tight upper bounds on the risk may be derived for finite sample sizes (e.g.
[40, 25, 3]).
3. Minimizing the empirical φ-risk instead of the empirical risk is computationally much
simpler.
In the remainder of this chapter the φ-risk is used to derive risk bounds that will be the
starting point of our research. But, before going any further, we introduce a data dependent
quantity that will be used as the class complexity.
3.2
The Rademacher complexity
Recently, several notions for the class complexity were presented, such as the maximum
discrepancy and the Gaussian and Rademacher complexity. All three measurements are
closely related, and detailed discussion regarding them can be found at [4]. We focus here
on the Rademacher complexity.
Definition 3.1 (Rademacher Complexity) Let F be a class of functions mapping Rk 7→
R. The empirical Rademacher complexity is defined as
(
)
N
1 X
R̂N (F) = Eσ sup
σn f (xn ) ,
f ∈F N n=1
(3.1)
38
CHAPTER 3. DATA DEPENDENT RISK BOUNDS
Loss functions, φ(yf(x))
Threshold
Piecewise−linear
tanh
2
1.5
1
0.5
0
−4
−3
−2
−1
0
yf(x)
1
2
3
4
Figure 3.1: Some widely used functions as φ(yf (x)).
The threshold loss function produce 1 if yf (x) ≤ 0 and 0 otherwise. The piecewise linear loss
function is given by a continuous chain of linear functions. The tanh loss function is based on a
shifted,scaled and normalized version of the tanh function.
where σ = [σ1 , σ2 , ..., σN ] is a vector of binary random variables, each of which is drawn i.i.d.
according to P {σn = 1} = P {σn = −1} = 1/2. The Rademacher complexity is defined as
the average of the empirical Rademacher complexity over all possible training sequences DN ,
n
o
RN (F) = EDN R̂N (F) .
Observe that the Rademacher complexity measures the extent to which some function from
F can be correlated with binary white noise, when mapping observations according to some
probability measure over Rk .
In the next section the Rademacher complexity is used as the class complexity to replace
VF in risk bounds. Some important properties of the empirical Rademacher complexity, for
classes of real functions, F, F1 , . . . , Fd where d is some integer, are listed below.
1. R̂N (F) ≥ 0.
3.3. RISK BOUNDS
39
2. If F1 ⊆ F2 then R̂N (F1 ) ≤ R̂N (F2 ).
3. For every λ ∈ R, R̂N (λF) = |λ|R̂N (F).
P
P
4. R̂N ( di=1 Fi ) ≤ di=1 R̂N (Fi ).
The proofs for these properties are trivial. Notice that each of the properties holds for RN (F)
as well.
3.3
Risk bounds
We begin with the following trivial inequality for the risk,
n
o
Pe (f ) ≤ Eφ (f ) ≤ Êφ (f, DN ) + sup Eφ (f ) − Êφ (f, DN ) .
(3.2)
f ∈F
n
o
Notice that supf ∈F Eφ (f ) − Êφ (f, DN ) is, in general, a random variable that depends on
DN . This term by itself does not leave much room for manipulation. However, by replacing
it with its mean, we get a more flexible complexity term. It is possible to do so using
Mcdiarmid’s inequality, as described in the following Lemma.
n
Lemma 3.2 Define Ẑ(F, DN ) = supf ∈F
o
n
o
Eφ (f ) − Êφ (f, DN ) and Z(F) = E Ẑ(F, DN ) .
Then, for every integer N and δ ∈ (0, 1), with probability at least 1−δ over training sequences
of length N ,
s
Ẑ (F, DN ) ≤ Z(F) + φ̄
ln 1δ
.
2N
Proof
To be able to use McDiarmid’s inequality, we first need to prove that Ẑ(F, DN ) is a bounded
difference function. Define
i
DN
n
o
0
0
= (x1 , y1 ), . . . , (xi−1 , yi−1 ), (xi , yi ), (xi+1 , yi+1 ), . . . , (xN , yN ) .
(3.3)
40
CHAPTER 3. DATA DEPENDENT RISK BOUNDS
0
0
i
That is, DN
is obtained by replacing (xi , yi ) in DN with (xi , yi ), drawn independently ac-
cording to the same distribution. It is easy to see that
Ẑ (F, DN )
(
)
X
1
1
1
1
0
0
0
0
= sup Eφ (f ) −
φ(yn f (xn )) − φ(yi f (xi )) + φ(yi f (xi )) − φ(yi f (xi ))
N n6=i
N
N
N
f ∈F
(
)
o
n 0
1 X
1
1
0
0
0
≤ sup Eφ (f ) −
φ(yn f (xn )) − φ(yi f (xi )) + sup φ(yi f (xi )) − φ(yi f (xi ))
N n6=i
N
N f ∈F
f ∈F
i
≤ Ẑ(F, DN
)+
φ̄
.
N
i
Combined with the analogous inequality Ẑ(F, DN
) ≤ Ẑ(F, DN ) +
i
|Ẑ(F, DN ) − Ẑ(F, DN
)| ≤
φ̄
,
N
φ̄
,
N
we have
i = 1, 2, . . . , N .
Thus, according to Mcdiarmid’s inequality,
n
o
2
− 2N ²
P Ẑ(F, DN ) − Z(F) ≥ ² ≤ e φ̄2 .
− 2N2²
Setting δ = e
φ̄
2
completes the proof of Lemma 3.2.
¤
Combining (3.2) with Lemma 3.2 yields the following uniform risk bound.
Lemma 3.3 Let F be a class of mappings. Then, for every integer N and δ ∈ (0, 1), with
probability at least 1 − δ over training sequences of length N , every f ∈ F satisfies
s
n
o
ln 1δ
Pe (f ) ≤ Êφ (f, DN ) + EDN sup Eφ (f ) − Êφ (f, DN ) + φ̄
.
2N
f ∈F
n
Next, we bound the complexity term EDN supf ∈F
o
Eφ (f ) − Êφ (f, DN ) by a quantity that
is more easily handled. This is done using the following Lemma.
Lemma 3.4 Let F be a class of mappings. Then, for every integer N ,
n
o
EDN sup Eφ (f ) − Êφ (f, DN ) ≤ 2RN (φ ◦ F) .
f ∈F
3.3. RISK BOUNDS
41
Combining Lemma 3.4 and Lemma 3.3 results in a distribution dependent risk bound for
every classifier f ∈ F, described in Theorem 3.5.
Theorem 3.5 Let F be a class of mappings. Then, for every integer N and δ ∈ (0, 1), with
probability at least 1 − δ over training sequences of length N , every f ∈ F satisfies
s
ln 1δ
Pe (f ) ≤ Êφ (f, DN ) + 2RN (φ ◦ F) + φ̄
.
2N
(3.4)
Proof (of Lemma 3.4)
Define a second training sequence similar to (1.3),
n 0 0
o
0
0
0
DN = (x1 , y1 ), . . . , (xN , yN ) ,
(3.5)
drawn from the same distribution as DN . Since Eφ (f ) = N −1 ED0
N
PN
n=1
¡ 0
0 ¢
φ yn f (xn ) , we
have
½
h
i¾
EDN sup Eφ (f ) − Êφ (f, DN )
f ∈F
)
#)
(
"
(
N
1 X
0
0
φ(yn f (xn )) − Êφ (f, DN )
= EDN sup ED0
N
N n=1
f ∈F
(
"
( N
)#)
N
X
X
1
0
0
= EDN sup ED0
φ(yn f (xn )) −
φ(yn f (xn ))
.
N
N
f ∈F
n=1
n=1
Replacing the supremum of the average with the average over the supremum yields
½
h
i¾
EDN sup Eφ (f ) − Êφ (f, DN )
f ∈F
(
" N
#)
³
´
X
1
0
0
φ(yn f (xn )) − φ(yn f (xn ))
≤ EDN ,DN0 sup
N
f ∈F n=1
(
" N
#)
´
X ³ 0
0
(a) 1
= EDN ,DN0 ,σ sup
σn φ(yn f (xn )) − φ(yn f (xn ))
N
f ∈F n=1
(
)
)
(
N
N
1 X
1 X
0
0
σn φ(yn f (xn )) + EDN0 ,σ sup
σn φ(yn f (xn ))
≤ EDN ,σ sup
f ∈F N n=1
f ∈F N n=1
= 2RN (φ ◦ F),
42
CHAPTER 3. DATA DEPENDENT RISK BOUNDS
0
where σ is introduced in definition 3.1, and (a) is due to the fact that DN and DN are drawn
independently and identically.
¤
Notice that computing the Rademacher complexity involves an average over the source.
Thus, Theorem 3.5 exhibits an advantage over bounds based on the VC dimension because
it considers the source distribution as well as the class F. Unfortunately, for the same reason
this result is not practical since the distribution is assumed to be unknown. To overcome
this difficulty, we use McDiarmid’s inequality again, to replace RN (φ ◦ F ) with R̂N (φ ◦ F),
which can, in principle, be estimated based on the data.
Lemma 3.6 Let F be a class of mappings. Then, for every integer N and every δ ∈ (0, 1),
with probability at least 1 − δ over training sequences of length N
s
ln 1δ
RN (φ ◦ F) ≤ R̂N (φ ◦ F) + φ̄
.
2N
(3.6)
Proof
i
We prove that R̂N (φ ◦ F ) is a bounded difference function. Define DN
as in (3.3) and set
i
i
R̂N
(φ ◦ F) to be the empirical Rademacher complexity computed over DN
. Then, we have
R̂N (φ ◦ F)
(
)
N
1 X
σn φ(yn f (xn ))
= Eσ sup
f ∈F N n=1
(
#)
"
³
´
1 X
0
0
0
0
= Eσ sup
σn φ(yn f (xn )) + σi φ(yi f (xi )) + σi φ(yi f (xi )) − φ(yi f (xi ))
f ∈F N
n6=i
(
"
#)
½
´¾
1 X
1 ³
0
0
0
0
≤ Eσ sup
σn φ(yn f (xn )) + σi φ(yi f (xi ))
+ Eσi sup σi φ(yi f (xi )) − φ(yi f (xi ))
f ∈F N
f ∈F N
n6=i
i
≤ R̂N
(φ ◦ F) +
φ̄
.
N
i
A symmetric argumentation implies that R̂N
(φ ◦ F ) ≤ R̂N (φ ◦ F ) +
Mcdiarmid’s inequality,
n
o
P RN (F) − R̂N (F) ≥ ² ≤ e
2
− 2N2²
φ̄
.
φ̄
.
N
Thus, according to
3.4. IMPROVED RISK BOUND
− 2N2²
Setting δ = e
φ̄
43
2
completes the proof of Lemma 3.6.
¤
A uniform, data dependent and distribution free risk bound is now achievable.
Theorem 3.7 Let F be a class of mappings. Then, for every integer N and δ ∈ (0, 1), with
probability at least 1 − δ over training sequences of length N , every f ∈ F satisfies
s
ln 2δ
Pe (f ) ≤ Êφ (f, DN ) + 2R̂N (φ ◦ F) + 3φ̄
.
2N
This result is clearly tighter than bounds based on VC dimension, yet uniform over the class
F and distribution free. To facilitate this result farther, the following basic property of the
empirical Rademacher complexity, taken from [23, 29], is used in the sequel.
Lemma 3.8 Let F be some class of mappings and assume φ(t) is Lipschitz with constant
Lφ . Then,
R̂N (φ ◦ F) ≤ Lφ R̂N (F).
This result enables us to deal directly with the class F rather than φ ◦ F, so that the bound
in Theorem 3.7 can be replaced by
s
Pe (f ) ≤ Êφ (f, DN ) + 2Lφ R̂N (F) + 3φ̄
3.4
ln 2δ
.
2N
(3.7)
Improved risk bound
Recall that in Lemma 3.6 we used McDiarmid’s inequality to upper bound the Rademacher
complexity with the empirical Rademacher complexity. In this section Lemma 3.9 is used
for this purpose, resulting in a risk bound that under some luckiness assumptions is tighter
than the bound in (3.7). Moreover, in the context of this work, as will be shown in the
sequel, this will always lead to tighter bounds.
44
3.4.1
CHAPTER 3. DATA DEPENDENT RISK BOUNDS
Improved bound for RN (φ ◦ F)
We begin with our main result in this chapter, improving the bound given in Lemma 3.6.
Lemma 3.9 Let F be a class of mappings. Then, for every integer N and δ ∈ (0, 1), with
probability at least 1 − δ over training sequences of length N ,
s
φ̄ ln 1δ R̂N (φ ◦ F) 3φ̄ ln 1δ
RN (φ ◦ F ) ≤ R̂N (φ ◦ F) +
+
.
2
N
4N
q
Remark. It can be shown [4] that for classes with finite VF , R̂N (F) ≤
(3.8)
VF
. Thus,
N
¢
−3/4
assum¡
ing that φ is Lipschitz, the result of Lemma 3.9 has a decaying rate of O N
, better than
¡ −1/2 ¢
Lemma 3.6 which exhibit a rate of O N
. Furthermore, for the classifiers studied in this
¡
¢
work, it will be shown that the Rademacher complexity always satisfies R̂N (F) ≤ O N −1/2 ,
¡
¢
leading (3.8) to a decaying rate of O N −3/4 .
Notice that R̂N (φ ◦ F) is a random variable, depending on DN . The so called ‘luckiness
assumption’ states that regardless of the source distribution, we might have a training sequence such that R̂N (φ ◦ F ) < O(N −γ ) for some γ ∈ (0, 1]. In such a case (3.8) outperforms
(3.6). To understand the values considered for γ, notice that
R̂N (φ ◦ F) ≤ φ̄ ,
and that
R̂N (φ ◦ F) ≥
N −1
1
R̂N −1 (φ ◦ F) ≥ . . . ≥ R̂1 (φ ◦ F) = O(N −1 ) ,
N
N
which can be proved by appropriate normalization of (3.9) in the sequel.
Proof (of Lemma 3.9)
n
o
P
We define the random variables ẐN (φ ◦ F) = 2φ̄−1 Eσ supf ∈F N
σ
φ(y
f
(x
))
and
n
n
n=1 n
n
o
P
(i)
ẐN (φ ◦ F ) = 2φ̄−1 Eσ\σi supf ∈F n6=i σn φ(yn f (xn )) . The proof is based on three steps:
(1) ẐN (φ ◦ F ) is shown to be a self bounding function, (2) a bound for EẐN (φ ◦ F) in terms
3.4. IMPROVED RISK BOUND
45
of ẐN (φ ◦ F ) is established and (3) this bound is normalized appropriately to obtain Lemma
3.9.
Step 1
First, we prove that ẐN (φ ◦ F ) satisfies the first property of self bounding functions, given
in definition 2.13. We begin with the upper bound,
(i)
(i)
(i)
ẐN (φ ◦ F) − ẐN (φ ◦ F) ≤ ẐN (φ ◦ F ) + 2φ̄−1 Eσi sup σi φ(yi f (xi )) − ẐN (φ ◦ F)
f ∈F
·
¸
−1
= φ̄
sup φ(yi f (xi )) − inf φ(yi f (xi ))
f ∈F
f ∈F
≤1.
To prove the lower bound, we use the following argumentation,
(
Eσ
sup
N
X
f ∈F n=1
1
= Eσ\σi
2
1
= Eσ\σi
2
)
σn φ(yn f (xn ))
(
sup
(
X
f ∈F
)
σn φ(yn f (xn )) + φ(yi f (xi ))
n6=i
sup
X
f,f˜∈F n6=i
³
σn
1
+ Eσ\σi sup
2
f ∈F
(
X
)
σn φ(yn f (xn )) − φ(yi f (xi ))
n6=i
´
φ(yn f (xn )) + φ(yn f˜(xn )) + φ(yi f (xi )) − φ(yi f˜(xi ))
)
.
Setting f = f˜, which is not necessarily the best option, yields
(
Eσ
sup
N
X
f ∈F n=1
)
σn φ(yn f (xn ))
(
≥ Eσ\σi
sup
f ∈F
X
)
σn φ(yn f (xn ))
.
n6=i
Multiplying both sides by 2φ̄−1 results in
(i)
ẐN (φ ◦ F) ≥ ẐN (φ ◦ F) .
(3.9)
This complete the proof that the first property of self bounding functions is satisfied.
Next, we prove that ẐN (φ ◦ F ) satisfies the second condition of self bounding functions.
P
Let us denote by f ∗ the function for which supf ∈F N
n=1 σn φ(yn f (xn )) is achieved. Similarly,
P
for every i = 1, 2, . . . , N , denote by fi∗ the function for which supf ∈F n6=i σn φ(yn f (xn )) is
46
CHAPTER 3. DATA DEPENDENT RISK BOUNDS
achieved. Then, for every given σ, we have the following argument.
Ã
!
N
N
N
X
X
X
sup
σn φ(yn f (xn )) − sup
σn φ(yn f (xn ))
i=1
=
f ∈F n=1
N
X
i=1
≤
=
Ã
N
X
σn φ(yn f ∗ (xn )) −
n=1
N
X
à N
X
i=1
n=1
N
X
f ∈F
X
n6=i
!
σn φ(yn fi∗ (xn ))
n6=i
σn φ(yn f ∗ (xn )) −
X
!
σn φ(yn f ∗ (xn ))
n6=i
σi φ(yi f ∗ (xi ))
i=1
= sup
N
X
f ∈F n=1
σn φ(yn f (xn )) .
By averaging over σ and normalizing appropriately, we have
N ³
X
´
(i)
ẐN (φ ◦ F) − ẐN (F) ≤ ẐN (φ ◦ F ) .
i=1
Step 2
According to Theorem 2.14, we have the following Lemma.
Lemma 3.10 Let F be some class of mappings and define the random variable
(
)
N
X
2
ẐN (φ ◦ F) = Eσ sup
σn φ(yn f (xn )) .
φ̄
f ∈F n=1
Then, for every integer N and any δ ∈ (0, 1), with probability at least 1 − δ over training
sequences of length N ,
r
ZN (φ ◦ F) ≤ ẐN (φ ◦ F) +
n
o
where ZN (φ ◦ F) = EDN ẐN (φ ◦ F) .
2ẐN (φ ◦ F ) ln
1 3 1
+ ln .
δ 2 δ
(3.10)
To prove Lemma 3.10 we use the second inequality of Theorem 2.14, which provides us with
the following bound for the tail probability
n
o
²2
−
P ZN (φ ◦ F ) − ẐN (φ ◦ F) ≥ ² ≤ e 2ZN (φ◦F) .
3.4. IMPROVED RISK BOUND
− 2Z
Set δ = e
²2
N (φ◦F)
, which imply ² =
47
p
2ZN (φ ◦ F) ln(1/δ) . Then, with probability at least
1−δ
r
ZN (φ ◦ F ) ≤ ẐN (φ ◦ F) +
2ZN (φ ◦ F) ln
1
.
δ
By solving this simple quadratic inequality while keeping in mind that ZN (φ ◦ F) is nonnegative, we get
s
p
ZN (φ ◦ F) ≤
s
1
δ
ln
+
2
ln 1δ
+ ẐN (φ ◦ F) .
2
(3.11)
Squaring both sides of (3.11) yields
sµ
ZN (φ ◦ F ) ≤ ẐN (φ ◦ F) +
Using the fact that
√
a+b≤
√
a+
1
ln
δ
¶2
+ 2ẐN (φ ◦ F) ln
1 ln 1δ
+
.
δ
2
(3.12)
√
b for every a, b ≥ 0, (3.12) implies
r
1 3 1
ZN (φ ◦ F) ≤ ẐN (φ ◦ F) + 2ẐN (φ ◦ F ) ln + ln .
δ 2 δ
(3.13)
which completes the proof of Lemma 3.10.
Step 3
Multiplying both sides of (3.13) by
3.4.2
φ̄
2N
completes the proof of Lemma 3.9.
¤
Deriving an improved risk bound
Finally, combining Theorem 3.5, Lemma 3.8 and Lemma 3.9 leads to the following result.
Theorem 3.11 Let F be a class of mappings. Then, for every integer N and every δ ∈
(0, 1), with probability at least 1−δ over training sequences of length N , every f ∈ F satisfies
s
s
2Lφ φ̄ ln 2δ R̂N (F)
ln 2δ
3φ̄ ln 2δ
Pe (f ) ≤ Êφ (f, DN ) + 2Lφ R̂N (F) +
+ φ̄
+
.
N
2N
2N
To appreciate the improvement of Theorem 3.11 over (3.7), consider the following typical
initialization of the parameters Lφ = 2, φ̄ = 2, δ = 10−4 , which are the same as the
48
CHAPTER 3. DATA DEPENDENT RISK BOUNDS
Risk bound improvement
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.08
100
200
300
400
500
600
700
800
900
1000
N
Figure 3.2: Risk bounds comparison.
The difference between the bounds in Theorem 3.11 and (3.7) for Lφ = 2, φ̄ = 2, δ = 10−4 and
N = 102 ÷ 103 , which are the same as the values that are used in the simulation chapter. Notice
that the improvement is of the same order as other terms in the bound, such as the empirical φ-risk.
parameters used in the simulations at Chapter 7. For these values, Figure 3.2 describes the
difference between the bounds. Notice that the improvement is of the same order as other
terms in the bound, such as the empirical φ-risk.
3.5
Summary
The main result of this chapter is given by Theorem 3.11, serving us as the starting point
for the remainder of this work. It provides us with a uniform risk bound that is independent
of the source distribution, a bound that can be used for model selection given a training
sequence. A key property of Theorem 3.11 is that bounding the empirical Rademacher
complexity of the class is sufficient to establish a uniform risk bound. As mentioned earlier,
the training sequence over which the Rademacher complexity is computed might be ‘simple’
in a way that leads to tight upper bounds (the so called ‘luckiness assumption’).
Chapter 4
Risk bounds for Mixture-of-Experts
classifiers
Imagine a medical clinic where several doctors, each of whom is an expert in a different
field of medicine, receive patients. Consider a general practitioner, directing the incoming
patients to the relevant doctor. Clearly, to decide which doctor is the most appropriate to
treat the patient, some information from the patient must first be gathered, indicating the
type of problem from which he is suffering. Then, based on this knowledge, the patient can
be directed to the doctor who is most likely to know how to treat him. Sometimes, when the
indicators are not unequivocal, the general practitioner might decide to direct the patient
to several doctors so that the opinions of all of them can be considered. To understand the
relevance of this example to our discussion, consider the alternative where only one doctor
is available, with the task of providing medical treatment to every patient, irrespective of
his problem.
The Mixture-of-Experts (MoE) [21, 17] classifier considers the equivalent of this example
in the context of pattern classification. It is based on an adaptive soft partition of the
feature space into regions, to each of which a local classifier is assigned. So, whenever a
new sample is to be classified, the MoE classifier combines the decisions of several ‘expert’
classifiers according to their relative superiority in the region of the feature space from
49
50
CHAPTER 4. RISK BOUNDS FOR MIXTURE-OF-EXPERTS CLASSIFIERS
which the new sample was drawn. Such a procedure can be thought of, on the one hand, as
extending standard approaches based on mixtures, and, on the other hand, providing a soft
probabilistic extension of decision trees. The MoE architecture has been successfully applied
to regression, classification, control and time series analysis [19, 22, 18, 37, 38, 39].
This chapter is organized as follow. In sections 4.1 the MoE classifier is introduced and
motivated, followed by a formal description of our assumptions regarding the various components of the classifier in Section 4.2. Section 4.3 concludes this chapter, providing risk
bound for the discussed classifier.
4.1
Mixture of Experts Classifiers
Consider the MoE architecture described in Figure 4.1, and given mathematically by
f (x) =
M
X
am (wm , x)hm (vm , x) .
(4.1)
m=1
We interpret the functions hm as experts, each of them ‘operates’ in regions of space for
which the gating functions am are nonzero. Such a classifier can be intuitively interpreted as
implementing the principle of ‘divide and conquer’ where instead of solving one complicated
problem over the entire space, we can do better by dividing it into several regions, defined
through the gating functions am , and using ‘simple’ experts hm in each region.
To demonstrate the way in which the MoE classifier operates, consider Figure 4.2. This
example demonstrate how a rather complicated classifier f can be constructed based on very
simple component classifier hm and gating functions am . Figure 4.3 presents the classifier
architecture that needs to be used, to construct the classifier described in Figure 4.2. Note
that assuming the gating functions am to be independent of x leads to a standard mixture.
Such a model, while using several classifiers, combines them with fixed weights over the
entire space. To realize the weakness of standard mixtures, consider any M soft linear
classifiers, combined with fixed weights. Thus, the overall classifier is linear, clearly inferior
to a combination of the same soft linear classifiers with data dependent weights.
4.1. MIXTURE OF EXPERTS CLASSIFIERS
x
51
a1(w 1,x)
h1(v 1,x)
f(x)
x
aM(w M,x)
x
hM(v M,x)
Figure 4.1: MoE classifier with M experts.
1
0.8
0.6
0.4
x2
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
−1
−0.8
−0.6
−0.4
−0.2
0
x1
0.2
0.4
0.6
0.8
1
Figure 4.2: A combination of simple classifiers in the feature space.
Rather complex decision boundaries can be achieved using simple classifiers on different subsets of
the feature space.
52
CHAPTER 4. RISK BOUNDS FOR MIXTURE-OF-EXPERTS CLASSIFIERS
x
Linear
gate
x
Radial
gate
Linear
classifier
Linear
classifier
x
sign
x
y
Radial
gate
Radial
classifier
Figure 4.3: A block diagram of the classifier in Figure 4.2.
A MoE classifier with two linear and one radial classifiers, assigned to subsets by two radial and
one linear gating functions.
In the process of constructing a MoE classifier, one needs to consider the following fundamental problems:
Partition of Rk . What kind of experts are the most appropriate for our problem? What is
the best way to divide the feature space? In this work we consider two general types
of division that are carried out by the gating functions. Generally speaking, in one
type of partition we consider each expert to be relatively efficient over some half of the
feature space defined by a hyperplane. In the other, each expert is considered to be
locally superior to the others, over some ball in Rk .
Model selection. What values should be considered for M ? How to determine the feasible
set of the parameters wm , vm ? Notice that by enlarging M or the feasible parameter
set we expect the class complexity to grow, increasing the danger of overfitting. On
the other hand, setting them to be too small might lead to underfitting.
Algorithm How do we select the best M, wm , vm for our problem? One disturbing problem
is the non-convex nature of the loss function, which makes gradient methods inefficient
4.2. MODEL SELECTION
53
and increases the computational complexity dramatically.
We discuss the algorithmic problems and provide two solutions in the spirit of genetic
algorithms in Chapter 7. To face the model selection problem, we establish risk bounds
for the MoE classifier. Previous results attempting to establish such bounds were based on
covering number approaches [28] and the VC dimension [20]. Unfortunately, such approaches
are too weak to be useful in any practical setting. For example, Jiang proved that for
mixtures of binary experts, the VC dimension can be upper bounded by O(M 4 k 2 ) [20].
2
√ k ), in this work bounds
Whereas according to (2.8) this implies a risk bound of order O( M
N
√
of order O( M√Nk ) are established. To realize the significance of this improvement, consider a
case where 4 experts (M = 4) are used to classify samples with 10 features (k = 10). Using a
bound based on the VC dimension, there is need to gather a training sequence that is 160(!)
times longer than the one needed when using the bound established in this work. In fact,
a careful examination of the constants in the bounds reveals that the improvement is even
more significant.
4.2
Model selection
It is clear that unless some restrictions are imposed on the gating functions and experts,
overfitting is imminent. We formalize our assumptions regarding the experts and gating
functions below. These assumptions are weakened in Chapter 6.
m
Definition 4.1 (Experts) For every m = 1, 2, . . . , M , let Vmax
be some nonnegative scalar
and vm a vector with k elements. Then, the m-th expert is given by a mapping hm (vm , x)
©
ª
m
where vm ∈ Vm = v ∈ Rk : kvk ≤ Vmax
. We define the collection of all functions hm (vm , x)
m
and set
such that vm ∈ Vm as Hm . To simplify the notation we define Vmax = maxm Vmax
H=
M
[
m=1
Hm =
M
[
{hm (vm , x)| vm ∈ Vm } .
m=1
Remark. For ease of notation we use the following convention. (1) For any parameter θ
drawn from some space Θ, supθ is used to indicate supθ∈Θ . (2) The symbol k · k is used to in-
54
CHAPTER 4. RISK BOUNDS FOR MIXTURE-OF-EXPERTS CLASSIFIERS
dicate k·k2 . We comment that our results can be generalized to other definitions of this norm.
In Definition 4.1 the feasible parameter set is used to define the class Hm . Such a definition
serves the purpose of regularization, as it enables us to control the class size via the definition
of the feasible parameter set. Assumption 4.2 addresses the properties of the mapping hm .
Assumption 4.2 The following assumptions are made for every m = 1, 2, . . . , M .
1. To allow different types of experts, assume hm (vm , x) = hm (τm (vm , x)) where τm (vm , x)
>
is some mapping such as vm
x or kvm − xk. hm (τm (vm , x)) is assumed to be Lips-
chitz with constant Lhm , i.e. |hm (τm (vm1 , x)) − hm (τm (vm2 , x))| ≤ Lhm |τm (vm1 , x) −
τm (vm2 , x)|. To simplify the notation, we sometime replace Lhm with Lh = maxm Lhm .
2. |hm (vm , x)| is bounded by some positive constant MHm < ∞. Setting MH = maxm MHm
implies supm,vm |hm (vm , x)| ≤ MH .
3. The experts are either symmetric (for regression) or antisymmetric (for classification)
with respect to the parameters, i.e. there exist ν ∈ {±1} for which hm (vm , x) =
νhm (−vm , x).
We emphasize that, even though x is referred to as a sample of the feature space, our results
hold for experts hm (vm , x) = hm (vm , Φm (x)) where Φm (x) is some nonlinear mapping. The
use of such experts results in a very powerful classifier that addresses different subspaces
of the feature space using the most appropriate kernels [33]. Indeed, this approach can be
interpreted as local classification, i.e. each expert faces the problem of classification locally.
The gating functions am reflect the relative weights of each of the experts at a given point
x. In the sequel two types of gating functions, described in the following definition, are
considered.
m
be some nonnegDefinition 4.3 (Gating functions) For every m = 1, 2, . . . , M , let Wmax
ative scalar and wm a vector with k elements. Then, the m-th gating function is given by a
4.3. ESTABLISHING RISK BOUNDS FOR MOE CLASSIFIERS
55
m
mapping am (wm , x) where wm ∈ Wm = {w ∈ Rk : kwk ≤ Wmax
}. We define the collection
of all functions am (wm , x) such that wm ∈ Wm as Am . To simplify the notation we define
m
Wmax = supm Wmax
and set A
A=
M
[
m=1
Am =
M
[
{am (wm , x)|wm ∈ Wm } .
m=1
>
am (wm , x) is said to be a half-space gate if am (wm , x) = am (wm
x) and a local gate if
¡1
¢
am (wm , x) = am 2 kwm − xk2 .
Similar to Assumption 4.2, the following assumptions serves the purpose of regularization.
Assumption 4.4 The following assumptions are made for every m = 1, 2, . . . , M .
1. am (wm , x) is Lipschitz with constant Lam , analogous to Assumption 4.2. We define the
global Lipschitz constant La = maxm Lam .
2. am (wm , x) is nonnegative and bounded by some positive constant MAm < ∞. Setting
MA = maxm MAm implies supm,wm |am (wm , x)| ≤ MA .
4.3
Establishing risk bounds for MoE classifiers
After introducing the MoE classifier and setting the framework of our discussion, we begin our
analysis. The problem of bounding the empirical Rademacher complexity R̂N (F), defined in
(3.1), for the class of MoE classifiers is addressed, beginning with Lemma 4.5. Unless stated
otherwise, wherever we write ‘Rademacher complexity’ we refer to the empirical Rademacher
complexity.
Lemma 4.5 Let Fm = {am (wm , x)hm (vm , x) : am (wm , x) ∈ Am , hm (vm , x) ∈ Hm }. Then,
R̂N (F) =
M
X
m=1
R̂N (Fm ) .
56
CHAPTER 4. RISK BOUNDS FOR MIXTURE-OF-EXPERTS CLASSIFIERS
Proof
By definition, since the set of parameters (wi , vi ) is independent of (wj , vj ) for every 1 ≤
i, j ≤ M , i 6= j,
(
)
N
M
1 X X
R̂N (F) = Eσ sup
σn
am (wm , xn )hm (vm , xn )
w,v N
n=1
m=1
(
)
M
N
X
1 X
=
Eσ sup
σn am (wm , xn )hm (vm , xn ) .
wm ,vm N
m=1
n=1
¤
Thus, it is suffices to bound R̂N (Fm ), m = 1, 2, . . . , M , in order to establish bounds for
R̂N (F). This is achieved using the following Lemma.
Lemma 4.6 Let G1 , G2 be two classes defined over some sets X1 , X2 respectively, and define
the class G3 as
G3 = {g : g(x1 , x2 ) = g1 (x1 )g2 (x2 ), g1 ∈ G1 , g2 ∈ G2 } .
Assume that at least one of the classes G1 , G2 is closed under negation. Then,
Z(G3 ) ≤ M2 Z(G1 ) + M1 Z(G2 ) ,
where
Z(Gi )
=
n
o
P
Eσ supg∈Gi N
σ
g(x
)
n
n=1 n
for
i
=
1, 2, 3
and
Mi
=
supgi ∈Gi ,xi ∈Xi |gi (xi )| for i = 1, 2.
The following Lemma is used to prove Lemma 4.6.
Lemma 4.7 Let the definitions and notations of Lemma 4.6 hold. Then, for any function
C(g1 , g2 , x), there exist ν ∈ {±1} such that
½
Eσ
¾
½
¾
sup (C(g1 , g2 , x) + σg1 (x)g2 (x)) ≤ Eσ sup (C(g1 , νg2 , x) + M2 σg1 (x) + M1 σg2 (x)) .
g1 ,g2
g1 ,g2
4.3. ESTABLISHING RISK BOUNDS FOR MOE CLASSIFIERS
57
Proof (of Lemma 4.7)
½
Eσ
¾
sup (C(g1 , g2 , x) + σg1 (x)g2 (x))
g1 ,g2
1
1
sup (C(g1 , g2 , x) + g1 (x)g2 (x)) + sup (C(g1 , g2 , x) − g1 (x)g2 (x))
2 g1 ,g2
2 g1 ,g2
1
=
sup (C(g1 , g2 , x) + g1 (x)g2 (x) + C(g̃1 , g̃2 , x) − g̃1 (x)g̃2 (x))
2 g1 ,g2 ,g̃1 ,g̃2
(a) 1
sup (C(g1 , g2 , x) + C(g̃1 , g̃2 , x) + |g1 (x)g2 (x) − g̃1 (x)g̃2 (x)|)
=
2 g1 ,g2 ,g̃1 ,g̃2
=
1
sup (C(g1 , g2 , x) + C(g̃1 , g̃2 , x) + M1 |g2 (x) − g̃2 (x)| + M2 |g1 (x) − g̃1 (x)|)
2 g1 ,g2 ,g̃1 ,g̃2
(b)
≤
(4.2)
where (a) is due to the symmetry of the expression over which the supremum is taken and
(b) is immediate, using the following inequality
|g1 (x)g2 (x) − g̃1 (x)g̃2 (x)|
= |g1 (x)(g2 (x) − g̃2 (x)) + g̃2 (x)(g1 (x) − g̃1 (x))|
≤ M1 |g2 (x) − g̃2 (x)| + M2 |g1 (x) − g̃1 (x)|.
Next, we denote by g1∗ , g2∗ , g̃1∗ , g̃2∗ the functions for which the supremum in (4.2) is achieved
and address all cases of the signum of the terms inside the absolute values at (4.2).
case 1: g2∗ (x) > g̃2∗ (x), g1∗ (x) > g̃1∗ (x)
sup {C(g1 , g2 , x) + C(g̃1 , g̃2 , x) + M1 (g2 (x) − g̃2 (x)) + M2 (g1 (x) − g̃1 (x))}
g1 ,g2 ,g̃1 ,g̃2
= sup {C(g1 , g2 , x) + M1 g2 (x) + M2 g1 (x)} + sup {C(g̃1 , g̃2 , x) − M1 g̃2 (x) − M2 g̃1 (x)}
g1 ,g2
g̃1 ,g̃2
= 2Eσ sup {C(g1 , g2 , x) + M1 σg2 (x) + M2 σg1 (x)} .
g1 ,g2
case 2: g2∗ (x) > g̃2∗ (x), g1∗ (x) < g̃1∗ (x)
sup {C(g1 , g2 , x) + C(g̃1 , g̃2 , x) + M1 (g2 (x) − g̃2 (x)) + M2 (g̃1 (x) − g1 (x))}
g1 ,g2 ,g̃1 ,g̃2
(a)
=
sup {C(g1 , −g2 , x) + C(g̃1 , −g̃2 , x) + M1 (g̃2 (x) − g2 (x)) + M2 (g̃1 (x) − g1 (x))}
g1 ,g2 ,g̃1 ,g̃2
= 2Eσ sup {C(g1 , −g2 , x) + M1 σg2 (x) + M2 σg1 (x)} .
g1 ,g2
58
CHAPTER 4. RISK BOUNDS FOR MIXTURE-OF-EXPERTS CLASSIFIERS
where (a) is due to the assumption that G2 is closed under negation. Notice that the cases
g2∗ (x) < g̃2∗ (x), g1∗ (x) < g̃1∗ (x) and g2∗ (x) < g̃2∗ (x), g1∗ (x) > g̃1∗ (x) are analogous to cases 1 and
2 respectively, thus Lemma 4.7 is proved.
¤
Proof (of Lemma 4.6)
Suitably setting C(g1 , g2 , x) in Lemma 4.7, we have
(
)
N
X
Eσ1N sup
σn g1 (xn )g2 (xn )
g1 ,g2
n=1
(
≤ Eσ1N
sup
à N
X
g1 ,g2
(4.3)
!)
ν1 σn g1 (xn )g2 (xn ) + M2 σ1 g1 (x1 ) + M1 ν1 σ1 g2 (x1 )
,
n=2
where ν1 ∈ {±1}. Since σn ∈ {±1} then ν1 σn ∈ {±1} for all n = 2, . . . , N as well. Thus,
(4.3) implies
(
Eσ1N
sup
g1 ,g2
N
X
)
σn g1 (xn )g2 (xn )
n=1
(
Ã
sup
≤ Eσ1N
g1 ,g2
N
X
!)
σn g1 (xn )g2 (xn ) + M2 σ1 g1 (x1 ) + M1 ν1 σ1 g2 (x1 )
.
n=2
Carrying out this procedure sequentially N times, with a suitable redefinition of C(g1 , g2 , x)
each time, it is easy to see that
(
)
N
X
Eσ1N sup
σn g1 (xn )g2 (xn )
g1 ,g2
(
≤ Eσ1N
n=1
Ã
sup
g1 ,g2
M2
(
= M2 Eσ1N
sup
g1
N
X
n=1
N
X
σn g1 (xn ) + M1
σn g1 (xn )
N
X
n=1
)
!)
Γ(n)σn g2 (xn )
(
+ M1 Eσ1N
n=1
sup
g2
N
X
)
Γ(n)σn g2 (xn )
,
n=1
QN
νi . Recall that νi ∈ {±1} for all i = 1, 2, . . . , N , thus Γ(n) ∈ {±1} for
QN −1
νi σn for all n in the second term of the last
all n = 1, 2, . . . , N . So, by redefining σn = i=n
where Γ(n) =
i=n
equality, we complete the proof of Lemma 4.6.
Notice that Lemma 4.6 implies the following corollary.
¤
4.3. ESTABLISHING RISK BOUNDS FOR MOE CLASSIFIERS
59
Corollary 4.8 For every m = 1, 2, . . . , M define Fm as in Lemma 4.5. Then,
R̂N (Fm ) ≤ MHm R̂N (Am ) + MAm R̂N (Hm ) .
Remark. We emphasize that Corollary 4.8 is tight. To see that, set the gating functions to
be independent of x. In such a case R̂N (Am ) = 0 and an equality is obtained.
Remark. Using the fact that Am Hm =
1
4
((Am + Hm )2 − (Am − Hm )2 ) and the Lipschitz
property of the quadratic function over a bounded domain, it can be shown that
³
´
R̂N (Fm ) ≤ (MHm + MAm ) R̂N (Am ) + R̂N (Hm ) .
Even though it is much easier to prove this result than to prove Corollary 4.8, the bound it
provides is loser.
Thus, the problem of bounding R̂N (Fm ) boils down to bounding R̂N (Am ) and R̂N (Hm ).
To do so, we introduce the following Lemma, which is a variant of Lemma 3.8 in a form that
is more suitable for our current discussion.
Lemma 4.9 Let Θ ⊆ Rk denote the feasible set of some parameter θ and let τ (θ, x) :
Rk 7→ R be some parameterized function. Define the class of mappings F : Rk 7→ R as
F = {f (τ (θ, x)) : θ ∈ Θ} and assume that every function f ∈ F is Lipschitz with constant
LF . Then,
(
Eσ
)
(
)
N
N
1 X
1 X
σn f (τ (θ, xn )) ≤ LF Eσ sup
σn τ (θ, xn ) .
sup
θ∈Θ N n=1
θ∈Θ N n=1
To minimize the technical burden, the experts are assumed to be generalized linear models
>
(glim, see [26]), i.e. τ (vm , x) = vm
x in Assumption 4.2. An extension to radial basis
functions (rbf), i.e. τ (vm , x) = kvm − xk2 , is immediate using our analysis of local gating
functions. Extension to many other types can be achieved using similar technique.
The Lipschitz property of the class Hm along with Lemma 4.9 implies
)
(
N
X
Lhm
>
R̂N (Hm ) ≤
Eσ sup vm
σn xn .
N
vm
n=1
60
CHAPTER 4. RISK BOUNDS FOR MIXTURE-OF-EXPERTS CLASSIFIERS
Obviously, the supremum for each value of m is obtained when vm is chosen to be in the
P
m
direction of N
n=1 σn xn such that kvm k = Vmax . This choice of vm leads to
°
°)
(
N
°X
°
Lhm
°
°
m
R̂N (Hm ) ≤
Eσ Vmax
σn xn °
°
° n=1
°
N
v
!2 
u k à N


m
u
X X
Lh V
σn xnj
= m max Eσ t


N
n=1
j=1
v 
u
à N
!2 
k
u


m
X X
(a) L
hm Vmax u
t Eσ
σn xnj
≤


N
n=1
j=1
v (
)
u
k X
N X
N
m u
X
Lhm Vmax t
=
Eσ
σn σp xnj xpj
N
j=1 n=1 p=1
=
m
Lhm Vmax
x̄
√
N
q
where (a) is due to Jensen’s inequality and x̄ =
N −1
Pk
j=1
PN
n=1
x2nj .
For half-space gating functions, a similar argumentation yields
R̂N (Am ) ≤
m
Lam Wmax
x̄
√
.
N
For the case of local gating functions we define am (wm , x) = am
(4.4)
¡1
2
¢
kwm − xk2 . Using the
Lipschitz property of the class Am yields
(
)
N
X
¡
¢
1
R̂N (Am ) = Eσ sup
σn am kwm − xn k2 /2
N
wm
n=1
(
)
N
X
Lam
≤
Eσ sup
σn kwm − xn k2
2N
wm
n=1
(
)
N
X
Lam
σn (kwm k2 − 2hwm , xn i + kxn k2 )
=
Eσ sup
2N
wm
n=1
(
)
N
X
¡
¢
(a) Lam
=
Eσ sup
σn kwm k2 − 2hwm , xn i
2N
wm
n=1
(
(
)
)
N
N
X
X
Lam
L
a
≤
Eσ sup
σn kwm k2 + m Eσ sup
σn hwm , xn i
2N
N
wm
wm
n=1
n=1
(4.5)
4.3. ESTABLISHING RISK BOUNDS FOR MOE CLASSIFIERS
where (a) holds because Eσ
nP
N
n=1
σn kxn k
2
61
o
= 0. We compute each term of (4.5) separately.
First, observe that similar argumentation to the one which was used for half-space gating
functions can be used to bound the second term,
(
)
N
X
Lam
La Wm x̄
Eσ sup
σn hwm , xn i ≤ m√ max .
N
wm
N
n=1
To bound the first term of (4.5), notice that
" N
#
N
N
X
X
X
m 2
sup kwm k2
σn = I
σn > 0 (Wmax
)
σn .
wm
n=1
n=1
n=1
The average of this expression over all possible σ is bounded as follow
( " N
# N
)
X
X
Eσ I
σn > 0
σn
n=1

n=1
v
"
#
Ã
!
u
2


N
N

u X
X
t
I
σn > 0
σn
= Eσ




n=1
n=1
v 
u
" N
#Ã N
!2 
u


X
X
(a) u
≤ tEσ I
σn > 0
σn
 n=1

n=1
v
( N N
)
u
XX
u1
= t Eσ
σn σp
2
n=1 p=1
r
N
=
,
2
where (a) is due to Jensen’s inequality. Combining all of the above, the Rademacher complexity of local gating function is upper bounded as
¶
µ
m 2
Lam (Wmax
)
m
√
R̂N (Am ) ≤ √
+ Wmax x̄ .
8
N
We summarize our results in the following Theorem.
Theorem 4.10 Let F be the class of mixture of experts classifiers with M glim experts.
Assume that gates 1, 2, . . . , M1 are local and M1 +1, . . . , M are half-space where 0 ≤ M1 ≤ M .
62
CHAPTER 4. RISK BOUNDS FOR MIXTURE-OF-EXPERTS CLASSIFIERS
Then, the Rademacher complexity of F satisfies
"M
#
M
M
1
X
X
X
1
m 2
m
m
R̂N (F) ≤ √
c1,m (Wmax
) +
c2,m Wmax
+
c3,m Vmax
N m=1
m=1
m=1
(4.6)
√
where c1,m = MHm Lam / 8, c2,m = MHm Lam x̄ and c3,m = MAm Lhm x̄ for all m = 1, 2, . . . , M .
Combining Theorems 4.10 and 3.11 results in the following Corollary.
Corollary 4.11 Let the definitions and notations of Theorem 4.10 hold. Then, for every
integer N and every δ ∈ (0, 1), with probability at least 1−δ over training sequences of length
N , every f ∈ F satisfies
s
A
Pe (f ) ≤ Êφ (f, DN ) + √ +
N
where
A = 2Lφ
"M
1
X
m
2
c1,m (Wmax ) +
m=1
4.4
s
2
δ
Aφ̄ ln
+ φ̄
N 3/2
M
X
m=1
m
c2,m Wmax +
ln 2δ
3φ̄
2
+
ln ,
2N
2N δ
M
X
#
m
c3,m Vmax
.
m=1
Summary
We introduced the MoE classifier where instead of constructing a single complicated classifier
over the entire feature space, several simpler classifiers, referred to as experts, are linearly
combined with data dependent weights, referred to as gates. Such a classifier can be used to
define a complex classification surface while minimizing the class complexity. Two general
ways for the division of the feature space were considered and risk bounds for MoE classifiers
implementing such a division were established. A possible deficiency of our bound is the
necessity of predefining the radius for the feasible set of the parameters. This predefined
radius is used in the bound to compute the class complexity. In Chapter 6 we generalize our
result to circumvent such a predefinition.
Chapter 5
Risk bounds for Hierarchical Mixture
of Experts
The MoE classifier is a linear combination of M experts. An intuitive interpretation of this
architecture corresponds to the division of the feature space into subspaces, in each of which
the experts are linearly combined using the gating functions am . The Hierarchical MoE
(HMoE) [21] takes this procedure one step further by recursively dividing the subspaces
using a ‘local’ MoE classifier as the expert in each domain. It is motivated by the idea of
local search - if the MoE is expected to simplify the problem of constructing a classifier over
the entire feature space, then using such a mixture for each of the subspaces might prove to
be the right thing to do.
In this chapter the bound obtained for MoE classifiers is expanded to the case of HMoE.
We demonstrate the procedure for the case of a two-levelled balanced hierarchy with M
experts (see Figure 5.1). It is easy to repeat the same procedure for any number of levels,
whether the HMoE is balanced or not, using the same idea.
63
64
CHAPTER 5. RISK BOUNDS FOR HIERARCHICAL MIXTURE OF EXPERTS
5.1
Some preliminaries & definitions
We begin with the mathematical description of a balanced two-level HMoE classifier, described in Figure 5.1. Let f (x) be the output of the HMoE classifier and let fm (θm , x)
denote the output of the m-th expert for all m = 1, 2, . . . , M . The parameter θm is comprised of all the parameters of the m-th expert, as will be detailed shortly. This is described
by
f (x) =
M
X
fm (θm , x),
m=1
where fm (θm , x) is given by
fm (θm , x) = am (wm , x)
M
X
amj (wmj , x)hmj (vmj , x) .
(5.1)
j=1
amj (wmj , x) and hmj (vmj , x) are referred to as the mj-th gating function and expert, respectively. We interpret the former as the gate of the latter.
Similar to Section 4.2, we formally define the experts and gating functions of the HMoE
classifier, followed by a formal description of our assumptions.
mj
Definition 5.1 (HMoE experts) For every m, j = 1, 2, . . . , M , let Vmax
be some non-
negative scalar and vmj a vector with k elements. Then, the mj-th expert is a mapping
©
ª
mj
hmj (vmj , x) where vmj ∈ Vmj = v ∈ Rk : kvk ≤ Vmax
. We define the collection of all functions hmj (vmj , x) such that vmj ∈ Vmj as Hmj . To simplify the notation we set
Hm =
M
[
j=1
Hmj =
M
[
{hmj (vmj , x), vmj ∈ Vmj } ,
j=1
and
H=
M
[
Hm .
m=1
Assumption 5.2 The following assumptions are made for every m, j = 1, 2, . . . , M .
1. hmj (vmj , x) is Lipschitz with constant Lhmj .
5.1. SOME PRELIMINARIES & DEFINITIONS
x
a11 (x,w
11
65
)
h11 (v11 ,x)
a1(w 1,x)
f1 (.,x)
x
a1M (x,w
1M
)
h1M (v 1M ,x)
x
f(x)
x
aM1 (w M1 ,x)
hM1(v M1 ,x)
aM(w M,x)
fM(.,x)
x
aMM(w MM ,x)
hMM(v MM ,x)
Figure 5.1: Balanced two-levelled HMoE classifier with M experts.
66
CHAPTER 5. RISK BOUNDS FOR HIERARCHICAL MIXTURE OF EXPERTS
2. |hmj (vmj , x)| is bounded by some positive constant MHmj .
3. hmj (vmj , x) is either symmetric or antisymmetric with respect to the parameter.
As for the gating functions we have the following, analogous to Definition 4.3 and Assumption 4.4.
mj
Definition 5.3 (HMoE Gating functions) For every m, j = 1, 2, . . . , M , let Wmax
be a
nonnegative scalar and wmj a vector with k elements. Then, the mj-th gating function is a
mj
mapping amj (wmj , x) where wmj ∈ Wmj = {w ∈ Rk : kwk ≤ Wmax
}. We define the collection
of all functions amj (wmj , x) such that wmj ∈ Wmj as Amj . To simplify the notation in the
sequel, we set
Am =
M
[
j=1
Amj =
M
[
{amj (wmj , x), wmj ∈ Wmj } ,
j=1
Ām = {am (wm , x), wm ∈ Wm } ,
and
A=
M
[
Am
[
Ām .
m=1
Assumption 5.4 The following assumptions are made for every m, j = 1, 2, . . . , M .
1. amj (wmj , x), am (wm , x) are Lipschitz with constant Lamj , Lam respectively.
2. amj (wmj , x), am (wm , x) are nonnegative and bounded by some positive constants MAmj ,
MĀm respectively.
To facilitate the notation, we sometimes use Lh , MH = maxm,j Lhmj , MHmj respectively,
¢
¡
¢
¡
La = maxm Lam , maxj Lamj and MA = maxm MĀm , maxj MAmj . However, though
leading to simplified bounds, it is best not to use these definitions if one attempts to obtain
the tightest possible bounds.
Before we continue, we give a formal definition of the feasible HMoE parameter set.
5.2. UPPER BOUNDS FOR HMOE R̂N (F)
67
Definition 5.5 (Feasible HMoE parameter set) For every m = 1, 2, . . . , M set
θm = [wm , wm1 , wm2 , . . . , wmM , vm1 , vm2 , . . . , , vmM ],
the parameter of the m-th (MoE)-expert at the first level of the HMoE classifier. Define
θ = [θ1 , . . . , θM ], the parameter of the HMoE. For every m = 1, 2, . . . , M set
Θm = {θm : wm ∈ Wm , wmj ∈ Wmj , vmj ∈ Vmj , j = 1, 2, . . . , M } ,
the feasible set of θm . The feasible set of θ is denoted by Θ,
Θ = {θ : θm ∈ Θm , m = 1, 2, . . . , M } .
5.2
Upper bounds for HMoE R̂N (F)
Recall that we are seeking to bound the Rademacher complexity for the class of HMoE. We
begin with the following Lemma, a variant of Lemma 4.5, adapted for HMoE.
Lemma 5.6 Consider all parameterized functions am (wm , x), amj (wmj , x), hmj (vmj , x), defined in Definitions 5.1,5.3 and satisfying Assumptions 5.2,5.4. Define the class of mappings
(
F=
f (θ, x) =
M
X
m=1
¯
)
¯
¯
fm (θm , x)¯ θ ∈ Θ ,
¯
(5.2)
where fm (θm , x) is defined in equation (5.1) and, for every m = 1, 2, . . . , M , define
(
Fm =
fm (θm , x) = am (wm , x)
M
X
j=1
¯
)
¯
¯
amj (wmj , x)hmj (vmj , x)¯ θm ∈ Θm .
¯
Then,
R̂N (F) =
M
X
m=1
R̂N (Fm ) .
(5.3)
68
CHAPTER 5. RISK BOUNDS FOR HIERARCHICAL MIXTURE OF EXPERTS
Proof
The proof is immediate, using the independence of the parameters as follow.
(
)
N
M
X
X
1
R̂N (F) = Eσ sup
σn
fm (θm , xn )
θ N n=1
m=1
(
)
M
N
1 XX
= Eσ
sup
σn fm (θm , xn )
θ1 ,θ2 ,...,θM N m=1 n=1
)
(M
N
X
1 X
σn fm (θm , xn )
= Eσ
sup
N n=1
m=1 θm
(
)
M
N
X
1 X
=
Eσ sup
σn fm (θm , xn ) ,
θm N n=1
m=1
which, by definition, completes the proof of Lemma 5.6.
¤
Thus, our problem boils down to bounding the summands in (5.3). Notice that for ev¯o
n¯P
¯
¯
ery m = 1, . . . , M we have supθm ¯ M
a
(w
,
x
)h
(v
,
x
)
≤ M MH MA . Using
mj
n mj mj
n ¯
j=1 mj
Corollary 4.8 recursively twice leads to the following result.
(
)
N
M
1 X X
R̂N (Fm ) ≤ M MH MA R̂N (Ām ) + MA Eσ sup
σn
amj (wmj , xn )hmj (vmj , xn )
θm N n=1
j=1
(
)
M
N
X
1 X
= M MH MA R̂N (Ām ) + MA
Eσ sup
σn amj (wmj , xn )hmj (vmj , xn )
N
θ
mj
n=1
j=1
≤ M MH MA R̂N (Ām ) + MA
³
M ³
X
´
MH R̂N (Amj ) + MA R̂N (Hmj )
j=1
´
≤ M MA MH R̂N (Ām ) + MH R̂N (Am ) + MA R̂N (Hm )
h
i
≤ M MA 2MH R̂N (A) + MA R̂N (H) , m = 1, 2, . . . , M .
(5.4)
Notice that each of the last three inequalities can be used to bound the Rademacher complexity of HMoE classifiers. When one wishes to use it within a risk bound, the ease of
notation should be considered as well as the tightness of the bound. Although in Theorem
5.7 the simplest bound is described, in Chapter 7 the tighter one is used.
5.3. SUMMARY
69
Combining (5.4) and (5.3) with Theorem 3.11 implies the following Theorem.
Theorem 5.7 Let F be the class of balanced two-level HMoE classifiers with glim experts
and gating functions. Then, for every integer N and every δ ∈ (0, 1), with probability at
least 1 − δ over training sequences of length N , every f ∈ F satisfies
c1 M 2
Pe (f ) ≤ Êφ (f, DN ) + √ (Wmax + Vmax ) +
N
r
s
c2 M 2
(Wmax + Vmax ) + φ̄
N 3/2
ln 2δ
c3
+
,
2N
N
where c1 = 2Lφ max (MH La , MA Lh ) x̄, c2 = 2Lφ max (MH La , MA Lh ) x̄φ̄ ln 2δ and c3 =
1.5φ̄ ln 2δ .
5.3
Summary
The HMoE classifier was introduced and interpreted as a recursive implementation of the
MoE classifier. While both conceptually and practically any classifier implemented by a
HMoE classifier can be also implemented using a MoE classifier, it has the advantage of a
recursive implementation. The main result of this Chapter is given by Theorem 5.7, where
the risk of every two-level balanced HMoE classifier is upper bounded. If one is willing to
pay in terms of the bound simplicity, it can be further tightened.
70
CHAPTER 5. RISK BOUNDS FOR HIERARCHICAL MIXTURE OF EXPERTS
Chapter 6
Fully data dependent bounds
The risk bound given in Theorem 3.11, with the Rademacher complexity used as a measure
of the class complexity, served as the starting point of our investigation. While this Theorem
holds for any class, the focus of our discussion was the MoE and HMoE classifiers. The major
problem was to upper bound the Rademacher complexity of these classes. Upon solving this
problem, the Rademacher complexity bounds were used in a plug-in procedure to establish
the risk bounds given in Corollary 4.11 and Theorem 5.7.
m
One deficiency of these bounds is the dependence on some preset parameters, such as Wmax
m
and Vmax
, m = 1, 2, . . . , M . These are used to define the feasible set of the classifier parame-
ters, consequently regularizing the class complexity. However, this presetting is problematic
as it is difficult to know in advance how to set these parameters. In this Chapter a risk bound
that requires no prior setting of parameters, except for the initialization of M , is established.
Although only the MoE classifier is considered, the same technique can be easily harnessed
to derive similar results for the case of HMoE as well.
To obtain a fully data dependent bound, we need to replace the preset parameters by some
term that is exclusively data and algorithm dependent. This means that it only depends on
the data DN and the classifier selected by the algorithm, or the parameters by which it is
defined. No presetting is allowed except for M , which is referred to as the hyper parameter.
71
72
CHAPTER 6. FULLY DATA DEPENDENT BOUNDS
6.1
Preliminaries
The technique used in [29, 9] is adapted to derive a fully data dependent bound for MoE
classifiers. The basic idea is given by the following steps.
m
m
1. Define a grid of possible values for Wmax
and Vmax
, m = 1, 2, . . . , M , for each of which
Corollary 4.11 holds.
2. Assign each of these grid points with a positive weight, such that the sum of all the
weights is 1.
3. Use a variant of the union bound to establish a risk bound that holds for every possible
choice of the classifier parameter.
We begin by introducing the so called ‘Multiple Testing Lemma’ and demonstrate the way
it is used to eliminate the dependence of a bound on a preset parameter (Subsection 6.1).
The same technique is then used to generalize Corollary 4.11 into a fully data dependent
bound (Subsection 6.2.2). The so called ‘multiple testing Lemma’ provides a bound for the
probability that a collection of events occurs simultaneously, given the probability that each
of them occur individually. Before providing this Lemma, let us introduce the notion of a
test.
Definition 6.1 Define a (mapping) test Γ : (S, δ) 7→ {TRUE, FALSE} where S is some
sample and δ is a confidence level. Denote the logical value of the test as Γ(S, δ).
The following Lemma is a variant of the union bound, taken from [14].
Lemma 6.2 (Multiple Testing Lemma) Assume we are given a set of tests {Γq }∞
q=1 with
associated discrete probability measure {pq }∞
q=1 . If for every q = 1, 2, . . . and any δ ∈ (0, 1),
P {Γq (DN , δ)} ≥ 1 − δ, then
P {Γ1 (DN , δp1 ) ∧ . . . ∧ Γq (DN , δpq ) ∧ . . .} ≥ 1 − δ .
6.1. PRELIMINARIES
73
A simple demonstration of the technique
Consider the class of soft linear classifiers defined over R, F = {f (λ, x) = sin(λx) : |λ| ≤ Λ}
for some given finite positive scalar Λ. According to (4.4), the Rademacher complexity of
F can be upper bounded by
Λ x̄
R̂N (F) ≤ √ ,
N
where x̄ =
q
P
2
N −1 N
n=1 xn . Then, according to (3.7), with probability at least 1 − δ every
f ∈ F satisfies
s
Pe (f ) ≤ Êφ (f, DN ) + C(Λ) + φ̄
q
2Lφ Λx̄
+
N 1/2
where C(Λ) =
2Lφ φ̄Λx̄ ln
N 3/2
2
δ
+
3φ̄ ln
2N
2
δ
ln 2δ
,
2N
(6.1)
. The presetting of Λ, resulting in the definition of
the feasible set for λ, is problematic since we do not necessarily know the best initialization
for it. So, we would like to consider every λ ∈ R. Unfortunately, by doing so we implicitly
set Λ to infinity, rendering the risk bound (6.1) useless. Lemma 6.2 can be used to overcome
this difficulty, leading to the following result.
Corollary 6.3 Set some positive constant g0 and define the class of soft classifiers F =
j
³ ´k
{f (λ, x) = sin(λx), λ ∈ R, |λ| > g0 }. For every function f define q(λ) = 2 + log2 |λ|
.
g0
Then, with probability at least 1 − δ over training sequences of length N , every function
f ∈ F satisfies
Pe (f ) ≤ Êφ (f, DN ) + C(g0 2q(λ)−1 ) + φ̄
r
where C(t) =
2Lφ tx̄
N 1/2
+
2Lφ φ̄tx̄
N 3/2
³
1
ln 2δ + 2 ln 2 2 log2
³
v
³ ´
u
u ln 2 + 2 ln 2 12 log 4|λ|
t δ
2
g0
4|λ|
g0
2N
´´
+
3φ̄
2N
³
1
,
ln 2δ + 2 ln 2 2 log2
³
4|λ|
g0
´´
.
Remark. The condition |λ| > g0 is stated here for convenience. In the result for the MoE
classifier it will be removed.
Proof
For all q = 1, 2, . . . set Λq = 2q−1 g0 for some positive constant g0 and define the class
74
CHAPTER 6. FULLY DATA DEPENDENT BOUNDS
Fq = {f (λ, x) = sin(λx) : |λ| ≤ Λq }. Also, define the test Γq (DN , δ) to be TRUE if
s
ln 2δ
2N
Pe (f ) ≤ Êφ (f, DN ) + C(Λq ) + φ̄
for all f ∈ Fq and FALSE otherwise. Then, according to (6.1), P {Γq (DN , δ)} ≥ 1 − δ. Next,
P
let {pq }∞
be a set of positive weights such that ∞
q=1 pq = 1, where for concreteness we set
q=1
o∞
n
1
. According to Lemma 6.2,
pq = q(q+1)
q=1
P {Γq (DN , δpq ) for all q = 1, 2, . . .} ≥ 1 − δ .
Thus, P {Γq (DN , δpq )} ≥ 1 − δ for every q = 1, 2, . . ., which means that with probability at
least 1 − δ, every f ∈ Fq satisfies
s
Pe (f ) ≤ Êφ (f, DN ) + C(Λq ) + φ̄
ln δp2q
2N
,
for every q = 1, 2, . . ..
Now, assume that we wish to have a risk bound for any classifier from the class F̃ =
j
³ ´k
{f (λ, x) = sin(λx) : |λ| = Ω} for some positive scalar Ω > g0 . Setting t = 2 + log2 gΩ0 ,
we have
t−1
Λt = g0 2
j
³ ´k
1+ log2 gΩ
= g0 2
0
log2
> g0 2
Ω
g0
=Ω,
and
1+log2
Λt ≤ g0 2
Ω
g0
= 2Ω ,
which means that out of the set of classes {Fq }∞
q=1 , Ft is the smallest class for which F̃ ⊂ Fq .
Combining all of the above with the inequality
1
1
ln = ln t(t + 1) ≤ ln 2t2 = 2 ln 2 2
pt
completes the proof of Corollary 6.3.
µ
¹
µ ¶
µ ¶º¶
1
4Ω
Ω
2
≤ 2 ln 2 log2
,
2 + log2
g0
g0
¤
6.2. FULLY DATA DEPENDENT RISK BOUND FOR MOE CLASSIFIERS
6.2
75
Fully data dependent risk bound for MoE classifiers
We now turn to the derivation of fully data dependent risk bounds for MoE classifiers,
using a generalization of the technique demonstrated in the previous section for a simple
case. However, before providing the result, we begin with some definitions and preliminary
results.
6.2.1
Some definitions, notations & preliminary results
Consider the MoE classifier given in Figure 4.1, where the classifier’s parameter vector is
given by
θ = [w1 , w2 , . . . , wM , v1 , v2 , . . . , vM ] .
(6.2)
To provide a formal definition of Θ, the feasible set of θ, define the following collection of
matrices
Ai (p, q) =


1
if

0
otherwise
k(i − 1) + 1 ≤ p, q ≤ ki and p = q,
,
for i = 1, 2, . . . , 2M . According to Definitions 4.1 and 4.3, Θ is given by
©
ª
Θ(b) = θ : κ1 (θ) ≤ b1 , κ2 (θ) ≤ b2 , . . . , κ2M (θ) ≤ b2M ; θ ∈ R2kM ,
where
κi (θ) =
p
θT Ai θ;
i = 1, 2, . . . , 2M,
(6.3)
(6.4)
and b is a vector consisting of the following elements


W i
if i = 1, 2, . . . , M,
max
.
bi =

i−M
Vmax
if i = M + 1, M + 2, . . . , 2M
Thus, Θ is a subset of R2kM , comprised of 2M balls which have no common elements (except
for the origin), defined by 2M constrains.
76
CHAPTER 6. FULLY DATA DEPENDENT BOUNDS
1
Denote by Zt+ the t dimensional grid of positive integers and let µ : Z2M
+ 7→ Z+ be some
mapping such that
q = µ(q1 , . . . , q2M ) ≤
2M
Y
qi ,
(6.5)
i=1
where (q1 , . . . , q2M ) ∈
Z2M
+
is some set of indices. To be concrete, it is easy to see that (6.5)
Q
is satisfied by µ(q1 , . . . , q2M ) = 1 + 2M
i=1 (qi − 1). Set
g = [g1 , g2 , . . . , g2M ] ∈ R2M
+
(6.6)
to be an initial gain, and define the vector
βq = [β1,q1 , β2,q2 , . . . , β2M,q2M ] ,
(6.7)
βi,qi = gi sqi
(6.8)
where, for all i = 1, 2, . . . , 2M ,
for some s > 1. Interpreting βq as a special case of b, the feasible parameter set is now given
by
©
ª
Θ(βq ) = θ : κ1 (θ) ≤ β1,q1 , κ2 (θ) ≤ β2,q2 , . . . , κ2M (θ) ≤ β2M,q2M ; θ ∈ R2kM .
Next, let {pq }∞
q=1 be the set of weights given in the previous section. For every parameter
θ ∈ Θ and each constraint index i = 1, 2, . . . , 2M let qi (θ) be the smallest index for which
´m
l
³
if κi (θ) > gi and qi (θ) = 1 otherwise. Notice
κi (θ) ≤ βi,qi (θ) , that is qi (θ) = logs κig(θ)
i
that since qi (θ) was chosen to be the smallest index for which this condition holds, then,
assuming κi (θ) > gi ,
1
κi (θ) ≥ βi,qi (θ)−1 = βi,qi (θ) .
s
Setting κ̃i (θ) = s max(κi (θ), gi ), we have
κ̃i (θ) ≥ βi,qi (θ) = gi sqi (θ) ,
which implies
µ
logs
κ̃i (θ)
gi
(6.9)
¶
≥ qi (θ) .
Combining this with (6.5) results in
q(θ) = µ(q1 (θ), q2 (θ), . . . , q2M (θ)) ≤
2M
Y
i=1
qi (θ) ≤
2M
Y
i=1
µ
logs
κ̃i (θ)
gi
¶
,
6.2. FULLY DATA DEPENDENT RISK BOUND FOR MOE CLASSIFIERS
which, in turn, implies
µ
ln
6.2.2
¶
1
pq(θ)
77
³ 1
´
≤ 2 ln 2 2 q(θ)
à 2M
µ
¶!
Y
1
κ̃i (θ)
≤ 2 ln 2 2
logs
gi
i=1
µ
µ
¶¶
2M
X
1
κ̃i (θ)
=2
ln 2 4M logs
.
g
i
i=1
Establishing a fully data dependent bound
We are now ready to derive a fully data dependent bound for MoE classifiers. This is achieved
by replacing the vector b in (6.3) with another vector, explicitly defined by the parameters
of the selected classifier. Consider the risk bound given in Corollary 4.11 (the notations here
are changed for convenience),
s
ln 2δ
,
2N
Pe (f ) ≤ Êφ (f, DN ) + C (β, δ) + φ̄
where
s
φ̄ ln 2δ
3φ̄ ln 2δ
1
C (β, δ) = √ R (β) +
R
(β)
+
,
3
2N
N
N2
¡ 1
¢
2
M
1
2
M
β = Wmax
, Wmax
, . . . , Wmax
, Vmax
, Vmax
, . . . , Vmax
,
ÃM
!
M
M
1
X
X
X
R (β) = 2cLφ
β(m)2 +
β(m) +
β(m + M ) ,
n
c = max
m=1
M√
H La
, MH La x̄, MA Lh x̄
8
m=1
o
(6.10)
(6.11)
(6.12)
(6.13)
m=1
(it should be noted that this bound can be tightened
by using the constants given in Corollary 4.11 instead of the constant c) and β(m) is the
m-th element of β. For every q = 1, 2, . . . set
©
ª
Fq = f : kwm k ≤ βm,qm , kvm k ≤ βm+M,qm+M , m = 1, 2, . . . , M ,
that is βq is interpreted as the radius of the feasible parameters sets by which Fq is defined.
Define the test Γq (DN , δ) to be TRUE if
Pe (f ) ≤ Êφ (f, DN ) + C (βq , δ) + φ̄
s
ln 2δ
,
2N
78
CHAPTER 6. FULLY DATA DEPENDENT BOUNDS
and FALSE otherwise. According to Corollary 4.11, P {Γq (DN , δ)} ≥ 1−δ for all q = 1, 2, . . ..
Thus, by Lemma 6.2,
P {Γq (DN , δpq ) for all q = 1, 2, . . .} ≥ 1 − δ ,
which means that with probability at least 1 − δ, every f ∈ Fq satisfies
s
ln δp2q
Pe (f ) ≤ Êφ (f, DN ) + C (βq , δpq ) + φ̄
,
2N
for every q = 1, 2, . . .. Finally, recall (from (6.9)) that for every i = 1, 2, . . . , 2M , βi,qi (θ) ≤
κ̃i (θ), which, combined with the monotonicity of C(βq , δpq ) with respect to each element of
βq , leads to the following result.
Theorem 6.4 Let F be the class of MoE classifiers with M glim experts and assume that
gates 1, 2, . . . , M1 are local and M1 + 1, . . . , M are half-space where 0 ≤ M1 ≤ M . Denote by θ ∈ R2kM the parameter of the selected classifier and let definitions (6.4), (6.6)
- (6.8) hold. For every i = 1, 2, . . . , 2M set κ̃i (θ) = s max(κi (θ), gi ) and define κ̃(θ) =
[κ̃1 (θ), κ̃2 (θ), . . . , κ̃2M (θ)]. Then, with probability at least 1 − δ over training sequences of
length N , every function f ∈ F satisfies
s
r
R (κ̃(θ))
C (κ̃(θ), δ)
3φ̄
φ̄R (κ̃(θ)) C (κ̃(θ), δ)
Pe (f ) ≤ Êφ (f, DN ) + √
+ φ̄
+
C (κ̃(θ), δ) ,
+
3
2N
2N
N
N2
(6.14)
where
Ã
C (κ̃(θ), δ) =
µ
µ
¶¶!
2M
X
1
κ̃i (θ)
2
ln + 2
ln 2 4M logs
δ
gi
i=1
and R (κ̃(θ)) is defined in (6.13).
Interpreting θ as the parameter of the selected classifier, a few observations may shed some
light on this result.
1. Notice that the set of numbers gi , i = 1, 2, . . . , 2M , are essentially scaling factors for
each of the constrains. If, for some classifier f , we have κi (θ) ≤ gi for all i, then the
bound is independent of θ. To see that, observe that θ influences the bound through
κ̃i (θ). If κi (θ) ≤ gi for all i then κ̃i (θ) = sgi , independent of θ.
6.3. SUMMARY
79
2. If, on the other hand, we have κi (θ) ≥ gi for all i, then κ̃i (θ) = sκi (θ), rendering the
bound explicitly dependent on θ.
3. The only constraint on the parameter s is to be larger than one. Thus, we obtain the
tightest bound by using s that minimizes the right hand side of (6.14). If we set s to
be small, then the logarithmic term increases, but the complexity term decreases as a
small s enables us to refine the grid. On the other hand, by setting s to be large, we
reduce the logarithmic term, but the complexity term increases.
4. At first glance, it seems like (4.6) is tighter than (6.14), but this is not always the
case. On the one hand, (6.14) exhibits the weakness of loosing tightness due to the
logarithmic term, but on the other it enables us to use the parameter of the selected
classifier rather than the boundaries of the parameter feasible set, potentially tightening
the bound in cases where the selected classifier is characterized by small values of θi .
6.3
Summary
This chapter concludes the Theoretical part of our work, providing fully data dependent risk
bounds for the class of MoE classifiers. The bound given in Theorem 6.4 holds for every MoE
classifier, depending only on the training sequence DN and the parameter of the selected
classifier. The most significant improvement of this bound with respect to previous results is
that it can be used to select the classifier itself rather then the class from which the classifier
is drawn. This can be achieved by minimizing the bound over the parameter θ.
80
CHAPTER 6. FULLY DATA DEPENDENT BOUNDS
Chapter 7
Numerical experiments
The main focus of this work has been on the theoretical aspects of model selection for MoE
& HMoE classifiers. Indeed, in Chapter 4 through Chapter 6 several risk bounds for these
classifiers were established. To complete our work, some numerical experiments to test these
bounds were carried out. This task proved to be non-trivial due to the difficulty of setting
the number of experts and the non-convex nature of the loss function, which will be discussed
in the sequel. It should be noted that previous methods for learning the parameters of the
MoE classifier were based on gradient methods for maximizing the likelihood or minimizing
some loss function. Such approaches are prone to problems of local optima, which render
standard gradient descent approaches of limited use. This problem also occurs for the EM
algorithm discussed in [21]. Notice that even if φ(yf (x)) is convex with respect to yf (x), this
does not imply that it is convex with respect to the parameters of f (x). The deterministic
annealing EM algorithm proposed in [30] attempts to address the local maxima problem,
using a modified posterior distribution parameterized by a temperature like parameter. A
modification of the EM algorithm, the split-and-merge EM algorithm proposed in [15], deals
with a certain types of local maxima involving an unbalanced usage of the experts over the
feature space.
In this chapter two types of numerical experiments are discussed. In one type synthetic
data sets are considered, and in the other type real data sets, taken from [6], are used. Due to
81
82
CHAPTER 7. NUMERICAL EXPERIMENTS
the different nature of these data sets two protocols, one for each type of data set, were used
for the numerical experiments, as described in Section 7.1. Section 7.2 provides a description
of the algorithmic problems and the solutions used in the experiments. The results of the
experiments are discussed in Sections 7.3 and 7.4.
7.1
Numerical experiments protocol
As mentioned above, two types of numerical experiments were carried out, one using a
synthetic data set and the other based on real world data sets, taken from [6]. Since the real
world data sets are of a fixed size, typically a few hundred samples, there was a need to set
a procedure to evaluate the classifier’s performance. This problem does not occur when the
synthetic data is used because a test sequence of arbitrary length can be generated. Thus,
we define two protocols for the numerical experiments, one for each type of data set.
7.1.1
Synthetic data
When one wishes to select a classifier from some class of classifiers, a fundamental problem
that needs to be solved is the problem of model selection. That is, what is the class of
mappings from which the classifier is to be drawn. In the context of our discussion we need
to decide, given the training sequence, what is the most suitable number of experts, M , for
the underlying p.d.f. If M is set to be too small, then underfitting occurs. On the other
hand, setting this parameter to be too large may result in overfitting. In the experiments
with synthetic data we consider several possible values for M , each of which is used to define
a different class according to
(
FM =
f : f (θ, x) =
M
X
)
>
x + vm,0 )
am (wm , wm,0 , x) tanh(vm
m=1
where
θ = [w1 , w1,0 , w2 , w2,0 , . . . , wM , wM,0 , v1 , v1,0 , v2 , v2,0 , . . . , vM , vM,0 ].
7.1. NUMERICAL EXPERIMENTS PROTOCOL
83
The gating functions am (wm , wm,0 , x), m = 1, 2, . . . , M , are defined by
am (wm , wm,0 , x) =
¡ >
¢¢
1¡
1 + tanh wm
x + wm,0
2
for half-space gates and
µ
kwm − xk2
am (wm , wm,0 , x) = exp −
wm,0
¶
for local gates. For every M = 1, 2, . . . , 5 a classifier is selected based on the training
sequence. The true risk of these classifiers is then compared to the prediction of the risk
bounds. It is desirable that the class over which the risk bound is minimized will be the
same class from which the classifier with the lowest risk is drawn.
The experiments with synthetic data are carried out according to the following stages.
Initialization
1. A p.d.f. P (X, Y ) over [−2, 2]2 ×{±1} is defined such that the Bayes classifier associated
with it is given by a MoE classifier with M = 3 experts and the associated Bayes risk
is 18.33%.
2. Set φ(yf (x)) =
1−tanh(γ1 yf (x)−γ2 )
1+tanh(γ2 )
where f (x) is the classifier output at x and γ1 , γ2 ∈ R+ .
This function was chosen for two reasons. (i) It fulfills all the conditions given in
Subsection 3.1. (ii) It can be set as tight as one wishes to the indicator function by
setting γ1 , γ2 appropriately.
3. Set the training sequence length N = 300 and the test sequence length NT = 106 .
Experimental protocol
1. Draw a training sequence DN ∼ P (X, Y )N .
2. Draw a test sequence Dtest ∼ P (X, Y )NT .
3. For every M = 1, 2, . . . , 5, carry out the following steps.
∗
• Find fM
= argminf ∈FM Êφ (f, DN ).
84
CHAPTER 7. NUMERICAL EXPERIMENTS
• Calculate the risk bound.
∗
∗
) by P̂e (fM
, Dtest ).
• Estimate the true risk Pe (fM
4. Repeat steps 1-3 400 times (to characterize the statistical behavior of the experimental
results).
7.1.2
Real world data
The significant difference between the synthetic data and the real world data sets is that
while we can draw a test sequence that is arbitrarily long in the former case, there is usually
no test sequence in the latter case. So, there is no way to compute the true risk and evaluate
the behavior of the bound. A common practice in these situations is given by the R-fold
cross-validation (R-CV, see Figure 7.1) protocol where the training sequence is divided R
times, in each division a different subset of the training sequence is used as a validation
sequence for a classifier that is selected based on the rest of the points in the training
sequence. The risk is then estimated by the average of the empirical risks computed over
the R validation sequences. However, among other weaknesses, the R-CV method uses the
same points for learning the classifier and evaluating its performance in different divisions
of the training sequence, potentially increasing the danger of overfitting. Thus, a refined
version of the R-CV, which we refer to as two-stage cross-validation (TS-CV), is used. This
method uses the R-CV method to select the classifier, but the performance of the selected
classifier is evaluated using different data points than those used to select the classifier, thus
potentially decreasing the danger of overfitting.
Before describing the protocol of TS-CV some definitions and notations are given. For
every n = 1, 2, . . . , N , let DN (n) be the n-th point in the training sequence. Set two positive
j k
j k
N
1
integers N1 , N2 and let R = N1 and S = N
. To simplify the discussion we assume
N2
that both
N
N1
and
N1
N2
are integers. If this is not the case, the protocol is not changed, but
the indices need to be dealt with more carefully. Next, for every r = 1, 2, . . . , R define the
train
= {1, 2, . . . , N } \ Ival
sets of indices Ival
1,r and
1,r = {(r − 1)N1 + 1, (r − 1)N1 + 2, . . . , rN1 }, I1,r
train
val
= {DN (n)}n∈Itrain . Now, if one
= {DN (n)}n∈Ival , D1,r
the associated sets of points D1,r
1,r
1,r
7.1. NUMERICAL EXPERIMENTS PROTOCOL
85
Training sequence
N points
N1
N1
N-N
1
r= 1
N-N
1
r= 2
N-N
Original training sequence
N
1
r=R
Training sequence of each fold
Validation sequence of each fold
Figure 7.1: R cross-validation data partition.
© train ªR
wishes to carry out R-CV, one needs to use each of the sets D1,r
to choose a classifier,
r=1
© val ªR
estimate the performance of each classifier using the associated test set from D1,r
and
r=1
average over the results.
The TS-CV takes the R-CV one step further (see Figure 7.2) by using another cross© train ªR
validation procedure for each element of D1,r
. For every r = 1, 2, . . . , R and every s =
r=1
1, 2, . . . , S define the sets of indices Ival
2,r,s = {(r − 1)N1 + [(s − 1)N2 + 1, (s − 1)N2 + 2, . . . , sN2 ]},
train
val
train
Itrain
\ Ival
2,r,s = I1,r
2,r,s and the associated sets of points D2,r,s = {DN (n)}n∈Ival , D2,r,s =
2,r,s
{DN (n)}n∈Itrain .
2,r,s
Given the definition of φ(yf (x)) from Subsection 7.1.1, the protocol for numerical experiments with the real data sets is now described.
Experimental protocol
1. For every r = 1, 2, . . . , R carry out the following steps.
∗
train
• For every s = 1, 2, . . . , S and M = 1, 2, . . . , 5 find fM,r,s
= argminf ∈FM Êφ (f, D2,r,s
).
P
∗
test
) and φ̄M,r = S1 Ss=1 φM,r,s .
• Calculate φM,r,s = Êφ (fM,r,s
, D2,r,s
86
CHAPTER 7. NUMERICAL EXPERIMENTS
Training sequence
N points
S-fold CV
Fold #1
N1
N-N
N2
N2
N-N 1-N 2
r= 1, s = 1
N-N 1-N 2
r= 1, s = 2
N-N 1-N 2
N-N
Fold # R
r= 1
1
N2
N2
1
N2
r= 1, s =
N1
r= R
N-N 1-N
2
r= R, s = 1
N-N 1-N
2
r= R, s = 2
N-N 1-N
2
N2
S
r = R, s = S
Original training sequence
Test sequence of each fold at second stage
Test sequence of each fold at first stage
Training sequence of each fold at second stage
Training sequence of each fold at first stage
Figure 7.2: Two-stage cross-validation data partition.
• Set Mr = argminM φ̄M,r .
train
• Find fr∗ = argminf ∈FMr Êφ (f, D1,r
).
val
).
• Calculate er = P̂φ (fr∗ , D1,r
2. The mean over the set {er }R
r=1 is interpreted as the expected risk of the classifier that
will be selected by minimizing Êφ (f, DN ), the empirical φ-risk computed over the entire
training sequence.
7.2. ALGORITHMS
7.2
87
Algorithms
We consider algorithms which attempt to minimize the empirical φ-risk with respect to the
parameters of the classifier. Using the definition of φ from Subsection 7.1.1, the empirical
φ-risk is given by
Êφ (f (θ, x)) =
N
1 X 1 − tanh(γ1 yn f (θ, xn ) − γ2 )
.
N n=1
1 + tanh(γ2 )
It is easy to see that not only is Êφ (f (θ, x)) not convex, it is not even unimodal, with respect
to θ. This renders simple gradient methods useless, as it is highly unlikely to select an
initial point for the gradient search from which the global minimum is achieved. The two
algorithms used in the experiments are described in Subsections 7.2.1 and 7.2.2.
7.2.1
Cross-Entropy
One possible solution to this problem is given by the Cross-Entropy (CE) algorithm ([32];
see [8] for a recent review). This algorithm, similarly to genetic algorithms, is based on the
idea of randomly drawing samples from the parameter space and improving the way these
samples are drawn from generation to generation. We observe that the CE algorithm is
applicable to finite dimensional problems.
To give an exact description of the algorithm used in our simulation we first introduce the
following notation. Denote by Θ the feasible set of values for θ, and a parameterized p.d.f.
ψΘ (θ; ξ) over Θ with ξ parameterizing the distribution. For example, if ψΘ (θ; ξ) is set to
be a gaussian p.d.f. then ξ = [σ, µ] where σ is the standard deviation and µ is the average
of the random variable θ, when it is drawn according to ψΘ (θ; ξ). To find a point that is
likely to be in the neighborhood of the global minimum, we carry out Algorithm 1 (see box).
Upon convergence, we use gradient methods with θ̂sB (see box for definition) as the initial
point to gain further accuracy in estimating the global minimum point. We denote by θ̂B
the solution of the gradient minimization procedure and declare it as the final solution.
88
CHAPTER 7. NUMERICAL EXPERIMENTS
The Cross-Entropy Algorithm.
Input: ψΘ (.) and φ(.).
Output: θ̂sB , a point in the neighborhood of the global minimum of Ê(f (θ), DN ).
Algorithm :
1. Pick some ξˆ0 (a good selection will turn ψΘ (θ; ξˆ0 ) into a uniform distribution over
Θ). Set iteration counter s = 1, two positive integers d, T and three parameters
0 < ρ1 , ρ2 , ρ3 < 1.
2. Generate an ensemble {θ1 , θ2 , . . . , θL } where L = 2kM T (k is the dimension of the
feature space and M is the number of experts, thus the dimension of Θ is 2kM ),
drawn i.i.d according to ψΘ (θ; ξˆs−1 ).
3. Calculate Êφ (f, DN ) for each member of the ensemble. The Elite Sample (ES) comprises the bρ1 Lc parameters that received the lowest empirical φ-risk. Denote the
parameters that are associated with the worst and the best Êφ (f, DN ) in the ES as θ̂sW
and θ̂sB respectively.
4. If for some s ≥ d
max (θ̂iW − θ̂jW ) ≤ ρ2
s−d≤i,j≤s
stop (declare θ̂sB as the solution). Otherwise, solve the maximum likelihood estimation
problem, based on the ES, to estimate the parameters of ψΘ (notice that it is not a
MLE for the original empirical risk minimization problem). Denoting the solution as
ξˆM L , compute ξˆs+1 = (1 − ρ3 )ξˆs + ρ3 ξˆM L . Set s = s + 1 and return to 2.
Algorithm 1: The Cross-Entropy Algorithm for estimating the location of the global minimum of the empirical φ-risk.
7.3. SYNTHETIC DATA SET RESULTS
7.2.2
89
Greedy search of local minima
The Greedy Search of Local Minima (GSLM) algorithm, described in Algorithm 2 (see box),
is in the spirit of genetic algorithms which combines stochastic search with gradient search
in attempt to mitigate the problem of non-convexity. Generally speaking, the idea at the
basis of this algorithm is given by the following five steps.
1. Randomly draw a set of N1 points from Θ.
2. Select the N2 points (N2 ¿ N1 , for simplicity assume N1 /N2 is an integer) associated
with the lowest empirical φ-risk.
3. Use gradient search to converge to a nearby local minimum.
4. Around each local minima randomly draw N1 /N2 − 1 points, so that we have a set of
N1 points again.
5. If a stopping criterion is fulfilled, stop. Otherwise, go to 2.
The GSLM algorithm has two significant advantages over the CE algorithm. First, while
the latter searches for a single point at each iteration, the former performs a tree-like search,
looking for several directions, each given by a member of the ES. Second, the GSLM algorithms combine gradient methods with stochastic search to accelerate the convergence.
7.3
Synthetic data set results
∗
We set γ1 = 2, γ2 = 0 in the definition of φ(·). For every M = 1, 2, . . . , 5, denote fM
=
∗
∗
argminf ∈FM Êφ (f, DN ). The reported risk Pe (fM
) is given by P̂e (fM
, Dtest ), the empirical
risk computed over a test sequence of 106 elements (Dtest ), drawn from the same source as the
training sequence. Figure 7.4 describes the results from 400 different training sequences (the
bars describe the standard deviation). The graph labelled as the ‘complexity term’ in Figure
7.4 is the sum of all terms on the right hand side of Corollary 4.11 with δ = 10−3 , excluding
90
CHAPTER 7. NUMERICAL EXPERIMENTS
The Greedy Search of Local Minima Algorithm.
Input: Loss function φ(.), a uniform p.d.f. ψΘ (θ, β, ξ) = cI [kθ − βk ≤ ξ] where c is a
normalization factor.
Output: θ̂sB , a point in the neighborhood of the global minimum of Ê(f (θ), DN ).
Algorithm :
1. Pick some β and ξ1 (0 and 2 in our simulation), set iteration counter s = 1, three
positive integers d, T, L where typically L ¿ T , and three parameters 0 < ρ1 , ρ2 , ρ3 < 1.
2. Generate an ensemble {θ1 , θ2 , . . . , θL } where L = 2kM T , drawn i.i.d according to
ψΘ (θ, β, ξs ). Denote this ensemble as the ‘sample’.
3. For each point in the sample carry out a gradient search to find the nearest local
n
o
minimum. Denote this set of local minima as θ̂1 , θ̂2 , . . . , θ̂L , where L is the number
of elements in the sample, and calculate the empirical φ-risk for each of them.
4. The Elite Sample (ES) comprises the bρ1 Lc parameters associated with the lowest
local minimum. Denote by ES(n), n = 1, 2, . . . , bρ1 Lc the n-th element of the ES, and
denote the parameters associated with the worst and the best empirical φ-risk in the
ES as θ̂sW and θ̂sB respectively.
5. If for some s ≥ d
max (θ̂iW − θ̂jW ) ≤ ρ2
s−d≤i,j≤s
stop (declare θ̂sB as the solution).
6. Set s = s + 1, ξs = ρ3 ξs−1 . For every n = 1, 2, . . . , bρ1 Lc, set β = ES(n) and generate
L − 1 samples according to ψΘ (θ, β, ξs ). Denote the ensemble of the bρ1 LcL points,
comprised from the ES and the new drawn points, as the sample. Go to 3.
Algorithm 2: The greedy search of local minima Algorithm for estimating the location of
the global minimum of the empirical φ-risk.
7.3. SYNTHETIC DATA SET RESULTS
91
∗
Êφ (fM
, DN ). As for the CE parameters, we set ψΘ (.) to be the β distribution, ξˆ0 = [1, 1]
(corresponds to uniform distribution), ρ1 = 0.03, ρ2 = 0.001, ρ3 = 0.7 and T = 200. A few
observations regarding the results, summarized in Figure 7.4, are in place.
∗
1. As one might expect, Êφ (fM
, DN ) is monotonically decreasing with respect to M .
2. As expected, the (data dependent) complexity term is monotonic, and approximately
linearly increasing with respect to M .
∗
3. Pe (fM
) is the closest to the Bayes error (18.33%) when M = 3, which is the Bayes
solution. We witness the phenomenon of underfitting for M = 1, 2 and overfitting for
M = 4, 5, as it is predicted by the bound.
1.5
1
0.5
0
−0.5
−1
−1.5
−1.5
−1
−0.5
0
0.5
1
1.5
Figure 7.3: A visual demonstration of the synthetic data set.
The blue line describes the Bayes classifier and the red line describes the selected classifier for
M = 3.
92
CHAPTER 7. NUMERICAL EXPERIMENTS
Eφ(f,DN)
Data dependent bound
1.1
0.7
0.65
1.05
0.6
0.55
1
0.5
0.45
0.95
0.4
1
2
3
4
5
1
2
Pe(f)
3
4
5
4
5
Complexity term
0.7
0.3
0.65
0.28
0.6
0.26
0.55
0.24
0.5
0.45
0.22
0.4
0.2
0.35
1
2
3
M
4
5
1
2
3
M
Figure 7.4: Synthetic data set results.
A comparison between the data dependent bound of Corollary 4.11 and the true error,
computed over 400 Monte Carlo iterations for different training sequences. The solid line
describes the mean and the bars indicate the standard deviation over all training sequences.
The two figures on the left demonstrate the applicability of the data dependent bound to
the problem of model selection when one wishes to set the optimal number of experts. It
can be observed that the optimal predicted value for M in this case is 3, which agrees with
the number of experts used to generate the data. Observe the underfitting that occurs for
M = 1, 2 and the overfitting for M = 4, 5.
7.4. REAL WORLD DATA SET RESULTS
7.4
93
Real world data set results
We applied Algorithm 2 to two real-world data sets, bupa (liver disorders, N = 345, k = 7)
and pima (Indian diabetes diagnoses , N = 768, k = 9), taken from [6]. The GSLM
parameters were initialized as follow; γ1 = 2, γ2 = 0, ρ1 = 0.03, ρ2 = 0.001, ρ3 = 0.9
and T = 200. ξ1 , the initial radius of the parameter feasible set, was set to be 2. The
results are compared to those obtained with Radial Basis Function Support Vector Machines
(RBF-SVM) and Linear Support Vector Machines (Linear-SVM). The hyper parameter of
all classifiers were selected and the performance estimated according to the cross-validation
protocol described in Subsection 7.1.2. The results are described in Table 7.1.
Data set
MoE (2 experts)
Linear-SVM
RBF-SVM
bupa
0.289 ± 0.050
0.320 ± 0.084
0.317 ± 0.048
pima
0.241 ± 0.056
0.244 ± 0.050
0.255 ± 0.067
Table 7.1: Real world data set results.
The results were computed using 7-fold cross-validation for bupa and 10-fold cross-validation for
pima. For each fold, the hyper parameters of the classifiers were selected using cross-validation in
the training sequence.
7.5
Summary
Two types of numerical experiments were carried out. In one experiment the CE algorithm
was used to estimate the parameters of a MoE classifier. The results of this experiment
indicate that the CE algorithm performs well and that, at least for the p.d.f. tested in the
experiment, the risk bound successfully predicts which class is the most appropriate. The
second experiment considered real world data sets. The results of this experiment indicate
that the MoE classifier performs at least as well as SVM classifiers. However, there are
two major problems when estimating the parameters of a MoE classifier. First, due to the
convex nature of SVM algorithms, it is faster than the algorithms used in our experiment.
This problem can be partially overcome by a more careful implementation of the algorithms.
94
CHAPTER 7. NUMERICAL EXPERIMENTS
The second problem is more disturbing. Notice that the dimension of the problem (the
number of parameters) is given by 2kM . This leads to a rapid increase in the problem
dimension as the number of features increases. Practically, only problems for which k < 15
could be solved within a reasonable time. Since many real problems contain hundreds or even
thousands of features, some feature selection algorithm needs to be used prior to learning
the parameters of the MoE classifier.
Appendix A
About the Lipschitz property of
functions
The Lipschitz property play a key role in this work, as it is used as the prime measure of
the various function’s smoothness. Thus, it is appropriate to study sufficient conditions for
a function to be Lipschitz. In the sequel we show that such a condition is the existence of a
bounded first derivative.
Consider a function f (x), mappings Rk 7→ R for some positive integer k ≥ 1. According to
Taylor expansion of functions, for every a ∈ Rk and b ∈ Rk , there exist some ξ on the line
between a and b for which
f (b) = f (a) + ∇x f (ξ)(b − a)
or, equivalently, f (b) − f (a) = ∇x f (ξ)(b − a). This, combined with the Cauchy-Schwartz
inequality, implies
|f (b) − f (a)| = |∇x f (a)(b − a)| ≤ k∇x f (a)kkb − ak.
Setting Lf = maxx∈Rk k∇x f (x)k implies that for every a, b in Rk
|f (b) − f (a)| ≤ Lf kb − ak,
95
96
APPENDIX A. ABOUT THE LIPSCHITZ PROPERTY OF FUNCTIONS
which means that f (x) is Lipschitz with constant Lf . Notice that for k = 1 this result yields
|f (b) − f (a)| ≤ Lf |b − a|,
¯
¯d
f (x)¯.
where Lf = maxx∈R ¯ dx
Bibliography
[1] M. Anthony and P.L. Bartlett. Neural Network Learning; Theoretical Foundations.
Cambridge University Press, 1999.
[2] T. Bailey and A.K. Jain. A note on distance-weighted k-nearest neighbor rules. IEEE
Trans. Syst., Man, Cybern,, SMC-8, 1978.
[3] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification,
and risk bounds. Technical Report 638, Department of Statistics, U.C. Berkeley, 2003.
[4] P.L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds
and structural results. Journal of Machine Learning Research, 3:463–482, 2002.
[5] C.M. Bishop. Neural Networks for Pattern recognition. Oxford University press, 1995.
[6] C.L. Blake and C.J. Merz.
UCI repository of machine learning databases, 1998.
http://www.ics.uci.edu/∼mlearn/MLRepository.html.
[7] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities using the entropy
method. The Annals of Probability, 31:1583–1614, 2003.
[8] P.T. de Boer, D.P. Kroese, S. Mannor, and R.Y. Rubinstein. A tutorial on the crossentropy method. Annals of Operations Research, 2004. To appear.
[9] I. Desyatnikov and R. Meir. Data-dependent bounds for multi-category classification
based on convex losses. In Proc. of the sixteenth Annual Conference on Computational
Learning Theory, volume 2777 of LNAI. Springer, 2003.
97
98
BIBLIOGRAPHY
[10] L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition.
Springer Verlag, New York, 1996.
[11] R.O. Duda, P.E. Hart, and D. Stork. Pattern Classification. Wiley, New York, second
edition, 2001.
[12] T. Hastie, R. Tibshirani, and J. Friedman.
The Elements of Statistical Learning.
Springer Verlag, Berlin, 2001.
[13] S. Haykin. Neural Networks : A Comprehensive Foundation. Prentice-Hall, second
edition, 1999.
[14] R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, Boston,
2002.
[15] Ghaharamani Z. Nakano R. Ueda N. Hinton, G.E. Smem algorithm for mixture models.
Neural Computation, 12:2109–2128, 2000.
[16] W. Hoeffding. Probability inequalities for sums of bounded random variables. J. Amer.
Statis. Assoc., 58:13–30, 1963.
[17] Jordan M.I. Nowlan S.J. Hinton G.E. Jacobs, R.A. Adaptive mixtures of local experts.
Neural Computation, 3:79–87, 1991.
[18] Peng F. Tanner M.A. Jacobs, R.A. Bayesian inference in mixtures-of-experts and hierarchical mixtures-of-experts models with an application to speech recognition. J. Amer.
Stat. Assoc., 91:953–960, 1996.
[19] Peng F. Tanner M.A. Jacobs, R.A. A bayesian approach to model selection in hierarchical mixture-of-experts classifiers. IEEE Trans. Neural Networks, 10:231–241, 1997.
[20] W. Jiang. The vc dimension for mixtures of binary classifiers. Neural Computation,
12:1293–1301, 2000.
[21] M.I. Jordan and R.A. Jacobs. Hierarchical mixtures of experts and the em algorithm.
Neural Computation, 6(2):181–214, 1994.
BIBLIOGRAPHY
99
[22] M.I. Jordan and L. Xu. Convergence results for the em approach to mixtures of experts
architectures. Neural Networks, 8:1409–1431, 1996.
[23] M. Ledoux and M. Talgrand. Probability in Banach Spaces: Isoperimetry and Processes.
Springer Press, New York, 1991.
[24] G. Lugosi. Concentration-of-measure inequalities, lecture notes given in summer school
at ANU, 2004. http://www.econ.upf.es/ lugosi/surveys.html.
[25] S. Mannor, R. Meir, and T. Zhang. Greedy algorithms for classification - consistency,
convergence rates, and adaptivity. Journal of Machine Learning Research, 4:713–741,
2003.
[26] P. McCullach and J. A. Nelder. Generalized Linear Models. CRC Press, 1989 (2nd
edition).
[27] C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics,
pages 148–188. Cambridge University Press, 1989.
[28] R. Meir, R. El-Yaniv, and S. Ben-David. Localized boosting. In N. Cesa-Bianchi and
S. Goldman, editors, Proc. Thirteenth Annual Conference on Computaional Learning
Theory, pages 190–199. Morgan Kaufman, 2000.
[29] R. Meir and T. Zhang. Generalization bounds for Bayesian mixture algorithms. Journal
of Machine Learning Research, 4:839–860, 2003.
[30] R. Nakano and N. N. Ueda. Determinisic annealing em algorithm. Neural Networks,
11(2), 1998.
[31] R. Royall. A Class of Nonparametric Estimators of a Smooth Regression Function. PhD
thesis, Stanford University, Stanford, CA, 1966.
[32] R.Y. Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodology and Computing in Applied Probability, 1:127–190, September
1999.
100
BIBLIOGRAPHY
[33] B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA,
2002.
[34] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies
of events to their probabilities. Theory of Probability and its Applications, 16:264–280,
1971.
[35] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York,
1995.
[36] V. N. Vapnik. Statistical Learning Theory. Wiley Interscience, New York, 1998.
[37] S.R. Waterhouse and A.J. Robinson. Nonlinear prediction of acoustic vectors using
hierarchical mixture of experts. In Advances in Neural Information Processing Systems,
pages 835–842. MIT Press, Cambridge, MA, 1995.
[38] S.R Waterhouse and A.J. Robinson. Constructive algorithms for hierarchical mixtures
of experts. In Advances in Neural Information Processing Systems, pages 584–590. MIT
Press, Cambridge, MA, 1996.
[39] A. Zeevi, R. Meir, and V. Maiorov. Error bounds for functional approximation and
estimation using mixtures of experts. IEEE Trans. Information Theory, 44(3):1010–
1025, 1998.
[40] T. Zhang. Statistical behavior and consistency of classification methods based on convex
risk minimization. The Annals of Statistics, 32(1), 2004.