Data Dependent Risk Bounds and Algorithms for Hierarchical Mixture of Experts Classifiers Research Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Electrical Engineering Arik Azran Submitted to the Senate of the Technion - Israel Institute of Technology TISHRAY, 5764 HAIFA JUNE, 2004 2 The Research Thesis Was Done Under the Supervision of Professor Ron Meir in the Faculty of Electrical Engineering. The Generous Financial Help of the Technion is Gratefully Acknowledged. Contents 1 A short introduction to pattern classification 5 1.1 Example of pattern classification problem . . . . . . . . . . . . . . . . . . . . 6 1.2 Formulating the binary pattern classification problem . . . . . . . . . . . . . 11 1.3 A brief classifiers survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.1 Nearest Neighbor Classifiers . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.2 Kernel classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3.3 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3.4 Tree classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.4 2 A short introduction to statistical learning theory 21 2.1 Supervised learning and generalization . . . . . . . . . . . . . . . . . . . . . 22 2.2 Vapnik-Chervonenkis theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.1 The basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.2 Sums of independent random variables . . . . . . . . . . . . . . . . . 30 i ii CONTENTS 2.3.3 2.4 General mappings of independent random variables . . . . . . . . . . 31 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3 Data dependent risk bounds 35 3.1 The φ-risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 The Rademacher complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Risk bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 Improved risk bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.1 Improved bound for RN (φ ◦ F) . . . . . . . . . . . . . . . . . . . . . 44 3.4.2 Deriving an improved risk bound . . . . . . . . . . . . . . . . . . . . 47 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5 4 Risk bounds for Mixture-of-Experts classifiers 49 4.1 Mixture of Experts Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3 Establishing risk bounds for MoE classifiers . . . . . . . . . . . . . . . . . . 55 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5 Risk bounds for Hierarchical Mixture of Experts 63 5.1 Some preliminaries & definitions . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2 Upper bounds for HMoE R̂N (F) . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6 Fully data dependent bounds 71 CONTENTS iii 6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.2 Fully data dependent risk bound for MoE classifiers . . . . . . . . . . . . . . 75 6.2.1 Some definitions, notations & preliminary results . . . . . . . . . . . 75 6.2.2 Establishing a fully data dependent bound . . . . . . . . . . . . . . . 77 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.3 7 Numerical experiments 7.1 81 Numerical experiments protocol . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.1.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.1.2 Real world data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2.1 Cross-Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2.2 Greedy search of local minima . . . . . . . . . . . . . . . . . . . . . . 89 7.3 Synthetic data set results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.4 Real world data set results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.2 A About the Lipschitz property of functions 95 iv CONTENTS List of Figures 1.1 A classification problem example. . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Supervised learning block diagram. . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 One dimensional demonstration of class complexity. . . . . . . . . . . . . . . . . 8 1.4 A training sequence, after pre-processing. . . . . . . . . . . . . . . . . . . . . . 9 1.5 Model selection - underfitting. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6 Model selection - overfitting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.7 Model selection - fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.8 Some widely used activation functions. . . . . . . . . . . . . . . . . . . . . . . . 16 1.9 A single neuron block diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.10 Two layer feed forward network with M neurons. . . . . . . . . . . . . . . . . . 17 1.11 Three layer feedforward network. . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.12 Binary tree classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1 The estimation-approximation trade off. . . . . . . . . . . . . . . . . . . . . 26 2.2 The VC dimension for linear classifiers in the two dimensional space. . . . . . . . 27 3.1 Some widely used functions as φ(yf (x)). . . . . . . . . . . . . . . . . . . . . . . 38 v vi LIST OF FIGURES 3.2 Risk bounds comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1 MoE classifier with M experts. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 A combination of simple classifiers in the feature space. . . . . . . . . . . . . 51 4.3 A block diagram of the classifier in Figure 4.2. . . . . . . . . . . . . . . . . . 52 5.1 Balanced two-levelled HMoE classifier with M experts. . . . . . . . . . . . . . . 65 7.1 R cross-validation data partition. . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.2 Two-stage cross-validation data partition. . . . . . . . . . . . . . . . . . . . . . 86 7.3 A visual demonstration of the synthetic data set. . . . . . . . . . . . . . . . . . 91 7.4 Synthetic data set results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Abstract According to the Oxford dictionary, a pattern is defined as a way in which something happens and the noun recognize is described as knowing, (being able to) identify again (a person or a thing) that one has seen, heard, etc before. Thus, by recognizing a pattern we identify something that is similar to something that we saw in the past. Pattern Recognition is about guessing the unknown nature of an observation as one out of a set of possibilities. In this work an observation is a collection of numerical measurements, denoted by x, where x ∈ Rk . The unknown nature of the observation is referred to as the label, denoted in this work by y, where y ∈ {0, 1}. In pattern recognition a mapping f : Rk 7→ {0, 1} is defined. This mapping is referred to as the classifier. There are many ways in which classifiers can be defined. One possibility is the Hierarchical Mixture of Experts classifier, which is in the focus of our discussion. The Hierarchical Mixture of Experts classifier is given by a recursive soft partition of the feature space Rk in a datadriven fashion. Such a procedure enables local classification where several experts are used, each of which is assigned with the task of classification over some subspace of the feature space. In this work, we provide data-dependent error bounds for this class of models, which lead to effective procedures for performing model selection. Tight bounds are particularly important here, because the model is highly parameterized. The theoretical results are complemented with some numerical experiments based on a randomized algorithm, which mitigates the effects of local minima that plague other approaches such as the expectationmaximization algorithm. 1 2 LIST OF FIGURES This work consists of three parts. Introductory chapters Chapters 1-3 establish the framework of our discussion and introduce the mathematical and probabilistic tools that are used through out this work. Though most of the results are taken from the literature, they are combined in a way which leads to the tightest achievable bounds. Furthermore, some known results are combined with the recently proposed ‘Entropy method’ for concentration inequalities to establish a new risk bound that often outperforms the best known bounds. Analytical chapters In chapters 4-6 the Mixture of Experts and Hierarchical Mixture of Experts classifiers are introduced and discussed. These classifiers are studied extensively and several analytical results regarding them are established. Numerical chapter Chapter 7 provides a numerical demonstration for the way in which the analytical results can be used in practical problems. It also provides a comparison between the performance of Mixture of Experts and some Kernel classifiers using some real data sets. LIST OF FIGURES 3 4 LIST OF FIGURES Notation and Abbreviations R the set of all real numbers R+ the set of all nonnegative and real numbers EZ average according the distribution of the random variable Z k dimension of the feature space X feature space, Rk unless stated otherwise Y label space {±1} x feature vector y label P (unknown) distribution, defined over X × Y N size of the training sequence DN training sequence {(Xn , Yn )}N n=1 , drawn i.i.d according to P M number of experts am (·) m-th gating function, mapping Rk 7→ R+ h(·) m-th expert, mapping Rk 7→ R f (·) classifier, mapping Rk 7→ R wm , v m parameter of the m-th gating function and expert Wm , V m feasible set of wm , vm m m Wmax , Vmax diameter of Wm , Vm σn balanced binary random variable, n = 1, 2, . . . , N σ vector of N i.i.d. random variables [σ1 , σ2 , . . . , σN ] φ(·) loss function, mapping R 7→ R+ φ̄ supremum of φ over R I[·] indicator function, mapping R 7→ {0, 1} F, H, A classes of all mappings f, h, a MA , MH supremum over all functions in A, H La , Lh , Lφ Lipschitz constants of a, h, φ R̂N (F) empirical Rademacher complexity RN (F) rademacher complexity Pe (f ) risk of the classifier f P̂e (f, DN ) empirical risk of the classifier f Eφ (f ) φ-risk of the classifier f Êφ (f, DN ) empirical φ-risk of the classifier f φ◦F function composition MoE Mixture of Experts HMoE Hierarchical MoE Chapter 1 A short introduction to pattern classification Imagine a toll road in which a fee is charged according to the vehicle size. A camera located at the entrance takes pictures of the vehicles. The pictures are then transferred to a machine (a processing unit) which classifies the vehicle in each picture as Small or Large. Making the decision regarding the type of vehicle is an example of what is referred to as pattern classification. This chapter provides a brief introduction to the problem of pattern classification, focusing on the binary case. To introduce the nature of the problem, we extend the example above, followed by a formal definition of the problem within a statistical framework. Section 1.3 presents a short survey of classifiers widely used in real world applications. A reader familiar with these issues can be satisfied with paying attention to the notations and definitions, without reading the entire chapter. 5 6 CHAPTER 1. A SHORT INTRODUCTION TO PATTERN CLASSIFICATION 1.1 Example of pattern classification problem Consider the toll road example above (see Figure 1.1), and assume that the fee is 1£ for small and 2£ for large vehicles. The machine extracts the vehicle length and height from the picture and uses these measurements to make a decision regarding the vehicle type. Denote by x1 and x2 the length and height measurements, respectively. The machine’s task is to select a mapping f from the two dimensional space X = {(x1 , x2 ) : 2.5 ≤ x1 ≤ 18, 1 ≤ x2 ≤ 4} into Y = {Small,Large}. x Camera Camera on a toll road Classifier y {Small, Large} Feature extraction & pre-processing Figure 1.1: A classification problem example. A camera on a toll road takes pictures of the vehicles entering the road, each of which is classified as Small or Large. First, each vehicle is segmented (isolated from the environment). Then, the characteristics of the vehicle, such as length and height, are extracted and described as a feature vector x = [length, height]. Some pre-processing of x, such as mean substraction and variance normalization, might take place, followed by the classification of x with one of the labels {Small,Large}. Suppose the machine is required to independently learn the rule which is used to classify the vehicles. Such a scheme is addressed in the context of supervised learning (or learning with a teacher ), where the machine is provided with a collection of labelled examples, based on which the most ‘appropriate’ classifier is selected. This classifier is then used to classify new, unlabelled examples. A block diagram, describing the scheme of supervised machine learning, is provided in Figure 1.2. Figure 1.4 describes a collection of 100 samples (after preprocessing), some of which are 1.1. EXAMPLE OF PATTERN CLASSIFICATION PROBLEM 7 Teacher Desired output Learning system System output DN + - Figure 1.2: Supervised learning block diagram. In supervised learning the machine receives a collection of labelled examples. The learning process is carried out under the supervision of a ‘teacher’ that motivates the machine to improve its performance by introducing some measurement of the error between the desired and the actual outputs. Generally speaking, the improvement is performed by minimizing an error measure. labelled ‘Small’ and the rest are ‘Large’. The machine selects a mapping f : X 7→ Y based on this set. Figures 1.5 and 1.6 describes two possible classifiers, the former is too simple for the training sequence, a phenomenon referred to as underfitting, and the latter is too complex, resulting in overfitting (see [11] for further discussion). Figure 1.7 describes a classifier which seems to be ‘just right’ for the training sequence. Generally, we would like to limit our machine to select a classifier from a given class F, hoping that neither underfitting nor overfitting occurs. To do so, a class of suitable complexity should be selected. One practical way of controlling the size of F uses a regularization parameter, as demonstrated in the following example. Example (Parameterized regularization). Let v = [v1 , v2 , . . . , vM ] ∈ RM . Set some positive number K and define a nonnegative and monotonically nondecreasing sequence {V1 , V2 , . . . , VK } where Vk ∈ R+ for all k = P 1, 2, . . . , K. Consider the one dimensional function h(v, x) = M m=1 tanh(vm x) and define the collection of classes ( Hk = h(v, x) = M X ) tanh(vm x) : |vm | ≤ Vk , m = 1, 2, . . . , M m=1 for all k. It is easy to see that Hk ⊆ Hk+1 for all k = 1, 2, . . . , K − 1 . 8 CHAPTER 1. A SHORT INTRODUCTION TO PATTERN CLASSIFICATION Thus, the size of the class Hk is controlled by setting the parameter Vk . Figure 1.3 demonstrates this idea using 3 functions, drawn from three different classes for which V1 = 1, V2 = 5 and V3 = 30. Functions from different classes, M=10. 0.5 V=1 V=5 V=30 0.4 0.3 0.2 h(v,x) 0.1 0 −0.1 −0.2 −0.3 −0.4 −0.5 −1 −0.8 −0.6 −0.4 −0.2 0 x 0.2 0.4 0.6 0.8 1 Figure 1.3: One dimensional demonstration of class complexity. We set M = 10 and chose some arbitrary v such that |vm | < 1 for all m = 1, 2, . . . , 10. The functions h(v, x), h(5v, x) and h(30v, x) are described. It is clear that as the radius of the feasible set is increased, the functions in the associated class are more ‘complex’. The problem of setting the class F is known as the problem of model selection, and is a main focus of our discussion. To put our work in context we point out some of the main problems of machine learning: Feature extraction What are the vehicle characteristics that provide the most relevant information? Will other attributes, such as vehicle weight, speed or color make the classification easier? Feature selection Following feature extraction, it may seem that some of them are practically useless as far as classification is concerned. The problem of finding the useful 1.1. EXAMPLE OF PATTERN CLASSIFICATION PROBLEM 9 Training Sequence 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Figure 1.4: A training sequence, after pre-processing. Linear classifier (overfitting) 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Figure 1.5: Model selection - underfitting. Out of the collection of all linear classifiers, the described classifier seems to be the most ‘suitable’ one. 10 CHAPTER 1. A SHORT INTRODUCTION TO PATTERN CLASSIFICATION Radial SVM (overfitting) 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Figure 1.6: Model selection - overfitting. If we consider a very large class of classifiers, the selected classifier might fit the labelled samples too well. Though almost every sample in the training sequence is classified correctly, we expect such a classifier to have poor generalization performance, measured by its classification of future examples. ’Appropriate’ classifier 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Figure 1.7: Model selection - fitting The described classifier seems most likely to capture the ‘true’ nature of the underlying source. 1.2. FORMULATING THE BINARY PATTERN CLASSIFICATION PROBLEM 11 features is of major significance in applications where thousands of features are available. Prior knowledge How can prior knowledge be exploited? For example, if it is known that small vehicles are more likely to appear during the mornings than during midday, how can this knowledge be used to improve classification? Noise All measurements are subject to inaccuracies. How can assumptions concerning such inaccuracies be combined with the procedure of obtaining a classifier? Segmentation How can the object be ‘isolated’ from the environment? For example, how are the pixels of the picture which represent the vehicle selected? Performance How is the classifier’s performance measured? In the sequel it will be shown that using the 0 − 1 loss in real world applications is usually inferior to other possible loss functions. 1.2 Formulating the binary pattern classification problem In the previous section a simple example was used to demonstrate the problem of binary pattern classification. We now provide an exact formulation of the problem within a statistical framework [10, 36, 12]. Let (X, Y ) be a random pair in X × Y drawn according to some probability measure P (X, Y ). In binary pattern classification we wish to obtain a classifier f : X 7→ Y, to classify an observation X ∈ X using one of two labels, for example Y ∈ Y = {±1}. We say that a classification error occurs if f (X) 6= Y . We evaluate the performance of the classifier f by the risk (a.k.a. probability of error ), defined as Pe (f ) = EX,Y {I [Y f (X) 6= 1]} , where 1 I[Y f (X) 6= 1] = 0 if Y f (X) 6= 1 otherwise (1.1) . (1.2) 12 CHAPTER 1. A SHORT INTRODUCTION TO PATTERN CLASSIFICATION It is desirable that the selected classifier minimizes the risk with respect to a given source. Theorem 1.1 [10] provides us with the absolute best achievable performance and the classifier associated with it. Theorem 1.1 (Bayes classifier) Let (X, Y ) be a random pair defined over the product space X × Y with some p.d.f. P (X, Y ), and denote the aposteriori probability of Y given the observation x as ηy (x) = P{Y = y|X = x}, y ∈ {±1}. Consider the class F of all possible mappings f : X 7→ {±1}. Then, every f ∈ F satisfies Pe (fB ) ≤ Pe (f ) , where fB (x) = argmaxy∈{±1} ηy (x) is the Bayes classifier. The risk incurred by this classifier is referred to as the Bayes risk, given by ½ Pe (fB ) = E ¾ min ηy (x) . y∈{±1} However, notice that to be able to determine fB (x), we need to have the conditional distribution at our disposal. In real life, this condition rarely holds, and usually we have partial or no information of the distribution at all. In the context of supervised learning it is assumed that a set of samples, drawn i.i.d. from the source, is available. We refer to this set as the training sequence, denoted as DN , ¡ k ¢N DN = {(Xn , Yn )}N ∈ R × Y ∼ P (X, Y )N . n=1 (1.3) Based on this collection of examples we wish to obtain a classifier f from a class F. Even though the Bayes classifier is usually not achievable using DN alone, it is very useful to have theoretical absolute bounds for the performance of any possible classifier. 1.3 A brief classifiers survey To put this work in context, we give a short overview of some well known classification methods, widely used in real world applications. Some comprehensive discussions can be 1.3. A BRIEF CLASSIFIERS SURVEY 13 found at [10, 11, 13, 5]. Recall that we have a training set of N samples, DN , based on which we wish to choose a classifier fˆN ∈ F to classify any future sample x ∈ Rk with a label y ∈ Y such that Pe (fˆN ) will be as small as possible. 1.3.1 Nearest Neighbor Classifiers Given a metric d(x, x0 ) on Rk and a positive integer p, the p nearest neighbor classifier is a mapping Rk 7→ Y, dependent on p elements of DN . More formally, fix any x ∈ Rk and assume, w.l.o.g., that d(x, xi ) ≤ d(x, xj ) , 1 ≤ i < j ≤ N . (1.4) The p nearest neighbors of x is the set {(x1 , y1 ), . . . , (xp , yp )}. The p nearest neighbors classifier labels x according to a majority vote policy, fˆN (x) = majority(y1 , . . . , yp ) . (1.5) Notice that there is a need for some procedure in case of a tie. One immediate refinement of this classifier is the assignment of different weights to different neighbors of x, typically decreasing with respect to the distance between x and the sample. For the binary case, this can be described as fˆN (x) = sgn X n∈I1 wn − X wn , (1.6) n∈I−1 where the non negative numbers w1 , . . . , wp are the weights, I1 = {n : 1 ≤ n ≤ p, yn = 1} and I−1 is defined similarly. A surprising result, proved by Bailey and Jain [2], states that for all distributions for which P {ηy (x) = 1/2} < 1, the weighted p nearest neighbor classifier achieves its minimal asymptotic risk when the weights are uniform. So, for large samples the p nearest neighbors classifier should be preferred over the weighted p nearest neighbors classifier. However, this does not mean that the latter should be ignored when a sample of finite length is given. In fact, Royall proved that if p is allowed to vary with N then nonuniform weights are advantageous [31]. 14 CHAPTER 1. A SHORT INTRODUCTION TO PATTERN CLASSIFICATION 1.3.2 Kernel classifiers Similar to the nearest neighbor classifier, the kernel classifier labels any point x ∈ Rk according to a majority vote among the labels of the training sequence. However, while the former considers only the p nearest points to x, the latter performs a weighted majority vote among all the points of DN . The kernel classifier combines the labels of the training sequence with data dependent weights, determined by some kernel function fˆN (x) = sgn à N X I [yi = 1] K(x, xi ) − i=1 N X ! I [yi = −1] K (x, xi ) . (1.7) i=1 ³P ´ N A compact form of (1.7) is fˆN (x) = sgn y K(x, x ) . The kernel function K : Rk 7→ i n=1 i R+ is usually monotonically decreasing with respect to the distance between x and xi . The following example describes a very popular kernel function. Example (Weighted Gaussian kernel function). The Gaussian kernel is given by 2 K(x, xi ) = e−kx−xi k . (1.8) An immediate refinement of this classifier assigns to each dimension of the feature space a different weight, > A(x−x K(x, xi ) = e−(x−xi ) i) . (1.9) Such a weighted distance measure is useful for two reasons: 1. It enables rescaling the features when each one of them has different units (such as length in cm. and speed in m.p.h.). 2. If some of the features seem to be more relevant than others when classification is concerned, they can be weighted according to some measure of their relevance. 1.3. A BRIEF CLASSIFIERS SURVEY 1.3.3 15 Neural networks A neural network is an assembly of some well defined computational elements, called ‘artificial neurons’, operating collectively. These ‘neurons’ may be interconnected in various ways to form a network. For example, the output of some of them may be used as inputs to others. The output of the neuron typically depends on a weighted sum of the inputs, exercising a threshold behavior by exhibiting different outputs depending on whether the sum exceeds some threshold. An artificial neuron is a mapping Rk+1 7→ Y ⊆ [−1, 1], where Y can be discrete, assuming only the elements {±1}. The mapping is given by some function ϕ(·), referred to as the activation function. The most popular activation function is a sigmoid, which has the property ϕ(t) → 1 t→∞ −1 t → −∞ , (1.10) where t is usually defined as the difference between the inner product of some parameter ω ∈ Rk and x, and a threshold ω0 , i.e. t = ω > x − ω0 . Some widely used activation functions, described graphically in Figure 1.8, are given here mathematically: 1. Threshold activation function ϕ(t) = sgn(t) . 2. Piecewise-linear activation function 1 t ≥ 1, ϕ(t) = t −1 < t < −1, . −1 t ≤ −1 3. Logistic activation function ϕ(t) = et − e−t . et + e−t 16 CHAPTER 1. A SHORT INTRODUCTION TO PATTERN CLASSIFICATION Activation functions 1 0.5 0 −0.5 Threshold Piecewise−linear Logistic −1 −4 −3 −2 −1 0 t 1 2 3 4 Figure 1.8: Some widely used activation functions. x1 x2 x0=-1 w1 w0 w2 y wk Activation function xk Figure 1.9: A single neuron block diagram. The elements of the feature vector x are linearly combined and compared with a threshold. The difference between the weighted sum of the features and the threshold is then transferred through a nonlinear mapping. 1.3. A BRIEF CLASSIFIERS SURVEY 17 Notice that each of the sigmoids above converges pointwise to sgn(t/T ) as T → ∞ (except maybe for t = 0), so that they are a natural generalization of the threshold activation function. There are practically numerous possible architectures to build a neural network. We will give a general description of two basic networks, relevant to our discussion. The first is the two layer feedforward network. Example (Two layer feedforward network ). A two layer feedforward network is a collection of M ‘neurons’ (see Figure 1.10), each of which receive the same input x, linearly combined to produce the network outcome, ÃM ! X ¡ > ¢ f (x) = ϕ υm ϕm ωm x − ωm,0 . (1.11) m=1 > For all m = 1, 2, . . . , M , υm ∈ R and [ωm , ωm,0 ] ∈ Rk+1 are the parameters of the m’th ‘neuron’. x0 x1 y { xk Mneurons Figure 1.10: Two layer feed forward network with M neurons. The second example is a three layer feedforward network. Example (Three layer feedforward network ). An example of three layer feedforward network is given in Figure 1.11, where M1 neurons (the ‘input layer’) receive the feature vector as their input. Their outputs are then used as the inputs for M2 neurons, called the ‘hidden layer’. The outputs of the hidden layer are 18 CHAPTER 1. A SHORT INTRODUCTION TO PATTERN CLASSIFICATION used, in turn, as the inputs to a neuron (the ‘output layer’), whose output is the network output. The mathematical description of this network is given by f (x) = ϕ ÃM 1 X υm ϕ m m=1 ÃM 2 X ¡ υmj ϕmj ωj> x − ωj,0 !! ¢ , (1.12) j=1 where υm , υmj ∈ R and [ωj> , ωj,0 ] ∈ Rk+1 for all m = 1, 2, . . . , M1 and j = 1, 2, . . . , M2 . x0 x1 y M 2 neurons { { xk M1 neurons Figure 1.11: Three layer feedforward network. Remark. Similar to the way in which the two layer feedforward network is generalized to a three layer network, it can also be generalized to a multilayer feedforward network with an arbirary number of hidden layers. The parameters of the neural network are set to minimize some loss function. For example, denote the desired output for the sample xn as dn and define the sample error en = dn −f (xn ). P 2 The network parameters are then set by minimizing the cost function ² = N1 N n=1 en , possibly subject to some regularization conditions. 1.4. SUMMARY 1.3.4 19 Tree classifiers Classification trees usually provides a hard recursive partition of Rk into regions (see Figure 1.12). The parameters of the tree are used in a straightforward manner to determine the boundaries of the different regions. X 1>a 4 X 2 >b X 2>c 1 2 6 x2 X 1>d 3 X 1>e 5 4 7 3 1 2 X 2>f 5 x1 7 6 Figure 1.12: Binary tree classifier. Notice that the tree classifier and a general multilayer feedforward neural networks share a closely related structure. In this work we address a classifier that attempts to combine the best of these two classifiers. 1.4 Summary In this chapter the binary classification problem was introduced and illustrated using a simple example. A formal description of the problem within a statistical framework was also introduced, followed by a short discussion concerning the solution for the case where the conditional distribution of Y given X is available. For this case, the best classifier (Bayes classifier) and the associated risk (Bayes risk) were provided. For the more realistic scenario, where the conditional distribution is not at our disposal but a training sequence is available, 20 CHAPTER 1. A SHORT INTRODUCTION TO PATTERN CLASSIFICATION several practical classifiers were proposed. Chapter 2 A short introduction to statistical learning theory In real world application the source distribution is unknown. In the context of our discussion, a classifier fˆN is selected based on some sequence drawn i.i.d. according to the unknown distribution. Thus, it is not realistic to expect the selected classifier to achieve the same performance as the Bayes classifier, fB . The mismatch between the two classifiers, measured by Pe (fˆN ) − Pe (fB ), is the focus of our discussion. This chapter serves as an introduction to the mathematical tools that form the basis of our discussion. It deals with the concept of generalization and serves as an introductory chapter for what is known as concentration inequalities, used to measure the concentration of a random variable around its mean. Such inequalities are found to be very useful in the context of learning problems where one wishes to minimize Pe (f ) when only a training sequence is available. The interested reader is referred to [10, 1, 24] for a comprehensive discussion. 21 22CHAPTER 2. A SHORT INTRODUCTION TO STATISTICAL LEARNING THEORY 2.1 Supervised learning and generalization Let us begin by a precise formulation of the problem. Let F be a class of mappings f : Rk 7→ R, and let fˆN ∈ F be a function selected based on the training sequence DN . We define the empirical risk P̂e (f, DN ) = N 1 X I [yn sgn (f (xn )) = −1] , N n=1 (2.1) and the empirical risk minimizer over the class fˆN∗ = argmin P̂e (f, DN ) . (2.2) f ∈F We emphasize that fˆN can be an arbitrary function of the data, not necessarily identical to fˆ∗ . In fact, we don’t even assume that Pe (fˆN ) = Pe (fˆ∗ ) or that P̂e (fˆN , DN ) = P̂e (fˆ∗ , DN ). N N N Given the class F, we state that our goal is to obtain the risk minimizer over the class, fF = argmin Pe (f ) . (2.3) f ∈F Since the source distribution is not available, one may believe that the best thing to do is set fˆN = fˆN∗ . This approach is motivated by the philosophy that underlies the laws of large numbers. Generally speaking, these laws state that by choosing N to be large enough, the empirical risk and the true risk can be arbitrarily close with probability that tends to 1. In the sequel, several procedures that lead to tight bounds are discussed, but for now, to demonstrate a formal way of phrasing this, we are satisfied with the well known Chebyshev’s inequality. According to Chebyshev’s inequality, for any random variable Z and every ² > 0, P {|Z − EZ| ≥ ²} ≤ Setting Z = PN n=1 Var {Z} . ²2 Zn , where Zn are drawn independently, yields PN Var {Zn } . P {|Z − EZ| ≥ ²} ≤ n=1 2 ² (2.4) (2.5) By defining Zn = I[Yn sgn (f (Xn )) = −1], p = P {Zn = 1} for all n = 1, 2, . . . , N and normalizing properly, we have ¯ o p(1 − p) n¯ ¯ ¯ . P ¯P̂e (f, DN ) − Pe (f )¯ ≥ ² ≤ N ²2 (2.6) 2.1. SUPERVISED LEARNING AND GENERALIZATION 23 This result states that for every ² > 0, there is a positive integer N for which ¯ ¯ ¯ ¯ ¯P̂e (f, DN ) − Pe (f )¯ < ² with probability that is arbitrarily close to 1. Thus, we may expect that with overwhelming probability, the empirical risk minimizer fˆN∗ will minimize the risk as well. It is not surprising to see that as N tends to infinity, this probability tends to 1. However, while this result can be theoretically useful, in practical problems we only have a finite length training sequence. The following example illustrates one of the problems that might arise in such a setup when we select fˆ∗ . N Example (Uniform bounds). Denote by F the family of all possible mappings from Rk 7→ {±1} and define n o FN = f : Pe (f ) = Pe (fˆN∗ ) . Notice that any classifier for which f (xn ) = yn , n = 1, 2, . . . , N , is a member of FN . One of these classifiers is given by f0 (x) = yn x = xn 1 otherwise , implying P̂e (f0 , DN ) = 0. Yet, it is very easy to find a distribution of (X, Y ) such that ½ ¾ P Pe (f0 ) − inf Pe (f ) > ² = 1, f ∈F for any ² ∈ (0, 1/2). So, we see that it is not sufficient to minimize P̂e (f, DN ), and we need to consider the class from which the classifier is drawn. The problem of choosing the class F is known as the model selection problem. One possibility of addressing this problem is by setting uniform ¯ ¯ n o ¯ ¯ bounds such as upper bounds for P supf ∈F ¯P̂e (f, DN ) − Pe (f )¯ ≥ ² . The following Lemma ¯ ¯ ¯ ¯ exhibits the importance of upper bounds for supf ∈F ¯P̂e (f, DN ) − Pe (f )¯. Lemma 2.1 (Vapnik and Chervonenkis, 1974) |P̂e (fˆN∗ , DN ) − Pe (fˆN∗ )| ≤ sup |P̂e (f, DN ) − Pe (f )|, f ∈F and Pe (fˆN∗ ) − Pe (fF ) ≤ 2 sup |P̂e (f, DN ) − Pe (f )|. f ∈F 24CHAPTER 2. A SHORT INTRODUCTION TO STATISTICAL LEARNING THEORY Proof The first inequality is trivial. As for the second inequality, we have Pe (fˆN∗ ) − Pe (fF ) = Pe (fˆN∗ ) − P̂e (fˆN∗ , DN ) + P̂e (fˆN∗ , DN ) + P̂e (fF , DN ) − P̂e (fF , DN ) − Pe (fF ) (a) ≤ Pe (fˆN∗ ) − P̂e (fˆN∗ , DN ) + P̂e (fF , DN ) − Pe (fF ) ≤ |P̂e (fˆN∗ , DN ) − Pe (fˆN∗ )| + |P̂e (fF , DN ) − Pe (fF )| ≤ 2 sup |P̂e (f, DN ) − Pe (f )| , f ∈F where (a) is due to the fact that fˆN∗ is the empirical risk minimizer. ¤ So, we see that upper bounds for supf ∈F |P̂e (f, DN ) − Pe (f )| provide us with upper bounds for two important quantities: 1. An upper bound for |P̂e (fˆN∗ , DN )−Pe (fˆN∗ )|. This is of great importance when we choose our classifier by minimizing the empirical risk. 2. An upper bound for Pe (fˆN∗ ) − Pe (fF ), the sub-optimality of fN∗ within F. To broaden our view of the model selection problem, recall that Pe (fB ) is the lowest risk one could hope for, while Pe (fF ) is the best achievable risk using classifiers from F. The performance of fˆ∗ compared to fB can be expressed in the following way: N Pe (fˆN∗ ) − Pe (fB ) = Pe (fˆN∗ ) − Pe (fF ) + Pe (fF ) − Pe (fB ) . {z } | {z } | (1) (2.7) (2) We give different interpretations to each of the terms above: (1) The estimation error measures the mismatch between fˆN∗ and fF . For every given N , this error is always nonnegative and expected to grow as the class complexity grows. (2) The approximation error measures the mismatch between fF and fB due to a suboptimal model selection. It is always nonnegative and equals zero if fB ∈ F . 2.2. VAPNIK-CHERVONENKIS THEORY 25 There is an inherent trade-off between these two measures. By enriching F we decrease the approximation error, up to the point where fB ∈ F where it equals zero. However, by doing so we also increase the estimation error, thus exhibiting the trade-off. So, on top of selecting a classifier from within F we need to solve an optimization problem to determine F itself. This idea, graphically illustrated in Figure 2.1, motivates the search for bounds which converge uniformly over the class. Definition 2.2 (Uniform risk bounds) A risk bound is said to be uniform with respect to a class of classifiers if it is of the following form: For every δ ∈ (0, 1), with probability at least 1 − δ over training sequences of length N , every f ∈ F satisfies Pe (f ) ≤ Ω(f, DN ) + Ψ(F, f, DN , δ), where Ω(f, DN ) is some empirical assessment of the real probability of error and Ψ(F, f, DN , δ) measures the class complexity. As an example, Ω(f, DN ) can be the proportion of misclassified elements in DN , given by P̂e (f, DN ). The term Ψ(F, f, DN , δ) is more context dependent, and can be addressed in more than one approach. In the next section we introduce the VC theory which address the problem of measuring the class complexity and introduces a possible definition for Ψ(F, f, DN , δ). 2.2 Vapnik-Chervonenkis theory Vapnik and Chervonenkis [34, 35, 36] introduced a purely combinatorial measure of the class complexity. First, observe that there exists a unique mapping between any binary classifier in F and the subset of Rk which it maps to label ‘1’. Thus, instead of thinking of classifiers in a class we may think of subsets in the feature space. This interpretation of a classifier motivates the following definition. 26CHAPTER 2. A SHORT INTRODUCTION TO STATISTICAL LEARNING THEORY Pe (f) Uniform risk bound Risk assesment Complexity term Underfitting * Overfitting Figure 2.1: The estimation-approximation trade off. A solution for the model selection problem given by the class which minimize the sum of these two quantities. Definition 2.3 (Shatter Coefficient) Given a set of points SN = {x1 , . . . , xN }, xn ∈ Rk for every n = 1, 2, . . . , N , and a class of mappings F : Rk 7→ {±1}, define Cf = {x ∈ Rk : f (x) = 1} and CF = {Cf : f ∈ F }. Let ∆F (SN ) denote the number of distinct sets of the form {SN ∩ C : C ∈ CF }. The N ’th shatter coefficient of F is defined as S(F, N ) = max ∆F (SN ) . SN Thus, ∆F (SN ) is simply the number of possible ways in which the set SN can be classified using classifiers from F. S(F, N ) is no more than the maximal number of different subsets of a set of N points which can be obtained by classifying them using classifiers from F. The VC dimension of a class F, denoted by VF , is defined as the largest N such that S(F, N ) = 2N . If S(F, N ) = 2N for all N , we say that that VF = ∞. We give a simple example that illustrates the meaning of V C dimension. Example (VC dimension). For the class H of hyperplanes in Rk , VH = k + 1. We demonstrate it for k = 2 in Figure 2.2. Vapnik and Chervonenkis proved the following distribution free result. Theorem 2.4 (Vapnik and Chervonenkis,1971) Let F denote a class of mappings Rk 7→ 2.2. VAPNIK-CHERVONENKIS THEORY 27 N=3 N=4 Figure 2.2: The VC dimension for linear classifiers in the two dimensional space. For N = 3, 8 different divisions are possible. For N = 4, some divisions can not be achieved using linear classifiers. {±1}. Then, for any positive integer N ≥ VF 2 and δ ∈ (0, 1), with probability at least 1 − δ over training sequences of length N , every f ∈ F satisfies v ´ ³ u u V ln 2eN + ln 4 F t VF δ Pe (f ) ≤ P̂e (f, DN ) + 3 , N and Pe (fˆN∗ ) ≤ Pe (fF ) + 6 v ´ ³ u u V ln 2eN + ln 4 F t VF δ N . (2.8) (2.9) Remark. The following result, taken from [1] (Theorem 4.2), is an improved and simplified version of (2.8). Under the conditions of Theorem 2.4, there is an absolute constant c such that s Pe (f ) ≤ P̂e (f, DN ) + c VF + ln 1δ . N Notice that the convergence rate of this version is slightly (logarithmically) better than the convergence rate of (2.8). In Chapter 4 we use this bound to compare previous results, based on VF , and new results, established in this work. This result provides us with upper bounds for both the deviation of P̂e (f, DN ) from Pe (f ) 28CHAPTER 2. A SHORT INTRODUCTION TO STATISTICAL LEARNING THEORY and the sub-optimality of fˆN∗ within F. An immediate and significant implication of (2.9) is that for any distribution and any class F with finite VF , learning a classifier with risk that is arbitrarily close to the best achievable risk is possible, provided the training sequence is large enough. The VC theory also provides an answer to the question of how tight this bound is. In the following theorem we consider all distributions P of the random pair (X, Y ) such that Pe (fF ) is a constant. Theorem 2.5 (Devroye, Györfy and Lugosi, [10]) Let F denote a class with VC dimension VF > 2 and define P = {P (X, Y ) : Pe (fF ) = PF } , where PF ∈ (0, 0.5) is fixed. Then, for any N ≥ f ∈ F satisfies VF −1 2PF r sup E {Pe (f ) − PF } ≥ P ∈P max(9, (1 − 2PF )−2 ), every mapping PF (VF − 1) −8 e . 24N A significant implication of Theorems 2.4 and 2.5 is that for classes with finite VF , we have upper and lower bounds for the sub-optimality of fˆN∗ within F. Furthermore, both bounds p behave like VF /N . In spite of its attractiveness, the VC dimension has several disadvantages: Excessive Pessimism. Even though the existence of lower and upper bounds of the same magnitude are established, the proof of the lower bound is based on very ‘bizarre’ distributions. It is enough to assume some smoothness assumptions of the underlying p.d.f. to obtain much tighter bounds. Problem geometry. The VC dimension is a purely combinatorial and distribution free measure of the class complexity, computed without any consideration of the source distribution. It is reasonable to expect that a measure which considers the source distribution will provide tighter bounds. In the sequel we will consider a data dependent 2.3. CONCENTRATION INEQUALITIES 29 complexity term which presumes no knowledge of the distribution, but uses the training sequence to measure the class complexity. Finiteness and tightness. The VC dimension can be very large, and even infinite. Such a behavior indicates that the bound might be very loose. Thus, it is of great importance to have complexity measures that exhibits a more restrained behavior. 2.3 Concentration inequalities According to the law of large numbers, under rather mild conditions, the empirical average of random variables drawn i.i.d. according to any distribution with finite mean is close to its mean with probability that tends to one as the sample size increases. This behavior was mathematically described in (2.6), and other inequalities of a similar nature are used intensively throughout this work. We provide a brief overview of this subject, focusing on some results that are relevant to our discussion. 2.3.1 The basics The first result we mention is known as Markov’s inequality. Lemma 2.6 (Markov’s inequality) Let Z be a nonnegative random variable. Then, for any ² > 0, P {Z ≥ ²} ≤ EZ . ² An immediate extension of Lemma 2.6 for random variables which are not necessarily nonnegative is given by the following corollary. Corollary 2.7 Let Z be a random variable and let ϕ(t) be any monotonically increasing and nonnegative function. Then, for any ² > 0, P {Z − EZ ≥ ²} = P {ϕ(Z − EZ) ≥ ϕ(²)} ≤ Eϕ(Z − EZ) . ϕ(²) (2.10) 30CHAPTER 2. A SHORT INTRODUCTION TO STATISTICAL LEARNING THEORY Notice that by setting ϕ(t) = t2 we obtain Chebyshev’s inequality. Also, by setting ϕ(t) = est , s > 0, and searching for the best positive s we have Chernoff ’s bound, P {Z − EZ ≥ ²} ≤ inf {e−s² Ees(Z−EZ) } . s>0 2.3.2 (2.11) Sums of independent random variables The most important random variables in the context of this work are variants of SN = P N −1 N n=1 Zn where {Z1 , Z2 , . . . , ZN } is a set of random variables drawn i.i.d. according to some distribution. From chebyshev’s inequality and the i.i.d. property of the random variables, we immediately have P {|ESN − SN | ≥ ²} ≤ Var(Z1 ) . N ²2 (2.12) 2 However, while the central limit theorem leads to a tail probability that decays as O(e−N ² ), this inequality results in a decay rate of order O(N −1 ). Thus, it is desirable to replace (2.12) with a tighter bound that exhibits an exponential rate with respect to N . For bounded random variables such an inequality was proved by Hoeffding [16]. The following Lemma is one of the many variants of Hoeffding’s inequality. Lemma 2.8 (Hoeffding’s inequality) Let Z1 , . . . , ZN be a set of independent bounded random variables such that |Zn | ≤ b with probability one and EZn = 0 for all n = 1, 2, . . . , N . P Set SN = N −1 N n=1 Zn . Then, for any 0 < ² < b "µ ¶ 21 + 2b² µ ¶ 12 − 2b² #N 1 1 P{SN ≥ ²} ≤ (2.13) 1 + b² 1 − b² N ²2 ≤ e− 2b2 . (2.14) Using a slightly different version of Hoeffding’s inequality, it is easy to prove the following result. Corollary 2.9 For every f ∈ F and any ² > 0, ¯ o n¯ 2 ¯ ¯ P ¯P̂e (f, DN ) − Pe (f )¯ ≥ ² ≤ 2e−2N ² . (2.15) 2.3. CONCENTRATION INEQUALITIES 31 2 Thus, Hoeffding’s inequality provides us with a tail probability that decays like O(e−N ² ), similar to the central limit theorem. However, the fact that the variance of the random variables is not considered in this bound indicates a major weakness. The following Theorem [10] provides a concentration inequality for the sum of independent random variables with bounded variance. Lemma 2.10 (Bennet (1962) and Bernstein (1946)) Let Z1 , . . . , ZN be a set of independent bounded random variables such that |Zn | ≤ b with probability one and EZn = 0 for P PN −1 all n = 1, 2, . . . , N . Define σ 2 = N1 N n=1 Var(Zn ) and set SN = N n=1 Zn . Then, for any ² > 0 ½ µµ ¶ µ ¶ ¶¾ σ2 2b² N² P {SN > ²} ≤ exp − 1+ ln 1 + 2 − 1 2b 2b² σ (Bennet, 1962), and ½ ¾ N ²2 P {SN > ²} ≤ exp − 2 . 2σ + 2b²/3 (2.16) (2.17) Comparing Hoeffding’s and Bernstein’s inequalities is insightful. Notice that both inequalities exhibits tail inequalities with exponential decay rate with respect to the sample size N . However, when σ 2 ¿ b² we see that Bernstein’s (and Bennet’s) inequality lead a tail ³ 3N ² ´ probability of O e− 2b , otperforming Hoeffding’s inequality which lead a tail probability ¶ µ 2 − N ²2 of O e 2b . 2.3.3 General mappings of independent random variables Recall that what is actually needed is an upper bound on supf ∈F |P̂e (f, DN ) − Pe (f )|. So, even though Hoeffding’s and Bernstein’s inequalities exhibit the desirable exponential rate for the tail probability, they are not directly applicable to this task. In this section we quote two concentration inequalities that will be used in the sequel to establish uniform upper bounds on the risk. The first one is McDiarmid’s inequality which addresses bounded difference functions [27]. 0 0 Definition 2.11 (Bounded difference functions) Let {z1 , . . . , zN } and {z1 , . . . , zN } be two sets of N elements, each of which belong to some space Z. Any function f : Z N 7→ R 32CHAPTER 2. A SHORT INTRODUCTION TO STATISTICAL LEARNING THEORY that satisfies sup 0 z1 ,...,zN ,zn 0 |f (z1 , . . . , zn−1 , zn , zn+1 , . . . , zN ) − f (z1 , . . . , zn−1 , zn , zn+1 , . . . , zN )| ≤ cn for all n = 1, 2, . . . , N , is called a bounded difference function. P Two simple examples of bounded difference functions are f (z1 , . . . , zN ) = N1 N n=1 g(zn ) and P N f (z1 , . . . , zN ) = supg∈G N1 n=1 g(zn ) where for every g ∈ G, |g(·)| is uniformly bounded. Such functions exhibit a ‘concentration’ behavior, thus motivating the search for tight bounds. Theorem 2.12, proved by McDiarmid [27], provide such a bound. Theorem 2.12 (McDiarmid’s Inequality) Let Z1 , . . . , ZN be independent random variables, each of which takes values in a set Z and assume the function f : Z N 7→ R is a bounded difference function. Then, for every ² ≥ 0, P {f (Z1 , . . . , ZN ) − Ef (Z1 , . . . , ZN ) ≥ ²} ≤ e and P {Ef (Z1 , . . . , ZN ) − f (Z1 , . . . , ZN ) ≥ ²} ≤ e 2 2 n=1 cn − PN2² 2 2 n=1 cn − PN2² , . Notice that if cn is O(N −1 ), McDiarmid’s inequality results in an exponential bound for the tail probability, similarly to Hoeffding’s inequality. The most significant difference between these two inequalities is that while the latter addresses only the sum of random variables, the former considers any mapping of the random variables which satisfies the condition of bounded difference functions. This is clearly a much larger class of functions. The last concentration inequality we present in this chapter is based on the entropy method [7]. For a certain type of functions, referred to as the self bounding functions, it may lead to tighter bounds for the tail probability than McDiarmid’s inequality. We begin with the definition of self bounding functions [7]. Definition 2.13 (Self bounding functions) Let {z1 , . . . , zN } be a set of N elements, each of which belongs to a space Z. Assume that f is a nonnegative function defined over Z N and that there exists a function fn , defined over Z N −1 , such that the following two conditions hold for all n = 1, 2, . . . , N 2.4. SUMMARY 33 1. 0 ≤ f (z1 , . . . , zn−1 , zn , zn+1 , . . . , zN ) − fn (z1 , . . . , zn−1 , zn+1 , . . . , zN ) ≤ 1, 2. PN n=1 [f (z1 , . . . , zn−1 , zn , zn+1 , . . . , zN ) − fn (z1 , . . . , zn−1 , zn+1 , . . . , zN )] ≤ f (z1 , . . . , zN ). Then, the function f is called a self bounding function. For example, let Z = Rk ×{±1} and G : Z 7→ R some class of classifiers. The worst empirical P risk P̂e (g, {z1 , . . . , zN }) = supg∈G N1 N n=1 I[yn g(zn ) < 0] is a self bounding function. The following Theorem, concerning self bounding functions, is taken from [7]. Theorem 2.14 Let Z1 , . . . , ZN be independent random variables taking values in some set Z and assume the function f : Z N 7→ R is a self bounding function. Then, for any ² ≥ 0, ²2 P {f (Z1 , . . . , ZN ) − Ef (Z1 , . . . , ZN ) ≥ ²} ≤ e− 2Ef +2²/3 , and ²2 P {Ef (Z1 , . . . , ZN ) − f (Z1 , . . . , ZN ) ≥ ²} ≤ e− 2Ef . Recall that the results of Lemma 2.10 improves the result of Lemma 2.8 by using the variance of the random variable. Similarly, Theorem 2.14 potentially improves on Theorem 2.12 using the mean of the random variable. When the random variable is nonnegative and bounded, its variance can be bounded using its mean. In the next chapter, Theorem 2.14 will be shown to provide a better tail probability rate than Theorem 2.12 when Ef is O(N −γ ), where γ is some positive scalar. 2.4 Summary We discussed the problem of generalization in the context of supervised learning and demonstrated the importance of model selection. We introduced the concept of class complexity ¯ ¯ ¯ ¯ and showed the usefulness of the random variable supf ∈F ¯P̂e (f, DN ) − Pe (f )¯ as such a measure. This measure is addressed in the VC theory, which provides us with distribution free upper bounds for the risk in terms of VF . Since such bounds are not satisfactory in real 34CHAPTER 2. A SHORT INTRODUCTION TO STATISTICAL LEARNING THEORY world problems, the concept of concentration inequalities was introduced, especially Theorems 2.12 and 2.14. These inequalities are used extensively in the sequel to replace random variables with their means and vice versa leading to tight uniform risk bounds. Chapter 3 Data dependent risk bounds In many real life classification problems the source distribution is not available. Moreover, there is no knowledge of a class containing classifiers that are known to perform well. Typically, the only indication of the underlying distribution is given by a training sequence of finite length. This practical scenario motivates the search for analytical and algorithmic tools to help us gain better understanding of the problem. In this chapter bounds that hold for every classifier f ∈ F and are independent of the source distribution are introduced. Even though most of the results are adapted from the literature, we either prove or provide an outline of the proofs for most of them, for several reasons: (1) some authors improved part of the results of others, and we combine the best relevant results available, (2) to obtain the optimal bounds, we pay special care to the constants, and (3) make this document self contained. This chapter is organized as follow. In section 3.1 we consider losses other than the 0 − 1 loss, which are motivated from both theoretical and practical considerations. In section 3.2 we introduce a data dependent measure for the class complexity which is used to replace the VC dimension in the risk bound. We conclude by introducing several risk bounds in sections 3.3 and 3.4. 35 36 CHAPTER 3. DATA DEPENDENT RISK BOUNDS 3.1 The φ-risk Consider a soft classifier f : Rk 7→ R, where the classifier maps the feature space into the real line R rather than {±1}, and the 0 − 1 loss incurred by it I[yf (x) ≤ 0]. While we attempt to minimize the expected value of the 0 − 1 loss, it turns out to be inopportune to directly minimize functions based on this loss. First, the computational task is often intractable due to its non-smoothness. Second, minimizing the empirical 0 − 1 loss may lead to severe overfitting. Many recent approaches are based on minimizing a smooth convex function φ(yf (x)) which upper bounds the 0 − 1 loss (e.g. [40, 25, 3]). We assume that the loss function φ(t) satisfies the following assumptions. 1. Upper bound. I[t ≤ 0] ≤ φ(t) for all t ∈ R. This condition enables us to upper bound the risk using the loss function φ. 2. Finiteness. Denote φ̄ = supt∈R φ(t); we assume that φ̄ < ∞. 3. Smoothness. φ(t) is Lipschitz with constant Lφ . Such a behavior of φ is important from both analytical (as will be demonstrated in the sequel) and practical considerations (e.g. when gradient based algorithms are used). 4. Monotonicity & tightness. limt→∞ φ(t) → 0, φ(0) = 1 and limt→−∞ φ(t) → φ̄. We might also require that d dt φ(t) < 0. Notice that t stands for yf (x), so that such monotonic behavior implies that the φ-risk is consistent with the confidence level of the classification result, known as the margin [3]. The values at t → ±∞ and t = 0 prevents any unnecessary loss of tightness. The property φ(0) = 1 can be easily achieved given a loss function that satisfies all other assumptions, simply by normalizing it by φ(0). Since φ(0) is a positive scalar, it is easy to show that now all of the desired properties are satisfied. Some widely used loss functions are described in Figure 3.1. Define the φ-risk, Eφ (f ) = E {φ(Y f (X))} , 3.2. THE RADEMACHER COMPLEXITY 37 and the empirical φ-risk, Êφ (f, DN ) = N 1 X φ (yn f (xn )) . N n=1 Using the φ-risk instead of the risk itself is motivated by several reasons. 1. Minimizing the φ-risk often leads asymptotically to the Bayes decision rule [40]. The basic idea behind this statement is that for every x ∈ Rk , the function that minimizes the φ-risk, fφ∗ , is positive if P {Y = 1|X = x} > 1/2 and negative otherwise. Thus, by ¡ ¢ taking sgn fφ∗ (x) we achieve the Bayes risk. 2. Rather tight upper bounds on the risk may be derived for finite sample sizes (e.g. [40, 25, 3]). 3. Minimizing the empirical φ-risk instead of the empirical risk is computationally much simpler. In the remainder of this chapter the φ-risk is used to derive risk bounds that will be the starting point of our research. But, before going any further, we introduce a data dependent quantity that will be used as the class complexity. 3.2 The Rademacher complexity Recently, several notions for the class complexity were presented, such as the maximum discrepancy and the Gaussian and Rademacher complexity. All three measurements are closely related, and detailed discussion regarding them can be found at [4]. We focus here on the Rademacher complexity. Definition 3.1 (Rademacher Complexity) Let F be a class of functions mapping Rk 7→ R. The empirical Rademacher complexity is defined as ( ) N 1 X R̂N (F) = Eσ sup σn f (xn ) , f ∈F N n=1 (3.1) 38 CHAPTER 3. DATA DEPENDENT RISK BOUNDS Loss functions, φ(yf(x)) Threshold Piecewise−linear tanh 2 1.5 1 0.5 0 −4 −3 −2 −1 0 yf(x) 1 2 3 4 Figure 3.1: Some widely used functions as φ(yf (x)). The threshold loss function produce 1 if yf (x) ≤ 0 and 0 otherwise. The piecewise linear loss function is given by a continuous chain of linear functions. The tanh loss function is based on a shifted,scaled and normalized version of the tanh function. where σ = [σ1 , σ2 , ..., σN ] is a vector of binary random variables, each of which is drawn i.i.d. according to P {σn = 1} = P {σn = −1} = 1/2. The Rademacher complexity is defined as the average of the empirical Rademacher complexity over all possible training sequences DN , n o RN (F) = EDN R̂N (F) . Observe that the Rademacher complexity measures the extent to which some function from F can be correlated with binary white noise, when mapping observations according to some probability measure over Rk . In the next section the Rademacher complexity is used as the class complexity to replace VF in risk bounds. Some important properties of the empirical Rademacher complexity, for classes of real functions, F, F1 , . . . , Fd where d is some integer, are listed below. 1. R̂N (F) ≥ 0. 3.3. RISK BOUNDS 39 2. If F1 ⊆ F2 then R̂N (F1 ) ≤ R̂N (F2 ). 3. For every λ ∈ R, R̂N (λF) = |λ|R̂N (F). P P 4. R̂N ( di=1 Fi ) ≤ di=1 R̂N (Fi ). The proofs for these properties are trivial. Notice that each of the properties holds for RN (F) as well. 3.3 Risk bounds We begin with the following trivial inequality for the risk, n o Pe (f ) ≤ Eφ (f ) ≤ Êφ (f, DN ) + sup Eφ (f ) − Êφ (f, DN ) . (3.2) f ∈F n o Notice that supf ∈F Eφ (f ) − Êφ (f, DN ) is, in general, a random variable that depends on DN . This term by itself does not leave much room for manipulation. However, by replacing it with its mean, we get a more flexible complexity term. It is possible to do so using Mcdiarmid’s inequality, as described in the following Lemma. n Lemma 3.2 Define Ẑ(F, DN ) = supf ∈F o n o Eφ (f ) − Êφ (f, DN ) and Z(F) = E Ẑ(F, DN ) . Then, for every integer N and δ ∈ (0, 1), with probability at least 1−δ over training sequences of length N , s Ẑ (F, DN ) ≤ Z(F) + φ̄ ln 1δ . 2N Proof To be able to use McDiarmid’s inequality, we first need to prove that Ẑ(F, DN ) is a bounded difference function. Define i DN n o 0 0 = (x1 , y1 ), . . . , (xi−1 , yi−1 ), (xi , yi ), (xi+1 , yi+1 ), . . . , (xN , yN ) . (3.3) 40 CHAPTER 3. DATA DEPENDENT RISK BOUNDS 0 0 i That is, DN is obtained by replacing (xi , yi ) in DN with (xi , yi ), drawn independently ac- cording to the same distribution. It is easy to see that Ẑ (F, DN ) ( ) X 1 1 1 1 0 0 0 0 = sup Eφ (f ) − φ(yn f (xn )) − φ(yi f (xi )) + φ(yi f (xi )) − φ(yi f (xi )) N n6=i N N N f ∈F ( ) o n 0 1 X 1 1 0 0 0 ≤ sup Eφ (f ) − φ(yn f (xn )) − φ(yi f (xi )) + sup φ(yi f (xi )) − φ(yi f (xi )) N n6=i N N f ∈F f ∈F i ≤ Ẑ(F, DN )+ φ̄ . N i Combined with the analogous inequality Ẑ(F, DN ) ≤ Ẑ(F, DN ) + i |Ẑ(F, DN ) − Ẑ(F, DN )| ≤ φ̄ , N φ̄ , N we have i = 1, 2, . . . , N . Thus, according to Mcdiarmid’s inequality, n o 2 − 2N ² P Ẑ(F, DN ) − Z(F) ≥ ² ≤ e φ̄2 . − 2N2² Setting δ = e φ̄ 2 completes the proof of Lemma 3.2. ¤ Combining (3.2) with Lemma 3.2 yields the following uniform risk bound. Lemma 3.3 Let F be a class of mappings. Then, for every integer N and δ ∈ (0, 1), with probability at least 1 − δ over training sequences of length N , every f ∈ F satisfies s n o ln 1δ Pe (f ) ≤ Êφ (f, DN ) + EDN sup Eφ (f ) − Êφ (f, DN ) + φ̄ . 2N f ∈F n Next, we bound the complexity term EDN supf ∈F o Eφ (f ) − Êφ (f, DN ) by a quantity that is more easily handled. This is done using the following Lemma. Lemma 3.4 Let F be a class of mappings. Then, for every integer N , n o EDN sup Eφ (f ) − Êφ (f, DN ) ≤ 2RN (φ ◦ F) . f ∈F 3.3. RISK BOUNDS 41 Combining Lemma 3.4 and Lemma 3.3 results in a distribution dependent risk bound for every classifier f ∈ F, described in Theorem 3.5. Theorem 3.5 Let F be a class of mappings. Then, for every integer N and δ ∈ (0, 1), with probability at least 1 − δ over training sequences of length N , every f ∈ F satisfies s ln 1δ Pe (f ) ≤ Êφ (f, DN ) + 2RN (φ ◦ F) + φ̄ . 2N (3.4) Proof (of Lemma 3.4) Define a second training sequence similar to (1.3), n 0 0 o 0 0 0 DN = (x1 , y1 ), . . . , (xN , yN ) , (3.5) drawn from the same distribution as DN . Since Eφ (f ) = N −1 ED0 N PN n=1 ¡ 0 0 ¢ φ yn f (xn ) , we have ½ h i¾ EDN sup Eφ (f ) − Êφ (f, DN ) f ∈F ) #) ( " ( N 1 X 0 0 φ(yn f (xn )) − Êφ (f, DN ) = EDN sup ED0 N N n=1 f ∈F ( " ( N )#) N X X 1 0 0 = EDN sup ED0 φ(yn f (xn )) − φ(yn f (xn )) . N N f ∈F n=1 n=1 Replacing the supremum of the average with the average over the supremum yields ½ h i¾ EDN sup Eφ (f ) − Êφ (f, DN ) f ∈F ( " N #) ³ ´ X 1 0 0 φ(yn f (xn )) − φ(yn f (xn )) ≤ EDN ,DN0 sup N f ∈F n=1 ( " N #) ´ X ³ 0 0 (a) 1 = EDN ,DN0 ,σ sup σn φ(yn f (xn )) − φ(yn f (xn )) N f ∈F n=1 ( ) ) ( N N 1 X 1 X 0 0 σn φ(yn f (xn )) + EDN0 ,σ sup σn φ(yn f (xn )) ≤ EDN ,σ sup f ∈F N n=1 f ∈F N n=1 = 2RN (φ ◦ F), 42 CHAPTER 3. DATA DEPENDENT RISK BOUNDS 0 where σ is introduced in definition 3.1, and (a) is due to the fact that DN and DN are drawn independently and identically. ¤ Notice that computing the Rademacher complexity involves an average over the source. Thus, Theorem 3.5 exhibits an advantage over bounds based on the VC dimension because it considers the source distribution as well as the class F. Unfortunately, for the same reason this result is not practical since the distribution is assumed to be unknown. To overcome this difficulty, we use McDiarmid’s inequality again, to replace RN (φ ◦ F ) with R̂N (φ ◦ F), which can, in principle, be estimated based on the data. Lemma 3.6 Let F be a class of mappings. Then, for every integer N and every δ ∈ (0, 1), with probability at least 1 − δ over training sequences of length N s ln 1δ RN (φ ◦ F) ≤ R̂N (φ ◦ F) + φ̄ . 2N (3.6) Proof i We prove that R̂N (φ ◦ F ) is a bounded difference function. Define DN as in (3.3) and set i i R̂N (φ ◦ F) to be the empirical Rademacher complexity computed over DN . Then, we have R̂N (φ ◦ F) ( ) N 1 X σn φ(yn f (xn )) = Eσ sup f ∈F N n=1 ( #) " ³ ´ 1 X 0 0 0 0 = Eσ sup σn φ(yn f (xn )) + σi φ(yi f (xi )) + σi φ(yi f (xi )) − φ(yi f (xi )) f ∈F N n6=i ( " #) ½ ´¾ 1 X 1 ³ 0 0 0 0 ≤ Eσ sup σn φ(yn f (xn )) + σi φ(yi f (xi )) + Eσi sup σi φ(yi f (xi )) − φ(yi f (xi )) f ∈F N f ∈F N n6=i i ≤ R̂N (φ ◦ F) + φ̄ . N i A symmetric argumentation implies that R̂N (φ ◦ F ) ≤ R̂N (φ ◦ F ) + Mcdiarmid’s inequality, n o P RN (F) − R̂N (F) ≥ ² ≤ e 2 − 2N2² φ̄ . φ̄ . N Thus, according to 3.4. IMPROVED RISK BOUND − 2N2² Setting δ = e φ̄ 43 2 completes the proof of Lemma 3.6. ¤ A uniform, data dependent and distribution free risk bound is now achievable. Theorem 3.7 Let F be a class of mappings. Then, for every integer N and δ ∈ (0, 1), with probability at least 1 − δ over training sequences of length N , every f ∈ F satisfies s ln 2δ Pe (f ) ≤ Êφ (f, DN ) + 2R̂N (φ ◦ F) + 3φ̄ . 2N This result is clearly tighter than bounds based on VC dimension, yet uniform over the class F and distribution free. To facilitate this result farther, the following basic property of the empirical Rademacher complexity, taken from [23, 29], is used in the sequel. Lemma 3.8 Let F be some class of mappings and assume φ(t) is Lipschitz with constant Lφ . Then, R̂N (φ ◦ F) ≤ Lφ R̂N (F). This result enables us to deal directly with the class F rather than φ ◦ F, so that the bound in Theorem 3.7 can be replaced by s Pe (f ) ≤ Êφ (f, DN ) + 2Lφ R̂N (F) + 3φ̄ 3.4 ln 2δ . 2N (3.7) Improved risk bound Recall that in Lemma 3.6 we used McDiarmid’s inequality to upper bound the Rademacher complexity with the empirical Rademacher complexity. In this section Lemma 3.9 is used for this purpose, resulting in a risk bound that under some luckiness assumptions is tighter than the bound in (3.7). Moreover, in the context of this work, as will be shown in the sequel, this will always lead to tighter bounds. 44 3.4.1 CHAPTER 3. DATA DEPENDENT RISK BOUNDS Improved bound for RN (φ ◦ F) We begin with our main result in this chapter, improving the bound given in Lemma 3.6. Lemma 3.9 Let F be a class of mappings. Then, for every integer N and δ ∈ (0, 1), with probability at least 1 − δ over training sequences of length N , s φ̄ ln 1δ R̂N (φ ◦ F) 3φ̄ ln 1δ RN (φ ◦ F ) ≤ R̂N (φ ◦ F) + + . 2 N 4N q Remark. It can be shown [4] that for classes with finite VF , R̂N (F) ≤ (3.8) VF . Thus, N ¢ −3/4 assum¡ ing that φ is Lipschitz, the result of Lemma 3.9 has a decaying rate of O N , better than ¡ −1/2 ¢ Lemma 3.6 which exhibit a rate of O N . Furthermore, for the classifiers studied in this ¡ ¢ work, it will be shown that the Rademacher complexity always satisfies R̂N (F) ≤ O N −1/2 , ¡ ¢ leading (3.8) to a decaying rate of O N −3/4 . Notice that R̂N (φ ◦ F) is a random variable, depending on DN . The so called ‘luckiness assumption’ states that regardless of the source distribution, we might have a training sequence such that R̂N (φ ◦ F ) < O(N −γ ) for some γ ∈ (0, 1]. In such a case (3.8) outperforms (3.6). To understand the values considered for γ, notice that R̂N (φ ◦ F) ≤ φ̄ , and that R̂N (φ ◦ F) ≥ N −1 1 R̂N −1 (φ ◦ F) ≥ . . . ≥ R̂1 (φ ◦ F) = O(N −1 ) , N N which can be proved by appropriate normalization of (3.9) in the sequel. Proof (of Lemma 3.9) n o P We define the random variables ẐN (φ ◦ F) = 2φ̄−1 Eσ supf ∈F N σ φ(y f (x )) and n n n=1 n n o P (i) ẐN (φ ◦ F ) = 2φ̄−1 Eσ\σi supf ∈F n6=i σn φ(yn f (xn )) . The proof is based on three steps: (1) ẐN (φ ◦ F ) is shown to be a self bounding function, (2) a bound for EẐN (φ ◦ F) in terms 3.4. IMPROVED RISK BOUND 45 of ẐN (φ ◦ F ) is established and (3) this bound is normalized appropriately to obtain Lemma 3.9. Step 1 First, we prove that ẐN (φ ◦ F ) satisfies the first property of self bounding functions, given in definition 2.13. We begin with the upper bound, (i) (i) (i) ẐN (φ ◦ F) − ẐN (φ ◦ F) ≤ ẐN (φ ◦ F ) + 2φ̄−1 Eσi sup σi φ(yi f (xi )) − ẐN (φ ◦ F) f ∈F · ¸ −1 = φ̄ sup φ(yi f (xi )) − inf φ(yi f (xi )) f ∈F f ∈F ≤1. To prove the lower bound, we use the following argumentation, ( Eσ sup N X f ∈F n=1 1 = Eσ\σi 2 1 = Eσ\σi 2 ) σn φ(yn f (xn )) ( sup ( X f ∈F ) σn φ(yn f (xn )) + φ(yi f (xi )) n6=i sup X f,f˜∈F n6=i ³ σn 1 + Eσ\σi sup 2 f ∈F ( X ) σn φ(yn f (xn )) − φ(yi f (xi )) n6=i ´ φ(yn f (xn )) + φ(yn f˜(xn )) + φ(yi f (xi )) − φ(yi f˜(xi )) ) . Setting f = f˜, which is not necessarily the best option, yields ( Eσ sup N X f ∈F n=1 ) σn φ(yn f (xn )) ( ≥ Eσ\σi sup f ∈F X ) σn φ(yn f (xn )) . n6=i Multiplying both sides by 2φ̄−1 results in (i) ẐN (φ ◦ F) ≥ ẐN (φ ◦ F) . (3.9) This complete the proof that the first property of self bounding functions is satisfied. Next, we prove that ẐN (φ ◦ F ) satisfies the second condition of self bounding functions. P Let us denote by f ∗ the function for which supf ∈F N n=1 σn φ(yn f (xn )) is achieved. Similarly, P for every i = 1, 2, . . . , N , denote by fi∗ the function for which supf ∈F n6=i σn φ(yn f (xn )) is 46 CHAPTER 3. DATA DEPENDENT RISK BOUNDS achieved. Then, for every given σ, we have the following argument. à ! N N N X X X sup σn φ(yn f (xn )) − sup σn φ(yn f (xn )) i=1 = f ∈F n=1 N X i=1 ≤ = à N X σn φ(yn f ∗ (xn )) − n=1 N X à N X i=1 n=1 N X f ∈F X n6=i ! σn φ(yn fi∗ (xn )) n6=i σn φ(yn f ∗ (xn )) − X ! σn φ(yn f ∗ (xn )) n6=i σi φ(yi f ∗ (xi )) i=1 = sup N X f ∈F n=1 σn φ(yn f (xn )) . By averaging over σ and normalizing appropriately, we have N ³ X ´ (i) ẐN (φ ◦ F) − ẐN (F) ≤ ẐN (φ ◦ F ) . i=1 Step 2 According to Theorem 2.14, we have the following Lemma. Lemma 3.10 Let F be some class of mappings and define the random variable ( ) N X 2 ẐN (φ ◦ F) = Eσ sup σn φ(yn f (xn )) . φ̄ f ∈F n=1 Then, for every integer N and any δ ∈ (0, 1), with probability at least 1 − δ over training sequences of length N , r ZN (φ ◦ F) ≤ ẐN (φ ◦ F) + n o where ZN (φ ◦ F) = EDN ẐN (φ ◦ F) . 2ẐN (φ ◦ F ) ln 1 3 1 + ln . δ 2 δ (3.10) To prove Lemma 3.10 we use the second inequality of Theorem 2.14, which provides us with the following bound for the tail probability n o ²2 − P ZN (φ ◦ F ) − ẐN (φ ◦ F) ≥ ² ≤ e 2ZN (φ◦F) . 3.4. IMPROVED RISK BOUND − 2Z Set δ = e ²2 N (φ◦F) , which imply ² = 47 p 2ZN (φ ◦ F) ln(1/δ) . Then, with probability at least 1−δ r ZN (φ ◦ F ) ≤ ẐN (φ ◦ F) + 2ZN (φ ◦ F) ln 1 . δ By solving this simple quadratic inequality while keeping in mind that ZN (φ ◦ F) is nonnegative, we get s p ZN (φ ◦ F) ≤ s 1 δ ln + 2 ln 1δ + ẐN (φ ◦ F) . 2 (3.11) Squaring both sides of (3.11) yields sµ ZN (φ ◦ F ) ≤ ẐN (φ ◦ F) + Using the fact that √ a+b≤ √ a+ 1 ln δ ¶2 + 2ẐN (φ ◦ F) ln 1 ln 1δ + . δ 2 (3.12) √ b for every a, b ≥ 0, (3.12) implies r 1 3 1 ZN (φ ◦ F) ≤ ẐN (φ ◦ F) + 2ẐN (φ ◦ F ) ln + ln . δ 2 δ (3.13) which completes the proof of Lemma 3.10. Step 3 Multiplying both sides of (3.13) by 3.4.2 φ̄ 2N completes the proof of Lemma 3.9. ¤ Deriving an improved risk bound Finally, combining Theorem 3.5, Lemma 3.8 and Lemma 3.9 leads to the following result. Theorem 3.11 Let F be a class of mappings. Then, for every integer N and every δ ∈ (0, 1), with probability at least 1−δ over training sequences of length N , every f ∈ F satisfies s s 2Lφ φ̄ ln 2δ R̂N (F) ln 2δ 3φ̄ ln 2δ Pe (f ) ≤ Êφ (f, DN ) + 2Lφ R̂N (F) + + φ̄ + . N 2N 2N To appreciate the improvement of Theorem 3.11 over (3.7), consider the following typical initialization of the parameters Lφ = 2, φ̄ = 2, δ = 10−4 , which are the same as the 48 CHAPTER 3. DATA DEPENDENT RISK BOUNDS Risk bound improvement 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 100 200 300 400 500 600 700 800 900 1000 N Figure 3.2: Risk bounds comparison. The difference between the bounds in Theorem 3.11 and (3.7) for Lφ = 2, φ̄ = 2, δ = 10−4 and N = 102 ÷ 103 , which are the same as the values that are used in the simulation chapter. Notice that the improvement is of the same order as other terms in the bound, such as the empirical φ-risk. parameters used in the simulations at Chapter 7. For these values, Figure 3.2 describes the difference between the bounds. Notice that the improvement is of the same order as other terms in the bound, such as the empirical φ-risk. 3.5 Summary The main result of this chapter is given by Theorem 3.11, serving us as the starting point for the remainder of this work. It provides us with a uniform risk bound that is independent of the source distribution, a bound that can be used for model selection given a training sequence. A key property of Theorem 3.11 is that bounding the empirical Rademacher complexity of the class is sufficient to establish a uniform risk bound. As mentioned earlier, the training sequence over which the Rademacher complexity is computed might be ‘simple’ in a way that leads to tight upper bounds (the so called ‘luckiness assumption’). Chapter 4 Risk bounds for Mixture-of-Experts classifiers Imagine a medical clinic where several doctors, each of whom is an expert in a different field of medicine, receive patients. Consider a general practitioner, directing the incoming patients to the relevant doctor. Clearly, to decide which doctor is the most appropriate to treat the patient, some information from the patient must first be gathered, indicating the type of problem from which he is suffering. Then, based on this knowledge, the patient can be directed to the doctor who is most likely to know how to treat him. Sometimes, when the indicators are not unequivocal, the general practitioner might decide to direct the patient to several doctors so that the opinions of all of them can be considered. To understand the relevance of this example to our discussion, consider the alternative where only one doctor is available, with the task of providing medical treatment to every patient, irrespective of his problem. The Mixture-of-Experts (MoE) [21, 17] classifier considers the equivalent of this example in the context of pattern classification. It is based on an adaptive soft partition of the feature space into regions, to each of which a local classifier is assigned. So, whenever a new sample is to be classified, the MoE classifier combines the decisions of several ‘expert’ classifiers according to their relative superiority in the region of the feature space from 49 50 CHAPTER 4. RISK BOUNDS FOR MIXTURE-OF-EXPERTS CLASSIFIERS which the new sample was drawn. Such a procedure can be thought of, on the one hand, as extending standard approaches based on mixtures, and, on the other hand, providing a soft probabilistic extension of decision trees. The MoE architecture has been successfully applied to regression, classification, control and time series analysis [19, 22, 18, 37, 38, 39]. This chapter is organized as follow. In sections 4.1 the MoE classifier is introduced and motivated, followed by a formal description of our assumptions regarding the various components of the classifier in Section 4.2. Section 4.3 concludes this chapter, providing risk bound for the discussed classifier. 4.1 Mixture of Experts Classifiers Consider the MoE architecture described in Figure 4.1, and given mathematically by f (x) = M X am (wm , x)hm (vm , x) . (4.1) m=1 We interpret the functions hm as experts, each of them ‘operates’ in regions of space for which the gating functions am are nonzero. Such a classifier can be intuitively interpreted as implementing the principle of ‘divide and conquer’ where instead of solving one complicated problem over the entire space, we can do better by dividing it into several regions, defined through the gating functions am , and using ‘simple’ experts hm in each region. To demonstrate the way in which the MoE classifier operates, consider Figure 4.2. This example demonstrate how a rather complicated classifier f can be constructed based on very simple component classifier hm and gating functions am . Figure 4.3 presents the classifier architecture that needs to be used, to construct the classifier described in Figure 4.2. Note that assuming the gating functions am to be independent of x leads to a standard mixture. Such a model, while using several classifiers, combines them with fixed weights over the entire space. To realize the weakness of standard mixtures, consider any M soft linear classifiers, combined with fixed weights. Thus, the overall classifier is linear, clearly inferior to a combination of the same soft linear classifiers with data dependent weights. 4.1. MIXTURE OF EXPERTS CLASSIFIERS x 51 a1(w 1,x) h1(v 1,x) f(x) x aM(w M,x) x hM(v M,x) Figure 4.1: MoE classifier with M experts. 1 0.8 0.6 0.4 x2 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1 −0.8 −0.6 −0.4 −0.2 0 x1 0.2 0.4 0.6 0.8 1 Figure 4.2: A combination of simple classifiers in the feature space. Rather complex decision boundaries can be achieved using simple classifiers on different subsets of the feature space. 52 CHAPTER 4. RISK BOUNDS FOR MIXTURE-OF-EXPERTS CLASSIFIERS x Linear gate x Radial gate Linear classifier Linear classifier x sign x y Radial gate Radial classifier Figure 4.3: A block diagram of the classifier in Figure 4.2. A MoE classifier with two linear and one radial classifiers, assigned to subsets by two radial and one linear gating functions. In the process of constructing a MoE classifier, one needs to consider the following fundamental problems: Partition of Rk . What kind of experts are the most appropriate for our problem? What is the best way to divide the feature space? In this work we consider two general types of division that are carried out by the gating functions. Generally speaking, in one type of partition we consider each expert to be relatively efficient over some half of the feature space defined by a hyperplane. In the other, each expert is considered to be locally superior to the others, over some ball in Rk . Model selection. What values should be considered for M ? How to determine the feasible set of the parameters wm , vm ? Notice that by enlarging M or the feasible parameter set we expect the class complexity to grow, increasing the danger of overfitting. On the other hand, setting them to be too small might lead to underfitting. Algorithm How do we select the best M, wm , vm for our problem? One disturbing problem is the non-convex nature of the loss function, which makes gradient methods inefficient 4.2. MODEL SELECTION 53 and increases the computational complexity dramatically. We discuss the algorithmic problems and provide two solutions in the spirit of genetic algorithms in Chapter 7. To face the model selection problem, we establish risk bounds for the MoE classifier. Previous results attempting to establish such bounds were based on covering number approaches [28] and the VC dimension [20]. Unfortunately, such approaches are too weak to be useful in any practical setting. For example, Jiang proved that for mixtures of binary experts, the VC dimension can be upper bounded by O(M 4 k 2 ) [20]. 2 √ k ), in this work bounds Whereas according to (2.8) this implies a risk bound of order O( M N √ of order O( M√Nk ) are established. To realize the significance of this improvement, consider a case where 4 experts (M = 4) are used to classify samples with 10 features (k = 10). Using a bound based on the VC dimension, there is need to gather a training sequence that is 160(!) times longer than the one needed when using the bound established in this work. In fact, a careful examination of the constants in the bounds reveals that the improvement is even more significant. 4.2 Model selection It is clear that unless some restrictions are imposed on the gating functions and experts, overfitting is imminent. We formalize our assumptions regarding the experts and gating functions below. These assumptions are weakened in Chapter 6. m Definition 4.1 (Experts) For every m = 1, 2, . . . , M , let Vmax be some nonnegative scalar and vm a vector with k elements. Then, the m-th expert is given by a mapping hm (vm , x) © ª m where vm ∈ Vm = v ∈ Rk : kvk ≤ Vmax . We define the collection of all functions hm (vm , x) m and set such that vm ∈ Vm as Hm . To simplify the notation we define Vmax = maxm Vmax H= M [ m=1 Hm = M [ {hm (vm , x)| vm ∈ Vm } . m=1 Remark. For ease of notation we use the following convention. (1) For any parameter θ drawn from some space Θ, supθ is used to indicate supθ∈Θ . (2) The symbol k · k is used to in- 54 CHAPTER 4. RISK BOUNDS FOR MIXTURE-OF-EXPERTS CLASSIFIERS dicate k·k2 . We comment that our results can be generalized to other definitions of this norm. In Definition 4.1 the feasible parameter set is used to define the class Hm . Such a definition serves the purpose of regularization, as it enables us to control the class size via the definition of the feasible parameter set. Assumption 4.2 addresses the properties of the mapping hm . Assumption 4.2 The following assumptions are made for every m = 1, 2, . . . , M . 1. To allow different types of experts, assume hm (vm , x) = hm (τm (vm , x)) where τm (vm , x) > is some mapping such as vm x or kvm − xk. hm (τm (vm , x)) is assumed to be Lips- chitz with constant Lhm , i.e. |hm (τm (vm1 , x)) − hm (τm (vm2 , x))| ≤ Lhm |τm (vm1 , x) − τm (vm2 , x)|. To simplify the notation, we sometime replace Lhm with Lh = maxm Lhm . 2. |hm (vm , x)| is bounded by some positive constant MHm < ∞. Setting MH = maxm MHm implies supm,vm |hm (vm , x)| ≤ MH . 3. The experts are either symmetric (for regression) or antisymmetric (for classification) with respect to the parameters, i.e. there exist ν ∈ {±1} for which hm (vm , x) = νhm (−vm , x). We emphasize that, even though x is referred to as a sample of the feature space, our results hold for experts hm (vm , x) = hm (vm , Φm (x)) where Φm (x) is some nonlinear mapping. The use of such experts results in a very powerful classifier that addresses different subspaces of the feature space using the most appropriate kernels [33]. Indeed, this approach can be interpreted as local classification, i.e. each expert faces the problem of classification locally. The gating functions am reflect the relative weights of each of the experts at a given point x. In the sequel two types of gating functions, described in the following definition, are considered. m be some nonnegDefinition 4.3 (Gating functions) For every m = 1, 2, . . . , M , let Wmax ative scalar and wm a vector with k elements. Then, the m-th gating function is given by a 4.3. ESTABLISHING RISK BOUNDS FOR MOE CLASSIFIERS 55 m mapping am (wm , x) where wm ∈ Wm = {w ∈ Rk : kwk ≤ Wmax }. We define the collection of all functions am (wm , x) such that wm ∈ Wm as Am . To simplify the notation we define m Wmax = supm Wmax and set A A= M [ m=1 Am = M [ {am (wm , x)|wm ∈ Wm } . m=1 > am (wm , x) is said to be a half-space gate if am (wm , x) = am (wm x) and a local gate if ¡1 ¢ am (wm , x) = am 2 kwm − xk2 . Similar to Assumption 4.2, the following assumptions serves the purpose of regularization. Assumption 4.4 The following assumptions are made for every m = 1, 2, . . . , M . 1. am (wm , x) is Lipschitz with constant Lam , analogous to Assumption 4.2. We define the global Lipschitz constant La = maxm Lam . 2. am (wm , x) is nonnegative and bounded by some positive constant MAm < ∞. Setting MA = maxm MAm implies supm,wm |am (wm , x)| ≤ MA . 4.3 Establishing risk bounds for MoE classifiers After introducing the MoE classifier and setting the framework of our discussion, we begin our analysis. The problem of bounding the empirical Rademacher complexity R̂N (F), defined in (3.1), for the class of MoE classifiers is addressed, beginning with Lemma 4.5. Unless stated otherwise, wherever we write ‘Rademacher complexity’ we refer to the empirical Rademacher complexity. Lemma 4.5 Let Fm = {am (wm , x)hm (vm , x) : am (wm , x) ∈ Am , hm (vm , x) ∈ Hm }. Then, R̂N (F) = M X m=1 R̂N (Fm ) . 56 CHAPTER 4. RISK BOUNDS FOR MIXTURE-OF-EXPERTS CLASSIFIERS Proof By definition, since the set of parameters (wi , vi ) is independent of (wj , vj ) for every 1 ≤ i, j ≤ M , i 6= j, ( ) N M 1 X X R̂N (F) = Eσ sup σn am (wm , xn )hm (vm , xn ) w,v N n=1 m=1 ( ) M N X 1 X = Eσ sup σn am (wm , xn )hm (vm , xn ) . wm ,vm N m=1 n=1 ¤ Thus, it is suffices to bound R̂N (Fm ), m = 1, 2, . . . , M , in order to establish bounds for R̂N (F). This is achieved using the following Lemma. Lemma 4.6 Let G1 , G2 be two classes defined over some sets X1 , X2 respectively, and define the class G3 as G3 = {g : g(x1 , x2 ) = g1 (x1 )g2 (x2 ), g1 ∈ G1 , g2 ∈ G2 } . Assume that at least one of the classes G1 , G2 is closed under negation. Then, Z(G3 ) ≤ M2 Z(G1 ) + M1 Z(G2 ) , where Z(Gi ) = n o P Eσ supg∈Gi N σ g(x ) n n=1 n for i = 1, 2, 3 and Mi = supgi ∈Gi ,xi ∈Xi |gi (xi )| for i = 1, 2. The following Lemma is used to prove Lemma 4.6. Lemma 4.7 Let the definitions and notations of Lemma 4.6 hold. Then, for any function C(g1 , g2 , x), there exist ν ∈ {±1} such that ½ Eσ ¾ ½ ¾ sup (C(g1 , g2 , x) + σg1 (x)g2 (x)) ≤ Eσ sup (C(g1 , νg2 , x) + M2 σg1 (x) + M1 σg2 (x)) . g1 ,g2 g1 ,g2 4.3. ESTABLISHING RISK BOUNDS FOR MOE CLASSIFIERS 57 Proof (of Lemma 4.7) ½ Eσ ¾ sup (C(g1 , g2 , x) + σg1 (x)g2 (x)) g1 ,g2 1 1 sup (C(g1 , g2 , x) + g1 (x)g2 (x)) + sup (C(g1 , g2 , x) − g1 (x)g2 (x)) 2 g1 ,g2 2 g1 ,g2 1 = sup (C(g1 , g2 , x) + g1 (x)g2 (x) + C(g̃1 , g̃2 , x) − g̃1 (x)g̃2 (x)) 2 g1 ,g2 ,g̃1 ,g̃2 (a) 1 sup (C(g1 , g2 , x) + C(g̃1 , g̃2 , x) + |g1 (x)g2 (x) − g̃1 (x)g̃2 (x)|) = 2 g1 ,g2 ,g̃1 ,g̃2 = 1 sup (C(g1 , g2 , x) + C(g̃1 , g̃2 , x) + M1 |g2 (x) − g̃2 (x)| + M2 |g1 (x) − g̃1 (x)|) 2 g1 ,g2 ,g̃1 ,g̃2 (b) ≤ (4.2) where (a) is due to the symmetry of the expression over which the supremum is taken and (b) is immediate, using the following inequality |g1 (x)g2 (x) − g̃1 (x)g̃2 (x)| = |g1 (x)(g2 (x) − g̃2 (x)) + g̃2 (x)(g1 (x) − g̃1 (x))| ≤ M1 |g2 (x) − g̃2 (x)| + M2 |g1 (x) − g̃1 (x)|. Next, we denote by g1∗ , g2∗ , g̃1∗ , g̃2∗ the functions for which the supremum in (4.2) is achieved and address all cases of the signum of the terms inside the absolute values at (4.2). case 1: g2∗ (x) > g̃2∗ (x), g1∗ (x) > g̃1∗ (x) sup {C(g1 , g2 , x) + C(g̃1 , g̃2 , x) + M1 (g2 (x) − g̃2 (x)) + M2 (g1 (x) − g̃1 (x))} g1 ,g2 ,g̃1 ,g̃2 = sup {C(g1 , g2 , x) + M1 g2 (x) + M2 g1 (x)} + sup {C(g̃1 , g̃2 , x) − M1 g̃2 (x) − M2 g̃1 (x)} g1 ,g2 g̃1 ,g̃2 = 2Eσ sup {C(g1 , g2 , x) + M1 σg2 (x) + M2 σg1 (x)} . g1 ,g2 case 2: g2∗ (x) > g̃2∗ (x), g1∗ (x) < g̃1∗ (x) sup {C(g1 , g2 , x) + C(g̃1 , g̃2 , x) + M1 (g2 (x) − g̃2 (x)) + M2 (g̃1 (x) − g1 (x))} g1 ,g2 ,g̃1 ,g̃2 (a) = sup {C(g1 , −g2 , x) + C(g̃1 , −g̃2 , x) + M1 (g̃2 (x) − g2 (x)) + M2 (g̃1 (x) − g1 (x))} g1 ,g2 ,g̃1 ,g̃2 = 2Eσ sup {C(g1 , −g2 , x) + M1 σg2 (x) + M2 σg1 (x)} . g1 ,g2 58 CHAPTER 4. RISK BOUNDS FOR MIXTURE-OF-EXPERTS CLASSIFIERS where (a) is due to the assumption that G2 is closed under negation. Notice that the cases g2∗ (x) < g̃2∗ (x), g1∗ (x) < g̃1∗ (x) and g2∗ (x) < g̃2∗ (x), g1∗ (x) > g̃1∗ (x) are analogous to cases 1 and 2 respectively, thus Lemma 4.7 is proved. ¤ Proof (of Lemma 4.6) Suitably setting C(g1 , g2 , x) in Lemma 4.7, we have ( ) N X Eσ1N sup σn g1 (xn )g2 (xn ) g1 ,g2 n=1 ( ≤ Eσ1N sup à N X g1 ,g2 (4.3) !) ν1 σn g1 (xn )g2 (xn ) + M2 σ1 g1 (x1 ) + M1 ν1 σ1 g2 (x1 ) , n=2 where ν1 ∈ {±1}. Since σn ∈ {±1} then ν1 σn ∈ {±1} for all n = 2, . . . , N as well. Thus, (4.3) implies ( Eσ1N sup g1 ,g2 N X ) σn g1 (xn )g2 (xn ) n=1 ( à sup ≤ Eσ1N g1 ,g2 N X !) σn g1 (xn )g2 (xn ) + M2 σ1 g1 (x1 ) + M1 ν1 σ1 g2 (x1 ) . n=2 Carrying out this procedure sequentially N times, with a suitable redefinition of C(g1 , g2 , x) each time, it is easy to see that ( ) N X Eσ1N sup σn g1 (xn )g2 (xn ) g1 ,g2 ( ≤ Eσ1N n=1 à sup g1 ,g2 M2 ( = M2 Eσ1N sup g1 N X n=1 N X σn g1 (xn ) + M1 σn g1 (xn ) N X n=1 ) !) Γ(n)σn g2 (xn ) ( + M1 Eσ1N n=1 sup g2 N X ) Γ(n)σn g2 (xn ) , n=1 QN νi . Recall that νi ∈ {±1} for all i = 1, 2, . . . , N , thus Γ(n) ∈ {±1} for QN −1 νi σn for all n in the second term of the last all n = 1, 2, . . . , N . So, by redefining σn = i=n where Γ(n) = i=n equality, we complete the proof of Lemma 4.6. Notice that Lemma 4.6 implies the following corollary. ¤ 4.3. ESTABLISHING RISK BOUNDS FOR MOE CLASSIFIERS 59 Corollary 4.8 For every m = 1, 2, . . . , M define Fm as in Lemma 4.5. Then, R̂N (Fm ) ≤ MHm R̂N (Am ) + MAm R̂N (Hm ) . Remark. We emphasize that Corollary 4.8 is tight. To see that, set the gating functions to be independent of x. In such a case R̂N (Am ) = 0 and an equality is obtained. Remark. Using the fact that Am Hm = 1 4 ((Am + Hm )2 − (Am − Hm )2 ) and the Lipschitz property of the quadratic function over a bounded domain, it can be shown that ³ ´ R̂N (Fm ) ≤ (MHm + MAm ) R̂N (Am ) + R̂N (Hm ) . Even though it is much easier to prove this result than to prove Corollary 4.8, the bound it provides is loser. Thus, the problem of bounding R̂N (Fm ) boils down to bounding R̂N (Am ) and R̂N (Hm ). To do so, we introduce the following Lemma, which is a variant of Lemma 3.8 in a form that is more suitable for our current discussion. Lemma 4.9 Let Θ ⊆ Rk denote the feasible set of some parameter θ and let τ (θ, x) : Rk 7→ R be some parameterized function. Define the class of mappings F : Rk 7→ R as F = {f (τ (θ, x)) : θ ∈ Θ} and assume that every function f ∈ F is Lipschitz with constant LF . Then, ( Eσ ) ( ) N N 1 X 1 X σn f (τ (θ, xn )) ≤ LF Eσ sup σn τ (θ, xn ) . sup θ∈Θ N n=1 θ∈Θ N n=1 To minimize the technical burden, the experts are assumed to be generalized linear models > (glim, see [26]), i.e. τ (vm , x) = vm x in Assumption 4.2. An extension to radial basis functions (rbf), i.e. τ (vm , x) = kvm − xk2 , is immediate using our analysis of local gating functions. Extension to many other types can be achieved using similar technique. The Lipschitz property of the class Hm along with Lemma 4.9 implies ) ( N X Lhm > R̂N (Hm ) ≤ Eσ sup vm σn xn . N vm n=1 60 CHAPTER 4. RISK BOUNDS FOR MIXTURE-OF-EXPERTS CLASSIFIERS Obviously, the supremum for each value of m is obtained when vm is chosen to be in the P m direction of N n=1 σn xn such that kvm k = Vmax . This choice of vm leads to ° °) ( N °X ° Lhm ° ° m R̂N (Hm ) ≤ Eσ Vmax σn xn ° ° ° n=1 ° N v !2 u k à N m u X X Lh V σn xnj = m max Eσ t N n=1 j=1 v u à N !2 k u m X X (a) L hm Vmax u t Eσ σn xnj ≤ N n=1 j=1 v ( ) u k X N X N m u X Lhm Vmax t = Eσ σn σp xnj xpj N j=1 n=1 p=1 = m Lhm Vmax x̄ √ N q where (a) is due to Jensen’s inequality and x̄ = N −1 Pk j=1 PN n=1 x2nj . For half-space gating functions, a similar argumentation yields R̂N (Am ) ≤ m Lam Wmax x̄ √ . N For the case of local gating functions we define am (wm , x) = am (4.4) ¡1 2 ¢ kwm − xk2 . Using the Lipschitz property of the class Am yields ( ) N X ¡ ¢ 1 R̂N (Am ) = Eσ sup σn am kwm − xn k2 /2 N wm n=1 ( ) N X Lam ≤ Eσ sup σn kwm − xn k2 2N wm n=1 ( ) N X Lam σn (kwm k2 − 2hwm , xn i + kxn k2 ) = Eσ sup 2N wm n=1 ( ) N X ¡ ¢ (a) Lam = Eσ sup σn kwm k2 − 2hwm , xn i 2N wm n=1 ( ( ) ) N N X X Lam L a ≤ Eσ sup σn kwm k2 + m Eσ sup σn hwm , xn i 2N N wm wm n=1 n=1 (4.5) 4.3. ESTABLISHING RISK BOUNDS FOR MOE CLASSIFIERS where (a) holds because Eσ nP N n=1 σn kxn k 2 61 o = 0. We compute each term of (4.5) separately. First, observe that similar argumentation to the one which was used for half-space gating functions can be used to bound the second term, ( ) N X Lam La Wm x̄ Eσ sup σn hwm , xn i ≤ m√ max . N wm N n=1 To bound the first term of (4.5), notice that " N # N N X X X m 2 sup kwm k2 σn = I σn > 0 (Wmax ) σn . wm n=1 n=1 n=1 The average of this expression over all possible σ is bounded as follow ( " N # N ) X X Eσ I σn > 0 σn n=1 n=1 v " # à ! u 2 N N u X X t I σn > 0 σn = Eσ n=1 n=1 v u " N #à N !2 u X X (a) u ≤ tEσ I σn > 0 σn n=1 n=1 v ( N N ) u XX u1 = t Eσ σn σp 2 n=1 p=1 r N = , 2 where (a) is due to Jensen’s inequality. Combining all of the above, the Rademacher complexity of local gating function is upper bounded as ¶ µ m 2 Lam (Wmax ) m √ R̂N (Am ) ≤ √ + Wmax x̄ . 8 N We summarize our results in the following Theorem. Theorem 4.10 Let F be the class of mixture of experts classifiers with M glim experts. Assume that gates 1, 2, . . . , M1 are local and M1 +1, . . . , M are half-space where 0 ≤ M1 ≤ M . 62 CHAPTER 4. RISK BOUNDS FOR MIXTURE-OF-EXPERTS CLASSIFIERS Then, the Rademacher complexity of F satisfies "M # M M 1 X X X 1 m 2 m m R̂N (F) ≤ √ c1,m (Wmax ) + c2,m Wmax + c3,m Vmax N m=1 m=1 m=1 (4.6) √ where c1,m = MHm Lam / 8, c2,m = MHm Lam x̄ and c3,m = MAm Lhm x̄ for all m = 1, 2, . . . , M . Combining Theorems 4.10 and 3.11 results in the following Corollary. Corollary 4.11 Let the definitions and notations of Theorem 4.10 hold. Then, for every integer N and every δ ∈ (0, 1), with probability at least 1−δ over training sequences of length N , every f ∈ F satisfies s A Pe (f ) ≤ Êφ (f, DN ) + √ + N where A = 2Lφ "M 1 X m 2 c1,m (Wmax ) + m=1 4.4 s 2 δ Aφ̄ ln + φ̄ N 3/2 M X m=1 m c2,m Wmax + ln 2δ 3φ̄ 2 + ln , 2N 2N δ M X # m c3,m Vmax . m=1 Summary We introduced the MoE classifier where instead of constructing a single complicated classifier over the entire feature space, several simpler classifiers, referred to as experts, are linearly combined with data dependent weights, referred to as gates. Such a classifier can be used to define a complex classification surface while minimizing the class complexity. Two general ways for the division of the feature space were considered and risk bounds for MoE classifiers implementing such a division were established. A possible deficiency of our bound is the necessity of predefining the radius for the feasible set of the parameters. This predefined radius is used in the bound to compute the class complexity. In Chapter 6 we generalize our result to circumvent such a predefinition. Chapter 5 Risk bounds for Hierarchical Mixture of Experts The MoE classifier is a linear combination of M experts. An intuitive interpretation of this architecture corresponds to the division of the feature space into subspaces, in each of which the experts are linearly combined using the gating functions am . The Hierarchical MoE (HMoE) [21] takes this procedure one step further by recursively dividing the subspaces using a ‘local’ MoE classifier as the expert in each domain. It is motivated by the idea of local search - if the MoE is expected to simplify the problem of constructing a classifier over the entire feature space, then using such a mixture for each of the subspaces might prove to be the right thing to do. In this chapter the bound obtained for MoE classifiers is expanded to the case of HMoE. We demonstrate the procedure for the case of a two-levelled balanced hierarchy with M experts (see Figure 5.1). It is easy to repeat the same procedure for any number of levels, whether the HMoE is balanced or not, using the same idea. 63 64 CHAPTER 5. RISK BOUNDS FOR HIERARCHICAL MIXTURE OF EXPERTS 5.1 Some preliminaries & definitions We begin with the mathematical description of a balanced two-level HMoE classifier, described in Figure 5.1. Let f (x) be the output of the HMoE classifier and let fm (θm , x) denote the output of the m-th expert for all m = 1, 2, . . . , M . The parameter θm is comprised of all the parameters of the m-th expert, as will be detailed shortly. This is described by f (x) = M X fm (θm , x), m=1 where fm (θm , x) is given by fm (θm , x) = am (wm , x) M X amj (wmj , x)hmj (vmj , x) . (5.1) j=1 amj (wmj , x) and hmj (vmj , x) are referred to as the mj-th gating function and expert, respectively. We interpret the former as the gate of the latter. Similar to Section 4.2, we formally define the experts and gating functions of the HMoE classifier, followed by a formal description of our assumptions. mj Definition 5.1 (HMoE experts) For every m, j = 1, 2, . . . , M , let Vmax be some non- negative scalar and vmj a vector with k elements. Then, the mj-th expert is a mapping © ª mj hmj (vmj , x) where vmj ∈ Vmj = v ∈ Rk : kvk ≤ Vmax . We define the collection of all functions hmj (vmj , x) such that vmj ∈ Vmj as Hmj . To simplify the notation we set Hm = M [ j=1 Hmj = M [ {hmj (vmj , x), vmj ∈ Vmj } , j=1 and H= M [ Hm . m=1 Assumption 5.2 The following assumptions are made for every m, j = 1, 2, . . . , M . 1. hmj (vmj , x) is Lipschitz with constant Lhmj . 5.1. SOME PRELIMINARIES & DEFINITIONS x a11 (x,w 11 65 ) h11 (v11 ,x) a1(w 1,x) f1 (.,x) x a1M (x,w 1M ) h1M (v 1M ,x) x f(x) x aM1 (w M1 ,x) hM1(v M1 ,x) aM(w M,x) fM(.,x) x aMM(w MM ,x) hMM(v MM ,x) Figure 5.1: Balanced two-levelled HMoE classifier with M experts. 66 CHAPTER 5. RISK BOUNDS FOR HIERARCHICAL MIXTURE OF EXPERTS 2. |hmj (vmj , x)| is bounded by some positive constant MHmj . 3. hmj (vmj , x) is either symmetric or antisymmetric with respect to the parameter. As for the gating functions we have the following, analogous to Definition 4.3 and Assumption 4.4. mj Definition 5.3 (HMoE Gating functions) For every m, j = 1, 2, . . . , M , let Wmax be a nonnegative scalar and wmj a vector with k elements. Then, the mj-th gating function is a mj mapping amj (wmj , x) where wmj ∈ Wmj = {w ∈ Rk : kwk ≤ Wmax }. We define the collection of all functions amj (wmj , x) such that wmj ∈ Wmj as Amj . To simplify the notation in the sequel, we set Am = M [ j=1 Amj = M [ {amj (wmj , x), wmj ∈ Wmj } , j=1 Ām = {am (wm , x), wm ∈ Wm } , and A= M [ Am [ Ām . m=1 Assumption 5.4 The following assumptions are made for every m, j = 1, 2, . . . , M . 1. amj (wmj , x), am (wm , x) are Lipschitz with constant Lamj , Lam respectively. 2. amj (wmj , x), am (wm , x) are nonnegative and bounded by some positive constants MAmj , MĀm respectively. To facilitate the notation, we sometimes use Lh , MH = maxm,j Lhmj , MHmj respectively, ¢ ¡ ¢ ¡ La = maxm Lam , maxj Lamj and MA = maxm MĀm , maxj MAmj . However, though leading to simplified bounds, it is best not to use these definitions if one attempts to obtain the tightest possible bounds. Before we continue, we give a formal definition of the feasible HMoE parameter set. 5.2. UPPER BOUNDS FOR HMOE R̂N (F) 67 Definition 5.5 (Feasible HMoE parameter set) For every m = 1, 2, . . . , M set θm = [wm , wm1 , wm2 , . . . , wmM , vm1 , vm2 , . . . , , vmM ], the parameter of the m-th (MoE)-expert at the first level of the HMoE classifier. Define θ = [θ1 , . . . , θM ], the parameter of the HMoE. For every m = 1, 2, . . . , M set Θm = {θm : wm ∈ Wm , wmj ∈ Wmj , vmj ∈ Vmj , j = 1, 2, . . . , M } , the feasible set of θm . The feasible set of θ is denoted by Θ, Θ = {θ : θm ∈ Θm , m = 1, 2, . . . , M } . 5.2 Upper bounds for HMoE R̂N (F) Recall that we are seeking to bound the Rademacher complexity for the class of HMoE. We begin with the following Lemma, a variant of Lemma 4.5, adapted for HMoE. Lemma 5.6 Consider all parameterized functions am (wm , x), amj (wmj , x), hmj (vmj , x), defined in Definitions 5.1,5.3 and satisfying Assumptions 5.2,5.4. Define the class of mappings ( F= f (θ, x) = M X m=1 ¯ ) ¯ ¯ fm (θm , x)¯ θ ∈ Θ , ¯ (5.2) where fm (θm , x) is defined in equation (5.1) and, for every m = 1, 2, . . . , M , define ( Fm = fm (θm , x) = am (wm , x) M X j=1 ¯ ) ¯ ¯ amj (wmj , x)hmj (vmj , x)¯ θm ∈ Θm . ¯ Then, R̂N (F) = M X m=1 R̂N (Fm ) . (5.3) 68 CHAPTER 5. RISK BOUNDS FOR HIERARCHICAL MIXTURE OF EXPERTS Proof The proof is immediate, using the independence of the parameters as follow. ( ) N M X X 1 R̂N (F) = Eσ sup σn fm (θm , xn ) θ N n=1 m=1 ( ) M N 1 XX = Eσ sup σn fm (θm , xn ) θ1 ,θ2 ,...,θM N m=1 n=1 ) (M N X 1 X σn fm (θm , xn ) = Eσ sup N n=1 m=1 θm ( ) M N X 1 X = Eσ sup σn fm (θm , xn ) , θm N n=1 m=1 which, by definition, completes the proof of Lemma 5.6. ¤ Thus, our problem boils down to bounding the summands in (5.3). Notice that for ev¯o n¯P ¯ ¯ ery m = 1, . . . , M we have supθm ¯ M a (w , x )h (v , x ) ≤ M MH MA . Using mj n mj mj n ¯ j=1 mj Corollary 4.8 recursively twice leads to the following result. ( ) N M 1 X X R̂N (Fm ) ≤ M MH MA R̂N (Ām ) + MA Eσ sup σn amj (wmj , xn )hmj (vmj , xn ) θm N n=1 j=1 ( ) M N X 1 X = M MH MA R̂N (Ām ) + MA Eσ sup σn amj (wmj , xn )hmj (vmj , xn ) N θ mj n=1 j=1 ≤ M MH MA R̂N (Ām ) + MA ³ M ³ X ´ MH R̂N (Amj ) + MA R̂N (Hmj ) j=1 ´ ≤ M MA MH R̂N (Ām ) + MH R̂N (Am ) + MA R̂N (Hm ) h i ≤ M MA 2MH R̂N (A) + MA R̂N (H) , m = 1, 2, . . . , M . (5.4) Notice that each of the last three inequalities can be used to bound the Rademacher complexity of HMoE classifiers. When one wishes to use it within a risk bound, the ease of notation should be considered as well as the tightness of the bound. Although in Theorem 5.7 the simplest bound is described, in Chapter 7 the tighter one is used. 5.3. SUMMARY 69 Combining (5.4) and (5.3) with Theorem 3.11 implies the following Theorem. Theorem 5.7 Let F be the class of balanced two-level HMoE classifiers with glim experts and gating functions. Then, for every integer N and every δ ∈ (0, 1), with probability at least 1 − δ over training sequences of length N , every f ∈ F satisfies c1 M 2 Pe (f ) ≤ Êφ (f, DN ) + √ (Wmax + Vmax ) + N r s c2 M 2 (Wmax + Vmax ) + φ̄ N 3/2 ln 2δ c3 + , 2N N where c1 = 2Lφ max (MH La , MA Lh ) x̄, c2 = 2Lφ max (MH La , MA Lh ) x̄φ̄ ln 2δ and c3 = 1.5φ̄ ln 2δ . 5.3 Summary The HMoE classifier was introduced and interpreted as a recursive implementation of the MoE classifier. While both conceptually and practically any classifier implemented by a HMoE classifier can be also implemented using a MoE classifier, it has the advantage of a recursive implementation. The main result of this Chapter is given by Theorem 5.7, where the risk of every two-level balanced HMoE classifier is upper bounded. If one is willing to pay in terms of the bound simplicity, it can be further tightened. 70 CHAPTER 5. RISK BOUNDS FOR HIERARCHICAL MIXTURE OF EXPERTS Chapter 6 Fully data dependent bounds The risk bound given in Theorem 3.11, with the Rademacher complexity used as a measure of the class complexity, served as the starting point of our investigation. While this Theorem holds for any class, the focus of our discussion was the MoE and HMoE classifiers. The major problem was to upper bound the Rademacher complexity of these classes. Upon solving this problem, the Rademacher complexity bounds were used in a plug-in procedure to establish the risk bounds given in Corollary 4.11 and Theorem 5.7. m One deficiency of these bounds is the dependence on some preset parameters, such as Wmax m and Vmax , m = 1, 2, . . . , M . These are used to define the feasible set of the classifier parame- ters, consequently regularizing the class complexity. However, this presetting is problematic as it is difficult to know in advance how to set these parameters. In this Chapter a risk bound that requires no prior setting of parameters, except for the initialization of M , is established. Although only the MoE classifier is considered, the same technique can be easily harnessed to derive similar results for the case of HMoE as well. To obtain a fully data dependent bound, we need to replace the preset parameters by some term that is exclusively data and algorithm dependent. This means that it only depends on the data DN and the classifier selected by the algorithm, or the parameters by which it is defined. No presetting is allowed except for M , which is referred to as the hyper parameter. 71 72 CHAPTER 6. FULLY DATA DEPENDENT BOUNDS 6.1 Preliminaries The technique used in [29, 9] is adapted to derive a fully data dependent bound for MoE classifiers. The basic idea is given by the following steps. m m 1. Define a grid of possible values for Wmax and Vmax , m = 1, 2, . . . , M , for each of which Corollary 4.11 holds. 2. Assign each of these grid points with a positive weight, such that the sum of all the weights is 1. 3. Use a variant of the union bound to establish a risk bound that holds for every possible choice of the classifier parameter. We begin by introducing the so called ‘Multiple Testing Lemma’ and demonstrate the way it is used to eliminate the dependence of a bound on a preset parameter (Subsection 6.1). The same technique is then used to generalize Corollary 4.11 into a fully data dependent bound (Subsection 6.2.2). The so called ‘multiple testing Lemma’ provides a bound for the probability that a collection of events occurs simultaneously, given the probability that each of them occur individually. Before providing this Lemma, let us introduce the notion of a test. Definition 6.1 Define a (mapping) test Γ : (S, δ) 7→ {TRUE, FALSE} where S is some sample and δ is a confidence level. Denote the logical value of the test as Γ(S, δ). The following Lemma is a variant of the union bound, taken from [14]. Lemma 6.2 (Multiple Testing Lemma) Assume we are given a set of tests {Γq }∞ q=1 with associated discrete probability measure {pq }∞ q=1 . If for every q = 1, 2, . . . and any δ ∈ (0, 1), P {Γq (DN , δ)} ≥ 1 − δ, then P {Γ1 (DN , δp1 ) ∧ . . . ∧ Γq (DN , δpq ) ∧ . . .} ≥ 1 − δ . 6.1. PRELIMINARIES 73 A simple demonstration of the technique Consider the class of soft linear classifiers defined over R, F = {f (λ, x) = sin(λx) : |λ| ≤ Λ} for some given finite positive scalar Λ. According to (4.4), the Rademacher complexity of F can be upper bounded by Λ x̄ R̂N (F) ≤ √ , N where x̄ = q P 2 N −1 N n=1 xn . Then, according to (3.7), with probability at least 1 − δ every f ∈ F satisfies s Pe (f ) ≤ Êφ (f, DN ) + C(Λ) + φ̄ q 2Lφ Λx̄ + N 1/2 where C(Λ) = 2Lφ φ̄Λx̄ ln N 3/2 2 δ + 3φ̄ ln 2N 2 δ ln 2δ , 2N (6.1) . The presetting of Λ, resulting in the definition of the feasible set for λ, is problematic since we do not necessarily know the best initialization for it. So, we would like to consider every λ ∈ R. Unfortunately, by doing so we implicitly set Λ to infinity, rendering the risk bound (6.1) useless. Lemma 6.2 can be used to overcome this difficulty, leading to the following result. Corollary 6.3 Set some positive constant g0 and define the class of soft classifiers F = j ³ ´k {f (λ, x) = sin(λx), λ ∈ R, |λ| > g0 }. For every function f define q(λ) = 2 + log2 |λ| . g0 Then, with probability at least 1 − δ over training sequences of length N , every function f ∈ F satisfies Pe (f ) ≤ Êφ (f, DN ) + C(g0 2q(λ)−1 ) + φ̄ r where C(t) = 2Lφ tx̄ N 1/2 + 2Lφ φ̄tx̄ N 3/2 ³ 1 ln 2δ + 2 ln 2 2 log2 ³ v ³ ´ u u ln 2 + 2 ln 2 12 log 4|λ| t δ 2 g0 4|λ| g0 2N ´´ + 3φ̄ 2N ³ 1 , ln 2δ + 2 ln 2 2 log2 ³ 4|λ| g0 ´´ . Remark. The condition |λ| > g0 is stated here for convenience. In the result for the MoE classifier it will be removed. Proof For all q = 1, 2, . . . set Λq = 2q−1 g0 for some positive constant g0 and define the class 74 CHAPTER 6. FULLY DATA DEPENDENT BOUNDS Fq = {f (λ, x) = sin(λx) : |λ| ≤ Λq }. Also, define the test Γq (DN , δ) to be TRUE if s ln 2δ 2N Pe (f ) ≤ Êφ (f, DN ) + C(Λq ) + φ̄ for all f ∈ Fq and FALSE otherwise. Then, according to (6.1), P {Γq (DN , δ)} ≥ 1 − δ. Next, P let {pq }∞ be a set of positive weights such that ∞ q=1 pq = 1, where for concreteness we set q=1 o∞ n 1 . According to Lemma 6.2, pq = q(q+1) q=1 P {Γq (DN , δpq ) for all q = 1, 2, . . .} ≥ 1 − δ . Thus, P {Γq (DN , δpq )} ≥ 1 − δ for every q = 1, 2, . . ., which means that with probability at least 1 − δ, every f ∈ Fq satisfies s Pe (f ) ≤ Êφ (f, DN ) + C(Λq ) + φ̄ ln δp2q 2N , for every q = 1, 2, . . .. Now, assume that we wish to have a risk bound for any classifier from the class F̃ = j ³ ´k {f (λ, x) = sin(λx) : |λ| = Ω} for some positive scalar Ω > g0 . Setting t = 2 + log2 gΩ0 , we have t−1 Λt = g0 2 j ³ ´k 1+ log2 gΩ = g0 2 0 log2 > g0 2 Ω g0 =Ω, and 1+log2 Λt ≤ g0 2 Ω g0 = 2Ω , which means that out of the set of classes {Fq }∞ q=1 , Ft is the smallest class for which F̃ ⊂ Fq . Combining all of the above with the inequality 1 1 ln = ln t(t + 1) ≤ ln 2t2 = 2 ln 2 2 pt completes the proof of Corollary 6.3. µ ¹ µ ¶ µ ¶º¶ 1 4Ω Ω 2 ≤ 2 ln 2 log2 , 2 + log2 g0 g0 ¤ 6.2. FULLY DATA DEPENDENT RISK BOUND FOR MOE CLASSIFIERS 6.2 75 Fully data dependent risk bound for MoE classifiers We now turn to the derivation of fully data dependent risk bounds for MoE classifiers, using a generalization of the technique demonstrated in the previous section for a simple case. However, before providing the result, we begin with some definitions and preliminary results. 6.2.1 Some definitions, notations & preliminary results Consider the MoE classifier given in Figure 4.1, where the classifier’s parameter vector is given by θ = [w1 , w2 , . . . , wM , v1 , v2 , . . . , vM ] . (6.2) To provide a formal definition of Θ, the feasible set of θ, define the following collection of matrices Ai (p, q) = 1 if 0 otherwise k(i − 1) + 1 ≤ p, q ≤ ki and p = q, , for i = 1, 2, . . . , 2M . According to Definitions 4.1 and 4.3, Θ is given by © ª Θ(b) = θ : κ1 (θ) ≤ b1 , κ2 (θ) ≤ b2 , . . . , κ2M (θ) ≤ b2M ; θ ∈ R2kM , where κi (θ) = p θT Ai θ; i = 1, 2, . . . , 2M, (6.3) (6.4) and b is a vector consisting of the following elements W i if i = 1, 2, . . . , M, max . bi = i−M Vmax if i = M + 1, M + 2, . . . , 2M Thus, Θ is a subset of R2kM , comprised of 2M balls which have no common elements (except for the origin), defined by 2M constrains. 76 CHAPTER 6. FULLY DATA DEPENDENT BOUNDS 1 Denote by Zt+ the t dimensional grid of positive integers and let µ : Z2M + 7→ Z+ be some mapping such that q = µ(q1 , . . . , q2M ) ≤ 2M Y qi , (6.5) i=1 where (q1 , . . . , q2M ) ∈ Z2M + is some set of indices. To be concrete, it is easy to see that (6.5) Q is satisfied by µ(q1 , . . . , q2M ) = 1 + 2M i=1 (qi − 1). Set g = [g1 , g2 , . . . , g2M ] ∈ R2M + (6.6) to be an initial gain, and define the vector βq = [β1,q1 , β2,q2 , . . . , β2M,q2M ] , (6.7) βi,qi = gi sqi (6.8) where, for all i = 1, 2, . . . , 2M , for some s > 1. Interpreting βq as a special case of b, the feasible parameter set is now given by © ª Θ(βq ) = θ : κ1 (θ) ≤ β1,q1 , κ2 (θ) ≤ β2,q2 , . . . , κ2M (θ) ≤ β2M,q2M ; θ ∈ R2kM . Next, let {pq }∞ q=1 be the set of weights given in the previous section. For every parameter θ ∈ Θ and each constraint index i = 1, 2, . . . , 2M let qi (θ) be the smallest index for which ´m l ³ if κi (θ) > gi and qi (θ) = 1 otherwise. Notice κi (θ) ≤ βi,qi (θ) , that is qi (θ) = logs κig(θ) i that since qi (θ) was chosen to be the smallest index for which this condition holds, then, assuming κi (θ) > gi , 1 κi (θ) ≥ βi,qi (θ)−1 = βi,qi (θ) . s Setting κ̃i (θ) = s max(κi (θ), gi ), we have κ̃i (θ) ≥ βi,qi (θ) = gi sqi (θ) , which implies µ logs κ̃i (θ) gi (6.9) ¶ ≥ qi (θ) . Combining this with (6.5) results in q(θ) = µ(q1 (θ), q2 (θ), . . . , q2M (θ)) ≤ 2M Y i=1 qi (θ) ≤ 2M Y i=1 µ logs κ̃i (θ) gi ¶ , 6.2. FULLY DATA DEPENDENT RISK BOUND FOR MOE CLASSIFIERS which, in turn, implies µ ln 6.2.2 ¶ 1 pq(θ) 77 ³ 1 ´ ≤ 2 ln 2 2 q(θ) à 2M µ ¶! Y 1 κ̃i (θ) ≤ 2 ln 2 2 logs gi i=1 µ µ ¶¶ 2M X 1 κ̃i (θ) =2 ln 2 4M logs . g i i=1 Establishing a fully data dependent bound We are now ready to derive a fully data dependent bound for MoE classifiers. This is achieved by replacing the vector b in (6.3) with another vector, explicitly defined by the parameters of the selected classifier. Consider the risk bound given in Corollary 4.11 (the notations here are changed for convenience), s ln 2δ , 2N Pe (f ) ≤ Êφ (f, DN ) + C (β, δ) + φ̄ where s φ̄ ln 2δ 3φ̄ ln 2δ 1 C (β, δ) = √ R (β) + R (β) + , 3 2N N N2 ¡ 1 ¢ 2 M 1 2 M β = Wmax , Wmax , . . . , Wmax , Vmax , Vmax , . . . , Vmax , ÃM ! M M 1 X X X R (β) = 2cLφ β(m)2 + β(m) + β(m + M ) , n c = max m=1 M√ H La , MH La x̄, MA Lh x̄ 8 m=1 o (6.10) (6.11) (6.12) (6.13) m=1 (it should be noted that this bound can be tightened by using the constants given in Corollary 4.11 instead of the constant c) and β(m) is the m-th element of β. For every q = 1, 2, . . . set © ª Fq = f : kwm k ≤ βm,qm , kvm k ≤ βm+M,qm+M , m = 1, 2, . . . , M , that is βq is interpreted as the radius of the feasible parameters sets by which Fq is defined. Define the test Γq (DN , δ) to be TRUE if Pe (f ) ≤ Êφ (f, DN ) + C (βq , δ) + φ̄ s ln 2δ , 2N 78 CHAPTER 6. FULLY DATA DEPENDENT BOUNDS and FALSE otherwise. According to Corollary 4.11, P {Γq (DN , δ)} ≥ 1−δ for all q = 1, 2, . . .. Thus, by Lemma 6.2, P {Γq (DN , δpq ) for all q = 1, 2, . . .} ≥ 1 − δ , which means that with probability at least 1 − δ, every f ∈ Fq satisfies s ln δp2q Pe (f ) ≤ Êφ (f, DN ) + C (βq , δpq ) + φ̄ , 2N for every q = 1, 2, . . .. Finally, recall (from (6.9)) that for every i = 1, 2, . . . , 2M , βi,qi (θ) ≤ κ̃i (θ), which, combined with the monotonicity of C(βq , δpq ) with respect to each element of βq , leads to the following result. Theorem 6.4 Let F be the class of MoE classifiers with M glim experts and assume that gates 1, 2, . . . , M1 are local and M1 + 1, . . . , M are half-space where 0 ≤ M1 ≤ M . Denote by θ ∈ R2kM the parameter of the selected classifier and let definitions (6.4), (6.6) - (6.8) hold. For every i = 1, 2, . . . , 2M set κ̃i (θ) = s max(κi (θ), gi ) and define κ̃(θ) = [κ̃1 (θ), κ̃2 (θ), . . . , κ̃2M (θ)]. Then, with probability at least 1 − δ over training sequences of length N , every function f ∈ F satisfies s r R (κ̃(θ)) C (κ̃(θ), δ) 3φ̄ φ̄R (κ̃(θ)) C (κ̃(θ), δ) Pe (f ) ≤ Êφ (f, DN ) + √ + φ̄ + C (κ̃(θ), δ) , + 3 2N 2N N N2 (6.14) where à C (κ̃(θ), δ) = µ µ ¶¶! 2M X 1 κ̃i (θ) 2 ln + 2 ln 2 4M logs δ gi i=1 and R (κ̃(θ)) is defined in (6.13). Interpreting θ as the parameter of the selected classifier, a few observations may shed some light on this result. 1. Notice that the set of numbers gi , i = 1, 2, . . . , 2M , are essentially scaling factors for each of the constrains. If, for some classifier f , we have κi (θ) ≤ gi for all i, then the bound is independent of θ. To see that, observe that θ influences the bound through κ̃i (θ). If κi (θ) ≤ gi for all i then κ̃i (θ) = sgi , independent of θ. 6.3. SUMMARY 79 2. If, on the other hand, we have κi (θ) ≥ gi for all i, then κ̃i (θ) = sκi (θ), rendering the bound explicitly dependent on θ. 3. The only constraint on the parameter s is to be larger than one. Thus, we obtain the tightest bound by using s that minimizes the right hand side of (6.14). If we set s to be small, then the logarithmic term increases, but the complexity term decreases as a small s enables us to refine the grid. On the other hand, by setting s to be large, we reduce the logarithmic term, but the complexity term increases. 4. At first glance, it seems like (4.6) is tighter than (6.14), but this is not always the case. On the one hand, (6.14) exhibits the weakness of loosing tightness due to the logarithmic term, but on the other it enables us to use the parameter of the selected classifier rather than the boundaries of the parameter feasible set, potentially tightening the bound in cases where the selected classifier is characterized by small values of θi . 6.3 Summary This chapter concludes the Theoretical part of our work, providing fully data dependent risk bounds for the class of MoE classifiers. The bound given in Theorem 6.4 holds for every MoE classifier, depending only on the training sequence DN and the parameter of the selected classifier. The most significant improvement of this bound with respect to previous results is that it can be used to select the classifier itself rather then the class from which the classifier is drawn. This can be achieved by minimizing the bound over the parameter θ. 80 CHAPTER 6. FULLY DATA DEPENDENT BOUNDS Chapter 7 Numerical experiments The main focus of this work has been on the theoretical aspects of model selection for MoE & HMoE classifiers. Indeed, in Chapter 4 through Chapter 6 several risk bounds for these classifiers were established. To complete our work, some numerical experiments to test these bounds were carried out. This task proved to be non-trivial due to the difficulty of setting the number of experts and the non-convex nature of the loss function, which will be discussed in the sequel. It should be noted that previous methods for learning the parameters of the MoE classifier were based on gradient methods for maximizing the likelihood or minimizing some loss function. Such approaches are prone to problems of local optima, which render standard gradient descent approaches of limited use. This problem also occurs for the EM algorithm discussed in [21]. Notice that even if φ(yf (x)) is convex with respect to yf (x), this does not imply that it is convex with respect to the parameters of f (x). The deterministic annealing EM algorithm proposed in [30] attempts to address the local maxima problem, using a modified posterior distribution parameterized by a temperature like parameter. A modification of the EM algorithm, the split-and-merge EM algorithm proposed in [15], deals with a certain types of local maxima involving an unbalanced usage of the experts over the feature space. In this chapter two types of numerical experiments are discussed. In one type synthetic data sets are considered, and in the other type real data sets, taken from [6], are used. Due to 81 82 CHAPTER 7. NUMERICAL EXPERIMENTS the different nature of these data sets two protocols, one for each type of data set, were used for the numerical experiments, as described in Section 7.1. Section 7.2 provides a description of the algorithmic problems and the solutions used in the experiments. The results of the experiments are discussed in Sections 7.3 and 7.4. 7.1 Numerical experiments protocol As mentioned above, two types of numerical experiments were carried out, one using a synthetic data set and the other based on real world data sets, taken from [6]. Since the real world data sets are of a fixed size, typically a few hundred samples, there was a need to set a procedure to evaluate the classifier’s performance. This problem does not occur when the synthetic data is used because a test sequence of arbitrary length can be generated. Thus, we define two protocols for the numerical experiments, one for each type of data set. 7.1.1 Synthetic data When one wishes to select a classifier from some class of classifiers, a fundamental problem that needs to be solved is the problem of model selection. That is, what is the class of mappings from which the classifier is to be drawn. In the context of our discussion we need to decide, given the training sequence, what is the most suitable number of experts, M , for the underlying p.d.f. If M is set to be too small, then underfitting occurs. On the other hand, setting this parameter to be too large may result in overfitting. In the experiments with synthetic data we consider several possible values for M , each of which is used to define a different class according to ( FM = f : f (θ, x) = M X ) > x + vm,0 ) am (wm , wm,0 , x) tanh(vm m=1 where θ = [w1 , w1,0 , w2 , w2,0 , . . . , wM , wM,0 , v1 , v1,0 , v2 , v2,0 , . . . , vM , vM,0 ]. 7.1. NUMERICAL EXPERIMENTS PROTOCOL 83 The gating functions am (wm , wm,0 , x), m = 1, 2, . . . , M , are defined by am (wm , wm,0 , x) = ¡ > ¢¢ 1¡ 1 + tanh wm x + wm,0 2 for half-space gates and µ kwm − xk2 am (wm , wm,0 , x) = exp − wm,0 ¶ for local gates. For every M = 1, 2, . . . , 5 a classifier is selected based on the training sequence. The true risk of these classifiers is then compared to the prediction of the risk bounds. It is desirable that the class over which the risk bound is minimized will be the same class from which the classifier with the lowest risk is drawn. The experiments with synthetic data are carried out according to the following stages. Initialization 1. A p.d.f. P (X, Y ) over [−2, 2]2 ×{±1} is defined such that the Bayes classifier associated with it is given by a MoE classifier with M = 3 experts and the associated Bayes risk is 18.33%. 2. Set φ(yf (x)) = 1−tanh(γ1 yf (x)−γ2 ) 1+tanh(γ2 ) where f (x) is the classifier output at x and γ1 , γ2 ∈ R+ . This function was chosen for two reasons. (i) It fulfills all the conditions given in Subsection 3.1. (ii) It can be set as tight as one wishes to the indicator function by setting γ1 , γ2 appropriately. 3. Set the training sequence length N = 300 and the test sequence length NT = 106 . Experimental protocol 1. Draw a training sequence DN ∼ P (X, Y )N . 2. Draw a test sequence Dtest ∼ P (X, Y )NT . 3. For every M = 1, 2, . . . , 5, carry out the following steps. ∗ • Find fM = argminf ∈FM Êφ (f, DN ). 84 CHAPTER 7. NUMERICAL EXPERIMENTS • Calculate the risk bound. ∗ ∗ ) by P̂e (fM , Dtest ). • Estimate the true risk Pe (fM 4. Repeat steps 1-3 400 times (to characterize the statistical behavior of the experimental results). 7.1.2 Real world data The significant difference between the synthetic data and the real world data sets is that while we can draw a test sequence that is arbitrarily long in the former case, there is usually no test sequence in the latter case. So, there is no way to compute the true risk and evaluate the behavior of the bound. A common practice in these situations is given by the R-fold cross-validation (R-CV, see Figure 7.1) protocol where the training sequence is divided R times, in each division a different subset of the training sequence is used as a validation sequence for a classifier that is selected based on the rest of the points in the training sequence. The risk is then estimated by the average of the empirical risks computed over the R validation sequences. However, among other weaknesses, the R-CV method uses the same points for learning the classifier and evaluating its performance in different divisions of the training sequence, potentially increasing the danger of overfitting. Thus, a refined version of the R-CV, which we refer to as two-stage cross-validation (TS-CV), is used. This method uses the R-CV method to select the classifier, but the performance of the selected classifier is evaluated using different data points than those used to select the classifier, thus potentially decreasing the danger of overfitting. Before describing the protocol of TS-CV some definitions and notations are given. For every n = 1, 2, . . . , N , let DN (n) be the n-th point in the training sequence. Set two positive j k j k N 1 integers N1 , N2 and let R = N1 and S = N . To simplify the discussion we assume N2 that both N N1 and N1 N2 are integers. If this is not the case, the protocol is not changed, but the indices need to be dealt with more carefully. Next, for every r = 1, 2, . . . , R define the train = {1, 2, . . . , N } \ Ival sets of indices Ival 1,r and 1,r = {(r − 1)N1 + 1, (r − 1)N1 + 2, . . . , rN1 }, I1,r train val = {DN (n)}n∈Itrain . Now, if one = {DN (n)}n∈Ival , D1,r the associated sets of points D1,r 1,r 1,r 7.1. NUMERICAL EXPERIMENTS PROTOCOL 85 Training sequence N points N1 N1 N-N 1 r= 1 N-N 1 r= 2 N-N Original training sequence N 1 r=R Training sequence of each fold Validation sequence of each fold Figure 7.1: R cross-validation data partition. © train ªR wishes to carry out R-CV, one needs to use each of the sets D1,r to choose a classifier, r=1 © val ªR estimate the performance of each classifier using the associated test set from D1,r and r=1 average over the results. The TS-CV takes the R-CV one step further (see Figure 7.2) by using another cross© train ªR validation procedure for each element of D1,r . For every r = 1, 2, . . . , R and every s = r=1 1, 2, . . . , S define the sets of indices Ival 2,r,s = {(r − 1)N1 + [(s − 1)N2 + 1, (s − 1)N2 + 2, . . . , sN2 ]}, train val train Itrain \ Ival 2,r,s = I1,r 2,r,s and the associated sets of points D2,r,s = {DN (n)}n∈Ival , D2,r,s = 2,r,s {DN (n)}n∈Itrain . 2,r,s Given the definition of φ(yf (x)) from Subsection 7.1.1, the protocol for numerical experiments with the real data sets is now described. Experimental protocol 1. For every r = 1, 2, . . . , R carry out the following steps. ∗ train • For every s = 1, 2, . . . , S and M = 1, 2, . . . , 5 find fM,r,s = argminf ∈FM Êφ (f, D2,r,s ). P ∗ test ) and φ̄M,r = S1 Ss=1 φM,r,s . • Calculate φM,r,s = Êφ (fM,r,s , D2,r,s 86 CHAPTER 7. NUMERICAL EXPERIMENTS Training sequence N points S-fold CV Fold #1 N1 N-N N2 N2 N-N 1-N 2 r= 1, s = 1 N-N 1-N 2 r= 1, s = 2 N-N 1-N 2 N-N Fold # R r= 1 1 N2 N2 1 N2 r= 1, s = N1 r= R N-N 1-N 2 r= R, s = 1 N-N 1-N 2 r= R, s = 2 N-N 1-N 2 N2 S r = R, s = S Original training sequence Test sequence of each fold at second stage Test sequence of each fold at first stage Training sequence of each fold at second stage Training sequence of each fold at first stage Figure 7.2: Two-stage cross-validation data partition. • Set Mr = argminM φ̄M,r . train • Find fr∗ = argminf ∈FMr Êφ (f, D1,r ). val ). • Calculate er = P̂φ (fr∗ , D1,r 2. The mean over the set {er }R r=1 is interpreted as the expected risk of the classifier that will be selected by minimizing Êφ (f, DN ), the empirical φ-risk computed over the entire training sequence. 7.2. ALGORITHMS 7.2 87 Algorithms We consider algorithms which attempt to minimize the empirical φ-risk with respect to the parameters of the classifier. Using the definition of φ from Subsection 7.1.1, the empirical φ-risk is given by Êφ (f (θ, x)) = N 1 X 1 − tanh(γ1 yn f (θ, xn ) − γ2 ) . N n=1 1 + tanh(γ2 ) It is easy to see that not only is Êφ (f (θ, x)) not convex, it is not even unimodal, with respect to θ. This renders simple gradient methods useless, as it is highly unlikely to select an initial point for the gradient search from which the global minimum is achieved. The two algorithms used in the experiments are described in Subsections 7.2.1 and 7.2.2. 7.2.1 Cross-Entropy One possible solution to this problem is given by the Cross-Entropy (CE) algorithm ([32]; see [8] for a recent review). This algorithm, similarly to genetic algorithms, is based on the idea of randomly drawing samples from the parameter space and improving the way these samples are drawn from generation to generation. We observe that the CE algorithm is applicable to finite dimensional problems. To give an exact description of the algorithm used in our simulation we first introduce the following notation. Denote by Θ the feasible set of values for θ, and a parameterized p.d.f. ψΘ (θ; ξ) over Θ with ξ parameterizing the distribution. For example, if ψΘ (θ; ξ) is set to be a gaussian p.d.f. then ξ = [σ, µ] where σ is the standard deviation and µ is the average of the random variable θ, when it is drawn according to ψΘ (θ; ξ). To find a point that is likely to be in the neighborhood of the global minimum, we carry out Algorithm 1 (see box). Upon convergence, we use gradient methods with θ̂sB (see box for definition) as the initial point to gain further accuracy in estimating the global minimum point. We denote by θ̂B the solution of the gradient minimization procedure and declare it as the final solution. 88 CHAPTER 7. NUMERICAL EXPERIMENTS The Cross-Entropy Algorithm. Input: ψΘ (.) and φ(.). Output: θ̂sB , a point in the neighborhood of the global minimum of Ê(f (θ), DN ). Algorithm : 1. Pick some ξˆ0 (a good selection will turn ψΘ (θ; ξˆ0 ) into a uniform distribution over Θ). Set iteration counter s = 1, two positive integers d, T and three parameters 0 < ρ1 , ρ2 , ρ3 < 1. 2. Generate an ensemble {θ1 , θ2 , . . . , θL } where L = 2kM T (k is the dimension of the feature space and M is the number of experts, thus the dimension of Θ is 2kM ), drawn i.i.d according to ψΘ (θ; ξˆs−1 ). 3. Calculate Êφ (f, DN ) for each member of the ensemble. The Elite Sample (ES) comprises the bρ1 Lc parameters that received the lowest empirical φ-risk. Denote the parameters that are associated with the worst and the best Êφ (f, DN ) in the ES as θ̂sW and θ̂sB respectively. 4. If for some s ≥ d max (θ̂iW − θ̂jW ) ≤ ρ2 s−d≤i,j≤s stop (declare θ̂sB as the solution). Otherwise, solve the maximum likelihood estimation problem, based on the ES, to estimate the parameters of ψΘ (notice that it is not a MLE for the original empirical risk minimization problem). Denoting the solution as ξˆM L , compute ξˆs+1 = (1 − ρ3 )ξˆs + ρ3 ξˆM L . Set s = s + 1 and return to 2. Algorithm 1: The Cross-Entropy Algorithm for estimating the location of the global minimum of the empirical φ-risk. 7.3. SYNTHETIC DATA SET RESULTS 7.2.2 89 Greedy search of local minima The Greedy Search of Local Minima (GSLM) algorithm, described in Algorithm 2 (see box), is in the spirit of genetic algorithms which combines stochastic search with gradient search in attempt to mitigate the problem of non-convexity. Generally speaking, the idea at the basis of this algorithm is given by the following five steps. 1. Randomly draw a set of N1 points from Θ. 2. Select the N2 points (N2 ¿ N1 , for simplicity assume N1 /N2 is an integer) associated with the lowest empirical φ-risk. 3. Use gradient search to converge to a nearby local minimum. 4. Around each local minima randomly draw N1 /N2 − 1 points, so that we have a set of N1 points again. 5. If a stopping criterion is fulfilled, stop. Otherwise, go to 2. The GSLM algorithm has two significant advantages over the CE algorithm. First, while the latter searches for a single point at each iteration, the former performs a tree-like search, looking for several directions, each given by a member of the ES. Second, the GSLM algorithms combine gradient methods with stochastic search to accelerate the convergence. 7.3 Synthetic data set results ∗ We set γ1 = 2, γ2 = 0 in the definition of φ(·). For every M = 1, 2, . . . , 5, denote fM = ∗ ∗ argminf ∈FM Êφ (f, DN ). The reported risk Pe (fM ) is given by P̂e (fM , Dtest ), the empirical risk computed over a test sequence of 106 elements (Dtest ), drawn from the same source as the training sequence. Figure 7.4 describes the results from 400 different training sequences (the bars describe the standard deviation). The graph labelled as the ‘complexity term’ in Figure 7.4 is the sum of all terms on the right hand side of Corollary 4.11 with δ = 10−3 , excluding 90 CHAPTER 7. NUMERICAL EXPERIMENTS The Greedy Search of Local Minima Algorithm. Input: Loss function φ(.), a uniform p.d.f. ψΘ (θ, β, ξ) = cI [kθ − βk ≤ ξ] where c is a normalization factor. Output: θ̂sB , a point in the neighborhood of the global minimum of Ê(f (θ), DN ). Algorithm : 1. Pick some β and ξ1 (0 and 2 in our simulation), set iteration counter s = 1, three positive integers d, T, L where typically L ¿ T , and three parameters 0 < ρ1 , ρ2 , ρ3 < 1. 2. Generate an ensemble {θ1 , θ2 , . . . , θL } where L = 2kM T , drawn i.i.d according to ψΘ (θ, β, ξs ). Denote this ensemble as the ‘sample’. 3. For each point in the sample carry out a gradient search to find the nearest local n o minimum. Denote this set of local minima as θ̂1 , θ̂2 , . . . , θ̂L , where L is the number of elements in the sample, and calculate the empirical φ-risk for each of them. 4. The Elite Sample (ES) comprises the bρ1 Lc parameters associated with the lowest local minimum. Denote by ES(n), n = 1, 2, . . . , bρ1 Lc the n-th element of the ES, and denote the parameters associated with the worst and the best empirical φ-risk in the ES as θ̂sW and θ̂sB respectively. 5. If for some s ≥ d max (θ̂iW − θ̂jW ) ≤ ρ2 s−d≤i,j≤s stop (declare θ̂sB as the solution). 6. Set s = s + 1, ξs = ρ3 ξs−1 . For every n = 1, 2, . . . , bρ1 Lc, set β = ES(n) and generate L − 1 samples according to ψΘ (θ, β, ξs ). Denote the ensemble of the bρ1 LcL points, comprised from the ES and the new drawn points, as the sample. Go to 3. Algorithm 2: The greedy search of local minima Algorithm for estimating the location of the global minimum of the empirical φ-risk. 7.3. SYNTHETIC DATA SET RESULTS 91 ∗ Êφ (fM , DN ). As for the CE parameters, we set ψΘ (.) to be the β distribution, ξˆ0 = [1, 1] (corresponds to uniform distribution), ρ1 = 0.03, ρ2 = 0.001, ρ3 = 0.7 and T = 200. A few observations regarding the results, summarized in Figure 7.4, are in place. ∗ 1. As one might expect, Êφ (fM , DN ) is monotonically decreasing with respect to M . 2. As expected, the (data dependent) complexity term is monotonic, and approximately linearly increasing with respect to M . ∗ 3. Pe (fM ) is the closest to the Bayes error (18.33%) when M = 3, which is the Bayes solution. We witness the phenomenon of underfitting for M = 1, 2 and overfitting for M = 4, 5, as it is predicted by the bound. 1.5 1 0.5 0 −0.5 −1 −1.5 −1.5 −1 −0.5 0 0.5 1 1.5 Figure 7.3: A visual demonstration of the synthetic data set. The blue line describes the Bayes classifier and the red line describes the selected classifier for M = 3. 92 CHAPTER 7. NUMERICAL EXPERIMENTS Eφ(f,DN) Data dependent bound 1.1 0.7 0.65 1.05 0.6 0.55 1 0.5 0.45 0.95 0.4 1 2 3 4 5 1 2 Pe(f) 3 4 5 4 5 Complexity term 0.7 0.3 0.65 0.28 0.6 0.26 0.55 0.24 0.5 0.45 0.22 0.4 0.2 0.35 1 2 3 M 4 5 1 2 3 M Figure 7.4: Synthetic data set results. A comparison between the data dependent bound of Corollary 4.11 and the true error, computed over 400 Monte Carlo iterations for different training sequences. The solid line describes the mean and the bars indicate the standard deviation over all training sequences. The two figures on the left demonstrate the applicability of the data dependent bound to the problem of model selection when one wishes to set the optimal number of experts. It can be observed that the optimal predicted value for M in this case is 3, which agrees with the number of experts used to generate the data. Observe the underfitting that occurs for M = 1, 2 and the overfitting for M = 4, 5. 7.4. REAL WORLD DATA SET RESULTS 7.4 93 Real world data set results We applied Algorithm 2 to two real-world data sets, bupa (liver disorders, N = 345, k = 7) and pima (Indian diabetes diagnoses , N = 768, k = 9), taken from [6]. The GSLM parameters were initialized as follow; γ1 = 2, γ2 = 0, ρ1 = 0.03, ρ2 = 0.001, ρ3 = 0.9 and T = 200. ξ1 , the initial radius of the parameter feasible set, was set to be 2. The results are compared to those obtained with Radial Basis Function Support Vector Machines (RBF-SVM) and Linear Support Vector Machines (Linear-SVM). The hyper parameter of all classifiers were selected and the performance estimated according to the cross-validation protocol described in Subsection 7.1.2. The results are described in Table 7.1. Data set MoE (2 experts) Linear-SVM RBF-SVM bupa 0.289 ± 0.050 0.320 ± 0.084 0.317 ± 0.048 pima 0.241 ± 0.056 0.244 ± 0.050 0.255 ± 0.067 Table 7.1: Real world data set results. The results were computed using 7-fold cross-validation for bupa and 10-fold cross-validation for pima. For each fold, the hyper parameters of the classifiers were selected using cross-validation in the training sequence. 7.5 Summary Two types of numerical experiments were carried out. In one experiment the CE algorithm was used to estimate the parameters of a MoE classifier. The results of this experiment indicate that the CE algorithm performs well and that, at least for the p.d.f. tested in the experiment, the risk bound successfully predicts which class is the most appropriate. The second experiment considered real world data sets. The results of this experiment indicate that the MoE classifier performs at least as well as SVM classifiers. However, there are two major problems when estimating the parameters of a MoE classifier. First, due to the convex nature of SVM algorithms, it is faster than the algorithms used in our experiment. This problem can be partially overcome by a more careful implementation of the algorithms. 94 CHAPTER 7. NUMERICAL EXPERIMENTS The second problem is more disturbing. Notice that the dimension of the problem (the number of parameters) is given by 2kM . This leads to a rapid increase in the problem dimension as the number of features increases. Practically, only problems for which k < 15 could be solved within a reasonable time. Since many real problems contain hundreds or even thousands of features, some feature selection algorithm needs to be used prior to learning the parameters of the MoE classifier. Appendix A About the Lipschitz property of functions The Lipschitz property play a key role in this work, as it is used as the prime measure of the various function’s smoothness. Thus, it is appropriate to study sufficient conditions for a function to be Lipschitz. In the sequel we show that such a condition is the existence of a bounded first derivative. Consider a function f (x), mappings Rk 7→ R for some positive integer k ≥ 1. According to Taylor expansion of functions, for every a ∈ Rk and b ∈ Rk , there exist some ξ on the line between a and b for which f (b) = f (a) + ∇x f (ξ)(b − a) or, equivalently, f (b) − f (a) = ∇x f (ξ)(b − a). This, combined with the Cauchy-Schwartz inequality, implies |f (b) − f (a)| = |∇x f (a)(b − a)| ≤ k∇x f (a)kkb − ak. Setting Lf = maxx∈Rk k∇x f (x)k implies that for every a, b in Rk |f (b) − f (a)| ≤ Lf kb − ak, 95 96 APPENDIX A. ABOUT THE LIPSCHITZ PROPERTY OF FUNCTIONS which means that f (x) is Lipschitz with constant Lf . Notice that for k = 1 this result yields |f (b) − f (a)| ≤ Lf |b − a|, ¯ ¯d f (x)¯. where Lf = maxx∈R ¯ dx Bibliography [1] M. Anthony and P.L. Bartlett. Neural Network Learning; Theoretical Foundations. Cambridge University Press, 1999. [2] T. Bailey and A.K. Jain. A note on distance-weighted k-nearest neighbor rules. IEEE Trans. Syst., Man, Cybern,, SMC-8, 1978. [3] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk bounds. Technical Report 638, Department of Statistics, U.C. Berkeley, 2003. [4] P.L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002. [5] C.M. Bishop. Neural Networks for Pattern recognition. Oxford University press, 1995. [6] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [7] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities using the entropy method. The Annals of Probability, 31:1583–1614, 2003. [8] P.T. de Boer, D.P. Kroese, S. Mannor, and R.Y. Rubinstein. A tutorial on the crossentropy method. Annals of Operations Research, 2004. To appear. [9] I. Desyatnikov and R. Meir. Data-dependent bounds for multi-category classification based on convex losses. In Proc. of the sixteenth Annual Conference on Computational Learning Theory, volume 2777 of LNAI. Springer, 2003. 97 98 BIBLIOGRAPHY [10] L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer Verlag, New York, 1996. [11] R.O. Duda, P.E. Hart, and D. Stork. Pattern Classification. Wiley, New York, second edition, 2001. [12] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Verlag, Berlin, 2001. [13] S. Haykin. Neural Networks : A Comprehensive Foundation. Prentice-Hall, second edition, 1999. [14] R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, Boston, 2002. [15] Ghaharamani Z. Nakano R. Ueda N. Hinton, G.E. Smem algorithm for mixture models. Neural Computation, 12:2109–2128, 2000. [16] W. Hoeffding. Probability inequalities for sums of bounded random variables. J. Amer. Statis. Assoc., 58:13–30, 1963. [17] Jordan M.I. Nowlan S.J. Hinton G.E. Jacobs, R.A. Adaptive mixtures of local experts. Neural Computation, 3:79–87, 1991. [18] Peng F. Tanner M.A. Jacobs, R.A. Bayesian inference in mixtures-of-experts and hierarchical mixtures-of-experts models with an application to speech recognition. J. Amer. Stat. Assoc., 91:953–960, 1996. [19] Peng F. Tanner M.A. Jacobs, R.A. A bayesian approach to model selection in hierarchical mixture-of-experts classifiers. IEEE Trans. Neural Networks, 10:231–241, 1997. [20] W. Jiang. The vc dimension for mixtures of binary classifiers. Neural Computation, 12:1293–1301, 2000. [21] M.I. Jordan and R.A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6(2):181–214, 1994. BIBLIOGRAPHY 99 [22] M.I. Jordan and L. Xu. Convergence results for the em approach to mixtures of experts architectures. Neural Networks, 8:1409–1431, 1996. [23] M. Ledoux and M. Talgrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer Press, New York, 1991. [24] G. Lugosi. Concentration-of-measure inequalities, lecture notes given in summer school at ANU, 2004. http://www.econ.upf.es/ lugosi/surveys.html. [25] S. Mannor, R. Meir, and T. Zhang. Greedy algorithms for classification - consistency, convergence rates, and adaptivity. Journal of Machine Learning Research, 4:713–741, 2003. [26] P. McCullach and J. A. Nelder. Generalized Linear Models. CRC Press, 1989 (2nd edition). [27] C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, pages 148–188. Cambridge University Press, 1989. [28] R. Meir, R. El-Yaniv, and S. Ben-David. Localized boosting. In N. Cesa-Bianchi and S. Goldman, editors, Proc. Thirteenth Annual Conference on Computaional Learning Theory, pages 190–199. Morgan Kaufman, 2000. [29] R. Meir and T. Zhang. Generalization bounds for Bayesian mixture algorithms. Journal of Machine Learning Research, 4:839–860, 2003. [30] R. Nakano and N. N. Ueda. Determinisic annealing em algorithm. Neural Networks, 11(2), 1998. [31] R. Royall. A Class of Nonparametric Estimators of a Smooth Regression Function. PhD thesis, Stanford University, Stanford, CA, 1966. [32] R.Y. Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodology and Computing in Applied Probability, 1:127–190, September 1999. 100 BIBLIOGRAPHY [33] B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. [34] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264–280, 1971. [35] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995. [36] V. N. Vapnik. Statistical Learning Theory. Wiley Interscience, New York, 1998. [37] S.R. Waterhouse and A.J. Robinson. Nonlinear prediction of acoustic vectors using hierarchical mixture of experts. In Advances in Neural Information Processing Systems, pages 835–842. MIT Press, Cambridge, MA, 1995. [38] S.R Waterhouse and A.J. Robinson. Constructive algorithms for hierarchical mixtures of experts. In Advances in Neural Information Processing Systems, pages 584–590. MIT Press, Cambridge, MA, 1996. [39] A. Zeevi, R. Meir, and V. Maiorov. Error bounds for functional approximation and estimation using mixtures of experts. IEEE Trans. Information Theory, 44(3):1010– 1025, 1998. [40] T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32(1), 2004.
© Copyright 2025 Paperzz