Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Theoretical Foundations of Selective Prediction Yair Wiener Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Theoretical Foundations of Selective Prediction Research Thesis Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Yair Wiener Submitted to the Senate of the Technion — Israel Institute of Technology Av 5773 Haifa July 2013 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 The research thesis was done under the supervision of Prof. Ran El-Yaniv in the Department of Computer Science. Acknowledgements: First and foremost, I want to thank my advisor, Prof. Ran El-Yaniv. It has been a true pleasure being his Ph.D. student. Ran taught me how to do research, how to address complicated problems, how to look for answers, but most importantly, how to ask the right questions. Ran did not spare any effort in supporting me during my long journey. His endless patience, contagious enthusiasm, and continuous encouragement make him the ideal advisor. Even during tough times, when I almost lost hope in our direction, Ran never lost faith. For that I am in debt to him. It has been an honor and a privilege to learn from him. I thank my friends at RADVISION for believing in me and giving me the time to focus on research. I also thank Intel and the European Union′ s Seventh Framework Programme for their financial support. Last, but not least, I deeply thank my family, my wife Hila for her support, love, and care, and my daughter Ori for giving me a new perspective on learning. This thesis is dedicated to them. The generous financial support of the Technion is gratefully acknowledged. Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 The work described in the thesis is based on the following publications: 1. Ran El-Yaniv and Yair Wiener. ”On the foundations of noise-free selective classification.” The Journal of Machine Learning Research 99 (2010): 16051641. [31] (Impact factor: 3.42, 5-year impact factor: 4.284, rank in Machine Learning: 1) 2. Ran El-Yaniv and Yair Wiener. ”Agnostic selective classification.” Advances in Neural Information Processing Systems . 2011. [32] 3. Ran El-Yaniv and Yair Wiener. ”Active learning via perfect selective classification.” The Journal of Machine Learning Research 13 (2012): 255-279. [33] (Impact factor: 3.42, 5-year impact factor: 4.284, rank in Machine Learning: 1) 4. Yair Wiener and Ran El-Yaniv. ”Pointwise Tracking the Optimal Regression Function.” Advances in Neural Information Processing Systems 25. 2012. [89] Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Contents Abstract 1 Abbreviations and Notations 3 1 Introduction 1.1 To abstain, or not to abstain, that is the question . . . . 1.2 Notations and definitions . . . . . . . . . . . . . . . . 1.3 Contextual background . . . . . . . . . . . . . . . . . 1.3.1 Selective classification . . . . . . . . . . . . . 1.3.2 Selective regression . . . . . . . . . . . . . . . 1.3.3 Active learning . . . . . . . . . . . . . . . . . 1.4 Main contributions . . . . . . . . . . . . . . . . . . . 1.4.1 Low error selective strategy (LESS) . . . . . . 1.4.2 Lazy implementation . . . . . . . . . . . . . . 1.4.3 The disbelief principle . . . . . . . . . . . . . 1.4.4 Characterizing set complexity . . . . . . . . . 1.4.5 Coverage rates . . . . . . . . . . . . . . . . . 1.4.6 New speedup results for active learning . . . . 1.4.7 Efficient implementation and empirical results 2 Characterizing Set Complexity 2.1 Motivation . . . . . . . . . . . . . 2.2 The characterizing set complexity 2.3 Linear classifiers in R . . . . . . . 2.4 Intervals in R . . . . . . . . . . . 2.5 Axis-aligned rectangles in Rd . . . 2.6 Linear classifiers in Rd . . . . . . i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 7 10 11 12 12 14 14 16 17 18 20 24 27 . . . . . . 29 29 33 34 34 35 38 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 3 4 5 6 Realizable Selective Classification 3.1 Consistent Selective Strategy (CSS) . . 3.2 Risk and coverage bounds . . . . . . . 3.2.1 Finite hypothesis classes . . . . 3.2.2 Infinite hypothesis spaces . . . 3.3 Distribution-dependent coverage bound 3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 44 45 45 48 53 56 From Selective Classification to Active Learning 4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 From coverage bound to label complexity bound . . . . . 4.4 A new technique for upper bounding the label complexity . 4.4.1 Linear separators in R . . . . . . . . . . . . . . . 4.4.2 Linear separators in Rd under mixture of Gaussians 4.5 Lower bound on label complexity . . . . . . . . . . . . . 4.6 Relation to existing label complexity measures . . . . . . 4.6.1 Teaching dimension . . . . . . . . . . . . . . . . 4.6.2 Disagreement coefficient . . . . . . . . . . . . . . 4.7 Agnostic active learning label complexity bounds . . . . . 4.7.1 Label complexity bound for A2 . . . . . . . . . . 4.7.2 Label complexity bound for RobustCALδ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 59 60 61 68 69 70 71 73 74 76 82 82 84 . . . . . . 86 87 87 88 95 98 98 . . . . . 107 108 110 111 116 119 Agnostic Selective Classification 5.1 Definitions . . . . . . . . . . . . . . 5.2 Low Error Selective Strategy (LESS) 5.3 Risk and coverage bounds . . . . . 5.4 Disbelief principle . . . . . . . . . 5.5 Implementation . . . . . . . . . . . 5.6 Empirical results . . . . . . . . . . . . . . . . . . . . . . Selective Regression 6.1 Definitions . . . . . . . . . . . . . . . . 6.2 ǫ-Low Error Selective Strategy (ǫ-LESS) 6.3 Risk and coverage bounds . . . . . . . 6.4 Rejection via constrained ERM . . . . . 6.5 Selective linear regression . . . . . . . ii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 6.6 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7 Future Directions 7.1 Beyond pointwise competitivness . . . . 7.2 Beyond linear classifiers . . . . . . . . . 7.3 Beyond mixtures of Gaussians . . . . . . 7.4 On computational complexity . . . . . . . 7.5 Other applications for selective prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 126 128 128 129 129 A Proofs 131 B Technical Lemmas 137 C Extended Minimax Model 143 D Realizable Selective Classification: RC trade-off analysis 145 E Related Work 152 iii Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 List of Figures 1.1 Excess risk, estimation error, and approximation error for different hypothesis classes with increasing complexity. . . . . . . . . . . . 6 Excess risk, estimation error, and approximation error for a single hypothesis class with different coverage values. . . . . . . . . . . 6 An example of CSS with the class of linear classifiers in R2 . The white region is the region rejected by the strategy. . . . . . . . . . 15 Lazy implementation. (a) The test point (depicted by the empty circle) is rejected as both constrained ERMs have zero empirical error; (b) the test point is accepted as one constrained ERM has an empirical error higher than zero. . . . . . . . . . . . . . . . . . . 16 Confidence level map using (a) the disbelief index; (b) the distance from the decision boundary. . . . . . . . . . . . . . . . . . . . . 18 The version space compression set is depicted in green circles, and the maximal agreement region is depicted in gray. . . . . . . . . . 19 1.7 Maximal agreement region for different samples. . . . . . . . . . 19 1.8 RC curve of LESS (red line) and rejection based on distance from the decision boundary (dashed green line). . . . . . . . . . . . . . 28 2.1 An example of a maximal agreement set for linear classifiers in R2 . 30 2.2 The region accepted by CSS for the case of linear classifiers in R2 . 31 2.3 Version space compression set. . . . . . . . . . . . . . . . . . . . 31 2.4 Maximal agreement region. . . . . . . . . . . . . . . . . . . . . . 32 2.5 Maximal agreement set for linear classifier in R. (a) sample with both positive and negative examples (b) sample with negative only examples (c) sample with positive only examples . . . . . . . . . 34 1.2 1.3 1.4 1.5 1.6 iv Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 2.6 2.7 Maximal agreement set for intervals in R. (a) non overlapping interval and two rays (b) no overlapping interval and a ray (c) interval only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Maximal agreement set for intervals in R when Sm include negative points only . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.8 Agreement region of V SF ,Ri . . . . . . . . . . . . . . . . . . . . 3.1 A worst-case distribution for linear classifiers: points are drawn uniformly at random on the two arcs and labeled by a linear classifier that passes between these arcs. The probability volume of the maximal agreement set is zero. . . . . . . . . . . . . . . . . . . . 49 5.1 The set of low true error (a) resides within a ball around f ∗ (b). . . 89 5.2 The set of low empirical error (a) resides within the set of low true error (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.3 Constrained ERM. . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4 Linear classifier. Confidence height map using (a) disbelief index; (b) distance from decision boundary. . . . . . . . . . . . . . . . . 99 36 5.5 SVM with polynomial kernel. Confidence height map using (a) disbelief index; (b) distance from decision boundary. . . . . . . . 100 5.6 RC curve of our technique (depicted in red) compared to rejection based on distance from decision boundary (depicted in dashed green line). The RC curve in right figure zooms into the lower coverage regions of the left curve. . . . . . . . . . . . . . . . . . 100 5.7 RC curves for SVM with linear kernel. Our method in solid red, and rejection based on distance from decision boundary in dashed green. Horizntal axis (c) represents coverage. . . . . . . . . . . . 103 5.8 SVM with linear kernel. The maximum coverage for a distancebased rejection technique that allows the same error rate as our method with a specific coverage. . . . . . . . . . . . . . . . . . . 104 5.9 RC curves for SVM with RBF kernel. Our method in solid red and rejection based on distance from decision boundary in dashed green. 105 5.10 SVM with RBF kernel. The maximum coverage for a distancebased rejection technique that allows the same error rate as our method with a specific coverage. . . . . . . . . . . . . . . . . . . 106 v Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 6.1 6.2 6.3 Absolute difference between the selective regressor (f, g) and the optimal regressor f ∗ . Our proposed method in solid red line and the baseline method in dashed black line. All curves in a logarithmic y-scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Test error of selective regressor (f, g) trained on the dataset ’years’ with sample size of 30, 50, 100, 150, and 200 samples. Our proposed method in solid red line and the baseline method in dashed black line. All curves in a logarithmic y-scale. . . . . . . . . . . 124 Test error of selective regressor (f, g). Our proposed method in solid red line and the baseline method in dashed black line. All curves in a logarithmic y-scale. . . . . . . . . . . . . . . . . . . 125 D.1 The RC plane and RC trade-off . . . . . . . . . . . . . . . . . . . 146 vi Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Abstract In selective prediction, a predictor is allowed to abstain on part of the domain. The objective is to reduce prediction error by compromising coverage. This research is concerned with the study of the theoretical foundations of selective prediction and its applications for selective classification, selective regression, and active learning. We present a new family of selective classification strategies called LESS (low error selective strategies). The labels predicted by LESS for the accepted domain are guaranteed to be identical to the labels predicted by the best hypothesis in the class, chosen in hindsight. Therefore, the estimation error of the predictor chosen by LESS is zero. Extending the idea to regression we also present a strategy called ǫ-LESS whose predictions are ǫ-close to the values predicted by the best (in hindsight) regressor in the class. We study the coverage rates of LESS (and ǫ-LESS) for classification and regression. Relying on a novel complexity measure termed characterizing set complexity, we derive both data-dependent and distribution-dependent guarantees on the coverage of LESS for both realizable and agnostic classification settings. These results are interesting because they allow for training selective predictors with substantial coverage whose estimation error is essentially zero. Moreover, we prove an equivalence between selective (realizable) classification and stream-based active learning, with respect to learning rates. One of the main consequences of this equivalence is an entirely novel technique to bound the label complexity in active learning for numerous interesting hypothesis classes and distributions. In particular, using classical results from probabilistic geometry, we prove exponential label complexity speedup for actively learning general (non-homogeneous) linear classifiers when the data distribution is an arbitrary highdimensional mixture of Gaussians. While direct implementations of the LESS (and ǫ-LESS) strategies appear to be intractable, we show how to reduce LESS to a procedure involving few calcula- 1 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 tions of constrained empirical risk minimization (ERM). Using this reduction, we develop a new principle for rejection, termed the disbelief principle, and show an efficient implementation for ǫ-LESS for the case of linear least squared regression (LLSR). 2 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Abbreviations and Notations SVM VC ERM LESS CSS RC RBF KWIK LLSR f F g R(f ) f∗ fˆ R̂(f ) θ Hn γ(F, n) θǫ ℓ δ n̂ R — — — — — — — — — — — — — — — — — — — — — — — — Support Vector Machine Vapnik Chervonenkis Empirical Risk Minimizer Low error selective strategy Consistent selective strategy Risk-Coverage Radial Basis Function Known What It Knows Linear least squares regression Hypothesis Hypothesis class Selection function True risk of hypothesis f True risk minimizer Empirical risk minimizer Empirical risk of hypothesis f Disagreement coefficient Order-n characterizing set Characterizing set complexity ǫ-disagreement coefficient Loss function Confidence parameter Version space compression set size Reals 3 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Chapter 1 Introduction “ It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so. ” Mark Twain, 1835 - 1910 In selective prediction, a predictor (either classifier or regressor) is allowed to abstain from prediction on part of the domain. The objective is to improve the accuracy of predictions by compromising coverage, a trade-off we refer to as the risk-coverage (RC) trade-off. An important goal is to construct algorithms that can optimally or near optimally control it. Selective prediction is strongly tied to the idea of self-aware learning whereby given evidence (labeled and unlabeled examples), our goal is to quantify the limits of our knowledge. By this we refer to the ability of a learner to objectively quantify its own self-confidence for each prediction it makes. This problem is at the heart of many machine learning problems, including selective classification [31, 32], selective regression [89], active learning (selective sampling [5, 24]), and reinforcement learning (KWIK [67], exploration vs. exploitation [6]). This research is concerned with the study of the theoretical foundations of selfaware learning and its manifestation in selective classification, selective regression, and active learning. 4 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 1.1 To abstain, or not to abstain, that is the question Effective selective prediction is compelling in applications where one is not concerned with, or can afford partial coverage of the domain, and/or in cases where extremely low risk is a must but is not achievable in standard prediction frameworks. The promise is that by allowing rejection we will be able to reduce the risk to acceptable levels. Before we start our discussion on the value of rejection, it will be revealing to take a closer look at the structure of risk in classical classification. One of the goals in statistical machine learning is to learn an optimal hypothesis f from a finite training set, assumed to be sampled i.i.d. from some unknown underlying distribution P . The classifier that minimizes the probability of misclassification over P is called the Bayes classifier. A classical result in statistical learning theory is that the excess risk of a classifier f (the difference between the risk of f and the risk of the Bayes classifier) can be expressed as a sum of estimation error and approximation error [30], R(f ) − R(f Bayes ) = {R(f ) − R(f ∗ )} + R(f ∗ ) − R f Bayes , | {z } | {z } | {z } excess risk estimation error approximation error where R(f ) is the error of hypothesis f and f ∗ is the best hypothesis in a given hypothesis class F, namely f ∗ = argminf ∈F R(f ). The approximation error component depends only on the hypothesis class F from which we choose the classifier. The richer the hypothesis class, the smaller the approximation error. In the extreme case, where the target hypothesis belongs to F, the approximation error vanishes altogether. The estimation error is the error of our classifier compared to the error of the best hypothesis in F (denoted by f ∗ ). Generally, the richer the hypothesis class is, the larger the estimation error will be. Consequently, there is a tradeoff between estimation error and approximation error. Figure 1.1 depicts this tradeoff. When the reject option is introduced, we should depict the error in a threedimensional space (complexity, coverage, and error) rather than in the two-dimensional space (complexity and error) shown in Figure 1.1. The reject option makes it possible for us to overcome the estimation error vs. approximation error tradeoff and reduce the overall risk in three ways: • Reduce the estimation error by compromising coverage — choose a subset of the domain over which we can get arbitrarily close to the best hypothesis in the class given a fixed sample. 5 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Figure 1.1: Excess risk, estimation error, and approximation error for different hypothesis classes with increasing complexity. Figure 1.2: Excess risk, estimation error, and approximation error for a single hypothesis class with different coverage values. 6 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 • Reduce the approximation error by compromising coverage — decompose the problem into “easy” and “difficult” parts. Choose a subset of the domain that can be better approximated by the hypothesis class F. For example, in the case of learning XOR with linear classifiers, the approximation error is 0.5 for all hypotheses. However, if we reject half of the domain, we can easily reduce the approximation error (over the accepted domain) to zero. • Reduce the Bayes error by compromising coverage — choose a subset of the domain with minimal noise. In this thesis we develop principles and methods that allow for (almost) complete elimination of the estimation error. We propose a new optimization objective termed pointwise competitiveness. Given a hypothesis class and a finite training sample, our goal is to match the performance of the best in hindsight hypothesis in the class over the accepted domain while rejecting as few examples as possible. In other words, our goal is to achieve zero estimation error on the accepted domain. Figure 1.2 depicts the excess risk, estimation error, and approximation error of our proposed method for a given hypothesis class. By compromising coverage, we are able to reduce the estimation error to zero. When we have done so, we have achieved a pointwise competitive selective classifier. The approximation error is depicted as a straight dashed line because its dependency on the coverage is not guaranteed. Some empirical results provide strong evidence that a byproduct of our method is a decrease in the approximation error as well. In the special case where the target hypothesis belongs to the hypothesis class F (the realizable case), the approximation error is zero anyway and we can achieve perfect learning (learning hypothesis with zero generalization error on the accepted domain). 1.2 Notations and definitions Let X be some feature space, for example, d-dimensional vectors in Rd , and Y be some output space. In standard classification, the goal is to learn a classifier f : X → Y, using a finite training sample of m labeled examples, Sm = {(xi , yi )}m i=1 , assumed to be sampled i.i.d. from some unknown underlying distribution P (X, Y ) over X × Y. We assume that the classifier is to be selected from a hypothesis class F. Let ℓ : Y × Y → [0, 1] be a bounded loss function. 7 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Definition 1.1 (true risk) The true risk of hypothesis f ∈ F with respect to distribution P (X, Y ) is RP (f ) , E(X,Y )∼P {ℓ(f (X), Y )} . Definition 1.2 (empirical risk) The empirical risk of hypothesis f ∈ F with respect to sample Sm = {(xi , yi )}m i=1 is m R̂Sm (f ) , 1 X ℓ (f (xi ), yi ) . m i=1 When the distribution P or the sample Sm are obvious from the context, we omit them from the notation and use R(f ) and R̂(f ) respectively. Given hypothesis class F, we define the true risk minimizer f ∗ ∈ F as f ∗ , arg min R(f ), f ∈F and the empirical risk minimizer (ERM) as fˆ , arg min R̂(f ). f ∈F The associated excess loss class [12] is defined as H = H(F, ℓ) , {ℓ(f (x), y) − ℓ(f ∗ (x), y) : f ∈ F} . In selective prediction (either selective classification or selective regression) the learner should output a selective predictor defined to be a pair (f, g), with f ∈ F being a standard predictor, and g : X → [0, 1], a selection function. When applying the selective predictor to a sample x ∈ X , g(x) is the probability the selective predictor will not abstain from prediction, and f (x) is the predicted value in that case. ( f (x), w.p. g(x); (f, g)(x) , (1.1) reject, w.p. 1 − g(x). Thus, in its most general form, the selective predictor is randomized. Whenever the selection function is a zero-one rule, g : X → {0, 1}, we say that the selective 8 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 predictor is deterministic1 . Note that “standard learning” (i.e., no rejection is allowed) is the special case of selective prediction where g(x) selects all points (i.e., g(x) ≡ 1). The two main characteristics of a selective predictor are its coverage and its risk (or “true error”). Definition 1.3 (coverage) The coverage of a selective predictor (f, g) is the mean value of the selection function g(X) taken over the underlying distribution P , Φ(f, g) , E(X,Y )∼P {g(X)} . Definition 1.4 (selective true risk) For a bounded loss function ℓ : Y × Y → [0, 1], we define the risk of a selective predictor (f, g) as the average loss on the accepted samples, R(f, g) , E(X,Y )∼P {ℓ(f (X), Y ) · g(X)} . Φ(f, g) This risk definition clearly reduces to the standard definition of risk (Definition 1.1) if g(x) ≡ 1. Note that (at the outset) both the coverage and risk are unknown quantities because they are defined in terms of the unknown underlying distribution P. Definition 1.5 (pointwise competitiveness) Let f ∗ ∈ F be the true risk minimizer with respect to unknown distribution P (X, Y ). A selective classifier (f, g) is pointwise competitive if for any x ∈ X , for which g(x) > 0, f (x) = f ∗ (x). In selective regression, where the output space Y ⊆ R is continuous, we can generalize the definition as follows. Definition 1.6 (ǫ-pointwise competitiveness) Let f ∗ ∈ F be the true risk minimizer with respect to unknown distribution P (X, Y ). A selective regressor (f, g) is ǫ-pointwise competitive if for any x ∈ X , for which g(x) > 0, |f (x) − f ∗ (x)| ≤ ǫ. 1 In the remainder of this work, with the exception of Appendix D, we use only determenistic selective predictors. 9 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Pointwise competitiveness is a considerably stronger property than risk, which only refers to average performance. Note that in the realizable case, the target hypothesis belongs to the hypothesis class F, and therefore R(f ∗ ) = 0. In this case the pointwise competitiveness reduces to the requirement that the selective predictor will never err on its region of activity. We call this case perfect classification. Definition 1.7 (perfect classification) A selective classifier (f, g) is a perfect classifier with respect to distribution P (X, Y ) if for any x ∈ X , for which g(x) > 0, EY ∼P (Y |x) {ℓ(f (x), Y )} = 0. We conclude with a standard definition of version space. Definition 1.8 (version space [71]) Given a hypothesis class F and a training sample Sm , the version space V SF ,Sm is the set of all hypotheses in F that classify Sm correctly. 1.3 Contextual background The term self-aware learning was first coined in the context of the KWIK (knows what it knows) framework for reinforcement-learning [67]. We use this term in a broader context to include all learning problems where the learner needs to quantify its own self-confidence for its predictions. Throughout the years, selective classification, selective regression, and active learning have been studied by different research communities. Although these problems seems different at first glance, they are tightly related and results and techniques developed for one can be used for the others. In this section we review some known results in selective prediction and active learning, used as a contextual background for our work. This is by no mean an exhaustive survey of main results in those areas. 10 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 1.3.1 Selective classification Selective classification, perhaps better known as ‘classification with a reject option’, was first studied in the fifties, in the context of character recognition [22]. Among the earliest works are papers by Chow [22, 23], focusing on Bayesian solutions for the case where the underlying distributions are fully known. Over the years, selective classification has continued to draw attention and numerous papers have been published. Effective selective classification is attractive for obvious reasons in applications where one is not concerned with, or can afford partial coverage of the domain, and/or in cases where extremely low risk is a must but is not achievable in standard classification frameworks. Classification problems in medical diagnosis and in bioinformatics often fall into this category [70, 45]. Of the many research publications on selective classification, the vast majority have been concerned with implementing a reject option within specific learning schemes, by endowing a learning scheme (e.g., neural networks, SVMs, HMMs) with a reject mechanism. Most of the reject mechanisms were based on heuristic “ambiguity” or (lack of) “confidence” principle: “when confused or when in doubt, refuse to classify.” For example, in the case of support vector machines, the natural (and widely used) indicator for confidence is distance from the decision boundary. While there is plenty of empirical evidence for the potential effectiveness of selective classification in reducing the risk, the work done so far has not facilitated rigorous understanding of rejection and quantification of its benefit. In particular, there are no formal discussions on the necessary and sufficient conditions for pointwise competitiveness and the achievable coverage rates. The very few theoretical works that considered selective classification (see Appendix E) do provide some risk or coverage bounds for specific schemes (e.g., ensemble methods) or learning principles (e.g., ERMs). However, none guarantee (even asymptotic) convergence of the risk of the selective classifier to the risk of the best hypothesis in the class, not to mention pointwise convergence of the selective classifier itself to the best hypothesis in the class. Hence, elimination of the estimation error is not guaranteed by any known result. In the KWIK framework a similar concept to pointwise competitiveness was indeed studied, and coverage rates where analyzed [67]. However, it was limited to the realizable case and concerned an adversarial setting where both the target hypothesis and the training data are chosen by an adversary. While all positive results for the KWIK adversarial setting apply to our probabilistic setting (where 11 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 training examples are sampled i.i.d.), this adversarial setting preclude non trivial coverage for all the interesting hypothesis classes discussed in this work. This deficiency comes as no surprise because the KWIK adversarial setting is much more challenging than the probabilistic model we assume here. Finally, conformal prediction is a very general model for “hedged” prediction that has been developed by Vovk et al. [86]. Though conformal prediction does not directly deal with partial coverage, technically it is possible to construct a selective predictor from any conformal predictor. However conformal prediction guarantees do not apply to our setting. In Appendix E we discuss the differences between our approach and conformal prediction. 1.3.2 Selective regression The reject option has been mentioned only rarely and anecdotally in the context of regression. In [60] a boosting algorithm for regression was proposed and a few reject mechanisms were considered, applied both on the aggregate decision and/or on the underlying weak regressors. A straightforward threshold-based reject mechanism (rejecting low response values) was applied in [7] on top of support vector regression. This mechanism was found to lower false positive rates. As in selective classification, the current literature lacks any formal discussions of RC trade-off optimality. In particular, no discussions on the sufficient conditions for ǫ-pointwise competitiveness and the achievable coverage rates are available. A similar concept to the ǫ-pointwise competitiveness was introduced in the KWIK online regression framework [67, 78, 66]. However, as in the case of classification, the analysis concerned an adversarial setting and assumed near-realizability (the best hypothesis is assumed to be pointwise close to the Bayes classifier). Furthermore, in cases where the approximation error is not precisely zero, the estimation error of the selective regressor in KWIK cannot be arbitrarily reduced as can be accomplished here. 1.3.3 Active learning Active learning is an intriguing learning model that provides the learning algorithm with some control over the learning process, possibly leading to much faster learning. In recent years it has been gaining considerable recognition as a vital technique for efficiently implementing inductive learning in many industrial applications where there is an abundance of unlabeled data, and/or in cases where labeling 12 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 costs are high. In online selective sampling [5, 24], also referred to as stream-based active learning, the learner is given an error objective ǫ and then sequentially receives unlabeled examples. At each step, after observing an unlabeled example x, the learner decides whether or not to request the label of x. The learner should terminate the learning process and output a binary classifier whose true error is guaranteed to be at most ǫ with high probability. The penalty incurred by the learner is the number of label requests made and this number is called the label complexity. A label complexity bound of O(d log(d/ǫ)) for actively learning an ǫ-good classifier from a concept class with VC-dimension d yields exponential speedup in terms of 1/ǫ. This is in contrast to standard (passive) supervised learning where the sample complexity is typically O(d/ǫ). The theoretical study of (stream-based, realizable) active learning is paved with very interesting ideas. Initially, any known significant advantage of active learning over passive learning was limited to a few cases. Perhaps the most favorable result was an exponential label complexity speedup for learning homogeneous (crossing through the origin) linear classifiers where the (linearly separable) data is uniformly distributed over the unit sphere. This result was obtained by various authors using various analysis techniques, for a number of strategies that can all be viewed in hindsight as approximations or variations of the “CAL algorithm” of Cohn et al. [24]. Among these studies, the earlier theoretical results [76, 36, 37, 34, 42] considered Bayesian settings and studied the speedup obtained by the Query by Committeee (QBC) algorithm. The more recent results provided PAC style analysis [29, 46, 48]. Lack of positive results for other non-toy problems, as well as various additional negative results, led some researchers to believe that active learning is not necessarily advantageous in general. Among the striking negative results is Dasgupta’s negative example for actively learning general (non-homogeneous) linear classifiers (even in two dimensions) under the uniform distribution over the sphere [27]. A number of recent innovative papers proposed alternative models for active learning. Balcan et al. [10] introduced a subtle modification of the traditional label complexity definition, which opened up avenues for new positive results. According to their new definition of “non-verifiable” label complexity, the active learner is not required to know when to stop the learning process with a guaranteed ǫgood classifier. Their main result, under this definition, is that active learning is 13 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 asymptotically better than passive learning in the sense that only o(1/ǫ) labels are required for actively learning an ǫ-good classifier from a concept class that has a finite VC-dimension. Another of their accomplishments is an exponential label complexity speedup for (non-verifiable) active learning of non-homogeneous linear classifiers under the uniform distribution over the unit sphere. Using Hanneke’s characterization of active learning in terms of the “disagreement coefficient” [46], Friedman [38] recently extended the Balcan et al. results and proved that a target-dependent exponential speedup can be asymptotically achieved for a wide range of “smooth” learning problems (in particular, the hypothesis class, the instance space and the distribution should all be expressible by smooth functions). He proved that under such smoothness conditions, for any target hypothesis f ∗ , Hanneke’s disagreement coefficient is bounded above in terms of a constant c(f ∗ ) that depends on the unknown target hypothesis f ∗ (and is independent of δ and ǫ). The resulting label complexity is O (c(f ∗ ) d polylog(d/ǫ)) [49]. This is a very general result but the target-dependent constant in this bound is only guaranteed to be finite. Despite impressive progress in the case of target-dependent bounds for active learning, the current state of affairs in the target-independent bounds for active learning arena leaves much to be desired. To date, the most advanced result in this model, which was essentially established by Seung et al. and Freund et al. more than twenty years ago [76, 36, 37], is still a target-independent exponential speedup bound for homogenous linear classifiers under the uniform distribution over the sphere [50]. 1.4 Main contributions 1.4.1 Low error selective strategy (LESS) We first show that pointwise competitiveness is achievable by a family of learning strategies termed low error selective strategies (LESS). Given a training sample Sm and a hypothesis class F, LESS outputs a pointwise competitive selective predictor (f, g) with high probability. The idea behind the strategy is simple: using standard concentration inequalities, one can show that the training error of the true risk minimizer, f ∗ , cannot be “too far” from the training error of the empirical risk minimizer, fˆ. Therefore, with high probability f ∗ belongs to the class of low empirical error hypotheses. Now all we need to do is abstain from prediction whenever the 14 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 vote among all hypotheses in this class is not unanimous. LESS is implemented in a slightly different manner for realizable selective classification, agnostic selective classification, and selective regression. For the realizable setting, the low empirical error class includes all hypotheses with exactly zero empirical error (all hypotheses in the version space). Figure 1.3 depicts this strategy for the case of linear classifiers in R2 . In this example the training set includes three positive samples (depicted by red crosses) and four negative samples (depicted by blue circles). The four dashed lines represent four different hypotheses that are consistent with the training sample. Therefore, all these hypotheses belong to the version space. The darker region represents the agreement region with respect to the version space, that is, the agreed upon domain region for all hypotheses in the version space. This region will be the one accepted by LESS in this example. This strategy is also termed consistent selective strategy (CSS) and is described in greater detail in Section 3.1. Figure 1.3: An example of CSS with the class of linear classifiers in R2 . The white region is the region rejected by the strategy. For the agnostic case, the low empirical error class includes all hypotheses with empirical error less than a threshold that depends only on the VC dimension of F, the size of the training set m, and the confidence parameter δ.2 This strategy is described in Section 5.2. For selective regression, as for agnostic classification, the low empirical error class includes all hypotheses with empirical error less than a threshold that depends on the VC dimension of F, m, and δ. However, in this case, the requirement is not to reach a unanimous vote for prediction, but rather 2 f ∗ belongs to this class with probability of at least 1 − δ. 15 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 an ǫ-unanimous vote. The strategy will abstain from prediction if there is at least one hypothesis in the class that predicts a value that is more than ǫ away from the prediction of the empirical risk minimizer (ERM). This strategy is also termed ǫ-LESS and is described in Section 6.2. 1.4.2 Lazy implementation At the outset, efficient implementation of LESS seems to be out of reach as we are required to track the supremum of the empirical error over a possibly infinite hypothesis subset, which in general might be intractable. To overcome this computational difficulty, we propose a reduction of this problem to a problem of calculating (two) constrained ERMs. For any given test point x, we calculate the ERM over the training sample Sm with a constraint on the label of x (one positive label constraint and one negative). We show that thresholding the difference in empirical error between these two constrained ERMs is equivalent to tracking the supremum over the entire (infinite) hypothesis subset. For the realizable setting, CSS rejects a test point x if and only if both constrained ERMs have exactly zero empirical error. Figure 1.4 depicts the constrained ERMs (in dashed lines) for two different cases. As before, the red crosses and blue circles represent the positive and negative samples in the training set. The test point Figure 1.4: Lazy implementation. (a) The test point (depicted by the empty circle) is rejected as both constrained ERMs have zero empirical error; (b) the test point is accepted as one constrained ERM has an empirical error higher than zero. is represented as an empty black circle. In (a) there are two linear classifiers (depicted in dashed lines) that separate the training sample and classify the new test 16 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 point as positive or negative. In this case the test point will be rejected by CSS. In (b) there is no linear classifier that can both separate the training set and classify the test point as negative. Therefore the risk of the ERM with a negative constraint on the test point is larger than zero and the test point will be accepted. A detailed discussion and proof of this reduction can be found in Section 3.4. For the agnostic case, LESS rejects a test point x if and only if the difference between the empirical error of both constrained ERMs is less than a threshold that depends only on the VC dimension of F, the size of the training set m, and the confidence parameter δ. A detailed discussion and proof of this reduction can be found in Section 5.4. For selective regression, the above reduction is a bit more involved but follows the same idea. We are now required to calculate two constrained ERMs and one unconstrained ERM for each test point x. As mentioned previously, our goal in selective regression is to find an ǫ-pointwise competitive selective hypothesis. The constraint is that the regressor should have the value fˆ(x) ± ǫ at the point x, where fˆ is the unconstrained ERM. The rejection rule is obtained by comparing the ratio between the constrained and unconstrained ERMs with a threshold that depends on the VC dimension of F, the size of the training set m, and the confidence parameter δ. This rejection rule is equivalent to the decisions of ǫ-LESS, provided that the hypothesis class F is convex. The intuition behind this reduction is beyond the scope of the introduction and is discussed in detail in Section 6.4. 1.4.3 The disbelief principle In our lazy implementation of LESS for selective classification, we calculate two constrained ERMs. We term the absolute difference between the errors of these constrained ERM predictors over test point x the disbelief index of x. The higher the disbelief index is, the more surprised we would be at being wrong on test point x, and our self-confidence in our predictions is higher. Therefore the disbelief index can be used as an indicator for classification confidence instead of the more traditional methods that are based on distance from the decision boundary. Figure 1.5 depicts an example of the confidence levels induced by both the disbelief index and the standard distance from the decision boundary. In both cases, the training set Sm was randomly sampled from a mixture of two identical normal distributions (centered at different locations) and the hypothesis class is the class of linear classifiers. Further discussion on the disbelief principle can be found in 17 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Section 5.4. (a) (b) Figure 1.5: Confidence level map using (a) the disbelief index; (b) the distance from the decision boundary. 1.4.4 Characterizing set complexity Clearly, the LESS strategy is extremely aggressive (or defensive): it rejects any point for which the vote is not unanimous (or ǫ-unanimous in the case of regression) among (infinitely) many hypotheses. While this worst-case approach is precisely the attribute that allows for pointwise competitiveness (or ǫ-pointwise competitiveness), it may raise the concern that pointwise competitiveness can only be achieved in a trivial sense where LESS rejects the entire domain. Are we throwing out the baby with the bath water? The dependency of the coverage on the training sample size is called the coverage rate. We present and analyze a new hypothesis class complexity measure, termed characterizing set complexity. This measure plays a central role in the analysis of coverage rates in selective classification (Chapters 3 and 5), as well as in the analysis of label complexity in online active learning (Chapter 4). The coverage rate of a selective classifier depends on the complexity of the hypothesis class from which the selection function g(x) is chosen. In the realizable case, the selection function g(x) chosen by LESS (CSS) is exactly the maximal agreement set of the version space: the set of all points for which all hypotheses in the version space vote unanimously. But from which hypothesis class is this 18 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 selection function chosen? To answer this question, we first use a data-dependent parameter to parameterize the selection hypothesis class. The parameter we use is called the version space compression set size, defined to be the size of the smallest subset of the training set that induces the same version space as the entire training set. Figure 1.6 illustrates the version space compression set and the maximal agreement region. The maximal agreement region with respect to the version space is the region agreed upon by all hypotheses in the version space. Figure 1.6: The version space compression set is depicted in green circles, and the maximal agreement region is depicted in gray. Figure 1.7 illustrates a few random samples, all with version space compression set size of 4. Each sample induces a different version space and hence, a different maximal agreement region. We can now consider the class of all maximal agreement regions for hypothesis class F and any training sample of size n. The order-n characterizing set complexity γ(F, n) is defined as the VC dimension of this spe- Figure 1.7: Maximal agreement region for different samples. 19 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Hypothesis class Linear separators in R Intervals in R Linear separators in Rd Balanced axis-aligned rectangles in Rd γ (F, n) Reference 2 O(max(n, 4)) d/2 log n O d3 nd Section 2.3 Section 2.4 O (dn log n) Lemma 2.7 Theorem 2.2 Table 1.1: Order-n characterizing set complexity of different hypothesis classes. cial class. Relying on classical results in combinatorial geometry and statistical learning, we derive an upper bound on the order-n characterizing set complexity of a few hypothesis classes. Our results are summarized in Table 1.1. Although this geometric complexity measure is defined for the realizable case, it is instrumental in the analysis of coverage rates in all cases including the agnostic setting, as well as for the analysis of label complexity in active learning. 1.4.5 Coverage rates As mentioned before, LESS is a very aggressive strategy. It is therefore quite surprising that it does not reject the vast majority of the domain for most cases. Indeed, our study of the coverage rates of LESS (ǫ-LESS) for both realizable and agnostic selective classification (regression) show that the resulting coverage is substantial. Specifically, one of the most important characteristics of a coverage bound is its dependency on the sample size m (the coverage rate). We say that a coverage bound that converges to one at rate O(polylog(m)/m) is fast. Can we hope for fast coverage rates when applying the LESS strategy? We start our analysis with the simple case of realizable selective classification, where we assume that the target hypothesis belongs to the hypothesis class F. We show in Theorem 3.1 that CSS is optimal in the sense that no other strategy that achieves perfect classification can have larger coverage. For finite hypothesis spaces, CSS achieves, with probability of at least 1 − δ (over the choice of the training set), a coverage of 1 1 . Φ(f, g) ≥ 1 − O |F| + ln m δ 20 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 This distribution-free coverage guarantee has a fast coverage rate and is proven to be nearly tight for CSS. Therefore, it is the best possible bound for any selective learner. Unfortunately, for infinite hypothesis spaces, the situation is not as favorable. It is actually impossible to provide any coverage guarantees for perfect classification for any data distribution. Specifically, for linear classifiers, we contrive a bad distribution for which any selective learner ensuring zero risk will be forced to reject the entire domain, thus failing to guarantee more than zero coverage. Fortunately, however, this observation does not preclude non-trivial perfect classification in less adverse situations. Using our new complexity measure, it is possible to obtain both data-dependent and distribution-dependent coverage bounds. For infinite hypothesis spaces, LESS (CSS) achieves, with probability of at least 1 − δ, a coverage of 1 m m Φ(f, g) ≥ 1 − O γ(F, n̂) ln + ln , (1.2) m γ(F, n̂) δ where n̂ is the version space compression set size, and γ(F, n) is the order-n characterizing set complexity of F. Plugging in our results for γ(F, n) (Table 1.1), we immediately derive data-dependent coverage bounds for different hypothesis classes. The bounds are data-dependent because they depend on the version space compression set size n̂ (an empirical measure). In order to study the rate of the bounds, we first need to investigate the dependency of n̂ on the sample size. Utilizing a classical result in geometric probability theory on the average number of maximal random vectors, we show in Lemma 3.14 that if the underlying distribution is any (unknown) finite mixture of arbitrary multi-dimensional Gaussians in Rd , and the hypothesis class is the class of linear classifiers, then the compression set size of the version space obtained using m labeled examples satisfies, with probability of at least 1 − δ, n̂ = O (log m)d δ . This bound immediately yields a coverage guarantee for perfect classification of linear classifiers with fast coverage rate, as stated in Corollary 3.15. This is a powerful result, providing strong indication of the effectiveness of perfect learning with guaranteed coverage in a variety of applications. To analyze the agnostic setting we introduce an additional known complexity measure, called the disagreement coefficient (see Definition 4.10). This measure 21 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 was first introduced by Hanneke in the study of label complexity in active learning [46]. Let F be a hypothesis class with VC-dimension d and disagreement coefficient θ, and H the associated excess loss class. If H is a (β, B)-Bernstein class w.r.t. P (X, Y ) (see Definition 5.4), then LESS achieves, with probability of at least 1 − δ, a coverage of ! d m 1 1 β/2 Φ(f, g) ≥ 1 − O Bθ ln + ln . m d m δ Bernstein classes arise in many natural situations; see discussions in [62, 11, 13]. For example, if the conditional probability P (Y |X) is bounded away from 1/2, or it satisfies Tsybakov’s noise conditions, then the excess loss function is a Bernstein [11, 80] class3 . Using a reduction from selective sampling to agnostic selective classification, we show that the disagreement coefficient can be bounded using the characterizing set complexity. Specifically, if γ(F, n̂) = O(polylog(m)) for some hypothesis class and marginal distribution P (X), then LESS achieves, with probability of at least 1 − δ, a coverage of ! polylog(m) 1 β/2 · log Φ(f, g) ≥ 1 − B · O m δ for the same hypothesis class and marginal distribution (the conditional distribution P (Y |X) can be different though). Since β ≤ 1, we get that the coverage rate √ is at most O(polylog(m)/ m), which is — not surprisingly — slower than the realizable case. In many applications the output domain Y is continuous or numeric and our goal is to find the best possible real-valued prediction function given a finite training sample Sm . Unfortunately, the disagreement coefficient is not defined for the case of real-valued functions and the 0/1-loss function cannot be used. The above results for agnostic selective classification thus cannot be directly applied to regression. In Chapter 6 we analyze this case and derive sufficient conditions for ǫ-pointwise competitiveness. Our first contribution is a natural extension of Hanneke’s disagreement coeffi3 If the data was generated from any unknown deterministic hypothesis with limited noise then P (Y |X) is bounded away from 1/2. 22 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Setting Realizable Realizable Agnostic Agnostic4 Regression Coverage rate 1 1 O |F| + ln δ 1− m 1 1− m O γ(F, n̂) ln γ(Fm,n̂) + ln m δ d m 1 1 β/2 1 − O Bθ m ln d + m ln δ β/2 polylog(m) 1 · log δ 1−O B m r 2 − 1 · R(f ∗ ) + σ ˆ 1 − θǫ σδ/4 · R̂( f ) δ/4 Reference Theorem 3.2 Theorem 3.10 Theorem 5.5 Theorem 5.9 Theorem 6.7 Table 1.2: Coverage rate bounds. cient to the case of real-valued functions. Hanneke’s disagreement coefficient is based on the definition of a disagreement set. A disagreement set of hypothesis class G is the set of all points in the domain on which at least two hypotheses in G do not agree. We extend this definition to real-valued functions by defining the ǫ-disagreement set as the set of all points in the domain for which at least two hypotheses in G predict values with a difference of more than ǫ. Our extended measure, termed the ǫ-disagreement coefficient (see Definition 6.5), is based on this definition of ǫ-disagreement set. Let F be a convex hypothesis class, assume ℓ : Y × Y → [0, ∞) is the squared loss function, and let ǫ > 0 be given. Then ǫ-LESS achieves, with probability of at least 1 − δ, a coverage of r σ 2 − 1 · R(f ∗ ) + σδ/4 · R̂(fˆ) , Φ(f, g) ≥ 1 − θǫ δ/4 where θǫ is the ǫ-disagreement coefficient of F and σδ is any multiplicative risk bound for real-valued functions. Specifically, we assume that there is a risk bound σδ , σ(m, δ, F) such that for any tolerance δ, with probability of at least 1 − δ, any hypothesis f ∈ F satisfies R(f ) ≤ R̂(f ) · σ(m, δ, F). Similarly, we assume the reverse bound, R̂(f ) ≤ R(f ) · σ(m, δ, F), holds under the same conditions. 4 The coverage rate bound holds for hypothesis class F and distribution P (X, Y ) if γ(F, n̂) = O(polylog(m)) for F and marginal distribution P (X). 23 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 1.4.6 New speedup results for active learning In stream-based active learning (also known as selective sampling) the learner is presented with a stream of unlabeled examples and should decide whether or not to request a label for each training example after it is presented. While at first glance active learning may seems completely unrelated to selective prediction, in Section 6.4 we present a reduction of active learning to perfect selective classification and a converse reduction — of selective classification to active learning. These reductions strongly tie together these two classical learning models and prove equivalence with respect to learning rates (label complexity and coverage rates). The first reduction (active reduced to selective) is substantially more involved, but is nonetheless quite rewarding: it provides us with the luxury of analyzing dynamic active learning problems within the static selective prediction framework, and facilitates a novel and powerful technique for bounding the label complexity of active learning algorithms. In Theorem 4.5 we show how the label complexity bound for the CAL active learning algorithm can be directly derived from any coverage bound for the CSS selective classification strategy. The intuition behind the relationship between pointwise competitive selective prediction and active learning can be easily hand waved: whenever the learner is sure about the prediction of the current unlabeled example x, there is no need to ask for the label of x. Conversely, whenever the learner cannot determine the label with sufficient certainty, perhaps there is something to learn from it and it is worthwhile to ask (and pay) for its label. Despite the conceptual simplicity of this intuitive relationship, the crucial component of this reduction is to prove that it preserves “fast rates.” Specifically, if the rejection rate of CSS (one minus coverage rate) is O (polylog (m/δ) /m) , then (F, P ) is actively learnable by CAL with exponential label complexity speedup. Utilizing our coverage bound results for perfect selective classification, we prove that general (non-homogeneous) linear classifiers are actively learnable at exponential (in 1/ǫ) label complexity rate when the data distribution is an arbitrary unknown finite mixture of high-dimensional Gaussians. This target-independent result substantially extends the state-of-the-art in the formal understanding of active learning. It is the first exponential label complexity speedup proof for a non-toy setting. While we obtain exponential label complexity speedup in 1/ǫ, our speedup result incur exponential slowdown in d2 , where d is the (Euclidean) problem di- 24 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 mension. Nevertheless, in Theorem 4.13 we prove a lower bound of Ω (log m)(d−1)/2 (1 + o(1)) on the label complexity, when considering the class of unrestricted linear classifiers under a Gaussian distribution. Thus, an exponential slowdown in the dimension d is not an artifact of our proof, but an unavoidable newly revealed limitation of active learning, as it is concerns any speedup result for the CAL algorithm. The argument relies on the observation that CAL has to request a label for any point on the convex hull of a sample Sm . The bound is then derived using known results from probabilistic geometry, which bound the first two moments of the number of vertices of a random polytope under the Gaussian distribution. Finally, in Section 4.6 we relate our proposed technique to other complexity measures for active learning. We show how recent results in active learning can be used to derive coverage bounds in selective classification. We start by proving a relation to the teaching dimension [43]. In Corollary 4.17 we show, by relying on a known bound for the teaching dimension, that perfect selective classification with meaningful coverage can be achieved for the case of axis-aligned rectangles under a product distribution. We then focus on Hanneke’s disagreement coefficient and show in Theorem 4.18 that the coverage of perfect selective classification can be bounded from below using the disagreement coefficient. Conversely, we show, in Corollary 4.23, that the disagreement coefficient can be bounded from above using any coverage bound for perfect selective classification. Consequently, the results here imply that the disagreement coefficient can be sufficiently bounded to ensure fast active learning for the case of linear classifiers under a mixture of Gaussians. Using the equivalence between active learning and perfect selective classification, we were able to prove, in Corollary 4.26, that bounded disagreement coefficient is a necessary condition for CAL to achieve exponential label complexity speedup. The importance of our new technique and the resulting (distribution-dependent) bounds for the analysis of CAL has been already recognized by others and it was recently highlighted by Hanneke in his survey of recent advances in the theory of active learning [50]: “El-Yaniv and Wiener (2010) identify an entirely novel way to bound ... the label complexity of CAL, in terms of a new complexity measure that has several noteworthy features. It incorporates certain aspects of many different known complexity measures, including the notion of 25 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 a region of disagreement (as in the disagreement coefficient analysis), the notion of a minimal specifying set (as in the teaching dimension analysis ...), and the notion of the VC dimension.” “... at present this is the only technique known to establish the bounds on the label complexity of CAL ... for both k-dimensional linear separators under mixtures of multivariate normal distributions and axisaligned rectangles under product distributions.” As a matter of fact, the implications of our upper bound on the disagreement coefficient go beyond the above mentioned bounds for CAL. Hanneke’s disagreement coefficient is one of the most important and widely used complexity measures in the study of label complexity in agnostic active learning. For example, the disagreement coefficient is a key factor in the label complexity analyses of the famous Agnostic Active (A2 ) [8] and RobustCALδ [50] learning algorithms. A2 was the first general-purpose agnostic active learning algorithm with proven improvement in error guarantees compared to passive learning. It has been proven that this algorithm, originally introduced by Balcan et. al. [8], achieves exponential label complexity speedup (for the low accuracy regime) compared to passive learning only for few simple cases including: threshold functions and homogenous linear separators under uniform distribution over the sphere [9]. In this work we extend the results and prove that exponential label complexity speedup (for the low accuracy regime) is achievable also for linear classifiers under a fixed (unknown) mixture of Gaussians. This new exponential speedup is limited to the low accuracy regime, but we prove in Theorem 4.31 that, under some conditions on the source distribution, RobustCALδ achieves exponential speedup for linear classifiers under a fixed mixture of Gaussians for any accuracy regime. To the best of our knowledge, our technique is the first (and currently only) one capable of accomplishing these strong results for A2 and RobustCALδ . Similarly, our technique can be used to derive new exponential speedup bounds for a long list of other learning methods thus leveraging the use and capability of the disagreement coefficient. This list includes Beygelzimer et al.’s algorithm for importance-weighted active learning (IWAL) [17], AdaProAL for Proactive Learning [90], and SRRA based algorithms for pool-based active learning [1], to mention a few. 26 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 1.4.7 Efficient implementation and empirical results Attempting to implement even a “lazy” version of LESS is challenging because the calculation of the disbelief index requires the identification of ERM hypotheses. Unfortunately, for nontrivial hypothesis classes this is in general a computationally hard problem. Furthermore, the disbelief index is a noisy statistic that depends on the sample Sm . We present an implementation that ‘approximates” the proposed strategy and addresses the above problems. We provide some empirical results demonstrating the advantage of our proposed technique over rejection based on distance from the decision boundary for six datasets and two kernels (SVM with linear and RBF kernels). Figure 1.8 depicts the RC curve5 of LESS (in solid red) compared to the RC curve of rejection based on distance from the decision boundary (green dashed line) for the Haberman dataset. This dataset contains survival data of patients who underwent surgery for breast cancer. The goal is to predict whether a patient will survive at least 5 years after surgery or not (classify to “surviving” and “not surviving”). Even with 80% coverage, we have already lowered the test error (probability of misclassification) by about 3% compared to rejection based on distance from the decision boundary. This advantage monotonically increases with the rejection rate. For the case of linear least squares regression (LLSR), we are able to show how ǫ-LESS can be implemented exactly and efficiently using standard matrix operations. Furthermore, we derive a pointwise bound on the difference between the prediction of the ERM and the prediction of the best regressor in the class (|f ∗ (x) − fˆ(x)|). We conclude with some empirical results demonstrating the advantage of our proposed technique over a simple and natural 1-nearest neighbor (NN) technique for selection. 5 See Appendix D for further discussion on the RC curve 27 0.25 0.2 test error Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Haberman 0.3 0.15 0.1 0.05 0 0.2 0.4 0.6 0.8 1 c Figure 1.8: RC curve of LESS (red line) and rejection based on distance from the decision boundary (dashed green line). 28 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Chapter 2 Characterizing Set Complexity “ Oh the places you’ll go! There is fun to be done! There are points to be scored. There are games to be won. And the magical things you can do with that ball will make you the winning-est winner of all. ” Dr. Seuss, Oh, the Places You’ll Go! In this chapter we present and analyze a new hypothesis class complexity measure, termed characterizing set complexity. This measure plays a central role in the analysis of coverage rates in selective classification (Chapters 3 and 5), as well as in the analysis of label complexity in online active learning (Chapter 4). 2.1 Motivation The characterizing set complexity was originally derived for the study of coverage rates in realizable selective classification [31]. Let F be some hypothesis class and f ∗ ∈ F our target hypothesis. Let Sm = {(xi , yi )}m i=1 be a finite training sample of m labeled examples, assumed to be sampled i.i.d. from some unknown underlying distribution P (X, Y ) over X × Y. A classic result from statistical learning theory is that the error rate of the ERM depends on the complexity of the hypothesis class 29 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 from which the classification function f (x) is chosen. Specifically, if F has a bounded complexity (VC dimension), then the error of the ERM approaches the error of the best hypothesis in the class when m approaches infinity. Analogously, the coverage rate of a selective classifier depends on the complexity of the selection hypothesis class from which the selection function g(x) is chosen. To better explain the motivation for our new complexity measure, we first review a simple and intuitive learning strategy for realizable selective classification. In Chapter 3 we present a simple learning strategy for the realizable case, which achieves perfect classification with maximum coverage. Namely, given hypothesis class F and training set Sm , the strategy outputs a selective classifier (f, g) that never errs on the accepted domain. This strategy, termed Consistent Selective Strategy (CSS), rejects any test point that is not classified unanimously by all hypotheses in the version space. The region accepted by the strategy is called the maximal agreement set. Figure 2.1 depicts an example of a maximal agreement set for the case of linear classifiers in R2 . The red crosses and blue circles represent positive and negative examples, respectively; the dashed lines represent hypotheses in the version space, and the grayed areas represent the maximal agreement set. What is Figure 2.1: An example of a maximal agreement set for linear classifiers in R2 . the function (hypothesis) class from which g(x) was chosen in this example? Is it the class of all unions of two open polygons with three edges each? And what happens when we have a larger training set? In Figure 2.2 we present an example of a maximal agreement set for a training set with 7 examples (a) and a training set with 16 examples (b). We can see that in the latter case the selection function g(x) is much more complex (union of two open polygons with 5 and 6 edges). It is evident that the complexity of the maximal agreement set depends on the train- 30 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Figure 2.2: The region accepted by CSS for the case of linear classifiers in R2 . ing sample Sm . A useful way to parameterize the hypothesis class from which the selection function is taken is to utilize a data dependet parameter. The parameter we use is the version space compression set size. The version space compression set size is the size of the smallest subset of the training set that induces the same version space as the entire training set. The version space compression set of the examples in Figure 2.2 are depicted in Figure 2.3 with green circles. Figure 2.3: Version space compression set. Clearly, by definition, the maximal agreement set depends only on the version space compression set and the hypothesis class F. Let Hn be the set of all possible maximal agreement sets generated by any version space compression set of size 31 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 n (with all possible labelings). For example, if n equals four, then the maximal agreement set can be either a union of two open polygons with up to four edges each (see Figure 2.4 (a), (b)), or a closed polygon with four edges (see Figure 2.4 (c)). The class of all maximal agreement sets is the union of all Hn . While the com- Figure 2.4: Maximal agreement region. plexity of this class is potentially unlimited (infinite VC-dimension), we can bound the complexity of each subclass Hn . Given a training sample Sm , g(x) selects all samples that belongs to the maximal agreement set with respect to Sm . Therefore, g(x) is chosen from the class of all maximal agreement sets, and specifically from Hn̂ , where n̂ is the version space compression set size. As we will see in the next chapter, it is sufficient to bound the complexity of Hn̂ in order to prove a bound on the probabilistic volume of the maximal agreement region, and hence the coverage of CSS. The above discussion motivates our novel complexity measure, γ(F, n), called the characterizing set complexity, which is defined as the VC dimension of the class of all maximal agreement regions for hypothesis class F and any training sample with size n. For example, for n = 4 we are limited to the class depicted in Figure 2.4. Surprisingly, while the characterizing set complexity was originally developed for the realizable case, it turns out to play a central role also in the general agnostic case. This result is obtained through a reduction to active learning and an interesting connection to another complexity measure termed the disagreement coefficient. The relation to the disagreement coefficient is studied in Section 4.6.2. 32 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 2.2 The characterizing set complexity We start by defining a region in X , which is termed the “maximal agreement set.” Any hypothesis that is consistent with the sample Sm is guaranteed to be consistent with the target hypothesis f ∗ on this entire region. Definition 2.1 (agreement set) Let G ⊆ F. A subset X ′ ⊆ X is an agreement set with respect to G if all hypotheses in G agree on every instance in X ′ , namely, ∀ g1 , g2 ∈ G, x ∈ X ′ , g1 (x) = g2 (x). Definition 2.2 (maximal agreement set) Let G ⊆ F. The maximal agreement set with respect to G is the union of all agreement sets with respect to G. Since a maximal agreement set is a region in X , rather than an hypothesis, we formally define the dual hypothesis that matches every maximal agreement set. Definition 2.3 (characterizing hypothesis) Let G ⊆ F and let AG be the maximal agreement set with respect to G. The characterizing hypothesis of G, fG (x) is a binary hypothesis over X obtaining positive values over AG and zero otherwise. We are now ready to formally define Hn , a class we term order-n characterizing set. Definition 2.4 (order-n characterizing set) For each n, let Σn be the set of all possible labeled samples of size n (all n-subsets, each with all possible labelings). The order-n characterizing set of F, denoted Hn , is the set of all characterizing hypotheses fG (x), where G ⊆ F is a version space induced by some member of Σn . Definition 2.5 (characterizing set complexity) Let Hn be the order-n characterizing set of F. The order-n characterizing set complexity of F, denoted γ (F, n), is the VC-dimension of Hn . The characterizing set complexity γ(F, n) depends only on the the hypothesis class F and the parameter n and is independent of the source distribution P . Can we explicitly evaluate or bound this novel complexity measure for some interesting hypothesis classes? Let us start with two toy examples that illustrate different properties of the characterizing set complexity. 33 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 2.3 Linear classifiers in R In the following example we explicitly calculate the characterizing set complexity of linear classifiers in R. Let F be the class of thresholds. Depending on the sample Sm , the maximal agreement set is either a single ray or two non-overlapping rays. In Figure 2.5 we illustrate three different cases. Negative examples are marked in blue circles, positive example in red crosses, and the maximal agreement set is marked in gray. The VC-dimension of the class of two non-overlapping rays is ex- Figure 2.5: Maximal agreement set for linear classifier in R. (a) sample with both positive and negative examples (b) sample with negative only examples (c) sample with positive only examples actly the VC-dimension of the class of intervals in R (the complement class), which is 2. Therefore, for any n the characterizing set complexity of linear classifiers in R is γ (F, n) = 2. 2.4 Intervals in R Let F be the class of intervals in R. If the sample Sm includes at least one positive point, then the maximal agreement set is either an interval in R, a non-overlapping interval and a ray, or a non-overlapping interval and two rays (see Figure 2.6). The function defined by an interval and two non-overlapping rays in R is exactly the Figure 2.6: Maximal agreement set for intervals in R. (a) non overlapping interval and two rays (b) no overlapping interval and a ray (c) interval only 34 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 complement of two intervals in R. Therefore, the VC-dimension of the maximal agreement set in this case is bounded by the VC-dimension of the class of the union of two intervals, which is 4. However, if the entire sample Sm is negative, the maximal agreement set is exactly the set of all points in Sm (see Figure 2.7 for an example). The VC-dimension of the class of m points is exactly m. Therefore, ( 4, n ≤ 4; γ(F, n) ≤ n, otherwise. This is a simple example where the characterizing set complexity highly depends on the parameter n. Figure 2.7: Maximal agreement set for intervals in R when Sm include negative points only 2.5 Axis-aligned rectangles in Rd In this section we consider the class of axis-aligned rectangles in Rd . Relying on the following classical result from statistical learning theory, we infer an explicit upper bound on the characterizing set complexity for axis-aligned rectangles. Lemma 2.1 [18, Lemma 3.2.3] Let F be a binary hypothesis class of finite VC dimension d ≥ 1. For all k ≥ 1, define the k-fold intersection, n o Fk∩ , ∩ki=1 fi : fi ∈ F, 1 ≤ i ≤ k , and the k-fold union, n o Fk∪ , ∪ki=1 fi : fi ∈ F, 1 ≤ i ≤ k . 35 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Then, for all k ≥ 1, V C(Fk∩ ), V C(Fk∩ ) ≤ 2dk log (3k). Theorem 2.2 (order-n characterizing set complexity) Let F be the class of axisaligned rectangles in Rd . Then, γ(F, n) ≤ 42dn log2 (3n). + Proof Let Sn = Sk− ∪ Sn−k be a sample of size n composed of k negative examples, {x1 , x2 , . . . xk }, and n − k positive ones. Let F be the class of axis-aligned rectangles. We define, ∀1 ≤ i ≤ k, + Ri , Sn−k ∪ {(xi , −1)} . Notice that V SF ,Ri includes all axis aligned rectangles that classify all samples in S + as positive, and xi as negative. Therefore, the agreement region of V SF ,Ri is composed of two components as depicted in Figure 2.8. The first component is Figure 2.8: Agreement region of V SF ,Ri . the smallest rectangle that bounds the positive samples, and the second is an unbounded convex polytope defined by up to d hyperplanes intersecting at xi . Let AGRi be the agreement region of V SF ,Ri and AGR the agreement region of V SF ,Sn . Clearly, Ri ⊆ Sn , so V SF ,Sn ⊆ V SF ,Ri , and AGRi ⊆ AGR, and it follows that k [ AGRi ⊆ AGR. i=1 36 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 S Assume, by contradiction, that x ∈ AGR but x 6∈ ki=1 AGRi . Therefore, for any (i) (i) (i) 1 ≤ i ≤ k, there exist two hypotheses f1 , f2 ∈ V SF ,Ri , such that, f1 (x) 6= (i) (i) f2 (x). Assume, without loss of generality, that f1 (x) = 1. We define f1 , k ^ (i) f1 and f2 , k ^ (i) f2 , i=1 i=1 (i) meaning that f1 classifies a sample as positive if and only if all hypotheses f1 classify it as positive. Noting that the intersection of axis-aligned rectangles is itself an axis-aligned rectangle, we know that f1 , f2 ∈ F. Moreover, for any xi we have, (i) (i) f1 (xi ) = f2 (xi ) = −1, so also f1 (xi ) = f2 (xi ) = −1, and f1 , f2 ∈ V SF ,Sn . But f1 (x) 6= f2 (x). Contradiction. Therefore, AGR = k [ AGRi . i=1 It is well known that the VC dimension of a hyper-rectangle in Rd is 2d. The VC dimension of AGRi is bounded by the VC dimension of the union of two hyperrectangles in Rd . Furthermore, the VC dimension of AGR is bounded by the VC dimension of the union of all AGRi . Applying Lemma 2.1 twice we get, V Cdim {AGR} ≤ 42dk log2 (3k) ≤ 42dn log2 (3n). If k = 0, then the entire sample is positive and the region of agreement is an hyperrectangle. Therefore, V Cdim {AGR} = 2d. If k = n, then the entire sample is negative and the region of agreement is the points of the samples themselves. Hence, V Cdim {AGR} = n. Overall we get that in all cases, V Cdim {AGR} ≤ 42dn log2 (3n). 37 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 2.6 Linear classifiers in Rd In this section we consider the class of linear classifiers in Rd . Relying on a classical result from combinatorial geometry, we infer an explicit upper bound on the characterizing set complexity for linear classifiers. Fix any positive integer d, and let F , {fw̄,φ (x̄)} be the class of all linear binary classifiers in Rd , where w̄ are d-dimensional real vectors, φ are scalars, and ( +1, w̄T x̄ − φ ≥ 0; fw̄,φ (x̄) = −1, w̄T x̄ − φ < 0. Given a binary labeled training sample Sm , define R+ , R+ (Sm ) ⊆ Rd to be the subset of the maximal agreement set with respect to the version space V SF ,Sm , consisting of all points with positive labels. R+ is called the ‘maximal positive agreement set.’ The ‘maximal negative agreement set’, R− , R− (Sm ), is defined similarly. Before continuing, we define a new symmetric hypothesis class F̃ that allows for a simpler analysis. Let F̃ , {fw̄,φ (x̄)} be the function class T +1, if w̄ x̄ − φ > 0; T f˜w̄,φ (x̄) = 0, if w̄ x̄ − φ = 0; −1, if w̄T x̄ − φ < 0, where we interpret 0 as a classification that agrees with both +1 and −1. Given a sample Sm , we define R̃+ ⊆ Rd to be the region in Rd for which any hypothesis in the version space1 V SF̃,Sm classifies either +1 or 0 (i.e., this is the maximal positive agreement set). We define R̃− analogously with respect to negative or zero classifications. While F and F̃ are not identical, the maximal agreement sets they induce are identical. This is stated in the following technical lemma whose proof appears in Appendix B. Lemma 2.3 (maximal agreement set equivalence) For any linearly separable sample Sm , R+ = R̃+ and R− = R̃− . The next technical lemma, whose proof also appears in the appendix, provides useful information on the geometry of the maximal agreement set for the class of linear classifiers. 1 Any hypothesis in F̃ that classifies every sample in Sm correctly or as 0 belongs to the version space. 38 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Lemma 2.4 (maximal agreement set geometry I) Let Sm be a linearly separable labeled sample that is a spanning set of Rd . Then the regions R+ and R− are each an intersection of a finite number of half-spaces, with at least d samples on the boundary of each half-space. Our goal is to bound the characterizing set complexity of F. As we show below, this complexity measure is directly related to the number of facets of the convex hull of n points in Rd . The following classical combinatorial geometry theorem by Klee [73, page 98] is thus particularly useful. The statement of Klee’s theorem provided here is readily obtained from the original by using the Stirling approximation of the binomial coefficient. Theorem 2.5 (Klee, 1966) The number of facets of a d-polytope with n vertices is at most 2· en ⌊d/2⌋ ⌊d/2⌋ . An immediate conclusion is that the above term upper bounds the number of facets of the convex hull of n points in Rd (which is of course a d-polytope). Lemma 2.6 (maximal agreement set geometry II) Let Sn be a linearly separable sample consisting of n ≥ d + 1 labeled points. Then the regions R+ (Sn ) and R− (Sn ) are each an intersection of at most 2(d + 1) · d+1 2en ⌊ 2 ⌋ d half-spaces in Rd . Proof For the sake of clarity, we limit the analysis to a sample Sn in general position; that is, we assume that no more than d points lie on a (d − 1)-dimensional plane. Handling a sample Sn in arbitrary position can be straightforwardly treated by including an appropriate infinitesimal displacement of the points. By Lemma 2.3, we can limit our discussion to the hypothesis class F̃ (rather than F). Since Sn includes more than d samples in general position it is a spanning set of Rd . According to Lemma 2.4, R+ is an intersection of a finite number of half-spaces, with at least d samples on the boundary of each half-space (and exactly 39 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 d in the general position). Let S + ⊆ Sn be the subset of all positive samples in Sn , and S − ⊆ Sn , the negative ones. Let f˜w̄,φ be one of the half-spaces defining R+ . Then, ( w̄T x̄ − φ ≥ 0, if x̄ ∈ S + ; ∀x̄ ∈ Sn w̄T x̄ − φ ≤ 0, if x̄ ∈ S − . Also, exactly d samples, x̄, satisfy w̄x̄ − φ = 0. x̄′ : We now embed the samples in Rd+1 using the following transformation, x̄ → x̄′ , ( (0, x̄), if x̄ ∈ S + ; (1, −x̄), if x̄ ∈ S − . For each half-space (w̄, φ) in Rd we define a unique half-space, (w̄′ , φ′ ), in Rd+1 , w̄′ , (2φ, w̄), φ′ , φ. We observe that ′T ′ ′ w̄ x̄ − φ = ( w̄T x̄ − φ ≥ 0, if x̄ ∈ S + ; T T 2φ − w̄ x̄ − φ = −(w̄ x̄ − φ) ≥ 0, if x̄ ∈ S − , and for exactly d samples we have ( w̄T x̄ − φ = 0, if x̄ ∈ S + ; ′T ′ ′ w̄ x̄ − φ = 2φ − w̄T x̄ − φ = −(w̄T x̄ − φ) = 0, if x̄ ∈ S − . Let v̄ be any orthogonal vector to the d samples on the boundary of the half-space. Defining w̄′′ , w̄′ + αv̄, φ′′ , φ′ , with an appropriate choice of α we have, ∀x̄′ ∈ Sn w̄′′T x̄′ − φ′′ = w̄′T x̄′ − φ′ + αv̄ ′T x̄′ ≥ 0, and for exactly d + 1 samples (including the original d samples), w̄′′ x̄′ − φ′′ = 0. We observe that f˜w̄′′ ,φ′′ is a facet of the convex hull of the samples in Rd+1 . Up to d + 1 different half-spaces in Rd can be transformed into a single half-space in 40 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Rd+1 (the number of combinations of choosing d samples out of d + 1 samples on the boundary). Using Theorem 2.5, we bound the number F (d) of facets of the convex hull of the points in Rd+1 as follows: F (d) ≤ 2 · en d+1 2 !⌊ d+1 ⌋ 2 ≤2· d+1 2en ⌊ 2 ⌋ . d Since up to d + 1 half-spaces in Rd can be mapped onto a single facet of the convex hull in Rd+1 , we can bound the number of half-spaces in Rd by (d + 1) · F (d) ≤ 2(d + 1) · d+1 2en ⌊ 2 ⌋ . d Lemma 2.7 (characterizing set complexity) Fix d ≥ 2 and n > d. Let F be the class of all linear binary classifiers in Rd . Then, the order-n characterizing set complexity of F satisfies 3 γ(F, n) ≤ 83 · (d + 1) · d+1 2en ⌊ 2 ⌋ · log n. d Proof Let G = Fk∩ be the class of k-fold intersections of half-spaces in Rd . Since the VC dimension of the class of all half-spaces in Rd is d + 1, we obtain, using Lemma 2.1, that the VC dimension of G satisfies V C(G) ≤ 2k log (3k)(d + 1). Let Hn be the order-n characterizing set of F. From Lemma 2.6 we know that any hypothesis f ∈ Hn is a union of two regions, where each region is an intersection of no more than d+1 2en ⌊ 2 ⌋ k = 2(d + 1) · d 41 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 half-spaces in Rd . Therefore, Hn ⊂ G2∪ . Using Lemma 2.1, we get V C(Hn ) ≤ V C (G2∪ ) ≤ 4 log(6) · V C(G) ≤ 8k log(6) log (3k)(d + 1) d+1 2en ⌊ 2 ⌋ 2 ≤ 16(d + 1) · · log (6) d d+1 ! 2en ⌊ 2 ⌋ . · log 6(d + 1) · d For n > d ≥ 2 we get d+1 ! 2en ⌊ 2 ⌋ 2en d+1 log 6(d + 1) · · log ≤ log (6n) + d 2 d d+1 ≤ 3 · log n + · log n2 ≤ (d + 4) · log n ≤ 2 · (d + 1) · log n. 2 Therefore, 3 V C(Hn ) ≤ 83 · (d + 1) · 42 d+1 2en ⌊ 2 ⌋ · log n d Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Chapter 3 Realizable Selective Classification “ A man doesn’t know what he knows until he knows what he doesn’t know. ” Laurence J. Peter, Educator and Writer, 1919-1990 In this chapter we present and analyze the first application of self-aware learning termed realizable selective classification. In the setting considered in this chapter, the target concept belongs to the hypothesis class F. Thus, in this case the best hypothesis in the class never errs, and any pointwise competitive selective classifier achieves perfect classification on the accepted domain. We show that perfect selective classification with guaranteed coverage is achievable (from a learning-theoretic perspective) for finite hypothesis spaces by a learning strategy termed consistent selective strategy (CSS). Moreover, CSS is shown to be optimal in its coverage rate, which is fully characterized by providing lower and upper bounds that match in their asymptotic behavior in the sample size m. We show that in the general case, for infinite hypothesis classes, perfect selective classification with guaranteed (nonzero) coverage is not achievable even when F has a finite VC-dimension. We then derive a meaningful coverage guarantee using posterior information on the source distribution (data-dependent bound). We then focus on the case of linear classifiers and show that if the unknown underlying distribution is a finite mixture of Gaussians, CSS will ensure perfect learning with guaranteed coverage. This powerful 43 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 result indicates that consistent selective classification might be relevant in various applications of interest. 3.1 Consistent Selective Strategy (CSS) Before presenting our consistent selective strategy (CSS) we recall that the version space, V SF ,Sm , is the set of all hypotheses in F that classify Sm correctly. Furthermore, the maximal agreement set of hypothesis class G ⊆ F is the set of all points for which all hypotheses in G agree (see Definition 2.2). Definition 3.1 (consistent selective strategy (CSS)) Given Sm , a consistent selective strategy (CSS) is a selective classification strategy that takes f to be any hypothesis in V SF ,Sm (i.e., a consistent learner), and takes a (deterministic) selection function g that equals one for all points in the maximal agreement set with respect to V SF ,Sm , and zero otherwise. In the present setting the (unknown) labeling hypothesis f ∗ is in V SF ,Sm . Thus, CSS simply rejects all points that might incur an error with respect to f ∗ . An immediate consequence is that any CSS selective hypothesis (f, g) always satisfies R(f, g) = 0. The main concern, however, is whether its coverage Φ(f, g) can be bounded from below and whether any other strategy that achieves perfect learning with certainty can achieve better coverage. The following theorem proves that CSS has the largest possible coverage among all strategies. Theorem 3.1 (CSS coverage optimality) Given Sm , let (f, g) be a selective classifier chosen by any strategy that ensures zero risk with certainty for any unknown distribution P and any target concept f ∗ ∈ F. Let (fc , gc ) be a selective classifier selected by CSS using Sm . Then, Φ(f, g) ≤ Φ(fc , gc ). Proof For the sake of simplicity we limit the discussion to deterministic strategies. The extension to stochastic strategies is straightforward. Given a hypothetical sample S̃m of size m, let (f˜c , g̃c ) be the selective classifier chosen by CSS and let (f˜, g̃) be the selective classifier chosen by any competing strategy. Assume that there exists x0 ∈ X (x0 6∈ S̃m ) such that g̃(x0 ) = 1 and g̃c (x0 ) = 0. According to the CSS construction of g̃c , since g̃c (x0 ) = 0, there are at least two hypotheses 44 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 f1 , f2 ∈ V SF ,S̃m such that f1 (x0 ) 6= f2 (x0 ). Assume, without loss of generality, that f1 (x0 ) = f˜(x0 ). We will now construct a new “imaginary” classification problem and show that, under the above assumption, the competing strategy fails to guarantee zero risk with certainty. Let the imaginary target concept f ′∗ be f2 and the imaginary underlying distribution P ′ be (1 − ǫ)/m, if x ∈ S̃m ; ′ P (x) = ǫ, if x = x0 ; 0, otherwise. ′ drawn i.i.d from P ′ . There is a positive (perhaps Imagine a random sample Sm ′ small) probability that Sm will equal S̃m , in which case (f ′ , g′ ) = (f˜, g̃). Since g′ (x0 ) = g̃(x0 ) = 1 and f ∗ (x0 ) 6= f ′ (x0 ), with positive probability R(f ′ , g′ ) = ǫ > 0. This contradicts the assumption that the competing strategy achieves perfect classification (zero risk) with certainty. It follows that for any sample S̃m and for any x ∈ X , if g̃(x) = 1, then g̃c (x) = 1. Consequently, for any unknown distribution P , Φ(f˜, g̃) ≤ Φ(f˜c , g̃c ). 3.2 Risk and coverage bounds While it is clear that CSS achieves perfect classification, it is not clear at all if this performance can be achieved with a meaningful coverage. In this section we will derive different coverage bounds starting with the simple case of finite hypothesis classes to the more complex case of infinite hypothesis classes. A formal definition of the meaning of upper and lower coverage bounds is given in Appendix C. 3.2.1 Finite hypothesis classes The next result establishes the existence of perfect classification with guaranteed coverage in the finite case. Theorem 3.2 (guaranteed coverage) Assume a finite F and let (f, g) be a selective classifier selected by CSS. Then, R(f, g) = 0 and for any 0 ≤ δ ≤ 1, with probability of at least 1 − δ, 1 1 (ln 2) min {|F|, |X |} + ln . (3.1) Φ(f, g) ≥ 1 − m δ 45 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Proof For any ǫ, let G1 , G2 , . . . , Gk , be all the hypothesis subsets of F with corresponding maximal agreement sets, λ1 , λ2 , . . . , λk , such that each λi has volume of at most 1 − ǫ with respect to P . For any 1 ≤ i ≤ k, the probability that a single point will be randomly drawn from λi is thus at most 1 − ǫ. The probability that all training points will be drawn from λi is therefore at most (1 − ǫ)m . If a training point x is in X \ λi , then there are at least two hypotheses f1 , f2 ∈ Gi that do not agree on x. Hence, Pr (Gi ⊆ V SF ,Sm ) ≤ (1 − ǫ)m . P We note that k ≤ 2min{|F|,|X |}, and by the union bound, Pr (∃Gi P Gi ⊆ V SF ,Sm ) ≤ k · (1 − ǫ)m ≤ 2min{|F|,|X |} · (1 − ǫ)m . Therefore, with probability of at least 1 − 2min{|F|,|X |} · (1 − ǫ)m , the version space V SF ,Sm differs from any subset Gi , and hence it has a maximal agreement set with volume of at least 1 − ǫ. Using the inequality 1 − ǫ ≤ exp(−ǫ), we have 2min{|F|,|X |} · (1 − ǫ)m ≤ 2min{|F|,|X |} · exp(−mǫ). Equating the right-hand side to δ and solving for ǫ completes the proof. A leading term in the coverage guarantee (3.1) is |F|. In corresponding results in standard consistent learning [53] the corresponding term is log |F|. This may raise a concern on the tightness of (3.1). However, as shown in Corollary 3.5, this bound is tight (up to multiplicative constants). To prove the Corollary we will require the following two definitions. Definition 3.2 (binomial tail distribution) Let Z1 , Z2 , . . . Zm be m independent Bernoulli random variables each with a success probability p. Then for any 0 ≤ k ≤ m we define ! m X Bin(m, k, p) , Pr Zi ≤ k . i=1 Definition 3.3 (binomial tail inversion [64]) For any 0 ≤ δ ≤ 1 we define Bin(m, k, δ) , max {p : Bin(m, k, p) ≥ δ} . p 46 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Theorem 3.3 (non-achievable coverage, implicit bound) Let 0 ≤ δ ≤ 12 , m, and n > 1 be given. There exists a distribution P , that depends on m and n, and a finite hypothesis class F of size n, such that for any selective classifier (f, g), chosen from F by CSS (so R(f, g) = 0) using a training sample Sm drawn i.i.d. according to P , with probability of at least δ, |F| 1 Φ(f, g) ≤ 1 − · Bin m, , 2δ . 2 2 Proof Let X , {e1 , e2 , . . . en+1 } be the standard (vector) basis of Rn+1 , X ′ , X \ {en+1 } and P be the source distribution over X satisfying ( Bin m, n2 , 2δ /n, if i ≤ n; P (ei ) , 1 − Bin m, n2 , 2δ , otherwise; where Bin (m, k, δ) is the binomial tail inversion (Definition 3.3). Since n n n o Bin m, , 2δ , max p : Bin m, , p ≥ 2δ , p 2 2 and Sm is drawn i.i.d. according to P , we get that with probability of at least 2δ, x ∈ Sm : x ∈ X ′ ≤ n . 2 Let F be the class of singletons such that ( 1, if i = j; hi (ej ) , −1, otherwise. Taking f ∗ , fi∗ , for some 1 ≤ i∗ ≤ n, we have, n Pr ei∗ 6∈ Sm , x ∈ Sm : x ∈ X ′ ≤ 2 n n ′ ∗ = Pr ei 6∈ Sm | x ∈ Sm : x ∈ X · Pr x ∈ Sm : x ∈ X ′ ≤ ≤ 2 2 n 1 2 ≥ 1− · 2δ ≥ δ. n If ei∗ 6∈ Sm then all samples in Sm are negative, so each sample in X ′ can reduce the version space V SF ,Sm by at most one hypothesis. Hence, with probability of 47 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 at least δ, n n = . 2 2 Since the coverage Φ(f, g) is the volume of the maximal agreement set with respect to the version space V SF ,Sm , it follows that |V SF ,Sm | ≥ |F| − Bin m, n2 , 2δ 1 |F| ≤ 1 − · Bin m, , 2δ . Φ(f, g) = 1 − |V SF ,Sm | · n 2 2 Remark 3.4 The result of Theorem 3.3 is based on the use of the class of singletons. Augmenting this class by the empty set and choosing a uniform distribution over X results in a tighter bound. However, the bound will be significantly less general as it will hold only for a single hypothesis in F and not for any hypothesis in F. Corollary 3.5 (non-achievable coverage, explicit bound) Let 0 ≤ δ ≤ 14 , m, and n > 1 be given. There exist a distribution P , that depends on m and n, and a finite hypothesis class F of size n, such that for any selective classifier (f, g), chosen from F by CSS (so R(f, g) = 0) using a training sample Sm drawn i.i.d. according to P , with probability of at least δ, 1 1 16 Φ(f, g) ≤ max 0, 1 − ln |F| − . 8m 3 1 − 2δ Proof Applying Lemma B.2 we get |F| |F| 4 1 Bin m, , 2δ ≥ min 1, − ln . 2 4m 3m 1 − 2δ Applying Theorem 3.3 completes the proof. 3.2.2 Infinite hypothesis spaces In this section we consider an infinite hypothesis space F. We show that in the general case, perfect selective classification with guaranteed (non-zero) coverage is not achievable even when F has a finite VC-dimension. We then derive a mean48 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 ingful coverage guarantee using posterior information on the source distribution (data-dependent bound). We start this section with a negative result that precludes non-trivial perfect learning when F is the set of linear classifiers. The result is obtained by constructing a particularly bad distribution. Theorem 3.6 (non-achievable coverage) Let m and d > 2 be given. There exist a distribution P , an infinite hypothesis class F with a finite VC-dimension d, and a target hypothesis in F, such that Φ(f, g) = 0 for any selective classifier (f, g), chosen from F by CSS using a training sample Sm drawn i.i.d. according to P . Proof Let F be the class of all linear classifiers in R2 and let P be a uniform distribution over the arcs, (x − 2)2 + y 2 = 2, x < 1, and (x + 2)2 + y 2 = 2, x > −1. Figure 3.1 depicts this construction. The training set Sm consists of points on these arcs, labeled by any linear classifier that passes between the arcs. The maximal Figure 3.1: A worst-case distribution for linear classifiers: points are drawn uniformly at random on the two arcs and labeled by a linear classifier that passes between these arcs. The probability volume of the maximal agreement set is zero. 49 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 agreement set, A, with respect to the version space V SF ,Sm is partitioned into two subsets A+ and A− according to the labels obtained by hypotheses in the version space. Clearly, A+ is confined by a polygon whose vertices lie on the right-hand side arc. Since P is concentrated on the arc, the probability volume of A+ is exactly zero for any finite m. The same analysis holds for A− , and therefore the coverage is forced to be zero. The VC-dimension of the class of all linear classifiers in R2 is 3. Embedding the distribution P in a higher dimensional space Rd and using the class of all linear classifiers in Rd completes the proof. A direct corollary of Theorem 3.6 is that, in the general case, perfect selective classification with distribution-free guaranteed coverage is not achievable for infinite hypothesis spaces. However, this is certainly not the end of the story for perfect learning. In the remainder of this chapter we derive meaningful coverage guarantees using posterior or prior information on the source distribution (dataand distribution-dependent bounds). In order to guarantee meaningful coverage we first need to study the complexity of the selection function g(x) chosen by CSS. The complexity of the classification function f (x) is determined only by the hypothesis class F and it is independent of the sample size itself. However, the complexity of g(x) (the maximal agreement set) chosen by CSS generally depends on the sample size. Therefore, increasing the training sample size does not necessarily guarantee non-trivial coverage. Our main task is to find the complexity class of the family of maximal agreement sets from which g(x) is chosen. Let us define the family of all maximal agreement S sets as H = Hn such that H1 ⊂ H2 ⊂ H3 ⊂ . . .. We can now exploit the fact that CSS chooses a maximal agreement set that belongs to a specific subclass Hn with a complexity measured in terms of the VC dimension of Hn . We term this approach Structural Coverage Maximization (SCM) following the analogous and familiar Structural Risk Minimization (SRM) approach [81]. A useful way to parameterize H is to use the size of the version space compression set. Definition 3.4 (version space compression set) Let Sm be a labeled sample of m points and let V SF ,Sm be the induced version space. The version space compression set, Sn̂ ⊆ Sm is a smallest subset of Sm satisfying V SF ,Sm = V SF ,Sn̂ . Note that for any given F and Sm , the size of the version space compression set, denoted n̂ = n̂(F, Sm ), is unique. 50 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Remark 3.7 Our ”version space compression set” is precisely Hanneke’s ”minimum specifying set” [47] for f on U with respect to V , where, f = h∗ , U = Sm , V = H[Sm ] (see Definition 4.6). Lemma 3.8 The characterizing hypothesis hV SF ,Sm (x) belongs to the order-n̂ characterizing set of F, where n̂ = n̂(F, Sm ) is the size of the version space compression set. Proof According to Definition 3.4, there exists a subset Sn̂ ⊂ Sm of size n̂ such that V SF ,Sm = V SF ,Sn̂ . The rest of the proof follows immediately from Definition 2.4. Before stating the main result of this chapter, we state a classical result that will be used later. Theorem 3.9 ([83]; [3, p.53]) Let F be a hypothesis space with VC-dimension d. For any probability distribution P on X × {±1}, with probability of at least 1 − δ over the choice of Sm from P m , any hypothesis f ∈ F consistent with Sm satisfies 2 2 2em + ln R(f ) ≤ ǫ(d, m, δ) = d ln , (3.2) m d δ where R(f ) , E [I(f (x) 6= f ∗ (x))] is the risk of f . We note that inequality (3.2) actually holds only for d ≤ m. For any d > m it is clear that no meaningful upper bound on the risk can be achieved. It is easy to fix the inequality for the general case by replacing ln 2em by ln+ 2em d d , where ln+ (x) , max (ln(x), 1). Theorem 3.10 (data-dependent coverage guarantee) For any m, let a1 , . . . , am ∈ P R be given, such that ai ≥ 0 and m i=1 ai ≤ 1. Let (f, g) be a selective CSS classifier. Then, R(f, g) = 0, and for any 0 ≤ δ ≤ 1, with probability of at least 1 − δ, 2 2 2em Φ(f, g) ≥ 1 − γ (F, n̂) ln+ + ln , m γ (F, n̂) an̂ δ where n̂ is the size of the version space compression set, and γ (F, n̂) is the order-n̂ characterizing set complexity of F. 51 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Proof Given our sample Sm = {(xi , f ∗ (xi ))}m i=1 (labeled by the unknown target ′ = {(x , 1)}m . S ′ can be asfunction f ∗ ), we define the “synthetic” sample Sm i m i=1 sumed to have been sampled i.i.d from the marginal distribution of X with positive labels (P ′ ). Theorem 3.9 can now be applied on the synthetic problem with the training ′ , the distribution P ′ , and the hypothesis space taken to be H , the ordersample Sm i ′ i characterizing set of F. It follows that for all h ∈ V SHi ,Sm , with probability of ′ from (P ′ )m , at least 1 − ai δ over choices of Sm 2em 2 2 Pr′ (h(x) 6= 1) ≤ + ln di ln , (3.3) m di ai δ P where di is the VC-dimension of Hi . Then, applying the union bound yields, with probability of at least 1 − δ, that inequality (3.3) holds simultaneously for all 1 ≤ i ≤ m. All hypotheses in the version space V SF ,Sm agree on all samples in Sm . Hence, the characterizing hypothesis fV SF ,Sm (x) = 1 for any point x ∈ Sm . Let n̂ be the size of the version space compression set. According to Lemma 3.8, fV SH,Sm (x) ∈ ′ , we learn that f Fn̂ . Noting that fV SH,Sm (x) = 1 for any x ∈ Sm V SH,Sm (x) ∈ ′ . Therefore, with probability of at least 1 − δ over choices of Sm , V SFn̂ ,Sm 2 2em 2 Pr(fV SH,Sm (x) 6= 1) ≤ dn̂ ln + ln . P m dn̂ an̂ δ Since Φ(f, g) = PrP (fV SH,Sm (x) = 1), and dn̂ is the order-n̂ characterizing set complexity of F, the proof is complete. The data-dependent bound in Theorem 3.10 is stated in terms of the characterizing set complexity of the hypothesis class F (see Definition 2.5). Relying on our results from Chapter 2 we derive an explicit data-dependent coverage bound for the class of binary linear classifiers. Corollary 3.11 (data-dependent coverage guarantee) Let F be the class of linear binary classifiers in Rd and assume that the conditions of Theorem 3.10 hold. Then, R(f, g) = 0, and for any 0 ≤ δ ≤ 1, with probability of at least 1 − δ, 2 2em 2 83(d + 1)3 Λn̂,d ln+ + ln , Φ(f, g) ≥ 1 − m Λn̂,d an̂ δ 52 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 where n̂ is the size of the empirical version space compression set, and Λn̂,d = d+1 2en̂ ⌊ 2 ⌋ · log n̂. d Proof Define 2 2em 2 Ψ(γ(F, n)) , 1 − γ(F, n) ln+ + ln . m γ(F, n) an δ We note that Ψ(γ(F, n)) is a continuous function. For any γ(F, n) < 2m 2 2em 2 ∂Ψ(γ(F, n)) = − ln + < 0, ∂γ(F, n) m γ(F, n) m and for any γ(F, n) > 2m, 2 ∂Ψ(γ(F, n)) = − < 0. ∂γ(F, n) m Thus, Ψ(γ(F, n)) is monotonically decreasing. Noting that ln+ (x) is monotonically increasing, by applying Theorem 3.10 together with Lemma 2.7 the proof is complete. 3.3 Distribution-dependent coverage bound As long as the empirical version space compression set size n̂ is sufficiently small compared to m, Corollary 3.11 provides a meaningful coverage guarantee. Since n̂ might depend on m, it is hard to analyze the effective rate of the bound. To further explore this guarantee, we now bound n̂ in terms of m for a specific family of source distributions and derive a distribution-dependent coverage guarantee. Theorem 3.12 ([15]) If m points in d dimensions have their components chosen independently from any set of continuous distributions (possibly different for each component), then the expected number of convex hull vertices v is E[v] = O (log m)d−1 . 53 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Definition 3.5 (sliced multivariate Gaussian distribution) A sliced multivariate Gaussian distribution, N (Σ, µ, w, φ), is a multivariate Gaussian distribution restricted by a half space in Rd . Thus, if Σ is a non-singular covariance matrix, the pdf of the sliced Gaussian is 1 − 1 (x−µ)T Σ−1 (x−µ) e 2 · I(wT x − φ ≥ 0), C where µ = (µ1 , . . . , µd )T , I is the indicator function and C is an appropriate normalization factor. Lemma 3.13 Let P be a sliced multivariate Gaussian distribution. If m points are chosen independently from P , then the expected number of convex hull vertices is O (log m)d−1 . Proof Let X ∼ N (Σ, µ, w, φ) and Y ∼ N (Σ, µ). There is a random vector Z, whose components are independent standard normal random variables, a vector µ, and a matrix A such that Y = AZ + µ. Since wT y − φ = wT (Az + µ) − φ = wT Az + wT µ − φ, we get that X = AZ0 + µ, where Z0 ∼ N (I, 0, wT A, φ − wT µ). Due to the spherical symmetry of Z, we can choose the half-space (wT A, φ − wT µ) to be axis-aligned by rotating the axes. We note that the d components of Z are chosen independently and that the axis-aligned half-space enforces restriction only on one of the axes. Therefore, the components of Z0 are chosen independently as well. Applying Theorem 3.12, we get that if m points are chosen independently from Z0 , then the expected number of convex hull vertices is O (log m)d−1 . The proof is complete by noting that the number of convex hull vertices is preserved under affine transformations. Lemma 3.14 (version space compression set size) Let F be the class of all linear binary classifiers in Rd . Assume that the underlying distribution P is a mixture of a fixed number of Gaussians. Then, for any 0 ≤ δ ≤ 1, with probability of at least 1 − δ, the empirical version space compression set size is n̂ = O (log m)d−1 δ 54 . Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Proof Let Sn be a version space compression set. Consider x̄0 ∈ Sn . Since Sn is a compression set there is a half-space, (w̄, φ), such that fw̄,φ ∈ V SF ,Sn \{x̄0 } and fw̄,φ 6∈ V SF ,Sn . W.l.o.g. assume that x̄0 ∈ Sn is positive; thus w̄T x̄0 − φ < 0, and for any other positive point x̄ ∈ Sn , w̄T x̄ − φ ≥ 0. For an appropriate φ′ < φ, there exists a half-space (w̄, φ′ ) such that w̄T x̄0 −φ′ = 0, and for any other positive point x̄ ∈ Sn , w̄T x̄ − φ′ > 0. Therefore, x̄0 is a convex hull vertex. It follows that we can bound the number of positive samples in Sn by the number of vertices of the convex hull of all the positive points. Defining v as the number of convex hull vertices and using Markov’s inequality, we get that for any ǫ > 0, Pr(v ≥ ǫ) ≤ E[v] . ǫ Since f ∗ is a linear classifier, the underlying distribution of the positive points is a mixture of sliced multivariate Gaussians. Using Lemmas 3.13 and B.5, we get that with probability of at least 1 − δ, E[v] =O v≤ δ (log m)d−1 δ . Repeating the same arguments for the negative points completes the proof. Corollary 3.15 (distribution-dependent coverage guarantee) Let F be the class of all linear binary classifiers in Rd , and let P be a mixture of a fixed number of Gaussians. Then, R(f, g) = 0, and for any 0 ≤ δ ≤ 1, with probability of at least 1 − δ, ! 2 (log m)d 1 Φ(f, g) ≥ 1 − O · (d+3)/2 . m δ Proof Λn̂,d = ⌊ d+1 ⌋ d+1 2 d+3 2e 2en̂ ⌊ 2 ⌋ · log n̂ ≤ · n̂ 2 . d d Applying Lemma 3.14, 2 Λn̂,d = O (log m)d δ(d+3)/2 ! . The proof is complete by noting that Λn̂,d ≥ 1 and using Corollary 3.11 with ai = 2−i . 55 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 3.4 Implementation In previous sections we analyzed the performance of CSS and proved that (in the realizable case) it can achieve sharp coverage rates under reasonable assumptions on the source distribution while guaranteeing zero error on the accepted samples. However, it remains unclear whether an efficient implementation of CSS is at reach. In this section we propose an algorithm for CSS and show that it can be efficiently implemented for linear classifiers. The following method, which we term lazy CSS, is very similar to the implicit selective sampling algorithm of Cohn et al. [24]. Instead of explicitly constructing the CSS selection function g during training (which is indeed a daunting task), we develop a “lazy learning” approach that can potentially facilitate an efficient CSS implementation during test time. In particular, we propose to evaluate g(x) at any given test point x during the classification process. For the training set Sm and a test point x we define the following two sets: − , Sm ∪ {(x, −1)}; Sm,x + , Sm ∪ {(x, +1)}, Sm,x + is the (labeled) training set S augmented by the test point x labeled that is, Sm,x m − is S augmented by x labeled negatively. The selection value positively, and Sm,x m g(x) is determined as follows: g(x) = 0 (i.e., x is rejected) iff there exist hypothe+ and S − , respectively. ses f + , f − ∈ F that are consistent with Sm,x m,x The following lemma states that the selection function g(x) constructed by lazy CSS is a precise implementation of CSS. Lemma 3.16 Let F be any hypothesis class, Sm a labeled training set, and x, a test point. Then x belongs to the maximal agreement set of V SF ,Sm iff there is no + or S − . hypothesis f ∈ F that is consistent with either Sm,x m,x + and Proof If there exist hypotheses f + , f − ∈ F that are consistent with Sm,x − , then there exist two hypotheses in F that correctly classify S (therefore Sm,x m they belong to V SF ,Sm ) but disagree on x. Hence, x does not belong to the maximal agreement set of V SF ,Sm . Conversely, if x does not belong to the maximal agreement set of V SF ,Sm , then there are two hypotheses, f1 and f2 , which correctly classify Sm but disagree on x. Let’s assume, without loss of generality, that + and f is consistent with f1 classifies x positively. Then, f1 is consistent with Sm,x 2 − . Thus there exist hypotheses f + , f − ∈ F that are consistent with S + and Sm,x m,x − . Sm,x 56 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 For the case of linear classifiers it follows that computing the lazy CSS selection function for any test point is reduced to two applications of a linear separability test. Yogananda et al. [91] recently presented a fast linear separability test with a worst case time complexity of O(mr 3 ) and space complexity of O(md), where m is the number of points, d is the dimension and r ≤ min(m, d + 1). Remark 3.17 For the realizable case we can modify any rejection mechanism by restricting rejection only to the region chosen for rejection by CSS. Since CSS accepts only samples that are guaranteed to have zero test error, the overall performance of the modified rejection mechanism is guaranteed to be at least as good as the original mechanism. Using this technique we were able to improve the performance (RC curve) of the most commonly used rejection mechanism for linear classifiers, which rejects samples according to a simple symmetric distance from the decision boundary (a “margin”). 57 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Chapter 4 From Selective Classification to Active Learning “ The uncreative mind can spot wrong answers, but it takes a very creative mind to spot wrong questions. ” Antony Jay, Writer, 1930 - present Active learning is an intriguing learning model that provides the learning algorithm with some control over the learning process, potentially leading to significantly faster learning. In recent years it has been gaining considerable recognition as a vital technique for efficiently implementing inductive learning in many industrial applications where abundance of unlabeled data exists, and/or in cases where labeling costs are high. In stream-based active learning, which is also referred to as online selective sampling [5, 24], the learner is given an error objective ǫ and then sequentially receives a stream of unlabeled examples. At each step, after observing an unlabeled example x, the learner decides whether or not to request the label of x. The learner should terminate the learning process and output a binary classifier whose true error is guaranteed to be at most ǫ with high probability. The penalty incurred by the learner is the number of label requests made and this number is called the label complexity. 58 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 In this chapter we present an equivalence between active learning and perfect selective classification with respect to “fast rates.” Then, by applying our results from Chapter 3, for realizable selective classification, we show that general (nonhomogeneous) linear classifiers are actively learnable at exponential (in 1/ǫ) label complexity rate when the data distribution is an arbitrary unknown finite mixture of high dimensional Gaussians. While we obtain exponential label complexity speedup in 1/ǫ, we incur exponential slowdown in d2 , where d is the problem dimension. By proving a lower bound on label complexity we show that an exponential slowdown in d is unavoidable in such settings. Finally we relate our proposed technique to other complexity measures for active learning, including teaching dimension [43] and Hanneke’s disagreement coefficient [48]. Specifically we are able to upper bound the disagreement coefficient using characterizing set complexity. This relation opens the possibilities to utilize (our) established characterizing set complexity results in many active learning problems where the disagreement coefficient is used to characterize sample complexity speedups. It is also very useful in the analysis of agnostic selective classification, which is the subject of the following chapter. 4.1 Definitions We consider the following standard active learning model in a realizable (noise free) setting. In this model the learner sequentially observes unlabeled instances, x1 , x2 , . . ., that are sampled i.i.d. from P (X). After receiving each xi , the learning algorithm decides whether or not to request its label f ∗ (xi ), where f ∗ ∈ F is an unknown target hypothesis. Before the start of the game the algorithm is provided with some desired error rate ǫ, and confidence level δ. Definition 4.1 (label complexity) We say that the learning algorithm actively learned the problem instance (F, P ) if at some round it can terminate the learning process, after observing m instances and requesting k labels, and output an hypothesis f ∈ F whose error R(f ) ≤ ǫ, with probability of at least 1 − δ. The quality of the algorithm is quantified by the number k of requested labels, which is called the label complexity. A positive result for a learning problem (F, P ) is a learning algorithm that can actively learn this problem for any given ǫ and δ, and for every f ∗ , with label complexity bounded above by L(ǫ, δ, f ∗ ). 59 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Definition 4.2 (actively learnable with exponential rate) If there is a label complexity bound that is O(polylog(1/ǫ)) we say that the problem is actively learnable at exponential rate. We conclude with the definition of the disagreement set. Definition 4.3 (disagreement set [46, 31]) Let F be an hypothesis class and G ⊆ F. The disagreement set w.r.t. G is defined as DIS(G) , {x ∈ X : ∃f1 , f2 ∈ G s.t. f1 (x) 6= f2 (x)} . The agreement set w.r.t. G is AGR(G) , X \ DIS(G). 4.2 Background A classical result in standard (passive) supervised learning (in the realizable case) is that any consistent learning algorithm has sample complexity of 1 1 1 d log + log , O ǫ ǫ δ where d is the VC-dimension of F (see, e.g., [3]). It appears that the best sample complexity one can hope for in this setting is O(d/ǫ). The main promise in active learning has been to gain an exponential speedup in the sample complexity. Specifically, a label complexity bound of d O d log ǫ for actively learning ǫ-good classifier from a concept class with VC-dimension d, provides an exponential speedup in terms of 1/ǫ. The main strategy for active learning in the realizable setting is to request labels only for instances belonging to the disagreement set with respect to the version space, and output any (consistent) hypothesis belonging to the version space. This strategy is often called the CAL algorithm after the names of its inventors: Cohn, Atlas, and Ladner [24]. 60 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 4.3 From coverage bound to label complexity bound In this section we present a reduction from stream-based active learning to perfect selective classification. Particularly, we show that if there exists for F a perfect selective classifier with a fast rejection rate of O(polylog(m)/m), then the CAL algorithm will actively learn F with exponential label complexity rate of O(polylog(1/ǫ)) and visa versa. Lemma 4.1 Let Sm = {(x1 , y1 ), . . . , (xm , ym )} be a sequence of m labeled samples drawn i.i.d. from an unknown distribution P (X), and let Si = {(x1 , y1 ), . . . , (xi , yi )} be the i-prefix of Sm . Then, with probability of at least 1 − δ over random choices of Sm , the following bound holds simultaneously for all i = 1, . . . , m − 1, δ ⌊log2 (i)⌋ Pr {xi+1 ∈ DIS(V SF ,Si )|Si } ≤ 1 − BΦ F, ,2 , log2 (m) where BΦ (F, δ, m) is a coverage bound for perfect selective classification with respect to hypothesis class F, confidence δ, and sample size m. Proof For j = 1, . . . , m, abbreviate DISj , DIS(V SF ,Sj ) and AGRj , AGR(V SF ,Sj ). By definition, DISj = X \ AGRj . By the definitions of a coverage bound and agreement/disagreement sets, with probability of at least 1 − δ over random choices of Sj BΦ (F, δ, j) ≤ Pr{x ∈ AGRj |Sj } = Pr{x 6∈ DISj |Sj } = 1−Pr{x ∈ DISj |Sj }. Applying the union bound we conclude that the following inequality holds simultaneously, with high probability, for t = 0, . . . , ⌊log2 (m)⌋ − 1, δ t ,2 . (4.1) Pr{x2t +1 ∈ DIS2t |S2t } ≤ 1 − BΦ H, log2 (m) For all j ≤ i, Sj ⊆ Si , so DISi ⊆ DISj . Therefore, since the samples in Sm are all drawn i.i.d., for any j ≤ i, Pr {xi+1 ∈ DISi |Si } ≤ Pr {xi+1 ∈ DISj |Sj } = Pr {xj+1 ∈ DISj |Sj } . The proof is established by setting j = 2⌊log2 (i)⌋ ≤ i, and applying inequality (4.1). 61 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Lemma 4.2 (Bernstein’s inequality [57]) Let X1 , . . . , Xn be independent zeromean random variables. Suppose that |Xi | ≤ M almost surely, for all i. Then, for all positive t, ) ( n 2 X t /2 . Xi > t ≤ exp − P n o Pr 2 + M t/3 E X i=1 j Lemma 4.3 Let Zi , i = 1, . . . , m, be independent Bernoulli random variables with success probabilities pi . Then, for any 0 < δ < 1, with probability of at least 1 − δ, r m X 1X 2 1 (Zi − E{Zi }) ≤ 2 ln pi + ln . δ 3 δ i=1 Proof Define Wi , Zi − E{Zi } = Zi − pi . Clearly, E{Wi } = 0, |Wi | ≤ 1, E{Wi2 } = pi (1 − pi ). Applying Bernstein’s inequality (Lemma 4.2) on the Wi , ) ( n 2 X t /2 i ≤ exp − P h Wi > t Pr 2 E Wj + t/3 i=1 t2 /2 P = exp − pi (1 − pi ) + t/3 t2 /2 ≤ exp − P . pi + t/3 Equating the right-hand side to δ and solving for t, we have P 1 t2 /2 = ln pi + t/3 δ ⇐⇒ t2 − t · 2 1 1X pi = 0, ln − 2 ln 3 δ δ and the positive solution of this quadratic equation is r r 1X 2 1 1X 1 1 1 21 pi < ln + 2 ln pi . ln + 2 ln t = ln + 3 δ 9 δ δ 3 δ δ 62 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Lemma 4.4 Let Z1 , Z2 , . . . , Zm be a high order Markov sequence of dependent binary random variables defined in the same probability space. Let X1 , X2 , . . . , Xm be a sequence of independent random variables such that, Pr {Zi = 1|Zi−1 , . . . , Z1 , Xi−1 , . . . , X1 } = Pr {Zi = 1|Xi−1 , . . . , X1 } . Define P1 , Pr {Z1 = 1}, and for i = 2, . . . , m, Pi , Pr {Zi = 1|Xi−1 , . . . , X1 } . Let b1 , b2 . . . bm be given constants independent of X1 , X2 , . . . , Xm 1 . Assume that Pi ≤ bi simultaneously for all i with probability of at least 1 − δ/2, δ ∈ (0, 1). Then, with probability of at least 1 − δ, r m m X X 2X 2 2 bi + 2 ln Zi ≤ bi + ln . δ 3 δ i=1 i=1 We proceed with a direct proof of Lemma 4.4. An alternative proof of this lemma, using super-martingales, appears in Appendix 7.5. Proof For i = 1, . . . , m, let Wi be binary random variables satisfying bi + I(Pi ≤ bi ) · (Pi − bi ) , Pi bi − Pi Pr{Wi = 1|Zi = 0, Xi−1 , . . . , X1 } , max ,0 , 1 − Pi Pr{Wi = 1|Wi−1 , . . . , W1 , Xi−1 , . . . , X1 } = Pr{Wi = 1|Xi−1 , . . . , X1 }. Pr{Wi = 1|Zi = 1, Xi−1 , . . . , X1 } , 1 Precisely we require that each of the bi were fixed before the Xi are chosen 63 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 We notice that Pr{Wi = 1|Xi−1 , . . . , X1 } = Pr{Wi = 1, Zi = 1|Xi−1 , . . . , X1 } + Pr{Wi = 1, Zi = 0|Xi−1 , . . . , X1 } = Pr{Wi = 1|Zi = 1, Xi−1 , . . . , X1 } · Pr{Zi = 1|Xi−1 , . . . , X1 } + Pr{Wi = 1|Zi = 0, Xi−1 , . . . , X1 } · Pr{Zi = 0|Xi−1 , . . . , X1 } ( i −Pi Pi + b1−P (1 − Pi ) = bi , Pi ≤ bi ; i = bi else. Pi · Pi + 0 = bi , Hence, the distribution of each Wi is independent of Xi−1 , . . . , X1 , and the Wi are independent Bernoulli random variables with success probabilities bi . By construction, if Pi ≤ bi , then Z Pr{Wi = 1|Zi = 1, Xi−1 , . . . , X1 } = 1. Pr{Wi = 1|Zi = 1} = X By assumption, Pi ≤ bi for all i simultaneously, with probability of at least 1−δ/2. Therefore, Zi ≤ Wi simultaneously, with probability of at least 1 − δ/2. We now apply Lemma 4.3 on the Wi . The proof is then completed using the union bound. Theorem 4.5 Let Sm be a sequence of m unlabeled samples drawn i.i.d. from an unknown distribution P . Then with probability of at least 1 − δ over choices of Sm , the number of label requests k by the CAL algorithm is bounded by r 2 2 2 k ≤ Ψ(F, δ, m) + 2 ln Ψ(F, δ, m) + ln , δ 3 δ where Ψ(F, δ, m) , m X i=1 1 − BΦ F, δ , 2⌊log2 (i)⌋ 2 log2 (m) and BΦ (F, δ, m) is a coverage bound for perfect selective classification with respect to hypothesis class F, confidence δ and sample size m. 64 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Proof According to CAL, the label of sample xi will be requested iff xi ∈ DIS(V SF ,Si−1 ). For i = 1, . . . , m, let Zi be binary random variables such that Zi , 1 iff CAL requests a label for sample xi . Applying Lemma 4.1 we get that for all i = 2, . . . , m, with probability of at least 1 − δ/2, Pr{Zi = 1|Si−1 } = Pr xi ∈ DIS(V SF ,Si−1 )|Si−1 δ ⌊log2 (i−1)⌋ ≤ 1 − BΦ F, ,2 . 2 log2 (m) For i = 1, BΦ (F, δ, 1) = 0, and the above inequality trivially holds. An application of Lemma 4.4 on the variables Zi completes the proof. Theorem 4.5 states an upper bound on the label complexity expressed in terms of m, the size of the sample provided to CAL. This upper bound is very convenient for directly analyzing the active learning speedup relative to supervised learning. A standard label complexity upper bound, which depends on 1/ǫ, can be extracted using the following simple observation. Lemma 4.6 ([48, 3]) Let Sm be a sequence of m unlabeled samples drawn i.i.d. from an unknown distribution P . Let F be a hypothesis class whose finite VC dimension is d, and let ǫ and δ be given. If 2 12 4 + ln d ln , m≥ ǫ ǫ δ then, with probability of at least 1−δ, CAL will output a classifier whose true error is at most ǫ. Proof Hanneke [48] observed that since CAL requests a label whenever there is a disagreement in the version space, it is guaranteed that after processing m examples, CAL will output a classifier that is consistent with all the m examples introduced to it. Therefore, CAL is a consistent learner. A classical result [3, Thm. 4.8] is that any consistent learner will achieve, with probability of at least 1 − δ, a true error not exceeding ǫ after observing at most 4 2 12 + ln d ln ǫ ǫ δ labeled examples. 65 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 In the next two theorems we prove that a fast coverage rate in CSS implies exponential label complexity in CAL (Theorem 4.7) and visa versa ( Theorem 4.8). Theorem 4.7 Let F be a hypothesis class whose finite VC dimension is d. If the rejection rate of CSS (see Definition 3.1) is ! polylog m δ O , m then (F, P ) is actively learnable with exponential label complexity speedup. Proof Plugging this rejection rate into Ψ (defined in Theorem 4.5) we have, m X δ ⌊log2 (i)⌋ Ψ(F, δ, m) , 1 − BΦ (F, ,2 ) log2 (m) i=1 m polylog i log(m) X δ . O = i i=1 Applying Lemma B.4 we get Ψ(F, δ, m) = O polylog By Theorem 4.5, k = O polylog cludes the proof. m δ m log(m) δ . , and an application of Lemma 4.6 con- Theorem 4.8 Assume (F, P ) has a passive learning sample complexity of Ω (1/ǫ) and it is actively learnable by CAL with exponential label complexity speedup. Then the rejection rate of CSS is O (polylog(m)/m) . Proof Let Γ(Sm ) be the total number of label requests made by CAL after processing the training sample Sm . Let δ1 be given. By assumption, (F, P ) is actively learnable by CAL with exponential label complexity speedup. Therefore, there exist constants, c1 (δ1 ) and c2 (δ1 ), such that for any m, with probability of at least 1 − δ1 over random choices of Sm , Γ(Sm ) ≤ c1 logc2 m. 66 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Let A , {Sm | Γ(Sm ) ≤ max (c1 logc2 (m), c1 logc2 (m − 1) + 1)} , and B , {Sm−1 | Γ(Sm−1 ) ≤ c1 logc2 (m − 1)} . Define Zi = Zi (Sm ) to be a binary indicator variable that equals 1 if CAL requests a label at step i, and 0 otherwise. Consider a sample sm = (x1 , x2 , . . . , xm ). For i > 1, assume that Zi = 1 and for some j < i, Zj = 0. Let s′m be a sample identical to sm with the exception that examples xi and xj are interchanged. Clearly, Zj (s′m ) = 1 and Zi (s′m ) = 0, Γ(s′m ) ≤ Γ(sm ), and P (sm ) = P (s′m ). Therefore, if sm ∈ A, then also s′m ∈ A, and it follows that for any j < i, Z Z Zi (sm )dP (sm ). Zj (sm )dP (sm ) ≥ A A Therefore, Z A Z X m Z m 1 X 1 Zi (sm )dP (sm ) Zi (sm )dP (sm ) = m m A A i=1 i=1 Z 1 dP (sm ) · max (c1 logc2 (m), c1 logc2 (m − 1) + 1) ≤ m Sm ∈A c1 logc2 (m) c1 logc2 (m − 1) + 1 ≤ max , . (4.2) m m Zm (sm )dP (sm ) ≤ Decompose the sample sm to a sample of size m − 1 (denoted by sm−1 ) containing the first m − 1 examples plus the final example xm , i.e., sm = (sm−1 , xm ). By definition, Γ(sm ) ≤ Γ(sm−1 ) + 1, and it follows that if sm−1 ∈ B, then sm ∈ A regardless of xm . We thus have, Z Z Z Z ∆V SF ,sm−1 dP (sm−1 ). Zm (sm , xm )dP (xm ) dP (sm−1 ) = Zm (sm )dP (sm ) ≥ A B B xm 67 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 By assumption, Pr(B) , Pr{Sm ∈ B} ≥ 1 − δ1 , so R Z ∆V SF ,sm−1 dP (sm−1 ) Zm (sm )dP (sm ) ≥ B · (1 − δ1 ). Pr(B) A (4.3) Combining (4.2) and (4.3) we get c1 logc2 (m) c1 logc2 (m − 1) + 1 E ∆V SF ,Sm−1 |B ≤ max , , f (m).(4.4) m(1 − δ1 ) m(1 − δ1 ) Using Markov inequality we get f (m) f (m) = Pr ∆V SF ,Sm−1 > Pr ∆V SF ,Sm−1 > δ2 δ2 f (m) + Pr ∆V SF ,Sm−1 > δ2 f (m) ≤ Pr ∆V SF ,Sm−1 > δ2 E ∆V SF ,Sm−1 |B ≤ + δ1 f (m)/δ2 ≤ δ1 + δ2 , | B | B̄ | B · Pr(B) · Pr(B̄) + Pr{B̄} where B̄ is the complement set of B. Choosing δ1 = δ2 = δ/2 completes the proof. 4.4 A new technique for upper bounding the label complexity In this section we present a novel technique for deriving target-independent label complexity bounds for active learning. The technique combines the reduction of Theorem 4.5 and the general data-dependent coverage bound for selective classification of Theorem 3.10. For some learning problems it is a straightforward technical exercise, involving VC-dimension calculations, to arrive with exponential label complexity bounds. We show a few applications of this technique resulting in both reproductions of known label complexity exponential rates as well as a new one. We recall that γ(F, n̂) is the characterizing set complexity of hypothesis class F with respect to the version space compression set size n̂. Given an hypothesis 68 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 class F, our recipe to deriving active learning label complexity bounds for F is: (i) calculate both n̂ and γ (F, n̂); (ii) apply Theorem 3.10, obtaining a bound BΦ for the coverage; (iii) plug BΦ in Theorem 4.5 to get a label complexity bound expressed as a summation; (iv) Apply Lemma B.4 to obtain a label complexity bound in a closed form. 4.4.1 Linear separators in R In the following example we derive a label complexity bound for the concept class of thresholds (linear separators in R). Although this is a toy example (for which an exponential rate is well known) it does exemplify the technique, and in many other cases the application of the technique is not much harder. Let F be the class of thresholds. We first show that the corresponding version space compression set size n̂ ≤ 2. Assume w.l.o.g. that f ∗ (x) , I(x > w) for some w ∈ (0, 1). Let x− , max{xi ∈ Sm |yi = −1} and x+ , min(xi ∈ Sm |yi = +1). At least one ′ = {(x , −1), (x , +1)}. Then V S ′ , of x− or x+ exist. Let Sm − + F ,Sm = V SF ,Sm ′ and n̂ = |Sm | ≤ 2. Now, γ (F, 2) = 2, as shown in Section 2.3. Plugging these numbers in Theorem 3.10, and using the assignment a1 = a2 = 1/2, 2 ln (m/δ) 4 BΦ (F, δ, m) = 1 − 2 ln (em) + ln =1−O . m δ m Next we plug BΦ in Theorem 4.5 obtaining a raw label complexity m X δ , 2⌊log 2 (i)⌋ Ψ(F, δ, m) = 1 − BΦ F, 2 log2 (m) i=1 m X ln (log2 (m) · i/δ) O = . i i=1 Finally, by applying Lemma B.4, with a = 1 and b = log2 m/δ, we conclude that m Ψ(F, δ, m) = O ln2 . δ Thus, F is actively learnable with exponential speedup, and this result applies to any distribution. In Table 4.1 we summarize the n̂ and γ (F, n̂) values we calculated for four other hypothesis classes. The last two cases are fully analyzed in Sections 4.4.2 and 4.6.1, respectively. For the other classes, where γ and n̂ are con- 69 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Hypothesis class Distribution n̂ γ (F , n̂) Linear separators in R Intervals in R Linear separators in R2 any any (target-dependent2) any distribution on the unit circle (target-dependent2) 2 4 4 2 4 4 Linear separators in Rd mixture of Gaussians Balanced axis-aligned rectangles in Rd product distribution (log m)d−1 δ O log dm δ O O n̂d/2+1 O (dn̂ log n̂) Table 4.1: Calculated n̂ and γ for various hypothesis classes achieving exponential rates. stants, it is clear (Theorem 3.10) that exponential rates are obtained. We emphasize that the bounds for these two classes are target-dependent as they require that Sm includes at least one sample from each class. 4.4.2 Linear separators in Rd under mixture of Gaussians In this section we state and prove our main example, an exponential label complexity bound for linear classifiers in Rd . Theorem 4.9 Let F be the class of all linear binary classifiers in Rd , and let the underlying distribution be any mixture of a fixed number of Gaussians in Rd . Then, with probability of at least 1 − δ over choices of Sm , the number of label requests k by CAL is bounded by ! 2 (log m)d +1 . k=O δ(d+3)/2 Therefore, by Lemma 4.6 we have, k = O (poly(1/δ) · polylog(1/ǫ)) . Proof According to Corollary 3.15, the following distribution-dependent coverage bound holds in our setting with probability of at least 1 − δ ! 2 (log m)d 1 (4.5) Φ(f, g) ≥ 1 − O · (d+3)/2 . m δ 2 With at least one sample in each class. 70 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Plugging this bound in Theorem 4.5 we obtain, m X δ 1 − BΦ F, Ψ(F, δ, m) = , 2⌊log 2 (i)⌋ 2 log2 (m) i=1 d+3 ! 2 m X log2 (m) 2 (log i)d · O = i δ i=1 ! d+3 X 2 m log2 (m) 2 (log(i))d = O · δ i i=1 Finally, an application of Lemma B.4 with a = d2 and b = 1 completes the proof. 4.5 Lower bound on label complexity In the previous section we have derived an upper bound on the label complexity of CAL for various classifiers and distributions. In the case of linear classifiers in Rd we have shown an exponential speedup in terms of 1/ǫ but also an exponential slowdown in terms of the dimension d. In passive learning there is a linear dependency in the dimension while in our case (active learning using CAL) there is an exponential one. Is it an artifact of our bounding technique or a fundamental phenomenon? To answer this question we derive an asymptotic lower bound on the label complexity. We show that the exponential dependency in d is unavoidable (at least asymptotically) for every bounding technique when considering linear classifier even under a single Gaussian (isotropic) distribution. The argument is obtained by the observation that CAL has to request a label to any point on the convex hull of a sample Sm . The bound is obtained using known results from probabilistic geometry, which bound the first two moments of the number of vertices of a random polytope under the Gaussian distribution. Definition 4.4 (Gaussian polytope) Let X1 , ..., Xm be i.i.d. random points in Rd with common standard normal distribution (with zero mean and covariance matrix 1 2 Id ). A Gaussian polytope Pm is the convex hull of these random points. Denote by fk (Pm ) the number of k-faces in the Gaussian polytope Pm . Note that 71 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 f0 (Pm ) is the number of vertices in Pm . The following two Theorems asymptotically bound the average and variance of fk (Pm ). Theorem 4.10 ([58], Theorem 1.1) Let X1 , ..., Xm be i.i.d. random points in Rd with common standard normal distribution. Then, Efk (Pm ) = c(k,d) (log m) d−1 2 · (1 + o(1)) as m → ∞, where c(k,d) is a constant depending only on k and d. Theorem 4.11 ([59], Theorem 1.1) Let X1 , ..., Xm be i.i.d. random points in Rd with common standard normal distribution. Then there exists a positive constant cd , depending only on the dimension, such that Var (fk (Pm )) ≤ cd (log m) d−1 2 , for all k ∈ {0, . . . , d − 1}. We can now use Chebyshev’s inequality to lower bound the number of vertices in Pm (f0 (Pm )) with high probability. Theorem 4.12 Let X1 , ..., Xm be i.i.d. random points in Rd with common standard normal distribution and δ > 0 be given. Then with probability of at least 1 − δ, d−1 d−1 c̃d 2 4 − √ (log m) · (1 + o(1)), f0 (Pm ) ≥ cd (log m) δ as m → ∞, where cd and c̃d are constants depending only on d. Proof Using Chebyshev’s inequality (in the second inequality), as well as Theorem 4.11, we obtain, Pr (f0 (Pm ) > Ef0 (Pm ) − t) = 1 − Pr (f0 (Pm ) ≤ Ef0 (Pm ) − t) ≥ 1 − Pr (|f0 (Pm ) − Ef0 (Pm )| ≥ t) d−1 Var (f0 (Pm )) cd ≥ 1− ≥ 1 − 2 (log m) 2 . 2 t t Equating the RHS to 1 − δ and solving for t we get s d−1 (log m) 2 t = cd . δ 72 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Applying Theorem 4.10 completes the proof. Theorem 4.13 (lower bound) Let F be the class of linear binary classifiers in Rd , and let the underlying distribution be standard normal distribution in Rd . Then there exists a target hypothesis such that, with probability of at least 1 − δ over choices of Sm , the number of label requests k by CAL is bounded by k≥ d−1 cd (log m) 2 · (1 + o(1)), 2 as m → ∞, where cd is a constant depending only on d. Proof Let us look at the Gaussian polytope Pm induced by the random sample Sm . As long as all labels requested by CAL have the same value (the case of minuscule minority class) we note that every vertex of Pm falls in the region of disagreement with respect to any subset of Sm that do not include that specific vertex. Therefore, CAL will request label at least for each vertex of Pm . For sufficiently large m, in particular, 4 2c̃d d−1 √ , log m ≥ cd δ we conclude the proof by applying Theorem 4.12. 4.6 Relation to existing label complexity measures A number of complexity measures to quantify the speedup in active learning have been proposed. In this section we show interesting relations between our techniques and two well known measures, namely the teaching dimension [43] and the disagreement coefficient [48]. Considering first the teaching dimension, we prove in Lemma 4.15 that the version space compression set size is bounded above, with high probability, by the extended teaching dimension growth function (introduced by Hanneke [47]). Consequently, it follows that perfect selective classification with meaningful coverage can be achieved for the case of axis-aligned rectangles under a product distribution. We then focus on Hanneke’s disagreement coefficient and show in Theorem 4.18 that the coverage of CSS can be bounded below using the disagreement coefficient. Conversely, in Corollary 4.23 we show that the disagreement coefficient can be bounded above using any coverage bound for CSS. Consequently, the results here 73 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 imply that the disagreement coefficient, θ(r0 ) grows slowly with 1/r0 for the case of linear classifiers under a mixture of Gaussians. 4.6.1 Teaching dimension The teaching dimension is a label complexity measure proposed by Goldman and Kearns [43]. The dimension of the hypothesis class F is the minimum number of examples required to present to any consistent learner in order to uniquely identify any hypothesis in the class. We now define the following variation of the extended teaching dimension [54] due to Hanneke. Throughout we use the notation f1 (S) = f2 (S) to denote the fact that the two hypotheses agree on the classification of all instances in S. Definition 4.5 (Extended Teaching Dimension, [54, 47]) Let V ⊆ F, m ≥ 0, U ∈ X m . For any f ∈ F, XT D(f, V, U ) , inf t | ∃R ⊆ U : | f ′ ∈ V : f ′ (R) = f (R) | ≤ 1 ∧ |R| ≤ t . Definition 4.6 ([47]) For V ⊆ F, V [Sm ] denotes any subset of V such that ∀f ∈ V, | f ′ ∈ V [Sm ] : f ′ (Sm ) = f (Sm ) | = 1. Claim 4.14 Let Sm be a sample of size m, F an hypothesis class, and n̂ = n(F, Sm ), the version space compression set size. Then, XT D(f ∗ , F[Sm ], Sm ) = n̂. Proof Let Sn̂ ⊆ Sm be a version space compression set. Assume, by contradiction, that there exist two hypotheses f1 , f2 ∈ F[Sm ], each of which agrees on the given classifications of all examples in Sn̂ . Therefore, f1 , f2 ∈ V SF ,Sn̂ , and by the definition of version space compression set, we know that f1 , f2 ∈ V SF ,Sm . Hence, | {f ∈ F[Sm ] : f (Sm ) = f ∗ (Sm )} | ≥ 2, which contradicts definition 4.6. Therefore, | {f ∈ F[Sm ] : f (Sn̂ ) = f ∗ (Sn̂ )} | ≤ 1, 74 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 and XT D(f ∗ , F[Sm ], Sm ) ≤ |Sn̂ | = n̂. Let R ⊂ Sm be any subset of size |R| < n̂. Consequently, V SF ,Sm ⊂ V SF ,R , and there exists an hypothesis, f ′ ∈ V SF ,R , that agrees with all labeled examples in R, but disagrees with at least one example in Sm . Thus, f ′ (Sm ) 6= f ∗ (Sm ), and according to definition 4.6, there exist hypotheses f1 , f2 ∈ F[Sm ] such that f1 (Sm ) = f ′ (Sm ) 6= f ∗ (Sm ) = f2 (Sm ). But f1 (R) = f2 (R) = f ∗ (R), so | {f ∈ V [Sm ] : f (R) = f ∗ (R)} | ≥ 2. It follows that XT D(h∗ , F[Sm ], Sm ) ≥ n̂. Definition 4.7 (XTD Growth Function, [47]) For m ≥ 0, V ⊆ F, δ ∈ [0, 1], XT D(V, P, m, δ) = inf {t|∀f ∈ F, P r {XT D(f, V [Sm ], Sm ) > t} ≤ δ} . Lemma 4.15 Let F be an hypothesis class, P an unknown distribution, and δ > 0. Then, with probability of at least 1 − δ, n̂ ≤ XT D(F, P, m, δ). Proof According to Definition 4.7, with probability of at least 1 − δ, XT D(f ∗ , F[Sm ], Sm ) ≤ XT D(F, P, m, δ). Applying Claim 4.14 completes the proof. Lemma 4.16 (Balanced Axis-Aligned Rectangles, [47], Lemma 4) If P is a product distribution on Rd with continuous CDF, and F is the set of axis-aligned rectangles such that ∀f ∈ F, P rX∼P {f (X) = +1} ≥ λ, then, XT D(F, P, m, δ) ≤ O 75 dm d2 log λ δ . Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Corollary 4.17 (Balanced Axis-Aligned Rectangles) Under the same conditions of Lemma 4.16, the class of balanced axis-aligned rectangles in Rd can be perfectly selectively learned with fast coverage rate. Proof Applying Lemmas 4.15 and 4.16 we get that with probability of at least 1 − δ, 2 d dm n̂ ≤ O log . λ δ Any balanced axis-aligned rectangle belongs to the class of all axis-aligned rectangles. Therefore, the coverage of CSS for the class of balanced axis-aligned rectangles is bounded bellow by the coverage of the class of axis-aligned rectangles. Applying Theorem 2.2, and assuming m ≥ d, we obtain, 2 3 2 d d dm dm dm d log log log2 ≤O . γ (H, n̂) ≤ O d log λ δ λ δ λ λδ Applying Theorem 3.10 completes the proof. 4.6.2 Disagreement coefficient In this section we show interesting relations between the disagreement coefficient and coverage bounds in perfect selective classification. We begin by defining, for an hypothesis f ∈ F, the set of all hypotheses that are r-close to f . Definition 4.8 ([49, p.337]) For any hypothesis f ∈ F, distribution P over X , and r > 0, define the set B(f, r) of all hypotheses that reside in a ball of radius r around f , ′ ′ B(f, r) , f ∈ F : Pr f (X) 6= f (X) ≤ r . X∼P Let 2 2em 2 + ln d ln . η(d, m, δ) , m d δ Definition 4.9 For any G ⊆ F, and distribution P , we denote by ∆G the volume of the disagreement set (see Definition 4.3) of G, ∆G , Pr {DIS(G)} . 76 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Definition 4.10 (disagreement coefficient [48]) Let r0 ≥ 0. The disagreement coefficient of the hypothesis class F with respect to the target distribution P is θ(r0 ) , θf ∗ (r0 ) = sup r>r0 ∆B(f ∗ , r) . r The following theorem formulates an intimate relation between active learning (disagreement coefficient) and selective classification. Theorem 4.18 Let F be an hypothesis class with VC-dimension d, P an unknown distribution, r0 ≥ 0, and θ(r0 ), the corresponding disagreement coefficient. Let (f, g) be a selective classifier chosen by CSS. Then, R(f, g) = 0, and for any 0 ≤ δ ≤ 1, with probability of at least 1 − δ, Φ(f, g) ≥ 1 − θ(r0 ) · max {η(d, m, δ), r0 } . Proof Clearly, R(h, g) = 0, and it remains to prove the coverage bound. By Theorem 3.9, with probability of at least 1 − δ, ∀f ∈ V SF ,Sm R(f ) ≤ η(d, m, δ) ≤ max {η(d, m, δ), r0 } . Therefore, V SF ,Sm ⊆ B (f ∗ , max {η(d, m, δ), r0 }) ∆V SF ,Sm ≤ ∆B (f ∗ , max {η(d, m, δ), r0 }) . (4.6) By Definition 4.10, for any r ′ > r0 , ∆B(f ∗ , r ′ ) ≤ θ(r0 )r ′ . (4.7) Thus, the proof is complete by recalling that Φ(f, g) = 1 − ∆V SF ,Sm . Theorem 4.18 tells us that whenever our learning problem (specified by the pair (F, P )) has a disagreement coefficient that grows slowly with respect to 1/r0 , it can be (perfectly) selectively learned with a “fast” coverage bound. Consequently, 77 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 through Theorem 4.7 we also know that in each case where there exists a disagreement coefficient that grows slowly with respect to 1/r0 , active learning with a fast rate can also be deduced directly through a reduction from perfect selective classification. It follows that as far as fast rates in active learning are concerned, whatever can be accomplished by bounding the disagreement coefficient, can be accomplished also using perfect selective classification. This result is summarized in the following corollary. Corollary 4.19 Let F be an hypothesis class with VC-dimension d, P an unknown distribution, and θ(r0 ), the corresponding disagreement coefficient. If θ(r0 ) = O(polylog(1/r0 )), there exists a coverage bound such that an application of Theorem 4.5 ensures that (F, P ) is actively learnable with exponential label complexity speedup. Proof The proof is established by straightforward applications of Theorems 4.18 with r0 = 1/m and 4.7. The following result, due to Hanneke [51], implies a coverage upper bound for CSS. Lemma 4.20 ([51], Proof of Lemma 47) Let F be an hypothesis class, P an unknown distribution, and r ∈ (0, 1). Then, EP ∆Dm ≥ (1 − r)m ∆B (f ∗ , r) , where Dm , V SF ,Sm ∩ B (f ∗ , r) . (4.8) Theorem 4.21 (coverage upper bound) Let F be an hypothesis class, P an unknown distribution, and δ ∈ (0, 1). Then, for any r ∈ (0, 1), 1 > α > δ, BΦ (F, δ, m) ≤ 1 − (1 − r)m − α ∆B (f ∗ , r) , 1−α where BΦ (F, δ, m) is any coverage bound. Proof Recalling the definition of Dm (4.8), clearly Dm ⊆ V SF ,Sm and Dm ⊆ B(f ∗ , r). These inclusions imply (respectively), by the definition of disagreement 78 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 set, ∆Dm ≤ ∆V SF ,Sm , and ∆Dm ≤ ∆B(f ∗ , r). (4.9) Using Markov’s inequality (in inequality (4.10) of the following derivation) and applying (4.9) (in equality (4.11)), we thus have, (1 − r)m − α ∗ Pr ∆V SF ,Sm ≤ ∆B (f , r) 1−α (1 − r)m − α ∗ ≤ P r ∆Dm ≤ ∆B (f , r) 1−α 1 − (1 − r)m ∗ ∗ = P r ∆B (f , r) − ∆Dm ≥ ∆B (f , r) 1−α 1 − (1 − r)m ∗ ∗ ≤ P r |∆B (f , r) − ∆Dm | ≥ ∆B (f , r) 1−α E {|∆B (f ∗ , r) − ∆Dm |} ≤ (1 − α) · (4.10) (1 − (1 − r)m ) ∆B (f ∗ , r) ∆B (f ∗ , r) − E∆Dm (4.11) = (1 − α) · (1 − (1 − r)m ) ∆B (f ∗ , r) Applying Lemma 4.20 we therefore obtain, ≤ (1 − α) · ∆B (f ∗ , r) − (1 − r)m ∆B(f ∗ , r) = 1 − α < 1 − δ. (1 − (1 − r)m ) ∆B (f ∗ , r) Observing that for any coverage bound, P r {∆V SF ,Sm ≤ 1 − BΦ (F, δ, m)} ≥ 1 − δ, completes the proof. Corollary 4.22 Let F be an hypothesis class, P an unknown distribution, and δ ∈ (0, 1/8). Then for any m ≥ 2, 1 ∗ 1 , BΦ (F, δ, m) ≤ 1 − ∆B f , 7 m where BΦ (F, δ, m) is any coverage bound. 79 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Proof The proof is established by a straightforward application of Theorem 4.21 with α = 1/8 and r = 1/m. With Corollary 4.22 we can bound the disagreement coefficient for settings whose coverage bound is known. Corollary 4.23 Let F be an hypothesis class, P an unknown distribution, and BΦ (F, δ, m) a coverage bound. Then the disagreement coefficient is bounded by, ( ) 1 − BΦ (F, 1/9, ⌊1/r⌋) θ(r0 ) ≤ max sup 7 · ,2 r r∈(r0 ,1/2) Proof Applying Corollary 4.22 we get that for any r ∈ (0, 1/2), ∆B(f ∗ , 1/⌊1/r⌋) 1 − BΦ (F, 1/9, ⌊1/r⌋) ∆B(f ∗ , r) ≤ ≤7· . r r r Therefore, ∆B(f ∗ , r) θ(r0 ) = sup ≤ max r r>r0 ( ) 1 − BΦ (F, 1/9, ⌊1/r⌋) sup 7 · ,2 r r∈(r0 ,1/2) Specifically we can bound the disagreement coefficient for the case of linear classifiers in Rd when the underlying distribution is an arbitrary unknown finite mixture of high dimensional Gaussians. Corollary 4.24 Let F be the class of all linear binary classifiers in Rd , and let the underlying distribution be any mixture of a fixed number of Gaussians in Rd . Then 1 θ(r0 ) ≤ O polylog . r0 Proof Applying Corollary 4.23 together with inequality 4.5 we get that ( ) 1 − BΦ (F, 1/9, ⌊1/r⌋) θ(r0 ) ≤ max sup 7 · ,2 r r∈(r0 ,1/2) ! ) ( 2! 2 d+3 (log ⌊1/r⌋)d 1 d 7 ·O ·9 2 . ,2 ≤ O log ≤ max sup ⌊1/r⌋ r0 r∈(r0 ,1/2) r 80 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 The above Corollary is true for any hypothesis class and source distribution for which the characterizing set complexity grows slowly with respect to the sample size m. Theorem 4.25 Let F be an hypothesis class, Sm a training sample, and n̂ the corresponding version space compression set size. If γ(F, n̂) = O(polylog(m)) then 1 . θ(r0 ) = O polylog r0 Proof According to Theorem 3.10 for ai = 1/m we get 1 m m BΦ (F, δ, m) = 1 − O polylog(m) · log + polylog m polylog(m) δ m 1 . = 1 − O polylog m δ Applying Corollary 4.23 we get ( ) polylog (⌊9/r⌋) ,2 ⌊1/r⌋ 1 = sup O (polylog (⌊1/r⌋)) = O polylog . r 0 r∈(r0 ,1/2) 1 ·O θ(r0 ) ≤ max sup r∈(r0 ,1/2) r We conclude with the following Corollary specifying necessary condition for active learning with exponential label complexity speedup. Corollary 4.26 (necessary condition) Let θ(r0 ) be the disagreement coefficient for the learning problem (F, P ). If (F, P ) has a passive learning sample complexity of Ω(1/ǫ) and it is actively learnable by CAL with exponential label complexity speedup, then 1 θ(r0 ) ≤ O polylog . r0 81 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Proof Assume that the learning problem (F, P ) is actively learnable with exponential label complexity speedup. Then according to Theorem 4.8, the rejection rate of CSS is O(polylog(m)/m). Therefore there is coverage bound for CSS satisfying polylog(m) BΦ (F, δ, m) = 1 − O . m Applying Corollary 4.23 we get that ( ) 1 − BΦ (F, 1/9, ⌊1/r⌋) θ(r0 ) ≤ max sup 7 · ,2 r r∈(r0 ,1/2) ( ) polylog(⌊1/r⌋) 1 ≤ max sup , 2 = O polylog . ⌊1/r⌋ · r r0 r∈(r0 ,1/2) 4.7 Agnostic active learning label complexity bounds So far we were able to upper bound the disagreement coefficient by the characterizing set complexity. We note that the disagreement coefficient depends only on the hypothesis class F and the marginal distribution P (X). It does not depend on the conditional distribution P (Y |X) at all. Therefore, although the upper bound was derived using realizable selective classification results, it is applicable for the agnostic case as well! In this section we derive label complexity bounds for the well known agnostic active learning algorithms A2 [8] and RobustCALδ [50], using the characterizing set complexity. It should be emphasized that no explicit label complexity bounds were known for these well studied algorithms except for a number of simple settings (see discussion below). 4.7.1 Label complexity bound for A2 A2 (Agnostic Active) was the first general-purpose agnostic active learning algorithm with proven improvement in error guarantees compared to passive learning. It has been proven that this algorithm, originally introduced by Balcan et. al. [8], achieves exponential label complexity speedup (for the low accuracy regime) compared to passive learning for few simple cases including: threshold functions and homogenous linear separators under uniform distribution over the sphere [9]. In 82 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 this section we substantially extend these results and prove that exponential label complexity speedup (for the low accuracy regime) can be accomplished also for linear classifiers under a fixed mixture of Gaussians. Let η be the error of the best hypothesis in F, namely η , R(f ∗ ). Theorem 4.27 ([46, Theorem 2]) Let F be an hypothesis class with VC dimension d. If θ(r0 ) is the disagreement coefficient for F, then with probability of at least 1 − δ, given the inputs F, ǫ, and δ, A2 outputs fˆ ∈ F with R(fˆ) ≤ η + ǫ, and the number of label requests made by A2 is at most 2 η 1 1 1 O θ(η + ǫ) 2 + 1 d log + log log . ǫ ǫ δ ǫ Theorem 4.28 Let F be the class of linear classifiers in Rd and P a mixture of a fixed number of Gaussians. For any c > 0, ǫ < 21 , and ǫ ≥ ηc , algorithm A2 makes 1 1 O polylog d + 1 + log ǫ δ label requests on examples drawn i.i.d. from P , with probability 1 − δ. Proof Using Corollary 4.24 we get 1 1 ≤ O polylog . θ(η + ǫ) ≤ O polylog η+ǫ ǫ We note that the VC dimension of the class of linear classifiers in Rd is d + 1 and that ǫ ≥ ηc by assumption. Applying Theorem 4.28 we get that the number of label requests made by A2 is bounded by 2 η 1 1 1 1 +1 (d + 1) log + log log O polylog ǫ ǫ2 ǫ δ ǫ 1 1 1 1 ≤ O polylog c2 + 1 (d + 1) log + log log ǫ ǫ δ ǫ 1 1 = O polylog d + 1 + log . ǫ δ 83 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 4.7.2 Label complexity bound for RobustCALδ Motivated by the CAL algorithm, Hanneke introduced a new agnostic active learning algorithm called RobustCALδ [51]. RobustCALδ exhibits asymptotic label complexity speedup for some favorable distributions that comply with the following condition. Condition 4.29 ([50]) For some a ∈ [1, ∞) and α ∈ [0, 1], for every f ∈ F, Pr(f (X) 6= f ∗ (X)) ≤ a(R(f ) − R(f ∗ ))α . Theorem 4.30 ([50, Theorem 5.4]) For any δ ∈ (0, 1), RobustCALδ achieves a label complexity Λ such that, for any distribution P , for a and α as in Condition 4.29, ∀ǫ ∈ (0, 1), 2−2α 1 log(a/ǫ) 1 α Λ ≤ a θ(aǫ ) d log(θ(aǫ )) + log log . ǫ δ ǫ 2 α Theorem 4.31 Let F be the class of linear classifiers in Rd and P a mixture of a fixed number of Gaussians satisfying Condition 4.29 with α = 1. Then for any ǫ < 1, RobustCALδ makes log(a/ǫ) 1 d + 1 + log O polylog ǫ δ label requests on examples drawn i.i.d. from P , with probability 1 − δ. Proof Using Corollary 4.24 and noting that a ≥ 1, α = 1 we get 1 1 α θ(aǫ ) ≤ O polylog ≤ O polylog . aǫ ǫ We note that the VC dimension of the class of linear classifiers in Rd is d + 1. Applying Theorem 4.30 completes the proof. Theorem 4.31 proves an exponential label complexity speedup compared to passive learning, for which there is a lower bound on label complexity of Ω(1/ǫ) [50]. 84 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Remark 4.32 Condition 4.29 in Theorem 4.31 can be satisfied with α = 1 if the Bayes optimal classifier is linear and the source distribution satisfies Massart noise [68], Pr (|P (Y = 1|X = x) − 1/2| < 1/(2a)) = 0. For example, if the data was generated by some unknown linear hypothesis with label noise (probability to flip any label) of up to (a − 1)/2a, then P satisfies the requirements of Theorem 4.31. 85 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Chapter 5 Agnostic Selective Classification “ The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. ” Stephen Hawking, 1942 - present In Chapter3, which considered a sterile, noiseless (realizable) setting, we have seen that perfect classification with guaranteed coverage is achievable for finite hypothesis spaces for any source distribution. Furthermore, data-dependent and distribution-dependent coverage bounds are available for infinite hypothesis spaces. It will come as no surprise that in the agnostic case, where noise is present, perfect classification is impossible. In general, in the worst case no hypothesis can achieve zero error over any nonempty subset of the domain. Therefore our goal is to find a pointwise competitive selective classifier (see Definition 1.5). In this chapter we show that pointwise competitiveness is achievable, under reasonable conditions, by a learning strategy termed low error selective strategy (LESS), which naturally extends CSS to noisy environments. We derive coverage bounds for LESS and show that the characterizing set complexity can be effectively used in this case as well. 86 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 5.1 Definitions Definition 5.1 (low error set) For any hypothesis class F, target hypothesis f ∈ F, distribution P , sample Sm , and real r > 0, define and V(f, r) , f ′ ∈ F : R(f ′ ) ≤ R(f ) + r n o V̂(f, r) , f ′ ∈ F : R̂(f ′ ) ≤ R̂(f ) + r , where R(f ′ ) is the true risk of hypothesis f ′ with respect to source distribution P , and R̂(f ′ ) is the empirical risk with respect to the sample Sm . Definition 5.2 (ball in F [48]) For any f ∈ F we define a ball in F of radius r around f . Specifically, with respect to class F, marginal distribution P over X , f ∈ F, and real r > 0, define ′ ′ B(f, r) , f ∈ F : Pr f (X) 6= f (X) ≤ r . X∼P 5.2 Low Error Selective Strategy (LESS) We now present a strategy that will be shown later to achieve non-trivial pointwise competitive selective classification under certain conditions. We call it a “strategy” rather than an “algorithm” because it does not include implementation details. We begin with some motivation. Using standard concentration inequalities one can show that the training error of the true risk minimizer, f ∗ , cannot be “too far” from the training error of the empirical risk minimizer, fˆ. Therefore, we can guarantee, with high probability, that the subset of all hypotheses with “sufficiently low” empirical error includes the true risk minimizer f ∗ . Selecting only a part of the domain, for which all hypotheses in that subset agree, is then sufficient to guarantee pointwise competitiveness. Algorithm 1 formulates this idea. In the next section we analyze this strategy and show that it achieves pointwise competitiveness with non trivial (bounded) coverage. 87 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Algorithm 1 Low Error Selective Strategy (LESS) Require: Sm , m, δ, d Ensure: a pointwise competitive selective classifier (f, g) with probability 1 − δ Set fˆ = ERM (F, Sm ), i.e., fˆ is any empirical risk minimizer from F q +ln δ8 d(ln 2me d ) ˆ Set G = V̂ f , 4 2 m Construct g such that g(x) = 1 ⇐⇒ x ∈ {X \ DIS (G)} f = fˆ 5.3 Risk and coverage bounds To facilitate the discussion we need a few definitions. Consider an instance of a binary learning problem with hypothesis class F, an underlying distribution P over X × Y, and a loss function ℓ(Y, Y). Let f ∗ = arg minf ∈F {Eℓ(f (X), Y )} be the true risk minimizer. Definition 5.3 (excess loss class) Let F be an hypothesis class. The associated excess loss class [12] is defined as H , {ℓ(f (x), y) − ℓ(f ∗ (x), y) : f ∈ F} . Definition 5.4 (Bernstein class [12]) Class H is said to be a (β, B)-Bernstein class with respect to P (where 0 < β ≤ 1 and B ≥ 1), if every h ∈ H satisfies Eh2 ≤ B(Eh)β . Bernstein classes arise in many natural situations; see discussions in [62, 11]. For example, if the probability P (X, Y ) satisfies Tsybakov’s noise conditions then the excess loss function is a Bernstein [11, 80] class. In the following sequence of lemmas and theorems we assume a binary hypothesis class F with VC-dimension d, an underlying distribution P over X × {±1}, and that ℓ is the 0/1 loss function. Also, H denotes the associated excess loss class. Our results can be extended to loss functions other than 0/1 by similar techniques to those used in [16]. 88 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 In Figure 5.1 we schematically depict the hypothesis class F (the gray area), the target hypothesis (filled black circle outside F), and the best hypothesis in the class f ∗ . The distance of two points in the diagram relates to the distance between two hypothesis under the marginal distribution P (X). Our first observation is that if the excess loss class is (β, B)-Bernstein class, then the set of low true error (depicted in Figure 5.1 (a)) resides within a larger ball centered around f ∗ (see Figure 5.1 (b)). Figure 5.1: The set of low true error (a) resides within a ball around f ∗ (b). Lemma 5.1 If H is a (β, B)-Bernstein class with respect to P , then for any r > 0 V(f ∗ , r) ⊆ B f ∗ , Br β . Proof If f ∈ V(f ∗ , r), then, by definition, E {I(f (X) 6= Y )} ≤ E {I(f ∗ (X) 6= Y )} + r. Using linearity of expectation we have, E {I(f (X) 6= Y ) − I(f ∗ (X) 6= Y )} ≤ r. 89 (5.1) Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Since H is a (β, B)-Bernstein class and ℓ is the 0/1 loss function, E {I(f (X) 6= f ∗ (X))} = E {|I(f (X) 6= Y ) − I(f ∗ (X) 6= Y )|} o n = E (ℓ(f (X), Y ) − ℓ(f ∗ (X), Y ))2 = Eh2 ≤ B(Eh)β = B (E {I(f (X) 6= Y ) − I(f ∗ (X) 6= Y )})β . By (5.1), for any r > 0, E {I(f (X) 6= f ∗ (X))} ≤ Br β . Therefore, by definition, f ∈ B f ∗ , Br β . So far we have seen that the set of low true error resides within a ball around f ∗ . Now we would like to prove that, with high probability the set of low empirical error (depicted in Figure 5.2 (a)) resides within the set of low true error (see Figure 5.2 (b)). We emphasize that the distance between hypotheses in Figure 5.2 (a) is based on the empirical error, while the distance in Figure 5.2 (b) is based on the true error. Figure 5.2: The set of low empirical error (a) resides within the set of low true error (b). 90 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Throughout this section we denote s σ(m, δ, d) , 2 + ln 2δ d ln 2me d . 2 m The following is a classical result is statistical learning theory. Theorem 5.2 ([20]) For any 0 < δ < 1, with probability of at least 1 − δ over the choice of Sm from P m , any hypothesis f ∈ F satisfies R(f ) ≤ R̂(f ) + σ(m, δ, d). Similarly, R̂(f ) ≤ R(f ) + σ(m, δ, d) under the same conditions. Lemma 5.3 For any r > 0, and 0 < δ < 1, with probability of at least 1 − δ, δ ∗ ˆ V̂(f , r) ⊆ V f , 2σ m, , d + r . 2 Proof If f ∈ V̂(fˆ, r), then, by definition, R̂(f ) ≤ R̂(fˆ) + r. Since fˆ minimizes the empirical error, we know that R̂(fˆ) ≤ R̂(f ∗ ). Using Theorem 5.2 twice, and applying the union bound, we see that with probability of at least 1 − δ, R(f ) ≤ R̂(f ) + σ(m, δ/2, d) Therefore, ∧ δ R(f ) ≤ R(f ) + 2σ m, , d + r, 2 ∗ and R̂(f ∗ ) ≤ R(f ∗ ) + σ(m, δ/2, d). δ f ∈ V f , 2σ m, , d + r . 2 ∗ We have shown that, with high probability, the set of low empirical error is a subset of a ball around f ∗ . Therefore, the probability that at least two hypotheses in the set of low empirical error will disagree with each other is bounded by the 91 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 probability that at least two hypotheses in the ball around f ∗ will disagree with each other. Luckily, the latter is bounded by a complexity measure termed the disagreement coefficient (see Definition 4.10). We recall that for any G ⊆ F, and distribution P we have defined (see Definition 4.9) ∆G , Pr {DIS(G)}, and that θ(r0 ) is the disagreement coefficient with parameter r0 . Theorem 5.4 Let F be an hypothesis class and assume that H is a (β, B)-Bernstein class w.r.t. P . Then, for any r > 0 and 0 < δ < 1, with probability of at least 1 − δ, β δ ∆V̂(fˆ, r) ≤ B · 2σ m, , d + r · θ(r0 ), 2 where θ(r0 ) is the disagreement coefficient of F with respect to P and r0 = (σ(m, δ/2, d)) β . Proof Applying Lemmas 5.3 and 5.1 we get that with probability of at least 1 − δ, β ! δ . V̂(fˆ, r) ⊆ B f ∗ , B 2σ m, , d + r 2 Therefore, β ! δ ∆V̂(fˆ, r) ≤ ∆B f ∗ , B 2σ m, , d + r . 2 By the definition of the disagreement coefficient, for any r ′ > r0 , ∆B(f ∗ , r ′ ) ≤ θ(r0 )r ′ . Noting that β β δ δ r = B 2σ m, , d + r > σ m, , d = r0 2 2 ′ completes the proof. Theorem 5.5 Let F be an hypothesis class and assume that H is a (β, B)-Bernstein class w.r.t. P . Let (f, g) be the selective classifier chosen by LESS. Then, with probability of at least 1 − δ, (f, g) is a pointwise competitive selective classifier and β δ Φ(f, g) ≥ 1 − B · 4σ m, , d · θ(r0 ), 4 92 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 where θ(r0 ) is the disagreement coefficient of F with respect to P , and r0 = (σ(m, δ/4, d)) β . Proof Applying Theorem 5.2 we get that with probability of at least 1 − δ/4, δ ∗ ∗ R̂(f ) ≤ R(f ) + σ m, , d . 4 Since f ∗ minimizes the true error, we get that R(f ∗ ) ≤ R(fˆ). Applying again Theorem 5.2, we learn that with probability of at least 1 − δ/4, δ ˆ ˆ R(f ) ≤ R̂(f ) + σ m, , d . 4 Using the union bound, it follows that with probability of at least 1 − δ/2, δ ∗ ˆ R̂(f ) ≤ R̂(f ) + 2σ m, , d . 4 Hence, with probability of at least 1 − δ/2, δ ∗ ˆ = G. f ∈ V̂ f , 2σ m, , d 4 We note that the selection function g(x) equals one only for x ∈ X \ DIS (G) . Therefore, for any x ∈ X , for which g(x) = 1, all the hypotheses in G agree, and in particular f ∗ and fˆ agree. Thus (f, g) is pointwise competitive. Applications of Theorem 5.4 and the union bound entail that with probability of at least 1 − δ, β δ ˆ · θ(r0 ), Φ(f , g) = E{g(X)} = 1 − ∆G ≥ 1 − B · 4σ m, , d 4 where θ(r0 ) is the disagreement coefficient of F with respect to P and r0 = (σ(m, δ/4, d)) β . Lemma 5.6 If r1 < r2 , then θ(r1 ) ≥ θ(r2 ). 93 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Proof Implied directly from the definition of a supermum. Corollary 5.7 Assume that F has disagreement coefficient θ(0) and that H is a (β, B)-Bernstein class w.r.t. P . Let (f, g) be the selective classifier chosen by LESS. Then, with probability of at least 1 − δ, (f, g) is a pointwise competitive selective classifier, and β δ · θ(0). Φ(f, g) ≥ 1 − B · 4σ m, , d 4 Proof Application of Theorem 5.5 together with Lemma 5.6 completes the proof. The disagreement coefficient θ(0) has been shown to be finite for different hypothesis classes and source distributions including: thresholds in R under any distribution (θ(0) = 2) [48], linear separators through the origin in Rd under uniform √ distribution on the sphere (θ(0) ≤ d) [48], and linear separators in Rd under smooth data distribution bounded away from zero (θ(0) ≤ c(f ∗ )d, where c(f ∗ ) is an unknown constant that depends on the target hypothesis) [38]. For these cases an application of Corollary 5.7 is sufficient to guarantee a pointwise competitive solution with bounded coverage that converge to one. Unfortunately for many hypothesis classes and distributions the disagreement coefficient θ(0) is infinite [48]. Luckily, if the disagreement coefficient θ(r0 ) grows slowly with respect to 1/r0 (as shown by Wang [87], under sufficient smoothness conditions), Theorem 5.5 is enough to guarantee a pointwise competitive solution. Theorem 5.8 (fast coverage rates) Assume that F has disagreement coefficient 1 θ(r0 ) = O polylog r0 w.r.t. distribution P , and that H is a (β, B)-Bernstein class w.r.t. the same distribution. Let (f, g) be the selective classifier chosen by LESS. Then, with probability of at least 1 − δ, (f, g) is a pointwise competitive selective classifier and ! 1 β/2 polylog(m) . · log Φ(f, g) ≥ 1 − B · O m δ 94 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Proof Application of Theorem 5.5 together with the assumption completes the proof. Theorem 5.9 Assume that F has characterizing set complexity γ(F, n̂) = O (polylog(m)) w.r.t. distribution P , and that H is a (β, B)-Bernstein class w.r.t. the same distribution. Let (f, g) be the selective classifier chosen by LESS. Then, with probability of at least 1 − δ, (f, g) is a pointwise competitive selective classifier, and ! polylog(m) 1 β/2 Φ(f, g) ≥ 1 − B · O · log . m δ Proof We note that the disagreement coefficient depends only on the hypothesis class F and the marginal distribution P (X). Therefore, we can use Theorem 4.25 to bound the disagreement coefficient. A direct application of Theorems 5.8 completes the proof. 5.4 Disbelief principle Theorem 5.5 tells us that LESS not only outputs a pointwise competitive selective classifier, but the resulting classifier also has guaranteed coverage (under some conditions). As emphasized in [31], in practical applications it is desirable to allow for some control over the trade-off between risk and coverage; in other words, we would like to be able to develop the entire risk-coverage curve for the classifier at hand and select ourselves the cutoff point along this curve in accordance with other practical considerations we may have. How can this be achieved? The following lemma facilitates a construction of a risk-coverage trade-off curve. The result is an alternative characterization of the selection function g, of the pointwise optimal selective classifier chosen by LESS. This result allows for calculating the value of g(x), for any individual test point x ∈ X , without actually constructing g for the entire domain X . This is an extension of the “lazy” implementation method, discussed in Section 3.4, to the agnostic case. 95 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Lemma 5.10 Let (f, g) be a selective classifier chosen by LESS after observing the training sample Sm . Let fˆ be the empirical risk minimizer over Sm . Let x be any point in X and n o fex , argmin R̂(f ) | f (x) = −sign fˆ(x) , f ∈F an empirical risk minimizer forced to label x the opposite from fˆ(x). Then δ e ˆ g(x) = 0 ⇐⇒ R̂(fx ) − R̂(f ) ≤ 2σ m, , d . 4 Proof According to the definition of V̂ (see Definition 5.1), δ δ ˆ e e ˆ ⇐⇒ f ∈ V̂ f , 2σ m, , d R̂(fx ) − R̂(f ) ≤ 2σ m, , d 4 4 Thus, fˆ, fex ∈ V̂. However, by construction, fˆ(x) = −fe(x), so x ∈ DIS(V̂) and g(x) = 0. Lemma 5.10 tells us that in order to decide if point x should be rejected we need to measure the empirical error R̂(fex ) of a special empirical risk minimizer, fex , which is constrained to label x the opposite from ĥ(x). If this error is sufficiently close to R̂(ĥ), our classifier cannot be too sure about the label of x and we must reject it. Figure 5.3 illustrates this principle for a 2-dimensional example. The hypothesis class is the class of linear classifiers in R2 and the source distribution is two normal distributions. Negative samples are represented by blue circles and positive samples by red squares. As usual, fˆ denotes the empirical risk minimizer. Let us assume that we want to classify point x1 . This point is classified positive by fˆ. Therefore, we force this point to be negative and calculate the restricted ERM (depicted by doted line marked fex1 ). The difference between the empirical risk of fˆ and fex1 is not large enough, so point x1 will be rejected. However, if we want to classify point x2 , the difference between the empirical risk of fˆ and fex2 is quite large and the point will be classified as positive. The above result strongly motivates the following definition of a “disbelief index” for each individual point. 96 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Figure 5.3: Constrained ERM. Definition 5.5 (disbelief index) For any x ∈ X , define its disbelief index w.r.t. Sm and F, D(x) , D(x, Sm ) , R̂(fex ) − R̂(fˆ). Observe that D(x) is large whenever our model is sensitive to the label of x in the sense that when we are forced to bend our best model to fit the opposite label of x, our model substantially deteriorates, giving rise to a large disbelief index. This large D(x) can be interpreted as our disbelief in the possibility that x can be labeled so differently. In this case we should definitely predict the label of x using our unforced model. Conversely, if D(x) is small, our model is indifferent to the label of x and in this sense, is not committed to its label. In this case we should abstain from prediction at x. This “disbelief principle” facilitates an exploration of the risk-coverage tradeoff curve for our classifier. Given a pool of test points we can rank these test points according to their disbelief index, and points with low index should be rejected first. Thus, this ranking provides the means for constructing a risk-coverage tradeoff curve. A similar technique of using an ERM oracle that can enforce an arbitrary number of example-based constraints was used in [28, 17] in the context of active learning. As in our disbelief index, the difference between the empirical risk (or importance weighted empirical risk [17]) of two ERM oracles (with different constraints) is used to estimate prediction confidence. 97 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 5.5 Implementation In this section we temporarily switch from theory to practice, aiming at implementing rejection methods inspired by the disbelief principle and see how well they work on real world problems. Attempting to implement a learning algorithm driven by the disbelief index we face a major bottleneck because the calculation of the index requires the identification of ERM hypotheses. To handle this computationally difficult problem, we heuristically “approximate” the ERM of linear classifiers using support vector machines (SVMs) with soft margins [25]. SVM with soft margin chooses a hyperplane that splits the training sample as cleanly as possible, while still maximizing the distance to the cleanly split examples. The penalty parameter of the error term, C, controls the tradeoff between misclassification and margin size. The higher the value of C, the higher the misclassification penalty. In our implementation we use a high C value (105 in our experiments) to penalize more on training errors than on small margin. In this way the solution to the optimization problem tend to get closer to the ERM. Another problem we face is that the disbelief index is a noisy statistic that highly depends on the sample Sm . To overcome this noise we robustify this statis1 , S 2 , . . . S 11 ) using tic as follow. First we generate eleven different samples (Sm m m bootstrap sampling. For each sample we calculate the disbelief index for all test points and for each point take the median of these measurements as the final index. We note that for any finite training sample the disbelief index is a discrete variable. It is often the case that several test points share the same disbelief index. In those cases we can use any confidence measure as a tie breaker. In our experiments we use distance from decision boundary to break ties. In order to estimate R̂(fex ) we have to restrict the SVM optimizer to only consider hypotheses that classify the point x in a specific way. To accomplish this we use a weighted SVM for unbalanced data [21]. We add the point x as another training point with weight 10 times larger than the weight of all training points combined. Thus, the penalty for misclassification of x is very large and the optimizer finds a solution that doesn’t violate the constraint. 5.6 Empirical results LESS leads to rejection regions that are fundamentally different than those obtained by the traditional distance-based techniques for rejection. To illustrate this 98 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 point we start with two synthetic 2-dimensional datasets. We compare those regions to the ones based on distance from decision boundary. This latter approach is very common in practical applications of selective classification. We then extend our results to standard medical diagnosis problems from the UCI repository. Focusing on both SVM with linear kernel and SVM with RBF kernel we analyze the RC (Risk-Coverage) curves achievable for those datasets. These experiments were constructed over the LIBSVM package [21]. The first 2D source distribution we analyze is a mixture of two identical normal distributions (centered at different locations). In Figure 5.4 we depict the rejection regions for a training sample of 150 points sampled from this distribution. The (a) (b) Figure 5.4: Linear classifier. Confidence height map using (a) disbelief index; (b) distance from decision boundary. height map reflects the “confidence regions” of each technique according to its own confidence measure. The second 2D source distribution we analyze is a bit more interesting. X is distributed uniformly over [0, 3π] × [−2, 2] and the labels are sampled according to the following conditional distribution ( 0.95, x2 ≥ sin(x1 ); P (Y = 1|X = (x1 , x2 )) , 0.05, else. 99 In Figure 5.5 we depict the rejection regions for a training sample of 50 points sampled from this distribution (averaged over 100 iterations). The hypothesis class used for training was SVM with polynomial kernel (of degree 5). Here again, the height map reflects the “confidence regions” of each technique according to its own confidence measure. The thick red line depicts the decision boundary of the Bayes classifier. The qualitative difference in confidence regions that is clearly evident in Figure 5.4 also results in quantifiably different RC performance curves as depicted in Figure 5.6. The RC curve generated by LESS is depicted in red and 0.1 0.16 0.14 0.09 test error 0.12 test error Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Figure 5.5: SVM with polynomial kernel. Confidence height map using (a) disbelief index; (b) distance from decision boundary. 0.1 0.08 0.08 0.07 0.06 0.04 0.06 0.02 0 0.2 0.4 0.6 0.8 0.05 0.1 1 c 0.2 0.3 0.4 0.5 0.6 c Figure 5.6: RC curve of our technique (depicted in red) compared to rejection based on distance from decision boundary (depicted in dashed green line). The RC curve in right figure zooms into the lower coverage regions of the left curve. 100 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 the RC curve generated by distance from decision boundary is depicted in dashed green line. The right graph is a zoom in section of the entire RC curve (depicted on the left graph). The dashed horizontal line is the test error of f ∗ on the entire domain and the dotted line is the Bayes error. While for high coverage values the two techniques are statistically indistinguishable, for any coverage less than 60% we get a significant advantage for LESS. It is clear that in this case not only the estimation error was reduced, but also the test error goes significantly bellow the optimal test error of f ∗ for low coverage values. We also tested our algorithm on standard medical diagnosis problems from the UCI repository, including all datasets used in [44]. We transformed nominal features to numerical ones in a standard way using binary indicator attributes. We also normalized each attribute independently so that its dynamic range is [0, 1]. No other preprocessing was employed. In each iteration we choose uniformly at random non overlapping training set (100 samples) and test set (200 samples) for each dataset.1 The SVM was trained on the entire training set, and test samples were sorted according to confidence (either using distance from decision boundary or disbelief index). Figure 5.7 depicts the RC curves of our technique (red solid line) and rejection based on distance from decision boundary (green dashed line) for linear kernel on all 6 datasets. All results are averaged over 500 iterations (error bars show standard error). With the exception of the Hepatitis dataset, in which both methods were statistically indistinguishable, in all other datasets the proposed method exhibits significant advantage over the traditional approach. We would like to highlight the performance of the proposed method on the Pima dataset. While the traditional approach cannot achieve error less than 8% for any rejection rate, in our approach the test error decreases monotonically to zero with rejection rate. Furthermore, a clear advantage for our method over a large range of rejection rates is evident in the Haberman dataset.2 . For the sake of fairness, we note that the running time of our algorithm (as presented here) is substantially longer than the traditional technique. The performance of our algorithm can be substantially improved when many unlabeled samples are available. In this case the rejection function can be evaluated on the unlabeled sam1 Due to the size of the Hepatitis dataset the test set was limited to 29 samples. 2 The Haberman dataset contains survival data of patients who had undergone surgery for breast cancer. With estimated 207,090 new cases of breast cancer in the united states during 2010 [77] an improvement of 1% affects the lives of more than 2000 women. 101 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 ples to generate a new “labeled” sample. Then a new rejection classifier can be trained on this sample. Figure 5.8 depicts the maximum coverage for a distance-based rejection technique that allows the same error rate as our method with a specific coverage. For example, let us assume that our method can have an error rate of 10% with coverage of 60% and the distance-based rejection technique achieves the same error with maximum coverage of 40%. Then the point (0.6, 0.4) will be on the red line. Thus, if the red line is bellow the diagonal then our technique has an advantage over distance-based rejection and visa versa. As an example, consider the Haberman dataset, and observe that regardless of the rejection rate, distance-based technique cannot achieve the same error as our technique with coverage lower than 80%. Figures 5.9 and 5.10 depict the results obtained with RBF kernel. In this case a statistically significant advantage for our technique was observed for all datasets. 102 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Figure 5.7: RC curves for SVM with linear kernel. Our method in solid red, and rejection based on distance from decision boundary in dashed green. Horizntal axis (c) represents coverage. 103 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Figure 5.8: SVM with linear kernel. The maximum coverage for a distance-based rejection technique that allows the same error rate as our method with a specific coverage. 104 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Figure 5.9: RC curves for SVM with RBF kernel. Our method in solid red and rejection based on distance from decision boundary in dashed green. 105 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Figure 5.10: SVM with RBF kernel. The maximum coverage for a distance-based rejection technique that allows the same error rate as our method with a specific coverage. 106 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Chapter 6 Selective Regression “ Ignoring isn’t the same as ignorance, you have to work at it. ” Margaret Atwood, The Handmaid’s Tale Consider a standard least squares regression problem. Given m input-output training pairs, (x1 , y1 ), . . . , (xm , ym ), we are required to learn a predictor, fˆ ∈ F, capable of generating accurate output predictions, fˆ(x) ∈ R, for any input x. Assuming that input-output pairs are i.i.d. realizations of some unknown stochastic source, P (X, Y ), we would like to choose fˆ so as to minimize the standard least squares risk functional, Z ˆ R(f ) = (y − fˆ(x))2 dP (x, y). Let f ∗ = argminf ∈F R(f ) be the optimal predictor in hindsight (based on full knowledge of P ). A classical result in statistical learning is that under certain structural conditions on F and possibly on P , one can learn a regressor that approaches the average optimal performance, R(f ∗ ), when the sample size, m, approaches infinity [81]. As in selective classification, in selective regression we allow the possibility of abstaining from prediction on part of the domain. In agnostic selective classification our goal was to find a pointwise competitive classifier; in other words to 107 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 classify exactly as the best hypothesis in the class on the accepted domain. In contrast, in selective regression we are only required to find an ǫ-pointwise competitive regressor, namely a regressor that predicts values that are ǫ close to the predictions of the best regressor in the class for any point in the accepted domain. In this chapter we show that, under some conditions, ǫ-pointwise competitiveness is achievable with monotonically increasing coverage (with m), by a learning strategy termed ǫlow error selective strategy (ǫ-LESS). ǫ-LESS is, of course, closely related to the LESS classification strategy, now adapted for predicting real numbers. As in the cases of LESS (and CSS), the ǫ-LESS strategy might appears to be out of computational reach because accept/reject decisions require the computation of a supremum over a very large, and possibly infinite hypothesis subset. However, here again we show how to compute the strategy for each point of interest using only two constrained ERM calculations. This useful reduction opens possibilities for efficient implementations of competitive selective regressors whenever the hypothesis class of interest allows for efficient (constrained) ERM (see Definition 6.6). For the case of linear least squares regression we utilize known techniques for both ERM and constrained ERM and derive exact implementation achieving pointwise competitive selective regression. The resulting algorithm is efficient and can be easily implemented using standard matrix operations including (pseudo) inversion. Finally we present numerical examples over a suite of real-world regression datasets demonstrating the effectiveness of our methods, and indicating that substantial performance improvements can be gained by using selective regression. 6.1 Definitions m A finite training sample of m labeled examples, Sm , {(xi , yi )}m i=1 ⊆ (X × Y) , is observed, where X is some feature space and Y ⊆ R. Using Sm we are required to select a regressor fˆ ∈ F, where F is a fixed hypothesis class containing potential regressors of the form f : X → Y. It is desired that predictions fˆ(x), for unseen instances x, will be as accurate as possible. We assume that pairs (x, y), including training instances, are sampled i.i.d. from some unknown stochastic source, P (x, y), defined over X × Y. Given a loss function, ℓ : Y × Y → [0, ∞), we quantify the prediction quality of any f through its true error or risk, Z R(f ) , E(X,Y ) {ℓ(f (X), Y )} = ℓ(f (x), y)dP (x, y). 108 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 While R(f ) is an unknown quantity, we do observe the empirical risk of f , defined as m 1 X R̂(f ) , ℓ(f (xi ), yi ). m i=1 As in previous chapters, let fˆ , arg minf ∈F R̂(f ) be the empirical risk minimizer (ERM), and f ∗ , arg minf ∈F R(f ), the true risk minimizer. Definition 6.1 (low error set) For any hypothesis class F, target hypothesis f ∈ F, distribution P , sample Sm , and real r > 0, define and V(f, r) , f ′ ∈ F : R(f ′ ) ≤ R(f ) + r n o V̂(f, r) , f ′ ∈ F : R̂(f ′ ) ≤ R̂(f ) + r , where R(f ′ ) is the true risk of hypothesis f ′ with respect to source distribution P , and R̂(f ′ ) is the empirical risk with respect to the sample Sm . For the sake of brevity, throughout this chapter we often write f instead of f (x), where f is any regressor. We also define a (standard) distance metric over the hypothesis class F. For any probability measure µ on X , let L2 (µ) be the Hilbert space of functions from X to R, with the inner product defined as hf, gi , Eµ(x) f (x)g(x). The distance function induced by the inner product is q p ρ(f, g) ,k f − g k= hf − g, f − gi = Eµ(x) (f (x) − g(x))2 . For any f ∈ F we define a ball in F of radius r around f , Definition 6.2 (ball if F) Let f ∈ F be an hypothesis and r > 0 be given. Then a ball of radius r around hypothesis f is defined as B(f, r) , f ′ ∈ F : ρ(f, f ′ ) ≤ r . Finally we define a multiplicative risk bound for regression. 109 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Definition 6.3 (Multiplicative Risk Bounds) Let σδ , σ (m, δ, F) be defined such that for any 0 < δ < 1, with probability of at least 1 − δ over the choice of Sm from P m , any hypothesis f ∈ F satisfies R(f ) ≤ R̂(f ) · σ (m, δ, F) . Similarly, the reverse bound , R̂(f ) ≤ R(f ) · σ (m, F, δ), holds under the same conditions. Remark 6.1 The purpose of Definition 6.3 is to facilitate the use of any (known) risk bound as a plug-in component in subsequent derivations. We define σ as a multiplicative bound, which is common in the treatment of unbounded loss functions such as the squared loss (see discussion by Vapnik in [82], page 993). Instances of such bounds can be extracted, e.g., from [61] (Theorem 1), and from bounds discussed in [82]. The entire set of results that follow can also be developed while relying on additive bounds, which are common when using bounded loss functions. 6.2 ǫ-Low Error Selective Strategy (ǫ-LESS) The ǫ-LESS strategy follows the same general idea of the LESS classification strategy discussed in previous chapters. Using standard concentration inequalities for real valued functions (see Definition 6.3) one can show that the training error of the true risk minimizer, f ∗ , cannot be “too far” from the training error of the empirical risk minimizer, fˆ. Therefore, we can guarantee, with high probability, that the class of all hypothesis with “sufficiently low” empirical error includes the true risk minimizer f ∗ . Selecting only subset of the domain, for which all hypotheses predict values up to ǫ from the prediction of the empirical risk minimizer fˆ, is then sufficient to guarantee ǫ-pointwise competitivness. Algorithm 2 formulates this idea. In the next section we analyze this strategy and show that it achieves ǫ-pointwise competitiveness with non trivial (bounded) coverage. 110 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Algorithm 2 ǫ-Low Error Selective Strategy (ǫ-LESS) Require: Sm , m, δ, F, ǫ Ensure: A selective regressor (fˆ, g) achieving ǫ-pointwise competitiveness Set fˆ = ERM (F, Sm ) /* fˆ is any empirical risk minimizer from F */ Set G = V̂ fˆ, σ(m, δ/4, F)2 − 1 · R̂(fˆ) Construct g such that g(x) = 1 ⇐⇒ ∀f ′ ∈ G |f ′ (x) − fˆ(x)| < ǫ 6.3 Risk and coverage bounds Lemma 6.2 that follows is based on the proof of Lemma A.12 in [65]. Lemma 6.2 ([65]) For any f ∈ F. Let ℓ : Y × Y → [0, ∞) be the squared loss function and F be a convex hypothesis class. Then, E(x,y) (f ∗ (x) − y)(f (x) − f ∗ (x)) ≥ 0. Proof By convexity, for any α ∈ [0, 1], fα , α · f + (1 − α) · f ∗ ∈ F. Since f ∗ is the best predictor in F, it follows that R(f ∗ ) ≤ R(fα ) = Eℓ(fα , y) = E(fα − y)2 = E(α · f + (1 − α) · f ∗ − y)2 = E(f ∗ − y + α · (f − f ∗ ))2 = E(f ∗ − y)2 + α2 E(f − f ∗ )2 + 2αE(f ∗ − y)(f − f ∗ ) = R(f ∗ ) + α2 E(f − f ∗ )2 + 2αE(f ∗ − y)(f − f ∗ ) Thus, for any α ∈ [0, 1], α E(f ∗ − y)(f − f ∗ ) ≥ − E(f − f ∗ )2 , 2 and the lemma is obtained by taking the limit α → 0. 111 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Lemma 6.3 Under the same conditions of Lemma 6.2, for any r > 0, √ V(f ∗ , r) ⊆ B f ∗ , r . Proof If f ∈ V(f ∗ , r), then by definition, R(f ) ≤ R(f ∗ ) + r. (6.1) R(f ) − R(f ∗ ) = E {ℓ(f, y) − ℓ(f ∗ , y)} = E (f − y)2 − (f ∗ − y)2 n o = E (f − f ∗ )2 − 2(y − f ∗ )(f − f ∗ ) = ρ2 (f, f ∗ ) + 2E(f ∗ − y)(f − f ∗ ). Applying Lemma 6.2 and Equation 6.1 we get, ρ(f, f ∗ ) ≤ p R(f ) − R(f ∗ ) ≤ √ r. Lemma 6.4 For any r > 0, and 0 < δ < 1, with probability of at least 1 − δ, 2 V̂(fˆ, r) ⊆ V f ∗ , (σδ/2 − 1) · R(f ∗ ) + r · σδ/2 . Proof If f ∈ V̂(fˆ, r), then, by definition, R̂(f ) ≤ R̂(fˆ) + r. Since fˆ minimizes the empirical error, we have R̂(fˆ) ≤ R̂(f ∗ ). Using the multiplicative risk bound twice (Definition 6.3), and applying the union bound, we know that with probability of at least 1 − δ, R(f ) ≤ R̂(f ) · σδ/2 ∧ 112 R̂(f ∗ ) ≤ R(f ∗ ) · σδ/2 . Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Therefore, R(f ) ≤ R̂(f ) · σδ/2 ≤ (R̂(fˆ) + r) · σδ/2 ≤ (R̂(f ∗ ) + r) · σδ/2 2 ≤ (R(f ∗ ) · σδ/2 + r) · σδ/2 = R(f ∗ ) + (σδ/2 − 1)R(f ∗ ) + r · σδ/2 . and 2 f ∈ V f ∗ , σδ/2 − 1 · R(f ∗ ) + r · σδ/2 . Lemma 6.5 Let F be a convex hypothesis space, ℓ : Y × Y → [0, ∞), a convex loss function, and fˆ be an ERM. Then, with probability of at least 1 − δ/2, for any x ∈ X, |f ∗ (x) − fˆ(x)| ≤ sup |f (x) − fˆ(x)|. ” “ 2 −1)·R̂(fˆ) f ∈V̂ fˆ,(σδ/4 Proof Applying the multiplicative risk bound, we get that with probability of at least 1 − δ/4, R̂(f ∗ ) ≤ R(f ∗ ) · σδ/4 . Since f ∗ minimizes the true error, R(f ∗ ) ≤ R(fˆ). Applying the multiplicative risk bound on fˆ, we know also that with probability of at least 1 − δ/4, R(fˆ) ≤ R̂(fˆ) · σδ/4 . Combining the three inequalities by using the union bound we get that with probability of at least 1 − δ/2, 2 2 R̂(f ∗ ) ≤ R̂(fˆ) · σδ/4 = R̂(fˆ) + σδ/4 − 1 · R̂(fˆ). Hence, with probability of at least 1 − δ/2 we get 2 f ∗ ∈ V̂ fˆ, (σδ/4 − 1) · R̂(fˆ) . 113 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Let G ⊆ F. We generalize the concept of disagreement set (Definition 4.3) to real-valued functions. Definition 6.4 (ǫ-disagreement set) The ǫ-disagreement set w.r.t. G is defined as DISǫ (G) , {x ∈ X : ∃f1 , f2 ∈ G s.t. |f1 (x) − f2 (x)| ≥ ǫ} . For any G ⊆ F, distribution P , and ǫ > 0, we define ∆ǫ G , P rP {DISǫ (G)} . In the following definition we extend Hanneke’s disagreement coefficient [48] to the case of real-valued functions.1 Definition 6.5 (ǫ-disagreement coefficient) Let r0 ≥ 0. The ǫ-disagreement coefficient of F with respect to P and r0 is, θǫ (r0 ) , sup r>r0 ∆ǫ B(h∗ , r) . r Throughout this chapter we set r0 = 0. Lemma 6.6 Let F be a convex hypothesis class, and assume ℓ : Y × Y → [0, ∞) is the squared loss function. Let ǫ > 0 be given. Assume that F has ǫ-disagreement coefficient θǫ . Then, for any r > 0 and 0 < δ < 1, with probability of at least 1 − δ, r 2 ˆ ∆ǫ V̂(f , r) ≤ θǫ σ − 1 · R(f ∗ ) + r · σδ/2 . δ/2 Proof Applying Lemmas 6.4 and 6.3 we know that with probability of at least 1 − δ, ! r 2 − 1 · R(f ∗ ) + r · σ σδ/2 V̂(fˆ, r) ⊆ B f ∗ , δ/2 . Therefore, ! r 2 − 1 · R(f ∗ ) + r · σ σδ/2 ∆ǫ V̂(fˆ, r) ≤ ∆ǫ B f ∗ , δ/2 . 1 Our attemps to utilize a different known extension of the disagreement coefficient [16] were not successful. Specifically, the coefficient proposed there is unbounded for the squared loss function when Y is unbounded. 114 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 By the definition of the ǫ-disagreement coefficient, for any r ′ > 0, ∆ǫ B(f ∗ , r ′ ) ≤ θǫ r ′ , which completes the proof. The following theorem is the main result of this section, showing that ǫ-LESS achieves ǫ-pointwise competitiveness with a meaningful coverage that converges to 1. Although R(f ∗ ) in the bound (6.2) is an unknown quantity, it is still a constant, and as σ approaches 1, the coverage lower bound approaches 1 as well. When using a typical additive risk bound, R(h∗ ) disappears from the RHS. Theorem 6.7 Assume the conditions of Lemma 6.6 hold. Let (f, g) be the selective regressor chosen by ǫ-LESS. Then, with probability of at least 1 − δ, (f, g) is ǫpointwise competitive and r σ 2 − 1 · R(f ∗ ) + σδ/4 · R̂(fˆ) . (6.2) Φ(f, g) ≥ 1 − θǫ δ/4 Proof According to ǫ-LESS, if g(x) = 1 then sup “ ” 2 −1 ·R̂(fˆ)) f ∈V̂(fˆ, σδ/4 |f (x) − fˆ(x)| < ǫ. Applying Lemma 6.5 we get that, with probability of at least 1 − δ/2, ∀x ∈ {x ∈ X : g(x) = 1} |f (x) − f ∗ (x)| < ǫ. 2 − 1) · R̂(fˆ) = G wet get Since fˆ ∈ V̂ fˆ, (σδ/4 ( Φ(f, g) = E{g(X)} = E I sup |f (x) − fˆ(x)| < ǫ ( f ∈G = 1 − E I sup |f (x) − fˆ(x)| ≥ ǫ ( ≥ 1−E I f ∈G !) sup |f1 (x) − f2 (x)| ≥ ǫ f1 ,f2 ∈G 115 !) !) = 1 − ∆ǫ G. Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Applying Lemma 6.6 and the union bound we conclude that with probability of at least 1 − δ, r Φ(f, g) = E{g(X)} ≥ 1 − θǫ σ 2 − 1 · R(f ∗ ) + σδ/4 · R̂(fˆ) . δ/4 6.4 Rejection via constrained ERM In our proposed strategy we are required to track the supremum of a possibly infinite hypothesis subset, which in general might be intractable. The following Lemma 6.8 reduces the problem of calculating the supremum to a problem of calculating a constrained ERM for two hypotheses. Definition 6.6 (constrained ERM) Let x ∈ X and ǫ ∈ R be given. Define, n o fˆǫ,x , argmin R̂(f ) | f (x) = fˆ(x) + ǫ , f ∈F where fˆ(x) is, as usual, the value of the unconstrained ERM regressor at point x. Lemma 6.8 Let F be a convex hypothesis space, and ℓ : Y × Y → [0, ∞), a convex loss function. Let ǫ > 0 be given, and let (f, g) be a selective regressor chosen by ǫ-LESS after observing the training sample Sm . Let fˆ be an ERM. Then, g(x) = 0 Proof Let ⇔ 2 R̂(fˆǫ,x ) ≤ R̂(fˆ) · σδ/4 ∨ 2 R̂(fˆ−ǫ,x ) ≤ R̂(fˆ) · σδ/4 . 2 G , V̂ fˆ, (σδ/4 − 1) · R̂(fˆ) , and assume there exists f ∈ G such that |f (x) − fˆ(x)| ≥ ǫ. Assume w.l.o.g. (the other case is symmetric) that f (x) − fˆ(x) = a ≥ ǫ. 116 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Since F is convex, ǫ ˆ ǫ f′ , 1 − · f + · f ∈ F. a a We thus have, ǫ ˆ ǫ ǫ ǫ ˆ f ′ (x) = 1 − · f (x)+ ·f (x) = 1 − · f (x)+ · fˆ(x) + a = fˆ(x)+ǫ. a a a a Therefore, by the definition of fˆǫ,x , and using the convexity of ℓ, together with Jensen’s inequality, m 1 X R̂(fˆǫ,x ) ≤ R̂(f ′ ) = ℓ(f ′ (xi ), yi ) m i=1 = m ǫ ǫ ˆ 1 X · f (xi ) + · f (xi ), yi ℓ 1− m a a i=1 m m ǫ 1 X ǫ 1 X ˆ · ℓ f (xi ), yi + · ℓ (f (xi ), yi ) a m a m i=1 i=1 ǫ ǫ ǫ ǫ 2 = 1− · R̂(fˆ) + · R̂(f ) ≤ 1 − · R̂(fˆ) + · R̂(fˆ) · σδ/4 a a a a ǫ 2 2 ˆ ˆ ˆ = R̂(f ) + · σδ/4 − 1 · R̂(f ) ≤ R̂(f ) · σδ/4 . a ≤ 1− As for the other direction, if 2 R̂(fˆǫ,x ) ≤ R̂(fˆ) · σδ/4 . Then fˆǫ,x ∈ G and ˆ fǫ,x(x) − fˆ(x) = ǫ. So far we have discussed the case where ǫ is given, and our objective is to find an ǫ-pointwise competitive regressor. Lemma 6.8 provides the means to compute such a competitive regressor assuming that a method to compute a constrained ERM is available (as is the case for squared loss linear regressors; see next section). However, as was discussed in [31], in many applications our objective might require to explore the entire risk-coverage trade-off, in other words, to get a pointwise bound on |f ∗ (x) − f (x)|, i.e., individually for any test point x. The following theorem states such a pointwise bound. 117 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Theorem 6.9 Let F be a convex hypothesis class, ℓ : Y × Y → [0, ∞), a convex loss function, and let fˆ be an ERM. Then, with probability of at least 1 − δ/2 over the choice of Sm from P m , for any x ∈ X , o n 2 . |f ∗ (x) − fˆ(x)| ≤ sup |ǫ| : R̂(fˆǫ,x ) ≤ R̂(fˆ) · σδ/4 ǫ∈R Proof Define f˜ , argmax “ ” 2 −1)·R̂(fˆ) f ∈V̂ fˆ,(σδ/4 |f (x) − fˆ(x)|. Assume w.l.o.g (the other case is symmetric) that f˜(x) = fˆ(x) + a. Following Definition 6.6 we get 2 R̂(fˆa,x ) ≤ R̂(f˜) ≤ R̂(fˆ) · σδ/4 . Define o n 2 . ǫ′ = sup |ǫ| : R̂(fˆǫ,x ) ≤ R̂(fˆ) · σδ/4 ǫ∈R We thus have, sup ” “ 2 −1)·R̂(fˆ) f ∈V̂ fˆ,(σδ/4 |f (x) − fˆ(x)| = a ≤ ǫ′ . An application of Lemma 6.5 completes the proof. We conclude this section with a general result on the monotonicity of the empirical risk attained by constrained ERM regressors. Lemma 6.10 (Monotonicity) Let F be a convex hypothesis space, ℓ : Y × Y → [0, ∞), a convex loss function, and 0 ≤ ǫ1 < ǫ2 , be given. Then, ǫ1 ˆ R̂(fǫ2 ,x0 ) − R̂(fˆ) . R̂(fǫ1 ,x0 ) − R̂(fˆ) ≤ ǫ2 The result also holds for the case 0 ≥ ǫ1 > ǫ2 . 118 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Proof Define ǫ2 − ǫ1 f (x) , · fˆ(x) + ǫ2 ǫ2 − ǫ1 · fˆ(x) + = ǫ2 ′ ǫ1 ˆ · fǫ2 (x) ǫ2 ǫ1 ˆ · f (x) + ǫ2 = fˆ(x) + ǫ1 . ǫ2 Since F is convex we get that f ′ (x) ∈ F. Therefore, by the definition of fˆǫ1 , and using the convexity of ℓ together with Jensen’s inequality, we obtain, m 1 X R̂(fˆǫ1 ) ≤ R̂(f ′ ) = ℓ m i=1 ≤ ǫ2 − ǫ1 ǫ2 ǫ2 − ǫ1 ǫ1 · R̂(fˆ) + · R̂(fˆǫ2 ) ǫ2 ǫ2 ǫ1 · fˆ(xi ) + · fˆǫ2 (xi ), yi ǫ2 6.5 Selective linear regression We now restrict attention to linear least squares regression (LLSR), and, relying on Theorem 6.9 and Lemma 6.10, as well as on known closed-form expressions for LLSR, we derive efficient implementation of ǫ-LESS and a new pointwise bound. Let X be an m × d training sample matrix whose ith row, xi ∈ Rd , is a feature vector. Let y ∈ Rm be a column vector of training labels. Lemma 6.11 (ordinary least-squares estimate [41]) The ordinary least square (OLS) solution of the following optimization problem, min kXβ − yk2 , β is given by β̂ , (X T X)+ X T y, where the sign + represents the pseudoinverse. Lemma 6.12 (constrained least-squares estimate [41], page 166) Let x0 be a row vector and c a label. The constrained least-squares (CLS) solution of the following 119 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 optimization problem minimize kXβ − yk2 s.t x0 β = c, is given by β̂C (c) , β̂ + (X T X)+ xT0 (x0 (X T X)+ xT0 )+ c − x0 β̂ , where β̂ is the OLS solution. Theorem 6.13 Let F be the class of linear regressors, and let fˆ be an ERM. Then, with probability of at least 1 − δ over choices on Sm , for any test point x0 we have, kX β̂ − yk q 2 |f ∗ (x0 ) − fˆ(x0 )| ≤ σδ/4 − 1, kXKk where K = (X T X)+ xT0 (x0 (X T X)+ xT0 )+ . Proof According to Lemma 6.10, for squared loss, R̂(fˆǫ,x0 ) is strictly monotonically increasing for ǫ > 0, and decreasing for ǫ < 0. Therefore, the equation, 2 R̂(fˆǫ,x0 ) = R̂(fˆ) · σδ/4 , where ǫ is the unknown, has precisely two solutions for any σ > 1. Denoting these solutions by ǫ1 , ǫ2 we get, o n 2 = max (|ǫ1 |, |ǫ2 |) . sup |ǫ| : R̂(fˆǫ,x0 ) ≤ R̂(fˆ) · σδ/4 ǫ∈R Applying Lemma 6.11 and 6.12 and setting c = X0 β̂ + ǫ, we obtain, 1 2 kX β̂C x0 β̂ + ǫ − yk2 = R̂(ĥǫ,x0 ) = R̂(ĥ) · σδ/4 m 1 2 = kX β̂ − yk2 · σδ/4 . m Hence, 2 kX β̂ + XKǫ − yk2 = kX β̂ − yk2 · σδ/4 , so, 2 2(X β̂ − y)T XKǫ + kXKk2 ǫ2 = kX β̂ − yk2 · (σδ/4 − 1). 120 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 We note that by applying Lemma 6.11 on (X β̂ − y)T X, we get, (X β̂ − y)T X = X T X(X T X)+ X T y − y Therefore, ǫ2 = T = (X T y − X T y)T = 0. kX β̂ − yk2 2 · (σδ/4 − 1). kXKk2 Application of Theorem 6.9 completes the proof. 6.6 Empirical results Focusing on linear least squares regression for which we have efficient, closedform implementation, we empirically evaluated the proposed method. Given a labeled dataset we randomly extracted two disjoint subsets: a training set Sm , and a test set Sn . The selective regressor (f, g) is computed as follows. The regressor f is an ERM over Sm , and for any coverage value Φ, the function g selects a subset of Sn of size n · Φ, including all test points with lowest value of the bound in Theorem 6.13.2 We compare our method relative to the following simple and natural 1-nearest neighbor (NN) technique for selection. Given the training set Sm and the test set Sn , let N N (x) denote the nearest neighbor of x in Sm , with corresponding p ρ(x) , kN N (x) − xk2 distance to x. These ρ(x) distances, corresponding to all x ∈ Sn , were used as alternative method to reject test points in decreasing order of their ρ(x) values. We tested the algorithm on 10 of the 14 LIBSVM [21] regression datasets, each with training sample size m = 30, and test set size n = 200. From this repository we took all sets that are not too small and have reasonable feature dimensionality.3 Figure 6.1 shows the average absolute difference between the selective regressor (f, g) and the optimal regressor f ∗ (taken as the ERM over the entire dataset) as a function of coverage, where the average is taken over the accepted instances. Our method appears in solid red line, and the baseline NN method, in dashed black 2 We use here the theorem only for ranking test points, so any constant > 1 can be used instead of 2 σδ/4 . 3 Two datasets having less than 200 samples, and two that have over 150, 000 features were excluded. 121 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 line. Each curve point is an average over 200 independent trials (error bars represent standard error of the mean). It is evident that for 9 out of the 10 datasets the average distance monotonically increases with coverage. Furthermore, in those cases the proposed method significantly outperforms the NN baseline. For the ’year’ dataset we see that the average distance is not monotone, and it is growing at low coverage values. Further analysis suggests that the source of this ’artifact’ is the high dimensionality of the dataset (d = 90) compared to the training sample size (m = 30). Clearly, in this case both the ERM and the constrained-ERM can completely overfit the training samples in which case and the ranking we obtain can be noisy. To validate this hypothesis, we repeated the same experiment with m growing from 30 to 200. Figure 6.2 depicts the resulting RC-curves for the five cases m = 30, 50, 100, 150, and 200. Clearly this overfitting artifact completely disappears whenever m > d. In those cases our methods slightly outperforms the NN baseline method. The graphs in Figure 6.3 show the RC-curves obtained by our method and the NN baseline method. That is, each curve is the test error of the selective regressor as a function of coverage. In 7 of the 10 datasets our method outperforms in terms of test error. In two datasets our method is statistically indistinguishable from the NN baseline method. In one dataset (’year’) our method fails for low coverage values (see the above discussion on overfiting and dimensionality). 122 eunite x 10 0 x 10−3 1.29 0.64 0.69 0 0.5 1.01 0.47 0 1 0.5 c x 10 4 0.80 0 1 x 10 cadata 1 cpusmall 0.85 2.90 0.56 0.5 c −2 x 10 1.58 0 1 x 10 mg 0 x 10 mpg 5.32 1.26 4.25 3.19 0.99 1 c x 10 2 space |f*−f| 5.66 |f*−f| 1.61 0.5 0.77 0 0.5 c 1 2.39 0 0.5 c year 9.86 8.07 6.61 5.41 0 0.5 1 c −2 6.40 3.68 0 0.5 c 4.42 housing 2.14 0.37 0 1 0 |f*−f| 3.04 |f*−f| 3.93 |f*−f| 1.29 0.5 1 c x 10 3.99 1.78 0 0.5 c 2.33 bodyfat |f*−f| 0.86 |f*−f| 1.13 |f*−f| 1.64 0.88 |f*−f| abalone 1.16 |f*−f| Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 x 10 1 1.44 1 c Figure 6.1: Absolute difference between the selective regressor (f, g) and the optimal regressor f ∗ . Our proposed method in solid red line and the baseline method in dashed black line. All curves in a logarithmic y-scale. 123 1 x 10 6 m=50 x 10 3 2.37 1.19 1.16 1.15 R(f,g) 2.26 R(f,g) R(f,g) m=30 0.59 0.69 0.40 0 0.5 0.30 0 1 x 10 2 0.5 m=150 x 10 2.12 1.44 R(f,g) 2.17 1.24 0.73 0 1 c 3.61 m=100 0.56 c R(f,g) Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 x 10 6 2.05 2 0.27 0 0.5 1 c m=200 0.95 0.5 c 1 0.63 0 0.5 1 c Figure 6.2: Test error of selective regressor (f, g) trained on the dataset ’years’ with sample size of 30, 50, 100, 150, and 200 samples. Our proposed method in solid red line and the baseline method in dashed black line. All curves in a logarithmic y-scale. 124 x 10 0 5.73 1.31 R(f,g) 6.31 0.5 2.50 0 1 x 10 9 0.5 0.39 0 1 x 10 3 cpusmall 6.46 0.61 3.59 R(f,g) 6.81 R(f,g) 4.32 0.09 4.40 0.5 0.5 c −2 x 10 1.00 0 1 x 10 mg 1 x 10 mpg 1.13 1.72 R(f,g) 2.29 R(f,g) 2.33 0.81 0.5 1 c x 10 6 0.58 0 1 c −2 1.58 1.92 housing 0.5 c 2.73 1.61 0 1 1.89 0.01 0 1 1 c x 10 9.47 3.00 0 0.5 c cadata bodyfat 0.72 3.79 c R(f,g) x 10−5 2.38 3.44 0 R(f,g) abalone 8.68 R(f,g) R(f,g) eunite 4.66 space 1.27 0.5 c 1 0.94 0 0.5 1 c year 2.05 R(f,g) Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 x 10 2 8.54 1.19 0.69 0.40 0 0.5 1 c Figure 6.3: Test error of selective regressor (f, g). Our proposed method in solid red line and the baseline method in dashed black line. All curves in a logarithmic y-scale. 125 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Chapter 7 Future Directions “ A: Would you tell me, please, which way I ought to go from here? C: That depends a good deal on where you want to get to. A: I don’t much care where. C: Then it doesn’t much matter which way you go. A: ...So long as I get somewhere. C: Oh, you’re sure to do that, if only you walk long enough. ” Lewis Carroll, Alice in Wonderland In this chapter we present a number of open questions that motivate directions for future research. Our focus is on larger fundamental questions rather than on technical/incremental improvements of the present results. Nevertheless, for those who are interested, the list of possible technical improvements includes tightening the existing bounds, extending the results to other loss functions, and extending the selective regression results to additive concentration inequalities. 7.1 Beyond pointwise competitivness In this work we analyzed the necessary and sufficient conditions for pointwise competitivness. Our goal was to match the predictions of the best hypothesis in the 126 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 class for the accepted domain. As a result, we were able to control the estimation error ∆1 , R(f, g) − R(f ∗ , g) and “instantly” attenuate it to zero by compromising coverage. However, one can consider other optimization objectives. For example, for a given selective classifier (f, g), we might want to bound the difference between its risk and the risk of the best hypothesis in the class, limited to the accepted domain. Namely, we might want to bound and reduce the excess loss, ∆2 , R(f, g) − inf R(f ′ , g). ′ f ∈F This is considerably more challenging than reducing ∆1 to zero. Can we bound ∆2 ? Can we reduce ∆2 to zero and achieve a stronger pointwise competitiveness, one that will ensure a match between the selective predictions and those of the predictor defined by arg min R(f ′ , g) ′ f ∈F over the accepted domain? Bounding ∆2 guarantees that the selective classifier (f, g) is the best classifier that can be obtained using a prediction function from F and the selection function g. We can also compare the risk of the selective predictor with the risk of any selective hypothesis using a prediction function from F and any selection function with the same coverage as g. This amounts to bounding the excess loss, ∆3 , R(f, g) − inf f ′ ∈F ,g ′ :X →{0,1} R(f ′ , g′ ) | E(g′ ) = E(g) . Finally, another interesting and natural objective would be to bound the difference between the risk of (f, g) and the risk of the Bayes selective classifier, a selective classifier based on Chow’s decision rule with the same coverage as g. Namely, the goal is to bound ∆4 , R(f, g) − inf f ′ ,g ′ :X →{0,1} R(f ′ , g′ ) | E(g′ ) = E(g) . It follows straightforwardly from the above definitions that ∆1 ≤ ∆2 ≤ ∆3 ≤ ∆4 . Therefore, the lower bounds derived in this work for ∆1 are relevant also for all the rest. However, upper bounds are yet to be found. 127 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 7.2 Beyond linear classifiers The characterizing set complexity has been shown to be an important measure for analyzing coverage rates in selective classification, as well as label complexity in active learning. Indeed, at present, for a number of interesting settings (hypothesis classes and distributions), this is the only known technique that is able to prove exponential label complexity speedup. In this work we focused on linear classifiers under a fixed mixture of Gaussians and axis aligned rectangles under product distributions. Is it possible to extend our results beyond linear classifiers and axis aligned rectangles? One possible direction is to analyze the behavior of the characterizing set complexity under hypothesis class operations such as k-fold intersections and k-fold unions. A similar technique has been applied successfully in the contexts of VC theory [18, Lemma 3.2.3] and the disagreement coefficient [49]. Such a result is very attractive in our context. For example, given results for k-fold unions, we could extend our results from axis aligned rectangles to decision trees (with limited depth). A result for k-fold intersection would allow us to extend our results from linear classifiers to polygons. 7.3 Beyond mixtures of Gaussians The proof of Corollary 3.15 implies that whenever the version space compression set size n̂ grows slowly with respect to the number of training examples m, then fast coverage rate for linear classifiers is achievable. In particular, if n̂ = polylog(m) with high probability, then fast rates are guaranteed. In this work we have proven that n̂ = polylog(m) for any fixed mixture of Gaussians. Let S + be the set of positive training examples. In the proof of Lemma 3.14 we show that if the number of vertices of the convex hull of S + is O(polylog(m)), then n̂ = polylog(m). Therefore, our analysis for the mixture of Gaussians case can be easily extended to other distributions for which the number of vertices of the convex hull is small. Specifically, any distribution that can be generated from a mixture of Gaussians by a transformation that preserves convex hull vertices (e.g., projective quasi-affine transformation [52]) will also guarantee fast coverage rates and exponential label complexity speedup. Can we prove the results for other (mixtures of) well-known multivariate parametric families such as Dirichlet or Cauchy distributions? Is it possible to characterize the properties of transformations that preserve the number 128 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 of convex hull vertices? 7.4 On computational complexity Implementations of the LESS strategy require efficient calculations of ERM and restricted ERM. For linear classifiers this problem is reduced to a variant of the MAX FLS and C MAX FLS problems with ≥ and > relations [2]. Not only is finding an optimal solution for these problems NP-hard, but neither can they be approximated to within arbitrary factors. MAX FLS with strict and nonstrict inequalities is APX-complete (within 2). Therefore, it can be approximated within a factor of 2 but not within every constant factor [2]. C MAX FLS is MAX IND SET-hard, so it cannot be efficiently approximated at all. Ben-David et al. extended these results to other hypothesis classes, including axis-aligned hyper-rectangles. They showed that approximating ERM for this class is NP-hard [14]. Do these results imply that we have reached the end of the road for efficient implementation (or approximation) of LESS? As pointed out by Ben-David et al., for each one of the concept classes they considered, there exists a subset of its legal inputs for which the maximum agreement problem can be solved in polynomial time. Do the above disappointing complexity results hold, with high probability, for datasets generated by mixtures of Gaussians? What about other source distributions? What about discrete variables? In selective regression we have been able to derive an efficient implementation of ǫ-LESS for LLSR. Can we extend the result for different kernels? Can efficient implementation of LESS (for the agnostic selective classification setting) be developed for other interesting hypothesis classes ? 7.5 Other applications for selective prediction The most obvious and studied use of selective prediction is ‘classification with a reject option’, the goal of which is to reduce the generalization risk of classifiers. However, selective prediction can be used for other purposes as well. Let us briefly describe three such applications: learning non-stationary time series, big data analysis, and reverse engineering. Learning a non-stationary time series is a challenging task since the underlying distribution is constantly changing. Therefore, training based on distant history is not relevant and the number of training examples is inherently limited. In this 129 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 setting, selective prediction can serve as a powerful tool. Looking at short time windows, we can obtain pointwise predictions close to the best hypothesis in the class and achieve good prediction accuracy. With the explosion of data generated every day, it is not surprising that training on massive datasets has become a major problem in machine learning. In this setting, labeled training examples are in abundance; therefore it seems that selective prediction brings no real value. However, can we subsample the dataset and train multiple selective predictors on its different subsets? Since pointwise competitiveness is guaranteed, fusion of multiple selective predictors is trivial. But how does the coverage of the aggregated classifier behave? How do we decide on the subset size? What is the optimal subsampling method? Machine learning is becoming an important part of modern corporate competitive strategy. “Knowledge is power” and companies perceive their datasets as valuable company assets. In many cases the “secret sauce” is not the type of learning algorithm or the hypothesis class but the training data itself. Selective prediction introduces a threat for those companies. Using the “public” predictions of the model, competitors can train a selective predictor and achieve the same performance (on the accepted domain). Knowing the hypothesis class and training on the model predictions falls into the realizable selective classification setting, where fast coverage rates have been proven for interesting distributions. 130 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Appendix A Proofs Proof of Lemma 2.3 Let S + ⊆ Sm be the set of all positive samples in Sm , and S − ⊆ Sm be the set of all negative samples. Let x̄0 ∈ R+ . There exists a hypothesis fw̄,φ (x̄) such that ∀ x̄ ∈ S + , ∀ x̄ ∈ S − , w̄T x̄ − φ ≥ 0; w̄T x̄ − φ < 0, and w̄T x̄0 − φ ≥ 0. Let’s assume that x̄0 6∈ R̃+ . Then, there exists a hypothesis f˜w̄′ ,φ′ (x̄) such that ∀ x̄ ∈ S + , ∀ x̄ ∈ S − , w̄′T x̄ − φ′ ≥ 0; w̄′T x̄ − φ′ ≤ 0, and w̄′T x̄0 − φ′ < 0. Defining w̄0 , w̄ + αw̄′ , where φ0 , φ + αφ′ , T w̄ x̄0 − φ , α > ′T w̄ x̄0 − φ′ 131 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 we deduce that there exists a hypothesis fw̄0 ,φ0 (x̄) such that ∀ x̄ ∈ S + , w̄0T x̄ − φ0 ≥ 0; ∀ x̄ ∈ S − , w̄0T x̄ − φ0 < 0, and w̄0T x̄0 − φ0 = w̄T x̄0 − φ + α w̄′T x̄0 − φ′ = w̄T x̄0 − φ − α w̄′T x̄0 − φ′ < w̄T x̄0 − φ − w̄T x̄0 − φ = 0. Therefore, x̄0 6∈ R+ . Contradiction. Hence, x̄0 ∈ R̃+ and R+ ⊆ R̃+ . The proof that R− ⊆ R̃− follows the same argument. To prove that R̃+ ⊆ R+ , we look at V SF̃ ,Sm : ∀f˜w̄,φ ∈ V SF̃ ,Sm , x̄ ∈ R̃+ w̄T x̄ − φ ≥ 0. We observe that if fw̄,φ ∈ V SF ,Sm , then f˜w̄,φ ∈ V SF̃ ,Sm . Therefore, ∀fw̄,φ ∈ V SF ,Sm , x̄ ∈ R̃+ w̄T x̄ − φ ≥ 0. Hence, R̃+ ⊆ R+ . It remains to prove that R̃− ⊆ R− . Assuming x̄0 6∈ R− implies that there exists a hypothesis fw̄,φ (x̄) such that ∀ x̄ ∈ S + , ∀ x̄ ∈ S − , w̄T x̄ − φ ≥ 0; w̄T x̄ − φ < 0, and w̄T x̄0 − φ ≥ 0. Defining1 w̄0 , w̄, 1 T φ0 , φ − max w̄ x̄ − φ , x̄∈S − If S − is an empty set we can arbitrarily define φ0 , φ − 1. 132 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 we conclude that there exists a hypothesis f˜w̄0 ,φ0 (x̄) such that ∀ x̄ ∈ S + ∀ x̄ ∈ S − w̄0T x̄ − φ0 ≥ 0; w̄0T x̄ and T − φ0 ≤ max w̄ x̄ − φ + max w̄ x̄ − φ = 0, T x̄∈S − x̄∈S − w̄0T x̄0 − φ0 > 0. Therefore, x̄0 6∈ R̃− , so R̃− ⊆ R− . Proof of Lemma 2.4 According to Lemma 2.3, R+ = R̃+ and R− = R̃− . Therefore, we can restrict our discussion to the hypothesis class F̃. Due to the symmetry of the hypothesis class F̃ we will concentrate only on the positive region R+ . Set G , V SF̃,Sm . By definition, R̃+ = \ ′ fw̄,φ , ′ ∈G fw̄,φ ′ ′ obtains the denotes the region in X for which the linear classifier fw̄,φ where fw̄,φ value one or zero. Let fw̄,φ ∈ G be a half-space with k < d points on its boundary. We will prove that there exist two half-spaces in G (fw̄1 ,φ1 , fw̄2 ,φ2 ) such that each has at least k + 1 samples on its boundary and fw̄,φ Therefore, \ fw̄1,φ1 \ R̃+ = fw̄2 ,φ2 = fw̄1 ,φ1 \ \ fw̄2 ,φ2 . ′ fw̄,φ . ′ ∈G\{fw̄,φ } fw̄,φ Repeating this process recursively with every half-space in G, with less than d points on its boundary, completes the proof. Before proceeding with the rigorous analysis let’s review the main idea behind the proof. If a half-space in Rd has less than d points on its boundary, it has at least one degree of freedom. Rotating the half-space clockwise or counterclockwise around a specific axis (defined by the points on the boundary) by sufficiently small angles will maintain correct classification over Sm . We will rotate the half- 133 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 space clockwise and counterclockwise until “touching” the first point in Sm on each direction. This operation will maintain correct classification but will result in having one additional point on the boundary. Then we only have to show that the intersection of the three half-spaces (original and two rotated ones) is the same as the intersection of the two rotated ones. Let fw̄,φ ∈ G be a half-space with k < d points on its boundary. Without loss 0 , {x̄ , x̄ , ..., x̄ }. For the sake of of generality assume that these points are Sm 1 2 k simplicity we will first translate the space such that x̄1 will lie on the origin. Since x̄1 is on the boundary of the half-space we get w̄T x̄1 − φ = 0 =⇒ φ = w̄T x̄1 . Therefore, 0 ∀x̄ ∈ Sm 0 = w̄T x̄ − φ = w̄T x̄ − w̄T x̄1 = w̄T (x̄ − x̄1 ). Hence, the weight vector w̄ is orthogonal to all the translated samples (x̄1 − x̄1 ), . . . , (x̄k − x̄1 ). We now have k < d vectors in Rd (including the weight vector) so we can always find at least one vector v̄ which is orthogonal to all the rest. We now rotate the translated samples around the origin so as to align the vector w̄ with the first axis, and align the vector v̄ with the second axis. From now on all translated and rotated coordinates and vectors will be marked with prime. Define the following rotation matrix in Rd , cos θ sin θ 0 0 . . . − sin θ cos θ 0 0 . . . . 0 0 1 0 . . . Rθ , 0 0 0 1 .. .. .. .. . . . . We can now define two new half-spaces in the translated and rotated space, fRα w̄′ ,0 and fR−β w̄′ ,0 , where α = max {α′ ′ 0<α ≤π | ∀x̄′ ∈ Sm (w̄′T x̄′ ) · (Rα′ w̄′ )T x̄′ ≥ 0}, 134 (A.1) Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 and β = max {β ′ ′ 0<β ≤π | (w̄′T x̄′ ) · (R−β ′ w̄′ )T x̄′ ≥ 0}. ∀x̄′ ∈ Sm (A.2) According to Claim B.6, both fRα w̄′ ,0 and fR−β w̄′ ,0 correctly classify Sm and have at least k + 1 samples on their boundaries. Now we examine the intersection of fRα w̄′ ,0 and fR−β w̄′ ,0 . According to Claim B.7, if (Rα w̄′ )T x̄′ ≥ 0 and (R−β w̄′ )T x̄′ ≥ 0, then w̄′T x̄′ ≥ 0. The intersection of fRα w̄′ ,0 , fR−β w̄′ ,0 and fw̄′ ,0 thus equals the intersection of fRα w̄′ ,0 and fR−β w̄′ ,0 , as required. P Proof of Lemma 4.4 using super martingales Define Wk , ki=1 (Zi − bi ). We assume that with probability of at least 1 − δ/2, Pr{Zi |Z1 , . . . , Zi−1 } ≤ bi , simultaneously for all i. Since Zi is a binary random variable it is easy to see that (w.h.p.), EZi {Wi |Z1 , . . . , Zi−1 } = Pr{Zi |Z1 , . . . , Zi−1 } − bi + Wi−1 ≤ Wi−1 , and the sequence W1m , W1 , . . . , Wm is a super-martingale with high probability. We apply the following theorem by McDiarmid that refers to martingales (but can be shown to apply to super-martingales, by following its original proof). Theorem A.1 ([69], Theorem 3.12) Let Y1 , . . . , Yn be a martingale difference seP ak . Then, for any quence with −ak ≤ Yk ≤ 1 − ak for each k; let A = n1 ǫ > 0, o nX Anǫ2 . Yk ≥ Anǫ ≤ exp (−[(1 + ǫ) ln(1 + ǫ) − ǫ]An) ≤ exp − Pr 2(1 + ǫ/3) In our case, Yk = Wk − Wk−1 = Zk − bk ≤ 1 − bk and we apply the (revised) P theorem with ak , bk and An , bk , B. We thus obtain, for any 0 < ǫ < 1, Pr nX o Bǫ2 Zk ≥ B + Bǫ ≤ exp − 2(1 + ǫ/3) 135 . Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Equating the right-hand side to δ/2, we obtain ǫ = ! 2 4 22 /2B ln + 8B ln 9 δ δ ! r r 1 22 2 /B + ln + 2B ln 9 δ δ ! r 2 /B. + 2B ln δ 2 2 ln ± 3 δ ≤ 1 2 ln 3 δ = 2 2 ln 3 δ r Applying the union bound completes the proof. 136 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Appendix B Technical Lemmas Lemma B.1 (Bernstein’s inequality [57]) Let X1 , . . . , Xn be independent zeromean random variables. Suppose that |Xi | ≤ M almost surely, for all i. Then for all positive t, ! n 2 X t /2 . Xi > t ≤ exp − P h i Pr E Xj2 + M t/3 i=1 Lemma B.2 (binomial tail inversion lower bound) For k > 0 and δ ≤ 12 , 4 1 k − ln Bin(m, k, δ) ≥ min 1, 2m 3m 1 − δ . Proof Let Z1 , . . . Zm be independent Bernoulli random variables each with a success probability 0 ≤ p ≤ 1. Setting Wi , Zi − p, ! ! m m X X Zi > k Zi ≤ k = 1 − Pr Bin(m, k, p) = Pr Z1 ,...,Zm ∼B(p)m = 1 − Pr m X i=1 i=1 ! i=1 Wi > k − mp . Clearly, E [Wi ] = 0, |Wi | ≤ 1, and E Wi2 = p·(1−p)2 +(1−p)·p2 = p·(1−p). 137 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Using Lemma B.1 (Bernstein’s inequality) we thus obtain, (k − mp)2 /2 Bin(m, k, p) ≥ 1 − exp − mp(1 − p) + (k − mp)/3 ! . Since (1 − p) ≤ 1, (k − mp)2 /2 mp(1 − p) + (k − mp)/3 ≥ ≥ (k − mp)2 (k − mp)2 = 2 4 2mp + 23 · (k − mp) 3 mp + 3 · k k2 − 2mpk . + 23 · k 4 3 mp Therefore, − Bin(m, k, p) ≥ 1 − e k2 −2mpk 4 mp+ 2 ·k 3 3 . Equating the right-hand side to δ and solving for p, we have p≤ k − 32 t k , · 2m k + 32 · t 1 . Choosing where t , ln (1−δ) ( 1 1 8 k p = min 1, · k − ln = min 1, 2m 3 1−δ 2m using the fact that k ≥ 1 > √ 2 2 3 1 ln 1/2 ≥ p≤ √ 2 2 3 t, 1−4· 2 3t k !) , and applying Lemma B.3, we get k − 23 t k · . 2m k + 23 · t Therefore, Bin(m, k, p) ≥ δ. Since Bin(m, k, δ) = maxp {p : Bin(m, k, p) ≥ δ}, we conclude that k 4 1 Bin(m, k, δ) ≥ p = min 1, − ln . 2m 3m 1 − δ 138 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 √ Lemma B.3 For u > 2v > 0, u−v v ≥1−4· . u+v u Proof u−v 2v u−v u = 1− = 1 − 2v 2 ≥ 1 − 2v 2 . 2 u+v u+v u −v u − v2 √ Since u > 2v, we have u2 . u2 − v 2 > 2 Applying to the previous inequality completes the proof. Lemma B.4 For any m ≥ 3, a ≥ 1, b ≥ 1 we get m a X ln (bi) i i=1 Proof Setting f (x) , lna (bx) , x < 4 a+1 ln (b(m + 1)). a we have lna−1 (bx) df = (a − ln bx) · . dx x2 Therefore, f is monotonically increasing when x < ea /b, monotonically decreasing function when x ≥ ea /b and its attains its maximum at x = ea /b. Consequently, for i < ea /b − 1, or i ≥ ea /b + 1, f (i) ≤ Z i+1 f (x)dx. x=i−1 For ea /b − 1 ≤ i < ea /b + 1, f (i) ≤ f (ea /b) = b Therefore, if m < ea − 1 we have, m X i=1 a f (i) = ln (b) + m X i=2 f (i) < 2 · Z a a e ≤ aa . m+1 x=1 139 f (x)dx ≤ (B.1) 2 lna+1 (b(m + 1)). a+1 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Otherwise, m ≥ ea /b, in which case we overcome the change of slope by adding twice the (upper bound on the) maximal value (B.1), m X f (i) < i=1 ≤ 2 2 2 lna+1 (b(m + 1)) + 2aa = lna+1 (b(m + 1)) + aa+1 a+1 a+1 a 2 2 4 lna+1 (b(m + 1)) + lna+1 bm ≤ lna+1 (b(m + 1)). a+1 a a Lemma B.5 Let S1 and S2 be two sets in Rd . Then, H(S1 ∪ S2 ) ≤ H(S1 ) + H(S2 ), where H(S) is the number of convex hull vertices of S. Proof Assume x ∈ S1 ∪ S2 is a convex hull vertex of S1 ∪ S2 . Then, there is a half-space (w, φ) such that, w · x − φ = 0, and any other y ∈ S1 ∪ S2 satisfies w · y − φ > 0. Assume w.l.o.g. that x ∈ S1 . Then it is clear that any y ∈ S1 satisfies w · y − φ > 0. Therefore, x is a convex hull vertex of S1 . Claim B.6 Both fRα w̄′ ,0 and fR−β w̄′ ,0 correctly classify Sm and have at least k+1 samples on their boundaries. Proof We note that after translation all half-spaces pass through the origin, so φ′ = 0. Recall the definitions of α and β as maximums (Equations (A.1) and (A.2), respectively). We show that the maximums over α′ and β ′ are well defined. Let x̄′ = (x′1 , x′2 , · · · x′d )T . Since w̄′ = (1, 0, 0, · · · )T we get that Rα w̄′ = (cos α, − sin α, 0, . . .)T and 2 (w̄′T x̄′ ) · (Rα′ w̄′ )T x̄′ = x′1 cos α − x′ 1 · x′ 2 · sin α. Since Sm is a spanning set of Rd , at least one sample has a vector with component x′1 6= 0. As all components are finite and x′1 2 > 0, we can always find a sufficiently small α′ such that x′1 2 cos α′ − x′ 1 · x′ 2 · sin α′ > 0. Hence, the maximum exists. Furthermore, for α′ = π we have x′1 2 cos π −x′ 1 ·x′ 2 ·sin π = −x′1 2 < 0. Noticing that x′1 2 cos α − x′ 1 · x′ 2 · sin α is continuous in α, and applying the intermediate value theorem, we know that 0 < α < π and x′1 2 cos α − x′ 1 · x′ 2 · sin α = 0. Therefore, there exists a sample in Sm that is not on the boundary of fw̄′ ,0 (since 140 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 0 are orthogonal x′1 6= 0) but on the boundary of fRα w̄′ ,0 . Recall that all points in Sm ′ T T to w̄ = (1, 0, 0, · · · ) and v̄ = (0, 1, 0, · · · ) . Therefore, 0 ∀x̄′ ∈ Sm (Rα w̄′ )T x̄′ = x′1 · cos α − x′ 2 · sin α = w̄′T x̄ · cos α − v̄ ′T x̄· sin α = 0, and they reside on the boundary of fRα w̄′ ,0 . Overall, fRα w̄′ ,0 correctly classifies Sm and has at least k + 1 samples on its boundary. The same argument applies for β by symmetry. Claim B.7 Using the notation introduced in the proof of Lemma 2.4, if (Rα w̄′ )T x̄′ ≥ 0 and (R−β w̄′ )T x̄′ ≥ 0, then w̄′T x̄′ ≥ 0. Proof If (Rα w̄′ )T x̄′ ≥ 0 and (R−β w̄′ )T x̄′ ≥ 0, then ( x′1 cos α − x′2 · sin α ≥ 0, x′1 cos β + x′2 · sin β ≥ 0. Multiplying the first inequality by sin β > 0, the second inequality by sin α > 0, and adding the two we have sin(α + β) · x′1 ≥ 0. According to Claim B.8 below, sin(α+β) ≥ 0. If sin(α+β) = 0, then (α+β) = π and cos(α + β) = −1. Using the trigonometric identities cos(α − β) = cos α cos β + sin α sin β; sin(α − β) = sin α cos β − cos α sin β, we get that cos β = cos(β + α − α) = cos(α + β) · cos α + sin(α + β) · sin α = − cos α, and sin β = sin(β + α − α) = sin(α + β) · cos α − cos(α + β) · sin α = sin α. Therefore, for any x′1 cos α − x′2 · sin α > 0, it holds that x′1 cos β + x′2 · sin β < 0 141 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 and R̃+ is degenerated. Contradiction to the fact that Sm is a spanning set of the Rd . Therefore, sin(α + β) > 0, x′1 ≥ 0 and w̄′T x̄′ ≥ 0. Claim B.8 Using the notation introduced in the proof of Lemma 2.4, sin(α + β) ≥ 0. Proof By definition we get that for all samples in Sm , ( x′1 2 cos α − x′1 · x′2 · sin α ≥ 0, x′1 2 cos β + x′1 · x′2 · sin β ≥ 0. Multiplying the first inequality by sin β > 0 (0 < β < π), the second inequality by sin α > 0, adding the two, and using the trigonometric identity sin(α + β) = sin α cos β + cos α sin β, we have 2 sin(α + β) · x′1 ≥ 0. Since there is a sample in Sm with a vector component x′1 6= 0, we conclude that sin(α + β) ≥ 0. 142 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Appendix C Extended Minimax Model We define a learning algorithm ALG to be a (random) mapping from a sample Sm to selective predictor (f, g). We evaluate learners with respect to their coverage and risk and derive both positive and negative results on achievable risk and coverage. Our model is a slight extension of the standard minimax model for standard statistical learning as described, e.g., by Antos and Lugosi [4]. Thus, we consider the following game between the learner and an adversary. The parameters of the game are a domain X and an hypothesis class F. 1. A tolerance level δ and a training sample size m are given. 2. The learner chooses a learning algorithm ALG . 3. With full knowledge of the learner’s choice, the adversary chooses a distribution P (X, Y ) over X × Y. 4. A training sample Sm is drawn i.i.d. according to P . 5. ALG is applied on Sm and outputs a selective predictor (f, g). The result of the game is evaluated in terms of the risk and coverage obtained by the chosen selective predictor and clearly, these are random quantities that trade-off each other. A positive result in this model is a pair of bounds, BR = BR (F, δ, m) and BΦ = BΦ (F, δ, m), for risk and coverage, respectively, that for any δ and m, hold with high probability, of at least 1 − δ for any distribution P ; namely, Pr {R(f, g) ≤ BR ∧ Φ(f, g) ≥ BΦ } ≥ 1 − δ. 143 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 The probability is taken w.r.t. the random choice of training samples Sm , as well as w.r.t. all other random choices introduced, such as a random choice of (f, g) by ALG (if applicable), and the randomized selection function. A negative result is a probabilistic statement on the impossibility of any positive result. Thus, in its most general form a negative result is a pair of bounds BR and BΦ that, for any δ, satisfy Pr {R(f, g) ≥ BR ∨ Φ(f, g) ≤ BΦ } ≥ δ, for some probability P . Here again, probability is taken w.r.t. the random choice of the training samples Sm , as well as w.r.t. all other random choices. For a selective predictor (f, g) with coverage Φ(f, g) we can specify a RiskCoverage (RC) trade-off as a bound on the risk R(f, g), expressed in terms of Φ(f, g). Thus, a positive result on the RC trade-off is a probabilistic statement of the following form Pr {R(f, g) ≤ B(Φ(f, g), δ, m)} ≥ 1 − δ. Similarly, a negative result on the RC trade-off is a statement of the form, Pr {R(f, g) ≥ B(Φ(f, g), δ, m)} ≥ δ. Clearly, all results (positive and negative) are qualified by the model parameters, namely the domain X and the hypothesis space F, and the quality/generality of a result should be assessed w.r.t. generality of these parameters. 144 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Appendix D Realizable Selective Classification: RC trade-off analysis The performance of any selective classifier (f, g) can be characterized by its risk and coverage. This can be depicted as a point in the risk-coverage (RC) plane. In Figure D.1 we schematically depict elements of the risk-coverage (RC) trade-off. The x-axis is coverage and the y-axis is risk on the accepted domain (error in the case of the 0/1 loss). The entire region depicted, called the RC plane, consisting of all (r, c) points in the rectangle of interest, where r is a risk (error) coordinate and c is a coverage coordinate. Assume a fixed problem setting (including an unknown underlying distribution P , m training examples drawn i.i.d. from P , an hypothesis class F and a tolerance parameter δ). To fully characterize the RC trade-off we need to determine for each point (r, c) on the RC plane if it is (efficiently) “achievable.” We say that (r, c) is (efficiently) achievable if there is an (efficient) learning algorithm that will output a selective classifier (f, g) such that with probability of at least 1 − δ, its coverage is at least c and its risk is at most r. Notice that point r ∗ (the coordinate (r ∗ , 1)) where the coverage is 1 represents “standard learning.” At this point we require full coverage with certainty and the achievable risk represents the lowest possible risk in our fixed setting (which should be achievable with probability of at least 1 − δ). Point r ∗ represent one extreme of the RC trade-off. The other extreme of the RC trade-off is point c∗ , where we require zero risk with certainty. The coverage at c∗ is the optimal (highest possi- 145 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Figure D.1: The RC plane and RC trade-off ble) in our setting when zero error is required. We call point c∗ perfect learning because achievable perfect learning means that we can generate a classifier that never errs with certainty for the problem at hand. Note that at the outset, it is not at all clear if non-trivial perfect learning (with guaranteed positive coverage) can be accomplished. The full RC trade-off is some (unknown) curve connecting points c∗ and r ∗ . This curve passes somewhere in the zone labeled with a question mark and represents optimal selective classification. Points above this curve (e.g., at zone A) are achievable. Points below this curve (e.g., at zone B) are not achievable. In this appendix we study the RC curve in the realizable setting and provide as tight as possible boundaries for it. To this end we characterize upper and lower envelopes of the RC curve as schematically depicted in Figure D.1. The upper envelop is a boundary of an “achievable zone” (zone A) and therefore we consider any upper envelop as a “positive result.” The lower envelop is a boundary of a “non- 146 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 achievable zone” (zone B) and is therefore considered as a “negative result.” Note that upper and lower envelopes, as depicted in the figure, represent two different things, which are formally defined in Appendix C as probabilistic statements on possibility and impossibility. In Chapter 3 we have shown that by compromising the coverage we can achieve zero risk. This is in contrast to the classical setting, where we compromise risk to achieve full coverage. Is it possible to learn a selective classifier with full control over this trade-off? What are the performance limitations of this trade-off control? In this section we present some answers to these questions thus deriving lower and upper envelopes for the risk-coverage (RC) trade-off. These results heavily rely on the previous results on perfect learning and on classical results on standard learning without rejection. The envelopes are obtained by interpolating bounds on these two extreme types of learning. We begin this section by deriving an upper envelop; that is, we introduce a strategy that can control the RC trade-off. Our upper RC envelop is facilitated by the following strategy, which is a generalization of the consistent selective classification strategy (CSS) of Definition 3.1. Definition D.1 (controllable selective strategy) Given a mixing parameter 0 ≤ α ≤ 1, the controllable selective strategy chooses a selective classifier (f, g) such that f is in the version space V SF ,Sm (as in CSS), and g is defined as follows: g(x) = 1 for any x in the maximal agreement set, A, with respect to V SF ,Sm , and g(x) = α for any x ∈ X \ A. Clearly, CSS is a special case of the controllable selective strategy obtained with α = 0. Standard consistent learning (in the classical setting) is the special case obtained with α = 1. We now state a well known (and elementary) upper bound for classical realizable learning. Theorem D.1 ([53]) Let F be any finite hypothesis class. Let f ∈ V SF ,Sm be a classifier chosen by any consistent learner. Then, for any 0 ≤ δ ≤ 1, with probability of at least 1 − δ, 1 1 ln |F| + ln , R(f ) ≤ m δ where R(f ) is standard risk (true error) of the classifier f . The following result provides a distribution independent upper bound on the risk of the controllable selective strategy as a function of its coverage. 147 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Theorem D.2 (upper envelop) Let F be any finite hypothesis class. Let (f, g) be a selective classifier chosen by a controllable selective learner after observing a training sample Sm . Then, for any 0 ≤ δ ≤ 1, with probability of at least 1 − δ, 1 − Φ0 /Φ(f, g) 1 2 · R(f, g) ≤ ln |F| + ln , 1 − Φ0 m δ where 1 Φ0 , 1 − m 2 (ln 2)|F| + ln δ . Proof For any controllable selective learner with a mixing parameter α we have, Φ(f, g) = E [g(X)] = E [I(g(X) = 1)] + αE [I(g(X) 6= 1)] . By Theorem 3.2, with probability of at least 1 − 2δ , 1 E [I(g(X) = 1)] ≥ 1 − m 2 (ln 2)|F| + ln δ , Φ0 . Therefore, since Φ(f, g) ≤ 1, α= Φ(f, g) − Φ0 Φ(f, g) − E [I(g(X) = 1)] ≤ . 1 − E [I(g(X) = 1)] 1 − Φ0 Using the law of total expectation we get 0 z }| { E [ℓ(f (X), Y ) · g(X)] = E [ℓ(f (X), Y ) · g(X) | g(x) = 1] · Pr (g(X) = 1) + E [ℓ(f (X), Y ) · g(X) | g(x) = α] · Pr (g(X) = α) = α · E [ℓ(f (X), Y ) | g(x) = α] · Pr (g(X) = α) = α · E [ℓ(f (X), Y )] . According to Definition 1.4 R(f, g) = α · E [ℓ(f (X), Y )] α · R(f ) E [ℓ(f (X), Y ) · g(X)] = = . Φ(f, g) Φ(f, g) Φ(f, g) Applying Theorem D.1 together with the union bound completes the proof. We now present a negative result which identifies a region of non-achievable 148 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 coverage-risk trade-off on the RC plane. The statement is a probabilistic lower bound on the risk of any selective classifier expressed as a function of the coverage. It negates any high probability upper bound on the risk of the classifier (where the probability is over choice of Sm and the target hypothesis). Theorem D.3 (non-achievable risk-coverage trade-off) Let F be any hypothesis class with VC dimension d and let 0 ≤ δ ≤ 41 and m be given. There exists a distribution P (that depends on F), such that for any selective classifier (f, g), chosen using a training sample Sm drawn i.i.d. according to P , with probability of at least δ, 1 1 d 4 1 1 − + · min 1, − ln R(f, g) ≥ 2 2Φ(f, g) 4Φ(f, g) 4m 3m 1 − 2δ Proof If d is the VC-dimension of hypothesis class F, there exists a set of data points X ′ = {e1 , e2 , . . . ed } shattered by F. Let X , X ′ ∪ {ed+1 }. The bad distribution is constructed as follows. Define Bin (m, k, δ), the binomial tail inversion, d d Bin m, , 2δ , max p : Bin m, , p ≥ 2δ , p 2 2 where Bin (m, k, p) is the binomial tail. Define P to be the source distribution over X satisfying ( Bin m, d2 , 2δ /d, if i ≤ d; P (ei ) , 1 − Bin m, d2 , 2δ , otherwise, Assuming that the training sample is selected i.i.d. from P , it follows that with probability of at least 2δ, x ∈ Sm : x ∈ X ′ ≤ d . 2 F shatters X ′ thus inducing all dichotomies over X ′ . Every sample from X ′ can reduce the version space by half, so with probability of at least 2δ, the version space V SF ,Sm includes all dichotomies over at least d/2 instances. Therefore, over these instances (referred to as x1 , x2 , . . . , xd/2 ), with probability of 1/2 the 149 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 error is at least 1/2.1 d+1 X Φ(f, g) = i=1 {P (ei ) · g(ei )} = P (e1 ) · d ≤ P (e1 ) · 2 X = P (e1 ) · 2 X g(xi ) + i=1 d X i=1 g(ei ) + P (ed+1 ) · g(ed+1 ) d · P (e1 ) + P (ed+1 ) 2 d i=1 g(xi ) + 1 − d · P (e1 ). 2 Therefore d P (e1 ) · 2 X i=1 g(xi ) ≥ Φ(f, g) + d · P (e1 ) − 1. 2 By Definition 1.4 Φ(f, g) · R(f, g) = E {ℓ(h(X), h∗ (X)) · g(X)} = d+1 X {P (ei ) · g(ei ) · I(f (ei ) 6= f ∗ (ei ))} i=1 d 2 X 1 ≥ P (xi ) · g(xi ) · 2 i=1 ≥ = Φ(f, g) − 1 d + · P (e1 ) 2 4 Φ(f, g) − 1 1 d + · Bin m, , 2δ . 2 4 2 Applying Lemma B.2 completes the proof. Corollary D.4 Let 0 ≤ δ ≤ 14 , m, and n > 1 be given. There exist a distribution P , that depends on m and n, and a finite hypothesis class F of size n and VCdimension d, such that for any selective classifier (f, g), chosen using a training 1 According to the game theoretic setting the adversary can choose a distribution over F. In this case the expectation in the risk is averaged over random instances and random labels. Therefore, the error over the instances x1 , x2 , . . . , xd/2 is exactly 1/2 and we can replace the term 2δ with δ. 150 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 sample Sm drawn i.i.d. according to P , with probability of at least δ, if 3 1 16 1 Φ(f, g) ≥ max ,1 − · d− ln 4 16m 3 1 − 2δ then R(f, g) ≥ 1 16 1 · d− ln . 16m 3 1 − 2δ Proof Assuming Φ(f, g) ≥ max 1 16 1 3 ,1 − · d− ln , 4 16m 3 1 − 2δ we apply Theorem D.3 to complete the proof. 151 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Appendix E Related Work The idea of selective classification (classification with a reject option) dates back to Chow’s seminal papers [22, 23]. These papers analyzed both the Bayes-optimal reject decision and the reject-rate vs. error trade-off. This is done under the 0-1 loss function, assuming that the underlying distribution is completely known. The Bayes-optimal rejection policy is based, as in standard classification, on maximum a posteriori probabilities. Instances should be rejected whenever none of the posteriori probabilities are sufficiently distinct. This type of rejection can be termed ambiguity-based rejection. More specifically, let ωi be class i and P (ωi |x) the posteriori probabilities. Then, according to Chow’s decision rule, for any t ∈ [0, 1], sample x should be rejected if max P (ωk |x) < t. k Chow’s rule is optimal in the sense that no other rule can result in a lower error rate, given a specific reject rate controlled by t. One of Chow’s main results (for the case of complete probabilistic knowledge), is that the optimal RC trade-off is monotonically increasing. While the optimal decision can be identified in the case of complete probabilistic knowledge, it was argued [40] that when the a posteriori probabilities are estimated with errors, Chow’s rule [23] does not provide the optimal error-reject trade-off. Papers [79] and [74] discussed Bayesian-optimal decisions in the case of arbitrary cost matrices. In these papers the optimal reject rule was chosen using the ROC curve evaluated on a subset of the training data. As in most papers on the subject emerging from the engineering community [40, 39, 72, 19, 63], no 152 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 probabilistic or other guarantees are provided for the misclassification error. A very common rejection model in the literature is the cost model, whereby a specific cost d is associated with rejection [79] and the objective is to minimize the generalized rejective risk function, ℓc (f, g) , d · E [1 − g(X)] + E [I(f (X) 6= Y ) · g(X)] . (E.1) Given our definitions of risk and coverage, the function (E.1) can be easily expressed as a function over the RC plane, ℓc (R, Φ) = d (1 − Φ(f, g)) + R(f, g)Φ(f, g). (E.2) For any fixed d, Equation (E.2) defines level sets (or elevation contour lines) over the RC plane. This popular cost model was refined to allow us to differentiate between the cost of false positive and false negative as well as different costs for rejection of positive and negative samples [56, 72, 79, 74]. Such extensions or refinements are appealing because they allow for additional control and more flexibility in modeling the problem at hand. Nevertheless, these cost models are often criticized for their impracticality in applications where it is impossible or hard to precisely quantify the cost of rejection. It is interesting to note that for an ideal Bayesian setting, where the underlying distribution is completely known, Chow showed [23] that the cost d upper bounds the probability of misclassification. In this case one can control the classification error by specifying a matching rejection cost. In [72], two additional optimization models are introduced. The first is the bounded-improvement model, in which, given a constraint on the misclassification cost, the classifier should reject as few samples as possible. The second is the bounded-abstention model, in which, given a constraint on the coverage, the classifier should have the lowest misclassification cost. It is argued in [72] that these models are more suitable than the above cost model in many applications, for instance, when a classifier with limited classification throughput (e.g., a human expert) should handle the rejected instances, and in medical and quality assurance applications, where the goal is to reduce the misclassification cost to a user-defined value. Very few studies have focused on error bounds for classifiers with a reject option. Hellman [55] proposed and analyzed two rejection rules for the nearest neighbor algorithm. Extending Cover and Hart’s classic result for the 1-nearest neighbor 153 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 algorithm [26], Hellman showed that the test error (over non-rejected points) of a nearest neighbor algorithm with a reject option can be bounded asymptotically (as the sample size approaches infinity) by some factor of the Bayes error (with reject). To the best of our knowledge, this excess risk bound is the first that has been introduced in the context of classification with a reject option. Herbei and Wegkamp [56] developed excess risk bounds for the classification with a reject option setting where the loss function is the 0-1 loss, extended such that the cost of each reject point is 0 ≤ d ≤ 1/2. This result generalizes the excess risk bounds of Tsybakov [80] for standard binary classification without reject (which is equivalent to the case d = 1/2). The bound applies to any empirical error minimization technique. This result is further extended in [13, 88] in various ways, including the use of the hinge loss function for efficient optimization. The main results of Herbei and Wegkamp (both for plug-in rules and empirical risk minimization) degenerate, in the realizable case, to a meaningless bound, where classification with a reject option is not guaranteed to be any better than classification without reject. Furthermore Herbei and Wegkamp’s results are limited only to the cost model. Freund et al. [35] studied a simple ensemble method for binary classification. Given a hypothesis class F, the method outputs a weighted average of all the hypotheses in F such that the weight of each hypothesis exponentially depends on its individual training error. Their algorithm abstains from prediction whenever the weighted average of all individual predictions is inconclusive (i.e., sufficiently close to zero). Two regret bounds for this algorithm were derived. The first bounds (w.h.p.) the risk of the selective classifier by 1 ∗ , R(f, g) ≤ 2R(f ) + O m1/2−θ where 0 < θ < 1/2 is a hyperparameter. The authors also proved that for a sufficiently large training sample size, s m = Ω !1/θ 1 , ln ln(|F|) δ 154 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 the probability that the algorithms will abstain from prediction is bounded by ln |F| Φ(f, g) ≥ 1 − 5R(f ∗ ) − O √ . m1/2−θ To the best of our knowledge, these bounds are the first to provide some guarantee for both the error of the classifier and the coverage. Therefore, these results are related to the bounded-improvement and bounded-abstention models. As was rightfully stated by the authors, the final aggregated hypothesis can significantly outperform the best base-hypothesis in F in some favorable situations. Unfortunately, the regret bound provided does not exploit these situations, as it is bounded by twice the generalization error of the best hypothesis. In a series of papers (see, e.g., [75, 85, 84]) culminating in a book [86], Vovk et al. presented transductive confidence machines and conformal predictions. This innovative approach is mainly concerned with an online probabilistic setting. Rather than predicting a single label for each sample point, a “conformal predictor” can assign multiple labels. Given a confidence level ǫ, it is shown how the error rate ǫ can be asymptotically guaranteed. Interpreting the uncertain (multi-labeled) predictions as abstentions, one can construct online predictors with a reject option that have asymptotic performance guarantees. A few important differences between conformal predictions and our results can be pointed out. While both approaches provide “hedged” predictions, they use different notions of hedging. Whereas our goal is to guarantee that with high probability over the training sample our classifier agrees with the best classifier in the class over all points in the accepted domain, the goal in conformal predictions is to provide guarantees for the average error rate where the average is taken over all possible samples and test points.1 In this sense conformal prediction cannot achieve pointwise competitiveness. The authors’ setting also utilizes a different notion of error than the one we use. While we provide performance guarantees for the error rate only on the covered (accepted) examples, they provided a guarantee for all examples (including those that have multiple predictions or none at all). By increasing the multi-labeled prediction rate (uncertain prediction), their error rate can be decreased to any arbitrarily small value. This is not the case with our error notion on the covered examples, which is, of course, bounded below by the Bayes error on the covered region. This definition of error 1 As noted by Vovk et al.: “It is impossible to achieve the conditional probability of error equal to ǫ given the observed examples, but it is the unconditional probability of error that equals ǫ. Therefore, it implicitly involves averaging over different data sequences...” [86, page 295]. 155 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 is also adopted by most other studies of classification with a reject option. The vast majority of our work is focused on the study of coverage rates. In conformal prediction an efficiency notion, similar to coverage rate, is mentioned, but no finite sample results are derived. Moreover, all the negative results provided in this work apply for conformal prediction; hence, distribution independent efficiency results are out of reach. 156 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 Bibliography [1] N. Ailon, R. Begleiter, and E. Ezra. Active learning using smooth relative regret approximations with applications. In 25th Annual Conference on Learning Theory (COLT), 2012. [2] E. Amaldi and V. Kann. The complexity and approximability of finding maximum feasible subsystems of linear relations. Theoretical Computer Science, 147(1):181–210, 1995. [3] M. Anthony and P.L. Bartlett. Neural Network Learning; Theoretical Foundations. Cambridge University Press, 1999. [4] A. Antos and G. Lugosi. Strong minimax lower bounds for learning. Machine Learning, 30(1):31–56, 1998. [5] L. Atlas, D. Cohn, R. Ladner, M. A El-Sharkawi, and R.J. Marks II. Training Connectionist Networks with Queries and Selective Sampling. Morgan Kaufmann Publishers Inc., 1990. [6] P. Auer. Using confidence bounds for exploitation-exploration tradeoffs. The Journal of Machine Learning Research, 3:397–422, 2003. [7] Ö. Ayşegül, G. Mehmet, A. Ethem, and H. Türkan. Machine learning integration for predicting the effect of single amino acid substitutions on protein stability. BMC Structural Biology, 9(1):66, 2009. [8] M.F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Proceedings of the 23rd international conference on Machine learning, pages 65–72. ACM, 2006. [9] M.F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78–89, 2009. 157 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 [10] M.F. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active learning. In 21st Annual Conference on Learning Theory (COLT), pages 45–56, 2008. [11] P.L. Bartlett and S. Mendelson. Discussion of ”2004 IMS medallion lecture: Local rademacher complexities and oracle inequalities in risk minimization” by V. koltchinskii. Annals of Statistics, 34:2657–2663, 2006. [12] P.L. Bartlett, S. Mendelson, and P. Philips. Local complexities for empirical risk minimization. In COLT: Proceedings of the Workshop on Computational Learning Theory. Morgan Kaufmann Publishers, 2004. [13] P.L. Bartlett and M.H. Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9:1823– 1840, 2008. [14] S. Ben-David, N. Eiron, and P.M. Long. On the difficulty of approximately maximizing agreements. Journal of Computer and System Sciences, 66(3):496–514, 2003. [15] J.L. Bentley, H.T. Kung, M. Schkolnick, and C.D. Thompson. On the average number of maxima in a set of vectors and applications. JACM: Journal of the ACM, 25, 1978. [16] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 49–56. ACM, 2009. [17] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. Advances in Neural Information Processing Systems 23, 2010. [18] A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. JACM: Journal of the ACM, 36, 1989. [19] A. Bounsiar, E. Grall, and P. Beauseroy. A kernel based rejection method for supervised classification. International Journal of Computational Intelligence, 3:312–321, 2006. 158 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 [20] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In Advanced Lectures on Machine Learning, volume 3176 of Lecture Notes in Computer Science, pages 169–207. Springer, 2003. [21] C.C. Chang and C.J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at ”http://www.csie.ntu.edu.tw/ cjlin/libsvm”. [22] C.K. Chow. An optimum character recognition system using decision function. IEEE Trans. Computer, 6(4):247–254, 1957. [23] C.K. Chow. On optimum recognition error and reject trade-off. IEEE Trans. on Information Theory, 16:41–36, 1970. [24] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994. [25] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995. [26] T.M. Cover and P. Hart. Neighbor pattern classification. IEEE Transactions. on Information Theory, 13(1):21–27, 1967. [27] S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing Systems 18, pages 235– 242, 2005. [28] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In NIPS, 2007. [29] S. Dasgupta, A. Tauman Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. Journal of Machine Learning Research, 10:281–299, 2009. [30] L. Devroye, L. Györfi, and G. Lugosi. A probabilistic theory of pattern recognition, volume 31. New York: Springer, 1996. [31] R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11:1605–1641, 2010. 159 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 [32] R. El-Yaniv and Y. Wiener. Agnostic selective classification. In Neural Information Processing Systems (NIPS), 2011. [33] R. El-Yaniv and Y. Wiener. Active learning via perfect selective classification. Journal of Machine Learning Research, 13:255–279, 2012. [34] S. Fine, R. Gilad-Bachrach, and E. Shamir. Query by committee, linear separation and random walks. Theoretical Computer Science, 284(1):25–51, 2002. [35] Y. Freund, Y. Mansour, and R.E. Schapire. Generalization bounds for averaged classifiers. Annals of Statistics, 32(4):1698–1722, 2004. [36] Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Information, prediction, and Query by Committee. In Advances in Neural Information Processing Systems (NIPS) 5, pages 483–490, 1993. [37] Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28:133– 168, 1997. [38] E. Friedman. Active learning for smooth problems. In Proceedings of the 22nd Annual Conference on Learning Theory, 2009. [39] G. Fumera and F. Roli. Support vector machines with embedded reject option. In Pattern Recognition with Support Vector Machines: First International Workshop, pages 811–919, 2002. [40] G. Fumera, F. Roli, and G. Giacinto. Multiple reject thresholds for improving classification reliability. Lecture Notes in Computer Science, 1876, 2001. [41] J.E. Gentle. Numerical Linear Algebra for Applications in Statistics. Springer Verlag, 1998. [42] R. Gilad-Bachrach. To PAC and Beyond. PhD thesis, the Hebrew University of Jerusalem, 2007. [43] S. Goldman and M. Kearns. On the complexity of teaching. JCSS: Journal of Computer and System Sciences, 50, 1995. 160 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 [44] Y. Grandvalet, A. Rakotomamonjy, J. Keshet, and S. Canu. Support vector machines with a reject option. In NIPS, pages 537–544. MIT Press, 2008. [45] B. Hanczar and E.R. Dougherty. Classification with reject option in gene expression data. Bioinformatics, 24(17):1889–1895, 2008. [46] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, pages 353–360, 2007. [47] S. Hanneke. Teaching dimension and the complexity of active learning. In Proceedings of the 20th Annual Conference on Learning Theory (COLT), volume 4539 of Lecture Notes in Artificial Intelligence, pages 66–81, 2007. [48] S. Hanneke. Theoretical Foundations of Active Learning. PhD thesis, Carnegie Mellon University, 2009. [49] S. Hanneke. Rates of convergence in active learning. Annals of Statistics, 37(1):333–361, 2011. [50] S. Hanneke. A statistical theory of active learning. Unpublished, 2013. [51] Steve Hanneke. Activized learning: Transforming passive to active with improved label complexity. The Journal of Machine Learning Research, 98888:1469–1587, 2012. [52] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision, volume 2. Cambridge Univ Press, 2000. [53] D. Haussler. Quantifying inductive bias: AI learning algorithms and Valiant’s learning framework. Artificial intelligence, 36:177–221, 1988. [54] T. Hegedüs. Generalized teaching dimensions and the query complexity of learning. In COLT: Proceedings of the Workshop on Computational Learning Theory. Morgan Kaufmann Publishers, 1995. [55] M.E. Hellman. The nearest neighbor classification rule with a reject option. IEEE Trans. on Systems Sc. and Cyb., 6:179–185, 1970. 161 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 [56] R. Herbei and M.H. Wegkamp. Classification with reject option. The Canadian Journal of Statistics, 34(4):709–721, 2006. [57] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13– 30, March 1963. [58] D. Hug, G. O. Munsonious, and M. Reitzner. Asymptotic mean values of Gaussian polytopes. Beiträge Algebra Geom., 45:531–548, 2004. [59] D. Hug and M. Reitzner. Gaussian polytopes: variances and limit theorems. Advances in applied probability, 37(2):297–320, 2005. [60] B. Kégl. Robust regression by boosting the median. Learning Theory and Kernel Machines, pages 258–272, 2003. [61] R.M. Kil and I. Koo. Generalization bounds for the regression of realvalued functions. In Proceedings of the 9th International Conference on Neural Information Processing, volume 4, pages 1766–1770, 2002. [62] V. Koltchinskii. 2004 IMS medallion lecture: Local rademacher complexities and oracle inequalities in risk minimization. Annals of Statistics, 34:2593–2656, 2006. [63] T.C.W. Landgrebe, D.M.J. Tax, P. Paclı́k, and R.P.W. Duin. The interaction between classification and reject performance for distance-based reject-option classifiers. Pattern Recognition Letters, 27(8):908–917, 2006. [64] J. Langford. Tutorial on practical prediction theory for classification. JMLR, 6:273–306, 2005. [65] W.S. Lee. Agnostic Learning and Single Hidden Layer Neural Networks. PhD thesis, Australian National University, 1996. [66] L. Li and M. L. Littman. Reducing reinforcement learning to kwik online regression. Annals of Mathematics and Artificial Intelligence, pages 217–237, 2010. [67] L. Li, M.L. Littman, and T.J. Walsh. Knows what it knows: a framework for self-aware learning. In Proceedings of the 25th International Conference on Machine learning, pages 568–575. ACM, 2008. 162 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 [68] P. Massart and É. Nédélec. Risk bounds for statistical learning. The Annals of Statistics, 34(5):2326–2366, 2006. [69] C. McDiarmid. Concentration. In Probabilistic Methods for Algorithmic Discrete Mathematics, volume 16, pages 195–248. SpringerVerlag, 1998. [70] P. S. Meltzer, J. Khan, J. S. Wei, M Ringnér, L. H. Saal, M Ladanyi, F Westermann, F Berthold, M Schwab, C. R. Antonescu, and C Peterson. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7(6), June 2001. [71] T. Mitchell. Version spaces: a candidate elimination approach to rule learning. In IJCAI’77: Proceedings of the 5th International Joint Conference on Artificial intelligence, pages 305–310, 1977. [72] T. Pietraszek. Optimizing abstaining classifiers using ROC analysis. In Proceedings of the Twenty-Second International Conference on Machine Learning(ICML), pages 665–672, 2005. [73] F.P. Preparata and M.I. Shamos. Computational Geometry: An Introduction. Springer-Verlag, 1990. [74] C.M. Santos-Pereira and A.M. Pires. On optimal reject rules and ROC curves. Pattern Recognition Letters, 26(7):943–952, 2005. [75] C. Saunders, A. Gammerman, and V. Vovk. Transduction with confidence and credibility. In Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI), pages 722–726, 1999. [76] H.S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the Fifth Annual Workshop on Computational Learning theory (COLT), pages 287–294, 1992. [77] American Cancer Society. Cancer facts & figures. 2010. [78] Alexander L Strehl and Michael L Littman. Online linear regression and its application to model-based reinforcement learning. In Advances in Neural Information Processing Systems, pages 1417–1424, 2007. 163 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 [79] F. Tortorella. An optimal reject rule for binary classifiers. Lecture Notes in Computer Science, 1876:611–620, 2001. [80] A.B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Mathematical Statistics, 32:135–166, 2004. [81] V. Vapnik. Statistical Learning Theory. Wiley Interscience, New York, 1998. [82] V. Vapnik. An overview of statistical learning theory. Neural Networks, IEEE Transactions on, 10(5):988–999, 1999. [83] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264–280, 1971. [84] Vovk. On-line confidence machines are well-calibrated. In FOCS: IEEE Symposium on Foundations of Computer Science (FOCS), pages 187–196, 2002. [85] V. Vovk, A. Gammerman, and C. Saunders. Machine-learning applications of algorithmic randomness. In Proc. 16th International Conf. on Machine Learning (ICML), pages 444–453, 1999. [86] V. Vovk, A. Gammerman, and G. Shafer. Algorithmic Learning in a Random World. Springer, New York, 2005. [87] L. Wang. Smoothness, disagreement coefficient, and the label complexity of agnostic active learning. JMLR, pages 2269–2292, 2011. [88] M.H. Wegkap. Lasso type classifiers with a reject option. Electronic Journal of Statistics, 1:155–168, 2007. [89] Y. Wiener and R. El Yaniv. Pointwise tracking the optimal regression function. In Advances in Neural Information Processing Systems 25, pages 2051–2059, 2012. [90] L. Yang and J. G. Carbonell. Adaptive Proactive Learning with CostReliability Tradeoff. Carnegie Mellon University, School of Computer Science, Machine Learning Department, 2009. 164 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 [91] A. P. Yogananda, M. Narasimha Murty, and L. Gopal. A fast linear separability test by projection of positive points on subspaces. In Proceedings of the Twenty-Fourth International Conference on Machine Learning (ICML 2007), pages 713–720. ACM, 2007. 165 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 יסודות עיוניים של חיזוי בררני יאיר וינר Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 חיבור על מחקר לשם מילוי חלקי של הדרישות לקבלת התואר דוקטור לפילוסופיה יאיר וינר הוגש לסנט הטכניון ---מכון טכנולוגי לישראל אב תשע"ג חיפה יולי 2013 Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 יסודות עיוניים של חיזוי בררני Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 המחקר נעשה בהנחיית פרופסור חבר רן אל-יניב בפקולטה למדעי המחשב. אני מודה לטכניון על התמיכה הכספית הנדיבה בהשתלמותי. Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 הכרת תודה Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 אחת הבעיות הבסיסיות בלמידת מכונה היא חיזוי ,דהיינו ,בהתבסס על דוגמאות מהעבר ,אנו נדרשים להסיק מסקנות לגבי העתיד .בפרט ,בבעיות סיווג אנו נדרשים לחזות לאיזה קבוצה שייכת כל דוגמת מבחן .לדוגמא ,בהנתן צילום חזה אנו נדרשים להחליט האם קיימת דלקת ריאות או לא ,וזאת על בסיס אוסף של צילומים מהעבר לגביהם אנו יודעים את האבחנה .כמובן שעל מנת לחזות ,אנו נדרשים להניח הנחות יסוד לגבי תהליך יצירת הדוגמאות )צלומי החזה( והתיוגים )האבחונים( .ההנחה בבסיס העבודה שלנו היא שהדוגמאות והתיוגים נדגמים באופן זהה ובלתי תלוי מההתפלגות המשותפת שלהם שאינה ידועה .כאשר הערך החזוי הוא רציף הבעיה נקראת בעית רגרסיה. בלמידה רגילה החזאי נדרש לחזות ערך עבור כל דוגמת מבחן .לעומת זאת בחיזוי בררני )סלקטיבי( ,חזאי )לדוגמא מסווג או רגרסור( רשאי להמנע מחיזוי עבור חלק מדוגמאות המבחן .המטרה היא לשפר את דיוק החיזוי בתמורה לוויתור על כיסוי המרחב. חזאים בררניים הם אטרקטיביים במיוחד בבעיות בהן אין צורך לתייג את כל דוגמאות ה מבחן )לדוגמא אם קיים מומחה אנושי שיכול לטפל בבעיות הקשות( ,או במקרים בהם יש צורך בשגיאת סיווג קטנה במיוחד ,אותה לא ניתן להשיג על ידי מסווג רגיל .מחקר זה מתרכז ביסודות העיוניים )תיאורטיים( של החיזוי הבררני ובישומים עבור סיווג בררני ) ,(selective classificationרגרסיה בררנית ) (selective regressionולמידה פעילה ).(active learning אנו מגדירים חזאי בררני כצמד פונקציות :פונקצית חיזוי ופונקצית בחירה .בהנתן דוגמת מבחן , xהערך של פונקצית הבחירה קובע אם נדחה את הדוגמא או לא .בהנחה שהדוגמא מתקבלת )כלומר ,אינה נדחית( ערך החזוי נקבע על ידי פונקצית החיזוי .שני i Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 תקציר המאפיינים העיקריים של כל חזאי בררני הם השגיאה שלו והכיסוי שלו .השגיאה של חזאי בהנתן מחלקת חזאים )מחלקה זו עשוייה להיות אינסופית( ,נאמר שחזאי בררני הוא תחרותי נקודתית ) (pointwise competitiveאם עבור כל דוגמת מבחן שהוא מתייג החיזוי זהה לחיזוי של החזאי הטוב במחלקה .כלומר ,החזאי הבררני משיג שגיאת שיערוך ) (estimation errorאפס עבור הדוגמאות שהוא מתייג .בבעיות רגרסיה אנו מחלישים את הדרישה ודורשים רק שהערך הנחזה יהיה עד כדי ±εמהערך הנחזה על ידי החזאי הטוב ביותר במחלקה .חזאי בררני אשר משיג דרישה מופחתת זו נקרא -εתחרותי נקודתית. המחקר בנושא סיווג בררני ,או בשמו הידוע יותר "סיווג עם אפשרות דחיה" ,אינו חדש והוא התחיל כבר בשלהי שנות החמישים עם עבודתו של צ'או ) .[22,23] (Chowצ'או התרכז במקרה הבייסיאני בו נתון מידע מלא על התפלגויות המקור .הוא הראה כי במקרה זה קיים כלל דחיה אופטימלי המשיג שגיאת הכללה מינימלת עבור כל אחוז דחיה נתון .לאורך השנים הצטבר מידע אמפירי רב המעיד על הפוטנציאל הגלום בסיווג בררני להפחתת שגיאת ההכללה ,ופותחו טכניקות דחיה שונות עבור מסווגים ספציפיים .אולם ,למיטב ידיעתנו ,עד כה לא התנהל דיון פורמלי בתנאים ההכרחיים והמספיקים למציאת חזאי בררני תחרותי נקודתית .מעט העבודות התיאורטיות בתחום הציגו מספר חסמי שגיאה וכיסוי ,אולם אף אחת מהן אינה מבטיחה התכנסות )אפילו אסימפטוטית( לשגיאה של המסווג הטוב במחלקה. להלן עקרי תוצאות עבודת המחקר: אסטרטגיה בררנית ללמידה ) – (LESSאנו מציגים משפחת אסטרטגיות למידה בררנית הנקראות . LESSאסטרטגיות אלו משיגות תחרותיות נקודתית בבעיות סיווג עם רעש וללא רעש .כלומר ,בהנתן אוסף דוגמאות מתוייגות ומחלקת חזאים ,החזאי הבררני הנבחר על ידי LESSמתייג את כל דוגמאות המבחן הנבחרות באופן זהה לחזאי הטוב ביותר במחלקה .עבור בעיות רגרסיה קיימת גרסה שונה במעט של אותה אסטרטגיה הנקראת – .ε-LESSאסטרטגיה זו משיגה -εתחרותיות נקודתית. מימוש "עצל" – במבט ראשון נראה שמימוש יעיל של האסטרטגיה הבררנית המוצעת אינו אפשרי מאחר ואנו נדרשים לחשב את השגיאה האמפירית )השגיאה על דוגמאות הלימוד( עבור קבוצה )אינסופית( של חזאים .אולם אנו מציגים צמצום של הבעיה לחישוב של מספר קטן של חזאים בעלי שגיאה אמפירית מינימלית .על ידי שימוש בצמצום זה אנו ii Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 בררני מוגדרת כתוחלת השגיאה על הדוגמאות שנבחרו ,והכיסוי מוגדר כהסתברות לתייג דוגמא. עקרון חוסר האמון – בהשראת המימוש ה"עצל" פיתחנו טכניקה חדשה להערכת הביטחון העצמי של חזאי בתחזיות שלו .נניח כי אנו דנים בבעית סיווג בינארית והחזאי חוזה ערך חיובי לדוגמת מבחן כלשהי .כעת אנו מבקשים מהלומד ללמוד חזאי חדש על בסיס כל המידע ההיסטורי אבל בתנאי שהחזאי יחזה ערך שלילי לדוגמת המבחן .כל שנותר הוא לבחון את ההבדל בביצועים של שני החזאים על המידע ההיסטורי שלנו .הבדל זה פרופורציונאלי לבטחון העצמי של המסווג .ככל שההבדל קטן יותר כך המסווג פחות בטוח בתחזיותיו ולהפך. סיבוכיות קבוצת האפיון – עד כה ראינו שהאסטרטגיה המוצעת משיגה תחרותיות נקודתית ושניתן לממש אותה )או לקרב אותה( באופן יעיל יחסית .כעת יש להראות כי אסטרטגיה זו לא דוחה את כל דוגמאות המבחן .דהיינו לחסום מלמטה את הכיסוי של האסטרטגיה .בכדי לפתח חסמי כיסוי הגדרנו מדד סיבוכיות חדש הנקרא סיבוכיות קבוצת האפיון .מדד זה משלב היבטים שונים של מספר מדדי סיבוכיות אחרים כולל מושג איזור אי ההסכמה )הנמצא בשימוש ב ,(disagreement coefficient-מושג קבוצת ההגדרה המינימלית )הנמצא בשימוש ב (teaching dimension-ומושג מימד ה .VC-על ידי שימוש בתוצאות קלאסיות מגאומטריה סטטיסטית הצלחנו לחסום את מדד הסיבוכיות החדש עבור מספר התפלגויות מעניינות לרבות מסווגים לינארים תחת התפלגות שהיא ערבוב סופי כלשהו של גאוסיאנים. קצבי כיסוי – האסטרטגיה המוצעת דוחה דוגמאות מבחן באופן שמרני ביותר ---גם בהנתן חשש מזערי לגבי נכונות התיוג האסטרטגיה המוצעת תסרב לתייג .למרות זאת, ובמפתיע ,בהרבה בעיות האסטרטגיה תתייג את מרבית הדוגמאות .יתרה מכך אנו מראים שהסתברות הדחיה של האסטרטגיה קטנה בקצב מהיר כפונקציה של מספר דוגמאות האימון .מסתבר שכאשר מחלקת החזאים ממנה אנו לומדים היא אינסופית לא ניתן לחסום את הכיסוי מלמטה ללא תלות בהתפלגות המקור או בדוגמאות הלימוד .במחקר זה אנו מפתחים חסמי כיסוי התלויים בדוגמאות הלימוד עצמן ) (data-dependentוחסמי כיסוי התלויים בהתפלגות המקור ) .(distribution-dependentבהתבסס על מדד הסיבוכיות החדש אנו מראים כי LESSמשיגה כיסוי משמעותי המתכנס לאחד עבור מסווגים לינארים תחת התפלגות שהיא ערבוב סופי כלשהו של גאוסיאנים .אנו מפתחים גם חסמי כיסוי נוספים למקרים של סיווג בררני תחת רעש ושל רגרסיה בררנית. חסמים מהירים ללמידה פעילה – בלמידה פעילה הלומד מקבל דוגמאות לימוד באופן סידרתי והוא נדרש להחליט ,עבור כל דוגמא ,אם הוא רוצה לקבל את התיוג שלה .הלומד הפעיל נמדד לפי כמות השאלות )בקשות התיוג( שהוא שואל עד להשגת שגיאת יעד נתונה. iii Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 מראים כיצד ניתן לממש באופן יעיל את האסטרטגיה למקרה של רגרסיה לינארית עם שגיאה ריבועית. מספר שאלות התיוג כפונקציה של שגיאת היעד נקרא סיבוכיות התיוג .אחת התקוות מבקשים תיוג עבור כל דוגמא( .בספרות ידוע רק על מספר מצומצם של בעיות )מסווגים והתפלגויות( לגביהן קיימת הוכחה להבטחת האצה אקספוננציאלית בקצב הלימוד .בפרט, הד וגמא הידועה ביותר היא מסווגים לינארים העוברים דרך הראשית תחת התפלגות אחידה על פני כדור .אולם בשנים האחרונות הראו שגם בבעיות פשוטות כמו מסווגים לינאריים כלשהם תחת אותה התפלגות על הכדור לא ניתן להשיג האצה לעומת למידה רגילה .תוצאה מאכזבת זו ריפתה את ידם של רבים והעלתה את השאלה האם האצה אקספוננציאלית אפשרית בבעיות מעניינות. בעבודה זו אנו מציגים שקילות של למידה פעילה וסיווג בררני ללא רעש .שקילות זו מאפשרת לנו לפתח חסמי סיבוכיות תיוג על ידי שימוש בחסמי כיסוי .באמצעות טכניקה זו, הצלחנו להוכיח האצה אקספוננציאלית עבור מסווגים לינארים כלשהם תחת התפלגות שהיא אוסף סופי של גאוסיאנים .שיטת החסימה החדשה היא השיטה היחידה הידועה כיום המאפשרת להוכיח האצה אקספוננציאלית בקצב הלימוד עבור מספר בעיות מעניינות. יתרה מכך ,הקשר ההדוק בין למידה פעילה וסיווג בררני מאפשר לנו לחסום את אחד ממדדי הסיבוכיות החשובים ביותר בלמידה פעילה ,הלא הוא מקדם אי-ההסכמה ) .(disagreement coefficientחסם זה מתבסס על מדד הסיבוכיות החדש שהצגנו וכל התוצאות שהוכחנו לגביו תקפות גם לגבי מקדם אי ההסכמה. תוצאות אמפיריות – אנחנו מסכמים את העבודה באוסף של תוצאות אמפיריות המדגימות את הייתרון בחיזוי בררני בכלל ,ובאסטרטגיה שלנו לחיזוי בררני בפרט. iv Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013 הגדולות ביותר בלמידה פעילה היא השגת האצה בקצב הלימוד לעומת לימוד רגיל )בו
© Copyright 2025 Paperzz