Theoretical Foundations of Selective Prediction

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Theoretical Foundations of
Selective Prediction
Yair Wiener
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Theoretical Foundations of
Selective Prediction
Research Thesis
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy
Yair Wiener
Submitted to the Senate of
the Technion — Israel Institute of Technology
Av 5773
Haifa
July 2013
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
The research thesis was done under the supervision of Prof. Ran El-Yaniv in the
Department of Computer Science.
Acknowledgements:
First and foremost, I want to thank my advisor, Prof. Ran El-Yaniv. It has been
a true pleasure being his Ph.D. student. Ran taught me how to do research, how to
address complicated problems, how to look for answers, but most importantly, how
to ask the right questions. Ran did not spare any effort in supporting me during my
long journey. His endless patience, contagious enthusiasm, and continuous encouragement make him the ideal advisor. Even during tough times, when I almost lost
hope in our direction, Ran never lost faith. For that I am in debt to him. It has been
an honor and a privilege to learn from him.
I thank my friends at RADVISION for believing in me and giving me the time
to focus on research. I also thank Intel and the European Union′ s Seventh Framework Programme for their financial support.
Last, but not least, I deeply thank my family, my wife Hila for her support, love,
and care, and my daughter Ori for giving me a new perspective on learning. This
thesis is dedicated to them.
The generous financial support of the Technion is gratefully acknowledged.
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
The work described in the thesis is based on the following publications:
1. Ran El-Yaniv and Yair Wiener. ”On the foundations of noise-free selective
classification.” The Journal of Machine Learning Research 99 (2010): 16051641. [31] (Impact factor: 3.42, 5-year impact factor: 4.284, rank in Machine
Learning: 1)
2. Ran El-Yaniv and Yair Wiener. ”Agnostic selective classification.” Advances
in Neural Information Processing Systems . 2011. [32]
3. Ran El-Yaniv and Yair Wiener. ”Active learning via perfect selective classification.” The Journal of Machine Learning Research 13 (2012): 255-279.
[33] (Impact factor: 3.42, 5-year impact factor: 4.284, rank in Machine
Learning: 1)
4. Yair Wiener and Ran El-Yaniv. ”Pointwise Tracking the Optimal Regression
Function.” Advances in Neural Information Processing Systems 25. 2012.
[89]
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Contents
Abstract
1
Abbreviations and Notations
3
1 Introduction
1.1 To abstain, or not to abstain, that is the question . . . .
1.2 Notations and definitions . . . . . . . . . . . . . . . .
1.3 Contextual background . . . . . . . . . . . . . . . . .
1.3.1 Selective classification . . . . . . . . . . . . .
1.3.2 Selective regression . . . . . . . . . . . . . . .
1.3.3 Active learning . . . . . . . . . . . . . . . . .
1.4 Main contributions . . . . . . . . . . . . . . . . . . .
1.4.1 Low error selective strategy (LESS) . . . . . .
1.4.2 Lazy implementation . . . . . . . . . . . . . .
1.4.3 The disbelief principle . . . . . . . . . . . . .
1.4.4 Characterizing set complexity . . . . . . . . .
1.4.5 Coverage rates . . . . . . . . . . . . . . . . .
1.4.6 New speedup results for active learning . . . .
1.4.7 Efficient implementation and empirical results
2 Characterizing Set Complexity
2.1 Motivation . . . . . . . . . . . . .
2.2 The characterizing set complexity
2.3 Linear classifiers in R . . . . . . .
2.4 Intervals in R . . . . . . . . . . .
2.5 Axis-aligned rectangles in Rd . . .
2.6 Linear classifiers in Rd . . . . . .
i
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
5
7
10
11
12
12
14
14
16
17
18
20
24
27
.
.
.
.
.
.
29
29
33
34
34
35
38
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
3
4
5
6
Realizable Selective Classification
3.1 Consistent Selective Strategy (CSS) . .
3.2 Risk and coverage bounds . . . . . . .
3.2.1 Finite hypothesis classes . . . .
3.2.2 Infinite hypothesis spaces . . .
3.3 Distribution-dependent coverage bound
3.4 Implementation . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
44
45
45
48
53
56
From Selective Classification to Active Learning
4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 From coverage bound to label complexity bound . . . . .
4.4 A new technique for upper bounding the label complexity .
4.4.1 Linear separators in R . . . . . . . . . . . . . . .
4.4.2 Linear separators in Rd under mixture of Gaussians
4.5 Lower bound on label complexity . . . . . . . . . . . . .
4.6 Relation to existing label complexity measures . . . . . .
4.6.1 Teaching dimension . . . . . . . . . . . . . . . .
4.6.2 Disagreement coefficient . . . . . . . . . . . . . .
4.7 Agnostic active learning label complexity bounds . . . . .
4.7.1 Label complexity bound for A2 . . . . . . . . . .
4.7.2 Label complexity bound for RobustCALδ . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
58
59
60
61
68
69
70
71
73
74
76
82
82
84
.
.
.
.
.
.
86
87
87
88
95
98
98
.
.
.
.
.
107
108
110
111
116
119
Agnostic Selective Classification
5.1 Definitions . . . . . . . . . . . . . .
5.2 Low Error Selective Strategy (LESS)
5.3 Risk and coverage bounds . . . . .
5.4 Disbelief principle . . . . . . . . .
5.5 Implementation . . . . . . . . . . .
5.6 Empirical results . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
Selective Regression
6.1 Definitions . . . . . . . . . . . . . . . .
6.2 ǫ-Low Error Selective Strategy (ǫ-LESS)
6.3 Risk and coverage bounds . . . . . . .
6.4 Rejection via constrained ERM . . . . .
6.5 Selective linear regression . . . . . . .
ii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
6.6
Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7 Future Directions
7.1 Beyond pointwise competitivness . . . .
7.2 Beyond linear classifiers . . . . . . . . .
7.3 Beyond mixtures of Gaussians . . . . . .
7.4 On computational complexity . . . . . . .
7.5 Other applications for selective prediction
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
126
126
128
128
129
129
A Proofs
131
B Technical Lemmas
137
C Extended Minimax Model
143
D Realizable Selective Classification: RC trade-off analysis
145
E Related Work
152
iii
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
List of Figures
1.1
Excess risk, estimation error, and approximation error for different
hypothesis classes with increasing complexity. . . . . . . . . . . .
6
Excess risk, estimation error, and approximation error for a single
hypothesis class with different coverage values. . . . . . . . . . .
6
An example of CSS with the class of linear classifiers in R2 . The
white region is the region rejected by the strategy. . . . . . . . . .
15
Lazy implementation. (a) The test point (depicted by the empty
circle) is rejected as both constrained ERMs have zero empirical
error; (b) the test point is accepted as one constrained ERM has an
empirical error higher than zero. . . . . . . . . . . . . . . . . . .
16
Confidence level map using (a) the disbelief index; (b) the distance
from the decision boundary. . . . . . . . . . . . . . . . . . . . .
18
The version space compression set is depicted in green circles, and
the maximal agreement region is depicted in gray. . . . . . . . . .
19
1.7
Maximal agreement region for different samples. . . . . . . . . .
19
1.8
RC curve of LESS (red line) and rejection based on distance from
the decision boundary (dashed green line). . . . . . . . . . . . . .
28
2.1
An example of a maximal agreement set for linear classifiers in R2 .
30
2.2
The region accepted by CSS for the case of linear classifiers in
R2 .
31
2.3
Version space compression set. . . . . . . . . . . . . . . . . . . .
31
2.4
Maximal agreement region. . . . . . . . . . . . . . . . . . . . . .
32
2.5
Maximal agreement set for linear classifier in R. (a) sample with
both positive and negative examples (b) sample with negative only
examples (c) sample with positive only examples . . . . . . . . .
34
1.2
1.3
1.4
1.5
1.6
iv
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
2.6
2.7
Maximal agreement set for intervals in R. (a) non overlapping interval and two rays (b) no overlapping interval and a ray (c) interval
only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
Maximal agreement set for intervals in R when Sm include negative points only . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
2.8
Agreement region of V SF ,Ri . . . . . . . . . . . . . . . . . . . .
3.1
A worst-case distribution for linear classifiers: points are drawn
uniformly at random on the two arcs and labeled by a linear classifier that passes between these arcs. The probability volume of the
maximal agreement set is zero. . . . . . . . . . . . . . . . . . . .
49
5.1
The set of low true error (a) resides within a ball around f ∗ (b). . .
89
5.2
The set of low empirical error (a) resides within the set of low true
error (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
5.3
Constrained ERM. . . . . . . . . . . . . . . . . . . . . . . . . .
97
5.4
Linear classifier. Confidence height map using (a) disbelief index;
(b) distance from decision boundary. . . . . . . . . . . . . . . . .
99
36
5.5
SVM with polynomial kernel. Confidence height map using (a)
disbelief index; (b) distance from decision boundary. . . . . . . . 100
5.6
RC curve of our technique (depicted in red) compared to rejection based on distance from decision boundary (depicted in dashed
green line). The RC curve in right figure zooms into the lower
coverage regions of the left curve. . . . . . . . . . . . . . . . . . 100
5.7
RC curves for SVM with linear kernel. Our method in solid red,
and rejection based on distance from decision boundary in dashed
green. Horizntal axis (c) represents coverage. . . . . . . . . . . . 103
5.8
SVM with linear kernel. The maximum coverage for a distancebased rejection technique that allows the same error rate as our
method with a specific coverage. . . . . . . . . . . . . . . . . . . 104
5.9
RC curves for SVM with RBF kernel. Our method in solid red and
rejection based on distance from decision boundary in dashed green. 105
5.10 SVM with RBF kernel. The maximum coverage for a distancebased rejection technique that allows the same error rate as our
method with a specific coverage. . . . . . . . . . . . . . . . . . . 106
v
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
6.1
6.2
6.3
Absolute difference between the selective regressor (f, g) and the
optimal regressor f ∗ . Our proposed method in solid red line and
the baseline method in dashed black line. All curves in a logarithmic y-scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Test error of selective regressor (f, g) trained on the dataset ’years’
with sample size of 30, 50, 100, 150, and 200 samples. Our proposed method in solid red line and the baseline method in dashed
black line. All curves in a logarithmic y-scale. . . . . . . . . . . 124
Test error of selective regressor (f, g). Our proposed method in
solid red line and the baseline method in dashed black line. All
curves in a logarithmic y-scale. . . . . . . . . . . . . . . . . . . 125
D.1 The RC plane and RC trade-off . . . . . . . . . . . . . . . . . . . 146
vi
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Abstract
In selective prediction, a predictor is allowed to abstain on part of the domain. The
objective is to reduce prediction error by compromising coverage. This research is
concerned with the study of the theoretical foundations of selective prediction and
its applications for selective classification, selective regression, and active learning.
We present a new family of selective classification strategies called LESS (low
error selective strategies). The labels predicted by LESS for the accepted domain
are guaranteed to be identical to the labels predicted by the best hypothesis in
the class, chosen in hindsight. Therefore, the estimation error of the predictor
chosen by LESS is zero. Extending the idea to regression we also present a strategy
called ǫ-LESS whose predictions are ǫ-close to the values predicted by the best (in
hindsight) regressor in the class.
We study the coverage rates of LESS (and ǫ-LESS) for classification and regression. Relying on a novel complexity measure termed characterizing set complexity,
we derive both data-dependent and distribution-dependent guarantees on the coverage of LESS for both realizable and agnostic classification settings. These results
are interesting because they allow for training selective predictors with substantial
coverage whose estimation error is essentially zero.
Moreover, we prove an equivalence between selective (realizable) classification and stream-based active learning, with respect to learning rates. One of the
main consequences of this equivalence is an entirely novel technique to bound
the label complexity in active learning for numerous interesting hypothesis classes
and distributions. In particular, using classical results from probabilistic geometry, we prove exponential label complexity speedup for actively learning general
(non-homogeneous) linear classifiers when the data distribution is an arbitrary highdimensional mixture of Gaussians.
While direct implementations of the LESS (and ǫ-LESS) strategies appear to
be intractable, we show how to reduce LESS to a procedure involving few calcula-
1
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
tions of constrained empirical risk minimization (ERM). Using this reduction, we
develop a new principle for rejection, termed the disbelief principle, and show an
efficient implementation for ǫ-LESS for the case of linear least squared regression
(LLSR).
2
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Abbreviations and Notations
SVM
VC
ERM
LESS
CSS
RC
RBF
KWIK
LLSR
f
F
g
R(f )
f∗
fˆ
R̂(f )
θ
Hn
γ(F, n)
θǫ
ℓ
δ
n̂
R
—
—
—
—
—
—
—
—
—
—
—
—
—
—
—
—
—
—
—
—
—
—
—
—
Support Vector Machine
Vapnik Chervonenkis
Empirical Risk Minimizer
Low error selective strategy
Consistent selective strategy
Risk-Coverage
Radial Basis Function
Known What It Knows
Linear least squares regression
Hypothesis
Hypothesis class
Selection function
True risk of hypothesis f
True risk minimizer
Empirical risk minimizer
Empirical risk of hypothesis f
Disagreement coefficient
Order-n characterizing set
Characterizing set complexity
ǫ-disagreement coefficient
Loss function
Confidence parameter
Version space compression set size
Reals
3
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Chapter 1
Introduction
“
It ain’t what you don’t know that gets you into trouble. It’s what
you know for sure that just ain’t so.
”
Mark Twain, 1835 - 1910
In selective prediction, a predictor (either classifier or regressor) is allowed to
abstain from prediction on part of the domain. The objective is to improve the
accuracy of predictions by compromising coverage, a trade-off we refer to as the
risk-coverage (RC) trade-off. An important goal is to construct algorithms that can
optimally or near optimally control it.
Selective prediction is strongly tied to the idea of self-aware learning whereby
given evidence (labeled and unlabeled examples), our goal is to quantify the limits
of our knowledge. By this we refer to the ability of a learner to objectively quantify
its own self-confidence for each prediction it makes. This problem is at the heart of
many machine learning problems, including selective classification [31, 32], selective regression [89], active learning (selective sampling [5, 24]), and reinforcement
learning (KWIK [67], exploration vs. exploitation [6]).
This research is concerned with the study of the theoretical foundations of selfaware learning and its manifestation in selective classification, selective regression,
and active learning.
4
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
1.1 To abstain, or not to abstain, that is the question
Effective selective prediction is compelling in applications where one is not concerned with, or can afford partial coverage of the domain, and/or in cases where
extremely low risk is a must but is not achievable in standard prediction frameworks. The promise is that by allowing rejection we will be able to reduce the risk
to acceptable levels. Before we start our discussion on the value of rejection, it will
be revealing to take a closer look at the structure of risk in classical classification.
One of the goals in statistical machine learning is to learn an optimal hypothesis f from a finite training set, assumed to be sampled i.i.d. from some unknown
underlying distribution P . The classifier that minimizes the probability of misclassification over P is called the Bayes classifier. A classical result in statistical
learning theory is that the excess risk of a classifier f (the difference between the
risk of f and the risk of the Bayes classifier) can be expressed as a sum of estimation error and approximation error [30],
R(f ) − R(f Bayes ) = {R(f ) − R(f ∗ )} + R(f ∗ ) − R f Bayes ,
|
{z
} |
{z
} |
{z
}
excess risk
estimation error
approximation error
where R(f ) is the error of hypothesis f and f ∗ is the best hypothesis in a given
hypothesis class F, namely f ∗ = argminf ∈F R(f ). The approximation error component depends only on the hypothesis class F from which we choose the classifier.
The richer the hypothesis class, the smaller the approximation error. In the extreme
case, where the target hypothesis belongs to F, the approximation error vanishes
altogether. The estimation error is the error of our classifier compared to the error
of the best hypothesis in F (denoted by f ∗ ). Generally, the richer the hypothesis
class is, the larger the estimation error will be. Consequently, there is a tradeoff
between estimation error and approximation error. Figure 1.1 depicts this tradeoff.
When the reject option is introduced, we should depict the error in a threedimensional space (complexity, coverage, and error) rather than in the two-dimensional
space (complexity and error) shown in Figure 1.1. The reject option makes it possible for us to overcome the estimation error vs. approximation error tradeoff and
reduce the overall risk in three ways:
• Reduce the estimation error by compromising coverage — choose a subset
of the domain over which we can get arbitrarily close to the best hypothesis
in the class given a fixed sample.
5
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Figure 1.1: Excess risk, estimation error, and approximation error for different
hypothesis classes with increasing complexity.
Figure 1.2: Excess risk, estimation error, and approximation error for a single
hypothesis class with different coverage values.
6
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
• Reduce the approximation error by compromising coverage — decompose
the problem into “easy” and “difficult” parts. Choose a subset of the domain
that can be better approximated by the hypothesis class F. For example, in
the case of learning XOR with linear classifiers, the approximation error is
0.5 for all hypotheses. However, if we reject half of the domain, we can
easily reduce the approximation error (over the accepted domain) to zero.
• Reduce the Bayes error by compromising coverage — choose a subset of
the domain with minimal noise.
In this thesis we develop principles and methods that allow for (almost) complete elimination of the estimation error. We propose a new optimization objective
termed pointwise competitiveness. Given a hypothesis class and a finite training
sample, our goal is to match the performance of the best in hindsight hypothesis in
the class over the accepted domain while rejecting as few examples as possible. In
other words, our goal is to achieve zero estimation error on the accepted domain.
Figure 1.2 depicts the excess risk, estimation error, and approximation error of our
proposed method for a given hypothesis class. By compromising coverage, we
are able to reduce the estimation error to zero. When we have done so, we have
achieved a pointwise competitive selective classifier. The approximation error is
depicted as a straight dashed line because its dependency on the coverage is not
guaranteed. Some empirical results provide strong evidence that a byproduct of
our method is a decrease in the approximation error as well.
In the special case where the target hypothesis belongs to the hypothesis class
F (the realizable case), the approximation error is zero anyway and we can achieve
perfect learning (learning hypothesis with zero generalization error on the accepted
domain).
1.2 Notations and definitions
Let X be some feature space, for example, d-dimensional vectors in Rd , and Y be
some output space. In standard classification, the goal is to learn a classifier f :
X → Y, using a finite training sample of m labeled examples, Sm = {(xi , yi )}m
i=1 ,
assumed to be sampled i.i.d. from some unknown underlying distribution P (X, Y )
over X × Y. We assume that the classifier is to be selected from a hypothesis class
F. Let ℓ : Y × Y → [0, 1] be a bounded loss function.
7
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Definition 1.1 (true risk) The true risk of hypothesis f ∈ F with respect to distribution P (X, Y ) is
RP (f ) , E(X,Y )∼P {ℓ(f (X), Y )} .
Definition 1.2 (empirical risk) The empirical risk of hypothesis f ∈ F with respect to sample Sm = {(xi , yi )}m
i=1 is
m
R̂Sm (f ) ,
1 X
ℓ (f (xi ), yi ) .
m
i=1
When the distribution P or the sample Sm are obvious from the context, we omit
them from the notation and use R(f ) and R̂(f ) respectively. Given hypothesis
class F, we define the true risk minimizer f ∗ ∈ F as
f ∗ , arg min R(f ),
f ∈F
and the empirical risk minimizer (ERM) as
fˆ , arg min R̂(f ).
f ∈F
The associated excess loss class [12] is defined as
H = H(F, ℓ) , {ℓ(f (x), y) − ℓ(f ∗ (x), y) : f ∈ F} .
In selective prediction (either selective classification or selective regression)
the learner should output a selective predictor defined to be a pair (f, g), with
f ∈ F being a standard predictor, and g : X → [0, 1], a selection function. When
applying the selective predictor to a sample x ∈ X , g(x) is the probability the
selective predictor will not abstain from prediction, and f (x) is the predicted value
in that case.
(
f (x),
w.p. g(x);
(f, g)(x) ,
(1.1)
reject, w.p. 1 − g(x).
Thus, in its most general form, the selective predictor is randomized. Whenever
the selection function is a zero-one rule, g : X → {0, 1}, we say that the selective
8
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
predictor is deterministic1 . Note that “standard learning” (i.e., no rejection is allowed) is the special case of selective prediction where g(x) selects all points (i.e.,
g(x) ≡ 1). The two main characteristics of a selective predictor are its coverage
and its risk (or “true error”).
Definition 1.3 (coverage) The coverage of a selective predictor (f, g) is the mean
value of the selection function g(X) taken over the underlying distribution P ,
Φ(f, g) , E(X,Y )∼P {g(X)} .
Definition 1.4 (selective true risk) For a bounded loss function ℓ : Y × Y →
[0, 1], we define the risk of a selective predictor (f, g) as the average loss on the
accepted samples,
R(f, g) ,
E(X,Y )∼P {ℓ(f (X), Y ) · g(X)}
.
Φ(f, g)
This risk definition clearly reduces to the standard definition of risk (Definition 1.1)
if g(x) ≡ 1. Note that (at the outset) both the coverage and risk are unknown
quantities because they are defined in terms of the unknown underlying distribution
P.
Definition 1.5 (pointwise competitiveness) Let f ∗ ∈ F be the true risk minimizer with respect to unknown distribution P (X, Y ). A selective classifier (f, g)
is pointwise competitive if for any x ∈ X , for which g(x) > 0,
f (x) = f ∗ (x).
In selective regression, where the output space Y ⊆ R is continuous, we can
generalize the definition as follows.
Definition 1.6 (ǫ-pointwise competitiveness) Let f ∗ ∈ F be the true risk minimizer with respect to unknown distribution P (X, Y ). A selective regressor (f, g)
is ǫ-pointwise competitive if for any x ∈ X , for which g(x) > 0,
|f (x) − f ∗ (x)| ≤ ǫ.
1
In the remainder of this work, with the exception of Appendix D, we use only determenistic selective predictors.
9
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Pointwise competitiveness is a considerably stronger property than risk, which
only refers to average performance. Note that in the realizable case, the target hypothesis belongs to the hypothesis class F, and therefore R(f ∗ ) = 0. In this case
the pointwise competitiveness reduces to the requirement that the selective predictor will never err on its region of activity. We call this case perfect classification.
Definition 1.7 (perfect classification) A selective classifier (f, g) is a perfect classifier with respect to distribution P (X, Y ) if for any x ∈ X , for which g(x) > 0,
EY ∼P (Y |x) {ℓ(f (x), Y )} = 0.
We conclude with a standard definition of version space.
Definition 1.8 (version space [71]) Given a hypothesis class F and a training
sample Sm , the version space V SF ,Sm is the set of all hypotheses in F that classify
Sm correctly.
1.3 Contextual background
The term self-aware learning was first coined in the context of the KWIK (knows
what it knows) framework for reinforcement-learning [67]. We use this term in a
broader context to include all learning problems where the learner needs to quantify
its own self-confidence for its predictions.
Throughout the years, selective classification, selective regression, and active
learning have been studied by different research communities. Although these
problems seems different at first glance, they are tightly related and results and
techniques developed for one can be used for the others. In this section we review
some known results in selective prediction and active learning, used as a contextual
background for our work. This is by no mean an exhaustive survey of main results
in those areas.
10
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
1.3.1 Selective classification
Selective classification, perhaps better known as ‘classification with a reject option’, was first studied in the fifties, in the context of character recognition [22].
Among the earliest works are papers by Chow [22, 23], focusing on Bayesian solutions for the case where the underlying distributions are fully known. Over the
years, selective classification has continued to draw attention and numerous papers have been published. Effective selective classification is attractive for obvious
reasons in applications where one is not concerned with, or can afford partial coverage of the domain, and/or in cases where extremely low risk is a must but is not
achievable in standard classification frameworks. Classification problems in medical diagnosis and in bioinformatics often fall into this category [70, 45]. Of the
many research publications on selective classification, the vast majority have been
concerned with implementing a reject option within specific learning schemes, by
endowing a learning scheme (e.g., neural networks, SVMs, HMMs) with a reject
mechanism. Most of the reject mechanisms were based on heuristic “ambiguity” or
(lack of) “confidence” principle: “when confused or when in doubt, refuse to classify.” For example, in the case of support vector machines, the natural (and widely
used) indicator for confidence is distance from the decision boundary. While there
is plenty of empirical evidence for the potential effectiveness of selective classification in reducing the risk, the work done so far has not facilitated rigorous understanding of rejection and quantification of its benefit. In particular, there are no
formal discussions on the necessary and sufficient conditions for pointwise competitiveness and the achievable coverage rates. The very few theoretical works that
considered selective classification (see Appendix E) do provide some risk or coverage bounds for specific schemes (e.g., ensemble methods) or learning principles
(e.g., ERMs). However, none guarantee (even asymptotic) convergence of the risk
of the selective classifier to the risk of the best hypothesis in the class, not to mention pointwise convergence of the selective classifier itself to the best hypothesis
in the class. Hence, elimination of the estimation error is not guaranteed by any
known result.
In the KWIK framework a similar concept to pointwise competitiveness was
indeed studied, and coverage rates where analyzed [67]. However, it was limited
to the realizable case and concerned an adversarial setting where both the target
hypothesis and the training data are chosen by an adversary. While all positive
results for the KWIK adversarial setting apply to our probabilistic setting (where
11
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
training examples are sampled i.i.d.), this adversarial setting preclude non trivial
coverage for all the interesting hypothesis classes discussed in this work. This
deficiency comes as no surprise because the KWIK adversarial setting is much
more challenging than the probabilistic model we assume here.
Finally, conformal prediction is a very general model for “hedged” prediction
that has been developed by Vovk et al. [86]. Though conformal prediction does not
directly deal with partial coverage, technically it is possible to construct a selective
predictor from any conformal predictor. However conformal prediction guarantees
do not apply to our setting. In Appendix E we discuss the differences between our
approach and conformal prediction.
1.3.2 Selective regression
The reject option has been mentioned only rarely and anecdotally in the context
of regression. In [60] a boosting algorithm for regression was proposed and a
few reject mechanisms were considered, applied both on the aggregate decision
and/or on the underlying weak regressors. A straightforward threshold-based reject
mechanism (rejecting low response values) was applied in [7] on top of support
vector regression. This mechanism was found to lower false positive rates. As in
selective classification, the current literature lacks any formal discussions of RC
trade-off optimality. In particular, no discussions on the sufficient conditions for
ǫ-pointwise competitiveness and the achievable coverage rates are available.
A similar concept to the ǫ-pointwise competitiveness was introduced in the
KWIK online regression framework [67, 78, 66]. However, as in the case of classification, the analysis concerned an adversarial setting and assumed near-realizability
(the best hypothesis is assumed to be pointwise close to the Bayes classifier). Furthermore, in cases where the approximation error is not precisely zero, the estimation error of the selective regressor in KWIK cannot be arbitrarily reduced as can
be accomplished here.
1.3.3 Active learning
Active learning is an intriguing learning model that provides the learning algorithm
with some control over the learning process, possibly leading to much faster learning. In recent years it has been gaining considerable recognition as a vital technique for efficiently implementing inductive learning in many industrial applications where there is an abundance of unlabeled data, and/or in cases where labeling
12
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
costs are high.
In online selective sampling [5, 24], also referred to as stream-based active
learning, the learner is given an error objective ǫ and then sequentially receives unlabeled examples. At each step, after observing an unlabeled example x, the learner
decides whether or not to request the label of x. The learner should terminate the
learning process and output a binary classifier whose true error is guaranteed to be
at most ǫ with high probability. The penalty incurred by the learner is the number of label requests made and this number is called the label complexity. A label
complexity bound of O(d log(d/ǫ)) for actively learning an ǫ-good classifier from
a concept class with VC-dimension d yields exponential speedup in terms of 1/ǫ.
This is in contrast to standard (passive) supervised learning where the sample complexity is typically O(d/ǫ).
The theoretical study of (stream-based, realizable) active learning is paved with
very interesting ideas. Initially, any known significant advantage of active learning
over passive learning was limited to a few cases. Perhaps the most favorable result
was an exponential label complexity speedup for learning homogeneous (crossing
through the origin) linear classifiers where the (linearly separable) data is uniformly
distributed over the unit sphere. This result was obtained by various authors using
various analysis techniques, for a number of strategies that can all be viewed in
hindsight as approximations or variations of the “CAL algorithm” of Cohn et al.
[24]. Among these studies, the earlier theoretical results [76, 36, 37, 34, 42] considered Bayesian settings and studied the speedup obtained by the Query by Committeee (QBC) algorithm. The more recent results provided PAC style analysis
[29, 46, 48].
Lack of positive results for other non-toy problems, as well as various additional negative results, led some researchers to believe that active learning is not
necessarily advantageous in general. Among the striking negative results is Dasgupta’s negative example for actively learning general (non-homogeneous) linear
classifiers (even in two dimensions) under the uniform distribution over the sphere
[27].
A number of recent innovative papers proposed alternative models for active
learning. Balcan et al. [10] introduced a subtle modification of the traditional label
complexity definition, which opened up avenues for new positive results. According to their new definition of “non-verifiable” label complexity, the active learner
is not required to know when to stop the learning process with a guaranteed ǫgood classifier. Their main result, under this definition, is that active learning is
13
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
asymptotically better than passive learning in the sense that only o(1/ǫ) labels are
required for actively learning an ǫ-good classifier from a concept class that has
a finite VC-dimension. Another of their accomplishments is an exponential label
complexity speedup for (non-verifiable) active learning of non-homogeneous linear
classifiers under the uniform distribution over the unit sphere.
Using Hanneke’s characterization of active learning in terms of the “disagreement coefficient” [46], Friedman [38] recently extended the Balcan et al. results
and proved that a target-dependent exponential speedup can be asymptotically
achieved for a wide range of “smooth” learning problems (in particular, the hypothesis class, the instance space and the distribution should all be expressible by
smooth functions). He proved that under such smoothness conditions, for any target hypothesis f ∗ , Hanneke’s disagreement coefficient is bounded above in terms
of a constant c(f ∗ ) that depends on the unknown target hypothesis f ∗ (and is independent of δ and ǫ). The resulting label complexity is O (c(f ∗ ) d polylog(d/ǫ))
[49]. This is a very general result but the target-dependent constant in this bound
is only guaranteed to be finite.
Despite impressive progress in the case of target-dependent bounds for active
learning, the current state of affairs in the target-independent bounds for active
learning arena leaves much to be desired. To date, the most advanced result in
this model, which was essentially established by Seung et al. and Freund et al.
more than twenty years ago [76, 36, 37], is still a target-independent exponential
speedup bound for homogenous linear classifiers under the uniform distribution
over the sphere [50].
1.4 Main contributions
1.4.1 Low error selective strategy (LESS)
We first show that pointwise competitiveness is achievable by a family of learning
strategies termed low error selective strategies (LESS). Given a training sample Sm
and a hypothesis class F, LESS outputs a pointwise competitive selective predictor
(f, g) with high probability. The idea behind the strategy is simple: using standard concentration inequalities, one can show that the training error of the true risk
minimizer, f ∗ , cannot be “too far” from the training error of the empirical risk minimizer, fˆ. Therefore, with high probability f ∗ belongs to the class of low empirical
error hypotheses. Now all we need to do is abstain from prediction whenever the
14
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
vote among all hypotheses in this class is not unanimous. LESS is implemented in
a slightly different manner for realizable selective classification, agnostic selective
classification, and selective regression.
For the realizable setting, the low empirical error class includes all hypotheses
with exactly zero empirical error (all hypotheses in the version space). Figure 1.3
depicts this strategy for the case of linear classifiers in R2 . In this example the training set includes three positive samples (depicted by red crosses) and four negative
samples (depicted by blue circles). The four dashed lines represent four different
hypotheses that are consistent with the training sample. Therefore, all these hypotheses belong to the version space. The darker region represents the agreement
region with respect to the version space, that is, the agreed upon domain region for
all hypotheses in the version space. This region will be the one accepted by LESS
in this example. This strategy is also termed consistent selective strategy (CSS) and
is described in greater detail in Section 3.1.
Figure 1.3: An example of CSS with the class of linear classifiers in R2 . The
white region is the region rejected by the strategy.
For the agnostic case, the low empirical error class includes all hypotheses with
empirical error less than a threshold that depends only on the VC dimension of F,
the size of the training set m, and the confidence parameter δ.2 This strategy is
described in Section 5.2. For selective regression, as for agnostic classification, the
low empirical error class includes all hypotheses with empirical error less than a
threshold that depends on the VC dimension of F, m, and δ. However, in this
case, the requirement is not to reach a unanimous vote for prediction, but rather
2
f ∗ belongs to this class with probability of at least 1 − δ.
15
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
an ǫ-unanimous vote. The strategy will abstain from prediction if there is at least
one hypothesis in the class that predicts a value that is more than ǫ away from
the prediction of the empirical risk minimizer (ERM). This strategy is also termed
ǫ-LESS and is described in Section 6.2.
1.4.2 Lazy implementation
At the outset, efficient implementation of LESS seems to be out of reach as we
are required to track the supremum of the empirical error over a possibly infinite
hypothesis subset, which in general might be intractable. To overcome this computational difficulty, we propose a reduction of this problem to a problem of calculating (two) constrained ERMs. For any given test point x, we calculate the ERM
over the training sample Sm with a constraint on the label of x (one positive label
constraint and one negative). We show that thresholding the difference in empirical
error between these two constrained ERMs is equivalent to tracking the supremum
over the entire (infinite) hypothesis subset.
For the realizable setting, CSS rejects a test point x if and only if both constrained ERMs have exactly zero empirical error. Figure 1.4 depicts the constrained
ERMs (in dashed lines) for two different cases. As before, the red crosses and blue
circles represent the positive and negative samples in the training set. The test point
Figure 1.4: Lazy implementation. (a) The test point (depicted by the empty circle)
is rejected as both constrained ERMs have zero empirical error; (b) the test point
is accepted as one constrained ERM has an empirical error higher than zero.
is represented as an empty black circle. In (a) there are two linear classifiers (depicted in dashed lines) that separate the training sample and classify the new test
16
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
point as positive or negative. In this case the test point will be rejected by CSS. In
(b) there is no linear classifier that can both separate the training set and classify
the test point as negative. Therefore the risk of the ERM with a negative constraint
on the test point is larger than zero and the test point will be accepted. A detailed
discussion and proof of this reduction can be found in Section 3.4.
For the agnostic case, LESS rejects a test point x if and only if the difference
between the empirical error of both constrained ERMs is less than a threshold that
depends only on the VC dimension of F, the size of the training set m, and the
confidence parameter δ. A detailed discussion and proof of this reduction can be
found in Section 5.4.
For selective regression, the above reduction is a bit more involved but follows
the same idea. We are now required to calculate two constrained ERMs and one
unconstrained ERM for each test point x. As mentioned previously, our goal in
selective regression is to find an ǫ-pointwise competitive selective hypothesis. The
constraint is that the regressor should have the value fˆ(x) ± ǫ at the point x, where
fˆ is the unconstrained ERM. The rejection rule is obtained by comparing the ratio
between the constrained and unconstrained ERMs with a threshold that depends on
the VC dimension of F, the size of the training set m, and the confidence parameter
δ. This rejection rule is equivalent to the decisions of ǫ-LESS, provided that the
hypothesis class F is convex. The intuition behind this reduction is beyond the
scope of the introduction and is discussed in detail in Section 6.4.
1.4.3 The disbelief principle
In our lazy implementation of LESS for selective classification, we calculate two
constrained ERMs. We term the absolute difference between the errors of these
constrained ERM predictors over test point x the disbelief index of x. The higher
the disbelief index is, the more surprised we would be at being wrong on test point
x, and our self-confidence in our predictions is higher. Therefore the disbelief index can be used as an indicator for classification confidence instead of the more
traditional methods that are based on distance from the decision boundary. Figure 1.5 depicts an example of the confidence levels induced by both the disbelief
index and the standard distance from the decision boundary. In both cases, the
training set Sm was randomly sampled from a mixture of two identical normal
distributions (centered at different locations) and the hypothesis class is the class
of linear classifiers. Further discussion on the disbelief principle can be found in
17
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Section 5.4.
(a)
(b)
Figure 1.5: Confidence level map using (a) the disbelief index; (b) the distance
from the decision boundary.
1.4.4 Characterizing set complexity
Clearly, the LESS strategy is extremely aggressive (or defensive): it rejects any
point for which the vote is not unanimous (or ǫ-unanimous in the case of regression)
among (infinitely) many hypotheses. While this worst-case approach is precisely
the attribute that allows for pointwise competitiveness (or ǫ-pointwise competitiveness), it may raise the concern that pointwise competitiveness can only be achieved
in a trivial sense where LESS rejects the entire domain. Are we throwing out the
baby with the bath water?
The dependency of the coverage on the training sample size is called the coverage rate. We present and analyze a new hypothesis class complexity measure,
termed characterizing set complexity. This measure plays a central role in the analysis of coverage rates in selective classification (Chapters 3 and 5), as well as in
the analysis of label complexity in online active learning (Chapter 4).
The coverage rate of a selective classifier depends on the complexity of the hypothesis class from which the selection function g(x) is chosen. In the realizable
case, the selection function g(x) chosen by LESS (CSS) is exactly the maximal
agreement set of the version space: the set of all points for which all hypotheses
in the version space vote unanimously. But from which hypothesis class is this
18
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
selection function chosen? To answer this question, we first use a data-dependent
parameter to parameterize the selection hypothesis class. The parameter we use is
called the version space compression set size, defined to be the size of the smallest
subset of the training set that induces the same version space as the entire training
set. Figure 1.6 illustrates the version space compression set and the maximal agreement region. The maximal agreement region with respect to the version space is
the region agreed upon by all hypotheses in the version space.
Figure 1.6: The version space compression set is depicted in green circles, and the
maximal agreement region is depicted in gray.
Figure 1.7 illustrates a few random samples, all with version space compression
set size of 4. Each sample induces a different version space and hence, a different
maximal agreement region. We can now consider the class of all maximal agreement regions for hypothesis class F and any training sample of size n. The order-n
characterizing set complexity γ(F, n) is defined as the VC dimension of this spe-
Figure 1.7: Maximal agreement region for different samples.
19
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Hypothesis class
Linear separators in R
Intervals in R
Linear separators in Rd
Balanced axis-aligned rectangles in Rd
γ (F, n)
Reference
2
O(max(n, 4))
d/2
log n
O d3 nd
Section 2.3
Section 2.4
O (dn log n)
Lemma 2.7
Theorem 2.2
Table 1.1: Order-n characterizing set complexity of different hypothesis classes.
cial class. Relying on classical results in combinatorial geometry and statistical
learning, we derive an upper bound on the order-n characterizing set complexity of
a few hypothesis classes. Our results are summarized in Table 1.1. Although this
geometric complexity measure is defined for the realizable case, it is instrumental
in the analysis of coverage rates in all cases including the agnostic setting, as well
as for the analysis of label complexity in active learning.
1.4.5 Coverage rates
As mentioned before, LESS is a very aggressive strategy. It is therefore quite
surprising that it does not reject the vast majority of the domain for most cases.
Indeed, our study of the coverage rates of LESS (ǫ-LESS) for both realizable and
agnostic selective classification (regression) show that the resulting coverage is
substantial. Specifically, one of the most important characteristics of a coverage
bound is its dependency on the sample size m (the coverage rate). We say that a
coverage bound that converges to one at rate O(polylog(m)/m) is fast. Can we
hope for fast coverage rates when applying the LESS strategy?
We start our analysis with the simple case of realizable selective classification, where we assume that the target hypothesis belongs to the hypothesis class
F. We show in Theorem 3.1 that CSS is optimal in the sense that no other strategy
that achieves perfect classification can have larger coverage. For finite hypothesis
spaces, CSS achieves, with probability of at least 1 − δ (over the choice of the
training set), a coverage of
1
1
.
Φ(f, g) ≥ 1 − O |F| + ln
m
δ
20
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
This distribution-free coverage guarantee has a fast coverage rate and is proven
to be nearly tight for CSS. Therefore, it is the best possible bound for any selective learner. Unfortunately, for infinite hypothesis spaces, the situation is not as
favorable. It is actually impossible to provide any coverage guarantees for perfect
classification for any data distribution. Specifically, for linear classifiers, we contrive a bad distribution for which any selective learner ensuring zero risk will be
forced to reject the entire domain, thus failing to guarantee more than zero coverage.
Fortunately, however, this observation does not preclude non-trivial perfect classification in less adverse situations. Using our new complexity measure, it is possible
to obtain both data-dependent and distribution-dependent coverage bounds.
For infinite hypothesis spaces, LESS (CSS) achieves, with probability of at
least 1 − δ, a coverage of
1
m
m
Φ(f, g) ≥ 1 − O γ(F, n̂) ln
+ ln
,
(1.2)
m
γ(F, n̂)
δ
where n̂ is the version space compression set size, and γ(F, n) is the order-n
characterizing set complexity of F. Plugging in our results for γ(F, n) (Table 1.1),
we immediately derive data-dependent coverage bounds for different hypothesis
classes. The bounds are data-dependent because they depend on the version space
compression set size n̂ (an empirical measure). In order to study the rate of the
bounds, we first need to investigate the dependency of n̂ on the sample size.
Utilizing a classical result in geometric probability theory on the average number of maximal random vectors, we show in Lemma 3.14 that if the underlying distribution is any (unknown) finite mixture of arbitrary multi-dimensional Gaussians
in Rd , and the hypothesis class is the class of linear classifiers, then the compression set size of the version space obtained using m labeled examples satisfies, with
probability of at least 1 − δ,
n̂ = O
(log m)d
δ
.
This bound immediately yields a coverage guarantee for perfect classification of
linear classifiers with fast coverage rate, as stated in Corollary 3.15. This is a
powerful result, providing strong indication of the effectiveness of perfect learning
with guaranteed coverage in a variety of applications.
To analyze the agnostic setting we introduce an additional known complexity
measure, called the disagreement coefficient (see Definition 4.10). This measure
21
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
was first introduced by Hanneke in the study of label complexity in active learning
[46]. Let F be a hypothesis class with VC-dimension d and disagreement coefficient θ, and H the associated excess loss class. If H is a (β, B)-Bernstein class
w.r.t. P (X, Y ) (see Definition 5.4), then LESS achieves, with probability of at
least 1 − δ, a coverage of
!
d
m
1
1 β/2
Φ(f, g) ≥ 1 − O Bθ
ln + ln
.
m
d
m δ
Bernstein classes arise in many natural situations; see discussions in [62, 11, 13].
For example, if the conditional probability P (Y |X) is bounded away from 1/2, or
it satisfies Tsybakov’s noise conditions, then the excess loss function is a Bernstein
[11, 80] class3 .
Using a reduction from selective sampling to agnostic selective classification,
we show that the disagreement coefficient can be bounded using the characterizing
set complexity. Specifically, if γ(F, n̂) = O(polylog(m)) for some hypothesis
class and marginal distribution P (X), then LESS achieves, with probability of at
least 1 − δ, a coverage of
!
polylog(m)
1 β/2
· log
Φ(f, g) ≥ 1 − B · O
m
δ
for the same hypothesis class and marginal distribution (the conditional distribution
P (Y |X) can be different though). Since β ≤ 1, we get that the coverage rate
√
is at most O(polylog(m)/ m), which is — not surprisingly — slower than the
realizable case.
In many applications the output domain Y is continuous or numeric and our
goal is to find the best possible real-valued prediction function given a finite training sample Sm . Unfortunately, the disagreement coefficient is not defined for the
case of real-valued functions and the 0/1-loss function cannot be used. The above
results for agnostic selective classification thus cannot be directly applied to regression. In Chapter 6 we analyze this case and derive sufficient conditions for
ǫ-pointwise competitiveness.
Our first contribution is a natural extension of Hanneke’s disagreement coeffi3
If the data was generated from any unknown deterministic hypothesis with limited noise then
P (Y |X) is bounded away from 1/2.
22
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Setting
Realizable
Realizable
Agnostic
Agnostic4
Regression
Coverage rate
1
1
O |F| + ln δ
1− m
1
1− m
O γ(F, n̂) ln γ(Fm,n̂) + ln m
δ
d
m
1
1 β/2
1 − O Bθ m ln d + m ln δ
β/2 polylog(m)
1
· log δ
1−O B
m
r
2 − 1 · R(f ∗ ) + σ
ˆ
1 − θǫ
σδ/4
·
R̂(
f
)
δ/4
Reference
Theorem 3.2
Theorem 3.10
Theorem 5.5
Theorem 5.9
Theorem 6.7
Table 1.2: Coverage rate bounds.
cient to the case of real-valued functions. Hanneke’s disagreement coefficient is
based on the definition of a disagreement set. A disagreement set of hypothesis
class G is the set of all points in the domain on which at least two hypotheses in
G do not agree. We extend this definition to real-valued functions by defining the
ǫ-disagreement set as the set of all points in the domain for which at least two
hypotheses in G predict values with a difference of more than ǫ. Our extended measure, termed the ǫ-disagreement coefficient (see Definition 6.5), is based on this
definition of ǫ-disagreement set.
Let F be a convex hypothesis class, assume ℓ : Y × Y → [0, ∞) is the squared
loss function, and let ǫ > 0 be given. Then ǫ-LESS achieves, with probability of at
least 1 − δ, a coverage of
r
σ 2 − 1 · R(f ∗ ) + σδ/4 · R̂(fˆ) ,
Φ(f, g) ≥ 1 − θǫ
δ/4
where θǫ is the ǫ-disagreement coefficient of F and σδ is any multiplicative risk
bound for real-valued functions. Specifically, we assume that there is a risk bound
σδ , σ(m, δ, F) such that for any tolerance δ, with probability of at least 1 − δ,
any hypothesis f ∈ F satisfies R(f ) ≤ R̂(f ) · σ(m, δ, F). Similarly, we assume
the reverse bound, R̂(f ) ≤ R(f ) · σ(m, δ, F), holds under the same conditions.
4
The coverage rate bound holds for hypothesis class F and distribution P (X, Y ) if γ(F, n̂) =
O(polylog(m)) for F and marginal distribution P (X).
23
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
1.4.6 New speedup results for active learning
In stream-based active learning (also known as selective sampling) the learner is
presented with a stream of unlabeled examples and should decide whether or not to
request a label for each training example after it is presented. While at first glance
active learning may seems completely unrelated to selective prediction, in Section 6.4 we present a reduction of active learning to perfect selective classification
and a converse reduction — of selective classification to active learning. These reductions strongly tie together these two classical learning models and prove equivalence with respect to learning rates (label complexity and coverage rates). The
first reduction (active reduced to selective) is substantially more involved, but is
nonetheless quite rewarding: it provides us with the luxury of analyzing dynamic
active learning problems within the static selective prediction framework, and facilitates a novel and powerful technique for bounding the label complexity of active
learning algorithms. In Theorem 4.5 we show how the label complexity bound
for the CAL active learning algorithm can be directly derived from any coverage
bound for the CSS selective classification strategy.
The intuition behind the relationship between pointwise competitive selective
prediction and active learning can be easily hand waved: whenever the learner is
sure about the prediction of the current unlabeled example x, there is no need to ask
for the label of x. Conversely, whenever the learner cannot determine the label with
sufficient certainty, perhaps there is something to learn from it and it is worthwhile
to ask (and pay) for its label. Despite the conceptual simplicity of this intuitive
relationship, the crucial component of this reduction is to prove that it preserves
“fast rates.” Specifically, if the rejection rate of CSS (one minus coverage rate) is
O (polylog (m/δ) /m) , then (F, P ) is actively learnable by CAL with exponential
label complexity speedup.
Utilizing our coverage bound results for perfect selective classification, we
prove that general (non-homogeneous) linear classifiers are actively learnable at
exponential (in 1/ǫ) label complexity rate when the data distribution is an arbitrary
unknown finite mixture of high-dimensional Gaussians. This target-independent result substantially extends the state-of-the-art in the formal understanding of active
learning. It is the first exponential label complexity speedup proof for a non-toy
setting.
While we obtain exponential label complexity speedup in 1/ǫ, our speedup
result incur exponential slowdown in d2 , where d is the (Euclidean) problem di-
24
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
mension. Nevertheless, in Theorem 4.13 we prove a lower bound of
Ω (log m)(d−1)/2 (1 + o(1))
on the label complexity, when considering the class of unrestricted linear classifiers
under a Gaussian distribution. Thus, an exponential slowdown in the dimension d
is not an artifact of our proof, but an unavoidable newly revealed limitation of
active learning, as it is concerns any speedup result for the CAL algorithm. The
argument relies on the observation that CAL has to request a label for any point on
the convex hull of a sample Sm . The bound is then derived using known results
from probabilistic geometry, which bound the first two moments of the number of
vertices of a random polytope under the Gaussian distribution.
Finally, in Section 4.6 we relate our proposed technique to other complexity
measures for active learning. We show how recent results in active learning can be
used to derive coverage bounds in selective classification. We start by proving a
relation to the teaching dimension [43]. In Corollary 4.17 we show, by relying on a
known bound for the teaching dimension, that perfect selective classification with
meaningful coverage can be achieved for the case of axis-aligned rectangles under
a product distribution. We then focus on Hanneke’s disagreement coefficient and
show in Theorem 4.18 that the coverage of perfect selective classification can be
bounded from below using the disagreement coefficient. Conversely, we show, in
Corollary 4.23, that the disagreement coefficient can be bounded from above using
any coverage bound for perfect selective classification. Consequently, the results
here imply that the disagreement coefficient can be sufficiently bounded to ensure
fast active learning for the case of linear classifiers under a mixture of Gaussians.
Using the equivalence between active learning and perfect selective classification,
we were able to prove, in Corollary 4.26, that bounded disagreement coefficient is
a necessary condition for CAL to achieve exponential label complexity speedup.
The importance of our new technique and the resulting (distribution-dependent)
bounds for the analysis of CAL has been already recognized by others and it was
recently highlighted by Hanneke in his survey of recent advances in the theory of
active learning [50]:
“El-Yaniv and Wiener (2010) identify an entirely novel way to bound
... the label complexity of CAL, in terms of a new complexity measure
that has several noteworthy features. It incorporates certain aspects
of many different known complexity measures, including the notion of
25
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
a region of disagreement (as in the disagreement coefficient analysis),
the notion of a minimal specifying set (as in the teaching dimension
analysis ...), and the notion of the VC dimension.”
“... at present this is the only technique known to establish the bounds
on the label complexity of CAL ... for both k-dimensional linear separators under mixtures of multivariate normal distributions and axisaligned rectangles under product distributions.”
As a matter of fact, the implications of our upper bound on the disagreement
coefficient go beyond the above mentioned bounds for CAL. Hanneke’s disagreement coefficient is one of the most important and widely used complexity measures
in the study of label complexity in agnostic active learning. For example, the disagreement coefficient is a key factor in the label complexity analyses of the famous
Agnostic Active (A2 ) [8] and RobustCALδ [50] learning algorithms.
A2 was the first general-purpose agnostic active learning algorithm with proven
improvement in error guarantees compared to passive learning. It has been proven
that this algorithm, originally introduced by Balcan et. al. [8], achieves exponential label complexity speedup (for the low accuracy regime) compared to passive
learning only for few simple cases including: threshold functions and homogenous
linear separators under uniform distribution over the sphere [9]. In this work we
extend the results and prove that exponential label complexity speedup (for the low
accuracy regime) is achievable also for linear classifiers under a fixed (unknown)
mixture of Gaussians.
This new exponential speedup is limited to the low accuracy regime, but we
prove in Theorem 4.31 that, under some conditions on the source distribution,
RobustCALδ achieves exponential speedup for linear classifiers under a fixed mixture of Gaussians for any accuracy regime. To the best of our knowledge, our technique is the first (and currently only) one capable of accomplishing these strong
results for A2 and RobustCALδ .
Similarly, our technique can be used to derive new exponential speedup bounds
for a long list of other learning methods thus leveraging the use and capability
of the disagreement coefficient. This list includes Beygelzimer et al.’s algorithm
for importance-weighted active learning (IWAL) [17], AdaProAL for Proactive
Learning [90], and SRRA based algorithms for pool-based active learning [1], to
mention a few.
26
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
1.4.7 Efficient implementation and empirical results
Attempting to implement even a “lazy” version of LESS is challenging because the
calculation of the disbelief index requires the identification of ERM hypotheses.
Unfortunately, for nontrivial hypothesis classes this is in general a computationally
hard problem. Furthermore, the disbelief index is a noisy statistic that depends on
the sample Sm . We present an implementation that ‘approximates” the proposed
strategy and addresses the above problems. We provide some empirical results
demonstrating the advantage of our proposed technique over rejection based on
distance from the decision boundary for six datasets and two kernels (SVM with
linear and RBF kernels). Figure 1.8 depicts the RC curve5 of LESS (in solid red)
compared to the RC curve of rejection based on distance from the decision boundary (green dashed line) for the Haberman dataset. This dataset contains survival
data of patients who underwent surgery for breast cancer. The goal is to predict
whether a patient will survive at least 5 years after surgery or not (classify to “surviving” and “not surviving”). Even with 80% coverage, we have already lowered
the test error (probability of misclassification) by about 3% compared to rejection
based on distance from the decision boundary. This advantage monotonically increases with the rejection rate.
For the case of linear least squares regression (LLSR), we are able to show
how ǫ-LESS can be implemented exactly and efficiently using standard matrix operations. Furthermore, we derive a pointwise bound on the difference between
the prediction of the ERM and the prediction of the best regressor in the class
(|f ∗ (x) − fˆ(x)|). We conclude with some empirical results demonstrating the advantage of our proposed technique over a simple and natural 1-nearest neighbor
(NN) technique for selection.
5
See Appendix D for further discussion on the RC curve
27
0.25
0.2
test error
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Haberman
0.3
0.15
0.1
0.05
0
0.2
0.4
0.6
0.8
1
c
Figure 1.8: RC curve of LESS (red line) and rejection based on distance from the
decision boundary (dashed green line).
28
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Chapter 2
Characterizing Set Complexity
“
Oh the places you’ll go! There is fun to be done!
There are points to be scored. There are games to be won.
And the magical things you can do with that ball
will make you the winning-est winner of all.
”
Dr. Seuss, Oh, the Places You’ll Go!
In this chapter we present and analyze a new hypothesis class complexity measure,
termed characterizing set complexity. This measure plays a central role in the
analysis of coverage rates in selective classification (Chapters 3 and 5), as well as
in the analysis of label complexity in online active learning (Chapter 4).
2.1 Motivation
The characterizing set complexity was originally derived for the study of coverage
rates in realizable selective classification [31]. Let F be some hypothesis class and
f ∗ ∈ F our target hypothesis. Let Sm = {(xi , yi )}m
i=1 be a finite training sample of
m labeled examples, assumed to be sampled i.i.d. from some unknown underlying
distribution P (X, Y ) over X × Y. A classic result from statistical learning theory
is that the error rate of the ERM depends on the complexity of the hypothesis class
29
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
from which the classification function f (x) is chosen. Specifically, if F has a
bounded complexity (VC dimension), then the error of the ERM approaches the
error of the best hypothesis in the class when m approaches infinity. Analogously,
the coverage rate of a selective classifier depends on the complexity of the selection
hypothesis class from which the selection function g(x) is chosen.
To better explain the motivation for our new complexity measure, we first review a simple and intuitive learning strategy for realizable selective classification.
In Chapter 3 we present a simple learning strategy for the realizable case, which
achieves perfect classification with maximum coverage. Namely, given hypothesis class F and training set Sm , the strategy outputs a selective classifier (f, g) that
never errs on the accepted domain. This strategy, termed Consistent Selective Strategy (CSS), rejects any test point that is not classified unanimously by all hypotheses in the version space. The region accepted by the strategy is called the maximal
agreement set. Figure 2.1 depicts an example of a maximal agreement set for the
case of linear classifiers in R2 . The red crosses and blue circles represent positive
and negative examples, respectively; the dashed lines represent hypotheses in the
version space, and the grayed areas represent the maximal agreement set. What is
Figure 2.1: An example of a maximal agreement set for linear classifiers in R2 .
the function (hypothesis) class from which g(x) was chosen in this example? Is
it the class of all unions of two open polygons with three edges each? And what
happens when we have a larger training set? In Figure 2.2 we present an example
of a maximal agreement set for a training set with 7 examples (a) and a training
set with 16 examples (b). We can see that in the latter case the selection function
g(x) is much more complex (union of two open polygons with 5 and 6 edges). It
is evident that the complexity of the maximal agreement set depends on the train-
30
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Figure 2.2: The region accepted by CSS for the case of linear classifiers in R2 .
ing sample Sm . A useful way to parameterize the hypothesis class from which the
selection function is taken is to utilize a data dependet parameter. The parameter
we use is the version space compression set size. The version space compression
set size is the size of the smallest subset of the training set that induces the same
version space as the entire training set. The version space compression set of the
examples in Figure 2.2 are depicted in Figure 2.3 with green circles.
Figure 2.3: Version space compression set.
Clearly, by definition, the maximal agreement set depends only on the version
space compression set and the hypothesis class F. Let Hn be the set of all possible
maximal agreement sets generated by any version space compression set of size
31
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
n (with all possible labelings). For example, if n equals four, then the maximal
agreement set can be either a union of two open polygons with up to four edges
each (see Figure 2.4 (a), (b)), or a closed polygon with four edges (see Figure 2.4
(c)). The class of all maximal agreement sets is the union of all Hn . While the com-
Figure 2.4: Maximal agreement region.
plexity of this class is potentially unlimited (infinite VC-dimension), we can bound
the complexity of each subclass Hn . Given a training sample Sm , g(x) selects all
samples that belongs to the maximal agreement set with respect to Sm . Therefore,
g(x) is chosen from the class of all maximal agreement sets, and specifically from
Hn̂ , where n̂ is the version space compression set size. As we will see in the next
chapter, it is sufficient to bound the complexity of Hn̂ in order to prove a bound on
the probabilistic volume of the maximal agreement region, and hence the coverage
of CSS.
The above discussion motivates our novel complexity measure, γ(F, n), called
the characterizing set complexity, which is defined as the VC dimension of the
class of all maximal agreement regions for hypothesis class F and any training
sample with size n. For example, for n = 4 we are limited to the class depicted in
Figure 2.4. Surprisingly, while the characterizing set complexity was originally developed for the realizable case, it turns out to play a central role also in the general
agnostic case. This result is obtained through a reduction to active learning and
an interesting connection to another complexity measure termed the disagreement
coefficient. The relation to the disagreement coefficient is studied in Section 4.6.2.
32
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
2.2 The characterizing set complexity
We start by defining a region in X , which is termed the “maximal agreement set.”
Any hypothesis that is consistent with the sample Sm is guaranteed to be consistent
with the target hypothesis f ∗ on this entire region.
Definition 2.1 (agreement set) Let G ⊆ F. A subset X ′ ⊆ X is an agreement set
with respect to G if all hypotheses in G agree on every instance in X ′ , namely,
∀ g1 , g2 ∈ G, x ∈ X ′ ,
g1 (x) = g2 (x).
Definition 2.2 (maximal agreement set) Let G ⊆ F. The maximal agreement set
with respect to G is the union of all agreement sets with respect to G.
Since a maximal agreement set is a region in X , rather than an hypothesis, we
formally define the dual hypothesis that matches every maximal agreement set.
Definition 2.3 (characterizing hypothesis) Let G ⊆ F and let AG be the maximal
agreement set with respect to G. The characterizing hypothesis of G, fG (x) is a
binary hypothesis over X obtaining positive values over AG and zero otherwise.
We are now ready to formally define Hn , a class we term order-n characterizing
set.
Definition 2.4 (order-n characterizing set) For each n, let Σn be the set of all
possible labeled samples of size n (all n-subsets, each with all possible labelings).
The order-n characterizing set of F, denoted Hn , is the set of all characterizing
hypotheses fG (x), where G ⊆ F is a version space induced by some member of
Σn .
Definition 2.5 (characterizing set complexity) Let Hn be the order-n characterizing set of F. The order-n characterizing set complexity of F, denoted γ (F, n),
is the VC-dimension of Hn .
The characterizing set complexity γ(F, n) depends only on the the hypothesis
class F and the parameter n and is independent of the source distribution P . Can
we explicitly evaluate or bound this novel complexity measure for some interesting
hypothesis classes? Let us start with two toy examples that illustrate different
properties of the characterizing set complexity.
33
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
2.3 Linear classifiers in R
In the following example we explicitly calculate the characterizing set complexity
of linear classifiers in R. Let F be the class of thresholds. Depending on the sample
Sm , the maximal agreement set is either a single ray or two non-overlapping rays.
In Figure 2.5 we illustrate three different cases. Negative examples are marked
in blue circles, positive example in red crosses, and the maximal agreement set is
marked in gray. The VC-dimension of the class of two non-overlapping rays is ex-
Figure 2.5: Maximal agreement set for linear classifier in R. (a) sample with both
positive and negative examples (b) sample with negative only examples (c)
sample with positive only examples
actly the VC-dimension of the class of intervals in R (the complement class), which
is 2. Therefore, for any n the characterizing set complexity of linear classifiers in
R is γ (F, n) = 2.
2.4 Intervals in R
Let F be the class of intervals in R. If the sample Sm includes at least one positive
point, then the maximal agreement set is either an interval in R, a non-overlapping
interval and a ray, or a non-overlapping interval and two rays (see Figure 2.6). The
function defined by an interval and two non-overlapping rays in R is exactly the
Figure 2.6: Maximal agreement set for intervals in R. (a) non overlapping interval
and two rays (b) no overlapping interval and a ray (c) interval only
34
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
complement of two intervals in R. Therefore, the VC-dimension of the maximal
agreement set in this case is bounded by the VC-dimension of the class of the
union of two intervals, which is 4. However, if the entire sample Sm is negative,
the maximal agreement set is exactly the set of all points in Sm (see Figure 2.7 for
an example). The VC-dimension of the class of m points is exactly m. Therefore,
(
4, n ≤ 4;
γ(F, n) ≤
n, otherwise.
This is a simple example where the characterizing set complexity highly depends
on the parameter n.
Figure 2.7: Maximal agreement set for intervals in R when Sm include negative
points only
2.5 Axis-aligned rectangles in Rd
In this section we consider the class of axis-aligned rectangles in Rd . Relying on
the following classical result from statistical learning theory, we infer an explicit
upper bound on the characterizing set complexity for axis-aligned rectangles.
Lemma 2.1 [18, Lemma 3.2.3] Let F be a binary hypothesis class of finite VC
dimension d ≥ 1. For all k ≥ 1, define the k-fold intersection,
n
o
Fk∩ , ∩ki=1 fi : fi ∈ F, 1 ≤ i ≤ k ,
and the k-fold union,
n
o
Fk∪ , ∪ki=1 fi : fi ∈ F, 1 ≤ i ≤ k .
35
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Then, for all k ≥ 1,
V C(Fk∩ ), V C(Fk∩ ) ≤ 2dk log (3k).
Theorem 2.2 (order-n characterizing set complexity) Let F be the class of axisaligned rectangles in Rd . Then,
γ(F, n) ≤ 42dn log2 (3n).
+
Proof Let Sn = Sk− ∪ Sn−k
be a sample of size n composed of k negative examples, {x1 , x2 , . . . xk }, and n − k positive ones. Let F be the class of axis-aligned
rectangles. We define,
∀1 ≤ i ≤ k,
+
Ri , Sn−k
∪ {(xi , −1)} .
Notice that V SF ,Ri includes all axis aligned rectangles that classify all samples in
S + as positive, and xi as negative. Therefore, the agreement region of V SF ,Ri is
composed of two components as depicted in Figure 2.8. The first component is
Figure 2.8: Agreement region of V SF ,Ri .
the smallest rectangle that bounds the positive samples, and the second is an unbounded convex polytope defined by up to d hyperplanes intersecting at xi . Let
AGRi be the agreement region of V SF ,Ri and AGR the agreement region of
V SF ,Sn . Clearly, Ri ⊆ Sn , so V SF ,Sn ⊆ V SF ,Ri , and AGRi ⊆ AGR, and
it follows that
k
[
AGRi ⊆ AGR.
i=1
36
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
S
Assume, by contradiction, that x ∈ AGR but x 6∈ ki=1 AGRi . Therefore, for any
(i) (i)
(i)
1 ≤ i ≤ k, there exist two hypotheses f1 , f2 ∈ V SF ,Ri , such that, f1 (x) 6=
(i)
(i)
f2 (x). Assume, without loss of generality, that f1 (x) = 1. We define
f1 ,
k
^
(i)
f1
and
f2 ,
k
^
(i)
f2 ,
i=1
i=1
(i)
meaning that f1 classifies a sample as positive if and only if all hypotheses f1
classify it as positive. Noting that the intersection of axis-aligned rectangles is itself
an axis-aligned rectangle, we know that f1 , f2 ∈ F. Moreover, for any xi we have,
(i)
(i)
f1 (xi ) = f2 (xi ) = −1, so also f1 (xi ) = f2 (xi ) = −1, and f1 , f2 ∈ V SF ,Sn .
But f1 (x) 6= f2 (x). Contradiction. Therefore,
AGR =
k
[
AGRi .
i=1
It is well known that the VC dimension of a hyper-rectangle in Rd is 2d. The VC
dimension of AGRi is bounded by the VC dimension of the union of two hyperrectangles in Rd . Furthermore, the VC dimension of AGR is bounded by the VC
dimension of the union of all AGRi . Applying Lemma 2.1 twice we get,
V Cdim {AGR} ≤ 42dk log2 (3k) ≤ 42dn log2 (3n).
If k = 0, then the entire sample is positive and the region of agreement is an hyperrectangle. Therefore, V Cdim {AGR} = 2d. If k = n, then the entire sample
is negative and the region of agreement is the points of the samples themselves.
Hence, V Cdim {AGR} = n. Overall we get that in all cases,
V Cdim {AGR} ≤ 42dn log2 (3n).
37
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
2.6 Linear classifiers in Rd
In this section we consider the class of linear classifiers in Rd . Relying on a classical result from combinatorial geometry, we infer an explicit upper bound on the
characterizing set complexity for linear classifiers.
Fix any positive integer d, and let F , {fw̄,φ (x̄)} be the class of all linear
binary classifiers in Rd , where w̄ are d-dimensional real vectors, φ are scalars, and
(
+1, w̄T x̄ − φ ≥ 0;
fw̄,φ (x̄) =
−1, w̄T x̄ − φ < 0.
Given a binary labeled training sample Sm , define R+ , R+ (Sm ) ⊆ Rd to be
the subset of the maximal agreement set with respect to the version space V SF ,Sm ,
consisting of all points with positive labels. R+ is called the ‘maximal positive
agreement set.’ The ‘maximal negative agreement set’, R− , R− (Sm ), is defined
similarly. Before continuing, we define a new symmetric hypothesis class F̃ that
allows for a simpler analysis. Let F̃ , {fw̄,φ (x̄)} be the function class

T

 +1, if w̄ x̄ − φ > 0;
T
f˜w̄,φ (x̄) =
0,
if w̄ x̄ − φ = 0;

 −1, if w̄T x̄ − φ < 0,
where we interpret 0 as a classification that agrees with both +1 and −1. Given a
sample Sm , we define R̃+ ⊆ Rd to be the region in Rd for which any hypothesis
in the version space1 V SF̃,Sm classifies either +1 or 0 (i.e., this is the maximal
positive agreement set). We define R̃− analogously with respect to negative or
zero classifications. While F and F̃ are not identical, the maximal agreement sets
they induce are identical. This is stated in the following technical lemma whose
proof appears in Appendix B.
Lemma 2.3 (maximal agreement set equivalence) For any linearly separable sample Sm , R+ = R̃+ and R− = R̃− .
The next technical lemma, whose proof also appears in the appendix, provides
useful information on the geometry of the maximal agreement set for the class of
linear classifiers.
1
Any hypothesis in F̃ that classifies every sample in Sm correctly or as 0 belongs to the version
space.
38
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Lemma 2.4 (maximal agreement set geometry I) Let Sm be a linearly separable labeled sample that is a spanning set of Rd . Then the regions R+ and R− are
each an intersection of a finite number of half-spaces, with at least d samples on
the boundary of each half-space.
Our goal is to bound the characterizing set complexity of F. As we show below,
this complexity measure is directly related to the number of facets of the convex
hull of n points in Rd . The following classical combinatorial geometry theorem by
Klee [73, page 98] is thus particularly useful. The statement of Klee’s theorem provided here is readily obtained from the original by using the Stirling approximation
of the binomial coefficient.
Theorem 2.5 (Klee, 1966) The number of facets of a d-polytope with n vertices is
at most
2·
en
⌊d/2⌋
⌊d/2⌋
.
An immediate conclusion is that the above term upper bounds the number of facets
of the convex hull of n points in Rd (which is of course a d-polytope).
Lemma 2.6 (maximal agreement set geometry II) Let Sn be a linearly separable sample consisting of n ≥ d + 1 labeled points. Then the regions R+ (Sn ) and
R− (Sn ) are each an intersection of at most
2(d + 1) ·
d+1
2en ⌊ 2 ⌋
d
half-spaces in Rd .
Proof For the sake of clarity, we limit the analysis to a sample Sn in general
position; that is, we assume that no more than d points lie on a (d − 1)-dimensional
plane. Handling a sample Sn in arbitrary position can be straightforwardly treated
by including an appropriate infinitesimal displacement of the points.
By Lemma 2.3, we can limit our discussion to the hypothesis class F̃ (rather
than F). Since Sn includes more than d samples in general position it is a spanning
set of Rd . According to Lemma 2.4, R+ is an intersection of a finite number of
half-spaces, with at least d samples on the boundary of each half-space (and exactly
39
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
d in the general position). Let S + ⊆ Sn be the subset of all positive samples in Sn ,
and S − ⊆ Sn , the negative ones. Let f˜w̄,φ be one of the half-spaces defining R+ .
Then,
(
w̄T x̄ − φ ≥ 0, if x̄ ∈ S + ;
∀x̄ ∈ Sn
w̄T x̄ − φ ≤ 0, if x̄ ∈ S − .
Also, exactly d samples, x̄, satisfy w̄x̄ − φ = 0.
x̄′ :
We now embed the samples in Rd+1 using the following transformation, x̄ →
x̄′ ,
(
(0, x̄),
if x̄ ∈ S + ;
(1, −x̄), if x̄ ∈ S − .
For each half-space (w̄, φ) in Rd we define a unique half-space, (w̄′ , φ′ ), in Rd+1 ,
w̄′ , (2φ, w̄),
φ′ , φ.
We observe that
′T
′
′
w̄ x̄ − φ =
(
w̄T x̄ − φ ≥ 0,
if x̄ ∈ S + ;
T
T
2φ − w̄ x̄ − φ = −(w̄ x̄ − φ) ≥ 0, if x̄ ∈ S − ,
and for exactly d samples we have
(
w̄T x̄ − φ = 0,
if x̄ ∈ S + ;
′T ′
′
w̄ x̄ − φ =
2φ − w̄T x̄ − φ = −(w̄T x̄ − φ) = 0, if x̄ ∈ S − .
Let v̄ be any orthogonal vector to the d samples on the boundary of the half-space.
Defining
w̄′′ , w̄′ + αv̄, φ′′ , φ′ ,
with an appropriate choice of α we have,
∀x̄′ ∈ Sn
w̄′′T x̄′ − φ′′ = w̄′T x̄′ − φ′ + αv̄ ′T x̄′ ≥ 0,
and for exactly d + 1 samples (including the original d samples),
w̄′′ x̄′ − φ′′ = 0.
We observe that f˜w̄′′ ,φ′′ is a facet of the convex hull of the samples in Rd+1 . Up
to d + 1 different half-spaces in Rd can be transformed into a single half-space in
40
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Rd+1 (the number of combinations of choosing d samples out of d + 1 samples on
the boundary). Using Theorem 2.5, we bound the number F (d) of facets of the
convex hull of the points in Rd+1 as follows:
F (d) ≤ 2 ·
en
d+1 2
!⌊ d+1 ⌋
2
≤2·
d+1
2en ⌊ 2 ⌋
.
d
Since up to d + 1 half-spaces in Rd can be mapped onto a single facet of the convex
hull in Rd+1 , we can bound the number of half-spaces in Rd by
(d + 1) · F (d) ≤ 2(d + 1) ·
d+1
2en ⌊ 2 ⌋
.
d
Lemma 2.7 (characterizing set complexity) Fix d ≥ 2 and n > d. Let F be the
class of all linear binary classifiers in Rd . Then, the order-n characterizing set
complexity of F satisfies
3
γ(F, n) ≤ 83 · (d + 1) ·
d+1
2en ⌊ 2 ⌋
· log n.
d
Proof Let G = Fk∩ be the class of k-fold intersections of half-spaces in Rd . Since
the VC dimension of the class of all half-spaces in Rd is d + 1, we obtain, using
Lemma 2.1, that the VC dimension of G satisfies
V C(G) ≤ 2k log (3k)(d + 1).
Let Hn be the order-n characterizing set of F. From Lemma 2.6 we know that any
hypothesis f ∈ Hn is a union of two regions, where each region is an intersection
of no more than
d+1
2en ⌊ 2 ⌋
k = 2(d + 1) ·
d
41
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
half-spaces in Rd . Therefore, Hn ⊂ G2∪ . Using Lemma 2.1, we get
V C(Hn ) ≤ V C (G2∪ ) ≤ 4 log(6) · V C(G) ≤ 8k log(6) log (3k)(d + 1)
d+1
2en ⌊ 2 ⌋
2
≤ 16(d + 1) ·
· log (6)
d
d+1 !
2en ⌊ 2 ⌋
.
· log 6(d + 1) ·
d
For n > d ≥ 2 we get
d+1 !
2en ⌊ 2 ⌋
2en
d+1
log 6(d + 1) ·
· log
≤ log (6n) +
d
2
d
d+1
≤ 3 · log n +
· log n2 ≤ (d + 4) · log n ≤ 2 · (d + 1) · log n.
2
Therefore,
3
V C(Hn ) ≤ 83 · (d + 1) ·
42
d+1
2en ⌊ 2 ⌋
· log n
d
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Chapter 3
Realizable Selective Classification
“
A man doesn’t know what he knows until he knows what he doesn’t
know.
”
Laurence J. Peter, Educator and Writer, 1919-1990
In this chapter we present and analyze the first application of self-aware learning
termed realizable selective classification. In the setting considered in this chapter,
the target concept belongs to the hypothesis class F. Thus, in this case the best hypothesis in the class never errs, and any pointwise competitive selective classifier
achieves perfect classification on the accepted domain. We show that perfect selective classification with guaranteed coverage is achievable (from a learning-theoretic
perspective) for finite hypothesis spaces by a learning strategy termed consistent selective strategy (CSS). Moreover, CSS is shown to be optimal in its coverage rate,
which is fully characterized by providing lower and upper bounds that match in
their asymptotic behavior in the sample size m. We show that in the general case,
for infinite hypothesis classes, perfect selective classification with guaranteed (nonzero) coverage is not achievable even when F has a finite VC-dimension. We then
derive a meaningful coverage guarantee using posterior information on the source
distribution (data-dependent bound). We then focus on the case of linear classifiers
and show that if the unknown underlying distribution is a finite mixture of Gaussians, CSS will ensure perfect learning with guaranteed coverage. This powerful
43
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
result indicates that consistent selective classification might be relevant in various
applications of interest.
3.1 Consistent Selective Strategy (CSS)
Before presenting our consistent selective strategy (CSS) we recall that the version
space, V SF ,Sm , is the set of all hypotheses in F that classify Sm correctly. Furthermore, the maximal agreement set of hypothesis class G ⊆ F is the set of all
points for which all hypotheses in G agree (see Definition 2.2).
Definition 3.1 (consistent selective strategy (CSS)) Given Sm , a consistent selective strategy (CSS) is a selective classification strategy that takes f to be any hypothesis in V SF ,Sm (i.e., a consistent learner), and takes a (deterministic) selection function g that equals one for all points in the maximal agreement set with
respect to V SF ,Sm , and zero otherwise.
In the present setting the (unknown) labeling hypothesis f ∗ is in V SF ,Sm . Thus,
CSS simply rejects all points that might incur an error with respect to f ∗ . An
immediate consequence is that any CSS selective hypothesis (f, g) always satisfies
R(f, g) = 0. The main concern, however, is whether its coverage Φ(f, g) can be
bounded from below and whether any other strategy that achieves perfect learning
with certainty can achieve better coverage. The following theorem proves that CSS
has the largest possible coverage among all strategies.
Theorem 3.1 (CSS coverage optimality) Given Sm , let (f, g) be a selective classifier chosen by any strategy that ensures zero risk with certainty for any unknown
distribution P and any target concept f ∗ ∈ F. Let (fc , gc ) be a selective classifier
selected by CSS using Sm . Then,
Φ(f, g) ≤ Φ(fc , gc ).
Proof For the sake of simplicity we limit the discussion to deterministic strategies.
The extension to stochastic strategies is straightforward. Given a hypothetical sample S̃m of size m, let (f˜c , g̃c ) be the selective classifier chosen by CSS and let
(f˜, g̃) be the selective classifier chosen by any competing strategy. Assume that
there exists x0 ∈ X (x0 6∈ S̃m ) such that g̃(x0 ) = 1 and g̃c (x0 ) = 0. According
to the CSS construction of g̃c , since g̃c (x0 ) = 0, there are at least two hypotheses
44
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
f1 , f2 ∈ V SF ,S̃m such that f1 (x0 ) 6= f2 (x0 ). Assume, without loss of generality, that f1 (x0 ) = f˜(x0 ). We will now construct a new “imaginary” classification
problem and show that, under the above assumption, the competing strategy fails
to guarantee zero risk with certainty. Let the imaginary target concept f ′∗ be f2
and the imaginary underlying distribution P ′ be


 (1 − ǫ)/m, if x ∈ S̃m ;
′
P (x) =
ǫ,
if x = x0 ;

 0,
otherwise.
′ drawn i.i.d from P ′ . There is a positive (perhaps
Imagine a random sample Sm
′
small) probability that Sm will equal S̃m , in which case (f ′ , g′ ) = (f˜, g̃). Since
g′ (x0 ) = g̃(x0 ) = 1 and f ∗ (x0 ) 6= f ′ (x0 ), with positive probability R(f ′ , g′ ) =
ǫ > 0. This contradicts the assumption that the competing strategy achieves perfect
classification (zero risk) with certainty. It follows that for any sample S̃m and
for any x ∈ X , if g̃(x) = 1, then g̃c (x) = 1. Consequently, for any unknown
distribution P , Φ(f˜, g̃) ≤ Φ(f˜c , g̃c ).
3.2 Risk and coverage bounds
While it is clear that CSS achieves perfect classification, it is not clear at all if this
performance can be achieved with a meaningful coverage. In this section we will
derive different coverage bounds starting with the simple case of finite hypothesis
classes to the more complex case of infinite hypothesis classes. A formal definition
of the meaning of upper and lower coverage bounds is given in Appendix C.
3.2.1 Finite hypothesis classes
The next result establishes the existence of perfect classification with guaranteed
coverage in the finite case.
Theorem 3.2 (guaranteed coverage) Assume a finite F and let (f, g) be a selective classifier selected by CSS. Then, R(f, g) = 0 and for any 0 ≤ δ ≤ 1, with
probability of at least 1 − δ,
1
1
(ln 2) min {|F|, |X |} + ln
.
(3.1)
Φ(f, g) ≥ 1 −
m
δ
45
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Proof For any ǫ, let G1 , G2 , . . . , Gk , be all the hypothesis subsets of F with corresponding maximal agreement sets, λ1 , λ2 , . . . , λk , such that each λi has volume
of at most 1 − ǫ with respect to P . For any 1 ≤ i ≤ k, the probability that a single
point will be randomly drawn from λi is thus at most 1 − ǫ. The probability that
all training points will be drawn from λi is therefore at most (1 − ǫ)m . If a training
point x is in X \ λi , then there are at least two hypotheses f1 , f2 ∈ Gi that do not
agree on x. Hence,
Pr (Gi ⊆ V SF ,Sm ) ≤ (1 − ǫ)m .
P
We note that
k ≤ 2min{|F|,|X |},
and by the union bound,
Pr (∃Gi
P
Gi ⊆ V SF ,Sm ) ≤ k · (1 − ǫ)m ≤ 2min{|F|,|X |} · (1 − ǫ)m .
Therefore, with probability of at least 1 − 2min{|F|,|X |} · (1 − ǫ)m , the version space
V SF ,Sm differs from any subset Gi , and hence it has a maximal agreement set with
volume of at least 1 − ǫ. Using the inequality 1 − ǫ ≤ exp(−ǫ), we have
2min{|F|,|X |} · (1 − ǫ)m ≤ 2min{|F|,|X |} · exp(−mǫ).
Equating the right-hand side to δ and solving for ǫ completes the proof.
A leading term in the coverage guarantee (3.1) is |F|. In corresponding results
in standard consistent learning [53] the corresponding term is log |F|. This may
raise a concern on the tightness of (3.1). However, as shown in Corollary 3.5, this
bound is tight (up to multiplicative constants). To prove the Corollary we will
require the following two definitions.
Definition 3.2 (binomial tail distribution) Let Z1 , Z2 , . . . Zm be m independent
Bernoulli random variables each with a success probability p. Then for any 0 ≤
k ≤ m we define
!
m
X
Bin(m, k, p) , Pr
Zi ≤ k .
i=1
Definition 3.3 (binomial tail inversion [64]) For any 0 ≤ δ ≤ 1 we define
Bin(m, k, δ) , max {p : Bin(m, k, p) ≥ δ} .
p
46
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Theorem 3.3 (non-achievable coverage, implicit bound) Let 0 ≤ δ ≤ 12 , m,
and n > 1 be given. There exists a distribution P , that depends on m and n,
and a finite hypothesis class F of size n, such that for any selective classifier (f, g),
chosen from F by CSS (so R(f, g) = 0) using a training sample Sm drawn i.i.d.
according to P , with probability of at least δ,
|F|
1
Φ(f, g) ≤ 1 − · Bin m,
, 2δ .
2
2
Proof Let X , {e1 , e2 , . . . en+1 } be the standard (vector) basis of Rn+1 , X ′ ,
X \ {en+1 } and P be the source distribution over X satisfying
(
Bin m, n2 , 2δ /n,
if i ≤ n;
P (ei ) ,
1 − Bin m, n2 , 2δ , otherwise;
where Bin (m, k, δ) is the binomial tail inversion (Definition 3.3). Since
n
n
n o
Bin m, , 2δ , max p : Bin m, , p ≥ 2δ ,
p
2
2
and Sm is drawn i.i.d. according to P , we get that with probability of at least 2δ,
x ∈ Sm : x ∈ X ′ ≤ n .
2
Let F be the class of singletons such that
(
1,
if i = j;
hi (ej ) ,
−1, otherwise.
Taking f ∗ , fi∗ , for some 1 ≤ i∗ ≤ n, we have,
n Pr ei∗ 6∈ Sm , x ∈ Sm : x ∈ X ′ ≤
2
n n ′ ∗
= Pr ei 6∈ Sm | x ∈ Sm : x ∈ X
· Pr x ∈ Sm : x ∈ X ′ ≤
≤
2
2
n
1 2
≥
1−
· 2δ ≥ δ.
n
If ei∗ 6∈ Sm then all samples in Sm are negative, so each sample in X ′ can reduce
the version space V SF ,Sm by at most one hypothesis. Hence, with probability of
47
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
at least δ,
n
n
= .
2
2
Since the coverage Φ(f, g) is the volume of the maximal agreement set with respect
to the version space V SF ,Sm , it follows that
|V SF ,Sm | ≥ |F| −
Bin m, n2 , 2δ
1
|F|
≤ 1 − · Bin m,
, 2δ .
Φ(f, g) = 1 − |V SF ,Sm | ·
n
2
2
Remark 3.4 The result of Theorem 3.3 is based on the use of the class of singletons. Augmenting this class by the empty set and choosing a uniform distribution
over X results in a tighter bound. However, the bound will be significantly less
general as it will hold only for a single hypothesis in F and not for any hypothesis
in F.
Corollary 3.5 (non-achievable coverage, explicit bound) Let 0 ≤ δ ≤ 14 , m,
and n > 1 be given. There exist a distribution P , that depends on m and n,
and a finite hypothesis class F of size n, such that for any selective classifier (f, g),
chosen from F by CSS (so R(f, g) = 0) using a training sample Sm drawn i.i.d.
according to P , with probability of at least δ,
1
1
16
Φ(f, g) ≤ max 0, 1 −
ln
|F| −
.
8m
3
1 − 2δ
Proof Applying Lemma B.2 we get
|F|
|F|
4
1
Bin m,
, 2δ ≥ min 1,
−
ln
.
2
4m 3m 1 − 2δ
Applying Theorem 3.3 completes the proof.
3.2.2 Infinite hypothesis spaces
In this section we consider an infinite hypothesis space F. We show that in the
general case, perfect selective classification with guaranteed (non-zero) coverage
is not achievable even when F has a finite VC-dimension. We then derive a mean48
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
ingful coverage guarantee using posterior information on the source distribution
(data-dependent bound).
We start this section with a negative result that precludes non-trivial perfect
learning when F is the set of linear classifiers. The result is obtained by constructing a particularly bad distribution.
Theorem 3.6 (non-achievable coverage) Let m and d > 2 be given. There exist
a distribution P , an infinite hypothesis class F with a finite VC-dimension d, and
a target hypothesis in F, such that Φ(f, g) = 0 for any selective classifier (f, g),
chosen from F by CSS using a training sample Sm drawn i.i.d. according to P .
Proof Let F be the class of all linear classifiers in R2 and let P be a uniform
distribution over the arcs,
(x − 2)2 + y 2 = 2,
x < 1,
and
(x + 2)2 + y 2 = 2,
x > −1.
Figure 3.1 depicts this construction. The training set Sm consists of points on these
arcs, labeled by any linear classifier that passes between the arcs. The maximal
Figure 3.1: A worst-case distribution for linear classifiers: points are drawn
uniformly at random on the two arcs and labeled by a linear classifier that passes
between these arcs. The probability volume of the maximal agreement set is zero.
49
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
agreement set, A, with respect to the version space V SF ,Sm is partitioned into two
subsets A+ and A− according to the labels obtained by hypotheses in the version
space. Clearly, A+ is confined by a polygon whose vertices lie on the right-hand
side arc. Since P is concentrated on the arc, the probability volume of A+ is exactly
zero for any finite m. The same analysis holds for A− , and therefore the coverage
is forced to be zero. The VC-dimension of the class of all linear classifiers in R2
is 3. Embedding the distribution P in a higher dimensional space Rd and using the
class of all linear classifiers in Rd completes the proof.
A direct corollary of Theorem 3.6 is that, in the general case, perfect selective classification with distribution-free guaranteed coverage is not achievable for
infinite hypothesis spaces. However, this is certainly not the end of the story for
perfect learning. In the remainder of this chapter we derive meaningful coverage
guarantees using posterior or prior information on the source distribution (dataand distribution-dependent bounds).
In order to guarantee meaningful coverage we first need to study the complexity
of the selection function g(x) chosen by CSS. The complexity of the classification
function f (x) is determined only by the hypothesis class F and it is independent
of the sample size itself. However, the complexity of g(x) (the maximal agreement
set) chosen by CSS generally depends on the sample size. Therefore, increasing
the training sample size does not necessarily guarantee non-trivial coverage. Our
main task is to find the complexity class of the family of maximal agreement sets
from which g(x) is chosen. Let us define the family of all maximal agreement
S
sets as H = Hn such that H1 ⊂ H2 ⊂ H3 ⊂ . . .. We can now exploit the
fact that CSS chooses a maximal agreement set that belongs to a specific subclass
Hn with a complexity measured in terms of the VC dimension of Hn . We term
this approach Structural Coverage Maximization (SCM) following the analogous
and familiar Structural Risk Minimization (SRM) approach [81]. A useful way to
parameterize H is to use the size of the version space compression set.
Definition 3.4 (version space compression set) Let Sm be a labeled sample of m
points and let V SF ,Sm be the induced version space. The version space compression set, Sn̂ ⊆ Sm is a smallest subset of Sm satisfying V SF ,Sm = V SF ,Sn̂ . Note
that for any given F and Sm , the size of the version space compression set, denoted
n̂ = n̂(F, Sm ), is unique.
50
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Remark 3.7 Our ”version space compression set” is precisely Hanneke’s ”minimum specifying set” [47] for f on U with respect to V , where,
f = h∗ ,
U = Sm ,
V = H[Sm ] (see Definition 4.6).
Lemma 3.8 The characterizing hypothesis hV SF ,Sm (x) belongs to the order-n̂
characterizing set of F, where n̂ = n̂(F, Sm ) is the size of the version space
compression set.
Proof According to Definition 3.4, there exists a subset Sn̂ ⊂ Sm of size n̂ such
that V SF ,Sm = V SF ,Sn̂ . The rest of the proof follows immediately from Definition 2.4.
Before stating the main result of this chapter, we state a classical result that
will be used later.
Theorem 3.9 ([83]; [3, p.53]) Let F be a hypothesis space with VC-dimension d.
For any probability distribution P on X × {±1}, with probability of at least 1 − δ
over the choice of Sm from P m , any hypothesis f ∈ F consistent with Sm satisfies
2
2
2em
+ ln
R(f ) ≤ ǫ(d, m, δ) =
d ln
,
(3.2)
m
d
δ
where R(f ) , E [I(f (x) 6= f ∗ (x))] is the risk of f .
We note that inequality (3.2) actually holds only for d ≤ m. For any d > m it
is clear that no meaningful upper bound on the risk can be achieved. It is easy to
fix the inequality for the general case by replacing ln 2em
by ln+ 2em
d
d , where
ln+ (x) , max (ln(x), 1).
Theorem 3.10 (data-dependent coverage guarantee) For any m, let a1 , . . . , am ∈
P
R be given, such that ai ≥ 0 and m
i=1 ai ≤ 1. Let (f, g) be a selective CSS classifier. Then, R(f, g) = 0, and for any 0 ≤ δ ≤ 1, with probability of at least
1 − δ,
2
2
2em
Φ(f, g) ≥ 1 −
γ (F, n̂) ln+
+ ln
,
m
γ (F, n̂)
an̂ δ
where n̂ is the size of the version space compression set, and γ (F, n̂) is the order-n̂
characterizing set complexity of F.
51
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Proof Given our sample Sm = {(xi , f ∗ (xi ))}m
i=1 (labeled by the unknown target
′ = {(x , 1)}m . S ′ can be asfunction f ∗ ), we define the “synthetic” sample Sm
i
m
i=1
sumed to have been sampled i.i.d from the marginal distribution of X with positive
labels (P ′ ).
Theorem 3.9 can now be applied on the synthetic problem with the training
′ , the distribution P ′ , and the hypothesis space taken to be H , the ordersample Sm
i
′
i characterizing set of F. It follows that for all h ∈ V SHi ,Sm , with probability of
′ from (P ′ )m ,
at least 1 − ai δ over choices of Sm
2em
2
2
Pr′ (h(x) 6= 1) ≤
+ ln
di ln
,
(3.3)
m
di
ai δ
P
where di is the VC-dimension of Hi . Then, applying the union bound yields, with
probability of at least 1 − δ, that inequality (3.3) holds simultaneously for all 1 ≤
i ≤ m.
All hypotheses in the version space V SF ,Sm agree on all samples in Sm . Hence,
the characterizing hypothesis fV SF ,Sm (x) = 1 for any point x ∈ Sm . Let n̂ be the
size of the version space compression set. According to Lemma 3.8, fV SH,Sm (x) ∈
′ , we learn that f
Fn̂ . Noting that fV SH,Sm (x) = 1 for any x ∈ Sm
V SH,Sm (x) ∈
′ . Therefore, with probability of at least 1 − δ over choices of Sm ,
V SFn̂ ,Sm
2
2em
2
Pr(fV SH,Sm (x) 6= 1) ≤
dn̂ ln
+ ln
.
P
m
dn̂
an̂ δ
Since Φ(f, g) = PrP (fV SH,Sm (x) = 1), and dn̂ is the order-n̂ characterizing set
complexity of F, the proof is complete.
The data-dependent bound in Theorem 3.10 is stated in terms of the characterizing set complexity of the hypothesis class F (see Definition 2.5). Relying on our
results from Chapter 2 we derive an explicit data-dependent coverage bound for
the class of binary linear classifiers.
Corollary 3.11 (data-dependent coverage guarantee) Let F be the class of linear binary classifiers in Rd and assume that the conditions of Theorem 3.10 hold.
Then, R(f, g) = 0, and for any 0 ≤ δ ≤ 1, with probability of at least 1 − δ,
2
2em
2
83(d + 1)3 Λn̂,d ln+
+ ln
,
Φ(f, g) ≥ 1 −
m
Λn̂,d
an̂ δ
52
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
where n̂ is the size of the empirical version space compression set, and
Λn̂,d =
d+1
2en̂ ⌊ 2 ⌋
· log n̂.
d
Proof Define
2
2em
2
Ψ(γ(F, n)) , 1 −
γ(F, n) ln+
+ ln
.
m
γ(F, n)
an δ
We note that Ψ(γ(F, n)) is a continuous function. For any γ(F, n) < 2m
2
2em
2
∂Ψ(γ(F, n))
= − ln
+
< 0,
∂γ(F, n)
m γ(F, n) m
and for any γ(F, n) > 2m,
2
∂Ψ(γ(F, n))
= − < 0.
∂γ(F, n)
m
Thus, Ψ(γ(F, n)) is monotonically decreasing. Noting that ln+ (x) is monotonically increasing, by applying Theorem 3.10 together with Lemma 2.7 the proof is
complete.
3.3 Distribution-dependent coverage bound
As long as the empirical version space compression set size n̂ is sufficiently small
compared to m, Corollary 3.11 provides a meaningful coverage guarantee. Since
n̂ might depend on m, it is hard to analyze the effective rate of the bound. To
further explore this guarantee, we now bound n̂ in terms of m for a specific family
of source distributions and derive a distribution-dependent coverage guarantee.
Theorem 3.12 ([15]) If m points in d dimensions have their components chosen
independently from any set of continuous distributions (possibly different for each
component), then the expected number of convex hull vertices v is
E[v] = O (log m)d−1 .
53
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Definition 3.5 (sliced multivariate Gaussian distribution) A sliced multivariate
Gaussian distribution, N (Σ, µ, w, φ), is a multivariate Gaussian distribution restricted by a half space in Rd . Thus, if Σ is a non-singular covariance matrix, the
pdf of the sliced Gaussian is
1 − 1 (x−µ)T Σ−1 (x−µ)
e 2
· I(wT x − φ ≥ 0),
C
where µ = (µ1 , . . . , µd )T , I is the indicator function and C is an appropriate
normalization factor.
Lemma 3.13 Let P be a sliced multivariate Gaussian distribution. If m points are
chosen independently from P , then the expected number of convex hull vertices is
O (log m)d−1 .
Proof Let X ∼ N (Σ, µ, w, φ) and Y ∼ N (Σ, µ). There is a random vector Z,
whose components are independent standard normal random variables, a vector µ,
and a matrix A such that Y = AZ + µ. Since
wT y − φ = wT (Az + µ) − φ = wT Az + wT µ − φ,
we get that X = AZ0 + µ, where Z0 ∼ N (I, 0, wT A, φ − wT µ). Due to the
spherical symmetry of Z, we can choose the half-space (wT A, φ − wT µ) to be
axis-aligned by rotating the axes. We note that the d components of Z are chosen
independently and that the axis-aligned half-space enforces restriction only on one
of the axes. Therefore, the components of Z0 are chosen independently as well.
Applying Theorem 3.12, we get that if m points are chosen independently from
Z0 , then the expected number of convex hull vertices is O (log m)d−1 . The
proof is complete by noting that the number of convex hull vertices is preserved
under affine transformations.
Lemma 3.14 (version space compression set size) Let F be the class of all linear binary classifiers in Rd . Assume that the underlying distribution P is a mixture
of a fixed number of Gaussians. Then, for any 0 ≤ δ ≤ 1, with probability of at
least 1 − δ, the empirical version space compression set size is
n̂ = O
(log m)d−1
δ
54
.
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Proof Let Sn be a version space compression set. Consider x̄0 ∈ Sn . Since Sn is
a compression set there is a half-space, (w̄, φ), such that fw̄,φ ∈ V SF ,Sn \{x̄0 } and
fw̄,φ 6∈ V SF ,Sn . W.l.o.g. assume that x̄0 ∈ Sn is positive; thus w̄T x̄0 − φ < 0,
and for any other positive point x̄ ∈ Sn , w̄T x̄ − φ ≥ 0. For an appropriate φ′ < φ,
there exists a half-space (w̄, φ′ ) such that w̄T x̄0 −φ′ = 0, and for any other positive
point x̄ ∈ Sn , w̄T x̄ − φ′ > 0. Therefore, x̄0 is a convex hull vertex. It follows that
we can bound the number of positive samples in Sn by the number of vertices of
the convex hull of all the positive points. Defining v as the number of convex hull
vertices and using Markov’s inequality, we get that for any ǫ > 0,
Pr(v ≥ ǫ) ≤
E[v]
.
ǫ
Since f ∗ is a linear classifier, the underlying distribution of the positive points is a
mixture of sliced multivariate Gaussians. Using Lemmas 3.13 and B.5, we get that
with probability of at least 1 − δ,
E[v]
=O
v≤
δ
(log m)d−1
δ
.
Repeating the same arguments for the negative points completes the proof.
Corollary 3.15 (distribution-dependent coverage guarantee) Let F be the class
of all linear binary classifiers in Rd , and let P be a mixture of a fixed number of
Gaussians. Then, R(f, g) = 0, and for any 0 ≤ δ ≤ 1, with probability of at least
1 − δ,
!
2
(log m)d
1
Φ(f, g) ≥ 1 − O
· (d+3)/2 .
m
δ
Proof
Λn̂,d =
⌊ d+1 ⌋
d+1
2
d+3
2e
2en̂ ⌊ 2 ⌋
· log n̂ ≤
· n̂ 2 .
d
d
Applying Lemma 3.14,
2
Λn̂,d = O
(log m)d
δ(d+3)/2
!
.
The proof is complete by noting that Λn̂,d ≥ 1 and using Corollary 3.11 with
ai = 2−i .
55
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
3.4 Implementation
In previous sections we analyzed the performance of CSS and proved that (in the
realizable case) it can achieve sharp coverage rates under reasonable assumptions
on the source distribution while guaranteeing zero error on the accepted samples.
However, it remains unclear whether an efficient implementation of CSS is at reach.
In this section we propose an algorithm for CSS and show that it can be efficiently
implemented for linear classifiers.
The following method, which we term lazy CSS, is very similar to the implicit
selective sampling algorithm of Cohn et al. [24]. Instead of explicitly constructing
the CSS selection function g during training (which is indeed a daunting task), we
develop a “lazy learning” approach that can potentially facilitate an efficient CSS
implementation during test time. In particular, we propose to evaluate g(x) at any
given test point x during the classification process. For the training set Sm and a
test point x we define the following two sets:
−
, Sm ∪ {(x, −1)};
Sm,x
+
, Sm ∪ {(x, +1)},
Sm,x
+ is the (labeled) training set S augmented by the test point x labeled
that is, Sm,x
m
− is S augmented by x labeled negatively. The selection value
positively, and Sm,x
m
g(x) is determined as follows: g(x) = 0 (i.e., x is rejected) iff there exist hypothe+ and S − , respectively.
ses f + , f − ∈ F that are consistent with Sm,x
m,x
The following lemma states that the selection function g(x) constructed by lazy
CSS is a precise implementation of CSS.
Lemma 3.16 Let F be any hypothesis class, Sm a labeled training set, and x, a
test point. Then x belongs to the maximal agreement set of V SF ,Sm iff there is no
+ or S − .
hypothesis f ∈ F that is consistent with either Sm,x
m,x
+ and
Proof If there exist hypotheses f + , f − ∈ F that are consistent with Sm,x
− , then there exist two hypotheses in F that correctly classify S (therefore
Sm,x
m
they belong to V SF ,Sm ) but disagree on x. Hence, x does not belong to the maximal agreement set of V SF ,Sm . Conversely, if x does not belong to the maximal
agreement set of V SF ,Sm , then there are two hypotheses, f1 and f2 , which correctly classify Sm but disagree on x. Let’s assume, without loss of generality, that
+ and f is consistent with
f1 classifies x positively. Then, f1 is consistent with Sm,x
2
− . Thus there exist hypotheses f + , f − ∈ F that are consistent with S + and
Sm,x
m,x
− .
Sm,x
56
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
For the case of linear classifiers it follows that computing the lazy CSS selection function for any test point is reduced to two applications of a linear separability
test. Yogananda et al. [91] recently presented a fast linear separability test with a
worst case time complexity of O(mr 3 ) and space complexity of O(md), where m
is the number of points, d is the dimension and r ≤ min(m, d + 1).
Remark 3.17 For the realizable case we can modify any rejection mechanism by
restricting rejection only to the region chosen for rejection by CSS. Since CSS
accepts only samples that are guaranteed to have zero test error, the overall performance of the modified rejection mechanism is guaranteed to be at least as good
as the original mechanism. Using this technique we were able to improve the performance (RC curve) of the most commonly used rejection mechanism for linear
classifiers, which rejects samples according to a simple symmetric distance from
the decision boundary (a “margin”).
57
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Chapter 4
From Selective Classification to
Active Learning
“
The uncreative mind can spot wrong answers, but it takes a very
creative mind to spot wrong questions.
”
Antony Jay, Writer, 1930 - present
Active learning is an intriguing learning model that provides the learning algorithm with some control over the learning process, potentially leading to significantly faster learning. In recent years it has been gaining considerable recognition
as a vital technique for efficiently implementing inductive learning in many industrial applications where abundance of unlabeled data exists, and/or in cases where
labeling costs are high.
In stream-based active learning, which is also referred to as online selective
sampling [5, 24], the learner is given an error objective ǫ and then sequentially
receives a stream of unlabeled examples. At each step, after observing an unlabeled
example x, the learner decides whether or not to request the label of x. The learner
should terminate the learning process and output a binary classifier whose true
error is guaranteed to be at most ǫ with high probability. The penalty incurred by
the learner is the number of label requests made and this number is called the label
complexity.
58
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
In this chapter we present an equivalence between active learning and perfect
selective classification with respect to “fast rates.” Then, by applying our results
from Chapter 3, for realizable selective classification, we show that general (nonhomogeneous) linear classifiers are actively learnable at exponential (in 1/ǫ) label
complexity rate when the data distribution is an arbitrary unknown finite mixture
of high dimensional Gaussians. While we obtain exponential label complexity
speedup in 1/ǫ, we incur exponential slowdown in d2 , where d is the problem dimension. By proving a lower bound on label complexity we show that an exponential slowdown in d is unavoidable in such settings. Finally we relate our proposed
technique to other complexity measures for active learning, including teaching dimension [43] and Hanneke’s disagreement coefficient [48]. Specifically we are
able to upper bound the disagreement coefficient using characterizing set complexity. This relation opens the possibilities to utilize (our) established characterizing
set complexity results in many active learning problems where the disagreement
coefficient is used to characterize sample complexity speedups. It is also very useful in the analysis of agnostic selective classification, which is the subject of the
following chapter.
4.1 Definitions
We consider the following standard active learning model in a realizable (noise
free) setting. In this model the learner sequentially observes unlabeled instances,
x1 , x2 , . . ., that are sampled i.i.d. from P (X). After receiving each xi , the learning
algorithm decides whether or not to request its label f ∗ (xi ), where f ∗ ∈ F is an
unknown target hypothesis. Before the start of the game the algorithm is provided
with some desired error rate ǫ, and confidence level δ.
Definition 4.1 (label complexity) We say that the learning algorithm actively learned
the problem instance (F, P ) if at some round it can terminate the learning process,
after observing m instances and requesting k labels, and output an hypothesis
f ∈ F whose error R(f ) ≤ ǫ, with probability of at least 1 − δ. The quality of
the algorithm is quantified by the number k of requested labels, which is called the
label complexity.
A positive result for a learning problem (F, P ) is a learning algorithm that can
actively learn this problem for any given ǫ and δ, and for every f ∗ , with label
complexity bounded above by L(ǫ, δ, f ∗ ).
59
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Definition 4.2 (actively learnable with exponential rate) If there is a label complexity bound that is O(polylog(1/ǫ)) we say that the problem is actively learnable
at exponential rate.
We conclude with the definition of the disagreement set.
Definition 4.3 (disagreement set [46, 31]) Let F be an hypothesis class and G ⊆
F. The disagreement set w.r.t. G is defined as
DIS(G) , {x ∈ X : ∃f1 , f2 ∈ G
s.t.
f1 (x) 6= f2 (x)} .
The agreement set w.r.t. G is AGR(G) , X \ DIS(G).
4.2 Background
A classical result in standard (passive) supervised learning (in the realizable case)
is that any consistent learning algorithm has sample complexity of
1
1
1
d log
+ log
,
O
ǫ
ǫ
δ
where d is the VC-dimension of F (see, e.g., [3]). It appears that the best sample complexity one can hope for in this setting is O(d/ǫ). The main promise in
active learning has been to gain an exponential speedup in the sample complexity.
Specifically, a label complexity bound of
d
O d log
ǫ
for actively learning ǫ-good classifier from a concept class with VC-dimension d,
provides an exponential speedup in terms of 1/ǫ.
The main strategy for active learning in the realizable setting is to request labels
only for instances belonging to the disagreement set with respect to the version
space, and output any (consistent) hypothesis belonging to the version space. This
strategy is often called the CAL algorithm after the names of its inventors: Cohn,
Atlas, and Ladner [24].
60
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
4.3 From coverage bound to label complexity bound
In this section we present a reduction from stream-based active learning to perfect selective classification. Particularly, we show that if there exists for F a perfect selective classifier with a fast rejection rate of O(polylog(m)/m), then the
CAL algorithm will actively learn F with exponential label complexity rate of
O(polylog(1/ǫ)) and visa versa.
Lemma 4.1 Let Sm = {(x1 , y1 ), . . . , (xm , ym )} be a sequence of m labeled samples drawn i.i.d. from an unknown distribution P (X), and let Si = {(x1 , y1 ), . . . , (xi , yi )}
be the i-prefix of Sm . Then, with probability of at least 1 − δ over random choices
of Sm , the following bound holds simultaneously for all i = 1, . . . , m − 1,
δ
⌊log2 (i)⌋
Pr {xi+1 ∈ DIS(V SF ,Si )|Si } ≤ 1 − BΦ F,
,2
,
log2 (m)
where BΦ (F, δ, m) is a coverage bound for perfect selective classification with
respect to hypothesis class F, confidence δ, and sample size m.
Proof For j = 1, . . . , m, abbreviate DISj , DIS(V SF ,Sj ) and AGRj ,
AGR(V SF ,Sj ). By definition, DISj = X \ AGRj . By the definitions of a coverage bound and agreement/disagreement sets, with probability of at least 1 − δ over
random choices of Sj
BΦ (F, δ, j) ≤ Pr{x ∈ AGRj |Sj } = Pr{x 6∈ DISj |Sj } = 1−Pr{x ∈ DISj |Sj }.
Applying the union bound we conclude that the following inequality holds simultaneously, with high probability, for t = 0, . . . , ⌊log2 (m)⌋ − 1,
δ
t
,2 .
(4.1)
Pr{x2t +1 ∈ DIS2t |S2t } ≤ 1 − BΦ H,
log2 (m)
For all j ≤ i, Sj ⊆ Si , so DISi ⊆ DISj . Therefore, since the samples in Sm are
all drawn i.i.d., for any j ≤ i,
Pr {xi+1 ∈ DISi |Si } ≤ Pr {xi+1 ∈ DISj |Sj } = Pr {xj+1 ∈ DISj |Sj } .
The proof is established by setting j = 2⌊log2 (i)⌋ ≤ i, and applying inequality (4.1).
61
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Lemma 4.2 (Bernstein’s inequality [57]) Let X1 , . . . , Xn be independent zeromean random variables. Suppose that |Xi | ≤ M almost surely, for all i. Then,
for all positive t,


)
( n
2
X
t /2
.
Xi > t ≤ exp − P n o
Pr
2 + M t/3
E
X
i=1
j
Lemma 4.3 Let Zi , i = 1, . . . , m, be independent Bernoulli random variables
with success probabilities pi . Then, for any 0 < δ < 1, with probability of at least
1 − δ,
r
m
X
1X
2 1
(Zi − E{Zi }) ≤ 2 ln
pi + ln .
δ
3 δ
i=1
Proof Define Wi , Zi − E{Zi } = Zi − pi . Clearly,
E{Wi } = 0,
|Wi | ≤ 1,
E{Wi2 } = pi (1 − pi ).
Applying Bernstein’s inequality (Lemma 4.2) on the Wi ,


)
( n
2
X
t /2

i
≤ exp − P h
Wi > t
Pr
2
E Wj + t/3
i=1
t2 /2
P
= exp −
pi (1 − pi ) + t/3
t2 /2
≤ exp − P
.
pi + t/3
Equating the right-hand side to δ and solving for t, we have
P
1
t2 /2
= ln
pi + t/3
δ
⇐⇒
t2 − t ·
2 1
1X
pi = 0,
ln − 2 ln
3 δ
δ
and the positive solution of this quadratic equation is
r
r
1X
2 1
1X
1 1
1 21
pi < ln + 2 ln
pi .
ln
+ 2 ln
t = ln +
3 δ
9
δ
δ
3 δ
δ
62
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Lemma 4.4 Let Z1 , Z2 , . . . , Zm be a high order Markov sequence of dependent binary random variables defined in the same probability space. Let X1 , X2 , . . . , Xm
be a sequence of independent random variables such that,
Pr {Zi = 1|Zi−1 , . . . , Z1 , Xi−1 , . . . , X1 } = Pr {Zi = 1|Xi−1 , . . . , X1 } .
Define P1 , Pr {Z1 = 1}, and for i = 2, . . . , m,
Pi , Pr {Zi = 1|Xi−1 , . . . , X1 } .
Let b1 , b2 . . . bm be given constants independent of X1 , X2 , . . . , Xm 1 . Assume that
Pi ≤ bi simultaneously for all i with probability of at least 1 − δ/2, δ ∈ (0, 1).
Then, with probability of at least 1 − δ,
r
m
m
X
X
2X
2 2
bi + 2 ln
Zi ≤
bi + ln .
δ
3 δ
i=1
i=1
We proceed with a direct proof of Lemma 4.4. An alternative proof of this lemma,
using super-martingales, appears in Appendix 7.5.
Proof For i = 1, . . . , m, let Wi be binary random variables satisfying
bi + I(Pi ≤ bi ) · (Pi − bi )
,
Pi
bi − Pi
Pr{Wi = 1|Zi = 0, Xi−1 , . . . , X1 } , max
,0 ,
1 − Pi
Pr{Wi = 1|Wi−1 , . . . , W1 , Xi−1 , . . . , X1 } = Pr{Wi = 1|Xi−1 , . . . , X1 }.
Pr{Wi = 1|Zi = 1, Xi−1 , . . . , X1 } ,
1
Precisely we require that each of the bi were fixed before the Xi are chosen
63
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
We notice that
Pr{Wi = 1|Xi−1 , . . . , X1 } = Pr{Wi = 1, Zi = 1|Xi−1 , . . . , X1 }
+ Pr{Wi = 1, Zi = 0|Xi−1 , . . . , X1 }
= Pr{Wi = 1|Zi = 1, Xi−1 , . . . , X1 }
·
Pr{Zi = 1|Xi−1 , . . . , X1 }
+ Pr{Wi = 1|Zi = 0, Xi−1 , . . . , X1 }
·
Pr{Zi = 0|Xi−1 , . . . , X1 }
(
i −Pi
Pi + b1−P
(1 − Pi ) = bi , Pi ≤ bi ;
i
=
bi
else.
Pi · Pi + 0 = bi ,
Hence, the distribution of each Wi is independent of Xi−1 , . . . , X1 , and the Wi are
independent Bernoulli random variables with success probabilities bi . By construction, if Pi ≤ bi , then
Z
Pr{Wi = 1|Zi = 1, Xi−1 , . . . , X1 } = 1.
Pr{Wi = 1|Zi = 1} =
X
By assumption, Pi ≤ bi for all i simultaneously, with probability of at least 1−δ/2.
Therefore, Zi ≤ Wi simultaneously, with probability of at least 1 − δ/2. We now
apply Lemma 4.3 on the Wi . The proof is then completed using the union bound.
Theorem 4.5 Let Sm be a sequence of m unlabeled samples drawn i.i.d. from an
unknown distribution P . Then with probability of at least 1 − δ over choices of Sm ,
the number of label requests k by the CAL algorithm is bounded by
r
2
2 2
k ≤ Ψ(F, δ, m) + 2 ln Ψ(F, δ, m) + ln ,
δ
3 δ
where
Ψ(F, δ, m) ,
m X
i=1
1 − BΦ F,
δ
, 2⌊log2 (i)⌋
2 log2 (m)
and BΦ (F, δ, m) is a coverage bound for perfect selective classification with respect to hypothesis class F, confidence δ and sample size m.
64
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Proof According to CAL, the label of sample xi will be requested iff xi ∈
DIS(V SF ,Si−1 ). For i = 1, . . . , m, let Zi be binary random variables such that
Zi , 1 iff CAL requests a label for sample xi . Applying Lemma 4.1 we get that
for all i = 2, . . . , m, with probability of at least 1 − δ/2,
Pr{Zi = 1|Si−1 } = Pr xi ∈ DIS(V SF ,Si−1 )|Si−1
δ
⌊log2 (i−1)⌋
≤ 1 − BΦ F,
,2
.
2 log2 (m)
For i = 1, BΦ (F, δ, 1) = 0, and the above inequality trivially holds. An application of Lemma 4.4 on the variables Zi completes the proof.
Theorem 4.5 states an upper bound on the label complexity expressed in terms
of m, the size of the sample provided to CAL. This upper bound is very convenient
for directly analyzing the active learning speedup relative to supervised learning.
A standard label complexity upper bound, which depends on 1/ǫ, can be extracted
using the following simple observation.
Lemma 4.6 ([48, 3]) Let Sm be a sequence of m unlabeled samples drawn i.i.d.
from an unknown distribution P . Let F be a hypothesis class whose finite VC
dimension is d, and let ǫ and δ be given. If
2
12
4
+ ln
d ln
,
m≥
ǫ
ǫ
δ
then, with probability of at least 1−δ, CAL will output a classifier whose true error
is at most ǫ.
Proof Hanneke [48] observed that since CAL requests a label whenever there
is a disagreement in the version space, it is guaranteed that after processing m
examples, CAL will output a classifier that is consistent with all the m examples
introduced to it. Therefore, CAL is a consistent learner. A classical result [3, Thm.
4.8] is that any consistent learner will achieve, with probability of at least 1 − δ, a
true error not exceeding ǫ after observing at most
4
2
12
+ ln
d ln
ǫ
ǫ
δ
labeled examples.
65
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
In the next two theorems we prove that a fast coverage rate in CSS implies exponential label complexity in CAL (Theorem 4.7) and visa versa ( Theorem 4.8).
Theorem 4.7 Let F be a hypothesis class whose finite VC dimension is d. If the
rejection rate of CSS (see Definition 3.1) is
!
polylog m
δ
O
,
m
then (F, P ) is actively learnable with exponential label complexity speedup.
Proof Plugging this rejection rate into Ψ (defined in Theorem 4.5) we have,
m X
δ
⌊log2 (i)⌋
Ψ(F, δ, m) ,
1 − BΦ (F,
,2
)
log2 (m)
i=1


m
polylog i log(m)
X
δ
.
O
=
i
i=1
Applying Lemma B.4 we get
Ψ(F, δ, m) = O polylog
By Theorem 4.5, k = O polylog
cludes the proof.
m
δ
m log(m)
δ
.
, and an application of Lemma 4.6 con-
Theorem 4.8 Assume (F, P ) has a passive learning sample complexity of Ω (1/ǫ)
and it is actively learnable by CAL with exponential label complexity speedup.
Then the rejection rate of CSS is O (polylog(m)/m) .
Proof Let Γ(Sm ) be the total number of label requests made by CAL after processing the training sample Sm . Let δ1 be given. By assumption, (F, P ) is actively
learnable by CAL with exponential label complexity speedup. Therefore, there exist constants, c1 (δ1 ) and c2 (δ1 ), such that for any m, with probability of at least
1 − δ1 over random choices of Sm ,
Γ(Sm ) ≤ c1 logc2 m.
66
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Let
A , {Sm
| Γ(Sm ) ≤ max (c1 logc2 (m), c1 logc2 (m − 1) + 1)} ,
and
B , {Sm−1
|
Γ(Sm−1 ) ≤ c1 logc2 (m − 1)} .
Define Zi = Zi (Sm ) to be a binary indicator variable that equals 1 if CAL requests
a label at step i, and 0 otherwise. Consider a sample sm = (x1 , x2 , . . . , xm ).
For i > 1, assume that Zi = 1 and for some j < i, Zj = 0. Let s′m be a
sample identical to sm with the exception that examples xi and xj are interchanged.
Clearly, Zj (s′m ) = 1 and Zi (s′m ) = 0, Γ(s′m ) ≤ Γ(sm ), and P (sm ) = P (s′m ).
Therefore, if sm ∈ A, then also s′m ∈ A, and it follows that for any j < i,
Z
Z
Zi (sm )dP (sm ).
Zj (sm )dP (sm ) ≥
A
A
Therefore,
Z
A
Z X
m Z
m
1 X
1
Zi (sm )dP (sm )
Zi (sm )dP (sm ) =
m
m A
A
i=1
i=1
Z
1
dP (sm ) · max (c1 logc2 (m), c1 logc2 (m − 1) + 1)
≤
m
Sm ∈A
c1 logc2 (m) c1 logc2 (m − 1) + 1
≤ max
,
.
(4.2)
m
m
Zm (sm )dP (sm ) ≤
Decompose the sample sm to a sample of size m − 1 (denoted by sm−1 ) containing
the first m − 1 examples plus the final example xm , i.e., sm = (sm−1 , xm ). By
definition,
Γ(sm ) ≤ Γ(sm−1 ) + 1,
and it follows that if sm−1 ∈ B, then sm ∈ A regardless of xm . We thus have,
Z
Z Z
Z
∆V SF ,sm−1 dP (sm−1 ).
Zm (sm , xm )dP (xm ) dP (sm−1 ) =
Zm (sm )dP (sm ) ≥
A
B
B
xm
67
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
By assumption, Pr(B) , Pr{Sm ∈ B} ≥ 1 − δ1 , so
R
Z
∆V SF ,sm−1 dP (sm−1 )
Zm (sm )dP (sm ) ≥ B
· (1 − δ1 ).
Pr(B)
A
(4.3)
Combining (4.2) and (4.3) we get
c1 logc2 (m) c1 logc2 (m − 1) + 1
E ∆V SF ,Sm−1 |B ≤ max
,
, f (m).(4.4)
m(1 − δ1 )
m(1 − δ1 )
Using Markov inequality we get
f (m)
f (m)
= Pr ∆V SF ,Sm−1 >
Pr ∆V SF ,Sm−1 >
δ2
δ2
f (m)
+ Pr ∆V SF ,Sm−1 >
δ2
f (m)
≤ Pr ∆V SF ,Sm−1 >
δ2
E ∆V SF ,Sm−1 |B
≤
+ δ1
f (m)/δ2
≤ δ1 + δ2 ,
|
B
|
B̄
|
B
· Pr(B)
· Pr(B̄)
+ Pr{B̄}
where B̄ is the complement set of B. Choosing δ1 = δ2 = δ/2 completes the
proof.
4.4 A new technique for upper bounding the label complexity
In this section we present a novel technique for deriving target-independent label
complexity bounds for active learning. The technique combines the reduction of
Theorem 4.5 and the general data-dependent coverage bound for selective classification of Theorem 3.10. For some learning problems it is a straightforward technical exercise, involving VC-dimension calculations, to arrive with exponential label
complexity bounds. We show a few applications of this technique resulting in both
reproductions of known label complexity exponential rates as well as a new one.
We recall that γ(F, n̂) is the characterizing set complexity of hypothesis class
F with respect to the version space compression set size n̂. Given an hypothesis
68
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
class F, our recipe to deriving active learning label complexity bounds for F is:
(i) calculate both n̂ and γ (F, n̂); (ii) apply Theorem 3.10, obtaining a bound BΦ
for the coverage; (iii) plug BΦ in Theorem 4.5 to get a label complexity bound
expressed as a summation; (iv) Apply Lemma B.4 to obtain a label complexity
bound in a closed form.
4.4.1 Linear separators in R
In the following example we derive a label complexity bound for the concept class
of thresholds (linear separators in R). Although this is a toy example (for which
an exponential rate is well known) it does exemplify the technique, and in many
other cases the application of the technique is not much harder. Let F be the class
of thresholds. We first show that the corresponding version space compression set
size n̂ ≤ 2. Assume w.l.o.g. that f ∗ (x) , I(x > w) for some w ∈ (0, 1). Let
x− , max{xi ∈ Sm |yi = −1} and x+ , min(xi ∈ Sm |yi = +1). At least one
′ = {(x , −1), (x , +1)}. Then V S
′ ,
of x− or x+ exist. Let Sm
−
+
F ,Sm = V SF ,Sm
′
and n̂ = |Sm | ≤ 2. Now, γ (F, 2) = 2, as shown in Section 2.3. Plugging these
numbers in Theorem 3.10, and using the assignment a1 = a2 = 1/2,
2
ln (m/δ)
4
BΦ (F, δ, m) = 1 −
2 ln (em) + ln
=1−O
.
m
δ
m
Next we plug BΦ in Theorem 4.5 obtaining a raw label complexity
m X
δ
, 2⌊log 2 (i)⌋
Ψ(F, δ, m) =
1 − BΦ F,
2 log2 (m)
i=1
m
X
ln (log2 (m) · i/δ)
O
=
.
i
i=1
Finally, by applying Lemma B.4, with a = 1 and b = log2 m/δ, we conclude that
m Ψ(F, δ, m) = O ln2
.
δ
Thus, F is actively learnable with exponential speedup, and this result applies to
any distribution. In Table 4.1 we summarize the n̂ and γ (F, n̂) values we calculated for four other hypothesis classes. The last two cases are fully analyzed in
Sections 4.4.2 and 4.6.1, respectively. For the other classes, where γ and n̂ are con-
69
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Hypothesis class
Distribution
n̂
γ (F , n̂)
Linear separators in R
Intervals in R
Linear separators in R2
any
any (target-dependent2)
any distribution on the unit
circle (target-dependent2)
2
4
4
2
4
4
Linear separators in Rd
mixture of Gaussians
Balanced axis-aligned
rectangles in Rd
product distribution
(log m)d−1
δ
O log dm
δ
O
O n̂d/2+1
O (dn̂ log n̂)
Table 4.1: Calculated n̂ and γ for various hypothesis classes achieving
exponential rates.
stants, it is clear (Theorem 3.10) that exponential rates are obtained. We emphasize
that the bounds for these two classes are target-dependent as they require that Sm
includes at least one sample from each class.
4.4.2 Linear separators in Rd under mixture of Gaussians
In this section we state and prove our main example, an exponential label complexity bound for linear classifiers in Rd .
Theorem 4.9 Let F be the class of all linear binary classifiers in Rd , and let the
underlying distribution be any mixture of a fixed number of Gaussians in Rd . Then,
with probability of at least 1 − δ over choices of Sm , the number of label requests
k by CAL is bounded by
!
2
(log m)d +1
.
k=O
δ(d+3)/2
Therefore, by Lemma 4.6 we have, k = O (poly(1/δ) · polylog(1/ǫ)) .
Proof According to Corollary 3.15, the following distribution-dependent coverage
bound holds in our setting with probability of at least 1 − δ
!
2
(log m)d
1
(4.5)
Φ(f, g) ≥ 1 − O
· (d+3)/2 .
m
δ
2
With at least one sample in each class.
70
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Plugging this bound in Theorem 4.5 we obtain,
m X
δ
1 − BΦ F,
Ψ(F, δ, m) =
, 2⌊log 2 (i)⌋
2 log2 (m)
i=1
d+3 !
2
m
X
log2 (m) 2
(log i)d
·
O
=
i
δ
i=1
!
d+3 X
2
m
log2 (m) 2
(log(i))d
= O
·
δ
i
i=1
Finally, an application of Lemma B.4 with a = d2 and b = 1 completes the proof.
4.5 Lower bound on label complexity
In the previous section we have derived an upper bound on the label complexity
of CAL for various classifiers and distributions. In the case of linear classifiers in
Rd we have shown an exponential speedup in terms of 1/ǫ but also an exponential slowdown in terms of the dimension d. In passive learning there is a linear
dependency in the dimension while in our case (active learning using CAL) there
is an exponential one. Is it an artifact of our bounding technique or a fundamental
phenomenon?
To answer this question we derive an asymptotic lower bound on the label complexity. We show that the exponential dependency in d is unavoidable (at least
asymptotically) for every bounding technique when considering linear classifier
even under a single Gaussian (isotropic) distribution. The argument is obtained by
the observation that CAL has to request a label to any point on the convex hull of
a sample Sm . The bound is obtained using known results from probabilistic geometry, which bound the first two moments of the number of vertices of a random
polytope under the Gaussian distribution.
Definition 4.4 (Gaussian polytope) Let X1 , ..., Xm be i.i.d. random points in Rd
with common standard normal distribution (with zero mean and covariance matrix
1
2 Id ). A Gaussian polytope Pm is the convex hull of these random points.
Denote by fk (Pm ) the number of k-faces in the Gaussian polytope Pm . Note that
71
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
f0 (Pm ) is the number of vertices in Pm . The following two Theorems asymptotically bound the average and variance of fk (Pm ).
Theorem 4.10 ([58], Theorem 1.1) Let X1 , ..., Xm be i.i.d. random points in Rd
with common standard normal distribution. Then,
Efk (Pm ) = c(k,d) (log m)
d−1
2
· (1 + o(1))
as m → ∞, where c(k,d) is a constant depending only on k and d.
Theorem 4.11 ([59], Theorem 1.1) Let X1 , ..., Xm be i.i.d. random points in Rd
with common standard normal distribution. Then there exists a positive constant
cd , depending only on the dimension, such that
Var (fk (Pm )) ≤ cd (log m)
d−1
2
,
for all k ∈ {0, . . . , d − 1}.
We can now use Chebyshev’s inequality to lower bound the number of vertices in
Pm (f0 (Pm )) with high probability.
Theorem 4.12 Let X1 , ..., Xm be i.i.d. random points in Rd with common standard normal distribution and δ > 0 be given. Then with probability of at least
1 − δ,
d−1
d−1
c̃d
2
4
− √ (log m)
· (1 + o(1)),
f0 (Pm ) ≥ cd (log m)
δ
as m → ∞, where cd and c̃d are constants depending only on d.
Proof Using Chebyshev’s inequality (in the second inequality), as well as Theorem 4.11, we obtain,
Pr (f0 (Pm ) > Ef0 (Pm ) − t) = 1 − Pr (f0 (Pm ) ≤ Ef0 (Pm ) − t)
≥ 1 − Pr (|f0 (Pm ) − Ef0 (Pm )| ≥ t)
d−1
Var (f0 (Pm ))
cd
≥ 1−
≥ 1 − 2 (log m) 2 .
2
t
t
Equating the RHS to 1 − δ and solving for t we get
s
d−1
(log m) 2
t = cd
.
δ
72
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Applying Theorem 4.10 completes the proof.
Theorem 4.13 (lower bound) Let F be the class of linear binary classifiers in Rd ,
and let the underlying distribution be standard normal distribution in Rd . Then
there exists a target hypothesis such that, with probability of at least 1 − δ over
choices of Sm , the number of label requests k by CAL is bounded by
k≥
d−1
cd
(log m) 2 · (1 + o(1)),
2
as m → ∞, where cd is a constant depending only on d.
Proof Let us look at the Gaussian polytope Pm induced by the random sample Sm .
As long as all labels requested by CAL have the same value (the case of minuscule
minority class) we note that every vertex of Pm falls in the region of disagreement
with respect to any subset of Sm that do not include that specific vertex. Therefore,
CAL will request label at least for each vertex of Pm . For sufficiently large m, in
particular,
4
2c̃d d−1
√
,
log m ≥
cd δ
we conclude the proof by applying Theorem 4.12.
4.6 Relation to existing label complexity measures
A number of complexity measures to quantify the speedup in active learning have
been proposed. In this section we show interesting relations between our techniques and two well known measures, namely the teaching dimension [43] and the
disagreement coefficient [48].
Considering first the teaching dimension, we prove in Lemma 4.15 that the version space compression set size is bounded above, with high probability, by the
extended teaching dimension growth function (introduced by Hanneke [47]). Consequently, it follows that perfect selective classification with meaningful coverage
can be achieved for the case of axis-aligned rectangles under a product distribution.
We then focus on Hanneke’s disagreement coefficient and show in Theorem 4.18
that the coverage of CSS can be bounded below using the disagreement coefficient.
Conversely, in Corollary 4.23 we show that the disagreement coefficient can be
bounded above using any coverage bound for CSS. Consequently, the results here
73
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
imply that the disagreement coefficient, θ(r0 ) grows slowly with 1/r0 for the case
of linear classifiers under a mixture of Gaussians.
4.6.1 Teaching dimension
The teaching dimension is a label complexity measure proposed by Goldman and
Kearns [43]. The dimension of the hypothesis class F is the minimum number of
examples required to present to any consistent learner in order to uniquely identify
any hypothesis in the class.
We now define the following variation of the extended teaching dimension [54]
due to Hanneke. Throughout we use the notation f1 (S) = f2 (S) to denote the fact
that the two hypotheses agree on the classification of all instances in S.
Definition 4.5 (Extended Teaching Dimension, [54, 47]) Let V ⊆ F, m ≥ 0,
U ∈ X m . For any f ∈ F,
XT D(f, V, U ) , inf t | ∃R ⊆ U : | f ′ ∈ V : f ′ (R) = f (R) | ≤ 1 ∧ |R| ≤ t .
Definition 4.6 ([47]) For V ⊆ F, V [Sm ] denotes any subset of V such that
∀f ∈ V,
| f ′ ∈ V [Sm ] : f ′ (Sm ) = f (Sm ) | = 1.
Claim 4.14 Let Sm be a sample of size m, F an hypothesis class, and n̂ =
n(F, Sm ), the version space compression set size. Then,
XT D(f ∗ , F[Sm ], Sm ) = n̂.
Proof Let Sn̂ ⊆ Sm be a version space compression set. Assume, by contradiction, that there exist two hypotheses f1 , f2 ∈ F[Sm ], each of which agrees on the
given classifications of all examples in Sn̂ . Therefore, f1 , f2 ∈ V SF ,Sn̂ , and by
the definition of version space compression set, we know that f1 , f2 ∈ V SF ,Sm .
Hence,
| {f ∈ F[Sm ] : f (Sm ) = f ∗ (Sm )} | ≥ 2,
which contradicts definition 4.6. Therefore,
| {f ∈ F[Sm ] : f (Sn̂ ) = f ∗ (Sn̂ )} | ≤ 1,
74
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
and
XT D(f ∗ , F[Sm ], Sm ) ≤ |Sn̂ | = n̂.
Let R ⊂ Sm be any subset of size |R| < n̂. Consequently, V SF ,Sm ⊂ V SF ,R ,
and there exists an hypothesis, f ′ ∈ V SF ,R , that agrees with all labeled examples
in R, but disagrees with at least one example in Sm . Thus,
f ′ (Sm ) 6= f ∗ (Sm ),
and according to definition 4.6, there exist hypotheses f1 , f2 ∈ F[Sm ] such that
f1 (Sm ) = f ′ (Sm ) 6= f ∗ (Sm ) = f2 (Sm ). But f1 (R) = f2 (R) = f ∗ (R), so
| {f ∈ V [Sm ] : f (R) = f ∗ (R)} | ≥ 2.
It follows that XT D(h∗ , F[Sm ], Sm ) ≥ n̂.
Definition 4.7 (XTD Growth Function, [47]) For m ≥ 0, V ⊆ F, δ ∈ [0, 1],
XT D(V, P, m, δ) = inf {t|∀f ∈ F, P r {XT D(f, V [Sm ], Sm ) > t} ≤ δ} .
Lemma 4.15 Let F be an hypothesis class, P an unknown distribution, and δ > 0.
Then, with probability of at least 1 − δ,
n̂ ≤ XT D(F, P, m, δ).
Proof According to Definition 4.7, with probability of at least 1 − δ,
XT D(f ∗ , F[Sm ], Sm ) ≤ XT D(F, P, m, δ).
Applying Claim 4.14 completes the proof.
Lemma 4.16 (Balanced Axis-Aligned Rectangles, [47], Lemma 4) If P is a product distribution on Rd with continuous CDF, and F is the set of axis-aligned rectangles such that ∀f ∈ F, P rX∼P {f (X) = +1} ≥ λ, then,
XT D(F, P, m, δ) ≤ O
75
dm
d2
log
λ
δ
.
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Corollary 4.17 (Balanced Axis-Aligned Rectangles) Under the same conditions
of Lemma 4.16, the class of balanced axis-aligned rectangles in Rd can be perfectly
selectively learned with fast coverage rate.
Proof Applying Lemmas 4.15 and 4.16 we get that with probability of at least
1 − δ,
2
d
dm
n̂ ≤ O
log
.
λ
δ
Any balanced axis-aligned rectangle belongs to the class of all axis-aligned rectangles. Therefore, the coverage of CSS for the class of balanced axis-aligned rectangles is bounded bellow by the coverage of the class of axis-aligned rectangles.
Applying Theorem 2.2, and assuming m ≥ d, we obtain,
2
3
2
d
d
dm
dm
dm
d
log
log
log2
≤O
.
γ (H, n̂) ≤ O d log
λ
δ
λ
δ
λ
λδ
Applying Theorem 3.10 completes the proof.
4.6.2 Disagreement coefficient
In this section we show interesting relations between the disagreement coefficient
and coverage bounds in perfect selective classification. We begin by defining, for
an hypothesis f ∈ F, the set of all hypotheses that are r-close to f .
Definition 4.8 ([49, p.337]) For any hypothesis f ∈ F, distribution P over X ,
and r > 0, define the set B(f, r) of all hypotheses that reside in a ball of radius r
around f ,
′
′
B(f, r) , f ∈ F : Pr f (X) 6= f (X) ≤ r .
X∼P
Let
2
2em
2
+ ln
d ln
.
η(d, m, δ) ,
m
d
δ
Definition 4.9 For any G ⊆ F, and distribution P , we denote by ∆G the volume
of the disagreement set (see Definition 4.3) of G,
∆G , Pr {DIS(G)} .
76
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Definition 4.10 (disagreement coefficient [48]) Let r0 ≥ 0. The disagreement
coefficient of the hypothesis class F with respect to the target distribution P is
θ(r0 ) , θf ∗ (r0 ) = sup
r>r0
∆B(f ∗ , r)
.
r
The following theorem formulates an intimate relation between active learning (disagreement coefficient) and selective classification.
Theorem 4.18 Let F be an hypothesis class with VC-dimension d, P an unknown
distribution, r0 ≥ 0, and θ(r0 ), the corresponding disagreement coefficient. Let
(f, g) be a selective classifier chosen by CSS. Then, R(f, g) = 0, and for any
0 ≤ δ ≤ 1, with probability of at least 1 − δ,
Φ(f, g) ≥ 1 − θ(r0 ) · max {η(d, m, δ), r0 } .
Proof Clearly, R(h, g) = 0, and it remains to prove the coverage bound. By
Theorem 3.9, with probability of at least 1 − δ,
∀f ∈ V SF ,Sm
R(f ) ≤ η(d, m, δ) ≤ max {η(d, m, δ), r0 } .
Therefore,
V SF ,Sm ⊆ B (f ∗ , max {η(d, m, δ), r0 })
∆V SF ,Sm ≤ ∆B (f ∗ , max {η(d, m, δ), r0 }) .
(4.6)
By Definition 4.10, for any r ′ > r0 ,
∆B(f ∗ , r ′ ) ≤ θ(r0 )r ′ .
(4.7)
Thus, the proof is complete by recalling that
Φ(f, g) = 1 − ∆V SF ,Sm .
Theorem 4.18 tells us that whenever our learning problem (specified by the pair
(F, P )) has a disagreement coefficient that grows slowly with respect to 1/r0 , it
can be (perfectly) selectively learned with a “fast” coverage bound. Consequently,
77
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
through Theorem 4.7 we also know that in each case where there exists a disagreement coefficient that grows slowly with respect to 1/r0 , active learning with a
fast rate can also be deduced directly through a reduction from perfect selective
classification. It follows that as far as fast rates in active learning are concerned,
whatever can be accomplished by bounding the disagreement coefficient, can be
accomplished also using perfect selective classification. This result is summarized
in the following corollary.
Corollary 4.19 Let F be an hypothesis class with VC-dimension d, P an unknown
distribution, and θ(r0 ), the corresponding disagreement coefficient. If θ(r0 ) =
O(polylog(1/r0 )), there exists a coverage bound such that an application of Theorem 4.5 ensures that (F, P ) is actively learnable with exponential label complexity
speedup.
Proof The proof is established by straightforward applications of Theorems 4.18
with r0 = 1/m and 4.7.
The following result, due to Hanneke [51], implies a coverage upper bound for
CSS.
Lemma 4.20 ([51], Proof of Lemma 47) Let F be an hypothesis class, P an unknown distribution, and r ∈ (0, 1). Then,
EP ∆Dm ≥ (1 − r)m ∆B (f ∗ , r) ,
where
Dm , V SF ,Sm ∩ B (f ∗ , r) .
(4.8)
Theorem 4.21 (coverage upper bound) Let F be an hypothesis class, P an unknown distribution, and δ ∈ (0, 1). Then, for any r ∈ (0, 1), 1 > α > δ,
BΦ (F, δ, m) ≤ 1 −
(1 − r)m − α
∆B (f ∗ , r) ,
1−α
where BΦ (F, δ, m) is any coverage bound.
Proof Recalling the definition of Dm (4.8), clearly Dm ⊆ V SF ,Sm and Dm ⊆
B(f ∗ , r). These inclusions imply (respectively), by the definition of disagreement
78
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
set,
∆Dm ≤ ∆V SF ,Sm , and ∆Dm ≤ ∆B(f ∗ , r).
(4.9)
Using Markov’s inequality (in inequality (4.10) of the following derivation) and
applying (4.9) (in equality (4.11)), we thus have,
(1 − r)m − α
∗
Pr ∆V SF ,Sm ≤
∆B (f , r)
1−α
(1 − r)m − α
∗
≤ P r ∆Dm ≤
∆B (f , r)
1−α
1 − (1 − r)m
∗
∗
= P r ∆B (f , r) − ∆Dm ≥
∆B (f , r)
1−α
1 − (1 − r)m
∗
∗
≤ P r |∆B (f , r) − ∆Dm | ≥
∆B (f , r)
1−α
E {|∆B (f ∗ , r) − ∆Dm |}
≤ (1 − α) ·
(4.10)
(1 − (1 − r)m ) ∆B (f ∗ , r)
∆B (f ∗ , r) − E∆Dm
(4.11)
= (1 − α) ·
(1 − (1 − r)m ) ∆B (f ∗ , r)
Applying Lemma 4.20 we therefore obtain,
≤ (1 − α) ·
∆B (f ∗ , r) − (1 − r)m ∆B(f ∗ , r)
= 1 − α < 1 − δ.
(1 − (1 − r)m ) ∆B (f ∗ , r)
Observing that for any coverage bound,
P r {∆V SF ,Sm ≤ 1 − BΦ (F, δ, m)} ≥ 1 − δ,
completes the proof.
Corollary 4.22 Let F be an hypothesis class, P an unknown distribution, and
δ ∈ (0, 1/8). Then for any m ≥ 2,
1
∗ 1
,
BΦ (F, δ, m) ≤ 1 − ∆B f ,
7
m
where BΦ (F, δ, m) is any coverage bound.
79
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Proof The proof is established by a straightforward application of Theorem 4.21
with α = 1/8 and r = 1/m.
With Corollary 4.22 we can bound the disagreement coefficient for settings
whose coverage bound is known.
Corollary 4.23 Let F be an hypothesis class, P an unknown distribution, and
BΦ (F, δ, m) a coverage bound. Then the disagreement coefficient is bounded by,
(
)
1 − BΦ (F, 1/9, ⌊1/r⌋)
θ(r0 ) ≤ max
sup 7 ·
,2
r
r∈(r0 ,1/2)
Proof Applying Corollary 4.22 we get that for any r ∈ (0, 1/2),
∆B(f ∗ , 1/⌊1/r⌋)
1 − BΦ (F, 1/9, ⌊1/r⌋)
∆B(f ∗ , r)
≤
≤7·
.
r
r
r
Therefore,
∆B(f ∗ , r)
θ(r0 ) = sup
≤ max
r
r>r0
(
)
1 − BΦ (F, 1/9, ⌊1/r⌋)
sup 7 ·
,2
r
r∈(r0 ,1/2)
Specifically we can bound the disagreement coefficient for the case of linear classifiers in Rd when the underlying distribution is an arbitrary unknown finite mixture
of high dimensional Gaussians.
Corollary 4.24 Let F be the class of all linear binary classifiers in Rd , and let the
underlying distribution be any mixture of a fixed number of Gaussians in Rd . Then
1
θ(r0 ) ≤ O polylog
.
r0
Proof Applying Corollary 4.23 together with inequality 4.5 we get that
(
)
1 − BΦ (F, 1/9, ⌊1/r⌋)
θ(r0 ) ≤ max
sup 7 ·
,2
r
r∈(r0 ,1/2)
! )
(
2!
2
d+3
(log ⌊1/r⌋)d
1 d
7
·O
·9 2
.
,2 ≤ O
log
≤ max
sup
⌊1/r⌋
r0
r∈(r0 ,1/2) r
80
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
The above Corollary is true for any hypothesis class and source distribution for
which the characterizing set complexity grows slowly with respect to the sample
size m.
Theorem 4.25 Let F be an hypothesis class, Sm a training sample, and n̂ the
corresponding version space compression set size. If
γ(F, n̂) = O(polylog(m))
then
1
.
θ(r0 ) = O polylog
r0
Proof According to Theorem 3.10 for ai = 1/m we get
1
m
m
BΦ (F, δ, m) = 1 − O polylog(m) · log
+ polylog
m
polylog(m)
δ
m
1
.
= 1 − O polylog
m
δ
Applying Corollary 4.23 we get
(
)
polylog (⌊9/r⌋)
,2
⌊1/r⌋
1
=
sup O (polylog (⌊1/r⌋)) = O polylog
.
r
0
r∈(r0 ,1/2)
1
·O
θ(r0 ) ≤ max
sup
r∈(r0 ,1/2) r
We conclude with the following Corollary specifying necessary condition for active
learning with exponential label complexity speedup.
Corollary 4.26 (necessary condition) Let θ(r0 ) be the disagreement coefficient
for the learning problem (F, P ). If (F, P ) has a passive learning sample complexity of Ω(1/ǫ) and it is actively learnable by CAL with exponential label complexity
speedup, then
1
θ(r0 ) ≤ O polylog
.
r0
81
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Proof Assume that the learning problem (F, P ) is actively learnable with exponential label complexity speedup. Then according to Theorem 4.8, the rejection
rate of CSS is O(polylog(m)/m). Therefore there is coverage bound for CSS
satisfying
polylog(m)
BΦ (F, δ, m) = 1 − O
.
m
Applying Corollary 4.23 we get that
(
)
1 − BΦ (F, 1/9, ⌊1/r⌋)
θ(r0 ) ≤ max
sup 7 ·
,2
r
r∈(r0 ,1/2)
(
)
polylog(⌊1/r⌋)
1
≤ max
sup
, 2 = O polylog
.
⌊1/r⌋ · r
r0
r∈(r0 ,1/2)
4.7 Agnostic active learning label complexity bounds
So far we were able to upper bound the disagreement coefficient by the characterizing set complexity. We note that the disagreement coefficient depends only on the
hypothesis class F and the marginal distribution P (X). It does not depend on the
conditional distribution P (Y |X) at all. Therefore, although the upper bound was
derived using realizable selective classification results, it is applicable for the agnostic case as well! In this section we derive label complexity bounds for the well
known agnostic active learning algorithms A2 [8] and RobustCALδ [50], using the
characterizing set complexity. It should be emphasized that no explicit label complexity bounds were known for these well studied algorithms except for a number
of simple settings (see discussion below).
4.7.1 Label complexity bound for A2
A2 (Agnostic Active) was the first general-purpose agnostic active learning algorithm with proven improvement in error guarantees compared to passive learning.
It has been proven that this algorithm, originally introduced by Balcan et. al. [8],
achieves exponential label complexity speedup (for the low accuracy regime) compared to passive learning for few simple cases including: threshold functions and
homogenous linear separators under uniform distribution over the sphere [9]. In
82
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
this section we substantially extend these results and prove that exponential label
complexity speedup (for the low accuracy regime) can be accomplished also for
linear classifiers under a fixed mixture of Gaussians.
Let η be the error of the best hypothesis in F, namely η , R(f ∗ ).
Theorem 4.27 ([46, Theorem 2]) Let F be an hypothesis class with VC dimension d. If θ(r0 ) is the disagreement coefficient for F, then with probability of at
least 1 − δ, given the inputs F, ǫ, and δ, A2 outputs fˆ ∈ F with
R(fˆ) ≤ η + ǫ,
and the number of label requests made by A2 is at most
2
η
1
1
1
O θ(η + ǫ) 2 + 1
d log + log
log
.
ǫ
ǫ
δ
ǫ
Theorem 4.28 Let F be the class of linear classifiers in Rd and P a mixture of a
fixed number of Gaussians. For any c > 0, ǫ < 21 , and ǫ ≥ ηc , algorithm A2 makes
1
1
O polylog
d + 1 + log
ǫ
δ
label requests on examples drawn i.i.d. from P , with probability 1 − δ.
Proof Using Corollary 4.24 we get
1
1
≤ O polylog
.
θ(η + ǫ) ≤ O polylog
η+ǫ
ǫ
We note that the VC dimension of the class of linear classifiers in Rd is d + 1 and
that ǫ ≥ ηc by assumption. Applying Theorem 4.28 we get that the number of label
requests made by A2 is bounded by
2
η
1
1
1
1
+1
(d + 1) log + log
log
O polylog
ǫ
ǫ2
ǫ
δ
ǫ
1
1
1
1
≤ O polylog
c2 + 1 (d + 1) log + log
log
ǫ
ǫ
δ
ǫ
1
1
= O polylog
d + 1 + log
.
ǫ
δ
83
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
4.7.2 Label complexity bound for RobustCALδ
Motivated by the CAL algorithm, Hanneke introduced a new agnostic active learning algorithm called RobustCALδ [51]. RobustCALδ exhibits asymptotic label
complexity speedup for some favorable distributions that comply with the following condition.
Condition 4.29 ([50]) For some a ∈ [1, ∞) and α ∈ [0, 1], for every f ∈ F,
Pr(f (X) 6= f ∗ (X)) ≤ a(R(f ) − R(f ∗ ))α .
Theorem 4.30 ([50, Theorem 5.4]) For any δ ∈ (0, 1), RobustCALδ achieves a label complexity Λ such that, for any distribution P , for a and α as in Condition 4.29,
∀ǫ ∈ (0, 1),
2−2α 1
log(a/ǫ)
1
α
Λ ≤ a θ(aǫ )
d log(θ(aǫ )) + log
log .
ǫ
δ
ǫ
2
α
Theorem 4.31 Let F be the class of linear classifiers in Rd and P a mixture of
a fixed number of Gaussians satisfying Condition 4.29 with α = 1. Then for any
ǫ < 1, RobustCALδ makes
log(a/ǫ)
1
d + 1 + log
O polylog
ǫ
δ
label requests on examples drawn i.i.d. from P , with probability 1 − δ.
Proof Using Corollary 4.24 and noting that a ≥ 1, α = 1 we get
1
1
α
θ(aǫ ) ≤ O polylog
≤ O polylog
.
aǫ
ǫ
We note that the VC dimension of the class of linear classifiers in Rd is d + 1.
Applying Theorem 4.30 completes the proof.
Theorem 4.31 proves an exponential label complexity speedup compared to passive
learning, for which there is a lower bound on label complexity of Ω(1/ǫ) [50].
84
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Remark 4.32 Condition 4.29 in Theorem 4.31 can be satisfied with α = 1 if the
Bayes optimal classifier is linear and the source distribution satisfies Massart noise
[68],
Pr (|P (Y = 1|X = x) − 1/2| < 1/(2a)) = 0.
For example, if the data was generated by some unknown linear hypothesis with
label noise (probability to flip any label) of up to (a − 1)/2a, then P satisfies the
requirements of Theorem 4.31.
85
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Chapter 5
Agnostic Selective Classification
“
The greatest enemy of knowledge is not ignorance, it is the illusion
of knowledge.
”
Stephen Hawking, 1942 - present
In Chapter3, which considered a sterile, noiseless (realizable) setting, we have
seen that perfect classification with guaranteed coverage is achievable for finite
hypothesis spaces for any source distribution. Furthermore, data-dependent and
distribution-dependent coverage bounds are available for infinite hypothesis spaces.
It will come as no surprise that in the agnostic case, where noise is present, perfect
classification is impossible. In general, in the worst case no hypothesis can achieve
zero error over any nonempty subset of the domain. Therefore our goal is to find
a pointwise competitive selective classifier (see Definition 1.5). In this chapter we
show that pointwise competitiveness is achievable, under reasonable conditions,
by a learning strategy termed low error selective strategy (LESS), which naturally
extends CSS to noisy environments. We derive coverage bounds for LESS and
show that the characterizing set complexity can be effectively used in this case as
well.
86
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
5.1 Definitions
Definition 5.1 (low error set) For any hypothesis class F, target hypothesis f ∈
F, distribution P , sample Sm , and real r > 0, define
and
V(f, r) , f ′ ∈ F : R(f ′ ) ≤ R(f ) + r
n
o
V̂(f, r) , f ′ ∈ F : R̂(f ′ ) ≤ R̂(f ) + r ,
where R(f ′ ) is the true risk of hypothesis f ′ with respect to source distribution P ,
and R̂(f ′ ) is the empirical risk with respect to the sample Sm .
Definition 5.2 (ball in F [48]) For any f ∈ F we define a ball in F of radius r
around f . Specifically, with respect to class F, marginal distribution P over X ,
f ∈ F, and real r > 0, define
′
′
B(f, r) , f ∈ F : Pr f (X) 6= f (X) ≤ r .
X∼P
5.2 Low Error Selective Strategy (LESS)
We now present a strategy that will be shown later to achieve non-trivial pointwise
competitive selective classification under certain conditions. We call it a “strategy”
rather than an “algorithm” because it does not include implementation details.
We begin with some motivation. Using standard concentration inequalities one
can show that the training error of the true risk minimizer, f ∗ , cannot be “too far”
from the training error of the empirical risk minimizer, fˆ. Therefore, we can guarantee, with high probability, that the subset of all hypotheses with “sufficiently low”
empirical error includes the true risk minimizer f ∗ . Selecting only a part of the domain, for which all hypotheses in that subset agree, is then sufficient to guarantee
pointwise competitiveness. Algorithm 1 formulates this idea. In the next section
we analyze this strategy and show that it achieves pointwise competitiveness with
non trivial (bounded) coverage.
87
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Algorithm 1 Low Error Selective Strategy (LESS)
Require: Sm , m, δ, d
Ensure: a pointwise competitive selective classifier (f, g) with probability 1 − δ
Set fˆ = ERM (F, Sm ), i.e., fˆ is any empirical risk minimizer from F
q
+ln δ8
d(ln 2me
d )
ˆ
Set G = V̂ f , 4 2
m
Construct g such that g(x) = 1 ⇐⇒ x ∈ {X \ DIS (G)}
f = fˆ
5.3 Risk and coverage bounds
To facilitate the discussion we need a few definitions. Consider an instance of
a binary learning problem with hypothesis class F, an underlying distribution P
over X × Y, and a loss function ℓ(Y, Y). Let f ∗ = arg minf ∈F {Eℓ(f (X), Y )}
be the true risk minimizer.
Definition 5.3 (excess loss class) Let F be an hypothesis class. The associated
excess loss class [12] is defined as
H , {ℓ(f (x), y) − ℓ(f ∗ (x), y) : f ∈ F} .
Definition 5.4 (Bernstein class [12]) Class H is said to be a (β, B)-Bernstein
class with respect to P (where 0 < β ≤ 1 and B ≥ 1), if every h ∈ H satisfies
Eh2 ≤ B(Eh)β .
Bernstein classes arise in many natural situations; see discussions in [62, 11]. For
example, if the probability P (X, Y ) satisfies Tsybakov’s noise conditions then the
excess loss function is a Bernstein [11, 80] class. In the following sequence of
lemmas and theorems we assume a binary hypothesis class F with VC-dimension
d, an underlying distribution P over X × {±1}, and that ℓ is the 0/1 loss function.
Also, H denotes the associated excess loss class. Our results can be extended to
loss functions other than 0/1 by similar techniques to those used in [16].
88
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
In Figure 5.1 we schematically depict the hypothesis class F (the gray area),
the target hypothesis (filled black circle outside F), and the best hypothesis in the
class f ∗ . The distance of two points in the diagram relates to the distance between
two hypothesis under the marginal distribution P (X). Our first observation is that
if the excess loss class is (β, B)-Bernstein class, then the set of low true error
(depicted in Figure 5.1 (a)) resides within a larger ball centered around f ∗ (see
Figure 5.1 (b)).
Figure 5.1: The set of low true error (a) resides within a ball around f ∗ (b).
Lemma 5.1 If H is a (β, B)-Bernstein class with respect to P , then for any r > 0
V(f ∗ , r) ⊆ B f ∗ , Br β .
Proof If f ∈ V(f ∗ , r), then, by definition,
E {I(f (X) 6= Y )} ≤ E {I(f ∗ (X) 6= Y )} + r.
Using linearity of expectation we have,
E {I(f (X) 6= Y ) − I(f ∗ (X) 6= Y )} ≤ r.
89
(5.1)
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Since H is a (β, B)-Bernstein class and ℓ is the 0/1 loss function,
E {I(f (X) 6= f ∗ (X))} = E {|I(f (X) 6= Y ) − I(f ∗ (X) 6= Y )|}
o
n
= E (ℓ(f (X), Y ) − ℓ(f ∗ (X), Y ))2
= Eh2 ≤ B(Eh)β
= B (E {I(f (X) 6= Y ) − I(f ∗ (X) 6= Y )})β .
By (5.1), for any r > 0,
E {I(f (X) 6= f ∗ (X))} ≤ Br β .
Therefore, by definition, f ∈ B f ∗ , Br β .
So far we have seen that the set of low true error resides within a ball around f ∗ .
Now we would like to prove that, with high probability the set of low empirical
error (depicted in Figure 5.2 (a)) resides within the set of low true error (see Figure 5.2 (b)). We emphasize that the distance between hypotheses in Figure 5.2 (a)
is based on the empirical error, while the distance in Figure 5.2 (b) is based on the
true error.
Figure 5.2: The set of low empirical error (a) resides within the set of low true
error (b).
90
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Throughout this section we denote
s
σ(m, δ, d) , 2
+ ln 2δ
d ln 2me
d
.
2
m
The following is a classical result is statistical learning theory.
Theorem 5.2 ([20]) For any 0 < δ < 1, with probability of at least 1 − δ over the
choice of Sm from P m , any hypothesis f ∈ F satisfies
R(f ) ≤ R̂(f ) + σ(m, δ, d).
Similarly, R̂(f ) ≤ R(f ) + σ(m, δ, d) under the same conditions.
Lemma 5.3 For any r > 0, and 0 < δ < 1, with probability of at least 1 − δ,
δ
∗
ˆ
V̂(f , r) ⊆ V f , 2σ m, , d + r .
2
Proof If f ∈ V̂(fˆ, r), then, by definition,
R̂(f ) ≤ R̂(fˆ) + r.
Since fˆ minimizes the empirical error, we know that R̂(fˆ) ≤ R̂(f ∗ ). Using Theorem 5.2 twice, and applying the union bound, we see that with probability of at
least 1 − δ,
R(f ) ≤ R̂(f ) + σ(m, δ/2, d)
Therefore,
∧
δ
R(f ) ≤ R(f ) + 2σ m, , d + r,
2
∗
and
R̂(f ∗ ) ≤ R(f ∗ ) + σ(m, δ/2, d).
δ
f ∈ V f , 2σ m, , d + r .
2
∗
We have shown that, with high probability, the set of low empirical error is a
subset of a ball around f ∗ . Therefore, the probability that at least two hypotheses
in the set of low empirical error will disagree with each other is bounded by the
91
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
probability that at least two hypotheses in the ball around f ∗ will disagree with
each other. Luckily, the latter is bounded by a complexity measure termed the
disagreement coefficient (see Definition 4.10).
We recall that for any G ⊆ F, and distribution P we have defined (see Definition 4.9) ∆G , Pr {DIS(G)}, and that θ(r0 ) is the disagreement coefficient with
parameter r0 .
Theorem 5.4 Let F be an hypothesis class and assume that H is a (β, B)-Bernstein
class w.r.t. P . Then, for any r > 0 and 0 < δ < 1, with probability of at least
1 − δ,
β
δ
∆V̂(fˆ, r) ≤ B · 2σ m, , d + r
· θ(r0 ),
2
where θ(r0 ) is the disagreement coefficient of F with respect to P and r0 =
(σ(m, δ/2, d)) β .
Proof Applying Lemmas 5.3 and 5.1 we get that with probability of at least 1 − δ,
β !
δ
.
V̂(fˆ, r) ⊆ B f ∗ , B 2σ m, , d + r
2
Therefore,
β !
δ
∆V̂(fˆ, r) ≤ ∆B f ∗ , B 2σ m, , d + r
.
2
By the definition of the disagreement coefficient, for any r ′ > r0 , ∆B(f ∗ , r ′ ) ≤
θ(r0 )r ′ . Noting that
β β
δ
δ
r = B 2σ m, , d + r
> σ m, , d
= r0
2
2
′
completes the proof.
Theorem 5.5 Let F be an hypothesis class and assume that H is a (β, B)-Bernstein
class w.r.t. P . Let (f, g) be the selective classifier chosen by LESS. Then, with probability of at least 1 − δ, (f, g) is a pointwise competitive selective classifier and
β
δ
Φ(f, g) ≥ 1 − B · 4σ m, , d
· θ(r0 ),
4
92
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
where θ(r0 ) is the disagreement coefficient of F with respect to P , and r0 =
(σ(m, δ/4, d)) β .
Proof Applying Theorem 5.2 we get that with probability of at least 1 − δ/4,
δ
∗
∗
R̂(f ) ≤ R(f ) + σ m, , d .
4
Since f ∗ minimizes the true error, we get that
R(f ∗ ) ≤ R(fˆ).
Applying again Theorem 5.2, we learn that with probability of at least 1 − δ/4,
δ
ˆ
ˆ
R(f ) ≤ R̂(f ) + σ m, , d .
4
Using the union bound, it follows that with probability of at least 1 − δ/2,
δ
∗
ˆ
R̂(f ) ≤ R̂(f ) + 2σ m, , d .
4
Hence, with probability of at least 1 − δ/2,
δ
∗
ˆ
= G.
f ∈ V̂ f , 2σ m, , d
4
We note that the selection function g(x) equals one only for x ∈ X \ DIS (G) .
Therefore, for any x ∈ X , for which g(x) = 1, all the hypotheses in G agree, and
in particular f ∗ and fˆ agree. Thus (f, g) is pointwise competitive.
Applications of Theorem 5.4 and the union bound entail that with probability
of at least 1 − δ,
β
δ
ˆ
· θ(r0 ),
Φ(f , g) = E{g(X)} = 1 − ∆G ≥ 1 − B · 4σ m, , d
4
where θ(r0 ) is the disagreement coefficient of F with respect to P and r0 =
(σ(m, δ/4, d)) β .
Lemma 5.6 If r1 < r2 , then θ(r1 ) ≥ θ(r2 ).
93
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Proof Implied directly from the definition of a supermum.
Corollary 5.7 Assume that F has disagreement coefficient θ(0) and that H is a
(β, B)-Bernstein class w.r.t. P . Let (f, g) be the selective classifier chosen by
LESS. Then, with probability of at least 1 − δ, (f, g) is a pointwise competitive
selective classifier, and
β
δ
· θ(0).
Φ(f, g) ≥ 1 − B · 4σ m, , d
4
Proof Application of Theorem 5.5 together with Lemma 5.6 completes the proof.
The disagreement coefficient θ(0) has been shown to be finite for different hypothesis classes and source distributions including: thresholds in R under any distribution (θ(0) = 2) [48], linear separators through the origin in Rd under uniform
√
distribution on the sphere (θ(0) ≤ d) [48], and linear separators in Rd under
smooth data distribution bounded away from zero (θ(0) ≤ c(f ∗ )d, where c(f ∗ ) is
an unknown constant that depends on the target hypothesis) [38]. For these cases
an application of Corollary 5.7 is sufficient to guarantee a pointwise competitive
solution with bounded coverage that converge to one.
Unfortunately for many hypothesis classes and distributions the disagreement
coefficient θ(0) is infinite [48]. Luckily, if the disagreement coefficient θ(r0 ) grows
slowly with respect to 1/r0 (as shown by Wang [87], under sufficient smoothness
conditions), Theorem 5.5 is enough to guarantee a pointwise competitive solution.
Theorem 5.8 (fast coverage rates) Assume that F has disagreement coefficient
1
θ(r0 ) = O polylog
r0
w.r.t. distribution P , and that H is a (β, B)-Bernstein class w.r.t. the same distribution. Let (f, g) be the selective classifier chosen by LESS. Then, with probability
of at least 1 − δ, (f, g) is a pointwise competitive selective classifier and
!
1 β/2
polylog(m)
.
· log
Φ(f, g) ≥ 1 − B · O
m
δ
94
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Proof Application of Theorem 5.5 together with the assumption completes the
proof.
Theorem 5.9 Assume that F has characterizing set complexity
γ(F, n̂) = O (polylog(m))
w.r.t. distribution P , and that H is a (β, B)-Bernstein class w.r.t. the same distribution. Let (f, g) be the selective classifier chosen by LESS. Then, with probability
of at least 1 − δ, (f, g) is a pointwise competitive selective classifier, and
!
polylog(m)
1 β/2
Φ(f, g) ≥ 1 − B · O
· log
.
m
δ
Proof We note that the disagreement coefficient depends only on the hypothesis
class F and the marginal distribution P (X). Therefore, we can use Theorem 4.25
to bound the disagreement coefficient. A direct application of Theorems 5.8 completes the proof.
5.4 Disbelief principle
Theorem 5.5 tells us that LESS not only outputs a pointwise competitive selective
classifier, but the resulting classifier also has guaranteed coverage (under some
conditions). As emphasized in [31], in practical applications it is desirable to allow
for some control over the trade-off between risk and coverage; in other words, we
would like to be able to develop the entire risk-coverage curve for the classifier at
hand and select ourselves the cutoff point along this curve in accordance with other
practical considerations we may have. How can this be achieved?
The following lemma facilitates a construction of a risk-coverage trade-off
curve. The result is an alternative characterization of the selection function g, of
the pointwise optimal selective classifier chosen by LESS. This result allows for
calculating the value of g(x), for any individual test point x ∈ X , without actually constructing g for the entire domain X . This is an extension of the “lazy”
implementation method, discussed in Section 3.4, to the agnostic case.
95
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Lemma 5.10 Let (f, g) be a selective classifier chosen by LESS after observing
the training sample Sm . Let fˆ be the empirical risk minimizer over Sm . Let x be
any point in X and
n
o
fex , argmin R̂(f ) | f (x) = −sign fˆ(x) ,
f ∈F
an empirical risk minimizer forced to label x the opposite from fˆ(x). Then
δ
e
ˆ
g(x) = 0 ⇐⇒ R̂(fx ) − R̂(f ) ≤ 2σ m, , d .
4
Proof According to the definition of V̂ (see Definition 5.1),
δ
δ
ˆ
e
e
ˆ
⇐⇒ f ∈ V̂ f , 2σ m, , d
R̂(fx ) − R̂(f ) ≤ 2σ m, , d
4
4
Thus, fˆ, fex ∈ V̂. However, by construction, fˆ(x) = −fe(x), so x ∈ DIS(V̂) and
g(x) = 0.
Lemma 5.10 tells us that in order to decide if point x should be rejected we
need to measure the empirical error R̂(fex ) of a special empirical risk minimizer, fex ,
which is constrained to label x the opposite from ĥ(x). If this error is sufficiently
close to R̂(ĥ), our classifier cannot be too sure about the label of x and we must
reject it. Figure 5.3 illustrates this principle for a 2-dimensional example. The
hypothesis class is the class of linear classifiers in R2 and the source distribution
is two normal distributions. Negative samples are represented by blue circles and
positive samples by red squares. As usual, fˆ denotes the empirical risk minimizer.
Let us assume that we want to classify point x1 . This point is classified positive
by fˆ. Therefore, we force this point to be negative and calculate the restricted
ERM (depicted by doted line marked fex1 ). The difference between the empirical
risk of fˆ and fex1 is not large enough, so point x1 will be rejected. However, if we
want to classify point x2 , the difference between the empirical risk of fˆ and fex2 is
quite large and the point will be classified as positive. The above result strongly
motivates the following definition of a “disbelief index” for each individual point.
96
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Figure 5.3: Constrained ERM.
Definition 5.5 (disbelief index) For any x ∈ X , define its disbelief index w.r.t. Sm
and F,
D(x) , D(x, Sm ) , R̂(fex ) − R̂(fˆ).
Observe that D(x) is large whenever our model is sensitive to the label of x in
the sense that when we are forced to bend our best model to fit the opposite label
of x, our model substantially deteriorates, giving rise to a large disbelief index.
This large D(x) can be interpreted as our disbelief in the possibility that x can be
labeled so differently. In this case we should definitely predict the label of x using
our unforced model. Conversely, if D(x) is small, our model is indifferent to the
label of x and in this sense, is not committed to its label. In this case we should
abstain from prediction at x.
This “disbelief principle” facilitates an exploration of the risk-coverage tradeoff curve for our classifier. Given a pool of test points we can rank these test points
according to their disbelief index, and points with low index should be rejected
first. Thus, this ranking provides the means for constructing a risk-coverage tradeoff curve.
A similar technique of using an ERM oracle that can enforce an arbitrary number of example-based constraints was used in [28, 17] in the context of active learning. As in our disbelief index, the difference between the empirical risk (or importance weighted empirical risk [17]) of two ERM oracles (with different constraints)
is used to estimate prediction confidence.
97
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
5.5 Implementation
In this section we temporarily switch from theory to practice, aiming at implementing rejection methods inspired by the disbelief principle and see how well
they work on real world problems. Attempting to implement a learning algorithm
driven by the disbelief index we face a major bottleneck because the calculation
of the index requires the identification of ERM hypotheses. To handle this computationally difficult problem, we heuristically “approximate” the ERM of linear
classifiers using support vector machines (SVMs) with soft margins [25]. SVM
with soft margin chooses a hyperplane that splits the training sample as cleanly as
possible, while still maximizing the distance to the cleanly split examples. The
penalty parameter of the error term, C, controls the tradeoff between misclassification and margin size. The higher the value of C, the higher the misclassification
penalty. In our implementation we use a high C value (105 in our experiments) to
penalize more on training errors than on small margin. In this way the solution to
the optimization problem tend to get closer to the ERM.
Another problem we face is that the disbelief index is a noisy statistic that
highly depends on the sample Sm . To overcome this noise we robustify this statis1 , S 2 , . . . S 11 ) using
tic as follow. First we generate eleven different samples (Sm
m
m
bootstrap sampling. For each sample we calculate the disbelief index for all test
points and for each point take the median of these measurements as the final index.
We note that for any finite training sample the disbelief index is a discrete variable. It is often the case that several test points share the same disbelief index. In
those cases we can use any confidence measure as a tie breaker. In our experiments
we use distance from decision boundary to break ties.
In order to estimate R̂(fex ) we have to restrict the SVM optimizer to only consider hypotheses that classify the point x in a specific way. To accomplish this
we use a weighted SVM for unbalanced data [21]. We add the point x as another
training point with weight 10 times larger than the weight of all training points combined. Thus, the penalty for misclassification of x is very large and the optimizer
finds a solution that doesn’t violate the constraint.
5.6 Empirical results
LESS leads to rejection regions that are fundamentally different than those obtained by the traditional distance-based techniques for rejection. To illustrate this
98
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
point we start with two synthetic 2-dimensional datasets. We compare those regions to the ones based on distance from decision boundary. This latter approach
is very common in practical applications of selective classification. We then extend
our results to standard medical diagnosis problems from the UCI repository. Focusing on both SVM with linear kernel and SVM with RBF kernel we analyze the
RC (Risk-Coverage) curves achievable for those datasets. These experiments were
constructed over the LIBSVM package [21].
The first 2D source distribution we analyze is a mixture of two identical normal
distributions (centered at different locations). In Figure 5.4 we depict the rejection
regions for a training sample of 150 points sampled from this distribution. The
(a)
(b)
Figure 5.4: Linear classifier. Confidence height map using (a) disbelief index; (b)
distance from decision boundary.
height map reflects the “confidence regions” of each technique according to its
own confidence measure.
The second 2D source distribution we analyze is a bit more interesting. X is
distributed uniformly over [0, 3π] × [−2, 2] and the labels are sampled according
to the following conditional distribution
(
0.95, x2 ≥ sin(x1 );
P (Y = 1|X = (x1 , x2 )) ,
0.05, else.
99
In Figure 5.5 we depict the rejection regions for a training sample of 50 points
sampled from this distribution (averaged over 100 iterations). The hypothesis class
used for training was SVM with polynomial kernel (of degree 5). Here again, the
height map reflects the “confidence regions” of each technique according to its
own confidence measure. The thick red line depicts the decision boundary of the
Bayes classifier. The qualitative difference in confidence regions that is clearly
evident in Figure 5.4 also results in quantifiably different RC performance curves
as depicted in Figure 5.6. The RC curve generated by LESS is depicted in red and
0.1
0.16
0.14
0.09
test error
0.12
test error
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Figure 5.5: SVM with polynomial kernel. Confidence height map using (a)
disbelief index; (b) distance from decision boundary.
0.1
0.08
0.08
0.07
0.06
0.04
0.06
0.02
0
0.2
0.4
0.6
0.8
0.05
0.1
1
c
0.2
0.3
0.4
0.5
0.6
c
Figure 5.6: RC curve of our technique (depicted in red) compared to rejection
based on distance from decision boundary (depicted in dashed green line). The
RC curve in right figure zooms into the lower coverage regions of the left curve.
100
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
the RC curve generated by distance from decision boundary is depicted in dashed
green line. The right graph is a zoom in section of the entire RC curve (depicted
on the left graph). The dashed horizontal line is the test error of f ∗ on the entire
domain and the dotted line is the Bayes error. While for high coverage values the
two techniques are statistically indistinguishable, for any coverage less than 60%
we get a significant advantage for LESS. It is clear that in this case not only the
estimation error was reduced, but also the test error goes significantly bellow the
optimal test error of f ∗ for low coverage values.
We also tested our algorithm on standard medical diagnosis problems from
the UCI repository, including all datasets used in [44]. We transformed nominal
features to numerical ones in a standard way using binary indicator attributes. We
also normalized each attribute independently so that its dynamic range is [0, 1]. No
other preprocessing was employed.
In each iteration we choose uniformly at random non overlapping training set
(100 samples) and test set (200 samples) for each dataset.1 The SVM was trained
on the entire training set, and test samples were sorted according to confidence (either using distance from decision boundary or disbelief index). Figure 5.7 depicts
the RC curves of our technique (red solid line) and rejection based on distance
from decision boundary (green dashed line) for linear kernel on all 6 datasets. All
results are averaged over 500 iterations (error bars show standard error).
With the exception of the Hepatitis dataset, in which both methods were statistically indistinguishable, in all other datasets the proposed method exhibits significant advantage over the traditional approach. We would like to highlight the
performance of the proposed method on the Pima dataset. While the traditional
approach cannot achieve error less than 8% for any rejection rate, in our approach
the test error decreases monotonically to zero with rejection rate. Furthermore, a
clear advantage for our method over a large range of rejection rates is evident in
the Haberman dataset.2 .
For the sake of fairness, we note that the running time of our algorithm (as presented here) is substantially longer than the traditional technique. The performance
of our algorithm can be substantially improved when many unlabeled samples are
available. In this case the rejection function can be evaluated on the unlabeled sam1
Due to the size of the Hepatitis dataset the test set was limited to 29 samples.
2
The Haberman dataset contains survival data of patients who had undergone surgery for breast
cancer. With estimated 207,090 new cases of breast cancer in the united states during 2010 [77]
an improvement of 1% affects the lives of more than 2000 women.
101
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
ples to generate a new “labeled” sample. Then a new rejection classifier can be
trained on this sample.
Figure 5.8 depicts the maximum coverage for a distance-based rejection technique that allows the same error rate as our method with a specific coverage. For
example, let us assume that our method can have an error rate of 10% with coverage of 60% and the distance-based rejection technique achieves the same error with
maximum coverage of 40%. Then the point (0.6, 0.4) will be on the red line. Thus,
if the red line is bellow the diagonal then our technique has an advantage over
distance-based rejection and visa versa. As an example, consider the Haberman
dataset, and observe that regardless of the rejection rate, distance-based technique
cannot achieve the same error as our technique with coverage lower than 80%.
Figures 5.9 and 5.10 depict the results obtained with RBF kernel. In this case a
statistically significant advantage for our technique was observed for all datasets.
102
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Figure 5.7: RC curves for SVM with linear kernel. Our method in solid red, and
rejection based on distance from decision boundary in dashed green. Horizntal
axis (c) represents coverage.
103
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Figure 5.8: SVM with linear kernel. The maximum coverage for a distance-based
rejection technique that allows the same error rate as our method with a specific
coverage.
104
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Figure 5.9: RC curves for SVM with RBF kernel. Our method in solid red and
rejection based on distance from decision boundary in dashed green.
105
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Figure 5.10: SVM with RBF kernel. The maximum coverage for a distance-based
rejection technique that allows the same error rate as our method with a specific
coverage.
106
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Chapter 6
Selective Regression
“
Ignoring isn’t the same as ignorance, you have to work at it.
”
Margaret Atwood, The Handmaid’s Tale
Consider a standard least squares regression problem. Given m input-output training pairs, (x1 , y1 ), . . . , (xm , ym ), we are required to learn a predictor, fˆ ∈ F,
capable of generating accurate output predictions, fˆ(x) ∈ R, for any input x. Assuming that input-output pairs are i.i.d. realizations of some unknown stochastic
source, P (X, Y ), we would like to choose fˆ so as to minimize the standard least
squares risk functional,
Z
ˆ
R(f ) = (y − fˆ(x))2 dP (x, y).
Let f ∗ = argminf ∈F R(f ) be the optimal predictor in hindsight (based on full
knowledge of P ). A classical result in statistical learning is that under certain structural conditions on F and possibly on P , one can learn a regressor that approaches
the average optimal performance, R(f ∗ ), when the sample size, m, approaches
infinity [81].
As in selective classification, in selective regression we allow the possibility
of abstaining from prediction on part of the domain. In agnostic selective classification our goal was to find a pointwise competitive classifier; in other words to
107
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
classify exactly as the best hypothesis in the class on the accepted domain. In contrast, in selective regression we are only required to find an ǫ-pointwise competitive
regressor, namely a regressor that predicts values that are ǫ close to the predictions
of the best regressor in the class for any point in the accepted domain. In this chapter we show that, under some conditions, ǫ-pointwise competitiveness is achievable
with monotonically increasing coverage (with m), by a learning strategy termed ǫlow error selective strategy (ǫ-LESS). ǫ-LESS is, of course, closely related to the
LESS classification strategy, now adapted for predicting real numbers.
As in the cases of LESS (and CSS), the ǫ-LESS strategy might appears to be
out of computational reach because accept/reject decisions require the computation
of a supremum over a very large, and possibly infinite hypothesis subset. However,
here again we show how to compute the strategy for each point of interest using
only two constrained ERM calculations. This useful reduction opens possibilities
for efficient implementations of competitive selective regressors whenever the hypothesis class of interest allows for efficient (constrained) ERM (see Definition 6.6).
For the case of linear least squares regression we utilize known techniques for both
ERM and constrained ERM and derive exact implementation achieving pointwise
competitive selective regression. The resulting algorithm is efficient and can be
easily implemented using standard matrix operations including (pseudo) inversion.
Finally we present numerical examples over a suite of real-world regression
datasets demonstrating the effectiveness of our methods, and indicating that substantial performance improvements can be gained by using selective regression.
6.1 Definitions
m
A finite training sample of m labeled examples, Sm , {(xi , yi )}m
i=1 ⊆ (X × Y) ,
is observed, where X is some feature space and Y ⊆ R. Using Sm we are required
to select a regressor fˆ ∈ F, where F is a fixed hypothesis class containing potential
regressors of the form f : X → Y. It is desired that predictions fˆ(x), for unseen
instances x, will be as accurate as possible. We assume that pairs (x, y), including training instances, are sampled i.i.d. from some unknown stochastic source,
P (x, y), defined over X × Y. Given a loss function, ℓ : Y × Y → [0, ∞), we
quantify the prediction quality of any f through its true error or risk,
Z
R(f ) , E(X,Y ) {ℓ(f (X), Y )} = ℓ(f (x), y)dP (x, y).
108
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
While R(f ) is an unknown quantity, we do observe the empirical risk of f , defined
as
m
1 X
R̂(f ) ,
ℓ(f (xi ), yi ).
m
i=1
As in previous chapters, let fˆ , arg minf ∈F R̂(f ) be the empirical risk minimizer
(ERM), and f ∗ , arg minf ∈F R(f ), the true risk minimizer.
Definition 6.1 (low error set) For any hypothesis class F, target hypothesis f ∈
F, distribution P , sample Sm , and real r > 0, define
and
V(f, r) , f ′ ∈ F : R(f ′ ) ≤ R(f ) + r
n
o
V̂(f, r) , f ′ ∈ F : R̂(f ′ ) ≤ R̂(f ) + r ,
where R(f ′ ) is the true risk of hypothesis f ′ with respect to source distribution P ,
and R̂(f ′ ) is the empirical risk with respect to the sample Sm .
For the sake of brevity, throughout this chapter we often write f instead of f (x),
where f is any regressor.
We also define a (standard) distance metric over the hypothesis class F. For
any probability measure µ on X , let L2 (µ) be the Hilbert space of functions from
X to R, with the inner product defined as
hf, gi , Eµ(x) f (x)g(x).
The distance function induced by the inner product is
q
p
ρ(f, g) ,k f − g k= hf − g, f − gi = Eµ(x) (f (x) − g(x))2 .
For any f ∈ F we define a ball in F of radius r around f ,
Definition 6.2 (ball if F) Let f ∈ F be an hypothesis and r > 0 be given. Then a
ball of radius r around hypothesis f is defined as
B(f, r) , f ′ ∈ F : ρ(f, f ′ ) ≤ r .
Finally we define a multiplicative risk bound for regression.
109
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Definition 6.3 (Multiplicative Risk Bounds) Let σδ , σ (m, δ, F) be defined such
that for any 0 < δ < 1, with probability of at least 1 − δ over the choice of Sm
from P m , any hypothesis f ∈ F satisfies
R(f ) ≤ R̂(f ) · σ (m, δ, F) .
Similarly, the reverse bound , R̂(f ) ≤ R(f ) · σ (m, F, δ), holds under the same
conditions.
Remark 6.1 The purpose of Definition 6.3 is to facilitate the use of any (known)
risk bound as a plug-in component in subsequent derivations. We define σ as a multiplicative bound, which is common in the treatment of unbounded loss functions
such as the squared loss (see discussion by Vapnik in [82], page 993). Instances
of such bounds can be extracted, e.g., from [61] (Theorem 1), and from bounds
discussed in [82]. The entire set of results that follow can also be developed while
relying on additive bounds, which are common when using bounded loss functions.
6.2 ǫ-Low Error Selective Strategy (ǫ-LESS)
The ǫ-LESS strategy follows the same general idea of the LESS classification strategy discussed in previous chapters. Using standard concentration inequalities for
real valued functions (see Definition 6.3) one can show that the training error of
the true risk minimizer, f ∗ , cannot be “too far” from the training error of the empirical risk minimizer, fˆ. Therefore, we can guarantee, with high probability, that
the class of all hypothesis with “sufficiently low” empirical error includes the true
risk minimizer f ∗ . Selecting only subset of the domain, for which all hypotheses
predict values up to ǫ from the prediction of the empirical risk minimizer fˆ, is then
sufficient to guarantee ǫ-pointwise competitivness.
Algorithm 2 formulates this idea. In the next section we analyze this strategy
and show that it achieves ǫ-pointwise competitiveness with non trivial (bounded)
coverage.
110
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Algorithm 2 ǫ-Low Error Selective Strategy (ǫ-LESS)
Require: Sm , m, δ, F, ǫ
Ensure: A selective regressor (fˆ, g) achieving ǫ-pointwise competitiveness
Set fˆ = ERM (F, Sm ) /* fˆ is any empirical risk minimizer from F */
Set G = V̂ fˆ, σ(m, δ/4, F)2 − 1 · R̂(fˆ)
Construct g such that g(x) = 1 ⇐⇒ ∀f ′ ∈ G |f ′ (x) − fˆ(x)| < ǫ
6.3 Risk and coverage bounds
Lemma 6.2 that follows is based on the proof of Lemma A.12 in [65].
Lemma 6.2 ([65]) For any f ∈ F. Let ℓ : Y × Y → [0, ∞) be the squared loss
function and F be a convex hypothesis class. Then,
E(x,y) (f ∗ (x) − y)(f (x) − f ∗ (x)) ≥ 0.
Proof By convexity, for any α ∈ [0, 1],
fα , α · f + (1 − α) · f ∗ ∈ F.
Since f ∗ is the best predictor in F, it follows that
R(f ∗ ) ≤ R(fα ) = Eℓ(fα , y) = E(fα − y)2 = E(α · f + (1 − α) · f ∗ − y)2
= E(f ∗ − y + α · (f − f ∗ ))2
= E(f ∗ − y)2 + α2 E(f − f ∗ )2 + 2αE(f ∗ − y)(f − f ∗ )
= R(f ∗ ) + α2 E(f − f ∗ )2 + 2αE(f ∗ − y)(f − f ∗ )
Thus, for any α ∈ [0, 1],
α
E(f ∗ − y)(f − f ∗ ) ≥ − E(f − f ∗ )2 ,
2
and the lemma is obtained by taking the limit α → 0.
111
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Lemma 6.3 Under the same conditions of Lemma 6.2, for any r > 0,
√ V(f ∗ , r) ⊆ B f ∗ , r .
Proof If f ∈ V(f ∗ , r), then by definition,
R(f ) ≤ R(f ∗ ) + r.
(6.1)
R(f ) − R(f ∗ ) = E {ℓ(f, y) − ℓ(f ∗ , y)} = E (f − y)2 − (f ∗ − y)2
n
o
= E (f − f ∗ )2 − 2(y − f ∗ )(f − f ∗ )
= ρ2 (f, f ∗ ) + 2E(f ∗ − y)(f − f ∗ ).
Applying Lemma 6.2 and Equation 6.1 we get,
ρ(f, f ∗ ) ≤
p
R(f ) − R(f ∗ ) ≤
√
r.
Lemma 6.4 For any r > 0, and 0 < δ < 1, with probability of at least 1 − δ,
2
V̂(fˆ, r) ⊆ V f ∗ , (σδ/2
− 1) · R(f ∗ ) + r · σδ/2 .
Proof If f ∈ V̂(fˆ, r), then, by definition,
R̂(f ) ≤ R̂(fˆ) + r.
Since fˆ minimizes the empirical error, we have
R̂(fˆ) ≤ R̂(f ∗ ).
Using the multiplicative risk bound twice (Definition 6.3), and applying the union
bound, we know that with probability of at least 1 − δ,
R(f ) ≤ R̂(f ) · σδ/2
∧
112
R̂(f ∗ ) ≤ R(f ∗ ) · σδ/2 .
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Therefore,
R(f ) ≤ R̂(f ) · σδ/2 ≤ (R̂(fˆ) + r) · σδ/2 ≤ (R̂(f ∗ ) + r) · σδ/2
2
≤ (R(f ∗ ) · σδ/2 + r) · σδ/2 = R(f ∗ ) + (σδ/2
− 1)R(f ∗ ) + r · σδ/2 .
and
2
f ∈ V f ∗ , σδ/2
− 1 · R(f ∗ ) + r · σδ/2 .
Lemma 6.5 Let F be a convex hypothesis space, ℓ : Y × Y → [0, ∞), a convex
loss function, and fˆ be an ERM. Then, with probability of at least 1 − δ/2, for any
x ∈ X,
|f ∗ (x) − fˆ(x)| ≤
sup
|f (x) − fˆ(x)|.
”
“
2 −1)·R̂(fˆ)
f ∈V̂ fˆ,(σδ/4
Proof Applying the multiplicative risk bound, we get that with probability of at
least 1 − δ/4,
R̂(f ∗ ) ≤ R(f ∗ ) · σδ/4 .
Since f ∗ minimizes the true error,
R(f ∗ ) ≤ R(fˆ).
Applying the multiplicative risk bound on fˆ, we know also that with probability of
at least 1 − δ/4,
R(fˆ) ≤ R̂(fˆ) · σδ/4 .
Combining the three inequalities by using the union bound we get that with probability of at least 1 − δ/2,
2
2
R̂(f ∗ ) ≤ R̂(fˆ) · σδ/4
= R̂(fˆ) + σδ/4
− 1 · R̂(fˆ).
Hence, with probability of at least 1 − δ/2 we get
2
f ∗ ∈ V̂ fˆ, (σδ/4
− 1) · R̂(fˆ) .
113
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Let G ⊆ F. We generalize the concept of disagreement set (Definition 4.3) to
real-valued functions.
Definition 6.4 (ǫ-disagreement set) The ǫ-disagreement set w.r.t. G is defined as
DISǫ (G) , {x ∈ X : ∃f1 , f2 ∈ G
s.t. |f1 (x) − f2 (x)| ≥ ǫ} .
For any G ⊆ F, distribution P , and ǫ > 0, we define
∆ǫ G , P rP {DISǫ (G)} .
In the following definition we extend Hanneke’s disagreement coefficient [48] to
the case of real-valued functions.1
Definition 6.5 (ǫ-disagreement coefficient) Let r0 ≥ 0. The ǫ-disagreement coefficient of F with respect to P and r0 is,
θǫ (r0 ) , sup
r>r0
∆ǫ B(h∗ , r)
.
r
Throughout this chapter we set r0 = 0.
Lemma 6.6 Let F be a convex hypothesis class, and assume ℓ : Y × Y → [0, ∞)
is the squared loss function. Let ǫ > 0 be given. Assume that F has ǫ-disagreement
coefficient θǫ . Then, for any r > 0 and 0 < δ < 1, with probability of at least 1 − δ,
r
2
ˆ
∆ǫ V̂(f , r) ≤ θǫ
σ − 1 · R(f ∗ ) + r · σδ/2 .
δ/2
Proof Applying Lemmas 6.4 and 6.3 we know that with probability of at least
1 − δ,
!
r
2 − 1 · R(f ∗ ) + r · σ
σδ/2
V̂(fˆ, r) ⊆ B f ∗ ,
δ/2 .
Therefore,
!
r
2 − 1 · R(f ∗ ) + r · σ
σδ/2
∆ǫ V̂(fˆ, r) ≤ ∆ǫ B f ∗ ,
δ/2 .
1
Our attemps to utilize a different known extension of the disagreement coefficient [16] were not
successful. Specifically, the coefficient proposed there is unbounded for the squared loss function
when Y is unbounded.
114
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
By the definition of the ǫ-disagreement coefficient, for any r ′ > 0, ∆ǫ B(f ∗ , r ′ ) ≤
θǫ r ′ , which completes the proof.
The following theorem is the main result of this section, showing that ǫ-LESS
achieves ǫ-pointwise competitiveness with a meaningful coverage that converges to
1. Although R(f ∗ ) in the bound (6.2) is an unknown quantity, it is still a constant,
and as σ approaches 1, the coverage lower bound approaches 1 as well. When
using a typical additive risk bound, R(h∗ ) disappears from the RHS.
Theorem 6.7 Assume the conditions of Lemma 6.6 hold. Let (f, g) be the selective
regressor chosen by ǫ-LESS. Then, with probability of at least 1 − δ, (f, g) is ǫpointwise competitive and
r
σ 2 − 1 · R(f ∗ ) + σδ/4 · R̂(fˆ) .
(6.2)
Φ(f, g) ≥ 1 − θǫ
δ/4
Proof According to ǫ-LESS, if g(x) = 1 then
sup
“
”
2 −1 ·R̂(fˆ))
f ∈V̂(fˆ, σδ/4
|f (x) − fˆ(x)| < ǫ.
Applying Lemma 6.5 we get that, with probability of at least 1 − δ/2,
∀x ∈ {x ∈ X : g(x) = 1} |f (x) − f ∗ (x)| < ǫ.
2 − 1) · R̂(fˆ) = G wet get
Since fˆ ∈ V̂ fˆ, (σδ/4
(
Φ(f, g) = E{g(X)} = E I sup |f (x) − fˆ(x)| < ǫ
(
f ∈G
= 1 − E I sup |f (x) − fˆ(x)| ≥ ǫ
(
≥ 1−E I
f ∈G
!)
sup |f1 (x) − f2 (x)| ≥ ǫ
f1 ,f2 ∈G
115
!)
!)
= 1 − ∆ǫ G.
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Applying Lemma 6.6 and the union bound we conclude that with probability of at
least 1 − δ,
r
Φ(f, g) = E{g(X)} ≥ 1 − θǫ
σ 2 − 1 · R(f ∗ ) + σδ/4 · R̂(fˆ) .
δ/4
6.4 Rejection via constrained ERM
In our proposed strategy we are required to track the supremum of a possibly
infinite hypothesis subset, which in general might be intractable. The following
Lemma 6.8 reduces the problem of calculating the supremum to a problem of calculating a constrained ERM for two hypotheses.
Definition 6.6 (constrained ERM) Let x ∈ X and ǫ ∈ R be given. Define,
n
o
fˆǫ,x , argmin R̂(f ) | f (x) = fˆ(x) + ǫ ,
f ∈F
where fˆ(x) is, as usual, the value of the unconstrained ERM regressor at point x.
Lemma 6.8 Let F be a convex hypothesis space, and ℓ : Y × Y → [0, ∞), a
convex loss function. Let ǫ > 0 be given, and let (f, g) be a selective regressor
chosen by ǫ-LESS after observing the training sample Sm . Let fˆ be an ERM. Then,
g(x) = 0
Proof Let
⇔
2
R̂(fˆǫ,x ) ≤ R̂(fˆ) · σδ/4
∨
2
R̂(fˆ−ǫ,x ) ≤ R̂(fˆ) · σδ/4
.
2
G , V̂ fˆ, (σδ/4
− 1) · R̂(fˆ) ,
and assume there exists f ∈ G such that
|f (x) − fˆ(x)| ≥ ǫ.
Assume w.l.o.g. (the other case is symmetric) that
f (x) − fˆ(x) = a ≥ ǫ.
116
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Since F is convex,
ǫ ˆ ǫ
f′ , 1 −
· f + · f ∈ F.
a
a
We thus have,
ǫ ˆ
ǫ
ǫ ǫ ˆ
f ′ (x) = 1 −
· f (x)+ ·f (x) = 1 −
· f (x)+ · fˆ(x) + a = fˆ(x)+ǫ.
a
a
a
a
Therefore, by the definition of fˆǫ,x , and using the convexity of ℓ, together with
Jensen’s inequality,
m
1 X
R̂(fˆǫ,x ) ≤ R̂(f ′ ) =
ℓ(f ′ (xi ), yi )
m
i=1
=
m
ǫ
ǫ ˆ
1 X · f (xi ) + · f (xi ), yi
ℓ 1−
m
a
a
i=1
m
m
ǫ 1 X
ǫ 1 X ˆ
·
ℓ f (xi ), yi + ·
ℓ (f (xi ), yi )
a
m
a m
i=1
i=1
ǫ
ǫ
ǫ ǫ
2
=
1−
· R̂(fˆ) + · R̂(f ) ≤ 1 −
· R̂(fˆ) + · R̂(fˆ) · σδ/4
a
a a
a
ǫ 2
2
ˆ
ˆ
ˆ
= R̂(f ) + · σδ/4 − 1 · R̂(f ) ≤ R̂(f ) · σδ/4 .
a
≤
1−
As for the other direction, if
2
R̂(fˆǫ,x ) ≤ R̂(fˆ) · σδ/4
.
Then fˆǫ,x ∈ G and
ˆ
fǫ,x(x) − fˆ(x) = ǫ.
So far we have discussed the case where ǫ is given, and our objective is to find an
ǫ-pointwise competitive regressor. Lemma 6.8 provides the means to compute such
a competitive regressor assuming that a method to compute a constrained ERM is
available (as is the case for squared loss linear regressors; see next section). However, as was discussed in [31], in many applications our objective might require to
explore the entire risk-coverage trade-off, in other words, to get a pointwise bound
on |f ∗ (x) − f (x)|, i.e., individually for any test point x. The following theorem
states such a pointwise bound.
117
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Theorem 6.9 Let F be a convex hypothesis class, ℓ : Y × Y → [0, ∞), a convex
loss function, and let fˆ be an ERM. Then, with probability of at least 1 − δ/2 over
the choice of Sm from P m , for any x ∈ X ,
o
n
2
.
|f ∗ (x) − fˆ(x)| ≤ sup |ǫ| : R̂(fˆǫ,x ) ≤ R̂(fˆ) · σδ/4
ǫ∈R
Proof Define
f˜ ,
argmax
“
”
2 −1)·R̂(fˆ)
f ∈V̂ fˆ,(σδ/4
|f (x) − fˆ(x)|.
Assume w.l.o.g (the other case is symmetric) that
f˜(x) = fˆ(x) + a.
Following Definition 6.6 we get
2
R̂(fˆa,x ) ≤ R̂(f˜) ≤ R̂(fˆ) · σδ/4
.
Define
o
n
2
.
ǫ′ = sup |ǫ| : R̂(fˆǫ,x ) ≤ R̂(fˆ) · σδ/4
ǫ∈R
We thus have,
sup
”
“
2 −1)·R̂(fˆ)
f ∈V̂ fˆ,(σδ/4
|f (x) − fˆ(x)| = a ≤ ǫ′ .
An application of Lemma 6.5 completes the proof.
We conclude this section with a general result on the monotonicity of the empirical
risk attained by constrained ERM regressors.
Lemma 6.10 (Monotonicity) Let F be a convex hypothesis space, ℓ : Y × Y →
[0, ∞), a convex loss function, and 0 ≤ ǫ1 < ǫ2 , be given. Then,
ǫ1 ˆ
R̂(fǫ2 ,x0 ) − R̂(fˆ) .
R̂(fǫ1 ,x0 ) − R̂(fˆ) ≤
ǫ2
The result also holds for the case 0 ≥ ǫ1 > ǫ2 .
118
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Proof Define
ǫ2 − ǫ1
f (x) ,
· fˆ(x) +
ǫ2
ǫ2 − ǫ1
· fˆ(x) +
=
ǫ2
′
ǫ1 ˆ
· fǫ2 (x)
ǫ2
ǫ1 ˆ
· f (x) + ǫ2 = fˆ(x) + ǫ1 .
ǫ2
Since F is convex we get that f ′ (x) ∈ F. Therefore, by the definition of fˆǫ1 , and
using the convexity of ℓ together with Jensen’s inequality, we obtain,
m
1 X
R̂(fˆǫ1 ) ≤ R̂(f ′ ) =
ℓ
m
i=1
≤
ǫ2 − ǫ1
ǫ2
ǫ2 − ǫ1
ǫ1
· R̂(fˆ) + · R̂(fˆǫ2 )
ǫ2
ǫ2
ǫ1
· fˆ(xi ) + · fˆǫ2 (xi ), yi
ǫ2
6.5 Selective linear regression
We now restrict attention to linear least squares regression (LLSR), and, relying on
Theorem 6.9 and Lemma 6.10, as well as on known closed-form expressions for
LLSR, we derive efficient implementation of ǫ-LESS and a new pointwise bound.
Let X be an m × d training sample matrix whose ith row, xi ∈ Rd , is a feature
vector. Let y ∈ Rm be a column vector of training labels.
Lemma 6.11 (ordinary least-squares estimate [41]) The ordinary least square (OLS)
solution of the following optimization problem,
min kXβ − yk2 ,
β
is given by
β̂ , (X T X)+ X T y,
where the sign + represents the pseudoinverse.
Lemma 6.12 (constrained least-squares estimate [41], page 166) Let x0 be a row
vector and c a label. The constrained least-squares (CLS) solution of the following
119
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
optimization problem
minimize kXβ − yk2
s.t x0 β = c,
is given by
β̂C (c) , β̂ + (X T X)+ xT0 (x0 (X T X)+ xT0 )+ c − x0 β̂ ,
where β̂ is the OLS solution.
Theorem 6.13 Let F be the class of linear regressors, and let fˆ be an ERM. Then,
with probability of at least 1 − δ over choices on Sm , for any test point x0 we have,
kX β̂ − yk q 2
|f ∗ (x0 ) − fˆ(x0 )| ≤
σδ/4 − 1,
kXKk
where
K = (X T X)+ xT0 (x0 (X T X)+ xT0 )+ .
Proof According to Lemma 6.10, for squared loss, R̂(fˆǫ,x0 ) is strictly monotonically increasing for ǫ > 0, and decreasing for ǫ < 0. Therefore, the equation,
2
R̂(fˆǫ,x0 ) = R̂(fˆ) · σδ/4
,
where ǫ is the unknown, has precisely two solutions for any σ > 1. Denoting these
solutions by ǫ1 , ǫ2 we get,
o
n
2
= max (|ǫ1 |, |ǫ2 |) .
sup |ǫ| : R̂(fˆǫ,x0 ) ≤ R̂(fˆ) · σδ/4
ǫ∈R
Applying Lemma 6.11 and 6.12 and setting c = X0 β̂ + ǫ, we obtain,
1
2
kX β̂C x0 β̂ + ǫ − yk2 = R̂(ĥǫ,x0 ) = R̂(ĥ) · σδ/4
m
1
2
=
kX β̂ − yk2 · σδ/4
.
m
Hence,
2
kX β̂ + XKǫ − yk2 = kX β̂ − yk2 · σδ/4
,
so,
2
2(X β̂ − y)T XKǫ + kXKk2 ǫ2 = kX β̂ − yk2 · (σδ/4
− 1).
120
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
We note that by applying Lemma 6.11 on (X β̂ − y)T X, we get,
(X β̂ − y)T X = X T X(X T X)+ X T y − y
Therefore,
ǫ2 =
T
= (X T y − X T y)T = 0.
kX β̂ − yk2
2
· (σδ/4
− 1).
kXKk2
Application of Theorem 6.9 completes the proof.
6.6 Empirical results
Focusing on linear least squares regression for which we have efficient, closedform implementation, we empirically evaluated the proposed method. Given a
labeled dataset we randomly extracted two disjoint subsets: a training set Sm , and
a test set Sn . The selective regressor (f, g) is computed as follows. The regressor f
is an ERM over Sm , and for any coverage value Φ, the function g selects a subset
of Sn of size n · Φ, including all test points with lowest value of the bound in
Theorem 6.13.2
We compare our method relative to the following simple and natural 1-nearest
neighbor (NN) technique for selection. Given the training set Sm and the test
set Sn , let N N (x) denote the nearest neighbor of x in Sm , with corresponding
p
ρ(x) , kN N (x) − xk2 distance to x. These ρ(x) distances, corresponding to
all x ∈ Sn , were used as alternative method to reject test points in decreasing order
of their ρ(x) values.
We tested the algorithm on 10 of the 14 LIBSVM [21] regression datasets, each
with training sample size m = 30, and test set size n = 200. From this repository
we took all sets that are not too small and have reasonable feature dimensionality.3
Figure 6.1 shows the average absolute difference between the selective regressor (f, g) and the optimal regressor f ∗ (taken as the ERM over the entire dataset)
as a function of coverage, where the average is taken over the accepted instances.
Our method appears in solid red line, and the baseline NN method, in dashed black
2
We use here the theorem only for ranking test points, so any constant > 1 can be used instead of
2
σδ/4
.
3
Two datasets having less than 200 samples, and two that have over 150, 000 features were excluded.
121
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
line. Each curve point is an average over 200 independent trials (error bars represent standard error of the mean). It is evident that for 9 out of the 10 datasets
the average distance monotonically increases with coverage. Furthermore, in those
cases the proposed method significantly outperforms the NN baseline. For the
’year’ dataset we see that the average distance is not monotone, and it is growing
at low coverage values. Further analysis suggests that the source of this ’artifact’
is the high dimensionality of the dataset (d = 90) compared to the training sample
size (m = 30). Clearly, in this case both the ERM and the constrained-ERM can
completely overfit the training samples in which case and the ranking we obtain
can be noisy. To validate this hypothesis, we repeated the same experiment with
m growing from 30 to 200. Figure 6.2 depicts the resulting RC-curves for the five
cases m = 30, 50, 100, 150, and 200. Clearly this overfitting artifact completely
disappears whenever m > d. In those cases our methods slightly outperforms the
NN baseline method.
The graphs in Figure 6.3 show the RC-curves obtained by our method and the
NN baseline method. That is, each curve is the test error of the selective regressor
as a function of coverage. In 7 of the 10 datasets our method outperforms in terms
of test error. In two datasets our method is statistically indistinguishable from the
NN baseline method. In one dataset (’year’) our method fails for low coverage
values (see the above discussion on overfiting and dimensionality).
122
eunite
x 10 0
x 10−3
1.29
0.64
0.69
0
0.5
1.01
0.47
0
1
0.5
c
x 10
4
0.80
0
1
x 10
cadata
1
cpusmall
0.85
2.90
0.56
0.5
c
−2
x 10
1.58
0
1
x 10
mg
0
x 10
mpg
5.32
1.26
4.25
3.19
0.99
1
c
x 10
2
space
|f*−f|
5.66
|f*−f|
1.61
0.5
0.77
0
0.5
c
1
2.39
0
0.5
c
year
9.86
8.07
6.61
5.41
0
0.5
1
c
−2
6.40
3.68
0
0.5
c
4.42
housing
2.14
0.37
0
1
0
|f*−f|
3.04
|f*−f|
3.93
|f*−f|
1.29
0.5
1
c
x 10
3.99
1.78
0
0.5
c
2.33
bodyfat
|f*−f|
0.86
|f*−f|
1.13
|f*−f|
1.64
0.88
|f*−f|
abalone
1.16
|f*−f|
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
x 10 1
1.44
1
c
Figure 6.1: Absolute difference between the selective regressor (f, g) and the
optimal regressor f ∗ . Our proposed method in solid red line and the baseline
method in dashed black line. All curves in a logarithmic y-scale.
123
1
x 10 6
m=50
x 10 3
2.37
1.19
1.16
1.15
R(f,g)
2.26
R(f,g)
R(f,g)
m=30
0.59
0.69
0.40
0
0.5
0.30
0
1
x 10
2
0.5
m=150
x 10
2.12
1.44
R(f,g)
2.17
1.24
0.73
0
1
c
3.61
m=100
0.56
c
R(f,g)
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
x 10 6
2.05
2
0.27
0
0.5
1
c
m=200
0.95
0.5
c
1
0.63
0
0.5
1
c
Figure 6.2: Test error of selective regressor (f, g) trained on the dataset ’years’
with sample size of 30, 50, 100, 150, and 200 samples. Our proposed method in
solid red line and the baseline method in dashed black line. All curves in a
logarithmic y-scale.
124
x 10 0
5.73
1.31
R(f,g)
6.31
0.5
2.50
0
1
x 10
9
0.5
0.39
0
1
x 10
3
cpusmall
6.46
0.61
3.59
R(f,g)
6.81
R(f,g)
4.32
0.09
4.40
0.5
0.5
c
−2
x 10
1.00
0
1
x 10
mg
1
x 10
mpg
1.13
1.72
R(f,g)
2.29
R(f,g)
2.33
0.81
0.5
1
c
x 10
6
0.58
0
1
c
−2
1.58
1.92
housing
0.5
c
2.73
1.61
0
1
1.89
0.01
0
1
1
c
x 10
9.47
3.00
0
0.5
c
cadata
bodyfat
0.72
3.79
c
R(f,g)
x 10−5
2.38
3.44
0
R(f,g)
abalone
8.68
R(f,g)
R(f,g)
eunite
4.66
space
1.27
0.5
c
1
0.94
0
0.5
1
c
year
2.05
R(f,g)
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
x 10 2
8.54
1.19
0.69
0.40
0
0.5
1
c
Figure 6.3: Test error of selective regressor (f, g). Our proposed method in solid
red line and the baseline method in dashed black line. All curves in a logarithmic
y-scale.
125
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Chapter 7
Future Directions
“
A: Would you tell me, please, which way I ought to go from here?
C: That depends a good deal on where you want to get to.
A: I don’t much care where.
C: Then it doesn’t much matter which way you go.
A: ...So long as I get somewhere.
C: Oh, you’re sure to do that, if only you walk long enough.
”
Lewis Carroll, Alice in Wonderland
In this chapter we present a number of open questions that motivate directions
for future research. Our focus is on larger fundamental questions rather than on
technical/incremental improvements of the present results. Nevertheless, for those
who are interested, the list of possible technical improvements includes tightening
the existing bounds, extending the results to other loss functions, and extending the
selective regression results to additive concentration inequalities.
7.1 Beyond pointwise competitivness
In this work we analyzed the necessary and sufficient conditions for pointwise
competitivness. Our goal was to match the predictions of the best hypothesis in the
126
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
class for the accepted domain. As a result, we were able to control the estimation
error ∆1 , R(f, g) − R(f ∗ , g) and “instantly” attenuate it to zero by compromising coverage. However, one can consider other optimization objectives. For
example, for a given selective classifier (f, g), we might want to bound the difference between its risk and the risk of the best hypothesis in the class, limited to the
accepted domain. Namely, we might want to bound and reduce the excess loss,
∆2 , R(f, g) − inf
R(f ′ , g).
′
f ∈F
This is considerably more challenging than reducing ∆1 to zero. Can we bound
∆2 ? Can we reduce ∆2 to zero and achieve a stronger pointwise competitiveness,
one that will ensure a match between the selective predictions and those of the
predictor defined by
arg min
R(f ′ , g)
′
f ∈F
over the accepted domain?
Bounding ∆2 guarantees that the selective classifier (f, g) is the best classifier
that can be obtained using a prediction function from F and the selection function
g. We can also compare the risk of the selective predictor with the risk of any
selective hypothesis using a prediction function from F and any selection function
with the same coverage as g. This amounts to bounding the excess loss,
∆3 , R(f, g) −
inf
f ′ ∈F ,g ′ :X →{0,1}
R(f ′ , g′ ) | E(g′ ) = E(g) .
Finally, another interesting and natural objective would be to bound the difference between the risk of (f, g) and the risk of the Bayes selective classifier, a
selective classifier based on Chow’s decision rule with the same coverage as g.
Namely, the goal is to bound
∆4 , R(f, g) −
inf
f ′ ,g ′ :X →{0,1}
R(f ′ , g′ ) | E(g′ ) = E(g) .
It follows straightforwardly from the above definitions that ∆1 ≤ ∆2 ≤ ∆3 ≤
∆4 . Therefore, the lower bounds derived in this work for ∆1 are relevant also for
all the rest. However, upper bounds are yet to be found.
127
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
7.2 Beyond linear classifiers
The characterizing set complexity has been shown to be an important measure for
analyzing coverage rates in selective classification, as well as label complexity in
active learning. Indeed, at present, for a number of interesting settings (hypothesis
classes and distributions), this is the only known technique that is able to prove
exponential label complexity speedup. In this work we focused on linear classifiers under a fixed mixture of Gaussians and axis aligned rectangles under product
distributions.
Is it possible to extend our results beyond linear classifiers and axis aligned
rectangles? One possible direction is to analyze the behavior of the characterizing
set complexity under hypothesis class operations such as k-fold intersections and
k-fold unions. A similar technique has been applied successfully in the contexts of
VC theory [18, Lemma 3.2.3] and the disagreement coefficient [49]. Such a result
is very attractive in our context. For example, given results for k-fold unions, we
could extend our results from axis aligned rectangles to decision trees (with limited
depth). A result for k-fold intersection would allow us to extend our results from
linear classifiers to polygons.
7.3 Beyond mixtures of Gaussians
The proof of Corollary 3.15 implies that whenever the version space compression
set size n̂ grows slowly with respect to the number of training examples m, then fast
coverage rate for linear classifiers is achievable. In particular, if n̂ = polylog(m)
with high probability, then fast rates are guaranteed. In this work we have proven
that n̂ = polylog(m) for any fixed mixture of Gaussians. Let S + be the set of
positive training examples. In the proof of Lemma 3.14 we show that if the number
of vertices of the convex hull of S + is O(polylog(m)), then n̂ = polylog(m).
Therefore, our analysis for the mixture of Gaussians case can be easily extended
to other distributions for which the number of vertices of the convex hull is small.
Specifically, any distribution that can be generated from a mixture of Gaussians by
a transformation that preserves convex hull vertices (e.g., projective quasi-affine
transformation [52]) will also guarantee fast coverage rates and exponential label
complexity speedup. Can we prove the results for other (mixtures of) well-known
multivariate parametric families such as Dirichlet or Cauchy distributions? Is it
possible to characterize the properties of transformations that preserve the number
128
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
of convex hull vertices?
7.4 On computational complexity
Implementations of the LESS strategy require efficient calculations of ERM and restricted ERM. For linear classifiers this problem is reduced to a variant of the MAX
FLS and C MAX FLS problems with ≥ and > relations [2]. Not only is finding
an optimal solution for these problems NP-hard, but neither can they be approximated to within arbitrary factors. MAX FLS with strict and nonstrict inequalities
is APX-complete (within 2). Therefore, it can be approximated within a factor of
2 but not within every constant factor [2]. C MAX FLS is MAX IND SET-hard,
so it cannot be efficiently approximated at all. Ben-David et al. extended these
results to other hypothesis classes, including axis-aligned hyper-rectangles. They
showed that approximating ERM for this class is NP-hard [14]. Do these results
imply that we have reached the end of the road for efficient implementation (or
approximation) of LESS? As pointed out by Ben-David et al., for each one of the
concept classes they considered, there exists a subset of its legal inputs for which
the maximum agreement problem can be solved in polynomial time. Do the above
disappointing complexity results hold, with high probability, for datasets generated
by mixtures of Gaussians? What about other source distributions? What about discrete variables?
In selective regression we have been able to derive an efficient implementation
of ǫ-LESS for LLSR. Can we extend the result for different kernels? Can efficient
implementation of LESS (for the agnostic selective classification setting) be developed for other interesting hypothesis classes ?
7.5 Other applications for selective prediction
The most obvious and studied use of selective prediction is ‘classification with a
reject option’, the goal of which is to reduce the generalization risk of classifiers.
However, selective prediction can be used for other purposes as well. Let us briefly
describe three such applications: learning non-stationary time series, big data analysis, and reverse engineering.
Learning a non-stationary time series is a challenging task since the underlying
distribution is constantly changing. Therefore, training based on distant history
is not relevant and the number of training examples is inherently limited. In this
129
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
setting, selective prediction can serve as a powerful tool. Looking at short time
windows, we can obtain pointwise predictions close to the best hypothesis in the
class and achieve good prediction accuracy.
With the explosion of data generated every day, it is not surprising that training
on massive datasets has become a major problem in machine learning. In this setting, labeled training examples are in abundance; therefore it seems that selective
prediction brings no real value. However, can we subsample the dataset and train
multiple selective predictors on its different subsets? Since pointwise competitiveness is guaranteed, fusion of multiple selective predictors is trivial. But how does
the coverage of the aggregated classifier behave? How do we decide on the subset
size? What is the optimal subsampling method?
Machine learning is becoming an important part of modern corporate competitive strategy. “Knowledge is power” and companies perceive their datasets as valuable company assets. In many cases the “secret sauce” is not the type of learning
algorithm or the hypothesis class but the training data itself. Selective prediction introduces a threat for those companies. Using the “public” predictions of the model,
competitors can train a selective predictor and achieve the same performance (on
the accepted domain). Knowing the hypothesis class and training on the model predictions falls into the realizable selective classification setting, where fast coverage
rates have been proven for interesting distributions.
130
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Appendix A
Proofs
Proof of Lemma 2.3 Let S + ⊆ Sm be the set of all positive samples in Sm ,
and S − ⊆ Sm be the set of all negative samples. Let x̄0 ∈ R+ . There exists a
hypothesis fw̄,φ (x̄) such that
∀ x̄ ∈ S + ,
∀ x̄ ∈ S − ,
w̄T x̄ − φ ≥ 0;
w̄T x̄ − φ < 0,
and
w̄T x̄0 − φ ≥ 0.
Let’s assume that x̄0 6∈ R̃+ . Then, there exists a hypothesis f˜w̄′ ,φ′ (x̄) such that
∀ x̄ ∈ S + ,
∀ x̄ ∈ S − ,
w̄′T x̄ − φ′ ≥ 0;
w̄′T x̄ − φ′ ≤ 0,
and
w̄′T x̄0 − φ′ < 0.
Defining
w̄0 , w̄ + αw̄′ ,
where
φ0 , φ + αφ′ ,
T
w̄ x̄0 − φ ,
α > ′T
w̄ x̄0 − φ′ 131
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
we deduce that there exists a hypothesis fw̄0 ,φ0 (x̄) such that
∀ x̄ ∈ S + ,
w̄0T x̄ − φ0 ≥ 0;
∀ x̄ ∈ S − ,
w̄0T x̄ − φ0 < 0,
and
w̄0T x̄0 − φ0 = w̄T x̄0 − φ + α w̄′T x̄0 − φ′ = w̄T x̄0 − φ − α w̄′T x̄0 − φ′ < w̄T x̄0 − φ − w̄T x̄0 − φ = 0.
Therefore, x̄0 6∈ R+ . Contradiction. Hence, x̄0 ∈ R̃+ and R+ ⊆ R̃+ . The proof
that R− ⊆ R̃− follows the same argument.
To prove that R̃+ ⊆ R+ , we look at V SF̃ ,Sm :
∀f˜w̄,φ ∈ V SF̃ ,Sm , x̄ ∈ R̃+
w̄T x̄ − φ ≥ 0.
We observe that if fw̄,φ ∈ V SF ,Sm , then f˜w̄,φ ∈ V SF̃ ,Sm . Therefore,
∀fw̄,φ ∈ V SF ,Sm , x̄ ∈ R̃+
w̄T x̄ − φ ≥ 0.
Hence, R̃+ ⊆ R+ .
It remains to prove that R̃− ⊆ R− . Assuming x̄0 6∈ R− implies that there
exists a hypothesis fw̄,φ (x̄) such that
∀ x̄ ∈ S + ,
∀ x̄ ∈ S − ,
w̄T x̄ − φ ≥ 0;
w̄T x̄ − φ < 0,
and
w̄T x̄0 − φ ≥ 0.
Defining1
w̄0 , w̄,
1
T
φ0 , φ − max w̄ x̄ − φ ,
x̄∈S −
If S − is an empty set we can arbitrarily define φ0 , φ − 1.
132
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
we conclude that there exists a hypothesis f˜w̄0 ,φ0 (x̄) such that
∀ x̄ ∈ S +
∀ x̄ ∈ S
−
w̄0T x̄ − φ0 ≥ 0;
w̄0T x̄
and
T
− φ0 ≤ max w̄ x̄ − φ + max w̄ x̄ − φ = 0,
T
x̄∈S −
x̄∈S −
w̄0T x̄0 − φ0 > 0.
Therefore, x̄0 6∈ R̃− , so R̃− ⊆ R− .
Proof of Lemma 2.4 According to Lemma 2.3, R+ = R̃+ and R− = R̃− . Therefore, we can restrict our discussion to the hypothesis class F̃. Due to the symmetry
of the hypothesis class F̃ we will concentrate only on the positive region R+ . Set
G , V SF̃,Sm . By definition,
R̃+ =
\
′
fw̄,φ
,
′
∈G
fw̄,φ
′
′
obtains the
denotes the region in X for which the linear classifier fw̄,φ
where fw̄,φ
value one or zero. Let fw̄,φ ∈ G be a half-space with k < d points on its boundary.
We will prove that there exist two half-spaces in G (fw̄1 ,φ1 , fw̄2 ,φ2 ) such that each
has at least k + 1 samples on its boundary and
fw̄,φ
Therefore,
\
fw̄1,φ1
\
R̃+ =
fw̄2 ,φ2 = fw̄1 ,φ1
\
\
fw̄2 ,φ2 .
′
fw̄,φ
.
′
∈G\{fw̄,φ }
fw̄,φ
Repeating this process recursively with every half-space in G, with less than d
points on its boundary, completes the proof.
Before proceeding with the rigorous analysis let’s review the main idea behind
the proof. If a half-space in Rd has less than d points on its boundary, it has at
least one degree of freedom. Rotating the half-space clockwise or counterclockwise around a specific axis (defined by the points on the boundary) by sufficiently
small angles will maintain correct classification over Sm . We will rotate the half-
133
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
space clockwise and counterclockwise until “touching” the first point in Sm on
each direction. This operation will maintain correct classification but will result in
having one additional point on the boundary. Then we only have to show that the
intersection of the three half-spaces (original and two rotated ones) is the same as
the intersection of the two rotated ones.
Let fw̄,φ ∈ G be a half-space with k < d points on its boundary. Without loss
0 , {x̄ , x̄ , ..., x̄ }. For the sake of
of generality assume that these points are Sm
1 2
k
simplicity we will first translate the space such that x̄1 will lie on the origin. Since
x̄1 is on the boundary of the half-space we get
w̄T x̄1 − φ = 0
=⇒
φ = w̄T x̄1 .
Therefore,
0
∀x̄ ∈ Sm
0 = w̄T x̄ − φ = w̄T x̄ − w̄T x̄1 = w̄T (x̄ − x̄1 ).
Hence, the weight vector w̄ is orthogonal to all the translated samples (x̄1 −
x̄1 ), . . . , (x̄k − x̄1 ). We now have k < d vectors in Rd (including the weight
vector) so we can always find at least one vector v̄ which is orthogonal to all the
rest. We now rotate the translated samples around the origin so as to align the
vector w̄ with the first axis, and align the vector v̄ with the second axis. From now
on all translated and rotated coordinates and vectors will be marked with prime.
Define the following rotation matrix in Rd ,


cos θ sin θ 0 0 . . .
 − sin θ cos θ 0 0 . . . 





.
0
0
1
0
.
.
.
Rθ , 



0
0
0 1


..
..
..
..
.
.
.
.
We can now define two new half-spaces in the translated and rotated space,
fRα w̄′ ,0 and fR−β w̄′ ,0 , where
α = max
{α′
′
0<α ≤π
|
∀x̄′ ∈ Sm
(w̄′T x̄′ ) · (Rα′ w̄′ )T x̄′ ≥ 0},
134
(A.1)
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
and
β = max
{β ′
′
0<β ≤π
|
(w̄′T x̄′ ) · (R−β ′ w̄′ )T x̄′ ≥ 0}.
∀x̄′ ∈ Sm
(A.2)
According to Claim B.6, both fRα w̄′ ,0 and fR−β w̄′ ,0 correctly classify Sm and have
at least k + 1 samples on their boundaries.
Now we examine the intersection of fRα w̄′ ,0 and fR−β w̄′ ,0 . According to Claim B.7,
if (Rα w̄′ )T x̄′ ≥ 0 and (R−β w̄′ )T x̄′ ≥ 0, then w̄′T x̄′ ≥ 0. The intersection of
fRα w̄′ ,0 , fR−β w̄′ ,0 and fw̄′ ,0 thus equals the intersection of fRα w̄′ ,0 and fR−β w̄′ ,0 ,
as required.
P
Proof of Lemma 4.4 using super martingales Define Wk , ki=1 (Zi − bi ). We
assume that with probability of at least 1 − δ/2,
Pr{Zi |Z1 , . . . , Zi−1 } ≤ bi , simultaneously for all i. Since Zi is a binary random
variable it is easy to see that (w.h.p.),
EZi {Wi |Z1 , . . . , Zi−1 } = Pr{Zi |Z1 , . . . , Zi−1 } − bi + Wi−1 ≤ Wi−1 ,
and the sequence W1m , W1 , . . . , Wm is a super-martingale with high probability.
We apply the following theorem by McDiarmid that refers to martingales (but can
be shown to apply to super-martingales, by following its original proof).
Theorem A.1 ([69], Theorem 3.12) Let Y1 , . . . , Yn be a martingale difference seP
ak . Then, for any
quence with −ak ≤ Yk ≤ 1 − ak for each k; let A = n1
ǫ > 0,
o
nX
Anǫ2
.
Yk ≥ Anǫ ≤ exp (−[(1 + ǫ) ln(1 + ǫ) − ǫ]An) ≤ exp −
Pr
2(1 + ǫ/3)
In our case, Yk = Wk − Wk−1 = Zk − bk ≤ 1 − bk and we apply the (revised)
P
theorem with ak , bk and An , bk , B. We thus obtain, for any 0 < ǫ < 1,
Pr
nX
o
Bǫ2
Zk ≥ B + Bǫ ≤ exp −
2(1 + ǫ/3)
135
.
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Equating the right-hand side to δ/2, we obtain
ǫ =
!
2
4 22
/2B
ln
+ 8B ln
9
δ
δ
!
r
r
1 22
2
/B
+
ln
+ 2B ln
9
δ
δ
!
r
2
/B.
+ 2B ln
δ
2 2
ln ±
3 δ
≤
1 2
ln
3 δ
=
2 2
ln
3 δ
r
Applying the union bound completes the proof.
136
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Appendix B
Technical Lemmas
Lemma B.1 (Bernstein’s inequality [57]) Let X1 , . . . , Xn be independent zeromean random variables. Suppose that |Xi | ≤ M almost surely, for all i. Then for
all positive t,


!
n


2
X
t /2
.
Xi > t ≤ exp − P h i
Pr

E Xj2 + M t/3 
i=1
Lemma B.2 (binomial tail inversion lower bound) For k > 0 and δ ≤ 12 ,
4
1
k
−
ln
Bin(m, k, δ) ≥ min 1,
2m 3m 1 − δ
.
Proof Let Z1 , . . . Zm be independent Bernoulli random variables each with a success probability 0 ≤ p ≤ 1. Setting Wi , Zi − p,
!
!
m
m
X
X
Zi > k
Zi ≤ k = 1 − Pr
Bin(m, k, p) =
Pr
Z1 ,...,Zm ∼B(p)m
= 1 − Pr
m
X
i=1
i=1
!
i=1
Wi > k − mp .
Clearly, E [Wi ] = 0, |Wi | ≤ 1, and E Wi2 = p·(1−p)2 +(1−p)·p2 = p·(1−p).
137
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Using Lemma B.1 (Bernstein’s inequality) we thus obtain,
(k − mp)2 /2
Bin(m, k, p) ≥ 1 − exp −
mp(1 − p) + (k − mp)/3
!
.
Since (1 − p) ≤ 1,
(k − mp)2 /2
mp(1 − p) + (k − mp)/3
≥
≥
(k − mp)2
(k − mp)2
=
2
4
2mp + 23 · (k − mp)
3 mp + 3 · k
k2 − 2mpk
.
+ 23 · k
4
3 mp
Therefore,
−
Bin(m, k, p) ≥ 1 − e
k2 −2mpk
4 mp+ 2 ·k
3
3
.
Equating the right-hand side to δ and solving for p, we have
p≤
k − 32 t
k
,
·
2m k + 32 · t
1
. Choosing
where t , ln (1−δ)
(
1
1
8
k
p = min 1,
·
k − ln
= min 1,
2m
3 1−δ
2m
using the fact that k ≥ 1 >
√
2 2
3
1
ln 1/2
≥
p≤
√
2 2
3 t,
1−4·
2
3t
k
!)
,
and applying Lemma B.3, we get
k − 23 t
k
·
.
2m k + 23 · t
Therefore, Bin(m, k, p) ≥ δ. Since Bin(m, k, δ) = maxp {p : Bin(m, k, p) ≥
δ}, we conclude that
k
4
1
Bin(m, k, δ) ≥ p = min 1,
−
ln
.
2m 3m 1 − δ
138
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
√
Lemma B.3 For u >
2v > 0,
u−v
v
≥1−4· .
u+v
u
Proof
u−v
2v
u−v
u
= 1−
= 1 − 2v 2
≥ 1 − 2v 2
.
2
u+v
u+v
u −v
u − v2
√
Since u > 2v, we have
u2
.
u2 − v 2 >
2
Applying to the previous inequality completes the proof.
Lemma B.4 For any m ≥ 3, a ≥ 1, b ≥ 1 we get
m a
X
ln (bi)
i
i=1
Proof Setting f (x) ,
lna (bx)
,
x
<
4 a+1
ln (b(m + 1)).
a
we have
lna−1 (bx)
df
= (a − ln bx) ·
.
dx
x2
Therefore, f is monotonically increasing when x < ea /b, monotonically decreasing function when x ≥ ea /b and its attains its maximum at x = ea /b. Consequently, for i < ea /b − 1, or i ≥ ea /b + 1,
f (i) ≤
Z
i+1
f (x)dx.
x=i−1
For ea /b − 1 ≤ i < ea /b + 1,
f (i) ≤ f (ea /b) = b
Therefore, if m < ea − 1 we have,
m
X
i=1
a
f (i) = ln (b) +
m
X
i=2
f (i) < 2 ·
Z
a a
e
≤ aa .
m+1
x=1
139
f (x)dx ≤
(B.1)
2
lna+1 (b(m + 1)).
a+1
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Otherwise, m ≥ ea /b, in which case we overcome the change of slope by adding
twice the (upper bound on the) maximal value (B.1),
m
X
f (i) <
i=1
≤
2
2
2
lna+1 (b(m + 1)) + 2aa =
lna+1 (b(m + 1)) + aa+1
a+1
a+1
a
2
2
4
lna+1 (b(m + 1)) + lna+1 bm ≤ lna+1 (b(m + 1)).
a+1
a
a
Lemma B.5 Let S1 and S2 be two sets in Rd . Then,
H(S1 ∪ S2 ) ≤ H(S1 ) + H(S2 ),
where H(S) is the number of convex hull vertices of S.
Proof Assume x ∈ S1 ∪ S2 is a convex hull vertex of S1 ∪ S2 . Then, there is a
half-space (w, φ) such that, w · x − φ = 0, and any other y ∈ S1 ∪ S2 satisfies
w · y − φ > 0. Assume w.l.o.g. that x ∈ S1 . Then it is clear that any y ∈ S1
satisfies w · y − φ > 0. Therefore, x is a convex hull vertex of S1 .
Claim B.6 Both fRα w̄′ ,0 and fR−β w̄′ ,0 correctly classify Sm and have at least k+1
samples on their boundaries.
Proof We note that after translation all half-spaces pass through the origin, so φ′ =
0. Recall the definitions of α and β as maximums (Equations (A.1) and (A.2), respectively). We show that the maximums over α′ and β ′ are well defined. Let x̄′ =
(x′1 , x′2 , · · · x′d )T . Since w̄′ = (1, 0, 0, · · · )T we get that Rα w̄′ = (cos α, − sin α, 0, . . .)T
and
2
(w̄′T x̄′ ) · (Rα′ w̄′ )T x̄′ = x′1 cos α − x′ 1 · x′ 2 · sin α.
Since Sm is a spanning set of Rd , at least one sample has a vector with component
x′1 6= 0. As all components are finite and x′1 2 > 0, we can always find a sufficiently
small α′ such that x′1 2 cos α′ − x′ 1 · x′ 2 · sin α′ > 0. Hence, the maximum exists.
Furthermore, for α′ = π we have x′1 2 cos π −x′ 1 ·x′ 2 ·sin π = −x′1 2 < 0. Noticing
that x′1 2 cos α − x′ 1 · x′ 2 · sin α is continuous in α, and applying the intermediate
value theorem, we know that 0 < α < π and x′1 2 cos α − x′ 1 · x′ 2 · sin α = 0.
Therefore, there exists a sample in Sm that is not on the boundary of fw̄′ ,0 (since
140
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
0 are orthogonal
x′1 6= 0) but on the boundary of fRα w̄′ ,0 . Recall that all points in Sm
′
T
T
to w̄ = (1, 0, 0, · · · ) and v̄ = (0, 1, 0, · · · ) . Therefore,
0
∀x̄′ ∈ Sm
(Rα w̄′ )T x̄′ = x′1 · cos α − x′ 2 · sin α = w̄′T x̄ · cos α − v̄ ′T x̄· sin α = 0,
and they reside on the boundary of fRα w̄′ ,0 . Overall, fRα w̄′ ,0 correctly classifies
Sm and has at least k + 1 samples on its boundary. The same argument applies for
β by symmetry.
Claim B.7 Using the notation introduced in the proof of Lemma 2.4, if (Rα w̄′ )T x̄′ ≥
0 and (R−β w̄′ )T x̄′ ≥ 0, then
w̄′T x̄′ ≥ 0.
Proof If (Rα w̄′ )T x̄′ ≥ 0 and (R−β w̄′ )T x̄′ ≥ 0, then
(
x′1 cos α − x′2 · sin α ≥ 0,
x′1 cos β + x′2 · sin β ≥ 0.
Multiplying the first inequality by sin β > 0, the second inequality by sin α > 0,
and adding the two we have
sin(α + β) · x′1 ≥ 0.
According to Claim B.8 below, sin(α+β) ≥ 0. If sin(α+β) = 0, then (α+β) = π
and cos(α + β) = −1. Using the trigonometric identities
cos(α − β) = cos α cos β + sin α sin β;
sin(α − β) = sin α cos β − cos α sin β,
we get that
cos β = cos(β + α − α) = cos(α + β) · cos α + sin(α + β) · sin α = − cos α,
and
sin β = sin(β + α − α) = sin(α + β) · cos α − cos(α + β) · sin α = sin α.
Therefore, for any x′1 cos α − x′2 · sin α > 0, it holds that x′1 cos β + x′2 · sin β < 0
141
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
and R̃+ is degenerated. Contradiction to the fact that Sm is a spanning set of the
Rd . Therefore, sin(α + β) > 0, x′1 ≥ 0 and w̄′T x̄′ ≥ 0.
Claim B.8 Using the notation introduced in the proof of Lemma 2.4,
sin(α + β) ≥ 0.
Proof By definition we get that for all samples in Sm ,
(
x′1 2 cos α − x′1 · x′2 · sin α ≥ 0,
x′1 2 cos β + x′1 · x′2 · sin β ≥ 0.
Multiplying the first inequality by sin β > 0 (0 < β < π), the second inequality
by sin α > 0, adding the two, and using the trigonometric identity
sin(α + β) = sin α cos β + cos α sin β,
we have
2
sin(α + β) · x′1 ≥ 0.
Since there is a sample in Sm with a vector component x′1 6= 0, we conclude that
sin(α + β) ≥ 0.
142
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Appendix C
Extended Minimax Model
We define a learning algorithm ALG to be a (random) mapping from a sample Sm
to selective predictor (f, g). We evaluate learners with respect to their coverage
and risk and derive both positive and negative results on achievable risk and coverage. Our model is a slight extension of the standard minimax model for standard
statistical learning as described, e.g., by Antos and Lugosi [4]. Thus, we consider
the following game between the learner and an adversary. The parameters of the
game are a domain X and an hypothesis class F.
1. A tolerance level δ and a training sample size m are given.
2. The learner chooses a learning algorithm
ALG .
3. With full knowledge of the learner’s choice, the adversary chooses a distribution P (X, Y ) over X × Y.
4. A training sample Sm is drawn i.i.d. according to P .
5.
ALG
is applied on Sm and outputs a selective predictor (f, g).
The result of the game is evaluated in terms of the risk and coverage obtained by
the chosen selective predictor and clearly, these are random quantities that trade-off
each other. A positive result in this model is a pair of bounds, BR = BR (F, δ, m)
and BΦ = BΦ (F, δ, m), for risk and coverage, respectively, that for any δ and m,
hold with high probability, of at least 1 − δ for any distribution P ; namely,
Pr {R(f, g) ≤ BR ∧ Φ(f, g) ≥ BΦ } ≥ 1 − δ.
143
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
The probability is taken w.r.t. the random choice of training samples Sm , as well
as w.r.t. all other random choices introduced, such as a random choice of (f, g) by
ALG (if applicable), and the randomized selection function.
A negative result is a probabilistic statement on the impossibility of any positive result. Thus, in its most general form a negative result is a pair of bounds BR
and BΦ that, for any δ, satisfy
Pr {R(f, g) ≥ BR ∨ Φ(f, g) ≤ BΦ } ≥ δ,
for some probability P . Here again, probability is taken w.r.t. the random choice
of the training samples Sm , as well as w.r.t. all other random choices.
For a selective predictor (f, g) with coverage Φ(f, g) we can specify a RiskCoverage (RC) trade-off as a bound on the risk R(f, g), expressed in terms of
Φ(f, g). Thus, a positive result on the RC trade-off is a probabilistic statement of
the following form
Pr {R(f, g) ≤ B(Φ(f, g), δ, m)} ≥ 1 − δ.
Similarly, a negative result on the RC trade-off is a statement of the form,
Pr {R(f, g) ≥ B(Φ(f, g), δ, m)} ≥ δ.
Clearly, all results (positive and negative) are qualified by the model parameters,
namely the domain X and the hypothesis space F, and the quality/generality of a
result should be assessed w.r.t. generality of these parameters.
144
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Appendix D
Realizable Selective
Classification: RC trade-off
analysis
The performance of any selective classifier (f, g) can be characterized by its risk
and coverage. This can be depicted as a point in the risk-coverage (RC) plane. In
Figure D.1 we schematically depict elements of the risk-coverage (RC) trade-off.
The x-axis is coverage and the y-axis is risk on the accepted domain (error in the
case of the 0/1 loss). The entire region depicted, called the RC plane, consisting of
all (r, c) points in the rectangle of interest, where r is a risk (error) coordinate and
c is a coverage coordinate. Assume a fixed problem setting (including an unknown
underlying distribution P , m training examples drawn i.i.d. from P , an hypothesis
class F and a tolerance parameter δ). To fully characterize the RC trade-off we
need to determine for each point (r, c) on the RC plane if it is (efficiently) “achievable.” We say that (r, c) is (efficiently) achievable if there is an (efficient) learning
algorithm that will output a selective classifier (f, g) such that with probability of
at least 1 − δ, its coverage is at least c and its risk is at most r.
Notice that point r ∗ (the coordinate (r ∗ , 1)) where the coverage is 1 represents
“standard learning.” At this point we require full coverage with certainty and the
achievable risk represents the lowest possible risk in our fixed setting (which should
be achievable with probability of at least 1 − δ). Point r ∗ represent one extreme
of the RC trade-off. The other extreme of the RC trade-off is point c∗ , where we
require zero risk with certainty. The coverage at c∗ is the optimal (highest possi-
145
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Figure D.1: The RC plane and RC trade-off
ble) in our setting when zero error is required. We call point c∗ perfect learning
because achievable perfect learning means that we can generate a classifier that
never errs with certainty for the problem at hand. Note that at the outset, it is not at
all clear if non-trivial perfect learning (with guaranteed positive coverage) can be
accomplished.
The full RC trade-off is some (unknown) curve connecting points c∗ and r ∗ .
This curve passes somewhere in the zone labeled with a question mark and represents optimal selective classification. Points above this curve (e.g., at zone A) are
achievable. Points below this curve (e.g., at zone B) are not achievable.
In this appendix we study the RC curve in the realizable setting and provide
as tight as possible boundaries for it. To this end we characterize upper and lower
envelopes of the RC curve as schematically depicted in Figure D.1. The upper
envelop is a boundary of an “achievable zone” (zone A) and therefore we consider
any upper envelop as a “positive result.” The lower envelop is a boundary of a “non-
146
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
achievable zone” (zone B) and is therefore considered as a “negative result.” Note
that upper and lower envelopes, as depicted in the figure, represent two different
things, which are formally defined in Appendix C as probabilistic statements on
possibility and impossibility.
In Chapter 3 we have shown that by compromising the coverage we can achieve
zero risk. This is in contrast to the classical setting, where we compromise risk to
achieve full coverage. Is it possible to learn a selective classifier with full control
over this trade-off? What are the performance limitations of this trade-off control?
In this section we present some answers to these questions thus deriving lower
and upper envelopes for the risk-coverage (RC) trade-off. These results heavily
rely on the previous results on perfect learning and on classical results on standard
learning without rejection. The envelopes are obtained by interpolating bounds on
these two extreme types of learning. We begin this section by deriving an upper
envelop; that is, we introduce a strategy that can control the RC trade-off.
Our upper RC envelop is facilitated by the following strategy, which is a generalization of the consistent selective classification strategy (CSS) of Definition 3.1.
Definition D.1 (controllable selective strategy) Given a mixing parameter 0 ≤
α ≤ 1, the controllable selective strategy chooses a selective classifier (f, g) such
that f is in the version space V SF ,Sm (as in CSS), and g is defined as follows:
g(x) = 1 for any x in the maximal agreement set, A, with respect to V SF ,Sm , and
g(x) = α for any x ∈ X \ A.
Clearly, CSS is a special case of the controllable selective strategy obtained with
α = 0. Standard consistent learning (in the classical setting) is the special case
obtained with α = 1. We now state a well known (and elementary) upper bound
for classical realizable learning.
Theorem D.1 ([53]) Let F be any finite hypothesis class. Let f ∈ V SF ,Sm be
a classifier chosen by any consistent learner. Then, for any 0 ≤ δ ≤ 1, with
probability of at least 1 − δ,
1
1
ln |F| + ln
,
R(f ) ≤
m
δ
where R(f ) is standard risk (true error) of the classifier f .
The following result provides a distribution independent upper bound on the risk
of the controllable selective strategy as a function of its coverage.
147
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Theorem D.2 (upper envelop) Let F be any finite hypothesis class. Let (f, g) be
a selective classifier chosen by a controllable selective learner after observing a
training sample Sm . Then, for any 0 ≤ δ ≤ 1, with probability of at least 1 − δ,
1 − Φ0 /Φ(f, g)
1
2
·
R(f, g) ≤
ln |F| + ln
,
1 − Φ0
m
δ
where
1
Φ0 , 1 −
m
2
(ln 2)|F| + ln
δ
.
Proof For any controllable selective learner with a mixing parameter α we have,
Φ(f, g) = E [g(X)] = E [I(g(X) = 1)] + αE [I(g(X) 6= 1)] .
By Theorem 3.2, with probability of at least 1 − 2δ ,
1
E [I(g(X) = 1)] ≥ 1 −
m
2
(ln 2)|F| + ln
δ
, Φ0 .
Therefore, since Φ(f, g) ≤ 1,
α=
Φ(f, g) − Φ0
Φ(f, g) − E [I(g(X) = 1)]
≤
.
1 − E [I(g(X) = 1)]
1 − Φ0
Using the law of total expectation we get
0
z
}|
{
E [ℓ(f (X), Y ) · g(X)] = E [ℓ(f (X), Y ) · g(X) | g(x) = 1] · Pr (g(X) = 1)
+ E [ℓ(f (X), Y ) · g(X) | g(x) = α] · Pr (g(X) = α)
= α · E [ℓ(f (X), Y ) | g(x) = α] · Pr (g(X) = α)
= α · E [ℓ(f (X), Y )] .
According to Definition 1.4
R(f, g) =
α · E [ℓ(f (X), Y )]
α · R(f )
E [ℓ(f (X), Y ) · g(X)]
=
=
.
Φ(f, g)
Φ(f, g)
Φ(f, g)
Applying Theorem D.1 together with the union bound completes the proof.
We now present a negative result which identifies a region of non-achievable
148
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
coverage-risk trade-off on the RC plane. The statement is a probabilistic lower
bound on the risk of any selective classifier expressed as a function of the coverage.
It negates any high probability upper bound on the risk of the classifier (where the
probability is over choice of Sm and the target hypothesis).
Theorem D.3 (non-achievable risk-coverage trade-off) Let F be any hypothesis class with VC dimension d and let 0 ≤ δ ≤ 41 and m be given. There exists
a distribution P (that depends on F), such that for any selective classifier (f, g),
chosen using a training sample Sm drawn i.i.d. according to P , with probability
of at least δ,
1
1
d
4
1
1
−
+
· min 1,
−
ln
R(f, g) ≥
2 2Φ(f, g) 4Φ(f, g)
4m 3m 1 − 2δ
Proof If d is the VC-dimension of hypothesis class F, there exists a set of data
points X ′ = {e1 , e2 , . . . ed } shattered by F. Let X , X ′ ∪ {ed+1 }. The bad distribution is constructed as follows. Define Bin (m, k, δ), the binomial tail inversion,
d
d
Bin m, , 2δ , max p : Bin m, , p ≥ 2δ ,
p
2
2
where Bin (m, k, p) is the binomial tail. Define P to be the source distribution
over X satisfying
(
Bin m, d2 , 2δ /d,
if i ≤ d;
P (ei ) ,
1 − Bin m, d2 , 2δ , otherwise,
Assuming that the training sample is selected i.i.d. from P , it follows that with
probability of at least 2δ,
x ∈ Sm : x ∈ X ′ ≤ d .
2
F shatters X ′ thus inducing all dichotomies over X ′ . Every sample from X ′ can
reduce the version space by half, so with probability of at least 2δ, the version
space V SF ,Sm includes all dichotomies over at least d/2 instances. Therefore,
over these instances (referred to as x1 , x2 , . . . , xd/2 ), with probability of 1/2 the
149
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
error is at least 1/2.1
d+1
X
Φ(f, g) =
i=1
{P (ei ) · g(ei )} = P (e1 ) ·
d
≤ P (e1 ) ·
2
X
= P (e1 ) ·
2
X
g(xi ) +
i=1
d
X
i=1
g(ei ) + P (ed+1 ) · g(ed+1 )
d
· P (e1 ) + P (ed+1 )
2
d
i=1
g(xi ) + 1 −
d
· P (e1 ).
2
Therefore
d
P (e1 ) ·
2
X
i=1
g(xi ) ≥ Φ(f, g) +
d
· P (e1 ) − 1.
2
By Definition 1.4
Φ(f, g) · R(f, g) = E {ℓ(h(X), h∗ (X)) · g(X)}
=
d+1
X
{P (ei ) · g(ei ) · I(f (ei ) 6= f ∗ (ei ))}
i=1
d
2 X
1
≥
P (xi ) · g(xi ) ·
2
i=1
≥
=
Φ(f, g) − 1 d
+ · P (e1 )
2
4
Φ(f, g) − 1 1
d
+ · Bin m, , 2δ .
2
4
2
Applying Lemma B.2 completes the proof.
Corollary D.4 Let 0 ≤ δ ≤ 14 , m, and n > 1 be given. There exist a distribution
P , that depends on m and n, and a finite hypothesis class F of size n and VCdimension d, such that for any selective classifier (f, g), chosen using a training
1
According to the game theoretic setting the adversary can choose a distribution over F. In this
case the expectation in the risk is averaged over random instances and random labels. Therefore,
the error over the instances x1 , x2 , . . . , xd/2 is exactly 1/2 and we can replace the term 2δ with δ.
150
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
sample Sm drawn i.i.d. according to P , with probability of at least δ, if
3
1
16
1
Φ(f, g) ≥ max
,1 −
· d−
ln
4
16m
3
1 − 2δ
then
R(f, g) ≥
1
16
1
· d−
ln
.
16m
3
1 − 2δ
Proof Assuming
Φ(f, g) ≥ max
1
16
1
3
,1 −
· d−
ln
,
4
16m
3
1 − 2δ
we apply Theorem D.3 to complete the proof.
151
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Appendix E
Related Work
The idea of selective classification (classification with a reject option) dates back
to Chow’s seminal papers [22, 23]. These papers analyzed both the Bayes-optimal
reject decision and the reject-rate vs. error trade-off. This is done under the 0-1
loss function, assuming that the underlying distribution is completely known. The
Bayes-optimal rejection policy is based, as in standard classification, on maximum
a posteriori probabilities. Instances should be rejected whenever none of the posteriori probabilities are sufficiently distinct. This type of rejection can be termed
ambiguity-based rejection. More specifically, let ωi be class i and P (ωi |x) the posteriori probabilities. Then, according to Chow’s decision rule, for any t ∈ [0, 1],
sample x should be rejected if
max P (ωk |x) < t.
k
Chow’s rule is optimal in the sense that no other rule can result in a lower error
rate, given a specific reject rate controlled by t. One of Chow’s main results (for
the case of complete probabilistic knowledge), is that the optimal RC trade-off is
monotonically increasing.
While the optimal decision can be identified in the case of complete probabilistic knowledge, it was argued [40] that when the a posteriori probabilities are
estimated with errors, Chow’s rule [23] does not provide the optimal error-reject
trade-off. Papers [79] and [74] discussed Bayesian-optimal decisions in the case
of arbitrary cost matrices. In these papers the optimal reject rule was chosen using the ROC curve evaluated on a subset of the training data. As in most papers
on the subject emerging from the engineering community [40, 39, 72, 19, 63], no
152
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
probabilistic or other guarantees are provided for the misclassification error.
A very common rejection model in the literature is the cost model, whereby a
specific cost d is associated with rejection [79] and the objective is to minimize the
generalized rejective risk function,
ℓc (f, g) , d · E [1 − g(X)] + E [I(f (X) 6= Y ) · g(X)] .
(E.1)
Given our definitions of risk and coverage, the function (E.1) can be easily expressed as a function over the RC plane,
ℓc (R, Φ) = d (1 − Φ(f, g)) + R(f, g)Φ(f, g).
(E.2)
For any fixed d, Equation (E.2) defines level sets (or elevation contour lines) over
the RC plane. This popular cost model was refined to allow us to differentiate
between the cost of false positive and false negative as well as different costs for
rejection of positive and negative samples [56, 72, 79, 74]. Such extensions or
refinements are appealing because they allow for additional control and more flexibility in modeling the problem at hand. Nevertheless, these cost models are often
criticized for their impracticality in applications where it is impossible or hard to
precisely quantify the cost of rejection. It is interesting to note that for an ideal
Bayesian setting, where the underlying distribution is completely known, Chow
showed [23] that the cost d upper bounds the probability of misclassification. In
this case one can control the classification error by specifying a matching rejection
cost.
In [72], two additional optimization models are introduced. The first is the
bounded-improvement model, in which, given a constraint on the misclassification cost, the classifier should reject as few samples as possible. The second is
the bounded-abstention model, in which, given a constraint on the coverage, the
classifier should have the lowest misclassification cost. It is argued in [72] that
these models are more suitable than the above cost model in many applications,
for instance, when a classifier with limited classification throughput (e.g., a human
expert) should handle the rejected instances, and in medical and quality assurance
applications, where the goal is to reduce the misclassification cost to a user-defined
value.
Very few studies have focused on error bounds for classifiers with a reject option. Hellman [55] proposed and analyzed two rejection rules for the nearest neighbor algorithm. Extending Cover and Hart’s classic result for the 1-nearest neighbor
153
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
algorithm [26], Hellman showed that the test error (over non-rejected points) of
a nearest neighbor algorithm with a reject option can be bounded asymptotically
(as the sample size approaches infinity) by some factor of the Bayes error (with
reject). To the best of our knowledge, this excess risk bound is the first that has
been introduced in the context of classification with a reject option.
Herbei and Wegkamp [56] developed excess risk bounds for the classification
with a reject option setting where the loss function is the 0-1 loss, extended such
that the cost of each reject point is 0 ≤ d ≤ 1/2. This result generalizes the excess risk bounds of Tsybakov [80] for standard binary classification without reject
(which is equivalent to the case d = 1/2). The bound applies to any empirical
error minimization technique. This result is further extended in [13, 88] in various
ways, including the use of the hinge loss function for efficient optimization. The
main results of Herbei and Wegkamp (both for plug-in rules and empirical risk
minimization) degenerate, in the realizable case, to a meaningless bound, where
classification with a reject option is not guaranteed to be any better than classification without reject. Furthermore Herbei and Wegkamp’s results are limited only to
the cost model.
Freund et al. [35] studied a simple ensemble method for binary classification.
Given a hypothesis class F, the method outputs a weighted average of all the hypotheses in F such that the weight of each hypothesis exponentially depends on
its individual training error. Their algorithm abstains from prediction whenever
the weighted average of all individual predictions is inconclusive (i.e., sufficiently
close to zero). Two regret bounds for this algorithm were derived. The first bounds
(w.h.p.) the risk of the selective classifier by
1
∗
,
R(f, g) ≤ 2R(f ) + O
m1/2−θ
where 0 < θ < 1/2 is a hyperparameter. The authors also proved that for a
sufficiently large training sample size,
 s
m = Ω
!1/θ 
1
,
ln
ln(|F|)
δ
154
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
the probability that the algorithms will abstain from prediction is bounded by
ln |F|
Φ(f, g) ≥ 1 − 5R(f ∗ ) − O √
.
m1/2−θ
To the best of our knowledge, these bounds are the first to provide some guarantee for both the error of the classifier and the coverage. Therefore, these results
are related to the bounded-improvement and bounded-abstention models. As was
rightfully stated by the authors, the final aggregated hypothesis can significantly
outperform the best base-hypothesis in F in some favorable situations. Unfortunately, the regret bound provided does not exploit these situations, as it is bounded
by twice the generalization error of the best hypothesis.
In a series of papers (see, e.g., [75, 85, 84]) culminating in a book [86], Vovk et
al. presented transductive confidence machines and conformal predictions. This innovative approach is mainly concerned with an online probabilistic setting. Rather
than predicting a single label for each sample point, a “conformal predictor” can
assign multiple labels. Given a confidence level ǫ, it is shown how the error rate ǫ
can be asymptotically guaranteed. Interpreting the uncertain (multi-labeled) predictions as abstentions, one can construct online predictors with a reject option that
have asymptotic performance guarantees. A few important differences between
conformal predictions and our results can be pointed out. While both approaches
provide “hedged” predictions, they use different notions of hedging. Whereas our
goal is to guarantee that with high probability over the training sample our classifier agrees with the best classifier in the class over all points in the accepted domain,
the goal in conformal predictions is to provide guarantees for the average error rate
where the average is taken over all possible samples and test points.1 In this sense
conformal prediction cannot achieve pointwise competitiveness. The authors’ setting also utilizes a different notion of error than the one we use. While we provide
performance guarantees for the error rate only on the covered (accepted) examples,
they provided a guarantee for all examples (including those that have multiple predictions or none at all). By increasing the multi-labeled prediction rate (uncertain
prediction), their error rate can be decreased to any arbitrarily small value. This
is not the case with our error notion on the covered examples, which is, of course,
bounded below by the Bayes error on the covered region. This definition of error
1
As noted by Vovk et al.: “It is impossible to achieve the conditional probability of error equal
to ǫ given the observed examples, but it is the unconditional probability of error that equals ǫ.
Therefore, it implicitly involves averaging over different data sequences...” [86, page 295].
155
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
is also adopted by most other studies of classification with a reject option. The
vast majority of our work is focused on the study of coverage rates. In conformal
prediction an efficiency notion, similar to coverage rate, is mentioned, but no finite
sample results are derived. Moreover, all the negative results provided in this work
apply for conformal prediction; hence, distribution independent efficiency results
are out of reach.
156
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
Bibliography
[1] N. Ailon, R. Begleiter, and E. Ezra. Active learning using smooth relative regret approximations with applications. In 25th Annual Conference on Learning Theory (COLT), 2012.
[2] E. Amaldi and V. Kann. The complexity and approximability of finding maximum feasible subsystems of linear relations. Theoretical Computer Science, 147(1):181–210, 1995.
[3] M. Anthony and P.L. Bartlett. Neural Network Learning; Theoretical
Foundations. Cambridge University Press, 1999.
[4] A. Antos and G. Lugosi. Strong minimax lower bounds for learning.
Machine Learning, 30(1):31–56, 1998.
[5] L. Atlas, D. Cohn, R. Ladner, M. A El-Sharkawi, and R.J. Marks II.
Training Connectionist Networks with Queries and Selective Sampling.
Morgan Kaufmann Publishers Inc., 1990.
[6] P. Auer. Using confidence bounds for exploitation-exploration tradeoffs. The Journal of Machine Learning Research, 3:397–422, 2003.
[7] Ö. Ayşegül, G. Mehmet, A. Ethem, and H. Türkan. Machine learning
integration for predicting the effect of single amino acid substitutions
on protein stability. BMC Structural Biology, 9(1):66, 2009.
[8] M.F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Proceedings of the 23rd international conference on Machine
learning, pages 65–72. ACM, 2006.
[9] M.F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78–89, 2009.
157
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
[10] M.F. Balcan, S. Hanneke, and J. Wortman. The true sample complexity
of active learning. In 21st Annual Conference on Learning Theory
(COLT), pages 45–56, 2008.
[11] P.L. Bartlett and S. Mendelson. Discussion of ”2004 IMS medallion
lecture: Local rademacher complexities and oracle inequalities in risk
minimization” by V. koltchinskii. Annals of Statistics, 34:2657–2663,
2006.
[12] P.L. Bartlett, S. Mendelson, and P. Philips. Local complexities for empirical risk minimization. In COLT: Proceedings of the Workshop on
Computational Learning Theory. Morgan Kaufmann Publishers, 2004.
[13] P.L. Bartlett and M.H. Wegkamp. Classification with a reject option
using a hinge loss. Journal of Machine Learning Research, 9:1823–
1840, 2008.
[14] S. Ben-David, N. Eiron, and P.M. Long. On the difficulty of approximately maximizing agreements. Journal of Computer and System Sciences, 66(3):496–514, 2003.
[15] J.L. Bentley, H.T. Kung, M. Schkolnick, and C.D. Thompson. On the
average number of maxima in a set of vectors and applications. JACM:
Journal of the ACM, 25, 1978.
[16] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted
active learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 49–56. ACM, 2009.
[17] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active
learning without constraints. Advances in Neural Information Processing Systems 23, 2010.
[18] A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. JACM: Journal of the
ACM, 36, 1989.
[19] A. Bounsiar, E. Grall, and P. Beauseroy. A kernel based rejection
method for supervised classification. International Journal of Computational Intelligence, 3:312–321, 2006.
158
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
[20] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical
learning theory. In Advanced Lectures on Machine Learning, volume
3176 of Lecture Notes in Computer Science, pages 169–207. Springer,
2003.
[21] C.C. Chang and C.J. Lin.
LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at
”http://www.csie.ntu.edu.tw/ cjlin/libsvm”.
[22] C.K. Chow. An optimum character recognition system using decision
function. IEEE Trans. Computer, 6(4):247–254, 1957.
[23] C.K. Chow. On optimum recognition error and reject trade-off. IEEE
Trans. on Information Theory, 16:41–36, 1970.
[24] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active
learning. Machine Learning, 15(2):201–221, 1994.
[25] C. Cortes and V. Vapnik. Support-vector networks. Machine learning,
20(3):273–297, 1995.
[26] T.M. Cover and P. Hart. Neighbor pattern classification. IEEE Transactions. on Information Theory, 13(1):21–27, 1967.
[27] S. Dasgupta. Coarse sample complexity bounds for active learning.
In Advances in Neural Information Processing Systems 18, pages 235–
242, 2005.
[28] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active
learning algorithm. In NIPS, 2007.
[29] S. Dasgupta, A. Tauman Kalai, and C. Monteleoni. Analysis of
perceptron-based active learning. Journal of Machine Learning Research, 10:281–299, 2009.
[30] L. Devroye, L. Györfi, and G. Lugosi. A probabilistic theory of pattern
recognition, volume 31. New York: Springer, 1996.
[31] R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective
classification. Journal of Machine Learning Research, 11:1605–1641,
2010.
159
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
[32] R. El-Yaniv and Y. Wiener. Agnostic selective classification. In Neural
Information Processing Systems (NIPS), 2011.
[33] R. El-Yaniv and Y. Wiener. Active learning via perfect selective classification. Journal of Machine Learning Research, 13:255–279, 2012.
[34] S. Fine, R. Gilad-Bachrach, and E. Shamir. Query by committee,
linear separation and random walks. Theoretical Computer Science,
284(1):25–51, 2002.
[35] Y. Freund, Y. Mansour, and R.E. Schapire. Generalization bounds for
averaged classifiers. Annals of Statistics, 32(4):1698–1722, 2004.
[36] Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Information, prediction, and Query by Committee. In Advances in Neural Information
Processing Systems (NIPS) 5, pages 483–490, 1993.
[37] Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Selective sampling
using the query by committee algorithm. Machine Learning, 28:133–
168, 1997.
[38] E. Friedman. Active learning for smooth problems. In Proceedings of
the 22nd Annual Conference on Learning Theory, 2009.
[39] G. Fumera and F. Roli. Support vector machines with embedded reject
option. In Pattern Recognition with Support Vector Machines: First
International Workshop, pages 811–919, 2002.
[40] G. Fumera, F. Roli, and G. Giacinto. Multiple reject thresholds for
improving classification reliability. Lecture Notes in Computer Science,
1876, 2001.
[41] J.E. Gentle. Numerical Linear Algebra for Applications in Statistics.
Springer Verlag, 1998.
[42] R. Gilad-Bachrach. To PAC and Beyond. PhD thesis, the Hebrew University of Jerusalem, 2007.
[43] S. Goldman and M. Kearns. On the complexity of teaching. JCSS:
Journal of Computer and System Sciences, 50, 1995.
160
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
[44] Y. Grandvalet, A. Rakotomamonjy, J. Keshet, and S. Canu. Support
vector machines with a reject option. In NIPS, pages 537–544. MIT
Press, 2008.
[45] B. Hanczar and E.R. Dougherty. Classification with reject option in
gene expression data. Bioinformatics, 24(17):1889–1895, 2008.
[46] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, pages 353–360, 2007.
[47] S. Hanneke. Teaching dimension and the complexity of active learning. In Proceedings of the 20th Annual Conference on Learning Theory
(COLT), volume 4539 of Lecture Notes in Artificial Intelligence, pages
66–81, 2007.
[48] S. Hanneke. Theoretical Foundations of Active Learning. PhD thesis,
Carnegie Mellon University, 2009.
[49] S. Hanneke. Rates of convergence in active learning. Annals of Statistics, 37(1):333–361, 2011.
[50] S. Hanneke. A statistical theory of active learning. Unpublished, 2013.
[51] Steve Hanneke. Activized learning: Transforming passive to active
with improved label complexity. The Journal of Machine Learning
Research, 98888:1469–1587, 2012.
[52] R. Hartley and A. Zisserman. Multiple View Geometry in Computer
Vision, volume 2. Cambridge Univ Press, 2000.
[53] D. Haussler. Quantifying inductive bias: AI learning algorithms
and Valiant’s learning framework. Artificial intelligence, 36:177–221,
1988.
[54] T. Hegedüs. Generalized teaching dimensions and the query complexity of learning. In COLT: Proceedings of the Workshop on Computational Learning Theory. Morgan Kaufmann Publishers, 1995.
[55] M.E. Hellman. The nearest neighbor classification rule with a reject
option. IEEE Trans. on Systems Sc. and Cyb., 6:179–185, 1970.
161
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
[56] R. Herbei and M.H. Wegkamp. Classification with reject option. The
Canadian Journal of Statistics, 34(4):709–721, 2006.
[57] W. Hoeffding. Probability inequalities for sums of bounded random
variables. Journal of the American Statistical Association, 58(301):13–
30, March 1963.
[58] D. Hug, G. O. Munsonious, and M. Reitzner. Asymptotic mean values
of Gaussian polytopes. Beiträge Algebra Geom., 45:531–548, 2004.
[59] D. Hug and M. Reitzner. Gaussian polytopes: variances and limit theorems. Advances in applied probability, 37(2):297–320, 2005.
[60] B. Kégl. Robust regression by boosting the median. Learning Theory
and Kernel Machines, pages 258–272, 2003.
[61] R.M. Kil and I. Koo. Generalization bounds for the regression of realvalued functions. In Proceedings of the 9th International Conference
on Neural Information Processing, volume 4, pages 1766–1770, 2002.
[62] V. Koltchinskii. 2004 IMS medallion lecture: Local rademacher complexities and oracle inequalities in risk minimization. Annals of Statistics, 34:2593–2656, 2006.
[63] T.C.W. Landgrebe, D.M.J. Tax, P. Paclı́k, and R.P.W. Duin. The interaction between classification and reject performance for distance-based
reject-option classifiers. Pattern Recognition Letters, 27(8):908–917,
2006.
[64] J. Langford. Tutorial on practical prediction theory for classification.
JMLR, 6:273–306, 2005.
[65] W.S. Lee. Agnostic Learning and Single Hidden Layer Neural Networks. PhD thesis, Australian National University, 1996.
[66] L. Li and M. L. Littman. Reducing reinforcement learning to kwik
online regression. Annals of Mathematics and Artificial Intelligence,
pages 217–237, 2010.
[67] L. Li, M.L. Littman, and T.J. Walsh. Knows what it knows: a framework for self-aware learning. In Proceedings of the 25th International
Conference on Machine learning, pages 568–575. ACM, 2008.
162
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
[68] P. Massart and É. Nédélec. Risk bounds for statistical learning. The
Annals of Statistics, 34(5):2326–2366, 2006.
[69] C. McDiarmid. Concentration. In Probabilistic Methods for Algorithmic Discrete Mathematics, volume 16, pages 195–248. SpringerVerlag, 1998.
[70] P. S. Meltzer, J. Khan, J. S. Wei, M Ringnér, L. H. Saal, M Ladanyi,
F Westermann, F Berthold, M Schwab, C. R. Antonescu, and C Peterson. Classification and diagnostic prediction of cancers using gene
expression profiling and artificial neural networks. Nature Medicine,
7(6), June 2001.
[71] T. Mitchell. Version spaces: a candidate elimination approach to rule
learning. In IJCAI’77: Proceedings of the 5th International Joint Conference on Artificial intelligence, pages 305–310, 1977.
[72] T. Pietraszek. Optimizing abstaining classifiers using ROC analysis.
In Proceedings of the Twenty-Second International Conference on Machine Learning(ICML), pages 665–672, 2005.
[73] F.P. Preparata and M.I. Shamos. Computational Geometry: An Introduction. Springer-Verlag, 1990.
[74] C.M. Santos-Pereira and A.M. Pires. On optimal reject rules and ROC
curves. Pattern Recognition Letters, 26(7):943–952, 2005.
[75] C. Saunders, A. Gammerman, and V. Vovk. Transduction with confidence and credibility. In Proceedings of the 16th International Joint
Conference on Artificial Intelligence (IJCAI), pages 722–726, 1999.
[76] H.S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In
Proceedings of the Fifth Annual Workshop on Computational Learning
theory (COLT), pages 287–294, 1992.
[77] American Cancer Society. Cancer facts & figures. 2010.
[78] Alexander L Strehl and Michael L Littman. Online linear regression
and its application to model-based reinforcement learning. In Advances
in Neural Information Processing Systems, pages 1417–1424, 2007.
163
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
[79] F. Tortorella. An optimal reject rule for binary classifiers. Lecture Notes
in Computer Science, 1876:611–620, 2001.
[80] A.B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Mathematical Statistics, 32:135–166, 2004.
[81] V. Vapnik. Statistical Learning Theory. Wiley Interscience, New York,
1998.
[82] V. Vapnik. An overview of statistical learning theory. Neural Networks,
IEEE Transactions on, 10(5):988–999, 1999.
[83] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability
and its Applications, 16:264–280, 1971.
[84] Vovk. On-line confidence machines are well-calibrated. In FOCS:
IEEE Symposium on Foundations of Computer Science (FOCS), pages
187–196, 2002.
[85] V. Vovk, A. Gammerman, and C. Saunders. Machine-learning applications of algorithmic randomness. In Proc. 16th International Conf. on
Machine Learning (ICML), pages 444–453, 1999.
[86] V. Vovk, A. Gammerman, and G. Shafer. Algorithmic Learning in a
Random World. Springer, New York, 2005.
[87] L. Wang. Smoothness, disagreement coefficient, and the label complexity of agnostic active learning. JMLR, pages 2269–2292, 2011.
[88] M.H. Wegkap. Lasso type classifiers with a reject option. Electronic
Journal of Statistics, 1:155–168, 2007.
[89] Y. Wiener and R. El Yaniv. Pointwise tracking the optimal regression
function. In Advances in Neural Information Processing Systems 25,
pages 2051–2059, 2012.
[90] L. Yang and J. G. Carbonell. Adaptive Proactive Learning with CostReliability Tradeoff. Carnegie Mellon University, School of Computer
Science, Machine Learning Department, 2009.
164
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
[91] A. P. Yogananda, M. Narasimha Murty, and L. Gopal. A fast linear
separability test by projection of positive points on subspaces. In Proceedings of the Twenty-Fourth International Conference on Machine
Learning (ICML 2007), pages 713–720. ACM, 2007.
165
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
‫יסודות עיוניים של חיזוי בררני‬
‫יאיר וינר‬
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
‫חיבור על מחקר‬
‫לשם מילוי חלקי של הדרישות לקבלת התואר‬
‫דוקטור לפילוסופיה‬
‫יאיר וינר‬
‫הוגש לסנט הטכניון ‪ ---‬מכון טכנולוגי לישראל‬
‫אב תשע"ג‬
‫חיפה‬
‫יולי ‪2013‬‬
‫‪Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013‬‬
‫יסודות עיוניים של חיזוי בררני‬
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
‫המחקר נעשה בהנחיית פרופסור חבר רן אל‪-‬יניב בפקולטה למדעי המחשב‪.‬‬
‫אני מודה לטכניון על התמיכה הכספית הנדיבה בהשתלמותי‪.‬‬
‫‪Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013‬‬
‫הכרת תודה‬
Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013
‫אחת הבעיות הבסיסיות בלמידת מכונה היא חיזוי‪ ,‬דהיינו‪ ,‬בהתבסס על דוגמאות‬
‫מהעבר‪ ,‬אנו נדרשים להסיק מסקנות לגבי העתיד‪ .‬בפרט‪ ,‬בבעיות סיווג אנו נדרשים לחזות‬
‫לאיזה קבוצה שייכת כל דוגמת מבחן‪ .‬לדוגמא‪ ,‬בהנתן צילום חזה אנו נדרשים להחליט‬
‫האם קיימת דלקת ריאות או לא‪ ,‬וזאת על בסיס אוסף של צילומים מהעבר לגביהם אנו‬
‫יודעים את האבחנה‪ .‬כמובן שעל מנת לחזות‪ ,‬אנו נדרשים להניח הנחות יסוד לגבי תהליך‬
‫יצירת הדוגמאות )צלומי החזה( והתיוגים )האבחונים(‪ .‬ההנחה בבסיס העבודה שלנו היא‬
‫שהדוגמאות והתיוגים נדגמים באופן זהה ובלתי תלוי מההתפלגות המשותפת שלהם שאינה‬
‫ידועה‪ .‬כאשר הערך החזוי הוא רציף הבעיה נקראת בעית רגרסיה‪.‬‬
‫בלמידה רגילה החזאי נדרש לחזות ערך עבור כל דוגמת מבחן‪ .‬לעומת זאת בחיזוי‬
‫בררני )סלקטיבי(‪ ,‬חזאי )לדוגמא מסווג או רגרסור( רשאי להמנע מחיזוי עבור חלק‬
‫מדוגמאות המבחן‪ .‬המטרה היא לשפר את דיוק החיזוי בתמורה לוויתור על כיסוי המרחב‪.‬‬
‫חזאים בררניים הם אטרקטיביים במיוחד בבעיות בהן אין צורך לתייג את כל דוגמאות‬
‫ה מבחן )לדוגמא אם קיים מומחה אנושי שיכול לטפל בבעיות הקשות(‪ ,‬או במקרים בהם יש‬
‫צורך בשגיאת סיווג קטנה במיוחד‪ ,‬אותה לא ניתן להשיג על ידי מסווג רגיל‪ .‬מחקר זה‬
‫מתרכז ביסודות העיוניים )תיאורטיים( של החיזוי הבררני ובישומים עבור סיווג בררני‬
‫)‪ ,(selective classification‬רגרסיה בררנית )‪ (selective regression‬ולמידה פעילה‬
‫)‪.(active learning‬‬
‫אנו מגדירים חזאי בררני כצמד פונקציות‪ :‬פונקצית חיזוי ופונקצית בחירה‪ .‬בהנתן‬
‫דוגמת מבחן ‪ , x‬הערך של פונקצית הבחירה קובע אם נדחה את הדוגמא או לא‪ .‬בהנחה‬
‫שהדוגמא מתקבלת )כלומר‪ ,‬אינה נדחית( ערך החזוי נקבע על ידי פונקצית החיזוי‪ .‬שני‬
‫‪i‬‬
‫‪Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013‬‬
‫תקציר‬
‫המאפיינים העיקריים של כל חזאי בררני הם השגיאה שלו והכיסוי שלו‪ .‬השגיאה של חזאי‬
‫בהנתן מחלקת חזאים )מחלקה זו עשוייה להיות אינסופית(‪ ,‬נאמר שחזאי בררני הוא‬
‫תחרותי נקודתית )‪ (pointwise competitive‬אם עבור כל דוגמת מבחן שהוא מתייג החיזוי‬
‫זהה לחיזוי של החזאי הטוב במחלקה‪ .‬כלומר‪ ,‬החזאי הבררני משיג שגיאת שיערוך‬
‫)‪ (estimation error‬אפס עבור הדוגמאות שהוא מתייג‪ .‬בבעיות רגרסיה אנו מחלישים את‬
‫הדרישה ודורשים רק שהערך הנחזה יהיה עד כדי ‪ ±ε‬מהערך הנחזה על ידי החזאי הטוב‬
‫ביותר במחלקה‪ .‬חזאי בררני אשר משיג דרישה מופחתת זו נקרא ‪ -ε‬תחרותי נקודתית‪.‬‬
‫המחקר בנושא סיווג בררני‪ ,‬או בשמו הידוע יותר "סיווג עם אפשרות דחיה"‪ ,‬אינו חדש‬
‫והוא התחיל כבר בשלהי שנות החמישים עם עבודתו של צ'או )‪ .[22,23] (Chow‬צ'או התרכז‬
‫במקרה הבייסיאני בו נתון מידע מלא על התפלגויות המקור‪ .‬הוא הראה כי במקרה זה קיים‬
‫כלל דחיה אופטימלי המשיג שגיאת הכללה מינימלת עבור כל אחוז דחיה נתון‪ .‬לאורך‬
‫השנים הצטבר מידע אמפירי רב המעיד על הפוטנציאל הגלום בסיווג בררני להפחתת‬
‫שגיאת ההכללה‪ ,‬ופותחו טכניקות דחיה שונות עבור מסווגים ספציפיים‪ .‬אולם‪ ,‬למיטב‬
‫ידיעתנו‪ ,‬עד כה לא התנהל דיון פורמלי בתנאים ההכרחיים והמספיקים למציאת חזאי‬
‫בררני תחרותי נקודתית‪ .‬מעט העבודות התיאורטיות בתחום הציגו מספר חסמי שגיאה‬
‫וכיסוי‪ ,‬אולם אף אחת מהן אינה מבטיחה התכנסות )אפילו אסימפטוטית( לשגיאה של‬
‫המסווג הטוב במחלקה‪.‬‬
‫להלן עקרי תוצאות עבודת המחקר‪:‬‬
‫אסטרטגיה בררנית ללמידה )‪ – (LESS‬אנו מציגים משפחת אסטרטגיות למידה‬
‫בררנית הנקראות ‪ . LESS‬אסטרטגיות אלו משיגות תחרותיות נקודתית בבעיות סיווג עם‬
‫רעש וללא רעש‪ .‬כלומר‪ ,‬בהנתן אוסף דוגמאות מתוייגות ומחלקת חזאים‪ ,‬החזאי הבררני‬
‫הנבחר על ידי ‪ LESS‬מתייג את כל דוגמאות המבחן הנבחרות באופן זהה לחזאי הטוב ביותר‬
‫במחלקה‪ .‬עבור בעיות רגרסיה קיימת גרסה שונה במעט של אותה אסטרטגיה הנקראת –‬
‫‪ .ε-LESS‬אסטרטגיה זו משיגה ‪ -ε‬תחרותיות נקודתית‪.‬‬
‫מימוש "עצל" – במבט ראשון נראה שמימוש יעיל של האסטרטגיה הבררנית המוצעת‬
‫אינו אפשרי מאחר ואנו נדרשים לחשב את השגיאה האמפירית )השגיאה על דוגמאות‬
‫הלימוד( עבור קבוצה )אינסופית( של חזאים‪ .‬אולם אנו מציגים צמצום של הבעיה לחישוב‬
‫של מספר קטן של חזאים בעלי שגיאה אמפירית מינימלית‪ .‬על ידי שימוש בצמצום זה אנו‬
‫‪ii‬‬
‫‪Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013‬‬
‫בררני מוגדרת כתוחלת השגיאה על הדוגמאות שנבחרו‪ ,‬והכיסוי מוגדר כהסתברות לתייג‬
‫דוגמא‪.‬‬
‫עקרון חוסר האמון – בהשראת המימוש ה"עצל" פיתחנו טכניקה חדשה להערכת‬
‫הביטחון העצמי של חזאי בתחזיות שלו‪ .‬נניח כי אנו דנים בבעית סיווג בינארית והחזאי‬
‫חוזה ערך חיובי לדוגמת מבחן כלשהי‪ .‬כעת אנו מבקשים מהלומד ללמוד חזאי חדש על‬
‫בסיס כל המידע ההיסטורי אבל בתנאי שהחזאי יחזה ערך שלילי לדוגמת המבחן‪ .‬כל שנותר‬
‫הוא לבחון את ההבדל בביצועים של שני החזאים על המידע ההיסטורי שלנו‪ .‬הבדל זה‬
‫פרופורציונאלי לבטחון העצמי של המסווג‪ .‬ככל שההבדל קטן יותר כך המסווג פחות בטוח‬
‫בתחזיותיו ולהפך‪.‬‬
‫סיבוכיות קבוצת האפיון – עד כה ראינו שהאסטרטגיה המוצעת משיגה תחרותיות‬
‫נקודתית ושניתן לממש אותה )או לקרב אותה( באופן יעיל יחסית‪ .‬כעת יש להראות כי‬
‫אסטרטגיה זו לא דוחה את כל דוגמאות המבחן‪ .‬דהיינו לחסום מלמטה את הכיסוי של‬
‫האסטרטגיה‪ .‬בכדי לפתח חסמי כיסוי הגדרנו מדד סיבוכיות חדש הנקרא סיבוכיות קבוצת‬
‫האפיון‪ .‬מדד זה משלב היבטים שונים של מספר מדדי סיבוכיות אחרים כולל מושג איזור‬
‫אי ההסכמה )הנמצא בשימוש ב‪ ,(disagreement coefficient-‬מושג קבוצת ההגדרה‬
‫המינימלית )הנמצא בשימוש ב‪ (teaching dimension-‬ומושג מימד ה‪ .VC-‬על ידי שימוש‬
‫בתוצאות קלאסיות מגאומטריה סטטיסטית הצלחנו לחסום את מדד הסיבוכיות החדש‬
‫עבור מספר התפלגויות מעניינות לרבות מסווגים לינארים תחת התפלגות שהיא ערבוב סופי‬
‫כלשהו של גאוסיאנים‪.‬‬
‫קצבי כיסוי – האסטרטגיה המוצעת דוחה דוגמאות מבחן באופן שמרני ביותר ‪ ---‬גם‬
‫בהנתן חשש מזערי לגבי נכונות התיוג האסטרטגיה המוצעת תסרב לתייג‪ .‬למרות זאת‪,‬‬
‫ובמפתיע‪ ,‬בהרבה בעיות האסטרטגיה תתייג את מרבית הדוגמאות‪ .‬יתרה מכך אנו מראים‬
‫שהסתברות הדחיה של האסטרטגיה קטנה בקצב מהיר כפונקציה של מספר דוגמאות‬
‫האימון‪ .‬מסתבר שכאשר מחלקת החזאים ממנה אנו לומדים היא אינסופית לא ניתן‬
‫לחסום את הכיסוי מלמטה ללא תלות בהתפלגות המקור או בדוגמאות הלימוד‪ .‬במחקר זה‬
‫אנו מפתחים חסמי כיסוי התלויים בדוגמאות הלימוד עצמן )‪ (data-dependent‬וחסמי‬
‫כיסוי התלויים בהתפלגות המקור )‪ .(distribution-dependent‬בהתבסס על מדד‬
‫הסיבוכיות החדש אנו מראים כי ‪ LESS‬משיגה כיסוי משמעותי המתכנס לאחד עבור‬
‫מסווגים לינארים תחת התפלגות שהיא ערבוב סופי כלשהו של גאוסיאנים‪ .‬אנו מפתחים גם‬
‫חסמי כיסוי נוספים למקרים של סיווג בררני תחת רעש ושל רגרסיה בררנית‪.‬‬
‫חסמים מהירים ללמידה פעילה – בלמידה פעילה הלומד מקבל דוגמאות לימוד באופן‬
‫סידרתי והוא נדרש להחליט‪ ,‬עבור כל דוגמא‪ ,‬אם הוא רוצה לקבל את התיוג שלה‪ .‬הלומד‬
‫הפעיל נמדד לפי כמות השאלות )בקשות התיוג( שהוא שואל עד להשגת שגיאת יעד נתונה‪.‬‬
‫‪iii‬‬
‫‪Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013‬‬
‫מראים כיצד ניתן לממש באופן יעיל את האסטרטגיה למקרה של רגרסיה לינארית עם‬
‫שגיאה ריבועית‪.‬‬
‫מספר שאלות התיוג כפונקציה של שגיאת היעד נקרא סיבוכיות התיוג‪ .‬אחת התקוות‬
‫מבקשים תיוג עבור כל דוגמא(‪ .‬בספרות ידוע רק על מספר מצומצם של בעיות )מסווגים‬
‫והתפלגויות( לגביהן קיימת הוכחה להבטחת האצה אקספוננציאלית בקצב הלימוד‪ .‬בפרט‪,‬‬
‫הד וגמא הידועה ביותר היא מסווגים לינארים העוברים דרך הראשית תחת התפלגות‬
‫אחידה על פני כדור‪ .‬אולם בשנים האחרונות הראו שגם בבעיות פשוטות כמו מסווגים‬
‫לינאריים כלשהם תחת אותה התפלגות על הכדור לא ניתן להשיג האצה לעומת למידה‬
‫רגילה‪ .‬תוצאה מאכזבת זו ריפתה את ידם של רבים והעלתה את השאלה האם האצה‬
‫אקספוננציאלית אפשרית בבעיות מעניינות‪.‬‬
‫בעבודה זו אנו מציגים שקילות של למידה פעילה וסיווג בררני ללא רעש‪ .‬שקילות זו‬
‫מאפשרת לנו לפתח חסמי סיבוכיות תיוג על ידי שימוש בחסמי כיסוי‪ .‬באמצעות טכניקה זו‪,‬‬
‫הצלחנו להוכיח האצה אקספוננציאלית עבור מסווגים לינארים כלשהם תחת התפלגות‬
‫שהיא אוסף סופי של גאוסיאנים‪ .‬שיטת החסימה החדשה היא השיטה היחידה הידועה‬
‫כיום המאפשרת להוכיח האצה אקספוננציאלית בקצב הלימוד עבור מספר בעיות מעניינות‪.‬‬
‫יתרה מכך‪ ,‬הקשר ההדוק בין למידה פעילה וסיווג בררני מאפשר לנו לחסום את אחד‬
‫ממדדי הסיבוכיות החשובים ביותר בלמידה פעילה‪ ,‬הלא הוא מקדם אי‪-‬ההסכמה‬
‫)‪ .(disagreement coefficient‬חסם זה מתבסס על מדד הסיבוכיות החדש שהצגנו וכל‬
‫התוצאות שהוכחנו לגביו תקפות גם לגבי מקדם אי ההסכמה‪.‬‬
‫תוצאות אמפיריות – אנחנו מסכמים את העבודה באוסף של תוצאות אמפיריות‬
‫המדגימות את הייתרון בחיזוי בררני בכלל‪ ,‬ובאסטרטגיה שלנו לחיזוי בררני בפרט‪.‬‬
‫‪iv‬‬
‫‪Technion - Computer Science Department - Ph.D. Thesis PHD-2013-12 - 2013‬‬
‫הגדולות ביותר בלמידה פעילה היא השגת האצה בקצב הלימוד לעומת לימוד רגיל )בו‬