Upper and Lower Error Bounds for Active Learning

Upper and Lower Error Bounds for Active Learning
Rui M. Castro and Robert D. Nowak
Abstract— This paper analyzes the potential advantages and
theoretical challenges of ”active learning” algorithms. Active
learning involves sequential, adaptive sampling procedures that
use information gleaned from previous samples in order to focus
the sampling and accelerate the learning process relative to
“passive learning” algorithms, which are based on non-adaptive
(usually random) samples. There are a number of empirical
and theoretical results suggesting that in certain situations
active learning can be significantly more effective than passive
learning. However, the fact that active learning algorithms
are feedback systems makes their theoretical analysis very
challenging. It is known that active learning can provably
improve on passive learning if the error or noise rate of the
sampling process is bounded. However, if the noise rate is
unbounded, perhaps the situation most common in practice,
then no previously existing theory demonstrates whether or
not active learning offers an advantage. To study this issue,
we investigate the basic problem of learning a threshold
function from noisy observations. We present an algorithm that
provably improves on passive learning, even when the noise is
unbounded. Moreover, we derive a minimax lower bound for
this learning problem, demonstrating that our proposed active
learning algorithm converges at the near-optimal rate.
I. INTRODUCTION
In various learning tasks it is possible to use information
gleaned from previous samples in order to focus the sampling
process, in what is generally referred to as “active learning”.
These methods attempt to accelerate the learning task relative
to “passive learning” algorithms, based on non-adaptive
sampling. A prototypical example is document classification:
suppose we are given a text document and want to decide if
the contents pertain either finance, sports, or anything else.
We are going learn how to do this task from examples,
that is, we have access to a number of documents that
have been labeled by an expert (usually a human), so we
know the topics of those documents. In general, labeling
examples is expensive and time consuming, so we would like
to use as few examples as possible. In many applications
we might have access to many un-labeled examples. This
is the case for our prototypical scenario, where one has a
virtually infinite supply of documents. Therefore, ways to
automatically decide whether or not obtaining the label for an
unlabeled example is worthwhile are crucial for the efficient
design of good classifiers.
The interest in active learning in the machine learning
community has increased greatly in the last few of years,
This work was partially supported by the National Science Foundation
grants CCR-0310889, CNS-0519824, and ECS-0529381.
Rui Castro is with the Department of Electrical Engineering
of University of Wisconsin - Madison, and Rice University.
[email protected]
Robert Nowak is with the Department of Electrical Engineering, University of Wisconsin - Madison. [email protected]
in part due to the dramatic growth of data sets, and the high
cost of labeling all the examples in such data sets. There
are several empirical and theoretical results suggesting that
in certain situations active learning can be significantly more
effective than passive learning [2], [7], [8], [12], [17]. Many
of these results pertain to the “noiseless” setting, in which the
labels are a deterministic function of the features. In certain
noiseless scenarios it has been shown that the number of
labeled examples needed is to achieve a desired classification
error rate is much smaller than what would be need using
passive learning (in fact for those scenarios if passive learning requires n labeled examples then active learning requires
only O(log n) labeled examples) [9]–[12]. Although this
setting is interesting from a theoretical perspective, it is very
restrictive, and seldom relevant for practical applications.
Some of these results have been extended to the “bounded
noise rate” setting. In this setting labels are no longer a
deterministic function of the features, but for a given feature
the probability of observing a particular label is significantly
higher than the probability of observing any other label. In
the case of binary classification this means that if (X, Y ) is
a feature-label pair, where Y ∈ {0, 1}, then | Pr(Y = 1|X =
x) − 1/2| > µ for every x in the feature space, with µ > 0.
Under this assumption it can be shown that results similar
to the ones for the noiseless scenario can be achieved [1],
[5], [7], [16]. These results are intimately related to adaptive
sampling techniques in regression problems [3], [5], [13],
[14], where similar performance gains have been reported.
In this paper we address scenarios where the noise rate is
unbounded, perhaps the situation most common in practice.
Previously existing theoretical studies have been unable to
answer whether or not active learning is advantageous in this
case. To study this issue, we investigate the basic problem of
learning a threshold function from noisy observations. Since
the noise rate is unbounded, the quality of the observations
degrade when the samples are taken in the vicinity of the
threshold location. In other words, the noise level tends
to 1/2, the flip of a fair coin, as the sampling locations
tend to the threshold location. We present an algorithm that
provably improves on passive learning, even when the noise
is unbounded. Moreover, we derive a minimax lower bound
for this learning problem, demonstrating that our proposed
active learning algorithm converges at a near-optimal rate.
To the best of our knowledge these are the first results of
this kind.
The paper is organized as follows: in Section II we
formalize the problem and setup. Section III presents the
fundamental limits of active learning, in terms of minimax
lower bounds. In Section IV we present various algorithms,
starting with an algorithm for the bounded noise rate scenario
due to [3], which will be the building block for algorithms
tackling unbounded noise rate problems. Section V presents
some simulation results illustrating our methods and Section VI closes with some final remarks and open questions.
We defer the proofs of the main results to the appendix.
II. P ROBLEM F ORMULATION
Throughout this paper we focus on a relatively simple
one-dimensional problem. Although this might seem a rather
“toyish” scenario, it displays many of the features that make
active learning appealing and theoretical analysis of algorithms challenging. Consider estimating a threshold function
from noisy samples. This problem boils down to finding
the threshold location. Adaptively sampling aims to find
this location with a minimal number of strategically placed
samples.
Let (X, Y ) ∈ [0, 1] × {0, 1} be a random variable, with
unknown distribution PXY . Our goal is to construct a “good”
classification rule, that is, given X we want to predict Y
as accurately as possible, where our classification rule is a
measurable function f : [0, 1] → {0, 1}. The performance of
the classifier is evaluated in terms of the expected loss, in
particular a natural choice to consider is the 0/1-loss. With
this choice the risk is simply the probability of classification
error,
R(f ) = E[1{f (X) 6= Y }] = Pr(f (X) 6= Y ) .
(1)
Since we are considering only two classes there is a
one-to-one correspondence between classifiers and sets: Any
reasonable classifier is of the form f (x) = 1G (x), where G
is a measurable subset of [0, 1]. We use the term classifier
interchangeably for both f and G. Define the optimal risk
as
R∗ = inf R(G) .
Gmeasurable
This minimum risk is attained by the Bayes Classifier
G∗ = {x ∈ [0, 1] : η(x) ≥ 1/2} ,
where
η(x) = E[Y |X = x] = Pr(Y = 1|X = x) ,
is called the conditional probability (we use this term only
if it is clear from the context). In general R(G∗ ) > 0,
therefore even the optimal classifier misclassifies sometimes.
The quantity of interest for the performance evaluation of a
classifier G is therefore the excess risk
Z
R(G) − R(G∗ ) =
|2η(x) − 1|dPX (x) ,
G∆G∗
where ∆ denotes the symmetric difference between two
sets1 , and PX is the marginal distribution of X.
If the distribution PXY is known, then clearly we could
just construct the Bayes classifier and we would be done.
This is not the case for interesting problems, where we do
1 A∆B ≡ (A ∩ B c ) ∪ (Ac ∩ B), where Ac and B c are the complement
of A and B respectively.
not have direct access to PXY . In most cases we have to
construct a classifier based on a finite number of samples
of PXY , and this is where most differences between passive
and active learning methods arise. For the purposes of this
paper we consider particular classes of distributions PXY .
We assume that PX is uniform of [0,1], but the results in
this paper can easily be generalized to the case where the
marginal density of X is bounded above and below. We
assume furthermore that the Bayes classifier has the form
G∗ = [θ∗ , 1], where θ∗ ∈ [0, 1]. This means that there exists
a threshold θ∗ ∈ [0, 1] such that η(x) is less than 1/2 for all
x < θ∗ , and greater or equal to 1/2 for all x ≥ θ∗ .
We assume that we have a large (infinite) pool of examples we can select from. We can choose any feature point
Xi ∈ [0, 1] and observe its label Yi . The data collection
operation has a temporal aspect to it, namely we collect the
labeled examples one at the time, starting with (X1 , Y1 ) and
proceeding until (Xn , Yn ) is observed. Formally we have:
A1 - The observation Yi , i ∈ {1, . . . , n} are distributed as
1 , with probability η(Xi )
Yi =
.
0 , with probability 1 − η(Xi )
The random variables {Yi }ni=1 is conditionally independent
given {Xi }ni=1 .
A2.1 - Passive Sampling: The sample locations Xi are
independent of {Yj }j6=i .
A2.2 - Active Sampling: The sampling location Xi
depends only on {Xj , Yj }j<i . In other words
Xi |X1 . . . Xi−1 , Xi+1 , . . . , Xn , Y1 . . . Yi−1 , Yi+1 , . . . , Yn
a.s.
= Xi |X1 . . . Xi−1 , Y1 . . . Yi−1 .
The conditional distribution on the right hand side (r.h.s) of
the above expression is called the sampling strategy and is
denoted by Sn . It completely defines our sampling schedule.
After collecting the n examples, that is, after collecting
b n that is desirably
{Xi , Yi }ni=1 , we construct a classifier G
close to G∗ . We use the subscript n to denote dependence
on the data set, instead of writing it explicitly.
Under the passive sampling scenario (A2.1) the sample
locations do not depend in any way on our observations,
therefore the collection of sample points {Xi }ni=1 can be
chosen before any observations are collected. On the other
hand, the active sampling scenario (A2.2) allows for the
ith sample location to be chosen using all the information
collected up to that point (the previous i − 1 samples).
To be able to present results on rates of convergence of
the excess risk we need to impose further conditions on η(·),
namely on the behavior of the conditional probability around
the 1/2 crossing point θ∗ . Let κ ≥ 1 and µ > 0, then
|η(x) − 1/2| ≥ µ|x − θ∗ |κ−1 , if |x − θ∗ | ≤ 0 ,
|η(x) − 1/2| ≥ µ0 κ−1 , if |x − θ∗ | > 0 ,
(2)
(3)
for some 0 > 0. Examples of such conditional probability
functions are depicted in Figure 1. The conditions above
are very similar to the so-called margin condition (or noisecondition) introduced by Tsybakov [18], although our condition is slightly more restrictive. Nevertheless the key aspect
of these conditions is the behavior of η(·) near the Bayes
decision boundary. Let P(κ, µ) be the class of distributions
PXY such that the marginal PX is uniform in [0, 1] and η(·)
satisfies (2) and (3). If κ = 1 then the η(·) function “jumps”
across 1/2, that is η(·) is bounded away from the value 1/2
(see Figure 1(a)). If κ > 1 then η(·) crosses the value 1/2 at
θ∗ . Arguably the most interesting case corresponds to κ = 2
(Figure 1(b)). In this case the conditional probability behaves
linearly around 1/2. This means that the noise affecting
observations that are made close to the decision boundary
is roughly proportional to the distance to the boundary. This
might be due to the fact that the feature X is not powerful
enough to clearly distinguish the two classes near θ∗ . Finally
if κ > 2 then η(·) is very “flat” around θ∗ .
where the samples {Xi }ni=1 are independent and identically
distributed (i.i.d.) uniformly over [0, 1]. Furthermore this
bound is tight, in the sense that it is possible to devise classification strategies attaining the same asymptotic behavior.
We notice that under the passive sampling scenario the
excess risk decays at a strictly slower rate than for the active
sampling scenario. The difference is dramatic when κ → 1
(Figure 1(a)). In that case it can actually be shown that an
exponential rate of error decay is attainable when actively
sampling [3]. When κ → ∞ the excess risk decay rates
are the same, regardless of the sampling paradigm. This
indicates that if no assumptions are made on the conditional
distribution Pr(Y = 1|X) then no advantage can be taken
from the extra complexity of active sampling. As remarked
before the most relevant case is κ = 2. In this case both
passive and active sampling methods display polynomial
rates of error decay, but the rate for active sampling is n−1 ,
which is significantly faster than n−2/3 , the best possible
rate for passive sampling. In the rest of the paper we present
algorithms showing that the rates of Theorem 1 are nearly
achievable.
We finally point out that the result of Theorem 1 applies
also the general setting presented in [18], and that using
similar ideas minimax bounds can be derived for higher dimensional classifier classes, characterized in terms of metric
entropy.
III. M INIMAX L OWER B OUNDS
IV. ACTIVE S AMPLING A LGORITHMS
In this section we present lower bounds on the performance of active and passive sampling methods, under the
conditions described above. The proof techniques described
here are relatively standard, and follow the approach in [19].
The key idea of the proof is to reduce the problem of
estimating level sets of P(κ, µ) to the problem of deciding
among a finite number of such distributions. In other words
we reduce the estimation problem to a hypotheses testing
problem. In this case it suffices to consider only two different
distributions PXY .
Theorem 1 (Active Sampling Minimax Lower Bound):
Let κ > 1. Under the assumptions (A1) and (A2.2) we have
h
i
κ
b n ) − R(G∗ ) ≥ cn− 2κ−2
,
inf
sup
E R(G
In this section we present various active sampling algorithms that allow us to nearly achieve the lower bounds of
Theorem 1. We start by presenting an algorithm proposed
by Burnashev and Zigangirov [3], inspired by an idea of
Horstein [15]. This algorithm is designed to work in
the bounded noise rate case, that is, when κ = 1. This
corresponds to a scenario where the conditional probability
η(x) = Pr(Y = 1|X = x) is bounded away from 1/2,
|η(x) − 1/2| ≥ µ for all x ∈ [0, 1]. The results and
lessons learned from this algorithm yield some intuition for
general margin parameters, namely a “back-of-the-envelope”
analysis indicates that a similar algorithm can be used to
nearly achieve the rates of Theorem 1.
(a)
(b)
Fig. 1. Examples of two conditional distributions η(x) = Pr(Y = 1|X =
x). (a) In this case η(·) satisfies the margin condition with κ = 1; (b) Here
the margin condition is satisfied for κ = 2.
bn ,Sn PXY ∈P(κ,µ)
G
for n large enough, where c ≡ c(κ, µ) > 0 and the infimum
bn
is taken over the set of all possible classification rules G
and sampling strategies Sn .
Contrast this result with the one attained for passive
sampling. Under the passive sampling scenario it is clear
that the sample locations {Xi }ni=1 must be scattered around
the interval [0, 1] in a somewhat uniform manner. These can
be deterministically placed, for example over a uniform grid,
or simply taken uniformly distributed over [0, 1]. In [18] was
shown that under assumptions (A1), (A2.1), and κ ≥ 1, we
have
h
i
κ
b n ) − R(G∗ ) ≥ cn− 2κ−1
, (4)
inf
sup
E R(G
bn PXY ∈P(κ,µ)
G
A. Bounded noise rate: κ = 1
Under this scenario we assume that |η(x) − 1/2| ≥ µ
for all x ∈ [0, 1]. Notice that in this case the class P(1, µ)
is a quasi-parametric class: elements of the class are essentially characterized by the location of the point where
η(·) “crosses” 1/2, denoted by θ∗ . If the observations are
noiseless (that is η(x) ∈ {0, 1} or equivalently µ = 1/2)
then it is clear that one can estimate θ∗ very efficiently using
binary bisection: start by taking a sample at X1 = 1/2.
Depending on the outcome we know if θ∗ is to the left of
X1 (if Y1 = 1) or to the right (if Y1 = 0). Proceeding
accordingly we can construct an estimate of θ∗ denoted by
b n ≡ [θbn , 1] such that
θbn and a corresponding classifier G
b n ) − R(G∗ )] = E[|θbn − θ∗ |] ≤ 2−(n+1) .
E[R(G
If there is noise (µ < 1/2) things get a bit more complicated, in part because our decisions about the sampling
depend on all the observations made in the past, which
are noisy and therefore unreliable. Nevertheless there is
a probabilistic bisection method, proposed in [15], that is
suitable for this purpose. The key idea stems from Bayesian
estimation. Suppose that we have a prior probability density
function P0 (x) on the unknown parameter θ∗ , namely that
θ∗ is uniformly distributed over the interval [0, 1] (that is
P0 (x) = 1 for all x ∈ [0, 1]). To make the exposition clear
assume a particular situation, namely that θ∗ = 1/4. Like
before, we start by taking a measurement at X1 = 1/2. With
probability η(X1 ) ≥ 1/2 + µ we observe a one, and with
probability 1−η(X1 ) ≤ 1/2−µ we observe a zero. Therefore
it is more likely to observe a one than a zero. Suppose
a one was observed. Given these facts we can compute a
“posterior” density simply by applying an approximate Bayes
rule (we assume that a one is observed with probability
1/2 + µ). In this case we would get that
P1 (x|X1 , Y1 ) =
1 + 2µ)
1 − 2µ)
, if x ≤ 1/2,
, if x > 1/2,
Fig. 2. Illustration of the probabilistic bisection strategy. The shaded areas
correspond to 1/2 of the probability mass of the posterior densities.
.
integration
The next step is to choose the sample location X2 . We
choose X2 so that is bisects the posterior distribution, that
is, we take X2 such that Prθ∼P1 (·) (θ > X2 |X1 , Y1 ) =
Prθ∼P1 (·) (θ < X2 |X1 , Y1 ). In other words X2 is just the
median of the posterior distribution. If our model is correct,
the probability of the event {θ < X2 } is identical to the
probability of the event {θ > X2 } and therefore sampling
Y2 at X 2 is most informative. We continue iterating this
procedure until we have collected n samples. The estimate
θbn is defined as the median of the final posterior distribution.
Figure 2 illustrates the procedure. Note that if µ = 1/2
then probabilistic bisection is simply the binary bisection
described above.
The above algorithm seems to work extremely well in
practice, but it is hard to analyze and there are few theoretical
guarantees for it, especially pertaining rates of error decay. In
[3] a similar algorithm was proposed. Although its operation
is slightly more complicated, it is easier to analyze. That
algorithm uses essentially the same ideas, but enforces a
parametric structure for the posterior by imposing that the
sample locations Xi lie on a grid, in particular Xi ∈
{0, ∆, 2∆, . . . , 1} where m = ∆−1 ∈ N. Furthermore in
the application of the Bayes rule we use α instead of µ,
where 0 < α < µ < 1/2. A description of the algorithm
can be found in [3] or in [4]. We call this method the BZ
algorithm. We have the following remarkable result [3].
Pr(|θbn − θ∗ | > ∆) ≤
1−∆
∆
1 + 2µ
1 − 2µ
+
2 + 4α 2 − 4α
n
. (5)
b n ≡ [θbn , 1].
From the estimate θbn we construct a classifier G
To get a bound on the expected excess risk one proceeds by
b n ) − R(G∗ )] = E
E[R(G
≤
=
=
=
Z
|2η(x) − 1|dx
bn ∆G∗
G
b n ∆G∗ |]
E[|G
E[|θbn − θ∗ |]
Z 1
Pr(|θbn − θ∗ | > t)dt
0
Z ∆
Z
Pr(|θbn − θ∗ | > t)dt +
0
1
Pr(|θbn − θ∗ | > t)dt
∆
≤ ∆ + (1 − ∆) Pr(|θbn − θ∗ | > ∆)
n
(1 − ∆)2 1 + 2µ
1 − 2µ
≤ ∆+
+
.
∆
2 + 4α 2 − 4α
√
n/2
1− 1−4µ2
1+2µ
1−2µ
Taking ∆ = 2+4α + 2−4α
and α =
(to
4µ
minimize the exponent) yields
n/2
1 1p
∗
2
b
E[R(Gn ) − R(G )] ≤ 2
+
1 − 4µ
.
2 2
Notice that the excess risk decays exponentially in the
number of samples. This is much faster than what is
attainable using only passive sampling, where the decay
rate is 1/n. Like in the noiseless scenario we obtain an
exponential rate of error decay. The difference is that now
the exponent depends on the noise margin µ, larger noise
margins corresponding to faster error decay rates.
B. Unbounded rate noise: κ > 1
In this section we consider scenarios where the noise rate
is not bounded, that is, as one makes observations closer to
the transition point θ∗ the observation noise becomes larger.
This clearly hinders extremely fast excess risk decay rates,
since as one focus more and more on the Bayes decision
boundary the quality of the observations degrades.
To gain some intuition we consider first the case where
|η(x) − 1/2| = µ|x − θ∗ |κ−1 , corresponding to a margin
parameter κ. Proceed now as in the previous section, and
collect samples over a grid, namely Xi ∈ {0, ∆, 2∆, . . . , 1}
where m = ∆−1 ∈ N. If this grid does not line-up with
the transition point θ∗ then |η(x) − 1/2| ≥ µ(∆/2)κ−1
for all x ∈ {0, ∆, . . . , 1}. Of course this is in general an
unrealistic assumption, but let us consider it for now. We can
now proceed by using the method described in the previous
section replacing µ by µ(∆/2)κ−1 and using (5). Begin by
noticing that due to the special form of η(·) the behavior of
the expected excess risk is related to the behavior of |θbn −θ∗ |
in an interesting way, namely
Z
∗
b
E[R(Gn ) − R(G )] = E
|2η(x) − 1|dx
bn ∆G∗
G
Z
= E
2µ|x − θ∗ |κ−1 dx
bn ∆G∗
G
≤ E[|θbn − θ∗ |κ ] .
We now proceed in a similar fashion as before
b n ) − R(G∗ )] ≤ E[|θbn − θ∗ |κ ]
E[R(G
Z 1
=
Pr(|θbn − θ∗ |κ > t)dt
0
Z 1
=
Pr(|θbn − θ∗ | > t1/κ )dt
0
Z ∆κ
=
Pr(|θbn − θ∗ | > t1/κ )dt
0
Z 1
+
Pr(|θbn − θ∗ | > t1/κ )dt
∆κ
≤ ∆ + (1 − ∆ ) Pr(|θbn − θ∗ | > ∆)
n
1 1 1p
κ
2
2κ−2
+
1 − 4µ (∆/2)
≤ ∆ +
∆ 2 2
n
1
≤ ∆κ +
1 − µ2 (∆/2)2κ−2
∆
1
κ
≤ ∆ + exp −nµ2 (∆/2)2κ−2 ,
∆
√
where the last two steps follow from the fact that x ≤
(x + 1)/2 for all x ≥ 0, and that (1 + s(x))x ≤ exp(xs(x))
for all x > 0 and s(x) > −1. Finally, let
1
1
κ+1
log n 2κ+2
∆=
.
2 µ2 (2κ − 2) n
κ
κ
We conclude that
b n ) − R(G∗ )] ∝
E[R(G
log n
n
κ
2κ−2
.
(6)
This is lower bound rate displayed in Theorem 1, apart from
logarithmic factors. The result indicates that, in principle, a
methodology similar to the Burnashev-Zigangirov algorithm
might allow us to achieve the lower bound rates.
It is important to emphasize that the above result holds
under the assumption that the sampling grid is not aligned
with the unknown transition point θ∗ . If that is not the
case then we will have |η(x) − 1/2| < µ(∆/2)κ−1 for
some sampling point, and the analysis above does not hold.
One way to avoid this problem is to consider different
sampling grids. A methodology that can be shown to work
consists in dividing the available measurements into three
equal sets, and use three different offset sampling grids. If
we have a budget of n samples we allocate n/3 samples
and run the BZ algorithm for one sampling grid. Then we
use other n/3 samples and run the BZ algorithm for another
sampling grid, essentially a slightly shifted version of the
first sampling grid, and proceed in an analogous fashion
with the remaining n/3 samples. The rational is that at
most one of these sampling grids is going to be closely
aligned with the unknown transition point θ∗ , and therefore
the other two are not going to display any problems. We then
check for agreement amoung the three estimators and make
a decision that provably attains the rate in (6). Although
this methodology is satisfying from a theoretical point of
view, it is somewhat wasteful (essentially only one third of
the samples are effectively used), and does not generalize
well to more complicated and realistic scenarios than the one
considered in this paper. Therefore it is not a very relevant
approach in practice. In the remaining of this document
we present an alternative methodology that solves some of
these problems and provides a practical algorithm capable of
achieving the desired fast rates.
It is worth pointing out that, even when the assumption
that the sampling grid does not line up with the transition
θ∗ does not hold, the BZ method still works extremely well
in practice. The difficulties arise solely on the performance
analysis of the algorithm.
C. Unbounded rate noise: A sample-efficient algorithm
The algorithm presented here is similar to the original BZ
algorithm, although some modifications are made allowing
the analysis of more general noise models than the bounded
rate noise. Like in the original methodology we concentrate
our sampling on a grid, in particular Xi ∈ {0, ∆, 2∆, . . . , 1}
where m = ∆−1 ∈ N. The algorithm works by propagating a
posterior-like density (we will denote this density as posterior
hereafter). After j observation the posterior is described as
Pj : [0, 1] → R,
Pj (x) = ∆−1
m
X
bi (j)1Ii (x) ,
i=1
where I1 = [0, ∆] and Ii = (∆(i − 1), ∆i], for i ∈
{2, . . . , m}. We initialize this posterior by taking bi (0) = ∆
for all i. Note that the posterior is completely
Pmcharacterized
by b(j) = {b1 (j), . . . , bm (j)}, and that
i=1 bi (j) = 1.
We select the sample location Xj+1 using a randomized
approach based on Pj : choose a number kj+1 ∈ {1, . . . , m}
according to the distribution b(j). In other words Pr(kj+1 =
i) = bi (j). Now let W ∈ {−2, −1, 0, 1} be a independent random variable, taking one of the values -2,-1,0, or
1 with probability 1/4. The sample location is given by
Xj+1 = min (max (∆(k(j + 1) − W ), 0) , 1). We collect a
label Yj+1 and update the posterior accordingly (implicitly
assuming a noise margin 0 < α < 1/2), namely if i ≤
kj+1 − W
bi (j + 1) =


, if Yj+1 = 0

, if Yj+1 = 1
(1/2−α)bi (j)
Pkj+1 −W
1/2+α−2α l=1
bl (j)
(1/2+α)bi (j)
Pkj+1 −W
1/2−α+2α l=1
bl (j)
-1
10
Active Sampling
Passive Sampling
-2
10
,
and if i > kj+1 − W
-3
bi (j + 1) =


, if Yj+1 = 0

, if Yj+1 = 1
((1/2+α)bi (j)
Pkj+1 −W
1/2+α−2α l=1
bl (j)
(1/2−α)bi (j)
Pkj+1 −W
1/2−α+2α l=1
bl (j)
10
.
We call α the update parameter. Finally, at time j, the
estimate of the true transition point θ∗ is given by θbj that is
defined as the median of Pj (·).
A couple of remarks are important at this point: (i) this
method differs from the original BZ algorithm in that the
latter chooses the next sample location to be one of the two
grid points that are closest to the median of the posterior.
Choosing the sample point according to this procedure, or
“sampling” the posterior (as is the case with our algorithm)
is essentially the same provided the posterior is somewhat
concentrated around a point, which happens with high probability after a number of observations have been collected.
(ii) Instead of choosing W to take values in {−2, −1, 0, 1}
one could take W to be Bernoulli with parameter 1/2. This
methodology works, provided we are working under the
bounded noise rate assumption, or in general noise models
if the transition point θ∗ is not closely aligned with the
sampling grid. If the last condition does not hold then
the proof technique breaks down, and it is not possible to
guarantee rates of error decay. This same problem arises
in the original BZ method. (iii) When using four sample
points around the chosen bin kj+1 (or in general points in
the vicinity of the chosen bin) we can guarantee that most
of the times we are taking a sample at a point that has a
reasonable noise margin (that is, a point such that |η(x)−1/2|
is reasonably large).
The analysis of the algorithm above can be done in a
similar fashion to the analysis of the original BZ algorithm,
although further complications arise due to the inability to
control the noise margin at one of the sampling points (see
proof of Lemma 1 for details). We use the algorithm to
b n = [θbn , 1] and have the following
construct the classifier G
result.
Theorem 2: Let PXY ∈ P(κ, µ). Furthermore assume
that PXY does not satisfy the margin condition for any
margin parameter κ0 such that κ0 < κ. Then there exists
a bin size ∆ ≡ ∆(n, κ, µ) and an update parameter α ≡
α(n, κ, µ) such that
κ
2κ−2
b n ) − R(G∗ )] ≤ C log n
E[R(G
,
(7)
n
where C ≡ C(κ, µ) > 0.
1
10
2
10
3
10
Fig. 3. Simulation results for the case κ = 2. The empirical expected excess
risk for the active sampling algorithm corresponds to the bold line, and the
corresponding curve for passive sampling corresponds to the regular line.
The proof of the theorem follows from an upper bound
of Pr(|θbn − θ∗ | > ∆), similar to (5), and the reasoning
of Section IV-B. The key result enabling the proof is the
following lemma.
Lemma 1: Let PXY ∈ P(κ, µ) and κ ≥ 2. Then taking
α = 0.09µ(3∆/4)κ−1 yields
n
1 − 2∆
µ2 (3∆/4)κ−1
Pr(|θbn − θ∗ | > ∆) ≤
1−
.
2∆
50
(8)
Although this lemma is presented only for κ ≥ 2 it is
possible to obtain similar results for any κ > 1, identical to
the above bounds apart from the constants. In the interest of
clarity we consider only the case κ ≥ 2.
Some remarks are important at this time: In the statement
of Theorem 2 one notices that ∆ and α are functions of
κ and µ. This indicates that the proposed algorithm is not
adaptive, that is, it cannot handle a scenario where κ and µ
are unknown.
It turns out that the choice of ∆ as a function of n, κ,
and µ is not critical for the practical performance of the
algorithm. Essentially finer sampling grids provide the same
or better performance than the one predicted by the theoretical analysis. However, the choice of update parameter α is
critical, since the wrong scaling of the parameter with n can
lead into slower rates of error decay.
V. S IMULATION R ESULTS
In this section we present a simple simulation of the
proposed algorithm. This does not constitute a thorough
empirical study, but simply an illustration of the practical
performance characteristics. We conducted various other
simulation tests, all achieving results agreeing with the
theoretical analysis above.
In this simulation we considered the case κ = 2, and the
distribution PXY is characterized by η(x) = (x − θ∗ )/2 +
1/2. We performed 1000 runs, in each one selecting θ∗
uniformly at random in the interval [0, 1]. In Figure 3 we
plot the empirical expected excess risk versus the number
of samples n. The expected excess risk is estimated by
averaging the excess risk over the 1000 trials. We display the
results in a log-log plot. We observe that the active method
outperforms the passive method for any number of samples.
Furthermore the rates of error decay coincide with the rates
predicted by the theory, that is, the excess risk decays at a rate
n−1 for active sampling (in the log-log plot this corresponds
to a slope of −1), and at a rate n−2/3 for passive sampling
(corresponding to a slope of −2/3 in the plot).
VI. F INAL R EMARKS AND O PEN Q UESTIONS
We present a formal analysis of active learning in an
unbounded noise setting. We show that the extra flexibility
of active sampling, when compared to passive sampling,
allows for faster rates of error decay, but the errors depend
on the behavior of the noise margin around the Bayes
decision boundary. We present results characterizing the
fundamental limits of active learning for the one-dimensional
setting of this paper, and provide algorithms capable of
nearly achieving those limits. The algorithms are practical
and easily implemented. The algorithms presented are not
adaptive to the noise conditions, in the sense that they require
knowledge of margin parameters κ and µ in order to achieve
the optimal rates, at least in theory. However, empirical
evidence suggests that the proposed algorithms yield the
optimal performance (in terms of error rates) solely with
mild knowledge of κ, but the proof techniques used here are
not powerful enough to demonstrate that. The ideas presented
here can be extended to larger and more complicated classes
of classifiers, characterized for example in terms of metric
entropy. The development and analysis of such extensions is
currently being investigated.
A PPENDIX
A. Proof of Theorem 1
The proof strategy follows the basic idea behind standard
minimax analysis methods, and consists in reducing the
problem of classification in the class P(κ, µ) to a hypothesis
testing problem. In this case it suffices to consider two
hypothesis.
1
κ−1
,x ≤ t
2 − µ(t − x)
,
η0 (x) =
1
,x > t
1
κ−1
,x ≤ t
2 + µx
.
η1 (x) =
1
,x > t
These are depicted in Figure 4. Note that G∗0 = [t, 1]
bn
and G∗1 = [0, 1] (provided t is small enough). Let G
be any classifier (i.e. a subset of [0, 1]) and define ψn =
b n ∆G∗ |, where | · | denotes the volume of
arg mini∈{0,1} |G
i
a set. We have the following result.
Lemma 2: For j ∈ {0, 1}
2µ κ ∆
t = s ∝ µtκ ,
κ2κ
where Rj denotes the risk (1) under hypothesis j.
b n ) − Rj (G∗0 ) ≥
ψn 6= j ⇒ Rj (G
(a)
Fig. 4.
(b)
The two conditional distributions used for the proof of Theorem 1.
Proof: Assume first that j = 0, the case j = 1 is
b n ∩ [0, t], and G̃cn = G
b cn ∩ [0, t].
analogous. Define G̃n = G
Since ψn = 1 we have
b n ∆G∗1 | ≤ |G
b n ∆G∗0 |
|G
⇔ |G̃n ∆G∗1 | ≤ |G̃n ∆G∗0 |
⇔ |G̃cn | ≤ |G̃n | .
Therefore
b n ) − R0 (G∗0 ) ≥
R0 (G
Z
|2η0 (x) − 1|dx
G̃n
t
Z
≥
|2η0 (x) − 1|dx
t/2
=
2µ κ
t .
κ2κ
A consequence of Lemma 2 is that
inf
b n ) − R(G∗ ) ≥ s)
Pr(R(G
sup
bn PXY ∈P(κ,a)
G
≥ max Pj (ψn 6= j)
j
∆
≥ inf sup Pj (φn 6= j) = pe ,
φn
(9)
j
where the infimum on the r.h.s. is taken with respect to all
functions of the data onto {0, 1}. All that remains to be done
now is to construct a lower bound for pe . To this we use a
(0)
result from [19]. Let P0,n ≡ PX1 ,...,Xn ,Y1 ,...,Yn be the probability measure of the random variables {Xi , Yi }ni=1 under hy(1)
pothesis 0 and define analogously P1,n ≡ PX1 ,...,Xn ,Y1 ,...,Yn .
Theorem 2.2 of [19] says that if K(P1,n kP0,n ) ≤ α < ∞
then
!
p
1 − α/2
1
pe ≥ max
exp(−α),
,
4
2
where K(·k·) denotes the Kullback divergence. Now
P1,n
K(P1,n kP0,n ) = E1 log
P
0,n
P1,n = E1 E1 log
X1 , . . . , X n
P   0,n

(1) PYi |Xi = nE1  E1 log (0) X1 , . . . , Xn 
PYi |Xi 

(1) PYi |Xi ≤ n E1 log (0) Xi = 0, ∀i
PYi |Xi κ−1 2
≤ n (2at
) + o(2atκ−1 )2 ,
case is essentially like the bounded noise margin scenario,
and the analysis can be carried out using techniques in [3].
Define k(θ∗ ) to be the index of the bin Ii containing θ∗ ,
that is θ∗ ∈ Ik(θ∗ ) . Define
M (j) =
1 − bk(θ∗ ) (j)
,
bk(θ∗ ) (j)
and
Fig. 5. Illustration of case 2 in the proof of Theorem 2. Only one sample
point is lying inside the gray area (corresponding to a noise margin less than
µ(∆/4)κ−1 .
as t → 0. In the above E1 denotes the expectation taken
with respect to measure P1,n . Therefore K(P1,n kP0,n ) ≤
1
4µ2 nt2κ−2 . Taking t ∝ (µ2 n)− 2κ−2 and using (9) we
conclude that
κ
b n ) − R(G∗ ) ≥ µ− κ1 n− 2κ−2
inf
sup
Pr R(G
bn PXY ∈P(κ,a)
G
N (j + 1) =
bk(θ∗ ) (j)(1 − bk(θ∗ ) (j + 1))
M (j + 1)
=
.
M (j)
bk(θ∗ ) (j + 1)(1 − bk(θ∗ ) (j))
The reasoning behind these definitions is made clear later.
For now, notice that M (j) is a decreasing function of
bk(θ∗ ) (j).
After n observations our estimate of θ∗ is the median
of the posterior density Pn . Taking that into account we
conclude that
Pr(|θbn − θ∗ | > ∆)
≥c>0,
where c > 0 is a constant. The statement of the theorem now
follows from the application of Markov’s inequality to the
above expression,
h
i
b n ) − R(G∗ )
E R(G
κ
κ
1
b n ) − R(G∗ ) ≥ µ− κ1 n− 2κ−2
.
≥ µ− κ n− 2κ−2 Pr R(G
Remark: Notice that, when bounding the Kullback
divergence, we considered all the feature examples to be
taken at Xi = 0, the most beneficial place to take a sample.
If instead we assume Xi i.i.d. uniformly over [0, 1] the
Kullback divergence is approximately nt2κ−2 t = nt2κ−1 ,
since roughly only a fraction t of the samples are informative
(any sample taken in (t, 1] is non-informative). Proceeding
as before we obtain the passive sampling minimax bound
(4), proved in [18] for a more general setting.
B. Proof of Lemma 1
The proof of Lemma 1 follows closely the approach taken
in [3], but some complications arise from the fact that the
noise margin for one of the grid sample points (namely the
point closest to θ∗ ) cannot be lower bounded. For a fix
distribution PXY ∈ P(κ, µ) and a given parameter ∆ two
different scenarios can happen. For all the results below we
assume that ∆ is small enough so that condition (3) does
not play a critical role. The proof also assumes that κ ≥ 2.
It is possible generalize the argument for any κ > 1, (by
changing the definition of case 1 and case 2 below), but we
do not consider this for the sake of clarity.
Case 1: The sampling grid points are not aligned with
the transition point θ∗ . We consider that this happens if the
point θ∗ is in the middle half of one of the bins. Formally this
means that for some i ∈ N θ∗ − i∆ > ∆/4 and (i + 1)∆ −
θ∗ > ∆/4. This implies that for all sampling points the noise
margin can be lower bounded, in particular |η(x) − 1/2| ≥
µ(∆/4)κ−1 for all x = ∆i, i ∈ {0, . . . , m}. Therefore this
≤ Pr(bk(θ∗ ) (n) < 1/2)
= Pr(M (n) > 1)
≤ E[M (n)],
where the last step follows from Markov’s inequality. The
definition of M (j) above is meant to get more leverage
out of Markov’s inequality, in a similar spirit of Chernoff
bounding techniques. Using the definition of N (j) and some
conditioning we get
E[M (n)] = E[M (n − 1)N (n)]
= E [E[M (n − 1)N (n)|b(n − 1)]]
= E [M (n − 1)E[N (n)|b(n − 1)]]
..
.
= M (0)E [E[N (1)|b(0)] · · · E[N (n)|b(n − 1)]]
≤ M (0) ×
(
)n
max
sup E[N (j + 1)|b(j)]
.
j∈{0,...,n−1} b(j)
(10)
The rest of the proof consists of upper bounding E[N (j +
1)|bj ] < 1, showing that it is always less than 1. Before
proceeding we make some remarks about the above technique. Note that M (j) measures how much mass is on the
bin containing θ∗ (if M (j) = 0 all the mass in our posterior
is in the bin containing θ∗ , the least error scenario). The
ratio N (j) is a measure of the improvement (in terms of
concentrating the posterior around the bin containing θ∗ ) by
sampling at Xj and observing Yj . This is strictly less than
one when an improvement is made. The bound (10) above is
therefore only useful if, no matter what happened in the past,
a measurement made with the proposed algorithm always
leads to a performance improvement on average.
It is possible to show that
E[N (j + 1)|bj ] ≤
1 − 2µ(∆/4)κ−1
1 1 + 2µ(∆/4)κ−1
+
2
1 + 2α
1 − 2α
Since this derivation is cumbersome, and very similar to
the one of case 2, we present only the derivation for that
case. One can check that using the update parameter α as
prescribed in the lemma statement and (10) satisfies the
desired bound (8).
Case 2: The grid sampling points are aligned with the
transition point θ∗ . Formally, there is a grid point i∆, i ∈ N,
such that |θ∗ − i∆| ≤ ∆/4 (see Figure 5). Therefore, for
that particular sampling point we have |η(i∆) − 1/2| ≤
µ(∆/4)κ−1 . For all the other grid-points x 6= i∆ we have
|η(x) − 1/2| > µ(3∆/4)κ−1 . Unlike in case 1 it is not
sufficient to keep track of the bin containing θ∗ . This is in
part due to the proof strategy, but also due to the fact that if
θ∗ coincides exactly with a sampling point the two bins of
the posterior that share that point might both get significant
probability mass after several measurements. There is the
need to keep track of both bins, which complicates matters
a bit. To simplify the notation below let p = µ(3∆/4)κ−1
and q = η(i∆) − 1/2, so that |q| ≤ µ(∆/4)κ−1 .
Define k(θ∗ ) = arg mini∈N |θ∗ − i∆| so that θ∗ is either
in bin Ik(θ∗ ) or bin Ik(θ∗ )+1 . In this presentation we ignore
“edge-effects”, that is we assume that θ∗ is not close to 0 or
1. These cases can be handled in a similar fashion. Define
and
M (j + 1)
M (j)
bk(θ∗ ) (j) + bk(θ∗ )+1 (j)
×
1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j)
1 − bk(θ∗ ) (j + 1) − bk(θ∗ )+1 (j + 1)
.
bk(θ∗ ) (j + 1) + bk(θ∗ )+1 (j + 1)
N (j + 1) =
These are very similar to the definitions in the proof of
case 1, except that now we are keeping track of two bins.
We proceed exactly as in case 1 and obtain
Pr(|θbn − θ∗ | > ∆) ≤
Pr(bk(θ∗ ) (n) + bk(θ∗ )+1 (n) < 1/2)
(
≤ M (0)
max
1 + 2p
1 − 2p
N (j + 1) ≤ 1 − 2α
−
1 + 2α 1 − 2α
Pm
i=kj+1 +1 bi (j)
)n
sup E[N (j + 1)|b(j)]
j∈{0,...,n−1} b(j)
To bound E[N (j + 1)|b(j)] we are going to condition on
W (recall the algorithm described in Section IV-C). For the
purposes of illustration we present here only the case W = 0.
The procedure for the remaining scenarios is analogous. If
W = 0 then the sample location at time j + 1 is X(j + 1) =
∆k( j + 1). Next we evaluate N (j + 1) for three different
cases: (i) kj+1 < k(θ∗ ); (ii) kj+1 > k(θ∗ ); kj+1 = k(θ∗ ).
For case (i) we have
1 + 2p
1 − 2p
N (j + 1) ≤ 1 − 2α
−
×
1 + 2α 1 − 2α
Pkj+1
i=1 bi (j)
.
1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j)
1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j)
×
.
Finally for case (iii) we have (11), displayed in the next page.
To compute E[N (j + 1)|b(j)|W = 0] we first need to
Pkj+1
evaluate E[ i=1
bi (j)|kj+1 < k(θ∗ )]. It is not hard to show
that


kj+1
X
E
bi (j)|kj+1 < k(θ∗ ) =
i=1
Pk(θ∗ )
i=1
bi (j)
2
Pk(θ∗ )
−
2
i=1 bi (j)
.
Pk(θ∗ )
2 i=1 bi (j)
Similarly we have


m
X
E
bi (j)|kj+1 > k(θ∗ ) =
i=kj+1 +1
Pm
i=k(θ ∗ )+1 bi (j)
2
1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j)
,
M (j) =
bk(θ∗ ) (j) + bk(θ∗ )+1 (j)
=
For case (ii) we have
Pm
−
2
i=k(θ ∗ )+1 bi (j)
2
Pm
i=k(θ ∗ )+1 bi (j)
.
These results can be used to easily evaluate E[N (j +
1)|b(j), W = 0]. We do not display this expression since
it is cumbersome.
Proceeding in a similar fashion for all the remaining possibilities for W yields the bound for E[N (j + 1)|b(j)] given
by (12), displayed in the next page. Although this expression
looks terribly complicated, it is possible to explicitly find
the posterior b(j) maximizing it. This posterior only has
mass in the vicinity of the true transition point θ∗ . The
intuitive reason for that is due to the fact that collecting
samples at the k(θ∗ )∆ (the sampling point with a “bad”
noise margin) yields the worst possible scenario. This can
be formally shown using Lagrange multipliers. We do not
include that derivation here, but it is available in a technical
report [6]. For illustration purposes suppose that q > 0. Then
E[N (j + 1)|b(j)] is largest when the posterior is of the form
bk(θ∗ )−1 = δ−ξ/2, bk(θ∗ )+2 = 1−δ−ξ/2, and bk(θ∗ )+1 = ξ,
and all the other entries of (b)(j) are zero. The largest value
is attained taking ξ → 0, and choosing δ maximizing the
bound.
1
E[N (j + 1)|b(j)] ≤ max 1 − −
δ
4
1 − 2α
1 + 2p
1 − 2p
−
×
4(1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j)) 1 + 2α 1 − 2α
1
1 − α − (1 − 2α)δ
(δ 2 + (1 − δ)2 ) +
(1/2 + q)
+
4
α
α + (1 − 2α)δ
(1/2 − q)
1−α
This bound can be computed explicitly, but the expressions
are cumbersome and do not provide much insight. Instead
we provide a simpler characterization of the bounds, namely
"
Pk(θ∗ )−1
bk(θ∗ ) (j) + bk(θ∗ ) (j)
(1/2 + α)(1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j)) − 2α i=1
bi (j)
N (j + 1) =
× (1/2 + q)
+
1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j)
(1/2 + α)bk(θ∗ ) (j) + (1/2 − α)bk(θ∗ ) (j)
#
Pk(θ∗ )−1
(1/2 − α)(1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j)) + 2α i=1
bi (j)
(1/2 − q)
(11)
(1/2 − α)bk(θ∗ ) (j) + (1/2 + α)bk(θ∗ ) (j)
1
E[N (j + 1)|b(j)] ≤ 1 − (bk(θ∗ )−1 (j) + bk(θ∗ ) (j) + bk(θ∗ )+1 (j) + bk(θ∗ )+2 (j))
4
α
1 + 2p
1 − 2p
−
−
×
2(1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j)) 1 + 2α 1 − 2α
 ∗
 ∗

k(θ )−2
k(θ )+1
X
X

bi (j) 
bi (j) − bk(θ∗ )−2 (j) · bk(θ∗ )−1 (j) + bk(θ∗ )−1 (j) · bk(θ∗ )+1 (j) +
i=1

m
X

k(θ ∗ )+3

k(θ ∗ )−1
X

i=1
i=1

bi (j) 
m
X

bi (j) − bk(θ∗ )+2 (j) · bk(θ∗ )+3 (j) + bk(θ∗ ) (j) · bk(θ∗ )+2 (j) +
k(θ ∗ )

bi (j) 
k(θ ∗ )

X
bi (j) + 
i=1

m
X
k(θ ∗ )+2

bi (j) 
m
X

bi (j) +
k(θ ∗ )+1
1
(bk(θ∗ )−1 (j) + bk(θ∗ ) (j) + bk(θ∗ )+1 (j) + bk(θ∗ )+2 (j)) ×
4
bk(θ∗ ) (j) + bk(θ∗ ) (j)
bk(θ∗ ) (j)
×
1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j)
"
Pk(θ∗ )−1
(1/2 + α)(1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j)) − 2α i=1
bi (j)
(1/2 + q)
+
(1/2 + α)bk(θ∗ ) (j) + (1/2 − α)bk(θ∗ ) (j)
#
Pk(θ∗ )−1
(1/2 − α)(1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j)) + 2α i=1
bi (j)
.
(1/2 − q)
(1/2 − α)bk(θ∗ ) (j) + (1/2 + α)bk(θ∗ ) (j)
that, for κ ≥ 2 and α = 0.09 × p we obtain E[N (j +
1)|b(j)] ≤ 1 − (1/2 − p)2 /50.
R EFERENCES
[1] N. Balcan, A. Beygelzimer, and J. Langford. Agostic active learning.
available at http://hunch.net/∼jl/projects/agnostic active/active.pdf.
[2] G. Blanchard and D. Geman. Hierarchical testing designs for pattern
recognition. to appear in Annals of Statistics, 2005.
[3] M. V. Burnashev and K. Sh. Zigangirov. An interval estimation
problem for controlled observations. Problems in Information Transmission, 10:223–231, 1974. (Translated from Problemy Peredachi
Informatsii, 10(3):51–61, July-September, 1974. Original article submitted June 25, 1973).
[4] R. Castro and R. Nowak.
Foundations and Applications of Sensor Management, chapter Active Learning
and Sampling.
Springer-Verlag, 2006.
Available at
http://homepages.cae.wisc.edu/∼rcastro/active sensing chapter.pdf.
[5] R. Castro, R. Willett, and R. Nowak. Faster rates in regression
via active learning. In Proceedings of Neural Information Processing Systems (NIPS), 2005. longer Technical Report available at
http://homepages.cae.wisc.edu/∼rcastro/ECE-05-3.pdf.
[6] R. M. Castro and R. D. Nowak.
Error bounds
for active learning.
Technical report, University of
Wisconsin - Madison, ECE Dept., 2006.
(available at
http://homepages.cae.wisc.edu/∼rcastro/bounds active.pdf).
[7] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Learning probabilistic
linear-threshold classifiers via selective sampling. In The Sixteenth
Annual Conference on Learning Theory. LNAI 2777, Springer, 2003.
(12)
[8] D. Cohn, Z. Ghahramani, and M. Jordan. Active learning with
statistical models. Journal of Artificial Intelligence Research, pages
129–145, 1996.
[9] S. Dasgupta. Analysis of a greedy active learning strategy. In Advances
in Neural Information Processing (NIPS), 2004.
[10] S. Dasgupta. Coarse sample complexity bounds for active learning.
In Advances in Neural Information Processing (NIPS), 2005.
[11] S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptronbased active learning. In Eighteen Annual Conference on Learning
Theory (COLT), 2005.
[12] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling
using the query by committee algorithm. Machine Learning, 28(23):133–168, August 1997.
[13] G. Golubev and B. Levit. Sequential recovery of analytic periodic
edges in the binary image models. Mathematical Methods of Statistics,
12:95–115, 2003.
[14] P. Hall and I. Molchanov. Sequential methods for design-adaptive
estimation of discontinuities in regression curves and surfaces. The
Annals of Statistics, 31(3):921–941, 2003.
[15] M. Horstein. Sequential decoding using noiseless feedback. IEEE
Trans. Info. Theory, 9(3):136–143, 1963.
[16] Matti Kääriäinen. Active learning in the non-realizable case. Personal
communication, 2006.
[17] D. J. C. Mackay. Information-based objective functions for active data
selection. Neural Computation, 4:698–714, 1991.
[18] A. Tsybakov. Optimal aggregation of classifiers in statistical learning.
The Annals of Statistics, 32(1):135–166, 2004.
[19] Alexandre B. Tsybakov. Introduction à l’estimation non-paramétrique.
Mathématiques et Applications, 41. Springer, 2004.