Upper and Lower Error Bounds for Active Learning Rui M. Castro and Robert D. Nowak Abstract— This paper analyzes the potential advantages and theoretical challenges of ”active learning” algorithms. Active learning involves sequential, adaptive sampling procedures that use information gleaned from previous samples in order to focus the sampling and accelerate the learning process relative to “passive learning” algorithms, which are based on non-adaptive (usually random) samples. There are a number of empirical and theoretical results suggesting that in certain situations active learning can be significantly more effective than passive learning. However, the fact that active learning algorithms are feedback systems makes their theoretical analysis very challenging. It is known that active learning can provably improve on passive learning if the error or noise rate of the sampling process is bounded. However, if the noise rate is unbounded, perhaps the situation most common in practice, then no previously existing theory demonstrates whether or not active learning offers an advantage. To study this issue, we investigate the basic problem of learning a threshold function from noisy observations. We present an algorithm that provably improves on passive learning, even when the noise is unbounded. Moreover, we derive a minimax lower bound for this learning problem, demonstrating that our proposed active learning algorithm converges at the near-optimal rate. I. INTRODUCTION In various learning tasks it is possible to use information gleaned from previous samples in order to focus the sampling process, in what is generally referred to as “active learning”. These methods attempt to accelerate the learning task relative to “passive learning” algorithms, based on non-adaptive sampling. A prototypical example is document classification: suppose we are given a text document and want to decide if the contents pertain either finance, sports, or anything else. We are going learn how to do this task from examples, that is, we have access to a number of documents that have been labeled by an expert (usually a human), so we know the topics of those documents. In general, labeling examples is expensive and time consuming, so we would like to use as few examples as possible. In many applications we might have access to many un-labeled examples. This is the case for our prototypical scenario, where one has a virtually infinite supply of documents. Therefore, ways to automatically decide whether or not obtaining the label for an unlabeled example is worthwhile are crucial for the efficient design of good classifiers. The interest in active learning in the machine learning community has increased greatly in the last few of years, This work was partially supported by the National Science Foundation grants CCR-0310889, CNS-0519824, and ECS-0529381. Rui Castro is with the Department of Electrical Engineering of University of Wisconsin - Madison, and Rice University. [email protected] Robert Nowak is with the Department of Electrical Engineering, University of Wisconsin - Madison. [email protected] in part due to the dramatic growth of data sets, and the high cost of labeling all the examples in such data sets. There are several empirical and theoretical results suggesting that in certain situations active learning can be significantly more effective than passive learning [2], [7], [8], [12], [17]. Many of these results pertain to the “noiseless” setting, in which the labels are a deterministic function of the features. In certain noiseless scenarios it has been shown that the number of labeled examples needed is to achieve a desired classification error rate is much smaller than what would be need using passive learning (in fact for those scenarios if passive learning requires n labeled examples then active learning requires only O(log n) labeled examples) [9]–[12]. Although this setting is interesting from a theoretical perspective, it is very restrictive, and seldom relevant for practical applications. Some of these results have been extended to the “bounded noise rate” setting. In this setting labels are no longer a deterministic function of the features, but for a given feature the probability of observing a particular label is significantly higher than the probability of observing any other label. In the case of binary classification this means that if (X, Y ) is a feature-label pair, where Y ∈ {0, 1}, then | Pr(Y = 1|X = x) − 1/2| > µ for every x in the feature space, with µ > 0. Under this assumption it can be shown that results similar to the ones for the noiseless scenario can be achieved [1], [5], [7], [16]. These results are intimately related to adaptive sampling techniques in regression problems [3], [5], [13], [14], where similar performance gains have been reported. In this paper we address scenarios where the noise rate is unbounded, perhaps the situation most common in practice. Previously existing theoretical studies have been unable to answer whether or not active learning is advantageous in this case. To study this issue, we investigate the basic problem of learning a threshold function from noisy observations. Since the noise rate is unbounded, the quality of the observations degrade when the samples are taken in the vicinity of the threshold location. In other words, the noise level tends to 1/2, the flip of a fair coin, as the sampling locations tend to the threshold location. We present an algorithm that provably improves on passive learning, even when the noise is unbounded. Moreover, we derive a minimax lower bound for this learning problem, demonstrating that our proposed active learning algorithm converges at a near-optimal rate. To the best of our knowledge these are the first results of this kind. The paper is organized as follows: in Section II we formalize the problem and setup. Section III presents the fundamental limits of active learning, in terms of minimax lower bounds. In Section IV we present various algorithms, starting with an algorithm for the bounded noise rate scenario due to [3], which will be the building block for algorithms tackling unbounded noise rate problems. Section V presents some simulation results illustrating our methods and Section VI closes with some final remarks and open questions. We defer the proofs of the main results to the appendix. II. P ROBLEM F ORMULATION Throughout this paper we focus on a relatively simple one-dimensional problem. Although this might seem a rather “toyish” scenario, it displays many of the features that make active learning appealing and theoretical analysis of algorithms challenging. Consider estimating a threshold function from noisy samples. This problem boils down to finding the threshold location. Adaptively sampling aims to find this location with a minimal number of strategically placed samples. Let (X, Y ) ∈ [0, 1] × {0, 1} be a random variable, with unknown distribution PXY . Our goal is to construct a “good” classification rule, that is, given X we want to predict Y as accurately as possible, where our classification rule is a measurable function f : [0, 1] → {0, 1}. The performance of the classifier is evaluated in terms of the expected loss, in particular a natural choice to consider is the 0/1-loss. With this choice the risk is simply the probability of classification error, R(f ) = E[1{f (X) 6= Y }] = Pr(f (X) 6= Y ) . (1) Since we are considering only two classes there is a one-to-one correspondence between classifiers and sets: Any reasonable classifier is of the form f (x) = 1G (x), where G is a measurable subset of [0, 1]. We use the term classifier interchangeably for both f and G. Define the optimal risk as R∗ = inf R(G) . Gmeasurable This minimum risk is attained by the Bayes Classifier G∗ = {x ∈ [0, 1] : η(x) ≥ 1/2} , where η(x) = E[Y |X = x] = Pr(Y = 1|X = x) , is called the conditional probability (we use this term only if it is clear from the context). In general R(G∗ ) > 0, therefore even the optimal classifier misclassifies sometimes. The quantity of interest for the performance evaluation of a classifier G is therefore the excess risk Z R(G) − R(G∗ ) = |2η(x) − 1|dPX (x) , G∆G∗ where ∆ denotes the symmetric difference between two sets1 , and PX is the marginal distribution of X. If the distribution PXY is known, then clearly we could just construct the Bayes classifier and we would be done. This is not the case for interesting problems, where we do 1 A∆B ≡ (A ∩ B c ) ∪ (Ac ∩ B), where Ac and B c are the complement of A and B respectively. not have direct access to PXY . In most cases we have to construct a classifier based on a finite number of samples of PXY , and this is where most differences between passive and active learning methods arise. For the purposes of this paper we consider particular classes of distributions PXY . We assume that PX is uniform of [0,1], but the results in this paper can easily be generalized to the case where the marginal density of X is bounded above and below. We assume furthermore that the Bayes classifier has the form G∗ = [θ∗ , 1], where θ∗ ∈ [0, 1]. This means that there exists a threshold θ∗ ∈ [0, 1] such that η(x) is less than 1/2 for all x < θ∗ , and greater or equal to 1/2 for all x ≥ θ∗ . We assume that we have a large (infinite) pool of examples we can select from. We can choose any feature point Xi ∈ [0, 1] and observe its label Yi . The data collection operation has a temporal aspect to it, namely we collect the labeled examples one at the time, starting with (X1 , Y1 ) and proceeding until (Xn , Yn ) is observed. Formally we have: A1 - The observation Yi , i ∈ {1, . . . , n} are distributed as 1 , with probability η(Xi ) Yi = . 0 , with probability 1 − η(Xi ) The random variables {Yi }ni=1 is conditionally independent given {Xi }ni=1 . A2.1 - Passive Sampling: The sample locations Xi are independent of {Yj }j6=i . A2.2 - Active Sampling: The sampling location Xi depends only on {Xj , Yj }j<i . In other words Xi |X1 . . . Xi−1 , Xi+1 , . . . , Xn , Y1 . . . Yi−1 , Yi+1 , . . . , Yn a.s. = Xi |X1 . . . Xi−1 , Y1 . . . Yi−1 . The conditional distribution on the right hand side (r.h.s) of the above expression is called the sampling strategy and is denoted by Sn . It completely defines our sampling schedule. After collecting the n examples, that is, after collecting b n that is desirably {Xi , Yi }ni=1 , we construct a classifier G close to G∗ . We use the subscript n to denote dependence on the data set, instead of writing it explicitly. Under the passive sampling scenario (A2.1) the sample locations do not depend in any way on our observations, therefore the collection of sample points {Xi }ni=1 can be chosen before any observations are collected. On the other hand, the active sampling scenario (A2.2) allows for the ith sample location to be chosen using all the information collected up to that point (the previous i − 1 samples). To be able to present results on rates of convergence of the excess risk we need to impose further conditions on η(·), namely on the behavior of the conditional probability around the 1/2 crossing point θ∗ . Let κ ≥ 1 and µ > 0, then |η(x) − 1/2| ≥ µ|x − θ∗ |κ−1 , if |x − θ∗ | ≤ 0 , |η(x) − 1/2| ≥ µ0 κ−1 , if |x − θ∗ | > 0 , (2) (3) for some 0 > 0. Examples of such conditional probability functions are depicted in Figure 1. The conditions above are very similar to the so-called margin condition (or noisecondition) introduced by Tsybakov [18], although our condition is slightly more restrictive. Nevertheless the key aspect of these conditions is the behavior of η(·) near the Bayes decision boundary. Let P(κ, µ) be the class of distributions PXY such that the marginal PX is uniform in [0, 1] and η(·) satisfies (2) and (3). If κ = 1 then the η(·) function “jumps” across 1/2, that is η(·) is bounded away from the value 1/2 (see Figure 1(a)). If κ > 1 then η(·) crosses the value 1/2 at θ∗ . Arguably the most interesting case corresponds to κ = 2 (Figure 1(b)). In this case the conditional probability behaves linearly around 1/2. This means that the noise affecting observations that are made close to the decision boundary is roughly proportional to the distance to the boundary. This might be due to the fact that the feature X is not powerful enough to clearly distinguish the two classes near θ∗ . Finally if κ > 2 then η(·) is very “flat” around θ∗ . where the samples {Xi }ni=1 are independent and identically distributed (i.i.d.) uniformly over [0, 1]. Furthermore this bound is tight, in the sense that it is possible to devise classification strategies attaining the same asymptotic behavior. We notice that under the passive sampling scenario the excess risk decays at a strictly slower rate than for the active sampling scenario. The difference is dramatic when κ → 1 (Figure 1(a)). In that case it can actually be shown that an exponential rate of error decay is attainable when actively sampling [3]. When κ → ∞ the excess risk decay rates are the same, regardless of the sampling paradigm. This indicates that if no assumptions are made on the conditional distribution Pr(Y = 1|X) then no advantage can be taken from the extra complexity of active sampling. As remarked before the most relevant case is κ = 2. In this case both passive and active sampling methods display polynomial rates of error decay, but the rate for active sampling is n−1 , which is significantly faster than n−2/3 , the best possible rate for passive sampling. In the rest of the paper we present algorithms showing that the rates of Theorem 1 are nearly achievable. We finally point out that the result of Theorem 1 applies also the general setting presented in [18], and that using similar ideas minimax bounds can be derived for higher dimensional classifier classes, characterized in terms of metric entropy. III. M INIMAX L OWER B OUNDS IV. ACTIVE S AMPLING A LGORITHMS In this section we present lower bounds on the performance of active and passive sampling methods, under the conditions described above. The proof techniques described here are relatively standard, and follow the approach in [19]. The key idea of the proof is to reduce the problem of estimating level sets of P(κ, µ) to the problem of deciding among a finite number of such distributions. In other words we reduce the estimation problem to a hypotheses testing problem. In this case it suffices to consider only two different distributions PXY . Theorem 1 (Active Sampling Minimax Lower Bound): Let κ > 1. Under the assumptions (A1) and (A2.2) we have h i κ b n ) − R(G∗ ) ≥ cn− 2κ−2 , inf sup E R(G In this section we present various active sampling algorithms that allow us to nearly achieve the lower bounds of Theorem 1. We start by presenting an algorithm proposed by Burnashev and Zigangirov [3], inspired by an idea of Horstein [15]. This algorithm is designed to work in the bounded noise rate case, that is, when κ = 1. This corresponds to a scenario where the conditional probability η(x) = Pr(Y = 1|X = x) is bounded away from 1/2, |η(x) − 1/2| ≥ µ for all x ∈ [0, 1]. The results and lessons learned from this algorithm yield some intuition for general margin parameters, namely a “back-of-the-envelope” analysis indicates that a similar algorithm can be used to nearly achieve the rates of Theorem 1. (a) (b) Fig. 1. Examples of two conditional distributions η(x) = Pr(Y = 1|X = x). (a) In this case η(·) satisfies the margin condition with κ = 1; (b) Here the margin condition is satisfied for κ = 2. bn ,Sn PXY ∈P(κ,µ) G for n large enough, where c ≡ c(κ, µ) > 0 and the infimum bn is taken over the set of all possible classification rules G and sampling strategies Sn . Contrast this result with the one attained for passive sampling. Under the passive sampling scenario it is clear that the sample locations {Xi }ni=1 must be scattered around the interval [0, 1] in a somewhat uniform manner. These can be deterministically placed, for example over a uniform grid, or simply taken uniformly distributed over [0, 1]. In [18] was shown that under assumptions (A1), (A2.1), and κ ≥ 1, we have h i κ b n ) − R(G∗ ) ≥ cn− 2κ−1 , (4) inf sup E R(G bn PXY ∈P(κ,µ) G A. Bounded noise rate: κ = 1 Under this scenario we assume that |η(x) − 1/2| ≥ µ for all x ∈ [0, 1]. Notice that in this case the class P(1, µ) is a quasi-parametric class: elements of the class are essentially characterized by the location of the point where η(·) “crosses” 1/2, denoted by θ∗ . If the observations are noiseless (that is η(x) ∈ {0, 1} or equivalently µ = 1/2) then it is clear that one can estimate θ∗ very efficiently using binary bisection: start by taking a sample at X1 = 1/2. Depending on the outcome we know if θ∗ is to the left of X1 (if Y1 = 1) or to the right (if Y1 = 0). Proceeding accordingly we can construct an estimate of θ∗ denoted by b n ≡ [θbn , 1] such that θbn and a corresponding classifier G b n ) − R(G∗ )] = E[|θbn − θ∗ |] ≤ 2−(n+1) . E[R(G If there is noise (µ < 1/2) things get a bit more complicated, in part because our decisions about the sampling depend on all the observations made in the past, which are noisy and therefore unreliable. Nevertheless there is a probabilistic bisection method, proposed in [15], that is suitable for this purpose. The key idea stems from Bayesian estimation. Suppose that we have a prior probability density function P0 (x) on the unknown parameter θ∗ , namely that θ∗ is uniformly distributed over the interval [0, 1] (that is P0 (x) = 1 for all x ∈ [0, 1]). To make the exposition clear assume a particular situation, namely that θ∗ = 1/4. Like before, we start by taking a measurement at X1 = 1/2. With probability η(X1 ) ≥ 1/2 + µ we observe a one, and with probability 1−η(X1 ) ≤ 1/2−µ we observe a zero. Therefore it is more likely to observe a one than a zero. Suppose a one was observed. Given these facts we can compute a “posterior” density simply by applying an approximate Bayes rule (we assume that a one is observed with probability 1/2 + µ). In this case we would get that P1 (x|X1 , Y1 ) = 1 + 2µ) 1 − 2µ) , if x ≤ 1/2, , if x > 1/2, Fig. 2. Illustration of the probabilistic bisection strategy. The shaded areas correspond to 1/2 of the probability mass of the posterior densities. . integration The next step is to choose the sample location X2 . We choose X2 so that is bisects the posterior distribution, that is, we take X2 such that Prθ∼P1 (·) (θ > X2 |X1 , Y1 ) = Prθ∼P1 (·) (θ < X2 |X1 , Y1 ). In other words X2 is just the median of the posterior distribution. If our model is correct, the probability of the event {θ < X2 } is identical to the probability of the event {θ > X2 } and therefore sampling Y2 at X 2 is most informative. We continue iterating this procedure until we have collected n samples. The estimate θbn is defined as the median of the final posterior distribution. Figure 2 illustrates the procedure. Note that if µ = 1/2 then probabilistic bisection is simply the binary bisection described above. The above algorithm seems to work extremely well in practice, but it is hard to analyze and there are few theoretical guarantees for it, especially pertaining rates of error decay. In [3] a similar algorithm was proposed. Although its operation is slightly more complicated, it is easier to analyze. That algorithm uses essentially the same ideas, but enforces a parametric structure for the posterior by imposing that the sample locations Xi lie on a grid, in particular Xi ∈ {0, ∆, 2∆, . . . , 1} where m = ∆−1 ∈ N. Furthermore in the application of the Bayes rule we use α instead of µ, where 0 < α < µ < 1/2. A description of the algorithm can be found in [3] or in [4]. We call this method the BZ algorithm. We have the following remarkable result [3]. Pr(|θbn − θ∗ | > ∆) ≤ 1−∆ ∆ 1 + 2µ 1 − 2µ + 2 + 4α 2 − 4α n . (5) b n ≡ [θbn , 1]. From the estimate θbn we construct a classifier G To get a bound on the expected excess risk one proceeds by b n ) − R(G∗ )] = E E[R(G ≤ = = = Z |2η(x) − 1|dx bn ∆G∗ G b n ∆G∗ |] E[|G E[|θbn − θ∗ |] Z 1 Pr(|θbn − θ∗ | > t)dt 0 Z ∆ Z Pr(|θbn − θ∗ | > t)dt + 0 1 Pr(|θbn − θ∗ | > t)dt ∆ ≤ ∆ + (1 − ∆) Pr(|θbn − θ∗ | > ∆) n (1 − ∆)2 1 + 2µ 1 − 2µ ≤ ∆+ + . ∆ 2 + 4α 2 − 4α √ n/2 1− 1−4µ2 1+2µ 1−2µ Taking ∆ = 2+4α + 2−4α and α = (to 4µ minimize the exponent) yields n/2 1 1p ∗ 2 b E[R(Gn ) − R(G )] ≤ 2 + 1 − 4µ . 2 2 Notice that the excess risk decays exponentially in the number of samples. This is much faster than what is attainable using only passive sampling, where the decay rate is 1/n. Like in the noiseless scenario we obtain an exponential rate of error decay. The difference is that now the exponent depends on the noise margin µ, larger noise margins corresponding to faster error decay rates. B. Unbounded rate noise: κ > 1 In this section we consider scenarios where the noise rate is not bounded, that is, as one makes observations closer to the transition point θ∗ the observation noise becomes larger. This clearly hinders extremely fast excess risk decay rates, since as one focus more and more on the Bayes decision boundary the quality of the observations degrades. To gain some intuition we consider first the case where |η(x) − 1/2| = µ|x − θ∗ |κ−1 , corresponding to a margin parameter κ. Proceed now as in the previous section, and collect samples over a grid, namely Xi ∈ {0, ∆, 2∆, . . . , 1} where m = ∆−1 ∈ N. If this grid does not line-up with the transition point θ∗ then |η(x) − 1/2| ≥ µ(∆/2)κ−1 for all x ∈ {0, ∆, . . . , 1}. Of course this is in general an unrealistic assumption, but let us consider it for now. We can now proceed by using the method described in the previous section replacing µ by µ(∆/2)κ−1 and using (5). Begin by noticing that due to the special form of η(·) the behavior of the expected excess risk is related to the behavior of |θbn −θ∗ | in an interesting way, namely Z ∗ b E[R(Gn ) − R(G )] = E |2η(x) − 1|dx bn ∆G∗ G Z = E 2µ|x − θ∗ |κ−1 dx bn ∆G∗ G ≤ E[|θbn − θ∗ |κ ] . We now proceed in a similar fashion as before b n ) − R(G∗ )] ≤ E[|θbn − θ∗ |κ ] E[R(G Z 1 = Pr(|θbn − θ∗ |κ > t)dt 0 Z 1 = Pr(|θbn − θ∗ | > t1/κ )dt 0 Z ∆κ = Pr(|θbn − θ∗ | > t1/κ )dt 0 Z 1 + Pr(|θbn − θ∗ | > t1/κ )dt ∆κ ≤ ∆ + (1 − ∆ ) Pr(|θbn − θ∗ | > ∆) n 1 1 1p κ 2 2κ−2 + 1 − 4µ (∆/2) ≤ ∆ + ∆ 2 2 n 1 ≤ ∆κ + 1 − µ2 (∆/2)2κ−2 ∆ 1 κ ≤ ∆ + exp −nµ2 (∆/2)2κ−2 , ∆ √ where the last two steps follow from the fact that x ≤ (x + 1)/2 for all x ≥ 0, and that (1 + s(x))x ≤ exp(xs(x)) for all x > 0 and s(x) > −1. Finally, let 1 1 κ+1 log n 2κ+2 ∆= . 2 µ2 (2κ − 2) n κ κ We conclude that b n ) − R(G∗ )] ∝ E[R(G log n n κ 2κ−2 . (6) This is lower bound rate displayed in Theorem 1, apart from logarithmic factors. The result indicates that, in principle, a methodology similar to the Burnashev-Zigangirov algorithm might allow us to achieve the lower bound rates. It is important to emphasize that the above result holds under the assumption that the sampling grid is not aligned with the unknown transition point θ∗ . If that is not the case then we will have |η(x) − 1/2| < µ(∆/2)κ−1 for some sampling point, and the analysis above does not hold. One way to avoid this problem is to consider different sampling grids. A methodology that can be shown to work consists in dividing the available measurements into three equal sets, and use three different offset sampling grids. If we have a budget of n samples we allocate n/3 samples and run the BZ algorithm for one sampling grid. Then we use other n/3 samples and run the BZ algorithm for another sampling grid, essentially a slightly shifted version of the first sampling grid, and proceed in an analogous fashion with the remaining n/3 samples. The rational is that at most one of these sampling grids is going to be closely aligned with the unknown transition point θ∗ , and therefore the other two are not going to display any problems. We then check for agreement amoung the three estimators and make a decision that provably attains the rate in (6). Although this methodology is satisfying from a theoretical point of view, it is somewhat wasteful (essentially only one third of the samples are effectively used), and does not generalize well to more complicated and realistic scenarios than the one considered in this paper. Therefore it is not a very relevant approach in practice. In the remaining of this document we present an alternative methodology that solves some of these problems and provides a practical algorithm capable of achieving the desired fast rates. It is worth pointing out that, even when the assumption that the sampling grid does not line up with the transition θ∗ does not hold, the BZ method still works extremely well in practice. The difficulties arise solely on the performance analysis of the algorithm. C. Unbounded rate noise: A sample-efficient algorithm The algorithm presented here is similar to the original BZ algorithm, although some modifications are made allowing the analysis of more general noise models than the bounded rate noise. Like in the original methodology we concentrate our sampling on a grid, in particular Xi ∈ {0, ∆, 2∆, . . . , 1} where m = ∆−1 ∈ N. The algorithm works by propagating a posterior-like density (we will denote this density as posterior hereafter). After j observation the posterior is described as Pj : [0, 1] → R, Pj (x) = ∆−1 m X bi (j)1Ii (x) , i=1 where I1 = [0, ∆] and Ii = (∆(i − 1), ∆i], for i ∈ {2, . . . , m}. We initialize this posterior by taking bi (0) = ∆ for all i. Note that the posterior is completely Pmcharacterized by b(j) = {b1 (j), . . . , bm (j)}, and that i=1 bi (j) = 1. We select the sample location Xj+1 using a randomized approach based on Pj : choose a number kj+1 ∈ {1, . . . , m} according to the distribution b(j). In other words Pr(kj+1 = i) = bi (j). Now let W ∈ {−2, −1, 0, 1} be a independent random variable, taking one of the values -2,-1,0, or 1 with probability 1/4. The sample location is given by Xj+1 = min (max (∆(k(j + 1) − W ), 0) , 1). We collect a label Yj+1 and update the posterior accordingly (implicitly assuming a noise margin 0 < α < 1/2), namely if i ≤ kj+1 − W bi (j + 1) = , if Yj+1 = 0 , if Yj+1 = 1 (1/2−α)bi (j) Pkj+1 −W 1/2+α−2α l=1 bl (j) (1/2+α)bi (j) Pkj+1 −W 1/2−α+2α l=1 bl (j) -1 10 Active Sampling Passive Sampling -2 10 , and if i > kj+1 − W -3 bi (j + 1) = , if Yj+1 = 0 , if Yj+1 = 1 ((1/2+α)bi (j) Pkj+1 −W 1/2+α−2α l=1 bl (j) (1/2−α)bi (j) Pkj+1 −W 1/2−α+2α l=1 bl (j) 10 . We call α the update parameter. Finally, at time j, the estimate of the true transition point θ∗ is given by θbj that is defined as the median of Pj (·). A couple of remarks are important at this point: (i) this method differs from the original BZ algorithm in that the latter chooses the next sample location to be one of the two grid points that are closest to the median of the posterior. Choosing the sample point according to this procedure, or “sampling” the posterior (as is the case with our algorithm) is essentially the same provided the posterior is somewhat concentrated around a point, which happens with high probability after a number of observations have been collected. (ii) Instead of choosing W to take values in {−2, −1, 0, 1} one could take W to be Bernoulli with parameter 1/2. This methodology works, provided we are working under the bounded noise rate assumption, or in general noise models if the transition point θ∗ is not closely aligned with the sampling grid. If the last condition does not hold then the proof technique breaks down, and it is not possible to guarantee rates of error decay. This same problem arises in the original BZ method. (iii) When using four sample points around the chosen bin kj+1 (or in general points in the vicinity of the chosen bin) we can guarantee that most of the times we are taking a sample at a point that has a reasonable noise margin (that is, a point such that |η(x)−1/2| is reasonably large). The analysis of the algorithm above can be done in a similar fashion to the analysis of the original BZ algorithm, although further complications arise due to the inability to control the noise margin at one of the sampling points (see proof of Lemma 1 for details). We use the algorithm to b n = [θbn , 1] and have the following construct the classifier G result. Theorem 2: Let PXY ∈ P(κ, µ). Furthermore assume that PXY does not satisfy the margin condition for any margin parameter κ0 such that κ0 < κ. Then there exists a bin size ∆ ≡ ∆(n, κ, µ) and an update parameter α ≡ α(n, κ, µ) such that κ 2κ−2 b n ) − R(G∗ )] ≤ C log n E[R(G , (7) n where C ≡ C(κ, µ) > 0. 1 10 2 10 3 10 Fig. 3. Simulation results for the case κ = 2. The empirical expected excess risk for the active sampling algorithm corresponds to the bold line, and the corresponding curve for passive sampling corresponds to the regular line. The proof of the theorem follows from an upper bound of Pr(|θbn − θ∗ | > ∆), similar to (5), and the reasoning of Section IV-B. The key result enabling the proof is the following lemma. Lemma 1: Let PXY ∈ P(κ, µ) and κ ≥ 2. Then taking α = 0.09µ(3∆/4)κ−1 yields n 1 − 2∆ µ2 (3∆/4)κ−1 Pr(|θbn − θ∗ | > ∆) ≤ 1− . 2∆ 50 (8) Although this lemma is presented only for κ ≥ 2 it is possible to obtain similar results for any κ > 1, identical to the above bounds apart from the constants. In the interest of clarity we consider only the case κ ≥ 2. Some remarks are important at this time: In the statement of Theorem 2 one notices that ∆ and α are functions of κ and µ. This indicates that the proposed algorithm is not adaptive, that is, it cannot handle a scenario where κ and µ are unknown. It turns out that the choice of ∆ as a function of n, κ, and µ is not critical for the practical performance of the algorithm. Essentially finer sampling grids provide the same or better performance than the one predicted by the theoretical analysis. However, the choice of update parameter α is critical, since the wrong scaling of the parameter with n can lead into slower rates of error decay. V. S IMULATION R ESULTS In this section we present a simple simulation of the proposed algorithm. This does not constitute a thorough empirical study, but simply an illustration of the practical performance characteristics. We conducted various other simulation tests, all achieving results agreeing with the theoretical analysis above. In this simulation we considered the case κ = 2, and the distribution PXY is characterized by η(x) = (x − θ∗ )/2 + 1/2. We performed 1000 runs, in each one selecting θ∗ uniformly at random in the interval [0, 1]. In Figure 3 we plot the empirical expected excess risk versus the number of samples n. The expected excess risk is estimated by averaging the excess risk over the 1000 trials. We display the results in a log-log plot. We observe that the active method outperforms the passive method for any number of samples. Furthermore the rates of error decay coincide with the rates predicted by the theory, that is, the excess risk decays at a rate n−1 for active sampling (in the log-log plot this corresponds to a slope of −1), and at a rate n−2/3 for passive sampling (corresponding to a slope of −2/3 in the plot). VI. F INAL R EMARKS AND O PEN Q UESTIONS We present a formal analysis of active learning in an unbounded noise setting. We show that the extra flexibility of active sampling, when compared to passive sampling, allows for faster rates of error decay, but the errors depend on the behavior of the noise margin around the Bayes decision boundary. We present results characterizing the fundamental limits of active learning for the one-dimensional setting of this paper, and provide algorithms capable of nearly achieving those limits. The algorithms are practical and easily implemented. The algorithms presented are not adaptive to the noise conditions, in the sense that they require knowledge of margin parameters κ and µ in order to achieve the optimal rates, at least in theory. However, empirical evidence suggests that the proposed algorithms yield the optimal performance (in terms of error rates) solely with mild knowledge of κ, but the proof techniques used here are not powerful enough to demonstrate that. The ideas presented here can be extended to larger and more complicated classes of classifiers, characterized for example in terms of metric entropy. The development and analysis of such extensions is currently being investigated. A PPENDIX A. Proof of Theorem 1 The proof strategy follows the basic idea behind standard minimax analysis methods, and consists in reducing the problem of classification in the class P(κ, µ) to a hypothesis testing problem. In this case it suffices to consider two hypothesis. 1 κ−1 ,x ≤ t 2 − µ(t − x) , η0 (x) = 1 ,x > t 1 κ−1 ,x ≤ t 2 + µx . η1 (x) = 1 ,x > t These are depicted in Figure 4. Note that G∗0 = [t, 1] bn and G∗1 = [0, 1] (provided t is small enough). Let G be any classifier (i.e. a subset of [0, 1]) and define ψn = b n ∆G∗ |, where | · | denotes the volume of arg mini∈{0,1} |G i a set. We have the following result. Lemma 2: For j ∈ {0, 1} 2µ κ ∆ t = s ∝ µtκ , κ2κ where Rj denotes the risk (1) under hypothesis j. b n ) − Rj (G∗0 ) ≥ ψn 6= j ⇒ Rj (G (a) Fig. 4. (b) The two conditional distributions used for the proof of Theorem 1. Proof: Assume first that j = 0, the case j = 1 is b n ∩ [0, t], and G̃cn = G b cn ∩ [0, t]. analogous. Define G̃n = G Since ψn = 1 we have b n ∆G∗1 | ≤ |G b n ∆G∗0 | |G ⇔ |G̃n ∆G∗1 | ≤ |G̃n ∆G∗0 | ⇔ |G̃cn | ≤ |G̃n | . Therefore b n ) − R0 (G∗0 ) ≥ R0 (G Z |2η0 (x) − 1|dx G̃n t Z ≥ |2η0 (x) − 1|dx t/2 = 2µ κ t . κ2κ A consequence of Lemma 2 is that inf b n ) − R(G∗ ) ≥ s) Pr(R(G sup bn PXY ∈P(κ,a) G ≥ max Pj (ψn 6= j) j ∆ ≥ inf sup Pj (φn 6= j) = pe , φn (9) j where the infimum on the r.h.s. is taken with respect to all functions of the data onto {0, 1}. All that remains to be done now is to construct a lower bound for pe . To this we use a (0) result from [19]. Let P0,n ≡ PX1 ,...,Xn ,Y1 ,...,Yn be the probability measure of the random variables {Xi , Yi }ni=1 under hy(1) pothesis 0 and define analogously P1,n ≡ PX1 ,...,Xn ,Y1 ,...,Yn . Theorem 2.2 of [19] says that if K(P1,n kP0,n ) ≤ α < ∞ then ! p 1 − α/2 1 pe ≥ max exp(−α), , 4 2 where K(·k·) denotes the Kullback divergence. Now P1,n K(P1,n kP0,n ) = E1 log P 0,n P1,n = E1 E1 log X1 , . . . , X n P 0,n (1) PYi |Xi = nE1 E1 log (0) X1 , . . . , Xn PYi |Xi (1) PYi |Xi ≤ n E1 log (0) Xi = 0, ∀i PYi |Xi κ−1 2 ≤ n (2at ) + o(2atκ−1 )2 , case is essentially like the bounded noise margin scenario, and the analysis can be carried out using techniques in [3]. Define k(θ∗ ) to be the index of the bin Ii containing θ∗ , that is θ∗ ∈ Ik(θ∗ ) . Define M (j) = 1 − bk(θ∗ ) (j) , bk(θ∗ ) (j) and Fig. 5. Illustration of case 2 in the proof of Theorem 2. Only one sample point is lying inside the gray area (corresponding to a noise margin less than µ(∆/4)κ−1 . as t → 0. In the above E1 denotes the expectation taken with respect to measure P1,n . Therefore K(P1,n kP0,n ) ≤ 1 4µ2 nt2κ−2 . Taking t ∝ (µ2 n)− 2κ−2 and using (9) we conclude that κ b n ) − R(G∗ ) ≥ µ− κ1 n− 2κ−2 inf sup Pr R(G bn PXY ∈P(κ,a) G N (j + 1) = bk(θ∗ ) (j)(1 − bk(θ∗ ) (j + 1)) M (j + 1) = . M (j) bk(θ∗ ) (j + 1)(1 − bk(θ∗ ) (j)) The reasoning behind these definitions is made clear later. For now, notice that M (j) is a decreasing function of bk(θ∗ ) (j). After n observations our estimate of θ∗ is the median of the posterior density Pn . Taking that into account we conclude that Pr(|θbn − θ∗ | > ∆) ≥c>0, where c > 0 is a constant. The statement of the theorem now follows from the application of Markov’s inequality to the above expression, h i b n ) − R(G∗ ) E R(G κ κ 1 b n ) − R(G∗ ) ≥ µ− κ1 n− 2κ−2 . ≥ µ− κ n− 2κ−2 Pr R(G Remark: Notice that, when bounding the Kullback divergence, we considered all the feature examples to be taken at Xi = 0, the most beneficial place to take a sample. If instead we assume Xi i.i.d. uniformly over [0, 1] the Kullback divergence is approximately nt2κ−2 t = nt2κ−1 , since roughly only a fraction t of the samples are informative (any sample taken in (t, 1] is non-informative). Proceeding as before we obtain the passive sampling minimax bound (4), proved in [18] for a more general setting. B. Proof of Lemma 1 The proof of Lemma 1 follows closely the approach taken in [3], but some complications arise from the fact that the noise margin for one of the grid sample points (namely the point closest to θ∗ ) cannot be lower bounded. For a fix distribution PXY ∈ P(κ, µ) and a given parameter ∆ two different scenarios can happen. For all the results below we assume that ∆ is small enough so that condition (3) does not play a critical role. The proof also assumes that κ ≥ 2. It is possible generalize the argument for any κ > 1, (by changing the definition of case 1 and case 2 below), but we do not consider this for the sake of clarity. Case 1: The sampling grid points are not aligned with the transition point θ∗ . We consider that this happens if the point θ∗ is in the middle half of one of the bins. Formally this means that for some i ∈ N θ∗ − i∆ > ∆/4 and (i + 1)∆ − θ∗ > ∆/4. This implies that for all sampling points the noise margin can be lower bounded, in particular |η(x) − 1/2| ≥ µ(∆/4)κ−1 for all x = ∆i, i ∈ {0, . . . , m}. Therefore this ≤ Pr(bk(θ∗ ) (n) < 1/2) = Pr(M (n) > 1) ≤ E[M (n)], where the last step follows from Markov’s inequality. The definition of M (j) above is meant to get more leverage out of Markov’s inequality, in a similar spirit of Chernoff bounding techniques. Using the definition of N (j) and some conditioning we get E[M (n)] = E[M (n − 1)N (n)] = E [E[M (n − 1)N (n)|b(n − 1)]] = E [M (n − 1)E[N (n)|b(n − 1)]] .. . = M (0)E [E[N (1)|b(0)] · · · E[N (n)|b(n − 1)]] ≤ M (0) × ( )n max sup E[N (j + 1)|b(j)] . j∈{0,...,n−1} b(j) (10) The rest of the proof consists of upper bounding E[N (j + 1)|bj ] < 1, showing that it is always less than 1. Before proceeding we make some remarks about the above technique. Note that M (j) measures how much mass is on the bin containing θ∗ (if M (j) = 0 all the mass in our posterior is in the bin containing θ∗ , the least error scenario). The ratio N (j) is a measure of the improvement (in terms of concentrating the posterior around the bin containing θ∗ ) by sampling at Xj and observing Yj . This is strictly less than one when an improvement is made. The bound (10) above is therefore only useful if, no matter what happened in the past, a measurement made with the proposed algorithm always leads to a performance improvement on average. It is possible to show that E[N (j + 1)|bj ] ≤ 1 − 2µ(∆/4)κ−1 1 1 + 2µ(∆/4)κ−1 + 2 1 + 2α 1 − 2α Since this derivation is cumbersome, and very similar to the one of case 2, we present only the derivation for that case. One can check that using the update parameter α as prescribed in the lemma statement and (10) satisfies the desired bound (8). Case 2: The grid sampling points are aligned with the transition point θ∗ . Formally, there is a grid point i∆, i ∈ N, such that |θ∗ − i∆| ≤ ∆/4 (see Figure 5). Therefore, for that particular sampling point we have |η(i∆) − 1/2| ≤ µ(∆/4)κ−1 . For all the other grid-points x 6= i∆ we have |η(x) − 1/2| > µ(3∆/4)κ−1 . Unlike in case 1 it is not sufficient to keep track of the bin containing θ∗ . This is in part due to the proof strategy, but also due to the fact that if θ∗ coincides exactly with a sampling point the two bins of the posterior that share that point might both get significant probability mass after several measurements. There is the need to keep track of both bins, which complicates matters a bit. To simplify the notation below let p = µ(3∆/4)κ−1 and q = η(i∆) − 1/2, so that |q| ≤ µ(∆/4)κ−1 . Define k(θ∗ ) = arg mini∈N |θ∗ − i∆| so that θ∗ is either in bin Ik(θ∗ ) or bin Ik(θ∗ )+1 . In this presentation we ignore “edge-effects”, that is we assume that θ∗ is not close to 0 or 1. These cases can be handled in a similar fashion. Define and M (j + 1) M (j) bk(θ∗ ) (j) + bk(θ∗ )+1 (j) × 1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j) 1 − bk(θ∗ ) (j + 1) − bk(θ∗ )+1 (j + 1) . bk(θ∗ ) (j + 1) + bk(θ∗ )+1 (j + 1) N (j + 1) = These are very similar to the definitions in the proof of case 1, except that now we are keeping track of two bins. We proceed exactly as in case 1 and obtain Pr(|θbn − θ∗ | > ∆) ≤ Pr(bk(θ∗ ) (n) + bk(θ∗ )+1 (n) < 1/2) ( ≤ M (0) max 1 + 2p 1 − 2p N (j + 1) ≤ 1 − 2α − 1 + 2α 1 − 2α Pm i=kj+1 +1 bi (j) )n sup E[N (j + 1)|b(j)] j∈{0,...,n−1} b(j) To bound E[N (j + 1)|b(j)] we are going to condition on W (recall the algorithm described in Section IV-C). For the purposes of illustration we present here only the case W = 0. The procedure for the remaining scenarios is analogous. If W = 0 then the sample location at time j + 1 is X(j + 1) = ∆k( j + 1). Next we evaluate N (j + 1) for three different cases: (i) kj+1 < k(θ∗ ); (ii) kj+1 > k(θ∗ ); kj+1 = k(θ∗ ). For case (i) we have 1 + 2p 1 − 2p N (j + 1) ≤ 1 − 2α − × 1 + 2α 1 − 2α Pkj+1 i=1 bi (j) . 1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j) 1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j) × . Finally for case (iii) we have (11), displayed in the next page. To compute E[N (j + 1)|b(j)|W = 0] we first need to Pkj+1 evaluate E[ i=1 bi (j)|kj+1 < k(θ∗ )]. It is not hard to show that kj+1 X E bi (j)|kj+1 < k(θ∗ ) = i=1 Pk(θ∗ ) i=1 bi (j) 2 Pk(θ∗ ) − 2 i=1 bi (j) . Pk(θ∗ ) 2 i=1 bi (j) Similarly we have m X E bi (j)|kj+1 > k(θ∗ ) = i=kj+1 +1 Pm i=k(θ ∗ )+1 bi (j) 2 1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j) , M (j) = bk(θ∗ ) (j) + bk(θ∗ )+1 (j) = For case (ii) we have Pm − 2 i=k(θ ∗ )+1 bi (j) 2 Pm i=k(θ ∗ )+1 bi (j) . These results can be used to easily evaluate E[N (j + 1)|b(j), W = 0]. We do not display this expression since it is cumbersome. Proceeding in a similar fashion for all the remaining possibilities for W yields the bound for E[N (j + 1)|b(j)] given by (12), displayed in the next page. Although this expression looks terribly complicated, it is possible to explicitly find the posterior b(j) maximizing it. This posterior only has mass in the vicinity of the true transition point θ∗ . The intuitive reason for that is due to the fact that collecting samples at the k(θ∗ )∆ (the sampling point with a “bad” noise margin) yields the worst possible scenario. This can be formally shown using Lagrange multipliers. We do not include that derivation here, but it is available in a technical report [6]. For illustration purposes suppose that q > 0. Then E[N (j + 1)|b(j)] is largest when the posterior is of the form bk(θ∗ )−1 = δ−ξ/2, bk(θ∗ )+2 = 1−δ−ξ/2, and bk(θ∗ )+1 = ξ, and all the other entries of (b)(j) are zero. The largest value is attained taking ξ → 0, and choosing δ maximizing the bound. 1 E[N (j + 1)|b(j)] ≤ max 1 − − δ 4 1 − 2α 1 + 2p 1 − 2p − × 4(1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j)) 1 + 2α 1 − 2α 1 1 − α − (1 − 2α)δ (δ 2 + (1 − δ)2 ) + (1/2 + q) + 4 α α + (1 − 2α)δ (1/2 − q) 1−α This bound can be computed explicitly, but the expressions are cumbersome and do not provide much insight. Instead we provide a simpler characterization of the bounds, namely " Pk(θ∗ )−1 bk(θ∗ ) (j) + bk(θ∗ ) (j) (1/2 + α)(1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j)) − 2α i=1 bi (j) N (j + 1) = × (1/2 + q) + 1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j) (1/2 + α)bk(θ∗ ) (j) + (1/2 − α)bk(θ∗ ) (j) # Pk(θ∗ )−1 (1/2 − α)(1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j)) + 2α i=1 bi (j) (1/2 − q) (11) (1/2 − α)bk(θ∗ ) (j) + (1/2 + α)bk(θ∗ ) (j) 1 E[N (j + 1)|b(j)] ≤ 1 − (bk(θ∗ )−1 (j) + bk(θ∗ ) (j) + bk(θ∗ )+1 (j) + bk(θ∗ )+2 (j)) 4 α 1 + 2p 1 − 2p − − × 2(1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j)) 1 + 2α 1 − 2α ∗ ∗ k(θ )−2 k(θ )+1 X X bi (j) bi (j) − bk(θ∗ )−2 (j) · bk(θ∗ )−1 (j) + bk(θ∗ )−1 (j) · bk(θ∗ )+1 (j) + i=1 m X k(θ ∗ )+3 k(θ ∗ )−1 X i=1 i=1 bi (j) m X bi (j) − bk(θ∗ )+2 (j) · bk(θ∗ )+3 (j) + bk(θ∗ ) (j) · bk(θ∗ )+2 (j) + k(θ ∗ ) bi (j) k(θ ∗ ) X bi (j) + i=1 m X k(θ ∗ )+2 bi (j) m X bi (j) + k(θ ∗ )+1 1 (bk(θ∗ )−1 (j) + bk(θ∗ ) (j) + bk(θ∗ )+1 (j) + bk(θ∗ )+2 (j)) × 4 bk(θ∗ ) (j) + bk(θ∗ ) (j) bk(θ∗ ) (j) × 1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j) " Pk(θ∗ )−1 (1/2 + α)(1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j)) − 2α i=1 bi (j) (1/2 + q) + (1/2 + α)bk(θ∗ ) (j) + (1/2 − α)bk(θ∗ ) (j) # Pk(θ∗ )−1 (1/2 − α)(1 − bk(θ∗ ) (j) − bk(θ∗ )+1 (j)) + 2α i=1 bi (j) . (1/2 − q) (1/2 − α)bk(θ∗ ) (j) + (1/2 + α)bk(θ∗ ) (j) that, for κ ≥ 2 and α = 0.09 × p we obtain E[N (j + 1)|b(j)] ≤ 1 − (1/2 − p)2 /50. R EFERENCES [1] N. Balcan, A. Beygelzimer, and J. Langford. Agostic active learning. available at http://hunch.net/∼jl/projects/agnostic active/active.pdf. [2] G. Blanchard and D. Geman. Hierarchical testing designs for pattern recognition. to appear in Annals of Statistics, 2005. [3] M. V. Burnashev and K. Sh. Zigangirov. An interval estimation problem for controlled observations. Problems in Information Transmission, 10:223–231, 1974. (Translated from Problemy Peredachi Informatsii, 10(3):51–61, July-September, 1974. Original article submitted June 25, 1973). [4] R. Castro and R. Nowak. Foundations and Applications of Sensor Management, chapter Active Learning and Sampling. Springer-Verlag, 2006. Available at http://homepages.cae.wisc.edu/∼rcastro/active sensing chapter.pdf. [5] R. Castro, R. Willett, and R. Nowak. Faster rates in regression via active learning. In Proceedings of Neural Information Processing Systems (NIPS), 2005. longer Technical Report available at http://homepages.cae.wisc.edu/∼rcastro/ECE-05-3.pdf. [6] R. M. Castro and R. D. Nowak. Error bounds for active learning. Technical report, University of Wisconsin - Madison, ECE Dept., 2006. (available at http://homepages.cae.wisc.edu/∼rcastro/bounds active.pdf). [7] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Learning probabilistic linear-threshold classifiers via selective sampling. In The Sixteenth Annual Conference on Learning Theory. LNAI 2777, Springer, 2003. (12) [8] D. Cohn, Z. Ghahramani, and M. Jordan. Active learning with statistical models. Journal of Artificial Intelligence Research, pages 129–145, 1996. [9] S. Dasgupta. Analysis of a greedy active learning strategy. In Advances in Neural Information Processing (NIPS), 2004. [10] S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing (NIPS), 2005. [11] S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptronbased active learning. In Eighteen Annual Conference on Learning Theory (COLT), 2005. [12] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(23):133–168, August 1997. [13] G. Golubev and B. Levit. Sequential recovery of analytic periodic edges in the binary image models. Mathematical Methods of Statistics, 12:95–115, 2003. [14] P. Hall and I. Molchanov. Sequential methods for design-adaptive estimation of discontinuities in regression curves and surfaces. The Annals of Statistics, 31(3):921–941, 2003. [15] M. Horstein. Sequential decoding using noiseless feedback. IEEE Trans. Info. Theory, 9(3):136–143, 1963. [16] Matti Kääriäinen. Active learning in the non-realizable case. Personal communication, 2006. [17] D. J. C. Mackay. Information-based objective functions for active data selection. Neural Computation, 4:698–714, 1991. [18] A. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004. [19] Alexandre B. Tsybakov. Introduction à l’estimation non-paramétrique. Mathématiques et Applications, 41. Springer, 2004.
© Copyright 2026 Paperzz