Active Learning

Active Learning of Binary Classifiers
Presenters: Nina Balcan and
Steve Hanneke
Maria-Florina Balcan
Outline
• What is Active Learning?
• Active Learning Linear Separators
• General Theories of Active Learning
• Open Problems
Maria-Florina Balcan
Supervised Passive Learning
Data
Source
Learning
Algorithm
Unlabeled
examples
Expert /
Oracle
Labeled examples
Algorithm outputs a classifier
Maria-Florina Balcan
Incorporating Unlabeled Data in the Learning
process
• In many settings, unlabeled data is cheap & easy to
obtain, labeled data is much more expensive.
• Web page, document classification
• OCR, Image classification
Maria-Florina Balcan
Semi-Supervised Passive Learning
Learning
Algorithm
Data
Source
Unlabeled
examples
Unlabeled
examples
Expert /
Oracle
Labeled Examples
Algorithm outputs a classifier
Maria-Florina Balcan
Semi-Supervised Passive Learning
• Several methods have been developed to try to use
unlabeled data to improve performance, e.g.:
• Transductive SVM [Joachims ’98]
• Co-training [Blum & Mitchell ’98], [BBY04]
• Graph-based methods [Blum & Chawla01], [ZGL03]
Maria-Florina Balcan
Semi-Supervised Passive Learning
• Several methods have been developed to try to use
unlabeled data to improve performance, e.g.:
• Transductive SVM [Joachims ’98]
• Co-training [Blum & Mitchell ’98], [BBY04]
• Graph-based methods [Blum & Chawla01], [ZGL03]
+
_
+
_
SVM
Labeled data only
+
_
+
_
+
_
+
_
Transductive SVM
Maria-Florina Balcan
Semi-Supervised Passive Learning
• Several methods have been developed to try to use
unlabeled data to improve performance, e.g.:
• Transductive SVM [Joachims ’98]
• Co-training [Blum & Mitchell ’98], [BBY04]
• Graph-based methods [Blum & Chawla01], [ZGL03]
Workshops [ICML ’03, ICML’ 05]
Books: Semi-Supervised Learning, MIT 2006
O. Chapelle, B. Scholkopf and A. Zien (eds)
Theoretical models: Balcan-Blum’05
Maria-Florina Balcan
Active Learning
Learning
Algorithm
Data
Source
Expert /
Oracle
Unlabeled
examples
Request for the Label of an Example
A Label for that Example
Request for the Label of an Example
A Label for that Example
...
Algorithm outputs a classifier
Maria-Florina Balcan
What Makes a Good Algorithm?
• Guaranteed to output a relatively good classifier
for most learning problems.
• Doesn’t make too many label requests.
Choose the label requests carefully, to get
informative labels.
Maria-Florina Balcan
Can It Really Do Better Than Passive?
• YES! (sometimes)
• We often need far fewer labels for active
learning than for passive.
• This is predicted by theory and has been
observed in practice.
Maria-Florina Balcan
Active Learning in Practice
• Active SVM (Tong & Koller, ICML 2000) seems to be
quite useful in practice.
At any time during the alg., we have a “current guess” of
the separator: the max-margin separator of all labeled
points so far.
E.g., strategy 1: request the label of the example closest to
the current separator.
Maria-Florina Balcan
When Does it Work? And Why?
• The algorithms currently used in practice are not
well understood theoretically.
• We don’t know if/when they output a good
classifier, nor can we say how many labels they
will need.
• So we seek algorithms that we can understand
and state formal guarantees for.
Rest of this talk: surveys recent theoretical results.
Maria-Florina Balcan
Standard Supervised Learning Setting
• S={(x, l)} - set of labeled examples
- drawn i.i.d. from some distr. D over X and labeled
by some target concept c* 2 C
• Want to do optimization over S to find some hyp. h,
but we want h to have small error over D.
–err(h)=Prx 2 D(h(x)  c*(x))
Sample Complexity, Finite Hyp. Space, Realizable case
Maria-Florina Balcan
Sample Complexity: Uniform Convergence Bounds
• Infinite Hypothesis Case
E.g., if C - class of linear separators in Rd, then we
need roughly O(d/) examples to achieve generalization
error .
Non-realizable case – replace  with 2.
Maria-Florina Balcan
Active Learning
• We get to see unlabeled data first, and there is a
charge for every label.
• The learner has the ability to choose specific examples
to be labeled:
- The learner works harder, in order to use fewer
labeled examples.
• How many labels can we save by querying adaptively?
Maria-Florina Balcan
Can adaptive querying help?
[CAL92, Dasgupta04]
• Consider threshold functions on the real line:
hw(x) = 1(x ¸ w), C = {hw: w 2 R}
-
•
+
w
Sample with 1/ unlabeled examples.
- -
+
• Binary search – need just O(log 1/) labels.
Active setting: O(log 1/) labels to find an -accurate threshold.
Supervised learning needs O(1/) labels.
Exponential improvement in sample complexity 
Maria-Florina Balcan
Active Learning might not help [Dasgupta04]
In general, number of queries needed depends on C and also on D.
h3
R1}:
C = {linear separators in
active learning reduces sample
complexity substantially.
h2
C = {linear separators in R2}:
there are some target hyp. for
which no improvement can be
achieved!
- no matter how benign the
input distr.
h1
h0
In this case: learning to accuracy  requires 1/ labels…
Maria-Florina Balcan
Examples where Active Learning helps
In general, number of queries needed depends on C and also on D.
• C = {linear separators in R1}: active learning reduces sample
complexity substantially no matter what is the input
distribution.
• C - homogeneous linear separators in Rd, D - uniform
distribution over unit sphere:
• need only d log 1/ labels to find a hypothesis with
error rate < .
• Dasgupta, Kalai, Monteleoni, COLT 2005
• Freund et al., ’97.
• Balcan-Broder-Zhang, COLT 07
Maria-Florina Balcan
Region of uncertainty [CAL92]
• Current version space: part of C consistent with labels so far.
• “Region of uncertainty” = part of data space about which
there is still some uncertainty (i.e. disagreement within version
space)
• Example: data lies on circle in R2 and hypotheses are
homogeneous linear separators.
current version space
+
+
region of uncertainty
in data space
Maria-Florina Balcan
Region of uncertainty [CAL92]
current version space
region of
uncertainy
Algorithm:
Pick a few points at random from the current
region of uncertainty and query their labels.
Maria-Florina Balcan
Region of uncertainty [CAL92]
• Current version space: part of C consistent with labels so far.
• “Region of uncertainty” = part of data space about which
there is still some uncertainty (i.e. disagreement within version
space)
current version space
+
+
region of uncertainty
in data space
Maria-Florina Balcan
Region of uncertainty [CAL92]
• Current version space: part of C consistent with labels so far.
• “Region of uncertainty” = part of data space about which
there is still some uncertainty (i.e. disagreement within version
space)
new version space
+
+
New region of
uncertainty in data
space
Maria-Florina Balcan
Region of uncertainty [CAL92], Guarantees
Algorithm: Pick a few points at random from the current region
of uncertainty and query their labels.
[Balcan, Beygelzimer, Langford, ICML’06]
Analyze a version of this alg. which is robust to noise.
• C- linear separators on the line, low noise, exponential
improvement.
• C - homogeneous linear separators in Rd, D -uniform
distribution over unit sphere.
• low noise, need only d2 log 1/ labels to find a
hypothesis with error rate < .
• realizable case, d3/2 log 1/ labels.
•supervised -- d/ labels.
Maria-Florina Balcan
Margin Based Active-Learning Algorithm
[Balcan-Broder-Zhang, COLT 07]
wk+1
Use O(d) examples to find w1 of error 1/8.
wk
iterate k=2, … , log(1/)
• rejection sample mk samples x from D
satisfying |wk-1T ¢ x| · k ;
γk
• label them;
• find wk 2 B(wk-1, 1/2k ) consistent with all these
examples.
end iterate
w*
Maria-Florina Balcan
BBZ’07, Proof Idea
iterate k=2, … , log(1/)
Rejection sample mk samples x from D
satisfying |wk-1T ¢ x| · k ;
ask for labels and find wk 2 B(wk-1, 1/2k )
consistent with all these examples.
end iterate
Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2 and
only need O(d log( 1/)) labels in round k.
wk+1
wk
w*
γk
Maria-Florina Balcan
BBZ’07, Proof Idea
iterate k=2, … , log(1/)
Rejection sample mk samples x from D
satisfying |wk-1T ¢ x| · k ;
ask for labels and find wk 2 B(wk-1, 1/2k )
consistent with all these examples.
end iterate
Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2 and
only need O(d log( 1/)) labels in round k.
wk+1
wk
w*
γk
Maria-Florina Balcan
BBZ’07, Proof Idea
iterate k=2, … , log(1/)
Rejection sample mk samples x from D
satisfying |wk-1T ¢ x| · k ;
ask for labels and find wk 2 B(wk-1, 1/2k )
consistent with all these examples.
end iterate
Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2 and
only need O(d log( 1/)) labels in round k.
Key Point
Under the uniform distr. assumption for
we have
wk+1
· /4
wk
w*
γk
Maria-Florina Balcan
BBZ’07, Proof Idea
Key Point
Under the uniform distr. assumption for
wk+1
we have
wk
· /4
w*
γk
Key Point
So, it’s enough to ensure that
We can do so by only using O(d log( 1/)) labels in round k.
Maria-Florina Balcan
General Theories of
Active Learning
Maria-Florina Balcan
General Concept Spaces
• In the general learning problem, there
is a concept space C, and we want to
find an -optimal classifier h  C with
high probability 1-.
Maria-Florina Balcan
How Many Labels Do We Need?
• In passive learning, we know of an algorithm
(empirical risk minimization) that needs only
labels (for realizable learning), and
if there is noise.
• We also know this is close to the best we can expect
from any passive algorithm.
Maria-Florina Balcan
How Many Labels Do We Need?
• As before, we want to explore the
analogous idea for Active Learning, (but
now for general concept space C).
• How many label requests are necessary
and sufficient for Active Learning?
• What are the relevant complexity
measures? (i.e., the Active Learning
analogue of VC dimension)
Maria-Florina Balcan
What ARE the Interesting
Quantities?
• Generally speaking, we want examples
whose labels are highly controversial
among the set of remaining concepts.
• The likelihood of drawing such an
informative example is an important
quantity to consider.
• But there are many ways to define
“informative” in general.
Maria-Florina Balcan
What Do You Mean By
“Informative”?
• Want examples that reduce the version space.
• But how do we measure progress?
•
•
•
•
A problem-specific measure P on C?
The Diameter?
Measure of the region of disagreement?
Cover size? (see e.g., Hanneke, COLT 2007)
All of these seem to have interesting theories associated with
them. As an example, let’s take a look at Diameter in detail.
Maria-Florina Balcan
Diameter (Dasgupta, NIPS 2005)
Imagine each pair of
concepts separated
by distance >  has an
edge between them.
We have to rule out at least one
of the two concepts for each
edge.
Each unlabeled example X partitions
the concepts into two sets.
And guarantees some fraction of the edges will have at least
one concept contradicted, no matter which label it has.
• Define distance d(g,h) = Pr(g(X)h(X)).
• One way to guarantee our classifier is within 
of the target classifier is to (safely) reduce
the diameter to size .
Maria-Florina Balcan
Diameter
•Theorem: (Dasgupta, NIPS 2005)
If, for any finite subset V  C,
PrX(X eliminates a ρ fraction of the edges)  ,
then (assuming no noise) we can reduce the
diameter to  using a number of label requests at
most
Furthermore, there is an algorithm that does this,
which with high probability requires a number of
unlabeled examples at most
Maria-Florina Balcan
Open Problems in Active Learning
• Efficient (correct) learning algorithms
for linear separators provably achieving
significant improvements on many
distributions.
• What about binary feature spaces?
• Tight general-purpose sample
complexity bounds, for both realizable
and agnostic.
• An optimal active learning algorithm?
Maria-Florina Balcan

Download Report

Active Learning

Paperzz.com

Your Paperzz