Machine Learning

Probably Approximately
Correct Learning (PAC)
Leslie G. Valiant.
A Theory of the Learnable.
Comm. ACM (1984) 1134-1142
Recall: Bayesian learning
• Create a model based on some
parameters
• Assume some prior distribution on
those parameters
• Learning problem
– Adjust the model parameters so as to
maximize the likelihood of the model given
the data
– Utilize Bayesian formula for that.
PAC Learning
• Given distribution D of observables X
• Given a family of functions (concepts) F
• For each x ε X and f ε F:
f(x) provides the label for x
• Given a family of hypotheses H, seek a
hypothesis h such that
Error(h) = Prx ε D[f(x) ≠ h(x)]
is minimal
PAC New Concepts
•
•
•
•
Large family of distributions D
Large family of concepts F
Family of hypothesis H
Main questions:
– Is there a hypothesis h ε H that can be learned
– How fast can it be learned
– What is the error that can be expected
Estimation vs. approximation
• Note:
– The distribution D is fixed
– There is no noise in the system (currently)
– F is a space of binary functions (concepts)
• This is thus an approximation
problem as the function is given exactly
for each x  X
• Estimation problem: f is not given
exactly but estimated from noisy data
Example (PAC)
• Concept: Average body-size person
• Inputs: for each person:
– height
– weight
• Sample: labeled examples of persons
– label + : average body-size
– label - : not average body-size
• Two dimensional inputs
Observable space X with concept f
Example (PAC)
• Assumption: target concept is a
rectangle.
• Goal:
– Find a rectangle h that “approximates” the
target
– Hypothesis family H of rectangles
• Formally:
– With high probability
– output a rectangle such that
– its error is low.
Example (Modeling)
• Assume:
– Fixed distribution over persons.
• Goal:
– Low error with respect to THIS
distribution!!!
• How does the distribution look like?
– Highly complex.
– Each parameter is not uniform.
– Highly correlated.
Model Based approach
• First try to model the distribution.
• Given a model of the distribution:
– find an optimal decision rule.
• Bayesian Learning
PAC approach
• Assume that the distribution is fixed.
• Samples are drawn are i.i.d.
– independent
– Identically distributed
• Concentrate on the decision rule rather
than distribution.
PAC Learning
• Task: learn a rectangle from examples.
• Input: point (x,f(x)) and classification + or – classifies by a rectangle R
• Goal:
– Using the fewest examples
– compute h
– h is a good approximation for f
PAC Learning: Accuracy
• Testing the accuracy of a hypothesis:
– using the distribution D of examples.
• Error = h D f (symmetric difference)
• Pr[Error] = D(Error) = D(h D h)
• We would like Pr[Error] to be
controllable.
• Given a parameter e:
– Find h such that Pr[Error] < e.
PAC Learning: Hypothesis
• Which Rectangle should we choose?
– Similar to parametric modeling?
Setting up the Analysis:
• Choose smallest rectangle.
• Need to show:
– For any distribution D and Rectangle h
– input parameters: e and d
– Select m(e,d) examples
– Let h be the smallest consistent rectangle.
– Such that with probability 1-d (on X):
D(f D h) < e
More general case (no rectangle)
• A distribution: D (unknown)
• Target function: ct from C
– ct : X  {0,1}
• Hypothesis: h from H
– h: X  {0,1}
• Error probability:
– error(h) = ProbD[h(x) ct(x)]
• Oracle: EX(ct,D)
PAC Learning: Definition
• C and H are concept classes over X.
• C is PAC learnable by H if
• There Exist an Algorithm A such that:
– For any distribution D over X and ct in C
– for every input e and d:
– outputs a hypothesis h in H,
– while having access to EX(ct,D)
– with probability 1-d we have error(h) < e
• Complexities.
Finite Concept class
• Assume C=H and finite.
• h is e-bad if error(h)> e.
• Algorithm:
– Sample a set S of m(e,d) examples.
– Find h in H which is consistent.
• Algorithm fails if h is e-bad.
PAC learning: formalization (1)
• X is the set of all possible examples
• D is the distribution from which the
examples are drawn
• H is the set of all possible hypotheses,
cH
• m is the number of training examples
• error(h) =
Pr(h(x)  c(x) | x is drawn from X with D)
• h is approximately correct if error(h)  e
PAC learning: formalization (2)
H
Hbad
e
c
To show: after m examples, with high
probability, all consistent hypotheses are
approximately correct.
All consistent hypotheses lie in an e-ball
around c.
Complexity analysis (1)
• The probability that hypothesis hbadHbad
is consistent with the first m examples:
error(hbad) > e by definition.
The probability that it agrees with an
example is thus (1- e) and with m
examples (1- e)m
Complexity analysis (2)
• For Hbad to contain a consistent
example, at least one hypothesis in it
must be consistent.
Probability(Hbad has a consistent hypothesis)
 |Hbad|(1- e)m  |H|(1- e)m
Complexity analysis (3)
• To reduce the probability of error below d
|H|(1- e)m  d
• This is possible when at least m examples
m  1/e(ln 1/d + ln |H|)
are seen.
• This is the sample complexity
Complexity analysis (4)
• “at least m examples are necessary to build a
consistent hypothesis h that is wrong at most
e times with probability 1- d”
• Since |H| = 22^n, the complexity grows
exponentially with the number of attributes n
• Conclusion: learning any boolean function is
no better in the worst case than table lookup!
PAC learning -- observations
• “Hypothesis h(X) is consistent with m
examples and has an error of at most e
with probability 1-d”
• This is a worst-case analysis. Note that
the result is independent of the
distribution D!
• Growth rate analysis:
– for e  0,
m   proportionally
– for d  0,
m   logarithmically
– for |H|  , m   logarithmically
PAC: comments
• We only assumed that examples are
i.i.d.
• We have two independent parameters:
– Accuracy e
– Confidence d
• No assumption about the likelihood of a
hypothesis.
• Hypothesis is tested on the same
distribution as the sample.
PAC: non-feasible case
• What happens if ct not in H
• Needs to redefine the goal.
• Let h* in H minimize the error
b=error(h*)
• Goal: find h in H such that
error(h)  error(h*) +e = b+e
Analysis*
• For each h in H:
– let obs-error(h) be the average error on
the sample S.
• Compute the probability that:
Pr {|obs-error(h) - error(h) | < e/2}
Chernoff bound: Pr < exp(-(e/2)2m)
• Consider entire H :
Pr < |H| exp(-(e/2)2m)
• Sample size m > (4/e2) ln (|H|/d)
Correctness
• Assume that for all h in H:
– |obs-error(h) - error(h) | < e/2
• In particular:
– obs-error(h*) < error(h*) + e/2
– error(h) -e/2 < obs-error(h)
• For the output h:
– obs-error(h) < obs-error(h*)
• Conclusion: error(h) < error(h*)+e
Sample size issue
• Due to the use of Chernoff boud:
Pr {|obs-error(h) - error(h) | < e/2}
Chernoff bound: Pr < exp(-(e/2)2m)
and on entire H :
Pr < |H| exp(-(e/2)2m)
• It follows that the sample size
m > (4/e2) ln (|H|/d)
Not (1/e) ln (|H|/d) as before
Example: Learning OR of literals
•
•
•
•
Inputs: x1, … , xn
Literals : x1, x1
OR functions: x1  x4  x7
For each variable, target disjunction
may contain xi or not, thus
Number of disjunctions is 3n
ELIM: Algorithm for learning OR
• Keep a list of all literals
• For every example whose classification is 0:
– Erase all the literals that are 1.
• Example c(00110)=0 results in deleting x1
• Correctness:
– Our hypothesis h: An OR of our set of literals.
– Our set of literals includes the target OR literals.
– Every time h predicts zero: we are correct.
• Sample size: m > (1/e) ln (3n/d)
Learning parity
• Functions: x1  x7  x9
• Number of functions: 2n
• Algorithm:
– Sample set of examples
– Solve linear equations (Matrix exists)
• Sample size: m > (1/e) ln (2n/d)
Infinite Concept class
• X=[0,1] and H={cq | q in [0,1]}
• cq(x) = 0 iff x < q
• Assume C=H:
max
min
• Which cq should we choose in [min,max]?
Proof I
• Define
max = min{x|c(x)=1}, min = max{x|c(x)=0}
• _Show that the probability that
– Pr[ D([min,max]) > e ] < d
• Proof: By Contradiction.
– The probability that x in [min,max] at least e
– The probability we do not sample from [min,max]
Is (1-e)m
– Needs m > (1/e) ln (1/d)
There is something wrong
Proof II (correct):
• Let max’ be : D([q,max’])=e/2
• Let min’ be : D([q,min’])=e/2
• Goal: Show that with high probability
– X+ in [max’,q] and
– X- in [q,min’]
• In such a case any value in [x-,x+] is
good.
• Compute sample size!
Proof II (correct):
• Pr{x1,x2,..,xm} is not in [q,min’])=
(1- e/2)m < exp(-me/2)
• Similarly with the other side
• We require 2exp(-me/2) < δ
• Thus, m > 2/e ln(2/δ)
Comments
• The hypothesis space was very simple
H={cq | q in [0,1]}
• There was no noise in the data or labels
• So learning was trivial in some sense
(analytic solution)
Non-Feasible case: Label Noise
• Suppose we sample:
• Algorithm:
– Find the function h with lowest error!
Analysis
• Define: zi as an e/4 - net (w.r.t. D)
• For the optimal h* and our h there are
– zj : |error(h[zj]) - error(h*)| < e/4
– zk : |error(h[zk]) - error(h)| < e/4
• Show that with high probability:
– |obs-error(h[zi])-error(h[zi])| < e/4
Exercise (Submission Mar 29, 04)
1. Assume there is Gaussian (0,σ) noise
on xi. Apply the same analysis to
compute the required sample size for
PAC learning.
Note: Class labels are determined by
the non-noisy observations.
General e-net approach
• Given a class H define a class G
– For every h in H
– There exist a g in G such that
– D(g D h) < e/4
• Algorithm: Find the best h in H.
• Computing the confidence and sample
size.
Occam Razor
W. Occam (1320) “Entities should not be
multiplied unnecessarily”
A. Einstein “Simplify a problem as much
as possible, but no simpler”
Information theoretic ground?
Occam Razor
Finding the shortest consistent
hypothesis.
• Definition: (a,b)-Occam algorithm

–
–
–
–
•
a >0 and b <1
Input: a sample S of size m
Output: hypothesis h
for every (x,b) in S: h(x)=b (consistency)
size(h) < sizea(ct) mb
Efficiency.
Occam algorithm and
compression
S
(xi,bi)
A
B
x1, … , xm
Compression
• Option 1:
– A sends B the values b1 , … , bm
– m bits of information
• Option 2:
– A sends B the hypothesis h
– Occam: large enough m has size(h) < m
• Option 3 (MDL):
– A sends B a hypothesis h and “corrections”
– complexity: size(h) + size(errors)
Occam Razor Theorem
•
•
•
•
A: (a,b)-Occam algorithm for C using H
D distribution over inputs X
ct in C the target function, n=size(ct)
Sample size:
a 1 (1 b )
 1 1  2n
m  2  ln + 
e d  e








• with probability 1-d A(S)=h has error(h) < e
Occam Razor Theorem
• Use the bound for finite hypothesis
class.
• Effective hypothesis class size 2size(h)
• size(h) < na mb
• Sample size:
1
m   ln
e

na m b
2
d
  na m b 2   1 1 
=

ln
+  ln 


  e
d  e d 

The VC dimension will replace 2size(h)
Exercise (Submission Mar 29, 04)
2. For an (a,b)-Occam algorithm, given
noisy data with noise ~ (0, σ2) find
the limitations on m.
Hint (ε-net and Chernoff bound)
Learning OR with few
attributes
• Target function: OR of k literals
• Goal: learn in time:
– polynomial in k and log n
 e and d constant
• ELIM makes “slow” progress
– disqualifies one literal per round
– May remain with O(n) literals
Set Cover - Definition
•
•
•
•
Input: S1 , … , St and Si  U
Output: Si1, … , Sik and j Sjk=U
Question: Are there k sets that cover U?
NP-complete
Set Cover Greedy algorithm
• j=0 ; Uj=U; C=
• While Uj  
– Let Si be arg max |Si  Uj|
– Add Si to C
– Let Uj+1 = Uj – Si
– j = j+1
Set Cover: Greedy Analysis
•
•
•
•
•
•
•
•
At termination, C is a cover.
Assume there is a cover C’ of size k.
C’ is a cover for every Uj
Some S in C’ covers Uj/k elements of Uj
Analysis of Uj: |Uj+1|  |Uj| - |Uj|/k
Solving the recursion.
Number of sets j < k ln |U|
Ex 2 Solve
Learning decision lists
• Lists of arbitrary size can represent any boolean
function. Lists with tests of at most with at most
k < n literals define the k-DL boolean language.
For n attributes, the language is k-DL(n).
• The language of tests Conj(n,k) has at most
3|Conj(n,k)| distinct component sets (Y,N,absent)
• |k-DL(n)|  3|Conj(n,k)| |Conj(n,k)|! (any order)
• |Conj(n,k)| = i=0k ( ) = O(nk)
2n
i
Building an Occam algorithm
• Given a sample S of size m
– Run ELIM on S
– Let LIT be the set of literals
– There exists k literals in LIT that classify
correctly all S
• Negative examples:
– any subset of LIT classifies theme correctly
Building an Occam algorithm
• Positive examples:
– Search for a small subset of LIT
– Which classifies S+ correctly
– For a literal z build Tz={x | z satisfies x}
– There are k sets that cover S+
– Find k ln m sets that cover S+
• Output h = the OR of the k ln m literals
• Size (h) < k ln m log 2n
• Sample size m =O( k log n log (k log
n))
Criticism of PAC model
• The worst-case emphasis makes it unusable
– Useful for analysis of computational complexity
– Methods to estimate the cardinality of the space of
concepts (VC dimension). Unfortunately not
sufficiently practical
• The notions of target concepts and noise-free
training are too restrictive
– True. Switch to concept approximation weak.
– Some extensions to label noise and fewer to
variable noise
Summary
• PAC model
– Confidence and accuracy
– Sample size
• Finite (and infinite) concept class
• Occam Razor
References
• A theory of the learnable. Comm. ACM
27(11):1134-42, 1984. Original work
• Probably approximately correct learning. D.
Haussler. (Review)
• Efficient noise-tolerant learning from statistical
queries. M. Kearns. (Review on noise methods)
• PAC learning with simple examples. F. Denis et
al. (Simple)
Learning algorithms
•
•
•
•
OR function
Parity function
OR of a few literals
Open problems
– OR in the non-feasible case
– Parity of a few literals

Download Report

Machine Learning

Paperzz.com

Your Paperzz