3/28/12 Computa.onal Learning Theory CS542 – Spring 2012 Kevin Small [Significant content from Dan Roth] Learning Conjunc.ons • Start with set of all literals as candidates • Eliminate any literal that is not ac.ve amongst a posi.ve labeled instance h(x) = x1 ^ x2 ^ x3 ^ . . . ^ x100
– <(1,1,1,1,1,…,1,1), 1> – <(1,1,1,0,1,…,0,0), 0> learned nothing – <(1,1,1,1,1,0,…,0,1,1), 1> h(x) = x1 ^ x2 ^ x3 ^ x4 ^ x5 ^ x99 ^ x100
– <(1,0,1,1,1,0,…,0,1), 0> learned nothing – <(0,1,1,1,1,…,0,1), 1> h(x) = x2 ^ x3 ^ x4 ^ x5 ^ x100
– <(1,0,1,0,…,0,1,1), 0> learned nothing Our final f (x) = x2 ^ x3 ^ x5 ^ x100
hypothesis is close to target. 1 3/28/12 Analyzing Performance • Probabilis.c Analysis – Distribu.on never generated x4=0 in a posi.ve instances – maybe it will never happen – This is at least a low probability event, so the resul.ng hypothesis should be preSy good (?) – PAC framework (next week) • Mistake Driven Analysis – Hypothesis is only updated when mistakes are made – If we count the number of mistakes made during learning, we can es.mate performance on future data – Note: not all on-‐line algorithms are mistake driven Perceptron Mistake-‐bound Theorem [Novikov, 1963]: Let (x1,y1),…(xm,ym) be a sequence of labeled examples with x i 2
R
d , kx
i k 2 R
, d
and y i 2
{ 1,
1}
for all i. Let u
2
R
; >
0 be such T
that kuk
= 1 and y i u
x
i for all i. In this case, the Perceptron makes at most O(R
2 / 2 ) mistakes on the example sequence. • Bound dependent on the ra.o of the size of the instance space to the margin of op.mal classifier. 2 3/28/12 Winnow Mistake-‐bound • Winnow makes O(k
log
d)
mistakes when learning a k-‐disjunc.on • If u is the number of false nega.ves (promo.ons), u < k log(2d) • If v is the number of false posi.ves (demo.ons), v < 2(u + 1) • u + v < 3u + 2 = O(k log d)
Summary of Algorithms • Perceptron [RosenblaS, 1958] – If f(x)=1, but h(x)=0, wi=wi+η if xi=1 (promo.on) – If f(x)=0, but h(x)=1, wi=wi-‐η if xi=1 (demo.on) • Winnow [LiSlestone, 1988] – If f(x)=1, but h(x)=0, wi=wiη if xi=1 (promo.on) – If f(x)=0, but h(x)=1, wi=wi/η if xi=1 (demo.on) 3 3/28/12 PAC Learning • Can we bound the error ErrorD = Px2D Jf (x) 6= h(x)K
given knowledge of training instances? f (x) = x2 ^ x3 ^ x5 ^ x100
h(x) = x2 ^ x3 ^ x4 ^ x5 ^ x100
PAC Learning of Conjunc.ons • We require
✓ ◆
d
1
m>
log(d) + log
✏
instances to ensure a probability of failure of less than δ • {δ=0.1, ε=0.1, d=100} means 6907 instances • {δ=0.1, ε=0.1, d=10} means 460 instances • {δ=0.01, ε=0.1, d=10} means 690 instances 4 3/28/12 PAC Learnability • Consider a concept class C defined over an instance space X with dimensionality d and a learner L which operates in the hypothesis space H • C is PAC learnable by L using H if 1. For all f in C 2. For any distribu.on D over X 3. For a fixed ε>0, δ<1 L when given a collec.on of m instances sampled iid according to D produces with probability of at least (1-‐δ) a hypothesis h in H with error of at most ε where m is polynomial in 1/ε, 1/δ, d and size(C) • C is efficiently PAC learnable if L produces h in .me polynomial in 1/ε, 1/δ, d and size(C) PAC Learnability • Most ostensibly determined by number of training examples (i.e., sample complexity) • Can we derive an algorithm-‐independent bound on number of training examples required to learn a consistent hypothesis • This assumes that a consistent hypothesis exists – which is the case if C ✓ H
5 3/28/12 ε-‐exhaus.ng a Version Space • Theorem: If the hypothesis space H is finite, and S is a sequence of m>0 instances drawn iid and labeled by c in C, for any ε, the probability the version space is not ε-‐exhausted is at most ✏m
|H|e
• This represents a bound on the probability that m training instances fails to eliminate all “bad” consistent hypotheses • This gives us a path to a PAC-‐bound Occam’s Razor • Claim: The probability that there exists a h in H that is consistent with m examples and sa.sfies Error(h)>ε is less than |H|(1-‐ε)m • This implies that (gross over-‐es.mate)
✓ ◆
1
1
m>
log(|H|) + log
✏
• What kind of hypothesis space do we want? • What about consistency? 6 3/28/12 Agnos.c Learning • Suppose f 2
(i.e., h may be inconsistent) /H
• Goal is then to find h in H with a small training m
error 1 X
ErrorS (h) =
m
i=1
Jf (x) 6= h(x)K
• We want a guarantee that h with small training error will be accurate on unseen data ErrorD (h) = Px2D Jf (x) 6= h(x)K
Hoeffding Bound • Characterizes devia.on between true probability of some event and its observed frequency over m independent trials P (ErrorD (h) > ErrorS (h) + ✏) < e
2m✏2
• This analysis is limited to binary variables 7 3/28/12 Agnos.c Learning • Using the union bound again P (J9h 2 HK : ErrorD (h) > ErrorS (h) + ✏) < |H|e
2m✏2
• If we bound this above by δ, we can get a generalizaTon bound – a bound on how much the true error will deviate from the observed error • For any distribu.on D genera.ng both training and tes.ng instances, with probability of at least (1-‐δ) over sampling S, for all h in H ErrorD (h) < ErrorS (h) +
r
log |H| + log(1/ )
2m
Agnos.c Learning • Using these results, we can also get a sample complexity bound for agnos.c learning in finite hypothesis spaces ✓ ◆
m
1
log |H| + log
2✏2
1
• Comparing to consistent learner
✓ ◆
1
1
m>
log(|H|) + log
✏
8 3/28/12 Return to Occam’s Razor • PAC bound for consistent learners is
✓ ◆
1
1
m>
log(|H|) + log
✏
• What about learning conjunc.ons? • What about unbiased learners? • What about non-‐trivial Boolean func.ons? Boolean Formula • k-‐CNF – conjunc.on of any number of disjunc.ve clauses with at most k literals • k-‐term CNF – conjunc.on of at most k disjunc.ve clauses • k-‐DNF – disjunc.on of any number of conjunc.ve clauses with at most k literals • k-‐term DNF – disjunc.ons of at most k disjunc.ve terms 9 3/28/12 k-‐CNF f (x) = ^ci=1 (li1 _ . . . _ lik )
• What is a learning algorithm to find h? • Define a new set of literals, one for each clause of size k vj = li1 _ . . . _ lik
• Learn conjunc.ons over the new space ✓ ◆
d
• How many literals exist in the new space? k < d
k
• What is the size of the hypothesis space? 3O(d )
k
✓ ◆
1
1
m>
log(|H|) + log
✏
k-‐term DNF f (x) = _ki=1 (li1 ^ . . . ^ lid )
< 3kd
• What is the size of the hypothesis space?
✓ ◆
1
1
• Is it PAC-‐learnable? m > ✏ log(|H|) + log
• Determining if there is a 2-‐term DNF consistent with a set of instances in NP-‐hard • Oh no! • Even though sample complexity is good, computa.onal complexity is bad 10 3/28/12 k-‐term DNF • However, k-‐CNF subsumes k-‐term DNF as every k-‐term DNF can be wriSen as a k-‐CNF T1 _ T2 _ T3 =
^
x2T1 ,y2T2 ,z2T3
{x _ y _ z}
• Since this is a polynomial-‐.me transforma.on, while k-‐term DNF is not a properly PAC learnable hypothesis space, k-‐term DNF is PAC-‐learnable with H=k-‐CNF Representa.on is Important! • Concepts that cannot be learned using one representa.on can owen be learned using a more expressive representa.on C
H
• However, small hypothesis spaces have advantages also 11 3/28/12 Nega.ve PAC results •
•
•
•
•
DNF CNF Determinis.c Finite Automata (DFAs) Context Free Grammars (CFGs) Recursive Logic Programs Learning Rectangles • Assume concept is axis-‐parallel rectangle + +
-
+
+
+
++
+
+
+
• Can we learn an axis-‐parallel rectangle? • Can we bound the error? 12 3/28/12 Infinite Hypothesis Spaces • So far, PAC analysis has been restricted to finite hypothesis spaces • Infinite hypothesis spaces also have varying levels of expressivity – Squares vs. Rectangles vs. 23-‐sided convex polygons vs. 23-‐sided general polygons vs. … – Linear Threshold Func.ons vs. Conjunc.on of LTUs • We need a measurement of expressivity – Vapnik-‐Chernonenkis (VC) dimension ShaSering Instances • We want to measure the number of dis.nct instances from X that can be discriminated using H regardless of their labeling • A set of instances S is shaVered by H iff for every labeling of S there exists some hypothesis in H consistent with this labeling 13 3/28/12 ShaSering Game • You (desiring a large VC dimension) specify a set of points from X and a hypothesis space H • I (desiring a low VC dimension) label the points in an adversarial manner, requiring that you find a hypothesis for all possible labelings • You try to find a h in H that is consistent with each labeling • Intui.on: A rich set of func.ons can shaSer large sets of points 14 3/28/12 15 3/28/12 Some examples •
•
•
•
To the 2-‐d chalk space! Lew-‐bounded 1-‐d intervals ([0,a]) VC(H)=1 General 1-‐d intervals ([a,b]) VC(H)=2 2-‐d Half-‐spaces (linear separators) VC(H)=3 Sample Complexity • Using VC(H) as a measure of expressivity we have a sample complexity bound m
1
✏
✓
8V C(H) log
✓
13
✏
◆
+ 4 log
✓ ◆◆
2
• If m is polynomial in relevant parameters, we have a PAC learning algorithm • Note that analysis of the lower bounds shows this is a fairly .ght bound • Note that to shaSer m examples it must be true that |H|>2m making log(|H|)≥VC(H) 16 3/28/12 Some examples •
•
•
•
•
To the 2-‐d chalk space! Lew-‐bounded 1-‐d intervals ([0,a]) General 1-‐d intervals ([a,b]) 2-‐d Half-‐spaces (linear separators)
Axis-‐Parallel Rectangles VC(H)=1 VC(H)=2 VC(H)=3 VC(H)=4 • Must also specify efficient algorithm Conclusions • PAC framework is a model for theore.cally analyzing the effec.veness of learning algorithms • The sample complexity of any consistent learner can be determined by expressivity of H • If sample complexity is tractable, computa.onal complexity is governing factor • Sample complexity bounds tend to be loose, but can s.ll dis.nguish learnable from not • Representa.on is very important 17 3/28/12 Conclusions • Many addi.onal models have been studied – Noisy Data – Distribu.onal Assump.ons – Probabilis.c Representa.ons – Neural Networks (and other nested structures) – Restricted Finite Automata – Ac.ve Learning • Recent work is most concerned with data dependent PAC bounds • Prac.cal COLT impact has seen recent significance in the analysis of “modern” algorithms (SVM, Boos.ng) 18
© Copyright 2026 Paperzz