Computa onal Learning Theory Learning Conjunc ons

3/28/12 Computa.onal Learning Theory CS542 – Spring 2012 Kevin Small [Significant content from Dan Roth] Learning Conjunc.ons •  Start with set of all literals as candidates •  Eliminate any literal that is not ac.ve amongst a posi.ve labeled instance h(x) = x1 ^ x2 ^ x3 ^ . . . ^ x100
–  <(1,1,1,1,1,…,1,1), 1> –  <(1,1,1,0,1,…,0,0), 0> learned nothing –  <(1,1,1,1,1,0,…,0,1,1), 1> h(x) = x1 ^ x2 ^ x3 ^ x4 ^ x5 ^ x99 ^ x100
–  <(1,0,1,1,1,0,…,0,1), 0> learned nothing –  <(0,1,1,1,1,…,0,1), 1> h(x) = x2 ^ x3 ^ x4 ^ x5 ^ x100
–  <(1,0,1,0,…,0,1,1), 0> learned nothing Our final f (x) = x2 ^ x3 ^ x5 ^ x100
hypothesis is close to target. 1 3/28/12 Analyzing Performance •  Probabilis.c Analysis –  Distribu.on never generated x4=0 in a posi.ve instances – maybe it will never happen –  This is at least a low probability event, so the resul.ng hypothesis should be preSy good (?) –  PAC framework (next week) •  Mistake Driven Analysis –  Hypothesis is only updated when mistakes are made –  If we count the number of mistakes made during learning, we can es.mate performance on future data –  Note: not all on-­‐line algorithms are mistake driven Perceptron Mistake-­‐bound Theorem [Novikov, 1963]: Let (x1,y1),…(xm,ym) be a sequence of labeled examples with x i 2
R
d , kx
i k 2 R
, d
and y i 2
{ 1,
1}
for all i. Let u
2
R
; >
0 be such T
that kuk
= 1 and y i u
x
i for all i. In this case, the Perceptron makes at most O(R
2 / 2 ) mistakes on the example sequence. •  Bound dependent on the ra.o of the size of the instance space to the margin of op.mal classifier. 2 3/28/12 Winnow Mistake-­‐bound •  Winnow makes O(k
log
d)
mistakes when learning a k-­‐disjunc.on •  If u is the number of false nega.ves (promo.ons), u < k log(2d) •  If v is the number of false posi.ves (demo.ons), v < 2(u + 1) •  u + v < 3u + 2 = O(k log d)
Summary of Algorithms •  Perceptron [RosenblaS, 1958] –  If f(x)=1, but h(x)=0, wi=wi+η if xi=1 (promo.on) –  If f(x)=0, but h(x)=1, wi=wi-­‐η if xi=1 (demo.on) •  Winnow [LiSlestone, 1988] –  If f(x)=1, but h(x)=0, wi=wiη if xi=1 (promo.on) –  If f(x)=0, but h(x)=1, wi=wi/η if xi=1 (demo.on) 3 3/28/12 PAC Learning •  Can we bound the error ErrorD = Px2D Jf (x) 6= h(x)K
given knowledge of training instances? f (x) = x2 ^ x3 ^ x5 ^ x100
h(x) = x2 ^ x3 ^ x4 ^ x5 ^ x100
PAC Learning of Conjunc.ons •  We require 
✓ ◆
d
1
m>
log(d) + log
✏
instances to ensure a probability of failure of less than δ •  {δ=0.1, ε=0.1, d=100} means 6907 instances •  {δ=0.1, ε=0.1, d=10} means 460 instances •  {δ=0.01, ε=0.1, d=10} means 690 instances 4 3/28/12 PAC Learnability •  Consider a concept class C defined over an instance space X with dimensionality d and a learner L which operates in the hypothesis space H •  C is PAC learnable by L using H if 1.  For all f in C 2.  For any distribu.on D over X 3.  For a fixed ε>0, δ<1 L when given a collec.on of m instances sampled iid according to D produces with probability of at least (1-­‐δ) a hypothesis h in H with error of at most ε where m is polynomial in 1/ε, 1/δ, d and size(C) •  C is efficiently PAC learnable if L produces h in .me polynomial in 1/ε, 1/δ, d and size(C) PAC Learnability •  Most ostensibly determined by number of training examples (i.e., sample complexity) •  Can we derive an algorithm-­‐independent bound on number of training examples required to learn a consistent hypothesis •  This assumes that a consistent hypothesis exists – which is the case if C ✓ H
5 3/28/12 ε-­‐exhaus.ng a Version Space •  Theorem: If the hypothesis space H is finite, and S is a sequence of m>0 instances drawn iid and labeled by c in C, for any ε, the probability the version space is not ε-­‐exhausted is at most ✏m
|H|e
•  This represents a bound on the probability that m training instances fails to eliminate all “bad” consistent hypotheses •  This gives us a path to a PAC-­‐bound Occam’s Razor •  Claim: The probability that there exists a h in H that is consistent with m examples and sa.sfies Error(h)>ε is less than |H|(1-­‐ε)m •  This implies that (gross over-­‐es.mate) 
✓ ◆
1
1
m>
log(|H|) + log
✏
•  What kind of hypothesis space do we want? •  What about consistency? 6 3/28/12 Agnos.c Learning •  Suppose f 2
(i.e., h may be inconsistent) /H
•  Goal is then to find h in H with a small training m
error 1 X
ErrorS (h) =
m
i=1
Jf (x) 6= h(x)K
•  We want a guarantee that h with small training error will be accurate on unseen data ErrorD (h) = Px2D Jf (x) 6= h(x)K
Hoeffding Bound •  Characterizes devia.on between true probability of some event and its observed frequency over m independent trials P (ErrorD (h) > ErrorS (h) + ✏) < e
2m✏2
•  This analysis is limited to binary variables 7 3/28/12 Agnos.c Learning •  Using the union bound again P (J9h 2 HK : ErrorD (h) > ErrorS (h) + ✏) < |H|e
2m✏2
•  If we bound this above by δ, we can get a generalizaTon bound – a bound on how much the true error will deviate from the observed error •  For any distribu.on D genera.ng both training and tes.ng instances, with probability of at least (1-­‐δ) over sampling S, for all h in H ErrorD (h) < ErrorS (h) +
r
log |H| + log(1/ )
2m
Agnos.c Learning •  Using these results, we can also get a sample complexity bound for agnos.c learning in finite hypothesis  spaces ✓ ◆
m
1
log |H| + log
2✏2
1
•  Comparing to consistent learner 
✓ ◆
1
1
m>
log(|H|) + log
✏
8 3/28/12 Return to Occam’s Razor •  PAC bound for consistent learners is 
✓ ◆
1
1
m>
log(|H|) + log
✏
•  What about learning conjunc.ons? •  What about unbiased learners? •  What about non-­‐trivial Boolean func.ons? Boolean Formula •  k-­‐CNF – conjunc.on of any number of disjunc.ve clauses with at most k literals •  k-­‐term CNF – conjunc.on of at most k disjunc.ve clauses •  k-­‐DNF – disjunc.on of any number of conjunc.ve clauses with at most k literals •  k-­‐term DNF – disjunc.ons of at most k disjunc.ve terms 9 3/28/12 k-­‐CNF f (x) = ^ci=1 (li1 _ . . . _ lik )
•  What is a learning algorithm to find h? •  Define a new set of literals, one for each clause of size k vj = li1 _ . . . _ lik
•  Learn conjunc.ons over the new space ✓ ◆
d
•  How many literals exist in the new space? k < d
k
•  What is the size of the hypothesis space? 3O(d )
k

✓ ◆
1
1
m>
log(|H|) + log
✏
k-­‐term DNF f (x) = _ki=1 (li1 ^ . . . ^ lid )
< 3kd
•  What is the size of the hypothesis space? 
✓ ◆
1
1
•  Is it PAC-­‐learnable? m > ✏ log(|H|) + log
•  Determining if there is a 2-­‐term DNF consistent with a set of instances in NP-­‐hard •  Oh no! •  Even though sample complexity is good, computa.onal complexity is bad 10 3/28/12 k-­‐term DNF •  However, k-­‐CNF subsumes k-­‐term DNF as every k-­‐term DNF can be wriSen as a k-­‐CNF T1 _ T2 _ T3 =
^
x2T1 ,y2T2 ,z2T3
{x _ y _ z}
•  Since this is a polynomial-­‐.me transforma.on, while k-­‐term DNF is not a properly PAC learnable hypothesis space, k-­‐term DNF is PAC-­‐learnable with H=k-­‐CNF Representa.on is Important! •  Concepts that cannot be learned using one representa.on can owen be learned using a more expressive representa.on C
H
•  However, small hypothesis spaces have advantages also 11 3/28/12 Nega.ve PAC results • 
• 
• 
• 
• 
DNF CNF Determinis.c Finite Automata (DFAs) Context Free Grammars (CFGs) Recursive Logic Programs Learning Rectangles •  Assume concept is axis-­‐parallel rectangle + +
-
+
+
+
++
+
+
+
•  Can we learn an axis-­‐parallel rectangle? •  Can we bound the error? 12 3/28/12 Infinite Hypothesis Spaces •  So far, PAC analysis has been restricted to finite hypothesis spaces •  Infinite hypothesis spaces also have varying levels of expressivity –  Squares vs. Rectangles vs. 23-­‐sided convex polygons vs. 23-­‐sided general polygons vs. … –  Linear Threshold Func.ons vs. Conjunc.on of LTUs •  We need a measurement of expressivity –  Vapnik-­‐Chernonenkis (VC) dimension ShaSering Instances •  We want to measure the number of dis.nct instances from X that can be discriminated using H regardless of their labeling •  A set of instances S is shaVered by H iff for every labeling of S there exists some hypothesis in H consistent with this labeling 13 3/28/12 ShaSering Game •  You (desiring a large VC dimension) specify a set of points from X and a hypothesis space H •  I (desiring a low VC dimension) label the points in an adversarial manner, requiring that you find a hypothesis for all possible labelings •  You try to find a h in H that is consistent with each labeling •  Intui.on: A rich set of func.ons can shaSer large sets of points 14 3/28/12 15 3/28/12 Some examples • 
• 
• 
• 
To the 2-­‐d chalk space! Lew-­‐bounded 1-­‐d intervals ([0,a]) VC(H)=1 General 1-­‐d intervals ([a,b]) VC(H)=2 2-­‐d Half-­‐spaces (linear separators) VC(H)=3 Sample Complexity •  Using VC(H) as a measure of expressivity we have a sample complexity bound m
1
✏
✓
8V C(H) log
✓
13
✏
◆
+ 4 log
✓ ◆◆
2
•  If m is polynomial in relevant parameters, we have a PAC learning algorithm •  Note that analysis of the lower bounds shows this is a fairly .ght bound •  Note that to shaSer m examples it must be true that |H|>2m making log(|H|)≥VC(H) 16 3/28/12 Some examples • 
• 
• 
• 
• 
To the 2-­‐d chalk space! Lew-­‐bounded 1-­‐d intervals ([0,a]) General 1-­‐d intervals ([a,b]) 2-­‐d Half-­‐spaces (linear separators)
Axis-­‐Parallel Rectangles VC(H)=1 VC(H)=2 VC(H)=3 VC(H)=4 •  Must also specify efficient algorithm Conclusions •  PAC framework is a model for theore.cally analyzing the effec.veness of learning algorithms •  The sample complexity of any consistent learner can be determined by expressivity of H •  If sample complexity is tractable, computa.onal complexity is governing factor •  Sample complexity bounds tend to be loose, but can s.ll dis.nguish learnable from not •  Representa.on is very important 17 3/28/12 Conclusions •  Many addi.onal models have been studied –  Noisy Data –  Distribu.onal Assump.ons –  Probabilis.c Representa.ons –  Neural Networks (and other nested structures) –  Restricted Finite Automata –  Ac.ve Learning •  Recent work is most concerned with data dependent PAC bounds •  Prac.cal COLT impact has seen recent significance in the analysis of “modern” algorithms (SVM, Boos.ng) 18