Separating distribution-free and mistake

Separating Distribution-Free and Mistake-Bound Learning Models over
the Boolean Domain
(Extended Abstract)
Avrim Blum’
MIT Laboratory for Com uter Science
Cambridge, MA 82139
Abstract
1
Two of the most commonly used models in computational learning theory are the distribution-free (PAC,
Valiant-style) model in which examples are chosen
from a fixed but arbitrary distribution, and the absolute mistake-bound (on-line, unrestricted equivalence
query) model in which examples are presented in order
by an adversary. Over the Boolean domain (0, 1}”, it
is known that if the learner is allowed unlimited computational resources then any concept class learnable
in one model is also learnable in the other. In addition,
any polynomial-time learning algorithm for a concept
class in the mistake-bound model can be transformed
into one that learns the class in the distribution-free
model.
This paper shows that if one-way functions exist, then the converse does not hold. We present
a concept class over (0,l)” that is learnable in the
distribution-free model, but is not learnable in the absolute mistake-bound model if one-way functions exist.
In addition, the concept class remains hard to learn in
the mistakebound model even if the learner is allowed
a polynomial number of membership queries.
The concepts considered are based upon the Goltlreich, Goldwasser and Micali random function construction [GGM86] and involve creating the following
new cryptographic object: an exponentially long sequence of strings u1,u2,.
. . ,ur over ( 0 , l ) “ that is
hard to compute in one direction (given U, one canriot compute U, for j < i) but is easy to compute and
w e n make random-access jumps in the other direction (given U, and j > i one can compute U,, even if
j is exponentially larger than i ) . This random-access
property is stronger than those held by the sequences
considered in [BBSSS] and [BM84] which do not allow
randomaccess jumps forward without knowledge of a
seed allowing one to’compute backwards as well.
Two of the most popular theoretical models for learning from examples are the distribution-free model, also
known as the probably-approzimately-correc2 (PAC) or
“Valiant-style’’ learning model [KLPV87a, Val841, and
the absolute mistake-bound model [Lit88]. In both
models, the goal of a learner is to approximately infer
some unknown target concept from positive and negative examples of that concept. In the distribution-free
model, an adversary chooses a distribution over the
labeled examples from which the learner may sample. The learner, from a polynomial number of samples, must produce a hypothesis that agrees with the
target concept over most of the distribution. In the
mistake-bound model, the adversary instead actually
chooses the order in which examples appear. Here, the
learner sees unlabeled examples, and after each must
predict whether it is positive or negative before being
told its true classification. The job of the learner in
the mistake-bound model is to make a polynomially
bounded number of mistakes.
In this paper, we will consider the “representationindependent” versions of the distribution-free and
mistake-bound models, also known as the “prediction”
model [HKL\Y88, HL\V88], in which the learner’s hypotheses need not be of the same form as the target
concept. The mistake-bound model is proven equivalent in [Lit881 to Angluin’s equivalence query model
[Ang88] when query hypotheses are unrestricted. This
last model is also called “LC - ARB” in [MT89].
It is known that any algorithm for learning a
concept class in the mistake-bound model can be
converted to one that will learn the class in the
distribution-free model (Ang88, KLPV87bl. If computational considerations are ignored, then over the
Boolean domain {0,1}” the converse holds as well.
Any concept class over ( 0 , l ) ” learnable with polynomial sample size (and not necessarily polynomial time)
in the distribution-free model can also be learned
with a polynomial mistake bound (in not necessar-
*Supported by an NSF Graduate Fellowship. NSF grant
CCR-8914428 and the Siemens Corporation. Author’s email
address: avrim(lDtheory.lcs.mit
.edu
211
CH2925-6/90/0000/0211$01.00 Q 1990 IEEE
Introduction
ily polynomial time) in the mistake-bound model
[Ang88, Lit881. At the heart of the proof for this last
direction is the use in the mistake-bound model of
the ‘%alving algorithm”. This algorithm enumerates
all concepts in the class, predicts according to majority vote, and when a mistake is made throws out
the concepts that predicted incorrectly. The equivalence of learnability in the non-computational setting has led researchers t o wonder (eg. [Kea89])
whether the models remain equivalent when computation is limited-especially in light of recent work
showing that the distribution-free model is equivalent
to a seemingly much easier form of learning known as
“weak-learning” [Sch89].
In this paper we show that if we consider only algorithms that run in polynomial time, then if cryptographically strong pseudorandom bit generators exist [BBS86][BM84]pa~82],the distribution-free and
mistake-bound models over {0,1}” are n o t equivalent.
That is, we create a concept class starting from a pseudorandom bit generator which is learnable in polyncmial time in the distribution-free model but not in
the mistake-bound model. In fact, the concept class
remains hard to learn in the mistake-bound model
even if the learning algorithm is allowed membership
queries-the ability to ask whether examples of its
own choosing belong t o the target concept. So, for instance, the position of this concept class contrasts with
that of deterministic finite automata (DFAs) which
are learnable in a mistake-bound model with membership queries [Ang87] but not in the distribution-free
model [KV89]. Recently, the existence of pseudorandom bit generators has been proven equivalent to the
existence of one-way functions [Lev85][1LL89][Has90],
so the separation results shown here hold given the
existence of any one-way function.
The concept class created is based upon the Goldreich, Goldwasser, and Micali random function construction [GGM86] and involves creating the following new cryptographic object: an exponentially long
sequence of strings u1,u2,. . . , U, over {0,1}“ that is
hard t o compute in one direction (given ui one cannot compute U, for j < i) but is easy to compute and
even m a k e random-access j u m p s in the other direction
(given ai and j > i one c a n compute U , , even if j is exponentially larger than i ) . This sequence is “stronger”
than those in [BBSSS] and [BM84] in the sense that
for those sequences, the only way to compute in the
forward direction (without knowing the seed that allows one t o compute in the other direction as well) is
to compute the strings sequentially in order.
The sequence of strings is useful for separating the
learning models for roughly the following reason. At-
tached to each ai is a classification that can only be
computed given uj for j < i . So, an adversary can
present the strings in the reverse order ur,~ ~ - 1. .,
and cause the learner t o make a mistake nearly half the
time. However, for any distribution over the strings,
if a learner collects m samples, we expect that about
of the distribution falls on the “easy side” (the
set of U, for j > i ) of the string ai of least index
seen. The “random-access forward” property of the
sequence then allows the distribution-free learner t o
use ui as a predictor for any U, for j > i. Notice that
without the random-access property, then even on a
uniform distribution the learner might still fail because
the examples seen would likely all be exponentially far
away from each other.
2
Notation, definitions,
background
and
A concept is a subset of the domain (here (0, l}”),and
a concept class is a collection of concepts.* An example
is an element x E {0,1}” and for a given a target
concept c, a labeled example of c is a pair (z,c(z))
where z is an example and c(x) = 1 zfl x E c. An
example x is a positive example of c if x E c; otherwise
it is a negative example. In both the distribution-free
and mistake-bound models, the object of a learning
algorithm for a concept class C is t o approximately
infer an unknown target concept c E C from examples.
The models differ in how examples are chosen and how
successful learning is defined.
In the distribution-free model, we imagine that
there is some distribution D over the set of labeled
examples of the target concept. The learning algcrithm is allowed to sample from D and based on this
information and knowledge of the class C t o which
the target concept belongs, must produce a hypothesis h that approximates the target concept. In particular, if c is the target concept, we say a hypothesis
h has e r r o r E if on pair (x,c(x)) chosen from D,the
probability that h ( z ) does not equal c(x) is E . We
will allow the hypothesis produced by the algorithm
to not necessarily belong to concept class C (this version is sometimes termed “polynomial predictability”
[HKLW88, HLW881). More formally, an algorithm A
l e a m s in the distribution-fme model a concept class C
over {0,1}” if for some polynomial P, for all target
concepts c E C ,distributions D, and error parameters
E and 6: in time at most P(n, $,
1). the algorithm
i,
‘To be rigorous, we should define a concept class C to be
a family {C,) where C , is defined over {0,1}“. In order to
avoid overly cumbersome notation, we shall assume this type of
indexing is done implicitly.
L LL
.
with probability at least 1 - 6 finds a polynomial-time
evaluatable hypothesis h (not necessarily from C) with
error at most c. See Haussler et al. [HKLW88] for
various equivalent formulations of the distribution-free
model.
In the mistake-bound model, instead of a distribution, we imagine there is some adversary presenting
the examples in any order it wishes. Learning is done
on-line in a sequence of stages. In each stage the adversary presents an unlabeled example to the learner, the
learner suggests a labeling, and then the learner is told
whether or not that labeling was correct. The object
of the learning algorithm is to make at most a polynomial number of mistakes. Algorithm A learns in the
mistake-bound model a concept class C over {0,1}" if
for some polynomial P , for all target concepts c E C
and all orderings of the examples, the algorithm makes
at most P ( n , lcl) mistakes maximum using polynomial
time in each stage.
A membership query is a query in which the learner
selects an unlabeled example of its own choosing and
is then told that example's classification. We may incorporate membership queries into the mistake-bound
model by allowing the learner at each stage to choose
either to make a membership query or else to receive
an example from the adversary. We will say that Algorithm A learns in the mistake-bound model with m e m bership queries a concept class C over {0,1}" if for
some polynomial P , for all target concepts c E C,the
algorithm makes at most P ( n ,1.1) mistakes and membership queries using polynomial time in each stage.
We now review some useful notation and definitions
from the cryptographic literature. If S is a set we
will use the notation s e R S to mean that s is chosen uniformly at random from S. For predicate f , let
Pr[f(sl,sz,. . .) : s 1 E R S 1 , s 2 E R S 2 , . .] be the probability that f ( s 1 , sa, . . .) = 1 given that each s; is chosen
randomly from set Si. For convenience, if A is a probabilistic polynomial time (PPT) algorithm+ with {0,1}
output, and g is some function, we will use Pd(A,g(s))
to mean Pr [A(g(s)): sER(O,l}d].
Informally, a Cryptographically Strong pseudorandom Bit generator, or CSB generator to use the notation of GGM, is a deterministic polynomial-time algorithm that given a randomly chosen input produces a
longer pseudorandom output. More formally,
rithms A, f o r all polynomials Q , for suficiently large
k:
That is, no polynomial-time algorithm can distinguish
with a l/poly probability between a string chosen randomly from {O,l}p(k) and the output of G on a string
chosen randomly from (0, l}k.For this paper, we will
just need a CSB generator G with stretch 2k.
For a CSB generator G that on input s E {0,1}'
produces a 2k-bit output G(s) = b 1 , . . . ,b2k, let us
define the following notation.
0
Let Go(s) be the leftmost k bits b l , . . . ,b k of
G(s), and let Gl(s) be the rightmost k bits
b k + l , . . ., b2k*
G1
{ X (the empty
string)
(SI
Let G6(s) =
0
ifb=0
if b = 1.
For i = il . . . i d E (0, l}d,
let Gi l...id(s)= G;d(G;d-l(.. .G;,(s). . .)).
If 2 and y are strings, we will use c o y to denote
the concatenation cy. Also, if an example created
by a concatenation of strings has length less than n,
then we will implicitly pad with O's, say on the right.
Finally, let LSB[z] denote the rightmost bit of string
2.
We now review the GGM random function construction. Let G be a CSB generator that on input
s E (0,1}' produces a Pk-bit output. Goldreich, Goldwasser, and Micali define the function f8 : (0, l}k ---*
(0, l}k on input i = il . . . i k to be:
f8(i) = G;l...;k(~)
= G;k(G;c-l(. . .Gi,(s). . .)).
The function fb can be viewed as a full binary tree of
depth k . The root is the seed s, and on input i, the
output f d ( i ) is the ith leaf in the tree if i is viewed
as a binary number between 0 and 2k. Though this
fact will not be needed for our purposes, it is proved
, 1a~poly-random
k
that the collection Fk = { f 8 } 8 ~ ~ o is
collection of functions over (0, l}k. Essentially, this
means that no polynomial-time algorithm can distinguish between a function chosen randomly from F k
and a function chosen randomly from the class of all
~ (0, I } ~ .
functions from (0, I } to
We will use the basic form of the GGM construction
as a starting point to create a concept class that separates the distribution-free and mistake-bound learning
models.
llefinition 1 For polynomial P , a deterministic
polynomial-time program G is a CSB generator with
stretch P if on input s E (0, l}k it produces a P ( k ) - b i t
output, and f o r all probabilistic polynomial-time algot In order to simplify the statements of some of the theorems,
in this paper we will use the term "algorithm"to include circuit
families (non-uniform algorithms).
213
some are just bit strings from (0,l)" that are not of
the form xf for any i . We will call the examples xf
the "good examples" of c,; so the positive examples of
c, are the good examples xf such that f , ( i ) = 1. The
examples xf correspond t o the strings ui mentioned in
the introduction.
For convenience, let us define the following additional notation.
0
Let z: be the (correctly) labeled example
(xf,c.<xf)).
be the incorrectly labeled example
Let
e
( & I - C,(Xf,).
Figure 1: Darkened regions are information
given with (xi,c,(xf))
0
for il . . . i d E (0, I } ~ ,let
G:,(s)oG:,(Gj,(s))o
(4 =
...oG:,(Gi l...jd-,(~)).
Ol...id
So, example xf is equal to i o G+..+(s).
We now demonstrate an algorithm that on input x:
and j > i computes z; in time O(nTk) where Tk is the
time to compute G on a k-bit input. The essential idea
is that contained in example xf are ancestors in the
full binary tree of each piece of information contained
in zd.
A concept class to separate
the learning models
3
We now define a concept class C over {0,1)" that
is learnable in the distribution-free model but not
the mistake-bound model if one-way functions exist.
Given a CSB generator G, we will fix k, the size of
seed s, t o equal LfiJ.
Definition 2 Let C = (C,),g{o,l}k
c, are defined as follows:
0
On input x: and j > i ,
Say i = il . - . i t and j = j ,
j k . Let r be the
least index such that i, # j , . Since j > i , we
have i r = 0 and j r = 1.
Extract from zf the portions:
where the concepts
c, = {xf : i E (0, 1)' and LSB[Gil...,k(s)]= 1)
where for i = il
0
Algorithm Compute-Forward
U
. . .i k ,
~f = i o G:, ( s ) o G:, (Gi, ( s ) ) o G:3(G;,;, ( s ) ) o . . .o
21
= Gil(s) OG:,(Gj,(s))
- Gli"'1.-1
. o G:r-l(Gil...i,-2(~))
0..
(s).
= G~v(Gil...ir-l(s))
= Gj,(Gjl...j,-,(s)).
G:, (Gi1...ik-,(S))So, for random s, concept c, is a set of size 2@(m
and the classification of example xf is the least significant bit of the GGM function f s ( i ) . We can
think of the labeled example (xf,c,(x:)) as the pair
(i, LSB[f,(i)]) together with the following extra information: for each bit ij equal to zero, we include the
right half of G(Gil...ij-l(s)). For example, if k = 4
and i = 0101, then we would have (see Figure 1):
~f
cs(x:)
Notice that GiV(Gjl...j,-,(6))= A. Since U =
G j r ( G j l . . . j p - l ( ~ we
) ) l can usev as an intermediate
point in the computation of thcse parts of zd that
differ from zf.
If r = k, output: ( j o U o X I LSB[v]).
Otherwise, output:
( j o U o X o Q'r+l"'jk(~)l
LSB[Gjk...jr+,(v)]).
Algorithm Compute-Forward produces zd as out. .
reason. By the definition of
put for the following
( s ) = Gj1...j~-1(s),
rQ
, '1we
know
G'1""r-l
. . so U o =
...j,1
(s) o Gir(Gj1...j,-,(s)) = Q1-Jr(s). Since
v = Gj1...jv(s),
we have:
= 0101 0 Gl(s) 0 A. 0 Gl(Gi(Go(s))) o X
= 0101 0 Gl(s) 0 Goll(s),
= GOlOl(S).
We will think of the index i as both a bit string and
a binary number between 0 and 2'.
Notice that negative examples of c, come in two
forms: some are examples x: for which f , ( i ) = 0 and
Gjr+l"'ik
(VI
- Gjl-Jv(s) oGjv+l..'Jk(G.
J i ' . ,3. v ( s ) )
- Gjl... j k (.),
-
214
and LSB[Gj,...j,+,(w)]= LSB[Gi, ...i,( s)] = c,(zjd).
algorithm Compute-Forward, on input 2: one can easily compute any other labeled example 2; for j > i .
So, lemma 1 could equivalently be written as:
In addition, algorithm Compute-Forward uses at
worst O(k2) computations of the bit generator G.
Corollary 1 For any PPT algorithm A and polynomial Q,for suficiently large k, for all i = il . . .i,,
Theorem 1 Concept class C is learnable in the
distribution-free model.
Proof: Algorithm Learn-C works as follows. Collect
m 2 log labeled examples. If all examples are
negative, hypothesize the empty concept. Otherwise,
we know each positive example is a “good example”, so
let i be the index (first k bits) of the positive example
x: of least index seen so far. Produce the hypothesis:
5
where 0 is an oracle that on any input j
labeled ezample zJ, .
>i
outputs
Lemma 1 implies, as shown below, that any learning
algorithm will be fooled in the mistake-bound model
by an adversary that presents the examples zf in reverse order. In fact, lemma 1 is stronger than we need
because it implies that any algorithm will fail not just
on some c, E C but in fact on almost all of the concepts
“On example z,let j be the first k bits of
z. If j < i predict 0. If j > i , use algorithm Compute-Forward to produce labeled
example ( z j d , c s ( d ) ) . If z = z{ then predict
c,(zjd); otherwise predict 0.”
c8 ’
One simple way to see that Learn-C learns in the
distribution-free model is just to notice that it is an
“Occam Algorithm” in the sense of Blumer et al.
[BEHW87] since for any size sample, the hypothesis
produced has size O(n) and is consistent with all the
data seen. Alternatively, one can use a direct argument of the type used for learning a subinterval [0, a]
of the interval [0,1]. Let 1 be the least index such that
the set of examples S, = { z j : j 5 1 and ca(zjd) = 1)
has probability at least E . Algorithm Learn-C fails to
learn with error less than E exactly when it sees no
example from SI.So, the probability Learn-C fails is
~
is less than 6 for m 2 f log
at most (1 - E ) which
i.
I
Theorem 2 Concept class C cannot be learned in the
mistake-bound model if one-way functions exist.
In order to prove theorem 2, we show that the sequence of examples zd is difficult to compute in the
reverse direction, as described in the following lemma
and corollary.
Lemma 1 For any PPT algorithm A and polynomial
Q,for sufficiently large k, for all i E (0, l}‘,
Pk(A, zd) = P r [ A ( z f ): SE,{O,
?’,(A,$) = Pr [ A ( $ ) : sE,{O, l}’].
Recall:
Proof of theorem 2 (assuming lemma 1): Suppose
there exists a learning algorithm A and polynomial
P such that for sufficiently large n, for all c, E C
over {0,1}”, algorithm A makes at most P ( n ) mistakes. Consider an adversary that presents the good
a
,.... Algorithm
examples in reverse order: z:’, z2’-l
A makes a mistake on at most one fourth of the first
4P(n) examples presented. So there exists some index
t (2’ -4P(n) 5 t 5 2’) such that over all c, E C, a l g e
rithm A makes an error an at most one fourth of the
examples z:. Thus, algorithm A contradicts corollary 1, since on input ( z : , b ) , we can use algorithm
Compute-Forward to simulate the adversary presenting examples to A and then output 1 if A’s prediction
of c , ( z : ) equals b. I
We now prove lemma 1, showing that given only
example z: it is difficult to calculate the classification c , ( z f ) . The basic idea of the proof is a version
of a standard cryptographic technique described by
Yao wao821. Roughly, if an algorithm can distinguish
then we will slowly substistrings zf from strings
tute bits in those strings with random bits and watch
the performance of the algorithm degrade. At some
point before the strings have become completely random the performance must degrade significantly, and
we will focus on that location to break the generator.
2,
Proof of L e m m a 1: Suppose to the contrary there
exists some P P T algorithm A and polynomial Q such
that for infinitely many k , for some t ( k ) E {0,1}’,
we have IP,(A, z:(‘)) - P,(A, Ff“’)I 2
We will
use A to create an algorithm B that breaks generator G. Namely, for infinitely many k , we will have
l}’] and
&.
That is, for any algorithm A and index i , for random
seeds s , algorithm A cannot distinguish between a correctly labeled z: and an incorrectly labeled one. Using
215
I%(B,G(s)) - M
B ,s)I < &.*
IPk(B,G(s))-p2k(B,S)I2 Ii(pk,k -pk,t)l 2 *,j
Let Sh = (0,l}kand Si = (A} (to correspond t o the
notation for Gb and G i ) and let t = t(k) = t l . . .tk.
Define P1,k = '&(A, z:) and jjl,k = Pk(A,Tt) and for
1 < d 5 k, define:
pd,k = Pr [A(t o rl
. o Td-1
0..
0
a contradiction.
Otherwise, the second case is that for infinitely
manykwehaveIpl,k--i,k( > & a n d b k , k - & , k I <
This implies that either:
kQ f
*.
0
Gtd"'*k(S),
b) :
~ ~ E R S *;* ,- ,7 pd-lERs:d-,
sE~(0,
0
0
jjd,k = Pr [A(t 0 P1 0 . . . 0 rd-1
0
Gtd"'tk(S),F):
rlERS:l,...,pd-lERS:d_l, ~ E R { Ol)L]
,
where b = LSB[Gtd...tk(s)]
LSB[Gtd , , . t k (s)].
So, for d
and
6 = 1-
> 1,Pd,k is the probability that A outputs
3. Flip a coin:
0
If heads, output:
0
A(t 0 r1 o . . . o r k - 1 o yik,LSB[ytk]).
If tails, output:
1 - A(t o rl o . . . o r k - 1 o d k ,
1 - LSB[yt,]).
2. Fori E { l , . .. , d - 1) choose riERSi,,
3. o u t p u t :
A(t 0 7-1 0 r2 o . . . o rd-1 o yid o Gtd+'"'tk(Yfd),
LSBIGtd+l
"
'f k
(Ytd )I).
If the input y to B' equals G(s) for random s, then
yid 0 Gtd+l"'tk(ytd) = G t d " ' t k and
( ~ ) Gtd+l...tk(ytd)=
Gt d . . . t k ( s )so,
.
the probability B' outputs 1 is pd,k.
On the other hand, if YE, (0, 1}2k, then ytd and dd
are completely independently chosen strings, so the
probability B' outputs 1 is Pd+l,k. By the choice of
d , algorithm B' breaks the generator G on infinitely
many k, a contradiction. U
So, we now have shown that concept class C is learnable in the distribution-free model but not in the a b s e
lute mistake-bound model if one-way functions exist.
In fact, algorithm A can do no better even if it is
allowed to make membership queries. The reason is
that if at any stage algorithm A has a probability
of constructing even an unlabeled
greater than
good example x{ (j < 2k) that it has not yet seen,
the argument in the proof of lemma 1 can be used to
show A can break the generator G. So, with a polynomial number of tries, with high probability, algorithm
A will never make a query whose classification is 1.
Thus, we have the following theorem.
If the input y t o B equals G(s) for random s, then the
probability B outputs 1 is: ! p k , k
+(I - jjk,k) =
$(pk,k - jjk,k). If the input y to B is randomly chosen
from (0, 1}2k, then the probability B outputs 1 is just
(since with probability one half, B outputs A(tor, b),
and with probability one half, B outputs 1 - A ( t o r , b),
for random r and b. So, for infinitely many k we have
+
5 k-1
1. On input y E (0, 1}2k,let yo be the left half of
y and y1 be the right half of y. Let y'o = y1 and
y1 = A.
&
2. F o r i E ( 1 , . . . , k - 1) choose riE,S;,.
for infinitely many k there exists 1 5 d
such that IFd,k - F d + l , k I >
Algorithm B'
&
1. On input y E (0, 1}2k,let yo be the left half of
y and y1 be the right half of y. Let yb = y1 and
y; = A .
5 k-1
&>
Without loss of generality, let us assume that the
first situation occurs. The algorithm B' to break generator G is then as follows.
1 on input zf where the first d - 1 "pieces" of xf have
been replaced by random strings of the appropriate
length and the application of generator G begins at
depth d in the full binary tree.
We have by assumption that Ip1,k -jjl,kl 2
for
infinitely many k. There are now two cases. First,
for infinitely many
suppose t h a t Ipk,k - & k l 2
t . We can then create algorithm B to break generator
G as follows:
Algorithm B
for infinitely many k there exists 1 5 d
such that b d , k -Pd+l,kI >
or
3+
&
4
:Algorithm B will be non-uniform, in part because we have
no guarantee that the index t ( k ) is easy to compute given k (to
contradict lemma 1 just requires there exist s o m e t ( k ) ) . One
could modify (and make less "clean")the statement of lemma
1 so that this problem no longer occurs and still separate the
learning models, since our advemary chooses the examples in an
easy to compute ordering.
Theorem 3 Concept class C cannot be learned in the
mastake-bound model with membership queries if oneway functions exist.
216
Proof sketch: If there exists a learning algorithm
that learns C in the mistake-bound model with membership queries then it must at some point make a
query of a good example that it has not yet seen.
So, we can use the learning algorithm to create an
algorithm A and polynomial Q,such that for some
t = t l . . . t k , the probability that A outputs Gt,. . . t , ( s)
on input z: is at least
(for sufficiently large k ) .
The reason is that Gt,. . . t , ( s) is computable from any
good example z: for i < t , so we may just guess which
query of the learning algorithm is of a good example.
Analogous to the proof of lemma 1, let Qd,k be
) input
the probability that A outputs G t d . . . t k ( son
t o rl o . . . o r d - 1 o G d . . . t k (s). We are given that
q1,k >
but we know that Q k , k must be small:
pk,k is the probability that A outputs Gt,(s) on input t o r oGi,(s). If A does so with probability greater
then it can be used to directly break generthan
ator G, since essentially it is able with a non-negligible
probability to produce one half of G(s) from seeing the
other half.
Thus, there must exist some value d such that ( q d , k p d + l , k l > &.
w e can now break G with a variant
of algorithm B' of the proof of lemma 1. On input
y E (0, 1}2kthe algorithm outputs 1 if A(t o r1 0...o
r d - 1 0 d d0 O d + l " ' t k (=
~ Gtd+
t d ) *...
) t k ( y t d ) .So, if y =
G(s) for random s, it outputs 1 with probability Qd,k,
and if yE, (0, 1}2k,then it outputs 1 with probability
pd+l,k, and thus breaks G.
&
&
mistake-bound model even if the learner is allowed
membership queries. This lies in contrast to the class
of DFAs which are learnable in the mistake-bound
(equivalence query) model with membership queries
[Ang87] but not in the distribution-free model [KV89].
Thus, neither model is strictly easier than the other.
I would like to thank Shafi Goldwasser, Silvio MiCali, Ron Rivest, and Phil Fbgaway for helpful discussions, and Rob Schapire for suggesting the simple
"Occamness" argument for the proof of theorem 1.
References
[Ang87]
Dana Angluin. Learning regular sets
from queries and counterexamples. Information and Computation, 75237-106,
November 1987.
[Ang88]
D. Angluin. Queries and concept learning. Machine Learning, 2(4):319-342,
1988.
[BBS86]
L. Blum, M. Blum, and M. Shub. A simple unpredictable pseudo-random numS I A M J. Computing,
ber generator.
15(2):364-383, May 1986.
&
4
[BEHW87] Anselm Blumer, Andrzej Ehrenfeucht,
David Haussler, and Manfred K. Warmuth. Occam's razor. Information Processing Letters, 24:377-380, April 1987.
Conclusion
We have shown a concept class C over {0,1}" that is
learnable in the distribution-free model but not in the
mistake-bound model if cryptographically secure bit
generators (or equivalently one-way functions) exist.
In fact, the assumption of one-way functions gives us
something stronger. Not only is the concept class C
hard to learn in the mistake-bound model in the sense
that for any learning algorithm there is some c E C
hard to learn, but in fact we have that for any learning algorithm, nearly all c € C are hard to learn. The
fraction of concepts which an algorithm can learn is
(for sufficiently
less than any polynomial fraction
large 71). This fact leads one to ask the question: is it
possible to separate the two models using a weaker assumption? Some assumption is apparently necessary
since if computational considerations are ignored, any
concept class learnable in one model is learnable in the
other. We conjecture that the learning models can be
separated using just the assumption P # NP.
The concept class C remains difficult to learn in the
&
[BM84]
M. Blum and S. Micali. How to generate cryptographically strong sequences of
pseudo-random bits. SIAM J. Computing, 13(4):850-863, November 1984.
[GGM86]
Oded Goldreich, Shafi Goldwasser, and
Silvio Micali. How to construct random functions. Journal of the A C M ,
33(4):792-807, October 1986.
[Has901
J . Hastad. Pseudo-random generators under uniform assumptions. In Proceedings
of the Twenty-Second Annual A C M Symposium on Theory of Computing, Baltimore, Maryland, May 1990.
[HKLW88] David Haussler, Michael Kearns, Nick
Littlestone, and Manfred K. Warmuth.
Equivalence of models for polynomial
learnability. In Proceedings of the 1988
Workshop on Computational Learning
Theory, pages 42-55, Cambridge, MA,
August 1988. Morgan-Kaufmann.
217
[HLW88]
[ILL891
[Kea891
David Haussler, Nick Littlestone, and
Manfred K. Warmuth. Predicting ( 0 , l ) functions on randomly drawn points. In
Proceedings of the Twenty-Ninth Annual
Symposium on Foundations of Computer
Science, pages 100-109, White Plains,
NY, 1988.
R. Impagliazzo, L. A. Levin, and
M. Luby. Pseudo-random generation
from one-way functions. In Proceedings
of the Twenty-First Annual ACM Symposium on Theory of Computing, pages
12-24, Seattle, May 1989.
of Computer Science, pages 262-267, October 1989.
[Sch89]
Robert E. Schapire. The strength of weak
learnability. In Proceedings of the Thirtieth Annual Symposium on Foundations
of Computer Science, Research Triangle
Park, North Carolina, October 1989.
[Val841
Leslie G. Valiant.
A theory of the
learnable. Communications of the ACM,
27(11):1134-1142, November 1984.
pa0821
A. C. Ym. Theory and application of
trapdoor functions. In Proceedings of the
23rd IEEE Symposium on Foundations of
Computer Science, pages 80-91, Chicago,
Michael Kearns.
The Computational
Complexity of Machine Learning. PhD
thesis, Harvard University Center for Research in Computing Technology, May
1989. Technical Report TR-13-89.
1982. IEEE.
[KLPV87a] Michael Kearns, Ming Li, Leonard Pitt,
and Leslie Valiant. On the learnability
of boolean formulae. In Proceedings of
the Nineteenth Annual ACM Symposium
on Theory of Computing, pages 285-295,
New York, New York, May 1987.
[KLPV87b] Michael Kearns, Ming Li, Leonard Pitt,
and Leslie Valiant. Recent results on
boolean concept learning. In Proceedings
of the Fourth International Workshop on
Machine Learning, pages 337-352, University of California, Irvine, June 1987.
[KV89]
Michael Kearns and Leslie G. Valiant.
Cryptographic limitations on learning
boolean formulae and finite automata. In
Proceedings of the Twenty-First Annual
ACM Symposium on Theory of Computing, pages 433-444, Seattle, Washington,
May 1989.
[Lev851
L. A. Levin. One-way functions and
pseudorandom generators. In Proc. 17th
ACM Symposium on Theory of Computing, pages 363-365, Providence, 1985.
ACM.
[Lit881
Nick Littlestone. Learning when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning,
2:285-318, 1988.
[MT89]
Wolfgang Maass and Gyorgy T u r k . On
the complexity of learning from counterexamples. In Proceedings of the Thirtieth Annual Symposium on Foundations
218

Download Report

Separating distribution-free and mistake

Paperzz.com

Your Paperzz