tocol together with a deduction procedure. The

A THEORY OF THE LEARNABLE
L.G. V a l i a n t
Aiken Computation Laboratory
Harvard University, Cambridge, Massachusetts
explicit programming.
Among human skills some
clearly appear to have a genetically preprogrammed
element while some others consist of executing an
explicit sequence of instructions that has been
memorized.
There remains a large area of skill
acquisition where no such explicit programming is
identifiable.
It is this area that we describe
here as learning.
The recognition of familiar
objects, such as tables, provide such examples.
These skills often have the additional property
that, although we have learnt them, we find it
difficult to articulate what algorithm we are
really using.
In these cases it w o u l d be especially significant if machines could be made to
acquire them by learning.
ABSTRACT.
Humans appear to be able to learn new
concepts without needing to be programmed explicitly
in any conventional sense.
In this p a p e r we regard
learning as the phenomenon of knowledge acquisition
in the absence of explicit programming.
We give a
precise methodology for studying this phenomenon
from a computational viewpoint.
It consists of
choosing an appropriate information gathering
mechanism, the learning protocol, and exploring the
class of concepts that can be learnt using it in a
reasonable (polynomial) number of steps.
We find
that inherent algorithmic complexity appears to set
serious limits to the range of concepts that can be
so learnt.
The methodology and results suggest
concrete principles for designing realistic learning
systems.
|.
This p a p e r is concerned with precise computational models of the learning phenomenon.
We shall
restrict ourselves to skills that consist of recognizing w h e t h e r a concept (or predicate) is true
or not for given data. We shall say that a concept Q has been learnt if a p r o g r a m for recognizing it has been deduced (i.e. by some m e t h o d
other than the acquisition from the outside of the
explicit program).
INTRODUCTION
Computability theory became possible once
precise models became available for modelling the
commonplance phenomenon of mechanical calculation.
The theory that evolved has been used to explain
human experience and to suggest how artificial
computing devices should be built.
It is also
worth studying for its own sake.
The main contribution of this p a p e r is that it
shows that it is possible to design learning
machines that have all three of the following properties.
The commonplace phenomenon of learning surely
merits similar attention.
The p r o b l e m is to discover good models that are interesting to study
for their own sake and promise to be relevant both
to explaining human experience and to building
devices that can learn.
The models should also
shed light on the limits of what can be learnt,
just as computability does on what can be computed.
(A) The machines can provably learn whole
classes of concepts.
Furthermore these classes can
be characterized.
(B) The classes of concepts are appropriate
and nontrivial for general purpose knowledge, and
In this p a p e r we shall say that a p r o g r a m for
performing a task has been acquired by learning if
it has been acquired by any means other than
(C) The computational process by which the
machines deduce the desired programs requires a
feasible (i.e. polynomial) number of steps.
*This research was supported in part by National
Science Foundation Grant MCS-83-02385.
A learning machine consists of a learning protogether with a deduction procedure.
The
former specifies the manner in which information is
obtained from the outside.
The latter is the
mechanism by which a correct recognition algorithm
for the concept to be learnt is deduced.
At the
broadest level the suggested methodology for
studying learning is the following:
define a
plausible learning protocol and investigate the
class of concepts for which recognition programs
can be deduced in polynomial time using the
protocol.
tocol
Permission to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct
commercial advantage, the ACM copyright notice and the title of the
publication and its date appear, and notice is given that copying is by
permission of the Association for Computing Machinery. To copy
otherwise, or to republish, requires a fee and/or specific permission.
© 1984 ACM 0 - 8 9 7 9 i - 1 3 3 - 4 / 8 4 / 0 0 4 / 0 4 3 6
$00.75
436
The s p e c i f i c p r o t o c o l s c o n s i d e r e d in this
p a p e r a l l o w for two kinds of i n f o r m a t i o n supply.
F i r s t the learner has access to a supply of typical data that p o s i t i v e l y e x e m p l i f y the concept.
More p r e c i s e l y , it is a s s u m e d that these p o s i t i v e
examples h a v e a p r o b a b i l i s t i c d i s t r i b u t i o n determ i n e d a r b i t r a r i l y by nature.
A call of a routine
EXAMPLES p r o d u c e s one such p o s i t i v e example.
The
relative p r o b a b i l i t y of p r o d u c i n g d i f f e r e n t e x a m ples is d e t e r m i n e d by the distribution.
The
s e c o n d source of i n f o r m a t i o n that may be available is a routine ORACLE.
In its m o s t b a s i c
version, w h e n p r e s e n t e d w i t h data it w i l l tell the
learner w h e t h e r or not the data p o s i t i v e l y e x e m p l i f i e s the concept.
The deduction p r o c e d u r e w i l l in each case outp u t an e x p r e s s i o n that w i t h h i g h l i k e l i h o o d closely
approximates the e x p r e s s i o n to be learnt.
Such an
a p p r o x i m a t e e x p r e s s i o n n e v e r says yes w h e n it
s h o u l d not, b u t may say no on a small fraction of
the p r o b a b i l i t y space of p o s i t i v e examples.
This
f r a c t i o n can be made a r b i t r a r i l y s m a l l by increasing the runtime of the deduction procedure.
Perhaps the m a i n t e c h n i c a l discovery c o n t a i n e d in the
p a p e r is that w i t h this p r o b a b i l i s t i c notion of
learning highly c o n v e r g e n t learning is p o s s i b l e for
w h o l e classes of B o o l e a n functions.
This appears
to d i s t i n g u i s h this approach f r o m more t r a d i t i o n a l
ones w h e r e learning is seen as a p r o c e s s of "inducing" some g e n e r a l rule from i n f o r m a t i o n that is
i n s u f f i c i e n t for a reliable d e d u c t i o n to be made.
The p o w e r of this p r o b a b i l i s t i c v i e w p o i n t is
illustrated, for example, by the fact that an
arbitrary set of p o l y n o m i a l l y many p o s i t i v e examples cannot be relied on to determine expressions
c o n s i s t i n g of even just a single m o n o m i a l in any
reliable way.
The m a j o r remaining design choice that has to
be made is that of k n o w l e d g e representation.
Since our d e c l a r e d aim is to represent g e n e r a l
knowledge, it seems almost u n a v o i d a b l e that w e use
some k i n d of logic r a t h e r than, for example,
formal grammars or g e o m e t r i c a l constructs.
In
this p a p e r we shall r e p r e s e n t concepts as B o o l e a n
functions of a set of p r o p o s i t i o n a l variables.
The r e c o g n i t i o n algorithms that w e attempt to
deduce w i l l be t h e r e f o r e B o o l e a n circuits or
expressions.
There is another aspect of our f o r m u l a t i o n that
is w o r t h emphasizing.
In a learning s y s t e m that is
about to learn a new concept there may be an
e n o r m o u s n u m b e r of p r o p o s i t i o n a l variables available.
T h e s e may be p r i m i t i v e inputs, the values of
p r e p r o g r a m m e d concepts, or the values of concepts
that have b e e n learnt previously.
We w a n t the
c o m p l e x i t y of learning the new concept to be related only to t h e n u m b e r of variables that may be set
in n a t u r a l examples o f it, and not on the cardinality of the u n i v e r s e of available variables.
Hence the q u e s t i o n s asked of ORACLE and the values
given by EX~4PLES w i l l not be truth assignments to
all the variables.
In the a r c h e t y p a l case they
w i l l specify an assignment to a subset of the
v a r i a b l e s that is s t i l l s u f f i c i e n t to guarantee
the t r u t h of the function.
The adequacy of the p r o p o s i t i o n a l calculus
for r e p r e s e n t i n g k n o w l e d g e in p r a c t i c a l l e a r n i n g
systems is clearly a crucial question.
Few w o u l d
argue that this much p o w e r is not necessary.
The
q u e s t i o n is w h e t h e r it is enough.
T h e r e are
s e v e r a l arguments s u g g e s t i n g that such a s y s t e m
would, at least, be a very g o o d start.
First,
w h e n one examines the m o s t famous examples of
systems that embody p r e p r o g r a m m e d k n o w l e d g e ,
namely expert systems such as D E N D R A L and MYCIN,
e s s e n t i a l l y n o logical n o t a t i o n b e y o n d the prop o s i t i o n a l calculus is used.
Surely it w o u l d be
o v e r - a m b i t i o u s to try to base learning systems
on r e p r e s e n t a t i o n s that are more p o w e r f u l than
those that h a v e b e e n s u c c e s s f u l l y m a n a g e d in p r o g r a m m e d systems.
Second, the results in this
p a p e r can be n e g a t i v e l y i n t e r p r e t e d to s u g g e s t
that the class of learnable concepts even w i t h i n
the p r o p o s i t i o n a l calculus is severely circumscribed.
This suggests that the search for extensions to the p r o p o s i t i o n a l calculus that have
s u b s t a n t i a l l y larger learnable classes may be a
d i f f i c u l t one.
W h e t h e r the classes of learnable B o o l e a n concepts can b e e x t e n d e d s i g n i f i c a n t l y b e y o n d the
three classes given is an i n t e r e s t i n g question.
There is c i r c u m s t a n t i a l evidence from cryptography:
however, that the w h o l e class of functions comp u t a b l e by p o l y n o m i a l size circuits is not
learnable.
C o n s i d e r a c r y p t o g r a p h i c scheme that
e n c o d e s m e s s a g e s by e v a l u a t i n g the function
Ek
w h e r e k s p e c i f i e s the key.
Suppose that this
scheme is immune to chosen p l a i n t e x t attack in the
sense that even if the values of E k are k n o w n for
p o l y n o m i a l l y many d/fferent inputs, it is computationally i n f e a s i b l e to deduce an a l g o r i t h m for E k
or for an a p p r o x i m a t i o n to it.
This is e q u i v a l e n t
to saying, however, that E k is not learnable.
The c o n j e c t u r e d e x i s t e n c e of g o o d c r y p t o g r a p h i c
functions that are easy to compute therefore implies that some easy to compute functions are not
learnable.
The p o s i t i v e conclusions of this p a p e r are
that there are s p e c i f i c classes of concepts that
are learnable in p o l y n o m i a l time u s i n g learning
p r o t o c o l s of the k i n d described.
These c l a s s e s
can all be c h a r a c t e r i z e d b y d e f i n i n g the class
of p r o g r a m s that recognize them.
In each case
the p r o g r a m s are special kinds of B o o l e a n expressions.
The three classes are:
(i) conjunctive normal form expressions w i t h a b o u n d e d
n u m b e r of literals in each clause,
(ii) m o n o t o n e
d i s j u n c t i v e normal form expressions, and
(iii)
arbitrary expressions in w h i c h each v a r i a b l e
occurs just once.
In the first of these no calls
of the oracle are necessary.
In the last n o
access to t y p i c a l examples is n e c e s s a r y b u t the
oracles n e e d to be more p o w e r f u l than the one
d e s c r i b e d above.
If the class of learnable concepts is as
severely l i m i t e d as s u g g e s t e d by our results then
it w o u l d follow that the only way of t e a c h i n g more
c o m p l i c a t e d concepts is to b u i l d t h e m up from such
simple ones.
Thus a g o o d t e a c h e r w o u l d have to
identify, name and sequence these i n t e r m e d i a t e
concepts in the m a n n e r of a programmer.
The results of ZeoiPnabiZity theory w o u l d then indicate
the m a x i m u m g r a n u l a r i t y of the single concepts that
can he a c q u i r e d w i t h o u t programming.
437
In summary, this p a p e r attempts to e x p l o r e the
limits of w h a t is l e a r n a b l e as a l l o w e d b y algorithm i c complexity.
The r e s u l t s are d i s t i n g u i s h a b l e
from the diverse b o d y of p r e v i o u s w o r k on l e a r n i n g
b e c a u s e they attempt to r e c o n c i l e the three p r o p e r t i e s (A)-(C) m e n t i o n e d earlier.
C l o s e s t in
r i g o u r to our a p p r o a c h is the i n d u c t i v e i n f e r e n c e
literature (see A n g l u i n a n d Smith [i] for a survey)
that deals w i t h i n d u c i n g s u c h things as recursive
functions or formal g r a m m a r s (but not B o o l e a n
functions) from examples.
T h e r e is a large b o d y
of w o r k on p a t t e r n r e c o g n i t i o n and c l a s s i f i c a t i o n ,
u s i n g s t a t i s t i c a l a n d o t h e r tools, (e.g. [4]) b u t
the q u e s t i o n of g e n e r a l k n o w l e d g e r e p r e s e n t a t i o n is
not a d d r e s s e d t h e r e directly.
Learning, in v a r i o u s
less formal senses, has b e e n s t u d i e d w i d e l y as a
b r a n c h of a r t i f i c i a l i n t e l l i g e n c e .
A s u r v e y and
b i b l i o g r a p h y can be f o u n d in [2,6].
In t h e i r
t e r m i n o l o g y the s u b j e c t of this p a p e r is c o n c e p t
learning.
2.
vectors for w h i c h
F is true, for otherwise, if
F is true for just one v e c t o r w h i c h is total,
only an e x p o n e n t i a l search or more p o w e r f u l oracles
w o u l d be able to find it.
Such c o n s i d e r a t i o n s led us to c o n s i d e r the
f o l l o w i n g l e a r n i n g p r o t o c o l as a r e a s o n a b l e one.
It gives the learner access to the f o l l o w i n g two
routines:
(i) EXAMPLES: This has no input. It gives
as output a v e c t o r
v s u c h t h a t F(v) = i. F o r
each such v the p r o b a b i l i t y t h a t v is o u t p u t
on any single call is D(v).
(ii)
ORACLE( ):
On v e c t o r v
outputs 1 or 0 a c c o r d i n g to w h e t h e r
as i n p u t it
F (v) = 1 or 0.
The first of these gives r a n d o m l y chosen
p o s i t i v e e x a m p l e s of the c o n c e p t b e i n g learnt.
The s e c o n d p r o v i d e s a test for w h e t h e r a v e c t o r
w h i c h the d e d u c t i o n p r o c e d u r e considers to b e
critical is an e x a m p l e of the c o n c e p t or not.
In
a real s y s t e m the oracle may b e a h u m a n expert, a
d a t a b a s e of p a s t o b s e r v a t i o n s , some d e d u c t i o n
system, o r a c o m b i n a t i o n of these.
A LEARNING P R O T O C O L FOR BOOLEAN FUNCTIONS
We c o n s i d e r
t Boolean variables
pl,...,pt
each of w h i c h can take the value 1 or 0 to i n d i c a t e
w h e t h e r the a s s o c i a t e d p r o p o s i t i o n is true or
false.
T h e r e is n o a s s u m p t i o n about the i n d e p e n dence of the variables.
Indeed, they may be functions of each other.
F i n a l l y w e o b s e r v e that our p a r t i c u l a r choice
of leazning p r o t o c o l was s t r o n g l y i n f l u e n c e d by
the e a r l i e r d e c i s i o n to deal w i t h concepts r a t h e r
than raw B o o l e a n functions.
If w e w e r e to deal
w i t h the latter t h e n many o t h e r a l t e r n a t i v e s are
e q u a l l y natural.
F o r example, g i v e n a v e c t o r
v
i n s t e a d of asking, as w e do, w h e t h e r all completions of it m ~ e s the f u n c t i o n true, we c o u l d ask
w h e t h e r there e x i s t any c o m p l e t i o n s t h a t make it
true.
This s u g g e s t s n a t u r a l a l t e r n a t i v e semantics
for E X A M P L E S a n d ORACLE.
In S e c t i o n 7 w e shall
use such m o r e e l a b o r a t e oracles.
A vGdtor is an a s s i g n m e n t to each of the t
v a r i a b l e s of a v a l u e f r o m
{0,i,*}.
The s y m b o l
*
denotes that a v a r i a b l e is undetermined.
A vector
is totaZ if e v e r y v a r i a b l e is determined (i.e. is
a s s i g n e d a value from
{0,i}.)
F o r example, the
assignment
Pl =P3 =i'
P4 =0
and P2 = *
is a
v e c t o r t h a t is not total.
A Boolean function F
is a m a p p i n g f r o m the
set of 2 t t o t a l v e c t o r s to
{0,i}.
A Boolean
f u n c t i o n F b e c o m e s a concept F if its domain
is e x t e n d e d to the set of all v e c t o r s as follows:
For a vector
v F(v) = i
if and o n l y if F(w) = I
for all t o t a l v e c t o r s
w
that agree w i t h
v on
all v a r i a b l e s for w h i c h
v
is determined.
The
p u r p o s e of this e x t e n s i o n is t h a t it p e r m i t s us
not to m e n t i o n the v a r i a b l e s on w h i c h
F
does
n o t depend.
3.
LEARNABILITY
We s h a l l c o n s i d e r v a r i o u s classes of p r o g r a m s
h a v i n g P l .... ,Pt as inputs and s h o w that in e a c h
case any p r o g r a m in the class can be deduced, w i t h
only s m a l l p r o b a b i l i t y of error, in p o l y n o m i a l
time u s i n g the p r o t o c o l p r e v i o u s l y described.
We
assume t h a t p r o g r a m s can take the values I, 0 and
undetermined.
G i v e n a c o n c e p t F w e c o n s i d e r an arbitrary
probability distribution
D o v e r the set of all
vectors
v such t h a t
F(v) = i.
In o t h e r w o r d s
for each v such that F(v) = i
it is the case
that
D(v) 9 0 .
Also
~D(v) = 1 w h e n s u m m a t i o n
is o v e r this set of vectors.
T h e r e are n o o t h e r
a s s u m e d r e s t r i c t i o n s on
D, w h i c h is i n t e n d e d to
d e s c r i b e the r e l a t i v e f r e q u e n c y w i t h w h i c h the
p o s i t i v e e x a m p l e s of F o c c u r in nature.
More p r e c i s e l y w e say t h a t a class
X of
p r o g r a m s is learnable w i t h r e s p e c t to a g i v e n
l e a r n i n g p r o t o c o l if and only if t h e r e exists an
a l g o r i t h m A (the d e d u c t i o n procedure) i n v o k i n g the
p r o t o c o l w i t h the f o l l o w i n g p r o p e r t i e s :
(a) The a l g o r i t h m runs in time p o l y n o m i a l
b o t h in an a d j u s t a b l e p a r a m e t e r h
a n d in the
v a r i o u s p a r a m e t e r s that q u a n t i f y the size of the
p r o g r a m t o be learnt, and
W h a t are r e a s o n a b l e l e a r n i n g p r o t o c o l s to
consider?
F i r s t w e m u s t a v o i d g i v i n g the t e a c h e r
too m u c h p o w e r , n a m e l y the p o w e r to c o m m u n i c a t e a
p r o g r a m i n s t r u c t i o n b y instruction.
F o r example,
if a p r e m e d i t a t e d s e q u e n c e of v e c t o r s w i t h r e p e t i t i o n s c o u l d b e c o m m u n i c a t e d then this c o u l d be u s e d
to encode the d e s c r i p t i o n of the p r o g r a m even if
just two s u c h v e c t o r s w e r e u s e d for b i n a r y notation.
S e c o n d l y w e m u s t a v o i d g i v i n g the t e a c h e r w h a t is
e v i d e n t l y too little power.
In p a r t i c u l a r , the
p r o t o c o l m u s t p r o v i d e some t y p i c a l e x a m p l e s of
(b) F o r all p r o g r a m s
f 6X
a n d all d i s t r i b u tions
D over vectors
v on w h i c h
f outputs 1
the a l g o r i t h m w i l l deduce w i t h p r o b a b i l i t y at least
(1-h -1)
a program
g EX
t h a t n e v e r o u t p u t s one
w h e n it s h o u l d not, b u t o u t p u t s one almost always
w h e n it should.
In p a r t i c u l a r (i) for all v e c t o r s
v g(v) = i
implies
f(v) = i , and
(ii) the s u m of
D (v) o v e r all v such t h a t
f (v) = 1 b u t
g (v) ~ 1
is at m o s t h -I.
438
In our definition we have disallowed positive
answers from g to be wrong, only because the
particular
classes X in this paper do not require
i t . This we call one-sided error learning. In
other circumstances it may be efficacious to allow
two-sided
errors.
In that more general case we
would need to put a probability distribution E
on the set of vectors such that f(v) fl.
Condition
(i) in (b) is then replaced by: CD(v) over all v
such that f(v) fl but g(v) =l is at most h-l.
A second aspect of our definition is that the
parameter h is used in two independent probabilistic bounds.
This simplifies the statement of the
results.
It would be more precise,.however,
to
work explicitly with two independent parameters hl
and h
2'
Thirdly, we should note that programs that
compute concepts should be distinguished from those
that compute merely the Boolean function.
This
distinction has little consequence for the three
classes of expressions that we consider since in
these cases programs for the concepts follow
directly from the specification of the expressions.
For the sake of generality our definitions do allow
the value of a program to be undefined since a nontotal vector will not, in general, determine the
value of an expressions as 0 or 1.
It would be interesting to obtain negative
results other than the cryptographic evidence
mentioned in the introduction, indicating that the
class of unrestricted Boolean circuits is not
learnable.
In the contrary direction the reader
can verify that the difficulty of this problem can
be upper bounded in various ways.
It can be shown
that the assumption that P equals NP would
imply that the class of Boolean circuits is
learnable with two sided error with respect to the
protocol that provides random examples from both
D and 6.
4.
A COMBINATORIAL BOUND
The probabilistic analysis needed for our
later results can be abstracted into a single
lemma.
The proof of the
the rest of the paper.
lemma
is
independent
of
We define the function L(h,Sl
for any real’
number h greater than one and any positive
integer S as follows:
Let L(h,S) be the
smallest integer such that in L(h,S)
independent
Bernoulli trials each with probability h
of
success, the probability of having fewer than S
successes is less than h- .
The following simple upper bound holds for
the whole range of values of S and h and shows
that L(h,S) is essentially linear both in h and
in S.
PROPOSITION:
For all integers
S 21 a n d
atZ
real
h >l.
L(h,S)
< 2h(S
+logeh)
.
Proof.
We use the following three well known inequalities of which the last is due to Chernoff
(see [51, p. 18).
(a)
For
all
x>O
(l+x-1)x<e
(b)
For
all
x >O ( 1
.
-x-1)x <e-l.
Cc) In m independent trials each with
probability p of success the probability that
there are at most k success, where k <mp, is at
most
pqk (!Q,” .
above
The first factor in the expression in (c)
can be rewritten as
(l-=&j (cm-k)/(mp-k))
(mp-k)
using (b) with x = (m-k)/(mp-k)
bound the product by
e
we can upper
-mp+k
bw/k)k
.
Substituting
and using loga~i~~"'t?%~'~as~
bound
.-Is
.
- 210gh+sm
and
k=S
=h-l
e gives the
L2 (l+(log h),S) I (S/loghl log h .
Rewriting this using (a) with x= (logeh)/S
e-S -210gho2S ,log h
<(2/e)
S
*e
gives
-logh&-1.
0
As an illustration of an application suppose
that we have access to EXAMPLES, a source of
natural positive examples of vectors for concept
F.
Suppose that we want to find an approximation
to the subset P of variables that are determined
in at least one natural example of F.
Consider
the
following
procedure. Pick the first L(h,lPI)
vectors provided by EXAMPLES and let P'
be the
union of the sets of variables determined by these
vectors.
The proposition then implies that with
probability at least 1-h-l the set P'
will
have the following property:
if a vector is
chosen
with
probability
distribution D then the
probability thaflit determines a variable in P -P'
is less than h . To see this observe that if P'
does not have the claimed property then the
following has happened:
L(h, IP 1)
trials have been
made (i.e. calls of EXAMPLES) each with probability
greater than h--l of success (i.e. finding a
vector that determines a variable in P-P')
but
there have been fewer than IPI successse~ ito;,
discoveries of new additions to P').
bility of such a happening is less than h-l by
the definition of L.
The
above
application
shows that the set of
in natural examples
o f F can be approximated by a procedure whose
runtime is independent of t, the total number of
variables that are determined
variables.
5.
BOUNDED CNF EXPRESSIONS
A conjunctive normal form (CNF) e x p r e s s i o n is
any p r o d u c t
C l C 2 . . . c r of clauses w h e r e e a c h
clause
c i is a s u m q l + q 2 +--- + q J i
of literals.
A ZiteraZ q is e i t h e r a v a r i a b l e
p
or the n e g a tion p
of a variable.
F o r example,
(Pl +P2) (Pl + ~ 2 + P 3 ) is a C N F expression.
For a
positive integer k
a k-CNF expression is a CNF
e x p r e s s i o n w h e r e each clause is the s u m of at m o s t
k
literals.
F o r CNF e x p r e s s i o n s in g e n e r a l w e do not k n o w
of any p o l y n o m i a l time d e d u c t i o n p r o c e d u r e w i t h
r e s p e c t to o u r l e a r n i n g p r o t o c o l .
The q u e s t i o n Of
w h e t h e r one exists is t a n t a l i z i n g b e c a u s e of its
a p p a r e n t simplicity.
f
In this s e c t i o n w e s h a l l p r o v e
that for any
i x e d i n t e g e r k the class of k - C N F e x p r e s s i o n s is
learnable.
In fact it is l e a r n a b l e e a s i l y in the
sense t h a t calls of E X A M P L E S s u f f i c e and no calls
of O R A C L E are required.
In this and s u b s e q u e n t sections we shall use
the r e l a t i o n ~
o n c o n c e p t s as follows:
F ~G
m e a n s t h a t for all v e c t o r s
v whenever
F(v) = I
it is a l s o the case t h a t G(v) =i.
(N.B.
This is
e q u i v a l e n t to the r e l a t i o n
F ~G
when
F,G
are
r e g a r d e d as functions.)
For brevity we shall often
denote b o t h e x p r e s s i o n s and e v e n v e c t o r s by the
concepts they represent.
Thus the c o n c e p t represented by vector w
is true for e x a c t l y those
v e c t o r s that agree w i t h
w
on all v a r i a b l e s on
which
w
is determined.
The c o n c e p t r e p r e s e n t e d
by e x p r e s s i o n
f is o b t a i n e d by c o n s i d e r i n g the
B o o l e a n f u n c t i o n c o m p u t e d by
f a n d e x t e n d i n g its
d o m a i n to b e c o m e a concept.
In this s e c t i o n w e
r e g a r d a clause
c i to b e a s p e c i a l k i n d of expression.
In S e c t i o n s 7 a n d 8 w e c o n s i d e r m o n o m i a l s
m, simple p r o d u c t s of literals, as s p e c i a l forms o f
expressions.
In all cases s t a t e m e n t s of the f o r m
f ~g,
v ~ c i, v ~ m
or m ~ f
can b e u n d e r s t o o d b y
i n t e r p r e t i n g the two sides as concepts or f u n c t i o n s
in the o b v i o u s way.
THEOREM A:
For any positive integer k the class
of k - C N F expressions is learnable via an algorithm
A that uses L = L ( h , ( 2 t ) k+l) calls of EXAMPLES
and no calls of ORACLE, where t is the number of
variables.
Proof.
The a l g o r i t h m is i n i t i a l i z e d w i t h f o r m u l a
g as the p r o d u c t of all p o s s i b l e clauses of up t o
k
literals f r o m
{ p l , ~ l , P 2 , P 2 , .... pt,Pt}.
Clearly
the n u m b e r of w a y s of c h o o s i n g a clause of e x a c t l y
k
literals is at m o s t
(2t) k
and h e n c e the n u m b e r
of w a y s of c h o o s i n g a clause w i t h up t o k
literals
is less than
2t + ( 2 t ) 2 +... + (2t~ k < (2t) k+l.
This
b o u n d s the i n i t i a l n u m b e r o f clauses in g.
The a l g o r i t h m then calls E X A M P L E S
L times to
give p o s i t i v e e x a m p l e s of the c o n c e p t r e p r e s e n t e d
by k - C N F f o r m u l a
f. F o r each v e c t o r
v so obt a i n e d it deletes all of the r e m a i n i n g clauses in g
t h a t do n o t c o n t a i n a l i t e r a l t h a t is d e t e r m i n e d to
be true in v.
(I.e. in the clauses d e l e t e d each
literal is e i t h e r not d e t e r m i n e d o r is n e g a t e d in
v.)
M o r e p r e c i s e l y the f o l l o w i n g is r e p e a t e d
L
times:
begin v :=EXAMPLES
for each c i in g delete c. if v ~ c..
end
1
l
The claim is t h a t w i t h h i g h p r o b a b i l i t y the value
of g after the L-th e x e c u t i o n of this b l o c k w i l l
be the d e s i r e d a p p r o x i m a t i o n to the f o r m u l a
f
that is b e i n g learnt.
Initially the set of v e c t o r s
{vlv~g}
is
clearly empty.
We first c l a i m that as the algor i t h m p r o c e e d s the set
{vlv~g}
w i l l always b e a
s u b s e t of
{vlv~f}.
T o p r o v e this it is clearly
s u f f i c i e n t t o p r o v e the same s t a t e m e n t w h e n b o t h
sets are r e s t r i c t e d to only t o t a l vectors.
Let B
b e the p r o d u c t of all the clauses
c c o n t a i n i n g up
to k
literals w i t h the p r o p e r t y that
"Vv if
v~f
then v-~c."
It is clear t h a t in the course
of the a l g o r i t h m no clause in B
can be ever
deleted.
H e n c e it is c e r t a i n l y true t h a t
{ v l v ~ g } c { v l v ~ B }. To e s t a b l i s h the c l a i m it
remains only to p r o v e t h a t B
computes the same
B o o l e a n f u n c t i o n as f.
(In fact B w i l l be the
m a x i m a l k - C N F f o r m u l a e q u i v a l e n t to
f.)
It is
easy to see t h a t
f =~B s i n c e by the d e f i n i t i o n of
B, for every
c in B it is the case t h a t "for
all v if v ~ f
then v ~c."
To v e r i f y the converse, that B ~ f , w e first note that, by definition, f is a k - C N F formula.
If some clause in
f
did not o c c u r in B then this clause
c' w o u l d
h a v e the p r o p e r t y that
"By such t h a t v ~ f
but
v ~6 c'."
But this is i m p o s s i b l e since if
c' is a
clause of
f and if v ~6 c' then v ~6 f. We conclude that e v e r y k - C N F r e p r e s e n t a t i o n of the function t h a t
f r e p r e s e n t s consists of a set of
clauses t h a t is a s u b s e t of the clauses in B.
Hence
B ~f.
Let X = ~ D ( v )
w i t h s u m m a t i o n o v e r all (not
n e c e s s a r i l y total) v e c t o r s
v such that v ~ f
but
v ~ g. This q u a n t i t y is d e f i n e d for each interm e d i a t e v a l u e of g in the course of the a l g o r i t h m
and, as is easily v e r i f i e d , it is m o n o t o n e d e c r e a s ing w i t h time.
N o w clauses w i l l be r e m o v e d f r o m g
w h e n e v e r E X A M P L E S outputs a v e c t o r
v such that
v ~ g. The p r o b a b i l i t y of this h a p p e n i n g at any
m o m e n t is e x a c t l y the current v a l u e of X. Also,
the p r o c e s s of r u n n i n g the a l g o r i t h m to c o m p l e t i o n
can h a v e one of just two p o s s i b l e outcomes:
(i) A t
some p o i n t
X b e c o m e s less than h -1, in w h i c h
case the
g f o u n d w i l l a p p r o x i m a t e to
f exactly
as r e q u i r e d by the d e f i n i t i o n of learnability.
(~i) The v a l u e of X n e v e r goes b e l o w
h -I
and
hence
g is not an a c c e p t a b l e approximation.
The
p r o b a b i l i t y of the latter e v e n t u a l i t y is, h o w e v e r ,
at m o s t h -I since it c o r r e s p o n d s to the s i t u a t i o n
of p e r f o r m i n g
L(h, (2t) k+l)
Bernoulli experiments
(i.e. calls of EXAMPLES) each w i t h p r o b a b i l i t y
g r e a t e r than h -I of success (i.e. f i n d i n g a v
such that v ~6 g)
and o b t a i n i n g f e w e r than
[2t) k + l
s u c c e s s e s (a success b e i n g m a n i f e s t e d by
the removal of at least one clause f r o m g).
[]
In c o n c l u s i o n w e o b s e r v e that a CNF e x p r e s s i o n
g i m m e d i a t e l y y i e l d s a p r o g r a m for c o m p u t i n g the
a s s o c i a t e d concept.
G i v e n a v e c t o r v w e subs t i t u t e the d e t e r m i n e d t r u t h values in g. The
c o n c e p t w i l l be true for v if and only if all the
clauses are made true b y the substitution.
6.
DNF EXPRESSIONS
The t e s t
v ~ g
amounts to asking w h e t h e r
n o n e of the m o n o m i a l s of
g
is m a d e true by the
values d e t e r m i n e d to be true in v.
Every time
E X A M P L E S p r o d u c e s a value
v
such that
v ~ g the
inner loop of the a l g o r i t h m w i l l f i n d a p r i m e implicant
m
to a d d to
g.
E a c h such
m
is
d i f f e r e n t f r o m any p r e v i o u s l y a d d e d (since the
contrary w o u l d h a v e i m p l i e d t h a t
v ~g).
It
follows that such a v w i l l be f o u n d at m o s t
d
times and hence O R A C L E w i l l be c a l l e d at m o s t
dt
times.
A disjunctive normal form (DNF~ e x p r e s s i o n is
any s u m m I + m 2 + ... + m r of m o n o m i a l s w h e r e each
monomial
m i is a p r o d u c t of literals.
For example
plP2 + p l P 3 P 4
is a DNF expression.
Such exp r e s s i o n s a p p e a r p a r t i c u l a r l y easy for h u m a n s to
comprehend.
Hence we e x p e c t t h a t any p r a c t i c a l
l e a r n i n g s y s t e m w o u l d h a v e to allow for them.
An e x p r e s s i o n is monotone if no v a r i a b l e is
n e g a t e d in it.
We shall show that for m o n o t o n e DNF
e x p r e s s i o n s there exists a simple d e d u c t i o n pro-cedure that uses b o t h E X A M P L E S and ORACLE.
For
u n r e s t r i c t e d DNF a s i m i l a r r e s u l t can be p r o v e d
w i t h respect to a d i f f e r e n t size measure.
In the
u n r e s t r i c t e d case there is the a d d i t i o n a l d i f f i c u l t y
that w e can g u a r a n t e e to deduce a p r o g r a m only for
the function, and not for the concept.
This difficulty does n o t arise in the m o n o t o n e case w h e r e
given an e x p r e s s i o n we can always compute the value
of the a s s o c i a t e d c o n c e p t for a v e c t o r
v by
m a k i n g the s u b s t i t u t i o n and asking w h e t h e r the r e s u l t i n g e x p r e s s i o n is i d e n t i c a l l y true.
Let
X = ZD(w)
w i t h s u m m a t i o n o v e r all (not
n e c e s s a r i l y total) v e c t o r s
w
such t h a t
w~f
but
w ~ g.
This q u a n t i t y is d e f i n e d for each
i n t e r m e d i a t e value of
g in the course of the
algorithm, is i n i t i a l l y unity, and decreases monot o n i c a l l y w i t h time.
N o w a m o n o m i a l w i l l be added
to
g each time E X A M P L F S outputs a v e c t o r
v
such
that
v ~ g.
The p r o b a b i l i t y of this o c c u r r i n g at
any call of E X A M P L E S is e x a c t l y
X.
The p r o c e s s
of r u n n i n g the a l g o r i t h m to c o m p l e t i o n can have
just two p o s s i b l e outcomes:
(i) A t some time
X
has b e c o m e less than
h -1, in w h i c h case the final
expression
g
f o u n d w i l l a p p r o x i m a t e to
f as
r e q u i r e d by the d e f i n i t i o n of learnability.
(li) The value of
X n e v e r goes b e l o w
h -I
and
hence
g
is n o t an a c c e p t a b l e approximation.
The
p r o b a b i l i t y of this s e c o n d e v e n t u a l i t y is, however,
at m o s t
h -I
since it corresponds to the s i t u a t i o n
of p e r f o r m i n g
L(h,d)
B e r n o u l l i e x p e r i m e n t s (i.e.
calls of EXAMPLES) e a c h w i t h p r o b a b i l i t y g r e a t e r
than
h -I
of success (i.e. of finding a v
such
that
v~f
but
v ~ g)
a n d o b t a i n i n g fewer t h a n
d s u c c e s s e s (each m a n i f e s t e d by the a d d i t i o n of a
new monomial).
[]
A monomial
m
is a prime implicant of a f u n c tion
F
(or of an e x p r e s s i o n r e p r e s e n t i n g the
function) if
m~F
and if m' ~ F
for any
m'
o b t a i n e d by d e l e t i n g one literal f r o m
m.
A DNF
e x p r e s s i o n is prime if it consists of the s u m of
p r i m e i m p l i c a n t s none of w h i c h is r e d u n d a n t in the
sense of b e i n g i m p l i e d by the s u m of the others.
T h e r e is always a unique p r i m e DNF e x p r e s s i o n in
the m o n o t o n e case, b u t not in general.
We therefore define the degree of a DNF e x p r e s s i o n to b e
the largest n u m b e r of m o n o m i a l s t h a t a p r i m e DNF
e x p r e s s i o n e q u i v a l e n t to it can have.
The unique
p r i m e DNF e x p r e s s i o n in the m o n o t o n e case is simply
the s u m of all the p r i m e implicants.
F o r u n r e s t r i c t e d DNF e x p r e s s i o n s several
p r o b l e m s arise.
The m a i n source of difficulty is
the fact t h a t the p r o b l e m of d e t e r m i n i n g w h e t h e r
a n o n t o t a l v e c t o r implies the f u n c t i o n s p e c i f i e d by
a DNF f o r m u l a is N P - h a r d .
This is simply
b e c a u s e the p r o b l e m of d e t e r m i n i n g w h e t h e r the now h e r e d e t e r m i n e d v e c t o r implies the f u n c t i o n is the
t a u t o l o g y q u e s t i o n of C o o k [3].
An immediate
c o n s e q u e n c e of this is t h a t it is u n r e a s o n a b l e to
assume that the p r o g r a m b e i n g learnt computes concepts r a t h e r than functions.
Another consequence
is that in any a l g o r i t h m akin to A l g o r i t h m B the
test
v ~ g m a y n o t be feasible if
v
is not
total.
A l g o r i t h m B is sufficient, however, to
e s t a b l i s h the following:
T H E O R E M B:
The class of monotone DNF expressions
is learnable via an algorithm B that uses
L =L(h,d)
calls of EXAMPLES and
dt calls of
ORACLE, where d is the degree of the DNF expression f to be learnt and t the number of
variables.
Proof.
The a l g o r i t h m is i n i t i a l i z e d w i t h f o r m u l a
g i d e n t i c a l l y equal to the c o n s t a n t zero.
The
a l g o r i t h m t h e n calls E X A M P L E S
L times to p r o d u c e
p o s i t i v e e x a m p l e s of
f.
E a c h time a v e c t o r
v
is p r o d u c e d such t h a t
v ~ g
a new monomial
m
is added to
g.
The m o n o m i a l
m
is the p r o d u c t
of those literals d e t e r m i n e d in
v
t h a t are
e s s e n t i a l to m a k e
v~f.
M o r e p r e c i s e l y , the loop
that is e x e c u t e d
L
times is the following:
begin
TIIEOREM B':
Suppose the notion of learnability is
restricted to distributions D such that D(V) = 0
whenever v is not total. Then the class of DNF
expressions is learnable, in the sense that programs
for the functions (but not necessarily the concepts)
can be deduced in L(h,d) calls of EXAMPLES and dt
calls of ORACLE where d is the degree of the DNF
expression to be learnt.
v := E X A M P L E S
if
v ~ g then
begin
t
do
for
i := 1 to
if Pi
is d e t e r m i n e d in
v
then
begin
set v equal to v b u t w i t h ~ p i :=*;
if ORACLE(v~ = i t h e n v : = v
end
set m equal to the p r o d u c t of all
literals q such t h a t
v~q;
g :=g + m
The r e a d e r s h o u l d note that here there is the
a d d i t i o n a l p h i l o s o p h i c a l d i f f i c u l t y that availability of the e x p r e s s i o n does n o t in itself imply a
p o l y n o m i a l time a l g o r i t h m for ORACLE.
On the o t h e r
h a n d , the t h e o r e m does say that if an agent has
some, may be ad hoc, b l a c k b o x for O R A C L E w h o s e
w o r k i n g s are u n k n o w n to him, he can use it to teach
end
end
441
s o m e o n e else a DNF e x p r e s s i o n
function.
that approximates
the
case that
The dual of this w o u l d b e a
P-ORACLE(v) = 1 if and only
if there exists a total v e c t o r
w
such that
w ~v
and
F(w) = i.
F o r b r e v i t y w e s h a l l a l s o define the
prime implicant oracle: PI(v) = 1 if and o n l y if
N-ORACLE(v) = 1 b u t
N-ORACLE(w) = 0
for any
w
obtained from
v by m a k i n g one v a r i a b l e determ i n e d in
v
undetermined.
F i n a l l y , it may be w o r t h e m p h a s i z i n g t h a t the
m o n o t o n e case is n o n p r o b l e m a t i c and the d e d u c t i o n
p r o c e d u r e for it v e r y s i m p l e - m i n d e d .
It may be
i n t e r e s t i n g to p u r s u e m o r e s o p h i s t i c a t e d d e d u c t i o n
p r o c e d u r e s f o r it if c o n t e x t s can be f o u n d in
w h i c h they can be p r o v e d a d v a n t a g e o u s .
The
q u e s t i o n as to w h e t h e r m o n o t o n e DNF e x p r e s s i o n s
can b e learnt f r o m E X A M P L E S alone is open.
A pos i t i v e a n s w e r w o u l d be e s p e c i a l l y s i g n i f i c a n t .
7..
Finally we
(iii)
ones of
C:
The class of p-expressions is leqrnable
via a deduction procedure C that uses 0(t °)
calls of N-ORACLE, RP and AP altogether, where t
is the number of variables, hand no calls of
EXAMPLES.) The procedure always deduces exactly the
correct expression.
Proof.
Let
~ be the m o n o m i a l t h a t is the p r o d u c t
of n o literals.
The a l g o r i t h m w i l l f i r s t compute
RP(Pi)
for e a c h
Pi
to d e t e r m i n e w h i c h of the
v a r i a b l e s o c c u r in the p r i m e i m p l i c a n t s of the
f u n c t i o n t h a t the e x p r e s s i o n
f to be learnt represents.
Let
g l , . . . , g r be d i s t i n c t single
v a r i a b l e e x p r e s s i o n s , one for e a c h
Pi
for w h i c h
RP(Pi) = i.
N o t e t h a t these are e x a c t l y the v a r i a b les t h a t o c c u r in
f since
f is m o n o t o n e .
A g e n e r a l expression o v e r v a r i a b l e set
{Pl,''" ,Pt } is d e f i n e d r e c u r s i v e l y as follows:
(ii)
oracles,
THEOREM
D e d u c i n g e x p r e s s i o n s on w h i c h t h e r e are less
s e v e r e s t r u c t u r a l r e s t r i c t i o n s t h a n DNF or CNF
appears m u c h m o r e difficult.
The a i m of this s e c t i o n is to give a p a r a d i g m a t i c e x a m p l e of h o w f a r
one can go in t h a t d i r e c t i o n if one is w i l l i n g t o
p a y the p r i c e of o r a c l e s t h a t are m o r e s o p h i s t i c a t e d
t h a n the one p r e v i o u s l y described.
"Pi"
two further
h o w it b e h a v e s on v e c t o r s r e p r e s e n t e d as m o n o m i a l s :
RP(m) = 1 iff f r o m some m o n o m i a l
m'
mm'
is a
p r i m e i m p l i c a n t of
f.
F o r sets, V , W
of v a r i a b l e s
we define
RA(V,W) = 1 iff every p r i m e i m p l i c a n t
of
f t h a t contains a v a r i a b l e f r o m
V
also
contains a variable from
W.
We h a v e a l r e a d y s e e n a class of e x p r e s s i o n s ,
n a m e l y the k - C N F e x p r e s s i o n s , t h a t can b e l e a r n t
f r o m p o s i t i v e e x a m p l e s alone.
Here we shall cons i d e r the o t h e r e x t r e m e , a class t h a t can be l e a r n t
u s i n g o r a c l e s alone.
F o r each
i (i ~ i ~ t )
"Pi"
are e x p r e s s i o n s .
define
relevant possibility ~
a n d of relevant accompaniment RA. F o r c o n v e n i e n c e w e d e f i n e the first by
p-EXPRESSIONS
(i)
F(w) = I.
possibility oracle:
and
If
fl,...,fr
are e x p r e s s i o n s t h e n
"(fl + f 2 + ' ' " + f r )'' is an e x p r e s s i o n
(called a p l u s e x p r e s s i o n ) .
With each
gi we a s s o c i a t e t w o m o n o m i a l s
mi
a n d mi"
The f o r m e r w i l l be the s i n g l e v a r i a b l e
pj
that
gi
represents.
The l a t t e r is d e f i n e d as
any m o n o m i a l
~
h a v i n g n o v a r i a b l e in common w i t h
mi
such t h a t
mi~
is a p r i m e i m p l i c a n t of
f.
The a l g o r i t h m w i l l c o n s t r u c t e a c h s u c h
mi
as the
final v a l u e of
m
in the f o l l o w i n g p r o c e d u r e :
Set
m =~;
while
P I ( m m i) = 0
find a Pk
n o t in
mm i
such that
R P ( P k ~ i) = 1 a n d set
m : = ~ k m. Hence
in a t ' m o s t
t2
calls of
PI
(i.e.
t°
calls of
N-ORACLE) a n d
t2
calls of
RP
v a l u e s for e v e r y
~.
w i l l be found.
l
If
fl .... ,fr
are e x p r e s s i o n s t h e n
"(fl x f 2 x . . . × f r ) "
is an e x p r e s s i o n
(called a times e x p r e s s i o n ) .
A p-expression is an e x p r e s s i o n in w h i c h e a c h
v a r i a b l e a p p e a r s at m o s t once.
We can assume t h a t
a p - e x p r e s s i o n is m o n o t o n e since w e can always
r e l a b e l n e g a t e d v a r i a b l e s w i t h n e w names t h a t
d e n o t e t h e i r negation.
In the r e c u r s i v e d e f i n i t i o n
of any p - e x p r e s s i o n there are clearly at m o s t
2t - i
i n t e r m e d i a t e e x p r e s s i o n s of w h i c h at m o s t
t
are of
type (i) a n d at m o s t
t - 1 of types (ii) o r (iii).
W i t h o u t loss of g e n e r a l i t y w e s h a l l a s s u m e
t h a t in
the d e f i n i t i o n of a p - e x p r e s s i o n rules (ii) a n d
(iii) alternate.
We s h a l l r e g a r d t w o p - e x p r e s s i o n s
as i d e n t i c a l if t h e i r d e f i n i £ i o n s can be m a d e
f o r m a l l y the same b y r e o r d e r i n g sums and p r o d u c t s
and r e l a b e l l i n g as n e c e s s a r y .
O n c e i n i t i a l i z e d the a l g o r i t h m p r o c e e d s by
a l t e r n a t e l y e x e c u t i n g a plus-phase a n d a timesphase. A t the s t a r t of e a c h p h a s e w e h a v e a set of
expressions
gl,..-,gr
where each
gi
is a s s o c i a t e d
w i t h two m o n o m i a l s
mi
and
mi
(having n o v a r i a b l e s
in common) w h e r e
mi
is a p r i m e i m p l i c a n t of
gi
and
mi~ i
is a p r i m e i m p l i c a n t of
f.
We disting u i s h the
g's
as plus o r times expressions a c c o r d i n g to w h e t h e r the o u t e r m o s t c o n s t r u c t i o n rule w a s
addition or multiplication.
A p l u s - p h a s e w i l l first
c o m p u t e an e q u i v a l e n c e r e l a t i o n
S o n the s u b s e t of
{gi } t h a t are times e x p r e s s i o n s .
F o r e a c h equiv a l e n c e class
G
s u c h t h a t no
gi 6 G
already
o c c u r s as a s u m m a n d in a sum, w e c o n s t r u c t a n e w
e x p r e s s i o n t h a t is the s u m of the m e m b e r s of
G
and
call this s u m
gk
where
k
is a p r e v i o u s l y u n u s e d
index.
If some m e m b e r s of
G
a l r e a d y o c c u r in a
sum, s a y
gj, (N.B. they are n e v e r d i s t r i b u t e d in
m o r e t h a n one sum) t h e n we m o d i f y the s u m e x p r e s s i o n
F o r l e a r n i n g p - e x p r e s s i o n s w e shall e m p l o y m o r e
p o w e r f u l oracles.
The b o u n d a r y b e t w e e n r e a s o n a b l e
a n d u n r e a s o n a b l e o r a c l e s does n o t a p p e a r sharp.
We
m a k e n o claims about the r e a s o n a b l e n e s s of these n e w
o r a c l e s e x c e p t t h a t t h e y m a y serve as v e h i c l e s f o r
u n d e r s t a n d i n g learnability.
The d e f i n i t i o n s r e f e r to the B o o l e a n f u n c t i o n
F
(not r e g a r d e d as a c o n c e p t here,.
The o r a c l e of
S e c t i o n 2 w i l l be r e n a m e d N - O R A C L E since it is one
of necessity:
N - O R A C L E ( v ) = 1 if a n d o n l y if for
all t o t a l v e c t o r s
w
such t h a t
w ~v
it is the
442
gj to equal the sum of every expression in G. A
times phase is exactly analogous except that it
computes a different equivalence relation T, now
on plus expressions, and will form new, or extend
old, times expressions.
In the above context
single variable expressions will be regarded as
both times and plus expressions.
Also, it is
immaterial which of the two kinds of phase is used
to start the algorithm.
First we shall verify that S and T are
indeed equivalence relations under the assumption
that Claim 1 holds at the beginning of each phase.
Clearly S and T are defined to be both reflexive and symmetric.
Also T is transitive for
supoose that gi Tgj
and gj Tg k. The former
implies that every prime implicant of f containing some variable from gi also contains some
variable from g~. The latter implies that every
•
.
3
prime implzcant of f containing some variable
from gj also contains a variable from gk" Hence
gi Tgk
follows.
In order to verify the transitivity of S we shall make a more general observation about any sub-expression of f.
The intention of the algorithm is best expressed by the claim to follow which will be
verified later.
Suppose that the expressions
occurring in the definition of f are fl,...,fq
(where q ~2t-l).
We shall say that g ~ fi iff
the set of prime implicants of g is a subset of
the set of prime implieants of ~fi where
(i) if
fi is a plus expression then fi =fi
and
(ii) if
fi is a times expression then
fi is the product
of some subset of the multiplicands of fi"
Claim 2: If mi,m.
are prime implicants of
~ n ~ t
tzmes
'
subexpressions
]
fi,fj of f and if
for some m, with no variable in common with either
m i or m., both mim and mjm are prime implirants of 3f, then fi,fj must occur as summands in
the same plus expression of f.
Claim I: After every phase of the algorithm for
every gi that has been constructed there is
exactly one expression
fi such that
(i) gi ~ fi
and
(ii) fi is of the same kind (i.e. plus or
times) as gi"
Proof.
Most easily verified by representing expression f as a directed graph (e.g. [7]) with edges
labelled by the Boolean variables.
The sets of
labels along directed paths between a distinguished
source node and a distinguished sink node correspond
exactly to prime implicants in the case of Dexpressions where all edges have different
labels.
[]
The procedure builds up the rooted tree of the
expression as rooted subtrees starting from the
leaves.
One evident difficulty is that there is
no a p ~ o P ~ knowledge available about the shape of
the tree.
Hence in the grafting process a subtree
may become attached to another at just about any
node.
Now to verify that S is transitive suppose
that gi Sgj
and g~ Sg k where, by Claim i,
gi ~fi, gj ~fj
and 3gk~.f k for appropriate times
expressions
Whenever a plus or times expression
gi is
created or enlarged its associated m i and mi is
updated as follows.
If gi is a sum then we let
m i =mj
and mi =~j
for any g j
that is a surf=and
in gi"
If gi is a product rnnen m i will be the
product of all the mj's that correspond to multiplicands in gi" F i n a l l y
mi will be generated as
the final value of m in the following procedure:
Set m :=I; while PI(mmi) =0
find a Pk not in
mm i such that RP(Pkmm i) = 1 and set m :=pk mSince such an mi has to be found at most t
times in the overall algorithm the total cost is
at most t 2 calls of PI
(i.e., t 3 calls of NORACLE) and t 2 calls of RP.
m.m.33
where
and
fi'fj 'fk"
mkmj^
mi,m q ,mk
Then it follows that
are all prime implicants of
have no variable in common with
J
gi S gk
follows.
Claim 3:
Suppose that after some phase of the
algorithm Claim 1 holds and times expressions
have been formed where
placed in the same sum expression at the next plus
phase if and only if
fi =fi' f'3 =~'3
and
are addends in the same sum expression in
(~)
If
gi ~ f i =~"
.......
(ii)
fi,fj
f.
if and only if
Proof.
(i)
gi,g 5
gi ~<fi' g-3 ~<f" and f. ,f.
J
z 3
Then gi and gi will be
are times expressions.
gi S gj
f
~..
It follows from Claim 2 that fi,fj
and fk
3
are addends in the same sum expression.
Hence
In order to complete the description of the
algorithm it remains only to define the equivalence
relations
S and T.
Definition:
m.l~'3 '
PI(mi~ j) = PI(mjm i) = i, and
fi,fj
prime implicants of
mi,~.
contains disjoint sets of
J
^
variables, as do mj,m i.
and
gj ~f. =f.
1
3
are in the same sum then
f. ,f..
i
j
will be prime implicants of
Also
mi,m ~
m.~n
i
and
]
will be
3
and
m.~.
3 I
f, and the variables
For defining T we shall denote by V i the
set of variables that occur in the expression
gi"
m. will be disjoint from those in m. as will
x
3
be those in m.
from those in ~n.. Hence gi S g:]
3
z
will hold and gi,g q will be in the same plus
Definition:
gi Tgj
RA(Vj,V i) = i.
expression after the next plus phase.
if and only if
in
RA(Vi,V j) =
(~)
gi S gj
443
Suppose that
holds.
Then
mi,m j
gi ~fi' gj ~<f.] and
will be prime implicants
of
gi'g"3
of
f.
of V 4 may have to be updated for up to t such
sets and hence t 2 calls of RA will suffice.
Hence O(t 3) calls of RA in the overall algrorithm will be enough.
0
respectively, m.d.,m.d,
will both be
i 3 3 3
prime implicants of f, and the variables in ~.
3
will be disjoint from those of both m.
and m..
3
It follows from Claim 2 that f. =f.
and f. =f.
l
•
3
3
must occur as summands in the same plus expression
[N.B.
8.
Here we are using the fact that
In this paper we have considered learning as
the process of deducing a program for performing a
task, from information that does not provide an
explicit description of such a program.
We have
given precise meaning to this notion of learning
and have shown that in some restricted but nontrivial contexts it is computationally feasible.
Claim 2 remains true if we allow a "times subexpression"
to be the product of any subset of the
multiplicands
Clalm 4:
in a times expression in
f.]
Suppose that after some phase of the
algorithm Claim 1 holds and plus expressions
gi ~ f i
and
gJ ~f'3
(fi,fj
plus expressions)
Consider a world containing robots and elephants.
Suppose that one of the robots has discovered a recognition algorithm for elephants that
can be meaningfully
expressed in k-conjunctive
normal form.
Our Theorem A implies that this
robot can communicate its algorithm to the rest of
the robot population by simply exclaiming "elephant" whenever one appears.
have
been found.
f.
3
in
(i) If gi = f. and g-3 = f" and fi'
i
3
are multiplicands in the same times expressions
f
then
gi,g 5
will be placed in the same
times expression after the next times phase.
(ii)
If
gi,g 5
are placed in the same times
expression at any subsequent phase then
in the same product in
fi,fj
An important aspect of our approach, if cast
in its greatest generality, is that we require the
recognition algorithms of the teacher and learner
to agree on an overwhelming fraction of only the
natural inputs.
Their behavior on unnatural inputs
is irrelevant and hence descriptions of all
possible worlds are not necessary.
If followed to
its conclusion this idea has considerable philosophical implications:
A learnable concept is
nothing more than a short program that distinguishes
some natural inputs from some others.
If such a
concept is passed on among a population in a
distributed manner substantial variations in
meaning may arise.
More importantly, what consensus
there is will only be meaningful for natural inputs.
The behavior of an individual's program for unnatural inputs has no relevance.
Hence thought
experiments and logical arguments involving unnatural hypothetical situations may be meaningless
activities.
are
f.
Proof.
(i) If the conditions of (i) hold then
gi T g j will be discovered at the next times phase
and the claim follows.
( i i ) If fi,fj
are not in
the same product in f then
f contains some
prime implicant containing variables from one of
gi,gj
and not from the other.
Hence gi Tgj
will
never hold.
o
Proof of Claim 1. By induction on the number of
phases on Claims i, 3 and 4 simultaneously.
D
We define the depth of a formula fi to be
the maximum number of alternations of sum and
product required in its definition.
The second important aspects of the formulation is that the notion of oracles makes it
possible to discuss a whole range of teacherlearner interactions beyond the mere identification
of examples.
This is significant in the context of
artificial intelligence where a human may be willing
to go to great lengths to convey his skills to a
machine while being unable to articulate thealgorithms
he himselfuses in the practice of the skills.
We
expect that some explicit programming does become
essential for transmitting skills that are beyond
certain limits of difficulty.
The identification
of these limits is a major goal of the line of
work proposed in this paper.
Claim 5: If fi is an expression of depth k in
f then after k phases a gi identical to fi
will have been constructed by the algorithm.
Proof.
By induction on Claims 3 and 4.
REMARKS
[3
To conclude the proof of the theorem it remains
to analyze the runtime, iThis is dominated by the
cost of computing m i and hi, for which we have
already accounted, plus the cost of computing the
equivalence relations
S and T. We first note
that fewer than 2t expression names
gi are
used in the course of the algorithm.
When an expression is grafted into another at a point deep in
the latter then the semantics of all the subexpressions above will change.
By Claims 3 and 4 such
grafting can occur when a gi is added to a sum
expression but not when added to a times expression.
Hence the values
m i and h i do not need to be
changed for any expression when such grafting
occurs.
It follows that for computing
S only 2t
values of mi,~ i need to be considered.
Hence
O(t 2)
calls of PI or O(t 3) calls of N-ORACLE
suffice overall.
For computing T grafting may
cause a ripple effect.
On each occasion the value
444
9.
REFERENCES
[I]
D. Angluin and C.H. Smith.
"A Survey of Inductive Inference:
Theory and Methods."
Yale University, Computer Science Department
Tech. Report 250, October 1982.
[2]
A. Barr and E.A. Feigenbaum.
The Handbook of
Artificial Intelligence, Vol. II, wm. Kaufmann,
Los Altos, Calif. (1982).
[3]
S.A. Cook.
"The Complexity of Theorem
Proving Procedures."
Proc. of Third ACM
Symposium on Theory of Computing (1971), 151158.
[4]
R.O. Duda and P.F. Hart.
Pattern Classifica-
tion and Scene Analysis.
Wiley, New York
(1973).
[5]
P. Erdos and J. Spencer.
Methods in Combinatorics.
New York
[6]
Probabi lis tic
Academic Press,
(1974).
R.S. Michalski, J.G. Carbonell and T.M.
Mitchell.
Machine Learning: An Artificial
Intelligence Approach, Tioga Publishing Co.,
Palo Alto, Calif.
[7]
(1983).
S. Skyum and L.G. Valiant.
"A Complexity
Theory based on Boolean Algebra. " Proc. of
22nd IEEE Syrup. on Foundations of Computer
Science, (1981) , 244-253.
445