A THEORY OF THE LEARNABLE L.G. V a l i a n t Aiken Computation Laboratory Harvard University, Cambridge, Massachusetts explicit programming. Among human skills some clearly appear to have a genetically preprogrammed element while some others consist of executing an explicit sequence of instructions that has been memorized. There remains a large area of skill acquisition where no such explicit programming is identifiable. It is this area that we describe here as learning. The recognition of familiar objects, such as tables, provide such examples. These skills often have the additional property that, although we have learnt them, we find it difficult to articulate what algorithm we are really using. In these cases it w o u l d be especially significant if machines could be made to acquire them by learning. ABSTRACT. Humans appear to be able to learn new concepts without needing to be programmed explicitly in any conventional sense. In this p a p e r we regard learning as the phenomenon of knowledge acquisition in the absence of explicit programming. We give a precise methodology for studying this phenomenon from a computational viewpoint. It consists of choosing an appropriate information gathering mechanism, the learning protocol, and exploring the class of concepts that can be learnt using it in a reasonable (polynomial) number of steps. We find that inherent algorithmic complexity appears to set serious limits to the range of concepts that can be so learnt. The methodology and results suggest concrete principles for designing realistic learning systems. |. This p a p e r is concerned with precise computational models of the learning phenomenon. We shall restrict ourselves to skills that consist of recognizing w h e t h e r a concept (or predicate) is true or not for given data. We shall say that a concept Q has been learnt if a p r o g r a m for recognizing it has been deduced (i.e. by some m e t h o d other than the acquisition from the outside of the explicit program). INTRODUCTION Computability theory became possible once precise models became available for modelling the commonplance phenomenon of mechanical calculation. The theory that evolved has been used to explain human experience and to suggest how artificial computing devices should be built. It is also worth studying for its own sake. The main contribution of this p a p e r is that it shows that it is possible to design learning machines that have all three of the following properties. The commonplace phenomenon of learning surely merits similar attention. The p r o b l e m is to discover good models that are interesting to study for their own sake and promise to be relevant both to explaining human experience and to building devices that can learn. The models should also shed light on the limits of what can be learnt, just as computability does on what can be computed. (A) The machines can provably learn whole classes of concepts. Furthermore these classes can be characterized. (B) The classes of concepts are appropriate and nontrivial for general purpose knowledge, and In this p a p e r we shall say that a p r o g r a m for performing a task has been acquired by learning if it has been acquired by any means other than (C) The computational process by which the machines deduce the desired programs requires a feasible (i.e. polynomial) number of steps. *This research was supported in part by National Science Foundation Grant MCS-83-02385. A learning machine consists of a learning protogether with a deduction procedure. The former specifies the manner in which information is obtained from the outside. The latter is the mechanism by which a correct recognition algorithm for the concept to be learnt is deduced. At the broadest level the suggested methodology for studying learning is the following: define a plausible learning protocol and investigate the class of concepts for which recognition programs can be deduced in polynomial time using the protocol. tocol Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. © 1984 ACM 0 - 8 9 7 9 i - 1 3 3 - 4 / 8 4 / 0 0 4 / 0 4 3 6 $00.75 436 The s p e c i f i c p r o t o c o l s c o n s i d e r e d in this p a p e r a l l o w for two kinds of i n f o r m a t i o n supply. F i r s t the learner has access to a supply of typical data that p o s i t i v e l y e x e m p l i f y the concept. More p r e c i s e l y , it is a s s u m e d that these p o s i t i v e examples h a v e a p r o b a b i l i s t i c d i s t r i b u t i o n determ i n e d a r b i t r a r i l y by nature. A call of a routine EXAMPLES p r o d u c e s one such p o s i t i v e example. The relative p r o b a b i l i t y of p r o d u c i n g d i f f e r e n t e x a m ples is d e t e r m i n e d by the distribution. The s e c o n d source of i n f o r m a t i o n that may be available is a routine ORACLE. In its m o s t b a s i c version, w h e n p r e s e n t e d w i t h data it w i l l tell the learner w h e t h e r or not the data p o s i t i v e l y e x e m p l i f i e s the concept. The deduction p r o c e d u r e w i l l in each case outp u t an e x p r e s s i o n that w i t h h i g h l i k e l i h o o d closely approximates the e x p r e s s i o n to be learnt. Such an a p p r o x i m a t e e x p r e s s i o n n e v e r says yes w h e n it s h o u l d not, b u t may say no on a small fraction of the p r o b a b i l i t y space of p o s i t i v e examples. This f r a c t i o n can be made a r b i t r a r i l y s m a l l by increasing the runtime of the deduction procedure. Perhaps the m a i n t e c h n i c a l discovery c o n t a i n e d in the p a p e r is that w i t h this p r o b a b i l i s t i c notion of learning highly c o n v e r g e n t learning is p o s s i b l e for w h o l e classes of B o o l e a n functions. This appears to d i s t i n g u i s h this approach f r o m more t r a d i t i o n a l ones w h e r e learning is seen as a p r o c e s s of "inducing" some g e n e r a l rule from i n f o r m a t i o n that is i n s u f f i c i e n t for a reliable d e d u c t i o n to be made. The p o w e r of this p r o b a b i l i s t i c v i e w p o i n t is illustrated, for example, by the fact that an arbitrary set of p o l y n o m i a l l y many p o s i t i v e examples cannot be relied on to determine expressions c o n s i s t i n g of even just a single m o n o m i a l in any reliable way. The m a j o r remaining design choice that has to be made is that of k n o w l e d g e representation. Since our d e c l a r e d aim is to represent g e n e r a l knowledge, it seems almost u n a v o i d a b l e that w e use some k i n d of logic r a t h e r than, for example, formal grammars or g e o m e t r i c a l constructs. In this p a p e r we shall r e p r e s e n t concepts as B o o l e a n functions of a set of p r o p o s i t i o n a l variables. The r e c o g n i t i o n algorithms that w e attempt to deduce w i l l be t h e r e f o r e B o o l e a n circuits or expressions. There is another aspect of our f o r m u l a t i o n that is w o r t h emphasizing. In a learning s y s t e m that is about to learn a new concept there may be an e n o r m o u s n u m b e r of p r o p o s i t i o n a l variables available. T h e s e may be p r i m i t i v e inputs, the values of p r e p r o g r a m m e d concepts, or the values of concepts that have b e e n learnt previously. We w a n t the c o m p l e x i t y of learning the new concept to be related only to t h e n u m b e r of variables that may be set in n a t u r a l examples o f it, and not on the cardinality of the u n i v e r s e of available variables. Hence the q u e s t i o n s asked of ORACLE and the values given by EX~4PLES w i l l not be truth assignments to all the variables. In the a r c h e t y p a l case they w i l l specify an assignment to a subset of the v a r i a b l e s that is s t i l l s u f f i c i e n t to guarantee the t r u t h of the function. The adequacy of the p r o p o s i t i o n a l calculus for r e p r e s e n t i n g k n o w l e d g e in p r a c t i c a l l e a r n i n g systems is clearly a crucial question. Few w o u l d argue that this much p o w e r is not necessary. The q u e s t i o n is w h e t h e r it is enough. T h e r e are s e v e r a l arguments s u g g e s t i n g that such a s y s t e m would, at least, be a very g o o d start. First, w h e n one examines the m o s t famous examples of systems that embody p r e p r o g r a m m e d k n o w l e d g e , namely expert systems such as D E N D R A L and MYCIN, e s s e n t i a l l y n o logical n o t a t i o n b e y o n d the prop o s i t i o n a l calculus is used. Surely it w o u l d be o v e r - a m b i t i o u s to try to base learning systems on r e p r e s e n t a t i o n s that are more p o w e r f u l than those that h a v e b e e n s u c c e s s f u l l y m a n a g e d in p r o g r a m m e d systems. Second, the results in this p a p e r can be n e g a t i v e l y i n t e r p r e t e d to s u g g e s t that the class of learnable concepts even w i t h i n the p r o p o s i t i o n a l calculus is severely circumscribed. This suggests that the search for extensions to the p r o p o s i t i o n a l calculus that have s u b s t a n t i a l l y larger learnable classes may be a d i f f i c u l t one. W h e t h e r the classes of learnable B o o l e a n concepts can b e e x t e n d e d s i g n i f i c a n t l y b e y o n d the three classes given is an i n t e r e s t i n g question. There is c i r c u m s t a n t i a l evidence from cryptography: however, that the w h o l e class of functions comp u t a b l e by p o l y n o m i a l size circuits is not learnable. C o n s i d e r a c r y p t o g r a p h i c scheme that e n c o d e s m e s s a g e s by e v a l u a t i n g the function Ek w h e r e k s p e c i f i e s the key. Suppose that this scheme is immune to chosen p l a i n t e x t attack in the sense that even if the values of E k are k n o w n for p o l y n o m i a l l y many d/fferent inputs, it is computationally i n f e a s i b l e to deduce an a l g o r i t h m for E k or for an a p p r o x i m a t i o n to it. This is e q u i v a l e n t to saying, however, that E k is not learnable. The c o n j e c t u r e d e x i s t e n c e of g o o d c r y p t o g r a p h i c functions that are easy to compute therefore implies that some easy to compute functions are not learnable. The p o s i t i v e conclusions of this p a p e r are that there are s p e c i f i c classes of concepts that are learnable in p o l y n o m i a l time u s i n g learning p r o t o c o l s of the k i n d described. These c l a s s e s can all be c h a r a c t e r i z e d b y d e f i n i n g the class of p r o g r a m s that recognize them. In each case the p r o g r a m s are special kinds of B o o l e a n expressions. The three classes are: (i) conjunctive normal form expressions w i t h a b o u n d e d n u m b e r of literals in each clause, (ii) m o n o t o n e d i s j u n c t i v e normal form expressions, and (iii) arbitrary expressions in w h i c h each v a r i a b l e occurs just once. In the first of these no calls of the oracle are necessary. In the last n o access to t y p i c a l examples is n e c e s s a r y b u t the oracles n e e d to be more p o w e r f u l than the one d e s c r i b e d above. If the class of learnable concepts is as severely l i m i t e d as s u g g e s t e d by our results then it w o u l d follow that the only way of t e a c h i n g more c o m p l i c a t e d concepts is to b u i l d t h e m up from such simple ones. Thus a g o o d t e a c h e r w o u l d have to identify, name and sequence these i n t e r m e d i a t e concepts in the m a n n e r of a programmer. The results of ZeoiPnabiZity theory w o u l d then indicate the m a x i m u m g r a n u l a r i t y of the single concepts that can he a c q u i r e d w i t h o u t programming. 437 In summary, this p a p e r attempts to e x p l o r e the limits of w h a t is l e a r n a b l e as a l l o w e d b y algorithm i c complexity. The r e s u l t s are d i s t i n g u i s h a b l e from the diverse b o d y of p r e v i o u s w o r k on l e a r n i n g b e c a u s e they attempt to r e c o n c i l e the three p r o p e r t i e s (A)-(C) m e n t i o n e d earlier. C l o s e s t in r i g o u r to our a p p r o a c h is the i n d u c t i v e i n f e r e n c e literature (see A n g l u i n a n d Smith [i] for a survey) that deals w i t h i n d u c i n g s u c h things as recursive functions or formal g r a m m a r s (but not B o o l e a n functions) from examples. T h e r e is a large b o d y of w o r k on p a t t e r n r e c o g n i t i o n and c l a s s i f i c a t i o n , u s i n g s t a t i s t i c a l a n d o t h e r tools, (e.g. [4]) b u t the q u e s t i o n of g e n e r a l k n o w l e d g e r e p r e s e n t a t i o n is not a d d r e s s e d t h e r e directly. Learning, in v a r i o u s less formal senses, has b e e n s t u d i e d w i d e l y as a b r a n c h of a r t i f i c i a l i n t e l l i g e n c e . A s u r v e y and b i b l i o g r a p h y can be f o u n d in [2,6]. In t h e i r t e r m i n o l o g y the s u b j e c t of this p a p e r is c o n c e p t learning. 2. vectors for w h i c h F is true, for otherwise, if F is true for just one v e c t o r w h i c h is total, only an e x p o n e n t i a l search or more p o w e r f u l oracles w o u l d be able to find it. Such c o n s i d e r a t i o n s led us to c o n s i d e r the f o l l o w i n g l e a r n i n g p r o t o c o l as a r e a s o n a b l e one. It gives the learner access to the f o l l o w i n g two routines: (i) EXAMPLES: This has no input. It gives as output a v e c t o r v s u c h t h a t F(v) = i. F o r each such v the p r o b a b i l i t y t h a t v is o u t p u t on any single call is D(v). (ii) ORACLE( ): On v e c t o r v outputs 1 or 0 a c c o r d i n g to w h e t h e r as i n p u t it F (v) = 1 or 0. The first of these gives r a n d o m l y chosen p o s i t i v e e x a m p l e s of the c o n c e p t b e i n g learnt. The s e c o n d p r o v i d e s a test for w h e t h e r a v e c t o r w h i c h the d e d u c t i o n p r o c e d u r e considers to b e critical is an e x a m p l e of the c o n c e p t or not. In a real s y s t e m the oracle may b e a h u m a n expert, a d a t a b a s e of p a s t o b s e r v a t i o n s , some d e d u c t i o n system, o r a c o m b i n a t i o n of these. A LEARNING P R O T O C O L FOR BOOLEAN FUNCTIONS We c o n s i d e r t Boolean variables pl,...,pt each of w h i c h can take the value 1 or 0 to i n d i c a t e w h e t h e r the a s s o c i a t e d p r o p o s i t i o n is true or false. T h e r e is n o a s s u m p t i o n about the i n d e p e n dence of the variables. Indeed, they may be functions of each other. F i n a l l y w e o b s e r v e that our p a r t i c u l a r choice of leazning p r o t o c o l was s t r o n g l y i n f l u e n c e d by the e a r l i e r d e c i s i o n to deal w i t h concepts r a t h e r than raw B o o l e a n functions. If w e w e r e to deal w i t h the latter t h e n many o t h e r a l t e r n a t i v e s are e q u a l l y natural. F o r example, g i v e n a v e c t o r v i n s t e a d of asking, as w e do, w h e t h e r all completions of it m ~ e s the f u n c t i o n true, we c o u l d ask w h e t h e r there e x i s t any c o m p l e t i o n s t h a t make it true. This s u g g e s t s n a t u r a l a l t e r n a t i v e semantics for E X A M P L E S a n d ORACLE. In S e c t i o n 7 w e shall use such m o r e e l a b o r a t e oracles. A vGdtor is an a s s i g n m e n t to each of the t v a r i a b l e s of a v a l u e f r o m {0,i,*}. The s y m b o l * denotes that a v a r i a b l e is undetermined. A vector is totaZ if e v e r y v a r i a b l e is determined (i.e. is a s s i g n e d a value from {0,i}.) F o r example, the assignment Pl =P3 =i' P4 =0 and P2 = * is a v e c t o r t h a t is not total. A Boolean function F is a m a p p i n g f r o m the set of 2 t t o t a l v e c t o r s to {0,i}. A Boolean f u n c t i o n F b e c o m e s a concept F if its domain is e x t e n d e d to the set of all v e c t o r s as follows: For a vector v F(v) = i if and o n l y if F(w) = I for all t o t a l v e c t o r s w that agree w i t h v on all v a r i a b l e s for w h i c h v is determined. The p u r p o s e of this e x t e n s i o n is t h a t it p e r m i t s us not to m e n t i o n the v a r i a b l e s on w h i c h F does n o t depend. 3. LEARNABILITY We s h a l l c o n s i d e r v a r i o u s classes of p r o g r a m s h a v i n g P l .... ,Pt as inputs and s h o w that in e a c h case any p r o g r a m in the class can be deduced, w i t h only s m a l l p r o b a b i l i t y of error, in p o l y n o m i a l time u s i n g the p r o t o c o l p r e v i o u s l y described. We assume t h a t p r o g r a m s can take the values I, 0 and undetermined. G i v e n a c o n c e p t F w e c o n s i d e r an arbitrary probability distribution D o v e r the set of all vectors v such t h a t F(v) = i. In o t h e r w o r d s for each v such that F(v) = i it is the case that D(v) 9 0 . Also ~D(v) = 1 w h e n s u m m a t i o n is o v e r this set of vectors. T h e r e are n o o t h e r a s s u m e d r e s t r i c t i o n s on D, w h i c h is i n t e n d e d to d e s c r i b e the r e l a t i v e f r e q u e n c y w i t h w h i c h the p o s i t i v e e x a m p l e s of F o c c u r in nature. More p r e c i s e l y w e say t h a t a class X of p r o g r a m s is learnable w i t h r e s p e c t to a g i v e n l e a r n i n g p r o t o c o l if and only if t h e r e exists an a l g o r i t h m A (the d e d u c t i o n procedure) i n v o k i n g the p r o t o c o l w i t h the f o l l o w i n g p r o p e r t i e s : (a) The a l g o r i t h m runs in time p o l y n o m i a l b o t h in an a d j u s t a b l e p a r a m e t e r h a n d in the v a r i o u s p a r a m e t e r s that q u a n t i f y the size of the p r o g r a m t o be learnt, and W h a t are r e a s o n a b l e l e a r n i n g p r o t o c o l s to consider? F i r s t w e m u s t a v o i d g i v i n g the t e a c h e r too m u c h p o w e r , n a m e l y the p o w e r to c o m m u n i c a t e a p r o g r a m i n s t r u c t i o n b y instruction. F o r example, if a p r e m e d i t a t e d s e q u e n c e of v e c t o r s w i t h r e p e t i t i o n s c o u l d b e c o m m u n i c a t e d then this c o u l d be u s e d to encode the d e s c r i p t i o n of the p r o g r a m even if just two s u c h v e c t o r s w e r e u s e d for b i n a r y notation. S e c o n d l y w e m u s t a v o i d g i v i n g the t e a c h e r w h a t is e v i d e n t l y too little power. In p a r t i c u l a r , the p r o t o c o l m u s t p r o v i d e some t y p i c a l e x a m p l e s of (b) F o r all p r o g r a m s f 6X a n d all d i s t r i b u tions D over vectors v on w h i c h f outputs 1 the a l g o r i t h m w i l l deduce w i t h p r o b a b i l i t y at least (1-h -1) a program g EX t h a t n e v e r o u t p u t s one w h e n it s h o u l d not, b u t o u t p u t s one almost always w h e n it should. In p a r t i c u l a r (i) for all v e c t o r s v g(v) = i implies f(v) = i , and (ii) the s u m of D (v) o v e r all v such t h a t f (v) = 1 b u t g (v) ~ 1 is at m o s t h -I. 438 In our definition we have disallowed positive answers from g to be wrong, only because the particular classes X in this paper do not require i t . This we call one-sided error learning. In other circumstances it may be efficacious to allow two-sided errors. In that more general case we would need to put a probability distribution E on the set of vectors such that f(v) fl. Condition (i) in (b) is then replaced by: CD(v) over all v such that f(v) fl but g(v) =l is at most h-l. A second aspect of our definition is that the parameter h is used in two independent probabilistic bounds. This simplifies the statement of the results. It would be more precise,.however, to work explicitly with two independent parameters hl and h 2' Thirdly, we should note that programs that compute concepts should be distinguished from those that compute merely the Boolean function. This distinction has little consequence for the three classes of expressions that we consider since in these cases programs for the concepts follow directly from the specification of the expressions. For the sake of generality our definitions do allow the value of a program to be undefined since a nontotal vector will not, in general, determine the value of an expressions as 0 or 1. It would be interesting to obtain negative results other than the cryptographic evidence mentioned in the introduction, indicating that the class of unrestricted Boolean circuits is not learnable. In the contrary direction the reader can verify that the difficulty of this problem can be upper bounded in various ways. It can be shown that the assumption that P equals NP would imply that the class of Boolean circuits is learnable with two sided error with respect to the protocol that provides random examples from both D and 6. 4. A COMBINATORIAL BOUND The probabilistic analysis needed for our later results can be abstracted into a single lemma. The proof of the the rest of the paper. lemma is independent of We define the function L(h,Sl for any real’ number h greater than one and any positive integer S as follows: Let L(h,S) be the smallest integer such that in L(h,S) independent Bernoulli trials each with probability h of success, the probability of having fewer than S successes is less than h- . The following simple upper bound holds for the whole range of values of S and h and shows that L(h,S) is essentially linear both in h and in S. PROPOSITION: For all integers S 21 a n d atZ real h >l. L(h,S) < 2h(S +logeh) . Proof. We use the following three well known inequalities of which the last is due to Chernoff (see [51, p. 18). (a) For all x>O (l+x-1)x<e (b) For all x >O ( 1 . -x-1)x <e-l. Cc) In m independent trials each with probability p of success the probability that there are at most k success, where k <mp, is at most pqk (!Q,” . above The first factor in the expression in (c) can be rewritten as (l-=&j (cm-k)/(mp-k)) (mp-k) using (b) with x = (m-k)/(mp-k) bound the product by e we can upper -mp+k bw/k)k . Substituting and using loga~i~~"'t?%~'~as~ bound .-Is . - 210gh+sm and k=S =h-l e gives the L2 (l+(log h),S) I (S/loghl log h . Rewriting this using (a) with x= (logeh)/S e-S -210gho2S ,log h <(2/e) S *e gives -logh&-1. 0 As an illustration of an application suppose that we have access to EXAMPLES, a source of natural positive examples of vectors for concept F. Suppose that we want to find an approximation to the subset P of variables that are determined in at least one natural example of F. Consider the following procedure. Pick the first L(h,lPI) vectors provided by EXAMPLES and let P' be the union of the sets of variables determined by these vectors. The proposition then implies that with probability at least 1-h-l the set P' will have the following property: if a vector is chosen with probability distribution D then the probability thaflit determines a variable in P -P' is less than h . To see this observe that if P' does not have the claimed property then the following has happened: L(h, IP 1) trials have been made (i.e. calls of EXAMPLES) each with probability greater than h--l of success (i.e. finding a vector that determines a variable in P-P') but there have been fewer than IPI successse~ ito;, discoveries of new additions to P'). bility of such a happening is less than h-l by the definition of L. The above application shows that the set of in natural examples o f F can be approximated by a procedure whose runtime is independent of t, the total number of variables that are determined variables. 5. BOUNDED CNF EXPRESSIONS A conjunctive normal form (CNF) e x p r e s s i o n is any p r o d u c t C l C 2 . . . c r of clauses w h e r e e a c h clause c i is a s u m q l + q 2 +--- + q J i of literals. A ZiteraZ q is e i t h e r a v a r i a b l e p or the n e g a tion p of a variable. F o r example, (Pl +P2) (Pl + ~ 2 + P 3 ) is a C N F expression. For a positive integer k a k-CNF expression is a CNF e x p r e s s i o n w h e r e each clause is the s u m of at m o s t k literals. F o r CNF e x p r e s s i o n s in g e n e r a l w e do not k n o w of any p o l y n o m i a l time d e d u c t i o n p r o c e d u r e w i t h r e s p e c t to o u r l e a r n i n g p r o t o c o l . The q u e s t i o n Of w h e t h e r one exists is t a n t a l i z i n g b e c a u s e of its a p p a r e n t simplicity. f In this s e c t i o n w e s h a l l p r o v e that for any i x e d i n t e g e r k the class of k - C N F e x p r e s s i o n s is learnable. In fact it is l e a r n a b l e e a s i l y in the sense t h a t calls of E X A M P L E S s u f f i c e and no calls of O R A C L E are required. In this and s u b s e q u e n t sections we shall use the r e l a t i o n ~ o n c o n c e p t s as follows: F ~G m e a n s t h a t for all v e c t o r s v whenever F(v) = I it is a l s o the case t h a t G(v) =i. (N.B. This is e q u i v a l e n t to the r e l a t i o n F ~G when F,G are r e g a r d e d as functions.) For brevity we shall often denote b o t h e x p r e s s i o n s and e v e n v e c t o r s by the concepts they represent. Thus the c o n c e p t represented by vector w is true for e x a c t l y those v e c t o r s that agree w i t h w on all v a r i a b l e s on which w is determined. The c o n c e p t r e p r e s e n t e d by e x p r e s s i o n f is o b t a i n e d by c o n s i d e r i n g the B o o l e a n f u n c t i o n c o m p u t e d by f a n d e x t e n d i n g its d o m a i n to b e c o m e a concept. In this s e c t i o n w e r e g a r d a clause c i to b e a s p e c i a l k i n d of expression. In S e c t i o n s 7 a n d 8 w e c o n s i d e r m o n o m i a l s m, simple p r o d u c t s of literals, as s p e c i a l forms o f expressions. In all cases s t a t e m e n t s of the f o r m f ~g, v ~ c i, v ~ m or m ~ f can b e u n d e r s t o o d b y i n t e r p r e t i n g the two sides as concepts or f u n c t i o n s in the o b v i o u s way. THEOREM A: For any positive integer k the class of k - C N F expressions is learnable via an algorithm A that uses L = L ( h , ( 2 t ) k+l) calls of EXAMPLES and no calls of ORACLE, where t is the number of variables. Proof. The a l g o r i t h m is i n i t i a l i z e d w i t h f o r m u l a g as the p r o d u c t of all p o s s i b l e clauses of up t o k literals f r o m { p l , ~ l , P 2 , P 2 , .... pt,Pt}. Clearly the n u m b e r of w a y s of c h o o s i n g a clause of e x a c t l y k literals is at m o s t (2t) k and h e n c e the n u m b e r of w a y s of c h o o s i n g a clause w i t h up t o k literals is less than 2t + ( 2 t ) 2 +... + (2t~ k < (2t) k+l. This b o u n d s the i n i t i a l n u m b e r o f clauses in g. The a l g o r i t h m then calls E X A M P L E S L times to give p o s i t i v e e x a m p l e s of the c o n c e p t r e p r e s e n t e d by k - C N F f o r m u l a f. F o r each v e c t o r v so obt a i n e d it deletes all of the r e m a i n i n g clauses in g t h a t do n o t c o n t a i n a l i t e r a l t h a t is d e t e r m i n e d to be true in v. (I.e. in the clauses d e l e t e d each literal is e i t h e r not d e t e r m i n e d o r is n e g a t e d in v.) M o r e p r e c i s e l y the f o l l o w i n g is r e p e a t e d L times: begin v :=EXAMPLES for each c i in g delete c. if v ~ c.. end 1 l The claim is t h a t w i t h h i g h p r o b a b i l i t y the value of g after the L-th e x e c u t i o n of this b l o c k w i l l be the d e s i r e d a p p r o x i m a t i o n to the f o r m u l a f that is b e i n g learnt. Initially the set of v e c t o r s {vlv~g} is clearly empty. We first c l a i m that as the algor i t h m p r o c e e d s the set {vlv~g} w i l l always b e a s u b s e t of {vlv~f}. T o p r o v e this it is clearly s u f f i c i e n t t o p r o v e the same s t a t e m e n t w h e n b o t h sets are r e s t r i c t e d to only t o t a l vectors. Let B b e the p r o d u c t of all the clauses c c o n t a i n i n g up to k literals w i t h the p r o p e r t y that "Vv if v~f then v-~c." It is clear t h a t in the course of the a l g o r i t h m no clause in B can be ever deleted. H e n c e it is c e r t a i n l y true t h a t { v l v ~ g } c { v l v ~ B }. To e s t a b l i s h the c l a i m it remains only to p r o v e t h a t B computes the same B o o l e a n f u n c t i o n as f. (In fact B w i l l be the m a x i m a l k - C N F f o r m u l a e q u i v a l e n t to f.) It is easy to see t h a t f =~B s i n c e by the d e f i n i t i o n of B, for every c in B it is the case t h a t "for all v if v ~ f then v ~c." To v e r i f y the converse, that B ~ f , w e first note that, by definition, f is a k - C N F formula. If some clause in f did not o c c u r in B then this clause c' w o u l d h a v e the p r o p e r t y that "By such t h a t v ~ f but v ~6 c'." But this is i m p o s s i b l e since if c' is a clause of f and if v ~6 c' then v ~6 f. We conclude that e v e r y k - C N F r e p r e s e n t a t i o n of the function t h a t f r e p r e s e n t s consists of a set of clauses t h a t is a s u b s e t of the clauses in B. Hence B ~f. Let X = ~ D ( v ) w i t h s u m m a t i o n o v e r all (not n e c e s s a r i l y total) v e c t o r s v such that v ~ f but v ~ g. This q u a n t i t y is d e f i n e d for each interm e d i a t e v a l u e of g in the course of the a l g o r i t h m and, as is easily v e r i f i e d , it is m o n o t o n e d e c r e a s ing w i t h time. N o w clauses w i l l be r e m o v e d f r o m g w h e n e v e r E X A M P L E S outputs a v e c t o r v such that v ~ g. The p r o b a b i l i t y of this h a p p e n i n g at any m o m e n t is e x a c t l y the current v a l u e of X. Also, the p r o c e s s of r u n n i n g the a l g o r i t h m to c o m p l e t i o n can h a v e one of just two p o s s i b l e outcomes: (i) A t some p o i n t X b e c o m e s less than h -1, in w h i c h case the g f o u n d w i l l a p p r o x i m a t e to f exactly as r e q u i r e d by the d e f i n i t i o n of learnability. (~i) The v a l u e of X n e v e r goes b e l o w h -I and hence g is not an a c c e p t a b l e approximation. The p r o b a b i l i t y of the latter e v e n t u a l i t y is, h o w e v e r , at m o s t h -I since it c o r r e s p o n d s to the s i t u a t i o n of p e r f o r m i n g L(h, (2t) k+l) Bernoulli experiments (i.e. calls of EXAMPLES) each w i t h p r o b a b i l i t y g r e a t e r than h -I of success (i.e. f i n d i n g a v such that v ~6 g) and o b t a i n i n g f e w e r than [2t) k + l s u c c e s s e s (a success b e i n g m a n i f e s t e d by the removal of at least one clause f r o m g). [] In c o n c l u s i o n w e o b s e r v e that a CNF e x p r e s s i o n g i m m e d i a t e l y y i e l d s a p r o g r a m for c o m p u t i n g the a s s o c i a t e d concept. G i v e n a v e c t o r v w e subs t i t u t e the d e t e r m i n e d t r u t h values in g. The c o n c e p t w i l l be true for v if and only if all the clauses are made true b y the substitution. 6. DNF EXPRESSIONS The t e s t v ~ g amounts to asking w h e t h e r n o n e of the m o n o m i a l s of g is m a d e true by the values d e t e r m i n e d to be true in v. Every time E X A M P L E S p r o d u c e s a value v such that v ~ g the inner loop of the a l g o r i t h m w i l l f i n d a p r i m e implicant m to a d d to g. E a c h such m is d i f f e r e n t f r o m any p r e v i o u s l y a d d e d (since the contrary w o u l d h a v e i m p l i e d t h a t v ~g). It follows that such a v w i l l be f o u n d at m o s t d times and hence O R A C L E w i l l be c a l l e d at m o s t dt times. A disjunctive normal form (DNF~ e x p r e s s i o n is any s u m m I + m 2 + ... + m r of m o n o m i a l s w h e r e each monomial m i is a p r o d u c t of literals. For example plP2 + p l P 3 P 4 is a DNF expression. Such exp r e s s i o n s a p p e a r p a r t i c u l a r l y easy for h u m a n s to comprehend. Hence we e x p e c t t h a t any p r a c t i c a l l e a r n i n g s y s t e m w o u l d h a v e to allow for them. An e x p r e s s i o n is monotone if no v a r i a b l e is n e g a t e d in it. We shall show that for m o n o t o n e DNF e x p r e s s i o n s there exists a simple d e d u c t i o n pro-cedure that uses b o t h E X A M P L E S and ORACLE. For u n r e s t r i c t e d DNF a s i m i l a r r e s u l t can be p r o v e d w i t h respect to a d i f f e r e n t size measure. In the u n r e s t r i c t e d case there is the a d d i t i o n a l d i f f i c u l t y that w e can g u a r a n t e e to deduce a p r o g r a m only for the function, and not for the concept. This difficulty does n o t arise in the m o n o t o n e case w h e r e given an e x p r e s s i o n we can always compute the value of the a s s o c i a t e d c o n c e p t for a v e c t o r v by m a k i n g the s u b s t i t u t i o n and asking w h e t h e r the r e s u l t i n g e x p r e s s i o n is i d e n t i c a l l y true. Let X = ZD(w) w i t h s u m m a t i o n o v e r all (not n e c e s s a r i l y total) v e c t o r s w such t h a t w~f but w ~ g. This q u a n t i t y is d e f i n e d for each i n t e r m e d i a t e value of g in the course of the algorithm, is i n i t i a l l y unity, and decreases monot o n i c a l l y w i t h time. N o w a m o n o m i a l w i l l be added to g each time E X A M P L F S outputs a v e c t o r v such that v ~ g. The p r o b a b i l i t y of this o c c u r r i n g at any call of E X A M P L E S is e x a c t l y X. The p r o c e s s of r u n n i n g the a l g o r i t h m to c o m p l e t i o n can have just two p o s s i b l e outcomes: (i) A t some time X has b e c o m e less than h -1, in w h i c h case the final expression g f o u n d w i l l a p p r o x i m a t e to f as r e q u i r e d by the d e f i n i t i o n of learnability. (li) The value of X n e v e r goes b e l o w h -I and hence g is n o t an a c c e p t a b l e approximation. The p r o b a b i l i t y of this s e c o n d e v e n t u a l i t y is, however, at m o s t h -I since it corresponds to the s i t u a t i o n of p e r f o r m i n g L(h,d) B e r n o u l l i e x p e r i m e n t s (i.e. calls of EXAMPLES) e a c h w i t h p r o b a b i l i t y g r e a t e r than h -I of success (i.e. of finding a v such that v~f but v ~ g) a n d o b t a i n i n g fewer t h a n d s u c c e s s e s (each m a n i f e s t e d by the a d d i t i o n of a new monomial). [] A monomial m is a prime implicant of a f u n c tion F (or of an e x p r e s s i o n r e p r e s e n t i n g the function) if m~F and if m' ~ F for any m' o b t a i n e d by d e l e t i n g one literal f r o m m. A DNF e x p r e s s i o n is prime if it consists of the s u m of p r i m e i m p l i c a n t s none of w h i c h is r e d u n d a n t in the sense of b e i n g i m p l i e d by the s u m of the others. T h e r e is always a unique p r i m e DNF e x p r e s s i o n in the m o n o t o n e case, b u t not in general. We therefore define the degree of a DNF e x p r e s s i o n to b e the largest n u m b e r of m o n o m i a l s t h a t a p r i m e DNF e x p r e s s i o n e q u i v a l e n t to it can have. The unique p r i m e DNF e x p r e s s i o n in the m o n o t o n e case is simply the s u m of all the p r i m e implicants. F o r u n r e s t r i c t e d DNF e x p r e s s i o n s several p r o b l e m s arise. The m a i n source of difficulty is the fact t h a t the p r o b l e m of d e t e r m i n i n g w h e t h e r a n o n t o t a l v e c t o r implies the f u n c t i o n s p e c i f i e d by a DNF f o r m u l a is N P - h a r d . This is simply b e c a u s e the p r o b l e m of d e t e r m i n i n g w h e t h e r the now h e r e d e t e r m i n e d v e c t o r implies the f u n c t i o n is the t a u t o l o g y q u e s t i o n of C o o k [3]. An immediate c o n s e q u e n c e of this is t h a t it is u n r e a s o n a b l e to assume that the p r o g r a m b e i n g learnt computes concepts r a t h e r than functions. Another consequence is that in any a l g o r i t h m akin to A l g o r i t h m B the test v ~ g m a y n o t be feasible if v is not total. A l g o r i t h m B is sufficient, however, to e s t a b l i s h the following: T H E O R E M B: The class of monotone DNF expressions is learnable via an algorithm B that uses L =L(h,d) calls of EXAMPLES and dt calls of ORACLE, where d is the degree of the DNF expression f to be learnt and t the number of variables. Proof. The a l g o r i t h m is i n i t i a l i z e d w i t h f o r m u l a g i d e n t i c a l l y equal to the c o n s t a n t zero. The a l g o r i t h m t h e n calls E X A M P L E S L times to p r o d u c e p o s i t i v e e x a m p l e s of f. E a c h time a v e c t o r v is p r o d u c e d such t h a t v ~ g a new monomial m is added to g. The m o n o m i a l m is the p r o d u c t of those literals d e t e r m i n e d in v t h a t are e s s e n t i a l to m a k e v~f. M o r e p r e c i s e l y , the loop that is e x e c u t e d L times is the following: begin TIIEOREM B': Suppose the notion of learnability is restricted to distributions D such that D(V) = 0 whenever v is not total. Then the class of DNF expressions is learnable, in the sense that programs for the functions (but not necessarily the concepts) can be deduced in L(h,d) calls of EXAMPLES and dt calls of ORACLE where d is the degree of the DNF expression to be learnt. v := E X A M P L E S if v ~ g then begin t do for i := 1 to if Pi is d e t e r m i n e d in v then begin set v equal to v b u t w i t h ~ p i :=*; if ORACLE(v~ = i t h e n v : = v end set m equal to the p r o d u c t of all literals q such t h a t v~q; g :=g + m The r e a d e r s h o u l d note that here there is the a d d i t i o n a l p h i l o s o p h i c a l d i f f i c u l t y that availability of the e x p r e s s i o n does n o t in itself imply a p o l y n o m i a l time a l g o r i t h m for ORACLE. On the o t h e r h a n d , the t h e o r e m does say that if an agent has some, may be ad hoc, b l a c k b o x for O R A C L E w h o s e w o r k i n g s are u n k n o w n to him, he can use it to teach end end 441 s o m e o n e else a DNF e x p r e s s i o n function. that approximates the case that The dual of this w o u l d b e a P-ORACLE(v) = 1 if and only if there exists a total v e c t o r w such that w ~v and F(w) = i. F o r b r e v i t y w e s h a l l a l s o define the prime implicant oracle: PI(v) = 1 if and o n l y if N-ORACLE(v) = 1 b u t N-ORACLE(w) = 0 for any w obtained from v by m a k i n g one v a r i a b l e determ i n e d in v undetermined. F i n a l l y , it may be w o r t h e m p h a s i z i n g t h a t the m o n o t o n e case is n o n p r o b l e m a t i c and the d e d u c t i o n p r o c e d u r e for it v e r y s i m p l e - m i n d e d . It may be i n t e r e s t i n g to p u r s u e m o r e s o p h i s t i c a t e d d e d u c t i o n p r o c e d u r e s f o r it if c o n t e x t s can be f o u n d in w h i c h they can be p r o v e d a d v a n t a g e o u s . The q u e s t i o n as to w h e t h e r m o n o t o n e DNF e x p r e s s i o n s can b e learnt f r o m E X A M P L E S alone is open. A pos i t i v e a n s w e r w o u l d be e s p e c i a l l y s i g n i f i c a n t . 7.. Finally we (iii) ones of C: The class of p-expressions is leqrnable via a deduction procedure C that uses 0(t °) calls of N-ORACLE, RP and AP altogether, where t is the number of variables, hand no calls of EXAMPLES.) The procedure always deduces exactly the correct expression. Proof. Let ~ be the m o n o m i a l t h a t is the p r o d u c t of n o literals. The a l g o r i t h m w i l l f i r s t compute RP(Pi) for e a c h Pi to d e t e r m i n e w h i c h of the v a r i a b l e s o c c u r in the p r i m e i m p l i c a n t s of the f u n c t i o n t h a t the e x p r e s s i o n f to be learnt represents. Let g l , . . . , g r be d i s t i n c t single v a r i a b l e e x p r e s s i o n s , one for e a c h Pi for w h i c h RP(Pi) = i. N o t e t h a t these are e x a c t l y the v a r i a b les t h a t o c c u r in f since f is m o n o t o n e . A g e n e r a l expression o v e r v a r i a b l e set {Pl,''" ,Pt } is d e f i n e d r e c u r s i v e l y as follows: (ii) oracles, THEOREM D e d u c i n g e x p r e s s i o n s on w h i c h t h e r e are less s e v e r e s t r u c t u r a l r e s t r i c t i o n s t h a n DNF or CNF appears m u c h m o r e difficult. The a i m of this s e c t i o n is to give a p a r a d i g m a t i c e x a m p l e of h o w f a r one can go in t h a t d i r e c t i o n if one is w i l l i n g t o p a y the p r i c e of o r a c l e s t h a t are m o r e s o p h i s t i c a t e d t h a n the one p r e v i o u s l y described. "Pi" two further h o w it b e h a v e s on v e c t o r s r e p r e s e n t e d as m o n o m i a l s : RP(m) = 1 iff f r o m some m o n o m i a l m' mm' is a p r i m e i m p l i c a n t of f. F o r sets, V , W of v a r i a b l e s we define RA(V,W) = 1 iff every p r i m e i m p l i c a n t of f t h a t contains a v a r i a b l e f r o m V also contains a variable from W. We h a v e a l r e a d y s e e n a class of e x p r e s s i o n s , n a m e l y the k - C N F e x p r e s s i o n s , t h a t can b e l e a r n t f r o m p o s i t i v e e x a m p l e s alone. Here we shall cons i d e r the o t h e r e x t r e m e , a class t h a t can be l e a r n t u s i n g o r a c l e s alone. F o r each i (i ~ i ~ t ) "Pi" are e x p r e s s i o n s . define relevant possibility ~ a n d of relevant accompaniment RA. F o r c o n v e n i e n c e w e d e f i n e the first by p-EXPRESSIONS (i) F(w) = I. possibility oracle: and If fl,...,fr are e x p r e s s i o n s t h e n "(fl + f 2 + ' ' " + f r )'' is an e x p r e s s i o n (called a p l u s e x p r e s s i o n ) . With each gi we a s s o c i a t e t w o m o n o m i a l s mi a n d mi" The f o r m e r w i l l be the s i n g l e v a r i a b l e pj that gi represents. The l a t t e r is d e f i n e d as any m o n o m i a l ~ h a v i n g n o v a r i a b l e in common w i t h mi such t h a t mi~ is a p r i m e i m p l i c a n t of f. The a l g o r i t h m w i l l c o n s t r u c t e a c h s u c h mi as the final v a l u e of m in the f o l l o w i n g p r o c e d u r e : Set m =~; while P I ( m m i) = 0 find a Pk n o t in mm i such that R P ( P k ~ i) = 1 a n d set m : = ~ k m. Hence in a t ' m o s t t2 calls of PI (i.e. t° calls of N-ORACLE) a n d t2 calls of RP v a l u e s for e v e r y ~. w i l l be found. l If fl .... ,fr are e x p r e s s i o n s t h e n "(fl x f 2 x . . . × f r ) " is an e x p r e s s i o n (called a times e x p r e s s i o n ) . A p-expression is an e x p r e s s i o n in w h i c h e a c h v a r i a b l e a p p e a r s at m o s t once. We can assume t h a t a p - e x p r e s s i o n is m o n o t o n e since w e can always r e l a b e l n e g a t e d v a r i a b l e s w i t h n e w names t h a t d e n o t e t h e i r negation. In the r e c u r s i v e d e f i n i t i o n of any p - e x p r e s s i o n there are clearly at m o s t 2t - i i n t e r m e d i a t e e x p r e s s i o n s of w h i c h at m o s t t are of type (i) a n d at m o s t t - 1 of types (ii) o r (iii). W i t h o u t loss of g e n e r a l i t y w e s h a l l a s s u m e t h a t in the d e f i n i t i o n of a p - e x p r e s s i o n rules (ii) a n d (iii) alternate. We s h a l l r e g a r d t w o p - e x p r e s s i o n s as i d e n t i c a l if t h e i r d e f i n i £ i o n s can be m a d e f o r m a l l y the same b y r e o r d e r i n g sums and p r o d u c t s and r e l a b e l l i n g as n e c e s s a r y . O n c e i n i t i a l i z e d the a l g o r i t h m p r o c e e d s by a l t e r n a t e l y e x e c u t i n g a plus-phase a n d a timesphase. A t the s t a r t of e a c h p h a s e w e h a v e a set of expressions gl,..-,gr where each gi is a s s o c i a t e d w i t h two m o n o m i a l s mi and mi (having n o v a r i a b l e s in common) w h e r e mi is a p r i m e i m p l i c a n t of gi and mi~ i is a p r i m e i m p l i c a n t of f. We disting u i s h the g's as plus o r times expressions a c c o r d i n g to w h e t h e r the o u t e r m o s t c o n s t r u c t i o n rule w a s addition or multiplication. A p l u s - p h a s e w i l l first c o m p u t e an e q u i v a l e n c e r e l a t i o n S o n the s u b s e t of {gi } t h a t are times e x p r e s s i o n s . F o r e a c h equiv a l e n c e class G s u c h t h a t no gi 6 G already o c c u r s as a s u m m a n d in a sum, w e c o n s t r u c t a n e w e x p r e s s i o n t h a t is the s u m of the m e m b e r s of G and call this s u m gk where k is a p r e v i o u s l y u n u s e d index. If some m e m b e r s of G a l r e a d y o c c u r in a sum, s a y gj, (N.B. they are n e v e r d i s t r i b u t e d in m o r e t h a n one sum) t h e n we m o d i f y the s u m e x p r e s s i o n F o r l e a r n i n g p - e x p r e s s i o n s w e shall e m p l o y m o r e p o w e r f u l oracles. The b o u n d a r y b e t w e e n r e a s o n a b l e a n d u n r e a s o n a b l e o r a c l e s does n o t a p p e a r sharp. We m a k e n o claims about the r e a s o n a b l e n e s s of these n e w o r a c l e s e x c e p t t h a t t h e y m a y serve as v e h i c l e s f o r u n d e r s t a n d i n g learnability. The d e f i n i t i o n s r e f e r to the B o o l e a n f u n c t i o n F (not r e g a r d e d as a c o n c e p t here,. The o r a c l e of S e c t i o n 2 w i l l be r e n a m e d N - O R A C L E since it is one of necessity: N - O R A C L E ( v ) = 1 if a n d o n l y if for all t o t a l v e c t o r s w such t h a t w ~v it is the 442 gj to equal the sum of every expression in G. A times phase is exactly analogous except that it computes a different equivalence relation T, now on plus expressions, and will form new, or extend old, times expressions. In the above context single variable expressions will be regarded as both times and plus expressions. Also, it is immaterial which of the two kinds of phase is used to start the algorithm. First we shall verify that S and T are indeed equivalence relations under the assumption that Claim 1 holds at the beginning of each phase. Clearly S and T are defined to be both reflexive and symmetric. Also T is transitive for supoose that gi Tgj and gj Tg k. The former implies that every prime implicant of f containing some variable from gi also contains some variable from g~. The latter implies that every • . 3 prime implzcant of f containing some variable from gj also contains a variable from gk" Hence gi Tgk follows. In order to verify the transitivity of S we shall make a more general observation about any sub-expression of f. The intention of the algorithm is best expressed by the claim to follow which will be verified later. Suppose that the expressions occurring in the definition of f are fl,...,fq (where q ~2t-l). We shall say that g ~ fi iff the set of prime implicants of g is a subset of the set of prime implieants of ~fi where (i) if fi is a plus expression then fi =fi and (ii) if fi is a times expression then fi is the product of some subset of the multiplicands of fi" Claim 2: If mi,m. are prime implicants of ~ n ~ t tzmes ' subexpressions ] fi,fj of f and if for some m, with no variable in common with either m i or m., both mim and mjm are prime implirants of 3f, then fi,fj must occur as summands in the same plus expression of f. Claim I: After every phase of the algorithm for every gi that has been constructed there is exactly one expression fi such that (i) gi ~ fi and (ii) fi is of the same kind (i.e. plus or times) as gi" Proof. Most easily verified by representing expression f as a directed graph (e.g. [7]) with edges labelled by the Boolean variables. The sets of labels along directed paths between a distinguished source node and a distinguished sink node correspond exactly to prime implicants in the case of Dexpressions where all edges have different labels. [] The procedure builds up the rooted tree of the expression as rooted subtrees starting from the leaves. One evident difficulty is that there is no a p ~ o P ~ knowledge available about the shape of the tree. Hence in the grafting process a subtree may become attached to another at just about any node. Now to verify that S is transitive suppose that gi Sgj and g~ Sg k where, by Claim i, gi ~fi, gj ~fj and 3gk~.f k for appropriate times expressions Whenever a plus or times expression gi is created or enlarged its associated m i and mi is updated as follows. If gi is a sum then we let m i =mj and mi =~j for any g j that is a surf=and in gi" If gi is a product rnnen m i will be the product of all the mj's that correspond to multiplicands in gi" F i n a l l y mi will be generated as the final value of m in the following procedure: Set m :=I; while PI(mmi) =0 find a Pk not in mm i such that RP(Pkmm i) = 1 and set m :=pk mSince such an mi has to be found at most t times in the overall algorithm the total cost is at most t 2 calls of PI (i.e., t 3 calls of NORACLE) and t 2 calls of RP. m.m.33 where and fi'fj 'fk" mkmj^ mi,m q ,mk Then it follows that are all prime implicants of have no variable in common with J gi S gk follows. Claim 3: Suppose that after some phase of the algorithm Claim 1 holds and times expressions have been formed where placed in the same sum expression at the next plus phase if and only if fi =fi' f'3 =~'3 and are addends in the same sum expression in (~) If gi ~ f i =~" ....... (ii) fi,fj f. if and only if Proof. (i) gi,g 5 gi ~<fi' g-3 ~<f" and f. ,f. J z 3 Then gi and gi will be are times expressions. gi S gj f ~.. It follows from Claim 2 that fi,fj and fk 3 are addends in the same sum expression. Hence In order to complete the description of the algorithm it remains only to define the equivalence relations S and T. Definition: m.l~'3 ' PI(mi~ j) = PI(mjm i) = i, and fi,fj prime implicants of mi,~. contains disjoint sets of J ^ variables, as do mj,m i. and gj ~f. =f. 1 3 are in the same sum then f. ,f.. i j will be prime implicants of Also mi,m ~ m.~n i and ] will be 3 and m.~. 3 I f, and the variables For defining T we shall denote by V i the set of variables that occur in the expression gi" m. will be disjoint from those in m. as will x 3 be those in m. from those in ~n.. Hence gi S g:] 3 z will hold and gi,g q will be in the same plus Definition: gi Tgj RA(Vj,V i) = i. expression after the next plus phase. if and only if in RA(Vi,V j) = (~) gi S gj 443 Suppose that holds. Then mi,m j gi ~fi' gj ~<f.] and will be prime implicants of gi'g"3 of f. of V 4 may have to be updated for up to t such sets and hence t 2 calls of RA will suffice. Hence O(t 3) calls of RA in the overall algrorithm will be enough. 0 respectively, m.d.,m.d, will both be i 3 3 3 prime implicants of f, and the variables in ~. 3 will be disjoint from those of both m. and m.. 3 It follows from Claim 2 that f. =f. and f. =f. l • 3 3 must occur as summands in the same plus expression [N.B. 8. Here we are using the fact that In this paper we have considered learning as the process of deducing a program for performing a task, from information that does not provide an explicit description of such a program. We have given precise meaning to this notion of learning and have shown that in some restricted but nontrivial contexts it is computationally feasible. Claim 2 remains true if we allow a "times subexpression" to be the product of any subset of the multiplicands Clalm 4: in a times expression in f.] Suppose that after some phase of the algorithm Claim 1 holds and plus expressions gi ~ f i and gJ ~f'3 (fi,fj plus expressions) Consider a world containing robots and elephants. Suppose that one of the robots has discovered a recognition algorithm for elephants that can be meaningfully expressed in k-conjunctive normal form. Our Theorem A implies that this robot can communicate its algorithm to the rest of the robot population by simply exclaiming "elephant" whenever one appears. have been found. f. 3 in (i) If gi = f. and g-3 = f" and fi' i 3 are multiplicands in the same times expressions f then gi,g 5 will be placed in the same times expression after the next times phase. (ii) If gi,g 5 are placed in the same times expression at any subsequent phase then in the same product in fi,fj An important aspect of our approach, if cast in its greatest generality, is that we require the recognition algorithms of the teacher and learner to agree on an overwhelming fraction of only the natural inputs. Their behavior on unnatural inputs is irrelevant and hence descriptions of all possible worlds are not necessary. If followed to its conclusion this idea has considerable philosophical implications: A learnable concept is nothing more than a short program that distinguishes some natural inputs from some others. If such a concept is passed on among a population in a distributed manner substantial variations in meaning may arise. More importantly, what consensus there is will only be meaningful for natural inputs. The behavior of an individual's program for unnatural inputs has no relevance. Hence thought experiments and logical arguments involving unnatural hypothetical situations may be meaningless activities. are f. Proof. (i) If the conditions of (i) hold then gi T g j will be discovered at the next times phase and the claim follows. ( i i ) If fi,fj are not in the same product in f then f contains some prime implicant containing variables from one of gi,gj and not from the other. Hence gi Tgj will never hold. o Proof of Claim 1. By induction on the number of phases on Claims i, 3 and 4 simultaneously. D We define the depth of a formula fi to be the maximum number of alternations of sum and product required in its definition. The second important aspects of the formulation is that the notion of oracles makes it possible to discuss a whole range of teacherlearner interactions beyond the mere identification of examples. This is significant in the context of artificial intelligence where a human may be willing to go to great lengths to convey his skills to a machine while being unable to articulate thealgorithms he himselfuses in the practice of the skills. We expect that some explicit programming does become essential for transmitting skills that are beyond certain limits of difficulty. The identification of these limits is a major goal of the line of work proposed in this paper. Claim 5: If fi is an expression of depth k in f then after k phases a gi identical to fi will have been constructed by the algorithm. Proof. By induction on Claims 3 and 4. REMARKS [3 To conclude the proof of the theorem it remains to analyze the runtime, iThis is dominated by the cost of computing m i and hi, for which we have already accounted, plus the cost of computing the equivalence relations S and T. We first note that fewer than 2t expression names gi are used in the course of the algorithm. When an expression is grafted into another at a point deep in the latter then the semantics of all the subexpressions above will change. By Claims 3 and 4 such grafting can occur when a gi is added to a sum expression but not when added to a times expression. Hence the values m i and h i do not need to be changed for any expression when such grafting occurs. It follows that for computing S only 2t values of mi,~ i need to be considered. Hence O(t 2) calls of PI or O(t 3) calls of N-ORACLE suffice overall. For computing T grafting may cause a ripple effect. On each occasion the value 444 9. REFERENCES [I] D. Angluin and C.H. Smith. "A Survey of Inductive Inference: Theory and Methods." Yale University, Computer Science Department Tech. Report 250, October 1982. [2] A. Barr and E.A. Feigenbaum. The Handbook of Artificial Intelligence, Vol. II, wm. Kaufmann, Los Altos, Calif. (1982). [3] S.A. Cook. "The Complexity of Theorem Proving Procedures." Proc. of Third ACM Symposium on Theory of Computing (1971), 151158. [4] R.O. Duda and P.F. Hart. Pattern Classifica- tion and Scene Analysis. Wiley, New York (1973). [5] P. Erdos and J. Spencer. Methods in Combinatorics. New York [6] Probabi lis tic Academic Press, (1974). R.S. Michalski, J.G. Carbonell and T.M. Mitchell. Machine Learning: An Artificial Intelligence Approach, Tioga Publishing Co., Palo Alto, Calif. [7] (1983). S. Skyum and L.G. Valiant. "A Complexity Theory based on Boolean Algebra. " Proc. of 22nd IEEE Syrup. on Foundations of Computer Science, (1981) , 244-253. 445
© Copyright 2024 Paperzz