Holographic Reduced Representations: C o n v o l u t i o n A l g e b r a for Compositional D i s t r i b u t e d Representations Tony Plate Department of Computer Science University of Toronto Toronto, Ontario, Canada, M5S 1A4 tap@ai .utoronto.ca Abstract A solution to the p r o b l e m of representing compositional s t r u c t u r e using d i s t r i b u t e d representations is described. T h e m e t h o d uses circular convolution to associate i t e m s , which are represented by vectors. A r b i t r a r y variable b i n d i n g s , short sequences of various lengths, frames, and reduced represent a t i o n s can be compressed i n t o a fixed w i d t h vector. These representations are items in their own r i g h t , a n d can be used in c o n s t r u c t i n g compositional structures. T h e noisy reconstructions given by c o n v o l u t i o n memories can be cleaned up by using a separate associative m e m o r y t h a t has good recons t r u c t i v e properties. 1 Introduction D i s t r i b u t e d representations [ H i n t o n , 1984] are a t t r a c t i v e for a number of reasons. T h e y offer the possibility of representing concepts in a continuous space, they degrade gracefully w i t h noise, and they can be processed in a parallel network of simple processing elements. However, the p r o b l e m of representing c o m p o s i t i o n a l structure in d i s t r i b u t e d representations has been for some t i m e a p r o m i n e n t concern of b o t h followers and critics of the connectionist f a i t h [Fodor and P y l y s h y n , 1988; H i n t o n , 1990] . U s i n g connectionist networks, e.g., back propagation nets, Hopfield nets, B o l t z r n a n n machines, or W i l l s h a w nets, it is easy to represent associations of a fixed n u m ber of items. T h e d i f f i c u l t y w i t h representing composit i o n a l s t r u c t u r e in all of these networks is t h a t items and associations are represented in different spaces. H i n t o n [1990] discusses this p r o b l e m and proposes a framework in w h i c h "reduced descriptions" are used. T h i s framework requires t h a t a number of vectors be compressed (reduced) i n t o a single vector of the same size as each of the o r i g i n a l vectors. T h i s vector acts as a reduced description of the set of vectors and itself can be a m e m ber of another set of vectors. T h e reduction must be reversible so t h a t one can move in b o t h directions in a p a r t - w h o l e hierarchy. In this way, compositional struct u r e is represented. However, H i n t o n does not suggest any concrete way of p e r f o r m i n g this reducing m a p p i n g . Some researchers have b u i l t models or designed frame- 30 Architectures and Languages works in w h i c h some some c o m p o s i t i o n a l s t r u c t u r e is present in d i s t r i b u t e d representations. For some e x a m ples see the papers of T o u r e t z k y , Pollack, or Smolensky i n [ A I J , 1990]. In this paper I propose a new m e t h o d for representing c o m p o s i t i o n a l s t r u c t u r e i n d i s t r i b u t e d representations. C i r c u l a r c o n v o l u t i o n is used to construct associations of vectors. T h e representation of an association is a vector of the same d i m e n s i o n a l i t y as the vectors w h i c h are associated. T h i s allows the c o n s t r u c t i o n of representations of objects w i t h c o m p o s i t i o n a l s t r u c t u r e . 1 call these Holographic Reduced Representations ( H R R s ) , since conv o l u t i o n and correlation based memories are closely related to holographic storage, and they p r o v i d e an imp l e m e n t a t i o n of H i n t o n ' s [1990] reduced descriptions. 1 describe how H R R s and error correcting associative i t e m memories can be used to b u i l d d i s t r i b u t e d connectionist systems which m a n i p u l a t e complex s t r u c t u r e s . T h e i t e m memories are necessary to clean up the noisy items extracted f r o m the c o n v o l u t i o n representations. 2 Associative memories Associative memories are used to store associations between items which are represented is a d i s t r i b u t e d fashion as vectors. Nearly all work on associative m e m o r y has been concerned w i t h s t o r i n g items or pairs of items. C o n v o l u t i o n - c o r r e l a t i o n memories (sometimes refered to as holographic-like) and m a t r i x memories have been regarded as alternate methods for i m p l e m e n t i n g associative m e m o r y [ W i l l s h a w , 1981; M u r d o c k , 1983; Pike, 1984; Schonemann, 1987]. M a t r i x memories have received more interest, p r o b a b l y due to their relative s i m p l i c i t y and their higher capacity in terms of the number of elements in the items being associated. T h e properties of m a t r i x memories are well understood. T w o of the best k n o w n m a t r i x memories are " W i l l s h a w " networks [ W i l l s h a w , 1981] and Hopfield networks.[Hopfield, 1982] M a t r i x memories can be used to construct auto-associative (or "content addressable") memories for p a t t e r n correction a n d c o m p l e t i o n . T h e y can also be used to represent associations between t w o vectors. A f t e r t w o vectors are associated one can be used as a cue to retrieve the other. There are three operations used in associative m e m ories: encoding, decoding, and trace c o m p o s i t i o n . T h e outside the n central ones (Metcalfe [1982]), and (c) use infinite vectors (Murdock [1982]). The growth problem can be avoided entirely by the use of circular convolution, an operation well known in signal processing. The result of the circular convolution of two vectors of n elements has just n elements. Since circular convolution does not have the growth property, it can be used recursively in connectionist systems w i t h fixed w i d t h vectors. There is a close relationship between m a t r i x and convolution memories A convolution of two vectors (whether circular or aperiodic) can be regarded as a compression of the outer product of those two vectors. The compression is achieved by summing along the transdiagonals of the outer product. In m a t r i x memories the encoding operation is the outer product, and in convolution memories the encoding operation is convolution. A d d i t i o n and superposition have both been used as the trace composition operation in m a t r i x and convolution memories. 2.1 Convolution-correlation memories In nearly all convolution memory models the aperiodic convolution operation has been used to form associations. 2 T h e aperiodic convolution of two vectors w i t h n elements each results in a vector w i t h 2n - 1 elements. T h i s result can be convolved w i t h another vector (recursive convolution); and if that vector has n elements, the result has 3 n - 2 elements. Thus the resulting vectors grow w i t h recursive convolution. This same growing property is exhibited in a much more dramatic form by b o t h m a t r i x memories and Smolensky's [1990J tensor product representations. Researchers have used three solutions to this problem of growth w i t h recursive associations - (a) l i m i t the depth of composition (Smolensky [1990]), (b) discard elements 1 There are usually distributional constraints on the elements of the vectors, e.g., the elements should be drawn from independent distributions. 2 The exception is the non-linear correlograph of Willshaw [1981], first published in 1969. Plate 31 4 2.2 D i s t r i b u t i o n a l constraints on vectors For correlation to decode convolution the elements of each vector must be independently distributed w i t h mean zero and variance 1/n so t h a t the euclidean length of each vector has a mean of one. Examples of suitable distributions are the normal distribution and the discrete d i s t r i b u t i o n w i t h values equiprobably ± l / / n . T h e analysis of signal strength and capacity depends on elements of vectors being independently distributed. T h e tension between these constraints and the need for vectors to have meaningful features is discussed in Plate [1991]. 2.3 If a system using convolution representations is to do some sort of recall (as opposed to recognition), then it must have an additional error correcting associative i t e m memory. T h i s is needed to clean up the noisy vectors retrieved f r o m the convolution traces. T h i s reconstructive memory must store all the items t h a t the system could produce. W h e n given as i n p u t a noisy version of one of those items it must either o u t p u t the closest i t e m or indicate t h a t the i n p u t is not close enough to any of the stored items. For example, suppose the system is to store pairs of letters, and suppose each one of the 26 letters is represented by the r a n d o m vectors a, b, ...,z. T h e i t e m memory must store these 26 vectors and must be able to output the closest i t e m for any i n p u t vector (the "clean" operation). Such a system is shown in Figure 3. T h e trace is a sum of convolved pairs, e.g., t = a ® b -f c ® d -f e ® f . W h e n the system is given one i t e m as an i n p u t cue its task is to o u t p u t the i t e m t h a t cue was associated w i t h in the trace. It should also o u t p u t a scalar value (the strength) which is high when the i n p u t cue was a member of a pair, and low when the i n p u t cue was not a member of a pair. W h e n given a as a cue it should produce b and a high strength. W h e n given g as a cue it should give a low strength. T h e i t e m it outputs is u n i m p o r t a n t when the strength is low. H o w much i n f o r m a t i o n is stored Since a convolution trace only has n numbers in i t , it may seem strange t h a t several pairs of vectors can be stored " i n " i t , since each of those vectors also has n numbers. T h e reason is t h a t the vectors are stored w i t h very poor fidelity; to successfully store a vector we only need to store enough i n f o r m a t i o n to discriminate it from the other vectors. If M vectors are used to represent M different (equiprobable) items, then about 2k log M bits of information are needed to represent k pairs of those items. 4 T h e size of the vectors does not enter into this calculation, only the number of vectors matters. 3 Addition Memories One of the simplest ways to store a set of vectors is to add them together. Such storage does not allow for recall or reconstruction of the stored items, but it does allow for recognition, i.e., determining whether a particular item has been stored or not. T h e principle of addition memory can be stated as "adding together two high dimensional vectors gives something which is similar to each and not very similar to a n y t h i n g else." 5 T h i s principle underlies b o t h convol u t i o n and m a t r i x memories and the same sort of analysis can be applied to the linear versions of each. A d d i t i o n memories are discussed at greater length in Plate [ l 9 9 l ] 4 Slightly less than 2f log M bits are required since the pairs are unordered 5 This applies to the degree that the elements of the vectors are randomly and independently distributed. 32 T h e need f o r r e c o n s t r u c t i v e i t e m memories Architectures and Languages The convolution trace stores only a few associations or items, and the i t e m memory stores many items. The item memory acts as an auto-associator to clean up the noisy items retrieved f r o m the convolution trace. The exact method of implementation of the i t e m memory is u n i m p o r t a n t . Hopfield networks are probably not a good candidate because of their low capacity. Kanerva networks [Kanerva, 1988] have sufficient capacity, b u t can only store binary vectors. 6 For experiments I have been using a nearest neighbor matching memory. 5 Representing more complex structure Pairs of items are easy to represent in any type of associative memory, but convolution memory is also suited to the representation of more complex structure. 5.1 Sequences Sequences can be represented in a number of ways using convolution encoding. An entire sequence can be repreAlthough most of this paper assumes items are represented as real vectors, convolution memories also work with binary vectors [Willshaw, 198l]. sented in one m e m o r y trace ( p r o v i d i n g the soft capacity l i m i t s are n o t exceeded), or chunking can be used to represent a sequence of any length in a number of memory traces. M u r d o c k [1983; 1987] proposes a chaining method of representing sequences in a single memory trace, and models a large n u m b e r of psychological phenomena w i t h it. T h e technique used stores b o t h i t e m and pair inform a t i o n in the m e m o r y trace, for example, if the sequence of vectors to be stored is a b c , then the trace is where and are suitable weighting constants less than 1, generated f r o m a t w o or three underlying parameters, w i t h T h e retrieval of the sequence begins w i t h r e t r i e v i n g the strongest component of the trace, w h i c h w i l l be a. F r o m there the retrieval is by chaining — c o r r e l a t i n g the trace w i t h the current i t e m to retrieve the next i t e m . T h e end of the sequence is detected when the correlation of the trace w i t h the current i t e m is n o t s i m i l a r to any i t e m in the i t e m memory. A n o t h e r way to represent sequences is to use the entire previous sequence as context rather than just the previous i t e m [ M u r d o c k , 1987]. T h i s makes it possible to store sequences w i t h repetitions of items. To store a b c , the trace is: T h i s type of sequence can be retrieved in a s i m i l a r way to the previous, except t h a t the retrieval cue m u s t be b u i l t up using convolutions. T h e retrieval of later items in b o t h these representations could be i m p r o v e d by s u b t r a c t i n g off prefix components as the items in the sequence are retrieved. Yet another way to represent sequences is to use a fixed cue for each p o s i t i o n of the sequence, so to store a b c , the trace is: T h e retrieval (and storage) cues p i can be a r b i t r a r y or generated in some manner f r o m a single vector, e.g., 5.2 C h u n k i n g of sequences A l l of the above methods have soft l i m i t s on the length of sequences t h a t can be stored. As the sequences get longer the noise in the retrieved items increases u n t i l the items are impossible to identify. T h i s l i m i t can be overcome by c h u n k i n g — creating new "non t e r m i n a l " items representing subsequences [Murdock, 1987]. T h e second sequence representation method is the most suitable one to do chunking w i t h . Suppose we want to represent the sequence a b e d e f g h . We can create three new items representing subsequences: 5.3 Variable binding It is simple to implement variable b i n d i n g w i t h convolut i o n : convolve the variable representation w i t h the value representation. If it is desired t h a t the representation of a variable b i n d i n g should be somewhat s i m i l a r to b o t h the variable and the value, the vectors for those can be added i n . T h i s gives ra as the represent a t i o n of a variable b i n d i n g . T h i s type of variable b i n d i n g can also be i m p l e m e n t e d in other types of associative m e m o r y , e.g., the t r i p l e space of B o l t z C O N S [Touretzky and H i n t o n . 1985], or the outer product of roles and fillers in D U C S [Touretzky and Geva, 1987]. However, in those systems the variable and value objects were of a different dimension t h a n the binding object. T h u s it was n o t possible add components of the variable and the value to the representation of the b i n d i n g , nor use the b i n d i n g itself as an i t e m in another association. 5.4 Frame-slot Frames can be represented using convolution encoding in an analogous manner to cross products of roles and fillers in [ H i n t o n , 1981] or the frames of D U C S [Touretzky and Geva, 1987]. A frame consists of a frame label and a set of roles, each represented by a vector. An i n stantiated frame is the sum of the frame label and the roles (slots) convolved w i t h their respective fillers. For example, suppose we have a (very simplified) frame for "seeing". T h e vector for the frame label is Lsee and the vectors for the roles are ragent and r o b j e c t - T h i s frame could be instantiated w i t h the fillers fj a n e and fspot, to represent "Jane saw S p o t " : Fillers (or roles) can be retrieved f r o m the instantiated frame by correlating w i t h the role (or filler). T h e vectors representing roles could be frame specific, i.e., ragent-see could be different f r o m or they could be the same (or j u s t s i m i l a r ) . U n i n s t a n t i a t e d frames can also be stored as the sums of the vectors representing their components, e.g., Section 7.1 describes one way of m a n i p u l a t i n g u n i n s t a n t i a t e d frames and selecting appropriate roles to fill. T h e frame representation can be made similar to any vector by adding some of t h a t vector to i t . For example, we could add to the above i n s t a n t i a t e d frame to make the representation for Jane d o i n g something have some s i m i l a r i t y to the representation for Jane. 6 These new items m u s t be added to the i t e m memory and marked in some way as non-terminals. T h e representat i o n for the whole sequence is: Decoding this chunked sequence is slightly more difficult, r e q u i r i n g the use of a stack and decisions on whether an i t e m is a non t e r m i n a l t h a t should be further decoded. A machine to decode such representations is described in section 7.2. representations Reduced Representations Using ' h e types of representations described in the last Lection, it is a t r i v i a l step to b u i l d i n g reduced representations which can represent complex hierarchical structure in a fixed w i d t h vector. We can use an instantiated f r a m e 7 as a filler instead of f s p o t in the frame b u i l t in the previous section. For example, "Spot r a n . " : 7 Normalization of lengths of vectors becomes an issue, but 1 do not consider this for lack of space. Plate 33 T h e machine t h a t shown in Figure 4. This representation can be manipulated w i t h or w i t h out chunking. W i t h o u t chunking, we could extract the agent of the object by correlating w i t h robject ® ragentUsing chunking, we could extract the object by correlating w i t h r0bject, clean it up, and then extract its agent, giving a less noisy vector than without chunking. This implements Hinton's idea [1990] of a system being able to focus attention on constituents as well as being able to have the whole meaning present at once. It also suggests the possibility of sacrificing accuracy for speed — if chunks are not cleaned up the retrievals are less accurate. 7 S i m p l e M a c h i n e s t h a t use H R R s In this section two simple machines that operate on complex convolution representations are described. Both of these machines have been successfully simulated on a convolution calculator using vectors with 1024 elements. 7.2 accomplishes this operation is C h u n k e d sequence r e a d o u t m a c h i n e A machine that reads out the chunked sequences described in section 5.2 can be built using two buffers, a stack, a classifier, a correlator, a clean up memory, and three gating paths. The classifier tells whether the item most prominent in the trace is a terminal, a non-terminal (chunk) or nothing. At each iteration the machine executes one of three action sequences depending on the output of the classifier. The stack could be implemented in any of a number of ways; including the way suggested in [Plate, 1991] , or in a network w i t h fast weights. The machine is shown in Figure 5. The control loop for the chunked sequence readout machine is: L o o p : (until stack gives E N D signal) Clean up the trace to recover most prominent item: x = Clean(t). Classify x as a terminal, non-terminal, or nothing (in which case "pop" is the appropriate action) and do the appropriate of the following action sequences. This machine is an example of a system that can have the whole meaning present and that can also focus attention on constituents. 8 Mathematical properties Mathematical properties of circular convolution and correlation are discussed in Plate [ l 9 9 l ] , including: algebraic properties; the reason convolution is an approximate inverse for correlation; the existence and usefulness of exact inverses of convolutions; and the variances of dot products of various convolution products. 34 Architectures and Languages been supported in part by the Canadian Natural Sciences and Engineering Research Council. References [AIJ, 1990] Sp ecial issue on connectionist s y m b o l processing. Artificial Intelligence, 46(1-2), 1990. [Fodor and P y l y s h y n , 1988] J. A. Fodor and Z. W. P y l y s h y n . Connectionism and cognitive architecture: A c r i t i c a l analysis. Cognition, 2 8 : 3 - 7 1 , 1988. [ H i n t o n , 198l] G. E. H i n t o n . I m p l e m e n t i n g semantic networks in parallel hardware. In Parallel Models of Associative Memory. Hillsdale, N J : E r l b a u m , 1981. [ H i n t o n , 1984] G. E. H i n t o n . D i s t r i b u t e d representations. Technical Report C M U - C S - 8 4 - 1 5 7 , Carnegie-Mellon U n i versity, P i t t s b u r g h P A , 1984. [ H i n t o n , 1990] G. E. H i n t o n . M a p p i n g p a r t - w h o l e heirarchies into connectionist networks. Artificial Intelligence, 4 6 ( 1 2):47-76, 1990. [Hopfield, 1982] J . J . Hopfield. Neural networks and physical systems w i t h emergent collective c o m p u t a t i o n a l abilities. Proceedings of the National Academy of Sciences U.S.A., Y9:2554-2558, 1982. [Kanerva, 1988] P. Kanerva. Sparse M I T Press, C a m b r i d g e , M A , 1988. F i g u r e 5: A chunked sequence readout machine. 9 Discussion Circular convolution is a bilinear operation, and one consequence of the linearity is low storage efficiency. However, the storage efficiency is high enough to he usable and scales linearly. Convolution is endowed w i t h several positive features by virtue of its linear properties. One is t h a t it can be computed very quickly using FFTs. A n other is t h a t analysis of the capacity, scaling, and generalization properties is straightforward. Another is that there is a possibility that a system using HRRs could retain ambiguity while processing ambiguous input. Convolution could be used as a fixed mapping in a connectionist network to replace one or more of the usual weight-matrix by vector mappings. Activations could be propagated forward very quickly using F F T s , and gradients could be propagated backward very quickly using F F T s as well. Such a network could learn to take advantage of the convolution mapping and could learn distributed representations for its inputs. Memory models using circular convolution provide a way of representing compositional structure in dist r i b u t e d representations. T h e operations involved are linear and the properties of the scheme are relatively easy to analyze. There is no learning involved and the scheme works w i t h a wide range of vectors. Systems employing this representation need to have an error-correcting auto-associative memory. Acknowledgements Conversations w i t h Jordan Pollack, Janet Metcalfe, and Geoff IIinton have been essential to the development of the ideas expressed in this paper. This research has Distributed Memory. [Metcalfe Eich, 1982] Janet Metcalfe Eich. A composite holographic associative recall model. Psychological Review, 89:627-661, 1982. [Murdock, 1982] Bennet B. M u r d o c k . A theory for the storage and retrieval of i t e m and associative i n f o r m a t i o n . Psychological Review, 89(6):316 338, 1982. [Murdock, 1983] B. B. M u r d o c k . A d i s t r i b u t e d memory model for serial-order i n f o r m a t i o n . Psychological Review, 90(4):316 338, 1983. [Murdock, 1987] Bennet B. M u r d o c k . Serial-order effects in a d i s t r i b u t e d - m e m o r y model. In D a v i d S. Gorfcin and Robert R. Hoffman, editors, MEMORY AND LEARNING: The Ebbinghaus Centennial Conference, pages 277 310. Lawrence E r l b a u m Associates, 1987. [Pike, 1984] Ray Pike. Comparison of c o n v o l u t i o n and mat r i x d i s t r i b u t e d memory systems for associative recall and recognition. Psychological Review, 91(3):281-294, 1984. [Plate, 1991] T. A. Plate. Holographic reduced representations: Convolution algebra for c o m p o s i t i o n a l d i s t r i b u t e d representations. Technical Report. C R G - T R - 9 1 - 1 , University of T o r o n t o , 1991. [Schonemann, 1987] P. H. Schonemann. Some algebraic relations between i n v o l u t i o n s , convolutions, and correlations, w i t h applications to holographic memories. Biological Cyberne ics, 56:367-374, 1987. [Smolensky, 1990] P. Smolensky. Tensor p r o d u c t variable b i n d i n g and the representation of symbolic structures in connectionist systems. Artificial Intelligence, 46(1-2):159216, 1990. [Touretzky and Geva, 1987] D. S. T o u r e t z k y and S. Geva. A d i s t r i b u t e d connectionist representation for concept structures. In Proceedings of the Ninth Annual Cognitive Science Society Conference. C o g n i t i v e Science Society, 1987. [Touretzky and H i n t o n , 1985] D. S. T o u r e t z k y and G. E. H i n t o n . Symbols among the neurons: Details of a connectionist inference architecture. In IJCAI 9, pages 238-243, 1985. [ W i l l s h a w , 1981] D. W i l l s h a w . Holography, associative m e m ory, and i n d u c t i v e generalization. In Parallel models of associative memory. E r l b a u m , Hillsdale, N J , 1981. Plate 35
© Copyright 2026 Paperzz