Holographic Reduced Representations: Convolution Algebra for

Holographic Reduced Representations:
C o n v o l u t i o n A l g e b r a for Compositional D i s t r i b u t e d Representations
Tony Plate
Department of Computer Science
University of Toronto
Toronto, Ontario, Canada, M5S 1A4
tap@ai .utoronto.ca
Abstract
A solution to the p r o b l e m of representing compositional s t r u c t u r e using d i s t r i b u t e d representations
is described. T h e m e t h o d uses circular convolution
to associate i t e m s , which are represented by vectors. A r b i t r a r y variable b i n d i n g s , short sequences
of various lengths, frames, and reduced represent a t i o n s can be compressed i n t o a fixed w i d t h vector.
These representations are items in their own
r i g h t , a n d can be used in c o n s t r u c t i n g compositional
structures.
T h e noisy reconstructions given by
c o n v o l u t i o n memories can be cleaned up by using a
separate associative m e m o r y t h a t has good recons t r u c t i v e properties.
1
Introduction
D i s t r i b u t e d representations [ H i n t o n , 1984] are a t t r a c t i v e
for a number of reasons. T h e y offer the possibility of representing concepts in a continuous space, they degrade
gracefully w i t h noise, and they can be processed in a
parallel network of simple processing elements. However, the p r o b l e m of representing c o m p o s i t i o n a l structure in d i s t r i b u t e d representations has been for some
t i m e a p r o m i n e n t concern of b o t h followers and critics
of the connectionist f a i t h [Fodor and P y l y s h y n , 1988;
H i n t o n , 1990] .
U s i n g connectionist networks, e.g., back propagation
nets, Hopfield nets, B o l t z r n a n n machines, or W i l l s h a w
nets, it is easy to represent associations of a fixed n u m ber of items. T h e d i f f i c u l t y w i t h representing composit i o n a l s t r u c t u r e in all of these networks is t h a t items and
associations are represented in different spaces. H i n t o n
[1990] discusses this p r o b l e m and proposes a framework
in w h i c h "reduced descriptions" are used. T h i s framework requires t h a t a number of vectors be compressed
(reduced) i n t o a single vector of the same size as each
of the o r i g i n a l vectors. T h i s vector acts as a reduced
description of the set of vectors and itself can be a m e m ber of another set of vectors. T h e reduction must be
reversible so t h a t one can move in b o t h directions in a
p a r t - w h o l e hierarchy. In this way, compositional struct u r e is represented. However, H i n t o n does not suggest
any concrete way of p e r f o r m i n g this reducing m a p p i n g .
Some researchers have b u i l t models or designed frame-
30
Architectures and Languages
works in w h i c h some some c o m p o s i t i o n a l s t r u c t u r e is
present in d i s t r i b u t e d representations. For some e x a m ples see the papers of T o u r e t z k y , Pollack, or Smolensky
i n [ A I J , 1990].
In this paper I propose a new m e t h o d for representing
c o m p o s i t i o n a l s t r u c t u r e i n d i s t r i b u t e d representations.
C i r c u l a r c o n v o l u t i o n is used to construct associations of
vectors. T h e representation of an association is a vector of the same d i m e n s i o n a l i t y as the vectors w h i c h are
associated. T h i s allows the c o n s t r u c t i o n of representations of objects w i t h c o m p o s i t i o n a l s t r u c t u r e . 1 call these
Holographic Reduced Representations ( H R R s ) , since conv o l u t i o n and correlation based memories are closely related to holographic storage, and they p r o v i d e an imp l e m e n t a t i o n of H i n t o n ' s [1990] reduced descriptions. 1
describe how H R R s and error correcting associative i t e m
memories can be used to b u i l d d i s t r i b u t e d connectionist systems which m a n i p u l a t e complex s t r u c t u r e s . T h e
i t e m memories are necessary to clean up the noisy items
extracted f r o m the c o n v o l u t i o n representations.
2
Associative memories
Associative memories are used to store associations between items which are represented is a d i s t r i b u t e d fashion as vectors. Nearly all work on associative m e m o r y
has been concerned w i t h s t o r i n g items or pairs of items.
C o n v o l u t i o n - c o r r e l a t i o n memories (sometimes refered
to as holographic-like) and m a t r i x memories have been
regarded as alternate methods for i m p l e m e n t i n g associative m e m o r y [ W i l l s h a w , 1981; M u r d o c k , 1983; Pike,
1984; Schonemann, 1987]. M a t r i x memories have received more interest, p r o b a b l y due to their relative s i m p l i c i t y and their higher capacity in terms of the number
of elements in the items being associated.
T h e properties of m a t r i x memories are well understood.
T w o of the best k n o w n m a t r i x memories are
" W i l l s h a w " networks [ W i l l s h a w , 1981] and Hopfield
networks.[Hopfield, 1982] M a t r i x memories can be used
to construct auto-associative (or "content addressable")
memories for p a t t e r n correction a n d c o m p l e t i o n . T h e y
can also be used to represent associations between t w o
vectors. A f t e r t w o vectors are associated one can be used
as a cue to retrieve the other.
There are three operations used in associative m e m ories: encoding, decoding, and trace c o m p o s i t i o n . T h e
outside the n central ones (Metcalfe [1982]), and (c) use
infinite vectors (Murdock [1982]).
The growth problem can be avoided entirely by the
use of circular convolution, an operation well known in
signal processing. The result of the circular convolution
of two vectors of n elements has just n elements. Since
circular convolution does not have the growth property,
it can be used recursively in connectionist systems w i t h
fixed w i d t h vectors.
There is a close relationship between m a t r i x and
convolution memories
A convolution of two vectors
(whether circular or aperiodic) can be regarded as a compression of the outer product of those two vectors. The
compression is achieved by summing along the transdiagonals of the outer product.
In m a t r i x memories the encoding operation is the
outer product, and in convolution memories the encoding operation is convolution. A d d i t i o n and superposition
have both been used as the trace composition operation
in m a t r i x and convolution memories.
2.1
Convolution-correlation memories
In nearly all convolution memory models the aperiodic convolution operation has been used to form
associations. 2 T h e aperiodic convolution of two vectors
w i t h n elements each results in a vector w i t h 2n - 1
elements. T h i s result can be convolved w i t h another
vector (recursive convolution); and if that vector has n
elements, the result has 3 n - 2 elements. Thus the resulting vectors grow w i t h recursive convolution. This same
growing property is exhibited in a much more dramatic
form by b o t h m a t r i x memories and Smolensky's [1990J
tensor product representations.
Researchers have used three solutions to this problem
of growth w i t h recursive associations - (a) l i m i t the depth
of composition (Smolensky [1990]), (b) discard elements
1
There are usually distributional constraints on the elements of the vectors, e.g., the elements should be drawn from
independent distributions.
2
The exception is the non-linear correlograph of Willshaw
[1981], first published in 1969.
Plate
31
4
2.2
D i s t r i b u t i o n a l constraints on vectors
For correlation to decode convolution the elements of
each vector must be independently distributed w i t h
mean zero and variance 1/n so t h a t the euclidean length
of each vector has a mean of one. Examples of suitable distributions are the normal distribution and the
discrete d i s t r i b u t i o n w i t h values equiprobably ± l / / n .
T h e analysis of signal strength and capacity depends on
elements of vectors being independently distributed.
T h e tension between these constraints and the need
for vectors to have meaningful features is discussed in
Plate [1991].
2.3
If a system using convolution representations is to do
some sort of recall (as opposed to recognition), then it
must have an additional error correcting associative i t e m
memory. T h i s is needed to clean up the noisy vectors retrieved f r o m the convolution traces. T h i s reconstructive
memory must store all the items t h a t the system could
produce. W h e n given as i n p u t a noisy version of one
of those items it must either o u t p u t the closest i t e m or
indicate t h a t the i n p u t is not close enough to any of the
stored items.
For example, suppose the system is to store pairs of
letters, and suppose each one of the 26 letters is represented by the r a n d o m vectors a, b, ...,z. T h e i t e m memory must store these 26 vectors and must be able to output the closest i t e m for any i n p u t vector (the "clean" operation). Such a system is shown in Figure 3. T h e trace
is a sum of convolved pairs, e.g., t = a ® b -f c ® d -f e ® f .
W h e n the system is given one i t e m as an i n p u t cue its
task is to o u t p u t the i t e m t h a t cue was associated w i t h
in the trace. It should also o u t p u t a scalar value (the
strength) which is high when the i n p u t cue was a member of a pair, and low when the i n p u t cue was not a
member of a pair. W h e n given a as a cue it should produce b and a high strength. W h e n given g as a cue
it should give a low strength. T h e i t e m it outputs is
u n i m p o r t a n t when the strength is low.
H o w much i n f o r m a t i o n is stored
Since a convolution trace only has n numbers in i t , it
may seem strange t h a t several pairs of vectors can be
stored " i n " i t , since each of those vectors also has n
numbers. T h e reason is t h a t the vectors are stored w i t h
very poor fidelity; to successfully store a vector we only
need to store enough i n f o r m a t i o n to discriminate it from
the other vectors. If M vectors are used to represent M
different (equiprobable) items, then about 2k log M bits
of information are needed to represent k pairs of those
items. 4 T h e size of the vectors does not enter into this
calculation, only the number of vectors matters.
3
Addition Memories
One of the simplest ways to store a set of vectors is to add
them together. Such storage does not allow for recall or
reconstruction of the stored items, but it does allow for
recognition, i.e., determining whether a particular item
has been stored or not.
T h e principle of addition memory can be stated as
"adding together two high dimensional vectors gives
something which is similar to each and not very similar
to a n y t h i n g else." 5 T h i s principle underlies b o t h convol u t i o n and m a t r i x memories and the same sort of analysis
can be applied to the linear versions of each. A d d i t i o n
memories are discussed at greater length in Plate [ l 9 9 l ]
4
Slightly less than 2f log M bits are required since the
pairs are unordered
5
This applies to the degree that the elements of the vectors
are randomly and independently distributed.
32
T h e need f o r r e c o n s t r u c t i v e i t e m
memories
Architectures and Languages
The convolution trace stores only a few associations
or items, and the i t e m memory stores many items. The
item memory acts as an auto-associator to clean up the
noisy items retrieved f r o m the convolution trace.
The exact method of implementation of the i t e m memory is u n i m p o r t a n t . Hopfield networks are probably not
a good candidate because of their low capacity. Kanerva
networks [Kanerva, 1988] have sufficient capacity, b u t
can only store binary vectors. 6 For experiments I have
been using a nearest neighbor matching memory.
5
Representing more complex structure
Pairs of items are easy to represent in any type of associative memory, but convolution memory is also suited
to the representation of more complex structure.
5.1
Sequences
Sequences can be represented in a number of ways using
convolution encoding. An entire sequence can be repreAlthough most of this paper assumes items are represented as real vectors, convolution memories also work with
binary vectors [Willshaw, 198l].
sented in one m e m o r y trace ( p r o v i d i n g the soft capacity
l i m i t s are n o t exceeded), or chunking can be used to represent a sequence of any length in a number of memory
traces.
M u r d o c k [1983; 1987] proposes a chaining method of
representing sequences in a single memory trace, and
models a large n u m b e r of psychological phenomena w i t h
it. T h e technique used stores b o t h i t e m and pair inform a t i o n in the m e m o r y trace, for example, if the sequence
of vectors to be stored is a b c , then the trace is
where
and
are suitable weighting constants less
than 1, generated f r o m a t w o or three underlying parameters, w i t h
T h e retrieval of the sequence
begins w i t h r e t r i e v i n g the strongest component of the
trace, w h i c h w i l l be a. F r o m there the retrieval is by
chaining — c o r r e l a t i n g the trace w i t h the current i t e m
to retrieve the next i t e m . T h e end of the sequence is detected when the correlation of the trace w i t h the current
i t e m is n o t s i m i l a r to any i t e m in the i t e m memory.
A n o t h e r way to represent sequences is to use the entire
previous sequence as context rather than just the previous i t e m [ M u r d o c k , 1987]. T h i s makes it possible to store
sequences w i t h repetitions of items. To store a b c , the
trace is:
T h i s type of sequence can
be retrieved in a s i m i l a r way to the previous, except t h a t
the retrieval cue m u s t be b u i l t up using convolutions.
T h e retrieval of later items in b o t h these representations could be i m p r o v e d by s u b t r a c t i n g off prefix components as the items in the sequence are retrieved.
Yet another way to represent sequences is to use a
fixed cue for each p o s i t i o n of the sequence, so to store
a b c , the trace is:
T h e retrieval
(and storage) cues p i can be a r b i t r a r y or generated in
some manner f r o m a single vector, e.g.,
5.2
C h u n k i n g of sequences
A l l of the above methods have soft l i m i t s on the length
of sequences t h a t can be stored. As the sequences get
longer the noise in the retrieved items increases u n t i l
the items are impossible to identify. T h i s l i m i t can be
overcome by c h u n k i n g — creating new "non t e r m i n a l "
items representing subsequences [Murdock, 1987].
T h e second sequence representation method is the
most suitable one to do chunking w i t h . Suppose we
want to represent the sequence a b e d e f g h . We can create three new items representing subsequences:
5.3
Variable binding
It is simple to implement variable b i n d i n g w i t h convolut i o n : convolve the variable representation w i t h the value
representation. If it is desired t h a t the representation of
a variable b i n d i n g should be somewhat s i m i l a r to b o t h
the variable and the value, the vectors for those can be
added i n . T h i s gives
ra
as the represent a t i o n of a variable b i n d i n g .
T h i s type of variable b i n d i n g can also be i m p l e m e n t e d
in other types of associative m e m o r y , e.g., the t r i p l e space of B o l t z C O N S [Touretzky and H i n t o n . 1985], or
the outer product of roles and fillers in D U C S [Touretzky
and Geva, 1987]. However, in those systems the variable
and value objects were of a different dimension t h a n the
binding object. T h u s it was n o t possible add components
of the variable and the value to the representation of the
b i n d i n g , nor use the b i n d i n g itself as an i t e m in another
association.
5.4
Frame-slot
Frames can be represented using convolution encoding
in an analogous manner to cross products of roles and
fillers in [ H i n t o n , 1981] or the frames of D U C S [Touretzky and Geva, 1987]. A frame consists of a frame label
and a set of roles, each represented by a vector. An i n stantiated frame is the sum of the frame label and the
roles (slots) convolved w i t h their respective fillers. For
example, suppose we have a (very simplified) frame for
"seeing". T h e vector for the frame label is Lsee and the
vectors for the roles are ragent and r o b j e c t - T h i s frame
could be instantiated w i t h the fillers fj a n e and fspot, to
represent "Jane saw S p o t " :
Fillers (or roles) can be retrieved f r o m the instantiated
frame by correlating w i t h the role (or filler). T h e vectors
representing roles could be frame specific, i.e., ragent-see
could be different f r o m
or they could be the
same (or j u s t s i m i l a r ) . U n i n s t a n t i a t e d frames can also be
stored as the sums of the vectors representing their components, e.g.,
Section 7.1
describes one way of m a n i p u l a t i n g u n i n s t a n t i a t e d frames
and selecting appropriate roles to fill.
T h e frame representation can be made similar to any
vector by adding some of t h a t vector to i t . For example,
we could add
to the above i n s t a n t i a t e d frame to
make the representation for Jane d o i n g something have
some s i m i l a r i t y to the representation for Jane.
6
These new items m u s t be added to the i t e m memory and
marked in some way as non-terminals. T h e representat i o n for the whole sequence is:
Decoding this chunked sequence is slightly more difficult, r e q u i r i n g the use of a stack and decisions on
whether an i t e m is a non t e r m i n a l t h a t should be further
decoded. A machine to decode such representations is
described in section 7.2.
representations
Reduced Representations
Using ' h e types of representations described in the last
Lection, it is a t r i v i a l step to b u i l d i n g reduced representations which can represent complex hierarchical structure in a fixed w i d t h vector. We can use an instantiated
f r a m e 7 as a filler instead of f s p o t in the frame b u i l t in the
previous section. For example, "Spot r a n . " :
7
Normalization of lengths of vectors becomes an issue, but
1 do not consider this for lack of space.
Plate
33
T h e machine t h a t
shown in Figure 4.
This representation can be manipulated w i t h or w i t h out chunking. W i t h o u t chunking, we could extract the
agent of the object by correlating w i t h robject ® ragentUsing chunking, we could extract the object by correlating w i t h r0bject, clean it up, and then extract its agent,
giving a less noisy vector than without chunking.
This implements Hinton's idea [1990] of a system being
able to focus attention on constituents as well as being
able to have the whole meaning present at once. It also
suggests the possibility of sacrificing accuracy for speed
— if chunks are not cleaned up the retrievals are less
accurate.
7
S i m p l e M a c h i n e s t h a t use H R R s
In this section two simple machines that operate on complex convolution representations are described. Both of
these machines have been successfully simulated on a
convolution calculator using vectors with 1024 elements.
7.2
accomplishes
this operation
is
C h u n k e d sequence r e a d o u t m a c h i n e
A machine that reads out the chunked sequences described in section 5.2 can be built using two buffers, a
stack, a classifier, a correlator, a clean up memory, and
three gating paths. The classifier tells whether the item
most prominent in the trace is a terminal, a non-terminal
(chunk) or nothing. At each iteration the machine executes one of three action sequences depending on the
output of the classifier. The stack could be implemented
in any of a number of ways; including the way suggested
in [Plate, 1991] , or in a network w i t h fast weights. The
machine is shown in Figure 5.
The control loop for the chunked sequence readout
machine is:
L o o p : (until stack gives E N D signal)
Clean up the trace to recover most prominent item:
x = Clean(t).
Classify x as a terminal, non-terminal, or nothing
(in which case "pop" is the appropriate action) and
do the appropriate of the following action sequences.
This machine is an example of a system that can have
the whole meaning present and that can also focus attention on constituents.
8
Mathematical properties
Mathematical properties of circular convolution and correlation are discussed in Plate [ l 9 9 l ] , including: algebraic properties; the reason convolution is an approximate inverse for correlation; the existence and usefulness of exact inverses of convolutions; and the variances
of dot products of various convolution products.
34
Architectures and Languages
been supported in part by the Canadian Natural Sciences
and Engineering Research Council.
References
[AIJ, 1990] Sp ecial issue on connectionist s y m b o l processing.
Artificial Intelligence, 46(1-2), 1990.
[Fodor and P y l y s h y n , 1988] J. A. Fodor and Z. W. P y l y s h y n .
Connectionism and cognitive architecture: A c r i t i c a l analysis. Cognition, 2 8 : 3 - 7 1 , 1988.
[ H i n t o n , 198l] G. E. H i n t o n . I m p l e m e n t i n g semantic networks in parallel hardware. In Parallel Models of Associative Memory. Hillsdale, N J : E r l b a u m , 1981.
[ H i n t o n , 1984] G. E. H i n t o n . D i s t r i b u t e d representations.
Technical Report C M U - C S - 8 4 - 1 5 7 , Carnegie-Mellon U n i versity, P i t t s b u r g h P A , 1984.
[ H i n t o n , 1990] G. E. H i n t o n . M a p p i n g p a r t - w h o l e heirarchies
into connectionist networks.
Artificial Intelligence, 4 6 ( 1 2):47-76, 1990.
[Hopfield, 1982] J . J . Hopfield. Neural networks and physical
systems w i t h emergent collective c o m p u t a t i o n a l abilities.
Proceedings of the National Academy of Sciences U.S.A.,
Y9:2554-2558, 1982.
[Kanerva, 1988] P. Kanerva.
Sparse
M I T Press, C a m b r i d g e , M A , 1988.
F i g u r e 5: A chunked sequence readout machine.
9
Discussion
Circular convolution is a bilinear operation, and one consequence of the linearity is low storage efficiency. However, the storage efficiency is high enough to he usable
and scales linearly. Convolution is endowed w i t h several
positive features by virtue of its linear properties. One
is t h a t it can be computed very quickly using FFTs. A n other is t h a t analysis of the capacity, scaling, and generalization properties is straightforward. Another is that
there is a possibility that a system using HRRs could
retain ambiguity while processing ambiguous input.
Convolution could be used as a fixed mapping in a connectionist network to replace one or more of the usual
weight-matrix by vector mappings. Activations could be
propagated forward very quickly using F F T s , and gradients could be propagated backward very quickly using F F T s as well. Such a network could learn to take
advantage of the convolution mapping and could learn
distributed representations for its inputs.
Memory models using circular convolution provide
a way of representing compositional structure in dist r i b u t e d representations. T h e operations involved are
linear and the properties of the scheme are relatively easy
to analyze. There is no learning involved and the scheme
works w i t h a wide range of vectors. Systems employing this representation need to have an error-correcting
auto-associative memory.
Acknowledgements
Conversations w i t h Jordan Pollack, Janet Metcalfe, and
Geoff IIinton have been essential to the development of
the ideas expressed in this paper.
This research has
Distributed Memory.
[Metcalfe Eich, 1982] Janet Metcalfe Eich.
A composite
holographic associative recall model. Psychological Review,
89:627-661, 1982.
[Murdock, 1982] Bennet B. M u r d o c k . A theory for the storage and retrieval of i t e m and associative i n f o r m a t i o n . Psychological Review, 89(6):316 338, 1982.
[Murdock, 1983] B. B. M u r d o c k .
A d i s t r i b u t e d memory
model for serial-order i n f o r m a t i o n .
Psychological Review,
90(4):316 338, 1983.
[Murdock, 1987] Bennet B. M u r d o c k . Serial-order effects in
a d i s t r i b u t e d - m e m o r y model. In D a v i d S. Gorfcin and
Robert R. Hoffman, editors,
MEMORY AND LEARNING: The Ebbinghaus Centennial Conference, pages 277
310. Lawrence E r l b a u m Associates, 1987.
[Pike, 1984] Ray Pike. Comparison of c o n v o l u t i o n and mat r i x d i s t r i b u t e d memory systems for associative recall and
recognition. Psychological Review, 91(3):281-294, 1984.
[Plate, 1991] T. A. Plate. Holographic reduced representations: Convolution algebra for c o m p o s i t i o n a l d i s t r i b u t e d
representations. Technical Report. C R G - T R - 9 1 - 1 , University of T o r o n t o , 1991.
[Schonemann, 1987] P. H. Schonemann. Some algebraic relations between i n v o l u t i o n s , convolutions, and correlations,
w i t h applications to holographic memories. Biological Cyberne ics, 56:367-374, 1987.
[Smolensky, 1990] P. Smolensky.
Tensor p r o d u c t variable
b i n d i n g and the representation of symbolic structures in
connectionist systems.
Artificial Intelligence, 46(1-2):159216, 1990.
[Touretzky and Geva, 1987] D. S. T o u r e t z k y and S. Geva. A
d i s t r i b u t e d connectionist representation for concept structures. In Proceedings of the Ninth Annual Cognitive Science Society Conference. C o g n i t i v e Science Society, 1987.
[Touretzky and H i n t o n , 1985] D. S. T o u r e t z k y and G. E.
H i n t o n . Symbols among the neurons: Details of a connectionist inference architecture. In IJCAI 9, pages 238-243,
1985.
[ W i l l s h a w , 1981] D. W i l l s h a w . Holography, associative m e m ory, and i n d u c t i v e generalization.
In Parallel models of
associative memory. E r l b a u m , Hillsdale, N J , 1981.
Plate
35