Preventing Catastrophic Interference in Multiple

Preventing Catastrophic Interference
in Multiple-Sequence Learning
Using
Coupled Reverberating Elman Networks
Bernard Ans, Stéphane Rousset,
Robert M. French & Serban Musca
(European Commission grant HPRN-CT-1999-00065)
The Problem of
Multiple-Sequence Learning
• Real cognition requires the ability to learn sequences
of patterns (or actions). (This is why SRN’s – Elman
Networks – were originally developed.)
• But learning sequences really means being able to
learn multiple sequences without the most recently
learned ones erasing the previously learned ones.
• Catastrophic interference is a serious problem for the
sequential learning of individual patterns. It is far
worse when multiple sequences of patterns have to
be learned consecutively.
The Solution
• We have developed a “dual-network”
system using coupled Elman networks that
completely solves this problem.
• These two separate networks exchange
information by means of “reverberated
pseudopatterns.”
Pseudopatterns
• Assume a network-in-a-box learns a series of patterns produced
by a function f(x).
• These original patterns are no longer available.
How can you approximate f(x)?
Outputs
f(x)
f(x)
Neural Network
Inputs
1
0
0
1
1
Random Input
1
1
1
0
0
0
Associated output
1
1
Random Input
1
1
1
0
0
0
This creates a pseudopattern:
Associated output
1
1
Random Input
1: 1 0 0 1 1  1 1 0
A large enough collection of these pseudopatterns:
1:
 2:
 3:
4:
Etc
10011
11000
00010
01111
110
011
100
000
will approximate the originally learned function.
Transferring information from Net 1 to Net 2
with pseudopatterns
Associated
output
110
1 1 0 target
Net 1
Random
input
Net 2
1 0 0 1 1 input
10011
Information transfer by pseudopatterns in
dual-network systems
• New information is presented to one network (Net 1).
• Pseudopatterns are generated by Net 2 where previously learned
information is stored.
• Net 1 then trains not only on the new pattern(s) to be learned,
but also on the pseudopatterns produced by Net 2.
• Once Net 1 has learned the new information, it generates (lots
of) pseudopatterns that train Net 2
This is why we say that information is continually
transferred between the two networks by means of
pseudopatterns.
Are all pseudopatterns created equal?
No.
Even though the simple dual-network system
(i.e., new learning in one network; long-term
storage in the other) using simple
pseudopatterns does eliminate catastrophic
interference, we can do better using
“reverberated” pseudopatterns.
Building a Network that uses “reverberated”
pseudopatterns.
Start with a standard backpropagation network
Output layer
Hidden layer
Input layer
Add an autoassociator
Output layer
Hidden layer
Input layer
A new pattern to be learned, P: Input  Target, will be
learned as shown below.
Input
Target
Input
What are “reverberated
pseudopatterns” and
how are they generated?
We start with a random input î0, feed it through the network and collect
the output on the autoassociative side of the network.. This output is
fed back into the input layer (“reverberated”) and, again, the output on
the autoassociative side is collected. This is done R times.
iˆ1
iˆ0
iˆ1
iˆ1
iˆ2
iˆ1
iˆ2
iˆ2
iˆ3
iˆ2
After R reverberations, we associate the reverberated
input and the “target” output.
This forms the reverberated pseudopattern:
 : iˆR  tˆ
tˆ
iˆR
This dual-network approach using reverberated
pseudopattern information transfer between the two
networks effectively overcomes catastrophic
interference in multiple-pattern learning
Net 1
New-learning
network
Net 2
Storage network
But what about multiple-sequence learning?
• Elman networks are designed to learn sequences of patterns.
But they forget catastrophically when they attempt to learn
multiple sequences.
• Can we generalize the dual-network, reverberated
pseudopattern technique to dual Elman networks and
eliminate catastrophic interference in multiple-sequence
learning? Yes
Elman networks
(a.k.a. Simple Recurrent Networks)
S(t+1)
Hidden H(t)
Standard input S(t)
Copy hidden unit
activations from previous
time-step
Context H(t-1)
Learning a sequence S(1), S(2), …, S(n).
A “Reverberated Simple Recurrent Network” (RSRN):
an Elman network with an autoassociative part
“autoassociative”
(Input) nodes
Teacher
S(t)
H(t-1)
“target”
nodes
S(t+1)
Error
Output
layer
Hidden layer
Input
layer
H(t)
S(t)
Standard
Input
H(t-1)
Context
RSRN technique for sequentially learning two
sequences A(t) and B(t).
•
•
•
•
Net 1 learns A(t) completely.
Reverberated pseudopattern transfer to Net 2.
Net 1 makes one weight-change pass through B(t).
Net 2 generates a few “static” reverberated
pseudopatterns
• Net 1 does one learning epoch on these
pseudopatterns from Net 2.
• Continue until Net 1 has learned B(t).
• Test how well Net 1 has retained A(t).
Two sequences to be learned:
A(0), A(1), … A(10) and B(0), B(1), … B(10)
Net 1
Net 1 learns (completely)
sequence A(0), A(1), …,
A(10)
Net 2
Transferring the learning to Net 2
1110010011010
1110010011010 Teacher
Net 1
Net 2
010110100110010
010110100110010 Input
Net 1 produces 10,000
pseudopatterns,
 1Net 1
: 010110100110010
1110010011010
Transferring the learning to Net 2
1110010011010 Teacher
Net 1
feedforward
Net 2
010110100110010 Input
Transferring the learning to Net 2
1110010011010 Teacher
Net 1
Backprop
Netchange
2
weight
010110100110010 Input
For each of the 10,000 pseudopatterns produced
by Net 1, Net 2 makes 1 FF-BP pass.
Learning B(0), B(1), … B(10) by NET 1
Net 1
1. Net 1 does ONE learning epoch on
sequence B(0), B(1), …, B(10)
3. Net 1 does one FF-BP pass on each
NET 2
Net 2
2. Net 2 generates a
few pseudopatterns
NET 2
Learning B(0), B(1), … B(10) by NET 1
Net 1
1. Net 1 does ONE learning epoch on
sequence B(0), B(1), …, B(10)
Net 2
2. Net 2 generates a
few pseudopatterns
NET 2
3. Net 1 does one FF-BP pass on each
NET 2
Continue until Net 1 has learned B(0), B(1), …, B(10)
Sequences chosen
• Twenty-two distinct random binary vectors of
length 100 are created.
• Half of these vectors are used to produce the first
ordered sequence of items, A, denoted by A(0),
A(1), …, A(10).
• The remaining 11 vectors are used to create a
second sequence of items, B, denoted by B(0),
B(1), …, B(10).
• In order to introduce a degree of ambiguity into
each sequence (so that a simple BP network would
not be able to learn them), we modify each
sequence so that A(8) = A(5) and B(5) = B(1).
Test method
• First, sequence A is completely learned by
the network.
• Then sequence B is learned.
• During the course of learning, we monitor
at regular intervals how much of sequence A
has been forgotten by the network.
Normal Elman networks: Catastrophic forgetting
(a)
100
Recall of Sequence B
Incorrect output units (%)
90
(a): Learning of sequence B (after
having previously learned sequence
A). By 450 epochs (an epoch
corresponds to one pass through the
entire sequence), sequence B has been
completely learned.
80
70
60
50
40
30
20
10
0
1
2
0
3
4
Seri
5
al p
ositi
on o
f
100
6
Seq
7
uen
c
ch
po
ge
n
i
300
arn ce B
f le
450
r o quen
e
Se
mb
Nu for
s
200
8
eB
9
item
s
10
(b)
100
Recall of Sequence A
Incorrect output units (%)
90
80
70
60
50
40
30
20
10
0
1
2
450
3
4
Seri
5
al p
ositi
on o
f
300
6
Seq
7
uen
c
s
och
ep
g
n
rni e B
0
lea
of uenc
r
e
q
Se
mb
Nu for
200
100
8
eA
9
item
s
10
(b): The number of incorrect units
(out of 100) for each serial position of
sequence A during learning of
sequence B. After 450 epochs, the
SRN has, for all intents and purposes,
completely forgotten the previously
learned sequence A
Dual-RSRN’s: Catastrophic forgetting is eliminated
(a)
100
Recall performance for sequences B
and A during learning of sequence B
by a dual-network RSRN.
Recall of Sequence B
Incorrect output units (%)
90
80
70
60
50
40
30
20
10
0
1
2
0
3
4
Seri
5
al p
ositi
on o
f
s
och
ep
g
n
300
rni e B
ea
of l uenc
400
r
e
q
mb Se
Nu for
100
6
Seq
200
7
uen
c
(a): By 400 epochs, the second
sequence B has been completely
learned.
8
eB
9
item
s
10
(b)
100
Recall of Sequence A
Incorrect output units (%)
90
(b): The previously learned sequence
A shows virtually no forgetting.
Catastrophic forgetting of the
previously learned sequence A has
been completely overcome.
80
70
60
50
40
30
20
10
0
1
2
0
3
4
Seri
5
al p
ositi
on o
f
100
6
Seq
7
uen
c
ch
po
ge
n
i
300
arn ce B
f le
400
r o quen
e
m b Se
Nu for
200
8
eA
9
item
s
10
s
Recall of
Incorrect o
Recall of Se
Incorrect outp
50
40
30
20
40
30
20
10
10
0
Normal Elman Network:
Massive forgetting
% Error on Sequence A
0
1
2
3
Seri
al po
s
4
chs
po
200
e
ng
rni
300
ea nce B
l
f
o
e
450
er
qu
mb or Se
u
N
f
100
5
6
ition
7
of S
8
equ
enc
eB
9
item
s
10
Dual RSRN:
Seri
chs
al p
ositi No forgetting of g epo
on o
in
f Se
arn ce B
e
que
l
nce
B ite
Sequence
ANumbefroorfSequen
ms
1
0
2
3
0
4
5
100
6
7
200
300
8
9
10
(b)
400
(b)
100
100
90
80
Recall of Sequence A
Incorrect output units (%)
Recall of Sequence A
Incorrect output units (%)
90
70
60
50
40
30
20
10
0
80
70
60
50
40
30
20
10
1
2
3
Seri
al po
s
450
4
0
300
5
6
ition
7
of S
8
equ
enc
eA
chs
po
e
ng
rni
0
B
lea nce
f
r o que
e
Se
mb
Nu for
200
100
9
item
s
10
Seq. B being learned
1
2
0
3
4
Seri
5
al p
ositi
on o
f
100
6
Seq
7
uen
c
och
ep
g
n
rni B
300
lea nce
f
e
400
ro
be Sequ
m
Nu for
200
8
eA
9
item
s
10
s
Cognitive/Neurobiological plausibility?
• The brain, somehow, does not forget catastrophically.
• Separating new learning from previously learned
information seems necessary.
• McClelland, McNaughton, O’Reilly (1995) have suggested
the hippocampal-neocortical separation may be Nature’s
way of solving this problem.
• Pseudopattern transfer is not so far-fetched if we accept
results that claim that neo-cortical memory consolidation,
is due, at least in part, to REM sleep.
Conclusions
• The RSRN reverberating dual-network architecture (Ans &
Rousset, 1997, 2000) can be generalized to sequential
learning of multiple temporal sequences.
• When learning multiple sequences of patterns, interleaving
simple reverberated input-output pseudopatterns, each of
which reflect the entire previously learned sequence(s),
reduces (or eliminates entirely) forgetting of the initially
learned sequence(s).