NeuralNets4 - Cleveland State University

Neural Networks
Part 4
Dan Simon
Cleveland State University
1
Outline
1. Learning Vector Quantization (LVQ)
2. The Optimal Interpolative Net (OINet)
2
Learning Vector Quantization (LVQ)
Invented by Tuevo Kohonen in 1981
Same architecture as the Kohonen Self Organizing Map
Supervised learning
x1
w11
w1m
xi
wi1
y1
w1k
wik
yk
wim
wn1
xn
wnk
wnm
ym
3
LVQ Notation:
x = [x1, …, xn] = training vector
T(x) = target; class or category to which x belongs
wk = weight vector of k-th output unit = [w1k, …, wnk]
a = learning rate
LVQ Algorithm:
Initialize reference vectors (that is, vectors which represent prototype
inputs for each class)
while not (termination criterion)
for each training vector x
k0 = argmink || x – wk ||
if k0 = T(x) then
wk0  wk0 + a(x – wk0)
else
wk0  wk0 – a(x – wk0)
end if
end for
end while
4
We have three input classes.
Training input x is closest to w2.
LVQ Example
If x  class 2, then w2  w2 + a(x – w2)
that is, move w2 towards x.
If x  class 2, then w2  w2 – a(x – w2)
that is, move w2 away from x.
w3
w2
x–w2
x
LVQ reference vector initialization:
w1
1. Use a random selection of training vectors, one from each class.
2. Use randomly-generated weight vectors.
3. Use a clustering method (e.g., the Kohonen SOM).
5
LVQ Example: LVQ1.m
(1, 1, 0, 0)  Class 1
(0, 0, 0, 1)  Class 2
(0, 0, 1, 1)  Class 2
(1, 0, 0, 0)  Class 1
(0, 1, 1, 0)  Class 2
Final weight vectors:
(1.04, 0.57, 0.04, 0.00)
(0.00, 0.30, 0.62, 0.70)
6
LVQ Example: LVQ2.m
Training Data
Final Result
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2
0.4
0.6
0.8
Training data from Fausett, p. 190.
Four initial weight vectors are at
the corners of the training data.
1
0
0
0.2
0.4
0.6
0.8
1
Final classification results on the training
data, and final weight vectors. 14
classification errors after 20 iterations.
7
LVQ Example: LVQ3.m
Training Data
Final Result
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
-0.2
-0.2
0
0.2
0.4
0.6
0.8
Training data from Fausett, p. 190.
20 initial weight vectors are randomly
chosen with random classes.
-0.2
-0.2
0
0.2
0.4
0.6
0.8
Final classification results on the training
data, and final weight vectors. Four
classification errors after 600 iterations.
In practice it would be better to use our training data to
assign the classes of the initial weight vectors.
8
LVQ Extensions:
w3
w2
x
w1
x–w2
The graphical illustration of LVQ
gives us some ideas for
algorithmic modifications.
• Always move the correct vector
towards x, and move the closest
p vectors that are incorrect away
from x.
• Move incorrect vectors away
from x only if they are within a
distance threshold.
• Popular modifications are called
LVQ2, LVQ2.1, and LVQ3 (not to
be confused with the names of
our Matlab programs).
9
LVQ Applications to Control Systems:
• Most LVQ applications involve classification
• Any classification algorithm can be adapted for control
• Switching control – switch between control algorithms
based on the system features (input type, system
parameters, objectives, failure type, …)
• Training rules for a fuzzy controller – if input 1 is Ai and
input 2 is Bk, then output is Cik – LVQ can be used to
classify x and y
• User intent recognition – for example, a brain machine
interface (BMI) can recognize what the user is trying to do
10
The optimal interpolative net (OINet) – March 1992
Pattern classification; M classes; if x  class k, then yi = ki
The network grows during training, but only as large as needed.
v11
x1
v1q
x2
xN
v12
v21
v22
v2q
vN1

y1
w12

w1m
w21
w22
w2m
wq1
vN2
vNq
w11

y2
wq2
wqM
yM
q hidden neurons
vi = weight vector to i-th hidden neuron; vi is N-dimensional
vi = prototype; { vi }  { xi }
11
 (|| v1  x ||) 

T 
T
y W 

W

  (|| vq  x || 


11 
 
 
q1 
 
( M 1)  ( M  q )(q 1)
Suppose we have q training samples: y(xi) = yi, for i  {1, …, q}. Then:
11
T 
[ y1  yq ]  W 
1q

Y  W TG
1q 


qq 
Note if xi = xk for some i  k,
then G is singular
( M  q )  ( M  q )(q  q )
W  G 1Y T
12
The OINet works by selecting a set of {xi} to use as input weights.
These are the prototype vectors {vi}, i = 1, …, p.
Choose a set of {xi} to optimize the output weights W.
These are the subprototypes {zi}, i = 1, …, l.
Include xi in {zi} only if needed to correctly classify xi.
Include xi in {vi} only if G is not ill-conditioned, and only if it decreases
the total classification error.
Y  W TG
Use l inputs for training.
Use p hidden neurons.
Gik = (||vizk||)
( M  l )  ( M  p)( p  l )
W T  Y GT (GGT ) 1
 Y GT R 1
more explicit notation:
(Wpl )T  Yl (G lp )T ( Rlp ) 1
13
 y1
  (|| v1  z1 ||)

yl   W T 
 (|| (v p  z1 ||)

 (|| v1  zl ||) 


 (|| v p  zl ||) 
Yl  (W pl )T G lp
( M  l )  ( M  p )( p  l )
pl
Suppose we have trained the network for a certain l and p.
All training inputs considered so far have been correctly classified.
We then consider xi, the next input in the training set.
Is xi correctly classified with the existing OINet?
If so, everything is fine, and we move on to the next training input.
If not, then we need to add xi to the subprototype set {zi} and obtain
a new set of output weights W. We also consider adding xi to the prototype
set {vi}, but only if it does not make G ill-conditioned, and only if it reduces
the error by enough.
14
Suppose we have trained the network for a certain l and p.
We have Wpl , Glp , Rlp , and ( Rlp )1
Suppose xi, the next input in the training set, is not correctly classified.
We need to add xi to the subprototype set {zi} and retrain W.
zl 1  xi
 y1
  (|| v1  z1 ||)

yl yl 1   (W pl 1 )T 
 (|| (v p  z1 ||)

Yl 1  (W pl 1 )T G lp1
( M  (l  1))  ( M  p )( p  (l  1))
 (|| v1  zl ||)  (|| v1  zl 1 ||) 


 (|| v p  zl ||)  (|| v p  zl 1 ||) 
k lp1 (Eq. 11)
(W pl 1 )T  Yl 1 (G lp1 )T ( R lp1 ) 1
This is going to get expensive if we have lots of data, and if we have to
perform a new matrix inversion every time we add a subprototype.
Note: Equation numbers from here on refer to those in Sin & deFigueiredo
15
Matrix inversion lemma (prove for homework):
( A  BD 1C )1  A1  A1 B( D  CA1B)1 CA1
l T


(
G
p)
l 1
l 1
l 1 T
l
l 1
R p  G p (G p )  [G p k p ]  l 1 T   G lp (G lp )T  k lp1 (k lp1 )T
(k p ) 
(Eq. 10)
l
l 1
l 1 T
 R p  k p (k p )
( R lp1 ) 1  ( R lp  k lp1 ( k lp1 )T ) 1
l 1 1
p
 I  ( k ) ( R ) k  ( k pl 1 )T ( R pl ) 1
l 1 l 1
l 1 T
l 1
(
R
)
k
(
k
)
(
R
p
p
p
p)
l 1
 ( Rp ) 
(Eq. 17)
l 1 T
l 1 l 1
1  (k p ) ( R p ) k p
l 1
p
l 1
p
 (R )  (R ) k
l 1
p
l 1 T
p
l 1
p
We only need scalar division to compute the new inverse, because we
already know the old inverse!
16
(Wpl 1 )T  Yl 1 (G lp1 )T ( Rlp1 ) 1
Wpl 1  ( Rlp1 )1 G lp1YlT1
We can implement this equation, but we can also find a recursive solution.
l 1 l 1
l 1 T
l 1
T


(
R
)
k
(
k
)
(
R
)


Y
p
p
p
p
l 1
l 1
l
l 1
l
G k p   T 
Wp  ( R p ) 
l 1 T
l 1 l 1   p
1  (k p ) ( R p ) k p 

 yl 1 
l 1 l 1
l 1 T
l 1
(
R
)
k
(
k
)
(
R
p
p
p
p)
l 1
l T
l 1 T
l T
l 1 T
 ( R p ) (G pYl  k p yl 1 ) 
(
G
Y

k
p l
p yl 1 )
l 1 T
l 1 l 1
1  (k p ) ( R p ) k p
l 1 T
l
l 1 l 1 T


(
k
)
(
W

(
R
p
p
p ) k p yl 1 )
l
l 1 l 1
T
 W p  ( R p ) k p  yl 1 
 (Eq. 20)
l 1 T
l 1 l 1
1  (k p ) ( R p ) k p


17
W pl 1  W pl  ( R lp ) 1 k pl 1 
 ylT1  (k lp1 )T ( R lp ) 1 k lp1 ylT1  (k lp1 )T (Wpl  ( Rlp ) 1 k lp1 ylT1 ) 


l 1 T
l 1 l 1
1  (k p ) ( R p ) k p


 Wpl  ( R lp ) 1 k pl 1 
 ylT1  (k lp1 )T ( R lp ) 1 k lp1 ylT1  (k lp1 )T W pl  (k pl 1 )T ( R lp ) 1 k lp1 ylT1 


l 1 T
l 1 l 1
1  (k p ) ( R p ) k p


T
l 1 T
l


y

(
k
)
W
l 1
p
p
l
l 1 l 1
 Wp  ( Rp ) k p 
(Eq. 20)
l 1 T
l 1 l 1 
1  (k p ) ( R p ) k p 
We have a recursive equation for the new weight matrix.
18
Now we have to decide if we should add xi to the prototypeset {vi}.
We will do this if it does not make G ill-conditioned, and if it reduces
the error by enough.
v p 1  xi
 y1
  (|| v1  z1 ||)

yl yl 1   (W pl 11 )T 
  (|| (v p  z1 ||)

 (|| (v p 1  z1 ||)
Yl 1  (W pl 11 )T G lp11
( M  (l  1))  ( M  ( p  1))(( p  1)  (l  1))
 (|| v1  zl ||)
 (|| v1  zl 1 ||) 


 (|| v p  zl ||)  (|| v p  zl 1 ||) 

 (|| v p 1  zl ||)  (|| v p 1  zl 1 ||) 
I wonder if we can think of
something clever to avoid
the new matrix inversion …
G lp k lp1 
G  T
 (Eqs. 22-24)
 rl 1  l 1 
(W pl 11 )T  Yl 1 (G lp11 )T [G lp11 (G lp11 )T ]1  Yl 1 (G lp11 )T ( R lp11 )1
l 1
p 1
19
R lp11  (G lp11 )(G lp11 )T
G lp k lp1   (G lp )T
 T
  l 1 T
 rl 1  l 1  (k p )
rl 1 

 l 1 

R lp1
 T l T
l 1 T
r
(
G
)


(
k
 l 1 p
l 1
p )
 A ' B '
l 1 1
( R p 1 )  

C
'
D
'


G lp rl 1   l 1k lp1   A B 
(Eq. 26)


T
2
rl 1rl 1   l 1  C D 
Another matrix inversion lemma (prove for homework):
1
A B
 A ' B '
1

First
define
E

D

CA
B
C D 
C ' D '




A '  A1  A1 BE 1CA1
B '   A1 BE 1 C '   E 1CA1
D '  E 1
20
A '  ( R lp1 ) 1  ( R lp1 ) 1 (G lp rl 1  l 1k lp1 ) 
r
r 
T
l 1 l 1
2
l 1

  r (G )   l 1 (k )  ( R ) (G r  k )
T
l 1
l T
p
l 1 T
p
l 1 1
p
l
p l 1
l 1
l 1 p
1

(rlT1 (G lp )T  l 1 (k pl 1 )T )( R lp1 ) 1
rˆ  [rlT1  l 1 ]T
(Eq. 29)
G lp rl 1  l 1k pl 1  G lp1rˆ
u  ( R lp1 ) 1 G lp1rˆ (Eq. 28)
  rˆT rˆ  rˆT (G lp1 )T ( R lp1 ) 1 G lp1rˆ (Eq. 30)
A '  ( R lp1 ) 1  ( R lp1 ) 1 (G lp1rˆ)(G lp1rˆ)T ( R lp1 ) 1 / 
A '  ( R lp1 ) 1  uu T / 
l 1 1
T

(
R
)

uu
/
l 1 1
p
( R p 1 )  
T

u
/

u /  
 (Eq. 27) Homework: Derive this
1/  
21
We already have everything on the right side, so we can derive the new
inverse with only scalar division (no additional matrix inversion).
Small   ill conditioning
So don’t use xi as a prototype if  < 1 (threshold)
Even if  > 1 , don’t use xi as a prototype if the error only decreases by a
small amount, because it won’t be worth the extra network complexity.
E lp1  Yl 1  (W pl 1 )T G lp1  M  (l  1) matrix (Eq. 35)
E lp11  Yl 1  (W pl 11 )T G lp11  M  (l  1) matrix
elp1   (E lp1 )ij2
i, j
e
l 1
p 1
  (E
l 1 2
p 1 ij
)
Before we check e, let’s see if we
can find a recursive formula for W …
i, j
e  elp1  elp11 (Eq. 38)
22
W pl 11  ( R lp11 ) 1 G lp11Yl T1
( R lp1 ) 1  uu T / 

T

u
/

u /   G lp k lp1   Yl T 
 T 
 T
r

1/    l 1
l 1   yl 1 
( R lp1 ) 1 G lp  uu T G lp /   urlT1 /  ( R lp1 ) 1 k lp1  uu T k lp1 /   u l 1 /    Yl T 

 T 
T l
T
T l 1

u
G
/


r
/


u
k
/



/


  yl 1 
p
l 1
p
l 1
 w1 
 
 w2 
w1  ( R lp1 ) 1 G lpYlT  uu T G lpYlT /   urlT1Yl T /   ( R lp1 ) 1 k lp1 ylT1 
uu T k lp1 ylT1 /   u l 1 ylT1 / 
 W pl 1  urˆT Yl T1 /   uu T G lp1YlT1 / 
ˆˆT YlT1 /   ( R lp1 ) 1 G lp1rr
ˆˆT (G lp1 )T ( R lp1 ) 1 G lp1Yl T1 / 
 W pl 1  ( R lp1 ) 1 G lp1rr
ˆˆT Yl T1  (G lp1 )T ( R lp1 ) 1 G lp1YlT1  / 
 W pl 1  ( R lp1 ) 1 G lp1rr
ˆˆT Yl T1  (G lp1 )T Wpl 1  / 
 W pl 1  ( R lp1 ) 1 G lp1rr
  E lp1rˆ /  (Eq. 34)  w1  W pl 1  ( R lp1 ) 1 G lp1rˆ T
(Eq. 33)
23
w2  u T G lpYlT /   rlT1YlT /   u T k lp1 ylT1 /    l 1 ylT1 / 
 u T G lp1Yl T1 /   rˆT YlT1 / 
 [ rˆT (G lp1 )T ( R pl 1 ) 1 G pl 1Yl T1  rˆT Yl T1 ] / 
 [ rˆT (G lp1 )T W pl 1  rˆT YlT1 ] / 
 [ rˆT ((G lp1 )T W pl 1  Yl T1 )] / 
 rˆT ( E lp1 )T / 
 T
l 1
l 1 1 l 1
T


ˆ
W

(
R
)
G
r

l 1
p
p
p
W p 1  

T



W pl 1   W   p  M 

(Eq. 25)


T
 
  1 M 
We have a recursive equation for the new weight matrix.
24
Back to computing e (see Eq. 38):
Suppose Ax = b, and dim(x) < dim(b), so system is over-determined
Least squares:
x  ( AT A) 1 AT b
e1 || Ax  b ||2 || A( AT A) 1 AT b  b ||2
|| ( A( AT A) 1 AT  I )b ||2
 bT ( A( AT A) 1 AT  I )T ( A( AT A) 1 AT  I )b
 bT ( A( AT A) 1 AT A( AT A) 1 AT  A( AT A) 1 AT  A( AT A) 1 AT  I )b
 bT ( I  A( AT A) 1 AT )b
Now suppose that we add another column to A and another element to x.
ˆ b
Ax
Aˆ  [ A a]
T
T



A
A
A
ˆAT Aˆ 
 T  [ A a]=  T
a 
a A
AT a 

aT a 
We have more degrees of
freedom in x, so the
approximation error should
decrease.
25
Matrix inversion lemma:
T
1
T

(
A
A
)

gg
/  g /  
T ˆ 1
ˆ
( A A)  

T
g / 
1/  

  aT a  aT A( AT A) 1 AT a  aT ( I  A( AT A) 1 AT )a
g  ( AT A) 1 AT a
e  bT ( I  Aˆ ( Aˆ T Aˆ ) 1 Aˆ T )b
2
e  e1  e2  bT Aˆ ( Aˆ T Aˆ ) 1 Aˆ T b  bT A( AT A) 1 AT b
But notice:
T
1
T

(
A
A
)

gg
/
T ˆ ˆ T ˆ 1 ˆ T
T
b A( A A) A b  b [ A a ] 
T

g
/


 g /    AT 
  T b
1/    a 

/  b
 bT A( AT A) 1 AT  Agg T AT /   ag T AT /   AgaT /   aa T /  b

 e  e1  e2  bT Agg T AT /   ag T AT /   AgaT /   aaT
26

 b
 b
 b

e  bT Agg T AT b  2bT ag T AT b  bT aaT b / 

2
/
T
Ag  b a
T
A( A A) A a  b a
T
( I  A( A A) A
T
1
T
T
T
 /
)a  / 
T
1
T
2
2
ˆ  [ A a] X  B (error e )
Now consider AX  B (error e1 ), and AX
2
Aˆ has one more column than A. Solve for X .


2
e  e1  e2  B  B A( A A) A a / 
T
T
T
1
T
This is like going from
(G lp1 )T Wpl 1  YlT1 (error e1 ), to (G lp11 )T Wpl 11  Yl T1 (error e2 )
27
e  e1  e2

Y
l 1 T
p
l 1 1
p
 Yl 1  Yl 1 (G ) ( R ) G

l 1
 (W
l 1 T
p
) G
l 1
p
 rˆ
2
l 1
p
 rˆ
2
/
/
2
 E rˆ / 
l 1
p
  ||  ||2
We have a very simple formula for the error decrease due to adding a
prototype. Don’t add the prototype unless (e / e1) > 2 (threshold)
28
The OINet Algorithm
Training data {xi}  {yi}, i  {1, …, q}
Initialize prototype set: V = {x1}, and # of prototypes = p = |V| = 1
Initialize subprototype set: Z = {x1}, and # of subprototypes = l = |Z| = 1
y1  Wpl11  Wpl G lp
Wpl  y1T / 11
x1
R  G (G )  
l
p
l
p
l T
p
2
11
(R )  1 / 
l 1
p
2
11
x2
y1
v11
v21
w11

w12
y2
w1m
What are the dimensions
of these quantities?
vN1
xN
1 hidden neuron
yM
29
Begin outer loop: Loop until all training patterns are correctly classified.
n=q–1
Re-index {x2 … xq} & {y2 … yq} from 1 to n
For i = 1 to n (training data loop)
Send xi through the network. If correctly classified, continue loop.
Begin subprototype addition:
zl 1  xi
k   (|| v1  zl 1 ||)  (|| v p  zl 1 ||
(R lp1 ) 1 computation (Eq. 17)
l 1
p
T
(Eq. 11)
W pl 1 computation (Eq. 20)
Consider using xi as a prototype (note v p 1  zl 1 ):
rl 1   (|| zl 1  z1 ||)
 (|| zl 1  zl ||
T
(Eq. 23)
 l 1   (|| zl 1  zl 1 ||) (Eq. 24)
 rl 1 
rˆ  
(Eq. 29)

 l 1 
30
G lp1  G lp k lp1  (Eq. 10), u  ( R lp1 ) 1 G lp1rˆ (Eq. 28)
  rˆT rˆ  rˆT (G lp1 )T ( R lp1 ) 1 G lp1rˆ (Eq. 30)
E lp1  Yl 1  (W pl 1 )T G pl 1 (Eq. 35)
  E lp1rˆ / 
(Eq. 34), e   ||  ||2
(Eq. 40)
If    1 and e / E lp1   2 then use xi as a prototype
 G lp1 
G  T
 (Eq. 22)
 rl 1  l 1 
Update W pl 1  W pl 11 (Eq. 25)
l 1
p 1
Update R lp1  R lp11 (Eq. 26)
Update (R lp1 ) 1  ( R lp11 ) 1 (Eq. 27)
p  p 1
End prototype addition
l  l 1
End subprototype addition
End training data loop
End outer loop
31
Homework:
• Implement the OINet using FPGA technology for classifying
subatomic particles using experimental data. You may need to
build your own particle accelerator to collect data. Be careful not
to create any black holes.
• Find the typo in Sin and deFigueiredo’s original OINet paper.
32
References
•
•
L. Fausett, Fundamentals of Neural Networks, Prentice Hall, 1994
S. Sin and R. deFigueiredo, An evolution-oriented learning algorithm for
the optimal intepolative net, IEEE Transactions on Neural Networks, vol.
3, no. 2, pp. 315–323, March 1992
33