NeuralNets2 - Cleveland State University

Neural Networks
Part 2
Dan Simon
Cleveland State University
1
Outline
1. Preprocessing
– Input normalization
– Feature selection
– Principal component analysis (PCA)
2. Cascade Correlation
2
Preprocessing
First things first:
1. Preprocessing
– Input Normalization
– Feature Selection
2. Neural network training
If you use a poor preprocessing algorithm, then it doesn’t
matter what type of neural network that you use. And if
you use a good preprocessing algorithm, then it doesn’t
matter what type of neural network you use.
3
Input normalization for independent variables
Training data: xni
n  [1, N], N = # of training samples
i  [1, d], d = input dimension
1
xi 
N
N
x
n 1
ni
(i  1,..., d )
Mean of each input dimension
1 N
2
 
(x

x
)
(i  1,..., d )

ni
i
N  1 n 1
xni  xi
xni 
(n  1,..., N ), (i  1,..., d )
2
i
Normalized inputs
i
Sometimes weight updating takes care of normalization, but
what about weight initialization? Also, recall that RBF activation
is determined by Euclidean distance between input and center,
which we don’t want to be dominated by a single dimension.
4
Input normalization for correlated variables (whitening)
Training data: xni
n  [1, N], N = # of training samples
i  [1, d], d = input dimension
1
x
N
N
x
n 1
n
(d  1 vectors)
1 N
T
(d  d covariance)
)
x

)(x
x

(x


n
n
N  1 n 1
u j   j u j ( j  1,..., d ) (eigenvalue equation)
U  [u1
  diag(1 ,
un ]
, d )
xn   1/2U T ( xn  x ) (normalized inputs)
What is the mean and covariance of the normalized inputs?
5
original distribution
whitened distribution
6
Feature Selection: What features of the input data should we
use as inputs to the neural network?
Example: A 256  256 bitmap for character recognition gives
65,536 input neurons!
One Solution:
• Problem-specific clustering of features – for example,
“super pixels” in image processing
We went from
to 256 to 64
features
7
Feature Selection: What features of the input data should we
use as inputs to the neural network?
Expert Knowledge: Use clever problem-dependent approaches
5
6
2
6
6
2
7
4
We went from to 64
to 16 features. The
old features were
binary, and the new
features are not.
0 1 6 6 6 6 7 6
8
Clever problem-dependent approach to feature selection
Example: Use ECG data to diagnose heart disease
We have 24 hours of data at
500 Hz.
Cardiologists tell us that
primary indicators include:
• P wave duration
• P wave amplitude
• P wave energy
• P wave inflection point
This gives us a neural network
with four inputs
9
Feature Selection: If you reduce the number of features, make
sure you don’t lose too much information!
8 8 6 6 8 8
8 8 6 6 8 8
10
Feature Selection in the case of no expert knowledge
Brute Force Search: If we want to reduce the number of
features from M to N, we find all possible N-element subsets of
the M features, and check neural network performance for
each subset.
How many N-element subsets can be taken from an Melement set?
M

M!
Binomial coefficient:   
N
N !( M  N )!
Example: M = 64, N = 8  M-choose-N = 4.4 billion
Hmm, I wonder if there are better ways …
11
Feature Selection:
• Branch and Bound Method
This is based on the idea that deleting features cannot improve
performance.
Suppose we want to reduce the number of features from 5 to 2.
First, pick a “reasonable” pair of features (say, 1 and 2).
Train an ANN with those features.
This gives a performance threshold J.
Create a tree of eliminated features.
Move down the tree to accumulate deleted features.
Evaluate ANN performance P at each node of the tree.
If P < J, there is no need to consider that branch any further.
12
Branch and Bound
We want a
reduction from
5 to 2 features
2
Optimal method,
but may require
a lot of effort.
B
1
3
2
4
3
C
3
4
4
A
3
4
5
4
5
5
4
5
5
Use features 1 and 2 to obtain performance at node A.
We find that B is worse than A, so no need to evaluate below B.
We also find C is worse than A, so no need to evaluate below C.
5
13
Feature Selection:
• Sequential Forward Selection
Find the feature f1 that gives the best performance
Find the feature f2 such that (f1 , f2 ) gives the best performance
Repeat for as many features as desired
Example: Find the best 3 out of 5 available features
1
1
2
3
4
5
2 is best
3
4
5
{2, 4} is best
1
3
5
{2, 4, 1} is best
14
Problem with sequential forward selection:
There may be two features such that either one along
provides little information, but the combination provide a lot
of information.
x2
Class 1
Class 2
x1
Neither feature, if used alone,
provides information about the
class. But both features in
combination provide a lot of
information.
15
Feature Selection:
• Sequential Backward Elimination
Start with all features.
Eliminate the one that provides the least information.
Repeat until the desired number of features is obtained.
Example: Find the best 3 out 5 available features
1,2,3,4,5
1,2,3,4
1,2,3
1,2,3,5
1,2,5
1,2,4,5
1,3,5
2,3,5
Use all features for best performance
1,3,4,5
2,3,4,5
Eliminating
feature 1
results in the
least loss of
performance
Eliminating
feature 4
results in the
least loss of
performance
16
Principal Component Analysis (PCA)
This is not a feature selection method, but a feature
reduction method.
x2
Class 1
Class 2
x1
This is a reduced-dimension
problem, but no single feature
gives us enough information
about the classes.
17
Principal Component Analysis (PCA)
We are given input vectors xn (n=1,…N), and each vector
contains d elements
Goal: map xn vectors to zn vectors, where each zn vector has M
elements, and M < d.
d
xn   zni ui
i 1
{ui} = orthonormal basis vectors
Since uiTuk = ik, we see that zni = uiTxn
M
xn  xn   zni ui 
i 1
d
i  M 1
{bi }  constants TBD
xn  xn 
d
 (z
i  M 1
ni

 bi )ui
bi ui
1 N
EM   xn  xn
2 n 1
2
1 N d
   ( zni  bi ) 2
2 n 1 i  M 1
We want to minimize EM
18
zni  u xn
T
i
1 N d
EM    ( zni  bi ) 2
2 n 1 i  M 1
2
1 d N T
EM    ui ( xn  x ) 
2 i  M 1 n 1
N
dEM
  ( zni  bi )
dbi
n 1
1
bi 
N
1

N
1 d T N
T 
  ui   ( xn  x )(xn  x )  ui
2 i  M 1  n 1

N
z
n 1
1 d N T
   ui ( xn  x )( xn  x )T ui
2 i  M 1 n 1
ni
1 d T
  ui Pui
2 i  M 1
N
T
u
 i xn
This defines P
(d  d matrix).
n 1
N
 u x , where x   xn
T
i
n 1
We found the best {bi} values.
What are the best {ui} vectors?
19
1 d T
EM   ui Pui
2 i  M 1
N
P   ( xn  x )(xn  x )T
n 1
We want to minimize EM with respect to {ui}.
{ui} = 0 would work …
We need to constrain {ui} to be a set of orthonormal basis
vectors; that is, uiTuk = ik. Constrained optimization:
d
d
d
1
1
T
ik = Lagrange
Eˆ M   ui Pui    ik (uiT uk   ik )
2 i  M 1
2 i  M 1 k  M 1
multipliers
T w.l.o.g.
M=M
1
1
 Tr(U T PU )  Tr[M (U TU  I )]
2
2
where U = [uM+1 … ud], M = [ ik ], I = identity matrix
U is a d  (dM) matrix, M is a (dM)  (dM) matrix
20
1
1
T
ˆ
EM  Tr(U PU )  Tr[M (U TU  I )]
2
2
UT
Eˆ M
 PU  UM  0
U
PU  UM
T
T
U
U PU  M
P
P
U = M
U
= M 0
0
UTPU=M: [(dM)  d] [d  d] [d  (dM)] = [(dM)  (dM)]
Add columns to U, and expand M.
We now have an eigenvector equation while still satisfying the
original UTPU=M equation.
PCA Solution: {ui} = eigenvectors of P
21
1 d T
EM   ui Pui
2 i  M 1
1 d T
  ui i ui
2 i  M 1
The error is half of the sum of the (dM)
smallest eigenvalues of P
{ui} = principal components
PCA = Karhunen-Loeve transformation
1 d
  i
2 i  M 1
22
PCA illustration in two
dimensions: All data points
are projected onto the u1
direction. Any variation in
the u2 direction is ignored.
u2
u1
23
Cascade Correlation
Scott Fahlman, 1988
This gives a way to automatically adjust the network size. Also,
it uses gradient-based weight optimization without the
complexity of backpropagation.
1. Begin with no hidden-layer neurons.
2. Add hidden neurons one at a time.
3. After adding hidden neuron Hi, optimize the weights from
“upstream” neurons to Hi to maximize the effect of Hi on
the outputs.
4. Optimize the output weights of Hi to minimize training
error.
24
Cascade Correlation Example
Two inputs and two outputs
Step 1 – Start with a two-layer network (no hidden layers)
y1
x1
x2
1
y2
x1
x2
1
y1
y2
y1 = f(x.w1)
w1 = weights from inputs to y1
w1 can be trained with a gradient method
Similarly, w2 can be trained with a gradient method
25
Cascade Correlation Example
Two inputs and two outputs
Step 2 – Add a hidden neuron H1. Maximize the correlation
between H1 outputs and training error.
z
y1
y2
H1
x1
x2
1
no
N
Don’t update
Do update
no = number of outputs
N = number of training samples
e = training error before H1 is added
{wi} = weights from inputs to H1
Correlation S   ( zn  z )(enk  ek )
k 1 n 1
no N
dS
  (enk  ek ) f ( w·xn ) xni
dwi k 1 n 1
Use a gradient method
to maximize |S| with
respect to {wi}
26
Cascade Correlation Example
Two inputs and two outputs
Step 3 – Optimize the output weights
y1
y2
Don’t update
Do update
H1
x1
x2
1
Use a gradient method to
minimize training error with
respect to the weights that are
connected to the output neurons.
27
Cascade Correlation Example
Two inputs and two outputs
Step 4 – Add another hidden neuron H2 and repeat step 2;
Maximize correlation between H2 outputs and training error.
z2
y1 y2
Don’t update
H2
Do update
H1
x1
no = number of outputs
N = number of training samples
x2
e = training error before H2 is added
1
{wi} = weights “upstream” from H2
no
N
Correlation S   ( zn  z )(enk  ek )
k 1 n 1
no N
dS
  (enk  ek ) f ( w·xn ) xni
dwi k 1 n 1
Use a gradient method
to maximize |S| with
respect to {wi}
28
Cascade Correlation Example
Two inputs and two outputs
Step 5 – Optimize the output weights
y1 y2
H2
H1
x1
x2
1
Don’t update
Do update
Use a gradient method to
minimize training error with
respect to the weights that are
connected to the output neurons.
Repeat previous two steps until desired performance obtained:
• Add a hidden neuron Hi
• Maximize corr. between Hi output and training error w/r to Hi input weights
• Freeze input weights
• Minimize training error w/r to output weights.
29
References
• C. Bishop, Neural Networks for Pattern Recognition
• D. Simon, Optimal State Estimation (Chapter 1)
30