슬라이드 제목 없음

Multilayer Perceptrons
CS679 Lecture Note
by Jin Hyung Kim
Computer Science Department
KAIST
Multilayer Perceptron


Hidden layers of computation nodes
input propagates in a forward direction, layer-by-layer basis
 also

called Multilayer Feedforward Network, MLP
Error back-propagation algorithm
 supervised
learning algorithm
 error-correction learning algorithm
 Forward pass
 input vector is applied to input nodes
 its effects propagate through the network layer-by-layer
 with fixed synaptic weights
 backward pass
 synaptic weights are adjusted in accordance with error signal
 error signal propagates backward, layer-by-layer fashion
FF NN
MLP Distinctive Characteristics

Non-linear activation function
 differentiable
 sigmoidal
1
yi 
1  exp( v j )
function, logistic function
 nonlinearity prevent reduction to single-layer perceptron

One or more layers of hidden neurons
 progressively
extracting more meaningful features from input
patterns

High degree of connectivity

Nonlinearity and high degree of connectivity makes
theoretical analysis difficult
Learning process is hard to visualize
BP is a landmark in NN: computationally efficient training


Preliminaries

Function signal
 input
signals comes in at the input end of the network
 propagates forward to output nodes

Error signal
 originates
from output neuron
 propagates backward to input nodes

Two computations in Training
 computation
of function signal
 computation of an estimate of gradient vector
 gradient of error surface with respect to the weights
Back Propagation Algorithm





1. Initialization: Randomize the weights to small values.
2. Presentation: Apply a pattern to the input and calculate
the network output.
3. Error Computation: Compare the output with the desired
output and compute the error (difference).
4. Backward Computation: Backpropagate the error
through the network and adjust the weights to minimize the
error.
5. Iteration: Repeat steps 2-4 until a desired error goal is
reached.
Back-Propagation Algorithm


Error signal for neuron j at iteration n
Total error energy
C

is set of the output nodes
e j ( n)  d j ( n)  y j ( n)
1
E (n)   e 2j (n)
2 jC
Average squared error energy
 average
over all training sample
 cost function as a measure of learning performance

N
 E ( n)
n 1
Objective of Learning process
 adjust

1
Eav 
N
NN parameters (synaptic weights) to minimize Eav
Weights updated pattern-by-pattern basis until one epoch
 complete
presentation of the entire training set
BPA
m

Induced local field
v j (n)   w ji (n) yi (n)
i 0


output of neuron j
y j (n)   j (v j (n))
Gradient
E (n) E (n) e j (n) y j (n) v j (n)

w ji (n) e j (n) y j (n) v j (n) w ji (n)
 Sensitivity
factor
 determine
the direction of search in weight space
 according to chain rule
E (n)
 e j ( n)
e j (n)
e j (n)
y j (n)
 1
y j (n)
v j (n)
  j (v(n))
v j (n)
w ji (n)
 yi ( n )
Gradient Descent
E (n)
 e j (n) j (v j (n)) yi (n)
w ji (n)

Therefore,

By delta rule

which is gradient descent in weight space
Local gradient
E (n)
w ji (n)  η
w ji (n)
E (n)
E (n) e j (n) y j (n)
δj (n)  

v j (n)
e j (n) y j (n) v j (n)
 e(n) j (v j (n))
w ji (n)  ηδ j (n) yi (n)
Local Gradient (I)

Neuron j is an output node e j (n)  d j (n)  y j (n)

Neuron j is a hidden node
 credit
assignment problem
 how to determine their share of responsibility

for output neuron k
E (n) y j (n)
E (n)
δ j ( n)  

 j (v j (n))
y j (n) v j (n)
y j (n)
E ( n) 
1
2
e

k ( n)
2 kC
e (n)
e (n) vk (n)
E (n)
  ek (n) k
  ek (n) k
y j (n) k
y j (n) k
vk (n) y j (n)
Local Gradient (II)


Error in neuron k
Hence
e (n)
k
vk (n)
ek (n)  d k (n)  yk (n)  d k (n)   k (vk (n))
  k (vk (n))
m

kj
j
since k
,
j 0
 desired partial derivative
v ( n) 
w ( n) y ( n)
vk (n)
 wkj (n)
y j (n)
E (n)
  ek (n) k (vk (n)) wkj (n)  δ k (n)wkj (n)
y j (n)
k
k

back-propagation formula for hidden neuron j
δ j (n)   j (v j (n))δ k (n)wkj (n)
k
BP Summary
 Weight   learning   local

 
 
 correction    parameter    gradient
 w (n)  η
 δ (n)
  j
 ji
 

forward pass
m
v j (n)   w ji (n) yi (n)
i 0
y j (n)   j (v j (n))

backward pass
compute local gradient 
 from output layer toward input layer
 synaptic weight change by delta rule
 recursively
  input signal
 
   to neuron j
  y ( n)
  i





Activation Function(logistic function)
1
 j (v j (n)) 
1  exp( av j (n))
 j (v j (n)) 

a  0 and -   v j (n)  
a exp( av j (n))
[1  exp( av j (n))]
2
 ay j (n)[1  y j (n)]
local gradient
 for
output node
δj (n)  ei j (vi (n))  a[d j (n)  o j (n)]o j (n)[1  o j (n)]
 for
hidden node
δ j (n)   j (vi (n))δ k (n)wkj (n)
k
 ay j (n)[1  y j (n)]δ k (n)wkj (n)
k
Activation Function(Hyperbolic
tangent function)
 j (vi (n))  a tanh( bv j (n))
(a,b)  0
 j (vi (n))  ab sec h 2 (bv j (n))  ab(1  tanh 2 (bv j (n)))


b
[a  y j (n)][ a  y j (n)]
a
local gradient
 for
output node
b
δ j (n)  ei j (vi (n))  [d j (n)  o j (n)][ a  o j (n)[ a  o j (n)]
a
 for
hidden node
δ j (n)   j (vi (n))δ k (n)wkj (n)
k
b
 [a  y j (n)][ a  y j (n)]δ k (n)wkj (n)
a
k
Moment term

BP approximate the trajectory of steepest descent
 smaller

learning-rate parameter makes smoother path
increase rate of learning yet avoiding danger of instability
w ji (n)  α w ji (n  1)  ηδ j (n) y j (n)
where  is momentum constant
n
w ji (n)  η α
t 0
n -1
n
E (n)
 η α n -1δ j (t ) yi (t )
w ji (n)
t 0
if 0 | |  1
 the patial deriviative has the same sign on consecutive iterations,
grows in magnitude - accelerate descebt
 opposite sign - shrinks; stabilizing effect
 converge

benefit of preventing the learning process from terminating
in a shallow local minimum
Mode of Training
 Epoch
: one complete presentation of training data
 randomize the order of presentation for each epoch

Sequential mode
 for
each training sample, synaptic weights are updated
 require less storage
 converge much fast, particularly training data is redundant
 random order makes trapping at local minimum less likely

Batch mode
 at
the end of one epoch, synaptic weights are updated
Eavg (n)
e j (n)
η N
w ji (n)  η
   e j ( n)
w ji (n)
N n 1
w ji (n)
 may
be robust with outliers
Stopping Criteria


No well-defined stopping criteria
Terminate when Gradient vector g(W) = 0
 located


at local or global minimum
Terminate when error measure is stastionary
Terminate if NN’s generalization performance is adequate
XOR Problem

McCulloch-Pitts Model (threshold model)
-1.5
(.)
+1
-2
+1
+1
(.)
+1
-0.5
(.)
+1
-0.5
Heuristics for making BP Better (I)

Training with BP is more an art than science
 result


of own experience
Sequential vs. Batch update
Maximizing information content
 examples
of largest training error
 examples of radically different from previous ones

Randomize the order of presentation
 successive

examples rarely belongs to the same class
Activation function
 antisymmetric

function learns fast
 (v)  a tanh( bv)
Target value is within range of sigmoid activation function
 target
value should be offset by some e from limiting value
Heuristics for making BP Better (II)

Normalizing the inputs
 preprocessed
so that its mean value is closed to zero
 input variables should be uncorrelated
 by principal component analysis
 scaled so that covariance are equal
 Fig 4.11

Weight Initialization
 large
weight value => saturation
 local gradient value is small = slow learning
 small weight value => operate on flat area = slow learning
 somewhere between two extremes.
 For the hyperbolic tangent function, set  = 0, 2 = 1/m
Heuristics for making BP Better (III)

Learning from hints
 prior
information should be included in the learning process
 invariant properties, symmetries, etc.

Learning Rate
 all
the neurons learn at the same rate
 last layer has large local gradient (by limiting effect)
 last layer learns fast
  of last layer should be assigned smaller one
 LeCun’s suggestion : learning rate is inversely proportional to
square root of the number of synaptic connection ( m-1/2)
Output Representation and
Decision rule

For M pattern classification problem, the kth
element of desired response vector is
1
if input vector x belong to Ck
 0 otherwise


conditional expectation of the desired
response vector equals the posteriori class
probability P(Ck| x), k= 1,2,…, M
Bayes Classification Rule
 Classify

x to class Ck if P (Ck |x) > P (Cj |x) for all j k
Approximated Bayes Rule by MLP
 Classify x

to class Ck if Fk(x) > Fj(x) for all j k
Multiple class assignment
 Classify
x to class Ck if Fk(x) > t
0 

 
1  kth element
 

0
Computer Experiment
Bayesian Decision
Pr(x | C1 )
(x) 
Pr(x | C 2 )

Likelihood ratio

Decision Boundary

Probability of Error
(x) 
Pr(x | C1 ) Pr(C 2 )


Pr(x | C 2 ) Pr(C1 )
Pe  P(C1 )P(e | C1 )  P(C 2 )P(e | C2 )

Optimal number of hidden neuron
 although
small mean-square-error does not necessarily imply good
generalization,

Number of training samples : Chernoff bound
P(| pN -p )  ε )  2exp(-2ε 2 N )  δ
e
= 0.01, = 0.01 yields N=26,500 => picked N as 32000
Computer Experiment

Optimal Learning and Momentum constants
 Small
learning rate results in slower convergence
 Small learning rate locates deeper local minimum
 Increasing momentum constant results in faster learning with small
learning rate
 With large learning rate, small momentum constant is required to
ensure learning stability

Figure 4.15, 4.16

Decision boundary by the BPA
 Figure
4.17
 Decision boundaries are convex
Feature Detection


Hidden neurons act as feature detectors
As learning progress, hidden neuron gradually discover
“salient” features that characterize training data
 Nonlinear

transformation of input data to feature space
Observe Role of Hidden Neurons with linear output node
 for
one-from-M coding scheme, MLP maximizes a discriminant
function that is trace of product of two matrices
 weighted between-class covariance
 pseudo-inverse of total covariance matrix

Close resemblance to Fisher’s linear discriminant
Duda and Hart, page 114 - 118
Fisher’s linear discriminant


Aim is reduction of dimensionality of feature space
project d-dimensional data onto a line in order to be wellseparated
a
b
Fisher’s linear discriminant
Sample x1, …, xn; n1 of 1 class; n2 of 2 class
 linear combination of the component x; scalar
 y1, …, yn divided into subset 1 and 2
 yi is projection of xi onto a line in direction of w
 Measure of separation ~ ~
| m1  m2 |  | w t (m1  m 2 ) |
 scatter ~
~ )2
s2 
(y  m

i

yi
y  wt x
i
~ m
~ |
|m
J (w )  ~ 12 ~ 22
s1  s2

criterion function

t
Fisher’s linear discriminant is a linear function w x for which
J (w ) is maximum
Fisher’s linear discriminant



Scatter matrix
then ~ 2
si 
define
Si 
 (x  mi )(x  mi )
t
SW  S1  S 2
x i
t
t
(
w
x

w
mi ) 

2
x i
t
t
t
w
(
x

m
)(
x

m
)
w

w
Si w

i
i
x i
S B  (m1  m 2 )(m1  m 2 )t
wt SBw
J (w )  t
w SW w

rewrite

vector w that maximize J must satisfy SW1S B w  λw
Since SBw is always in the direction of m1-m2

w  SW1 (m1  m 2 )

maximum ratio of between-class scatter to within-class
scatter
Generalization



Input-output mapping is correct for data never seen before
Learning process is curve fitting - non-linear mapping
Overfitting - Overtraining
 memorize
training data, not the essence of the training data
 learns idiosyncrasy and noise
 loose the ability of generalize

Occam’s Razer
 find
the simplest function among those which satisfy given conditions
 smoothest function
 Figure 4.19
Training Set Size for Generalization

Genralization is influenced
 size
of training set
 architecture of Neural Network



Given architecture, determine the size of training set for
good generalization
Given set of training samples, determine the best
architecture for good generalization
VC dimension - theoretical basis
W 
N  O 
 ε
Approximation of Functions

Non-linear input-output mapping
 M0


input space to ML output space
What is the minimum number of hidden layers in a MLP that
provide approximate any continuous mapping ?
Universal Approximation Theorem
 existence
of approximation of arbitrary continuous function
 single hidden layer is sufficient for MLP to compute a uniform e
approximation to a given training set
 not saying single layer is optimum in the sense of training time, easy
of implementation, or generalization

Bound of Approximation Errors of single hidden node NN
 larger
the number of hidden nodes, more accurate the
approximation
 smaller the number of hidden nodes, more accurate the empirical fit
Curse of Dimensionality

For good generalization, N > m0m1/ e = W / e
 where


W is total number of synaptic weights
We need dense sample points to learn it well.
Dense samples are hard to find in high dimensions
 exponential
growth in complexity as increase of dimensions
Practical Consideration

Single hidden layer vs double(multiple) hidden layer
 single
HL NN is good for any approximation of continuous function
 double HL NN may be good some times

double(multiple) hidden layer
 first
hidden layer - local feature detection
 second hidden layer - global feature detection
Cross-Validation

Validate learned model on different set to assess the
generalization performance
 guarding

against overfitting
Partition Training set into
 Estimation
subset
 validation subset

cross-validation for
 best
model selection
 determine when to stop training
Model selection


Choosing MLP with the best number of free parameters with
given N training samples
Issue is to choose r
 that
determines split of training set between estimation set and
validation set
 to minimize classification error of model trained by the estimation set
when it tested with the validation set

Kearns(1996) : Qualitative properties of optimum r
 Analysis
with VC Dim
 for small complexity problem (desired response is small compared to
N), performance of cross-validation is insensitive to r
 single fixed r nearly optimal for wide range of target function

suggest r = 0.2
 80%
of training set is estimation set
Stopping method of training

Right time to stop training
 to

avoid overfitting
Early stopping method
 after
some training, with fixed synaptic weights computed validation
error
 resume training after computing validation error
Mean
squared
error
Validation
sample
Training
sample
Early stopping point
Number of epoch
Stopping method

Amari(1996)
 for
N<W
 early stopping improves generalization
 for N<30W
2W  1  1
 overfitting occurs
ropt  1 
2(W  1)
ropt  1
 example:
1
2W
for large W
w=100, r=0.07
93% for estimation, 7% for validation
 for N>30W
 early stopping improvement is small

Leave-one-out method
Network Pruning

Minimizing network improves generalization
 less


likely to learn idiosyncrasies or noise
Network growing
Network pruning
 weakening

or eliminate synaptic weights
Complexity-regularization
 tradeoff
between reliability of training data and goodness of the
model
 supervised learning by minimizing the risk function
where
R(w)  Es (W )  λEc (W )
Es (W ) : standard performance measure
Ec (W ) : complexity penalty
Complexity-regularization

Weight Decays
Ec (W ) || W ||2   wi2
 some
weights are forced to take value zero
 weights in network are grouped into two categories
 those of large influence
 those of little or no influence : excess weights

Weight Elimination
( wi / wo ) 2
Ec (W )  
1  ( wi / wo ) 2
 when

wi << w0, eliminated
Approximate Smoother
Hessian-based Network Pruning


Identify parameters whose deletion will cause the least
increase in Eav
by Tayer series
1 t
Eav (w  w )  Eav (w )  g (w )w  w Hw  O(|| w ||3 )
2
t
 Parameters
are deleted after training process has converged
 quadratic approximation
1
Eav  E (w  w )  E (w )  wt Hw
2


eliminate the weights of wi  wi  0
wi
1

w


H
1i
Solve the constrained optimization problem :
1
[H ]
 if
1
[H ]i,i
is small, even small weight is important
i ,i
Optimal Brain Surgeon

Saliency of wi
 represent

of small saliency will be deleted
Optimal Brain Damage
 with

the increase in the mean-squared error from delete of wi
OBS procedure
 weight

wi2
Si 
2[H 1 ]i ,i
assumption of the Hessian matrix is diagonal
computation of the inverse of Hessian
Accelerated Convergence

Heuristics
1. Adjustable weights should have own learning rate
parameter
2. Learning rate parameters should be allowed to vary on
iteration
3.If sign of the derivative is same for several iteration, learning
rate parameter should be increased
 Apply
the Momentum idea even on learning rate parameters
4. If sign of the derivative is alternating for several iteration,
learning rate parameter should be decreased