Supervised3

BACKPROPAGATION
(CONTINUED)
• Hidden unit transfer function usually sigmoid
(s-shaped), a smooth curve.
• Limits the output (activation) unit between 0..1
Take net input to unit, pass through function: gives output of
unit.
Most common is Logistic function:
1
output i =
- neti
1 + exp
Transfer function
• Often use same function for output units (when
building classifier, I.e. 0..1 classification)
• For regression problems (decimal outputs) output
transfer function used is linear (just sum net
inputs)
Local minima
• Backpropagation is a gradient descent process,
each change in weights bringing net closer to a
minimum error in weight space.
• Because of this, easily trapped in local minimum,
as only 'downward' steps can be taken.
• Sometimes remedied by starting with different
random weights (starting from different point in
the error surface).
Critical parameters
• Cumulative versus incremental weight changing.
• Adjust weights after presentation of one pattern
(incremental) or all (epoch)?
• Epoch faster, but more likely to fall into local
minima and requires more memory.
Critical parameters
• Size of learning constant.
• Too high – won’t learn anything (oscillate). Too
low, will take ages to find solution.
Critical parameters
• Momentum method.
• Supplement current weight adjustments with a
fraction of most recent weight adjustment.
Standard back-propagation weight adjustment:
weight(t + 1) = weight(t) + ( weight(t))
i.e. weight at time t+1 equal to weight at t plus (learning rate *
calculated weight change).
Back propagation with momentum included, where alpha is
momentum constant:
weight(t + 1) = weight(t) + ( weight(t)) +  ( weight(t - 1))
i.e. we have included some of weight change from previous
cycle of back-propagation.
Can significantly speed up learning.
Critical parameters
• Number of hidden units:
• Too few, won’t learn problem.
• Too many, network generalises poorly to new data
(overfits training data).
• Addition of Bias units
• Bias is unit which always has output of 1,
sometimes helps convergence of weights to an
solution by providing extra degree of freedom in
weight space.
n
net j =
w
ji
xi +  j (1)
i=1
Where theta is bias weight multiplied by 1:
 j = BIAS WEIGHT j * 1
Bias error derivatives (analogous to weight error derivatives i
normal backprop) are calculated:
output
BED
i
= output
BED
i
+ (  outputi * 1)
i.e. the Bias error derivative for an output unit is found using
the delta of that output unit multiplied by the output of the bias
unit it connects to (i.e. 1).
In a similar way the Bias error derivatives for the hidden units
are found:
hidden
BED
i
= hidden
BED
i
+ (  hiddeni * 1)
Bias weights then changed in the same way as the other
weights in the network, using the Bias error derivatives and a
Bias change rate parameter beta ( analogous to n the learning
rate parameter for normal weights):
output
BIAS WEIGHT
i
(t + 1) = output
BIAS WEIGHT
i
(t) +  ( output
BED
i
)
In the same way, the Bias weights are changed for the hidden
units:
hidden
BIAS WEIGHT
i
(t + 1) = hiddeniBIAS WEIGHT(t) +  ( hiddeniBED )
Bias units
• Bias units not mandatory but as they involve
relatively little extra computation most neural
networks have them by default.
Overtraining
• Overtraining a type of overfitting
• Can train network too much.
• Network becomes very good at classifying the
training set, but poor at classifying the test set that
it has not encountered before, i.e. it is not
generalising well (overfitting training data).
Overtraining
• Avoid by periodically presenting a validation set
and recording the error and storing weights. Best
set of weights giving minimum error on the test
set retrieved when training finished.
• Some neural network packages do this for you, in
a way that is hidden from the user.
Data selection
• Selection of training data critical
• When training to perform in a noisy environment
must include noisy input patterns in the training
set.
• MLP good at interpolation but not at
extrapolation.
• Must balance number of members of each class in
training and validation sets
Catastrophic forgetting
• Unwise to train the network completely on
patterns selected from one class and then switch to
training from another class of patterns, as the
network will forget the original training.
• Solution is to mix both classes in same training
set.