delta rule

Supervised Learning
in
Neural Networks
Learning when all the correct answers
are available.
Remembering -vs- Learning
• Remember: Store a given piece of information such that it
can be retrieved and reused in the future in the same (or
very similar) way that it was used earlier.
• Learn: Extract useful generalizations (or specializations)
from information such that, in the future, it may be:
– Applied to new (previously unseen) situations
– More effectively applied to previously-seen cases
Supervised Learning
• Generalizing (and specializing) useful knowledge from
data items, each of which contains both a situation
(context) and the proper response (action) for that
situation.
• In an educational setting, the teacher provides a problem
and THE CORRECT ANSWER.
• In Reinforcement Learning (RL) the teacher only responds
“Right” or “Wrong”.
Learning = Weight Adjustment
wj,i
xi
xj
•
zj
Generalized Hebbian Weight Adjustment:
– The sign of the weight change = the sign of the correlation between xi and
z j:
∆wji
xizi
– zj is:
• xj
Hopfield networks
• dj - xj
Perceptrons (dj = desired output)
• dj - ∑xiwji
ADALINES “
“
i
Perceptrons
Perceptron: A machine that classifies input vectors by applying linear
functions to them (Rosenblatt, 1958).
Perceptron Learning Algorithm: A stochastic gradient-descent method for
finding a linear function that properly classifies a set of input vectors
(Minsky & Papert, 1969).
X
X
Y
-1
1
Y
-1
1
1
-5
xwx + ywy > 5
(-1)x + (1)y > 5
0
5
(-1)x + (1)y + (-5)(1) > 0
Perceptron Training Rule (Mitchell, 1997)
xj
wij   (ti  oi ) xj
Learning rate
Apply to all weights
after each misclassified
training example
Error
Expected Output
Intuitive:
If sgn(error) = + , then oi needs to increase.
So if xj > 0, then increase wij.
Else if xj < 0, then decrease wij.
If sgn(error) = - , then oi needs to decrease.
So if xj > 0, then decrease wij.
Else if xj < 0, then increase wij.
If sgn(error) = 0, then leave wij alone.
1
j
wij
i
0
oi
t i - oi
If the input vectors are linearly
separable, and the learning rate is
sufficiently small, then repeated
presentation of the training inputs and
applications of this rule will lead to
a set of weights that correctly
classifies all training examples.
Perceptrons & Adalines
•
•
•
•
Descriptions of the 2 vary from book to book.
I/O
– Originally, perceptrons (0 & 1), adalines (-1 & 1). Now, perceptrons can use both.
– Both sum inputs and use step function to compute a binary output.
Training
– Perceptron training rule: ∆wij = n(ti-oi)xj
(n = learning rate)
– Delta rule: ∆wij = n(ti-neti)xj
– neti -vs- oi is only difference, but it is a big difference:
• Delta rule can asymptotically approach the error minimum of a non-linearlyseparable problem; perceptron rule cannot.
• Perceptron rule can converge to a perfect classifier of a linearly separable
problem; delta rule cannot always do so.
Terminology Warning!!
– Some books use the term “multi-level perceptron”, but neither perceptrons nor
adalines in their classic forms are appropriate for trainable multi-level networks,
since both use step transfer functions, which are non-continuous and hence nondifferentiable.
– The sigmoid units used in most multi-level feed-forward networks are similar to
both perceptrons and adalines, but they use a continuous, differentiable sigmoidal
transfer function.
Function Learning
•
The main application of feed-forward ANNs is to learn a general function (mapping)
between a particular domain (D) and range (R) when given a set of examples: {(d, r): d
in D, r in R}.
•
D and R may contain vectors or scalars
F
r1
d1
d2
r2
d3
r3
d4
Example set = {(d1,r3), (d2,r1), (d3,r2), (d4,r2)}
Goal: Build an ANN that can take ANY element d of D on its input
layer and produce F(d) on its output layer.
Problem: The example set normally represents a very small fraction of
the complete mapping set (which may be infinite).
Standard ANN Approach
• Feed-forward neural net with back-propagation learning.
Examples = {(d1,r3), (d2,r1), (d3,r2), (d4,r2)}
In
d1
Encoder
r3
Out
Decoder
r*
Use E to adjust weights based
on their contributions to E, so that
E is reduced.
E = r3 - r*
Classification & Learning
case
x y
1
2
3
4
1 3
-5 2
1 3
-3 -1
2 +1
-3 -1
1 9
3 +1
5
6
7
-2 4
-7 2
5 5
1 +1
4 +1
-5 -1
*Sum assumes
values -1, 1, -5
for the weights.
X
Y
wx
1
wy
wz
0
Classification: The perceptron should compute the proper
class for each input x-y pair. For a single perceptron, this
is only possible when the input vectors are linearly separable
Learning: Find the proper values for weights wx, wy and wz
so that the perceptron properly classifies all input cases.
This is a search problem in weight-vector space.
Training & Testing Phases
Data Set
Training
Set
N times
In
Test
Set
Out
1 time
Training:
• Epoch = one processing round of the entire training set.
• Run training through many epochs.
• Record error
• Use error to update weights and (eventually) achieve proper
discrimination among the classes present in the training cases.
Testing:
• Run test set through ANN 1 time
• Do not change weights
• Record error
Gradient Descent Weight Learning
Inputs
Outputs
wa
wm
wk
w
o1
Expected Outputs
t1
o2
t2
o3
t3
w
w
Error
Base weight changes upon their contribution to the error such that the
updated weights will create LESS error on the same training cases.
Contribution =
Error
wij
Gradient Descent
E
w32
  E E
E 
E ( w)  
,
,...,
wn 
 w0 w1
w31
= Gradient of E w.r.t. the weight vector
Gradient-Descent Training Rule
•As described in Machine Learning (Mitchell, 1997).
•Also called delta rule, although not quite the same as the adaline delta rule.
•Compute Ei, the error at output node i over the set of training instances, D.
1
2
Ei   (tid  oid )
2 dD
D = training set
• Base weight updates on dEi/dwij
Ei = distance * direction (to move in error space)
wij  
wij
Intuitive: Do what you can to reduce Ei, so:
If increases in wij will increase Ei (i.e., dEi/dwij > 0), then reduce wij, but
”
”
decrease Ei ”
” < 0, then increase wij
• Compute dEi/dwij (i.e. wij’s contribution to error at node i)
for every input weight to node i.
• Gradient Descent Method: Updating all wij by the delta rule amounts to moving
along the path of steepest descent on the error surface.
• Difficult part: computing dEi/dwij .
Computing dEi/dwij
Ei
 1
2

(tid  oid )

wij wij 2 dD
1

  2(tid  oid )
(tid  oid )
2 dD
wij

  (tid  oid )
(oid )
wij
d D

  (tid  oid )
( fT ( sumid ))
wij
d D
where
sumid   wij x jd
In general:

fT sumid
fT
fT ( sumid ) 

x jd
wij
sumid wij
sumid
j
Computing
Identity ft :
fT ( sumid )  sumid

fT ( sumid )
wij


fT 
sumid  1
sumid
sumid

fT sumid
fT ( sumid ) 
 (1) x jd  x jd
wij
sumid wij
Sigmoidal ft :
1
fT ( sumid ) 
1  e  sumid
If fT is not continuous, and
hence not differentiable
everywhere, then we cannot
use the Delta Rule.

1
e  sumid

 f t ( sumid )(1  f t ( sumid ))
 sumid
 sumid 2
sumid 1  e
(1  e
)

fT ( sumid )  oid (1  oid ) x jd
But since: fT ( sumid )  oid
wij
Weight Updates for Simple Units
fT = identity function
Ei

  (tid  oid )
( fT ( sumid ))
wij dD
wij
  (tid  oid )( x jd )
d D
Ei
wij  
   (tid  oid )x jd
wij
d D
xjd
wij
tid
oid
Eid
Weight Updates for Sigmoidal Units
fT = sigmoidal function
Ei

  (tid  oid )
( fT ( sumid ))
wij dD
wij
  (tid  oid )oid (1  oid )(  x jd )
d D
Ei
wij  
   (tid  oid )oid (1  oid ) x jd
wij
d D
xjd
wij
tid
oid
Eid
Incremental Gradient Descent
After each training instance, d, update the weights by:
j
xjd
Simple Unit:
wij   (tid  oid ) x jd
Sigmoidal Unit:
wij
i
*Same as perceptron rule
Error Term
wij   (tid  oid )oid (1  oid ) x jd
oid
tid
Backpropagation Learning in Multi-Layer ANNs
wij  
Ei
wij
•
Still use gradient-descent (delta) rule:
– But now the effects of an arc’s weight change on the error need to be computed
across all nodes along all arcs from the current arc to the output layer.
•
Starting from the output layer and moving back through the hidden nodes, compute an
error term  i for each node i.
•
Then, for each arc going into node i, compute the contribution of that arc’s weight w ij to
the total error as xjd i , where xjd is the output value of node j on training example d.
•


When an error term has been calculated for every node to which node j sends outputs,
then node j’s error contribution can be computed as the product of :
– a) the influence of j’s input sum (netj) upon j’s output.
– b) the sum of the contributions of j’s output to each of its downstream neighbors’
error terms.
• Each such contribution is simply the weight along the arc times the error
contribution of the node on the downstream end of that arc.
BackPropagation
For each input vector:
1. Propagate the inputs forward through the network.
2. Propagate the errors backward through the network.
Error term for output units:
Ed
Eid
i  

 (tid  oid )oid (1  oid )
sumid
sumid
Ed
i  
 oid (1  oid )  wki k
sumid
kOutputs
wij   i xj
wmi
wki
Error term for hidden units:
3. Compute all weights changes
i
k
l
m
k
l
m
j
xj
wij
i
i
Incremental -vs- Batch Processing in
Backprop Learning
•
•
•
In both cases, ∆wij needs to be computed after EACH training instance, since ∆wij is a
function of xjd (forall j, d), i.e., the output of each node for each particular training
instance.
But, we can delay the updating of wij until after ALL training instances have been
processed (batch mode).
Incremental mode:
– wij = wij + ∆wij
•
Batch mode:
– sum-∆wij = sum-∆wij + ∆wij
– wij = wij + sum-∆wij
Do after each training instance.
Do after each training instance.
Do after complete training set
– So the same weight values will be used for each training instance in an epoch = one
processing round of an entire training set.
Explaining Error Terms
Ed
i  
sumid
• What is node i’s effect on the total error, Ed?
Ed
wij  
  i x jd
wij
It affects via its sum of inputs on case d, sumid.
• Negative sign is merely for convenience when
updating wij.
j
xjd wij
i
Ed sumid Ed

wij
wij sumid
sumid


dwij
dwij
• For any input weight, wij, to node i, its influence
 wik xkd  x jd
k
i
on Ed is simply its effect on sumid times
sumid’s effect on Ed.
• A weight’s influence on the sum is simply xjd
• So once we compute a node’s influence upon
Ed, we can use that value to compute the
influences of each weight, and thus to update
each weight via the negative of its influence.
Output Node Error Term
• If node i is an output node, then the contribution of sumid to Ed is its
contribution to the error on the output of node i, Eid.
Ed
Eid
i  

sumid
sumid
• The standard error function is a quadratic

1

2

( (tid  oid ) )  (tid  oid )
(oid )
sumid 2
sumid
• The influence of the sum on the output is simply the derivative
of the transfer function with respect to the sum. For a sigmoid unit, we’ve
already shown that value to be oid(1-oid)
fT
 (tid  oid )
 (tid  oid )oid (1  oid )
sumid
Hidden Node Error Term
• If node i is a hidden node, then the contribution of sumid to Ed is via its
contributions to the error terms of each node that i outputs to.
Ed
oid
i  

sumid
sumid
sumkd Ed

sumkd
kOutputs oid
• The influence of output oid on sumk is simply the weight wki. And the
other 2 derivatives were computed earlier.
i
sumkd


oid
oid
w o
jInputs
oid wki
k i
• Putting it all together:
kj
jd
 wki
Ed
  k
sumkd
oid
 oid (1  oid )
sumid
 i  oid (1  oid )
For a sigmoid unit
w 
ki
kOutputs
k
Influence of Hidden Nodes on Ed
Ed
oid

sumid sumid
oid
sumid
sumid
i
summd
oid
jOutputs
oid
sumkd k
wki
oid

sum jd
sumkd
oid
sumld
oid
sumld
l
wmi
summd m
Ed
sum jd
Ed
sumkd
k
Ed
sumld
l
Ed
summd
m
Ed
Learned XOR: Version I
Slightly sketchy: For this
example, backpropagation was
used with the perceptron training
rule instead of the delta rule. This
is necessary because fT of the
perceptron is not differentiable, and
thus not amenable to the delta rule.
A
B
-3.3
6.5
1
1
-5.4
Cleaner Approach: If perceptron nets,
use the GA to find a good weight set.
-5.7 10
X
Y
1.1
8.6
-3.2
1
AB X
1 1 -1
1 -1 1
-1 1 -1
-1 -1 -1
AB Y
1 1 1
1 -1 1
-1 1 -1
-1 -1 1
Z
3.8
A  B
(A  B)
X Y Z
1 1 1
1 -1 1
-1 1 -1
-1 -1 1
(X  Y )  X  Y
 ( A  B)  (A  B)
 XOR( A, B)
Learned XOR: Version II
*The concepts that nodes
X and Z represent are
different in the two
versions.
A
B
-3.7
-4.5
1
1
4.8
.85
4.7
X
Y
-.35
AB X
1 1 1
1 -1 -1
-1 1 1
-1 -1 1
AB Y
1 1 1
1 -1 1
-1 1 -1
-1 -1 1
3.3
-.5
1
Z
( A  B)
(A  B)
X Y Z
1 1 -1
1 -1 1
-1 1 1
-1 -1 1
.8
( X  Y )  X  Y
 ( A  B)  (A  B)
 XOR( A, B)
Learned XOR: Version III
*Nodes X, Y and Z represent
different logical concepts in
this version than in the
previous 2 versions
A
B
-12.9
9
1
1
-77
74 -11.2
X
Y
-.7
AB X
1 1 1
1 -1 -1
-1 1 -1
-1 -1 -1
AB Y
1 1 -1
1 -1 -1
-1 1 -1
-1 -1 1
-1.8
-1.1
1
Z
-.65
A B
A  B
 ( A  B)
X Y Z
1 1 -1
1 -1 -1
-1 1 -1
-1 -1 1
X  Y
 ( A  B)  ( A  B)
 XOR( A, B)
Recurrent Networks
Goal: Include previous states of the network as input, thereby
including history in decision making.
Applications:
Time Series Prediction (e.g. Stocks, Weather forecasts)
Control Systems (Robots, internal environments)
History
y(t)
y(t-1)
y(t+1)
y(t-2)
x1(t)
x2(t)
• This works fine
• Normal backprop applies
• But, length of the history is hard-wired.
Recurrent Networks (2)
y(t)
p
r
y(t+1)
m
wm1
x1(t)
wm2
x2(t)
•
•
•
•
v(t) means ”value of v after it’s t-th update”
Assume unthresholded neurons in this eg.
r(t) = m(t) + y(t)
m(t+1) = p*r(t) + X(t+1)
where X(t) = x1(t)wm1 + x2(t)wm2
0 <= p <= 1 (decay/forgetting rate)
m(1) = p*r(0) + X(1) = X(1)
r(1) = m(1) + y(1) = X(1) + y(1)
m(2) = p*r(1) + X(2) = p[X(1) + y(1)] + X(1)
r(2) = m(2) + y(2) = p*[X(1) + y(1)] + X(2) + y(2)
m(3) = p*r(2) + X(3) = p2[X(1) + y(1)] + p[X(2) + y(2)] + X(3)
k1
m(k)   p i X(k  i)  y(k  i)  X(k)
i1
=> Decaying importance of earlier inputs & outputs.
High (low) p => slow (fast) decay
Momentum
Gradient Descent can easily
get stuck at a local minimum
of the error landscape.
?
Momentum allows previous
search direction to influence
current direction.
wkj (t  1)   k x j  wkj (t )
Standard Delta-Rule update
1
0   ,  1
Momentum term
Best local move
Momentum smoothes the error landscape by guiding
search in the best average direction (i.e., that which
will, on average, decrease the error most) for a
region (that may be quite jagged).
2
momentum
resultant
Design Issues for Learning ANNs
• Initial Weights
– Random -vs- Biased
– Width of init range
• Typically: [-1 1] or [-0.5 0.5]
• Too wide => large weights will drive many sigmoids to saturation =>
all output 1 => takes a lot of training to undo.
• Frequency of Weight Updates
– Incremental - after each input.
• In some cases, all training instances are not available at the same
time, so the ANN must improve on-line.
• Sensitive to presentation order.
– Batch - after each epoch.
• Uses less computation time
• Insensitive to presentation order
•

Design Issues (2)
Learning Rate
– Low value => slow learning
– High value => faster learning, but oscillation due to overshoot
– Typical range: [0.1 0.9] - Very problem specific!
– Dynamic learning rate (many heuristics):
• Gradually decrease over the epochs
• Increase (decrease) whenever performance improves (worsens)
• Use 2nd deriv of error function
–
–
•
d2E/dw2 high => changing dE/dw => rough landscape => lower learning rate
d2E/dw2 low => dE/dw ~ constant => smooth landscape => raise learning rate
Length of Training Period
– Too short => Poor discrimination/classification of inputs
– Too long => Overtraining => nearly perfect on training set, but not general enough to
handle new (test) cases.
– Many nodes & long training period = recipe for overtraining
– Adding noise to training instances can help prevent overtraining.
• (x1, x2…xn) => add noise => (x1+e1, x2 + e2….xn+en)
Design Issues (3)
• Size of Training Set
– Heuristic: |Training Set | > k |Set of Weights| where k > 5 or 10.
• Stopping Criteria
– Low error on training set
• Can lead to overtraining if threshold is too low
– Include extra validation set (preliminary test set) and test ANN on it after each
epoch.
• Stop when validation error is low enough.
Supervised Learning ANN Applications
• Classification D: feature vectors => R: classes
– Medical Diagnosis: symptoms => disease
– Letter Recognition: pixel array => letter
• Control
D: situation state vectors => R: responses/actions
– CMU’s ALVINN: road picture => steering angle
– Chemical plants:
Temperature, Pressure, Chemical Concs in a container
Valve settings for heat, chemical inputs/outputs
 Prediction
D: Time series of previous states s1,s2…sn => R: next states sn+1
 Finance: Price of a stock on days 1…n => price on day n+1
 Meteorology: Cloud cover on days 1…n => cloud cover on day n+1
Supervised Learning in the Cerebellum
•
•
•
The inferior olive gets sensory input from the skin, joints, muscle stretch receptors, etc.
These indicate the stresses put on the body as a result of the motor action.
Climbing fibers from the inferior olive provide feedback (i.e. an error signal) which
causes LTP (i.e. depression/weakening) of the Parallel-Purkinje synapses.
Hence, learning in the cerebellum has a “supervised” flavor to it.
Parallel Fibers
*Arrows denote
signal direction
Granule
Cell
Thought
Cerebral
Neocortex
Mossy
Fiber
Purkinje
Cell
Climbing
Fiber
From inferior
olive
To motor
cortex (Action!)
Biology & Backpropagation
Synapses
P
N
Axon
Dendrites
Depolarizing
signal
• When a post-synaptic neuron, P, fires an action potential,
this can have electrical (and then chemical) effects upon P’s dendrites.
• These effects may even spread to the pre-synaptic neuron, N, and even
further back.
• BUT, this is only qualitatively similar to formal ANN backprop. It has
no quantitative similarity to the delta rule; i.e. no error signal is being
sent backwards as the basis for altering synapses in a manner that will
reduce the error.
• The evidence for forms of Reinforcement Learning (RL) in the brain
are more convincing.
Backpropagation in Brain Research
• Feedforward ANNs trained using backprop are useful in neurophysiology.
• Given: Pairs (Sensory Inputs, Behavioral Actions) {e.g. from psychological
experiments}
• Backprop produces:
– A general mapping/function that both
• covers the I/O pairs and
• can be used to PREDICT the behavioral effects of untested inputs.
– A neural model of how the mapping might be wired in the brain.
• The model tells neurobiologists WHAT to look for and WHERE.
• So even though the means of producing the proper (trained) network is not
biological, the RESULT can be very biological and useful.
• The brain has so many neurons and so many connections that biologists need
LOTS of help to narrow the search, both in terms of:
– the structures of neural circuits in a region
– The functions embodied in those circuits.
• Backprop finds error minima; evolution may have also done so!
Backprop Brain Applications
• Vestibular-Occular Response (VOR) : Discovering
synaptic strengths in the circuitry that integrates signals
from the ears (vestibular - balance and head velocity) and
eyes (occular - visual images) so that we can keep our eyes
focused on an object when it and/or we are moving.
• LEECHNET: Tuning (hundreds of) weight parameters in
the neural circuitry that governs the bending of leech body
segments in response to touch stimuli.
• Stereo Vision: Tuning large networks for integrating
images from the right and left eye so as to detect object
depth.
Stereoscopic Vision
•
•
From 2 2-d images (taken from points a
few cm apart) the brain creates a 3-d
picture.
Integration of the 2-d images
occurs in the visual cortex.
Fixation Cells
•
•
•
Fixation Depth: distance from nose to intersection of sight.
Vergence Angle: deviation of eyes from parallel
Fixation cells: neurons of the visual cortex that are only active when their location in the visual
cortex is receiving the SAME signal from both eyes.
– Recognize objects at the fixation depth
0
2
1
3
4
5

d
5
5
3
4
3
3
2
3
1
1
5
4
3
3
1
5
3
3
2
1
Fixation Cells
Near & Far Cells
0
Detect correspondences between
left and right images
that indicate objects in front
of or behind the fixation depth.
2
3 4
5

d
5
5
4
3 1
5
3
2
1
1
5
3
5
5
4
2
3 1
1
1
Near Cells
• The corresponding connections are shifted one unit to the left as compared to the fixation cells
•There are near cells for many different levels of nearness.
• For greater levels of nearness, shift more.
• Far cells work similarly, but with the opposite shift.
ANN for Stereoscopic Vision
•
•
•
•
•
Fusion.Net Paul Churchland (1992)
Uses artificial neurons to model the combination of 2-d images in the visual cortex.
Given: 2 2-d pictures
Detect: Objects at the fixation depth and in the near & far zones.
Use same type of neuron for fixation, near and far cells, just use connection patterns
with different shifts.
X
1
1
-10
-10
1
1
-1
Right eye pixel
-1
Left eye pixel
• X is a fixation, near or far cell.
• It acts like a NOT-XOR.
• If both pixels are the same,
then neither of the hidden nodes
fires. Hence, the output node is
dominated by the constant input,
and it fires.
• If the pixels differ, then exactly one
of the hidden nodes fires, and its output
will then inhibit the output node.
Fusion.Net Example
The ANN detects patterns at 3 levels:
Fixation depth (0)
Nearness depth (-1)
Farness depth (+1)
When seen through stereo
glasses, this is a stack of
3 boxes viewed from the
top.
Fusion.net detects the 3 boxes, one
at each depth level. Each box is flat, since
there are only 3 detection levels, and each
is a plane.
How Realistic is Fusion.net?
• Similar neurons have been found, but the
corresponding connections have not been verified
1
Layer 3
X
1
-10
-10
Layer 4
1
1
-1
-1
Layer 5
• Both the brain and Fusion.net need brightness
variation to stimulate stereopsis
Needed Improvements to Fusion.net:
• More depth levels
• Remove constant neurons
• Input layer cells shouldn’t have both excitatory
and inhibitory effects (biologically unrealistic)

Download Report

delta rule

Paperzz.com

Your Paperzz