Derivation of Backpropagation Rule

ML
Artificial Neural Networks
• Approximate real-valued, discrete-valued, and
vector-valued functions
• Continuous or discrete-valued attributes
• Handwritten characters, spoken words, faces…
• Handles noisy data
• Long training times
• Quick evaluation/test time
• Network understanding is not important
Mar-30-17
Rodney D. Nielsen
1
Perceptron
ML
d
ì
ï1 if å w j x j > 0; x0 º 1
y=í
j=0
ï
î0 otherwise
x0=1 w0
x1
w1
x2
w2

 wd

Σ
d
åw
1
y
j
xj
j= 0
xd
-3
Mar-30-17
Rodney D. Nielsen
-2
-1
0
0
1
2
3
2
Gradient Descent
ML
• Delta rule
• Unthresholded perceptron
x0=1 w0
x1
w1
x2
w2

 wd

xd
Mar-30-17
d
yˆ =å w j x j = w× x
Σ
j= 0
E ( w) =
yˆ
å( y
x,y ÎTr
Rodney D. Nielsen
( i)
)
i) 2
(
ˆ
-y
3
Representational Power
ML
• Perceptron representational power
d
ì
ï1 if å w j x j > 0; x0 º 1
y=í
j=0
ï
î0 otherwise
++ +
+ +
- -
d
åw x
j
j= 0
j
=0
++ + - + +
- -
Mar-30-17
Rodney D. Nielsen
4
ML
Mar-30-17
Gradient Descent
Rodney D. Nielsen
5
ML
Gradient Descent Rule
w j ¬ w j + hå ( y (i) - yˆ (i) ) x (i)
j
i
(
(
))
w j ¬ w j + hå y (i) - Pˆ y (i) = 1x(i),w x (i)j
i
Mar-30-17
Rodney D. Nielsen
6
Gradient
Descent
Algorithm
ML
Initialize each wj to a small random value
Until termination condition is met, Do
For i = 1..N
d
(i )
i)
(
yˆ = å w j x j
For j = 0..d j= 0
For j = 0..d
Mar-30-17
Rodney D. Nielsen
7
Stochastic
Gradient
Descent
ML
• Update weights after each instance rather
than summing over all of the training data
• Standard gradient descent uses the true
gradient versus instance specific gradients
– Therefore, can have larger step size
• If there are multiple local minima in E(w),
stochastic gradient descent might avoid
falling into one
– What else can do this?
Mar-30-17
Rodney D. Nielsen
8
Stochastic
Gradient
Descent
Alg.
ML
Initialize each wj to a small random value
Until termination condition is met, Do
For i = 1..N
d
(i )
i)
(
yˆ = å w j x j
For j = 0..d j= 0
For j = 0..d
Mar-30-17
Rodney D. Nielsen
9
Multilayer Networks
ML
• Perceptron representational power
d
ì
ï1 if å w j x j > 0; x0 º 1
y=í
j=0
ï
î0 otherwise
++ +
+ +
- -
d
åw x
j
j= 0
j
=0
++ + - + +
- -
Mar-30-17
Rodney D. Nielsen
10
ML
Mar-30-17
Multilayer Networks
Rodney D. Nielsen
From Machine Learning (Mitchell, 1997)
11
Net
of
Linear
Outputs
still
Linear
ML
x0=1
y1
w21
w22
w23
x1
x2



w2h
xd
y2
y3



x0=1 w



0
w
x1
w
1
x2



yk
2
d
Σ
yˆ =å w j x j = wx
j= 0
wd
xd
Mar-30-17
Rodney D. Nielsen
12
Perceptron
–
Not
Differentiable
ML
d
ì
ï1 if å w j x j > 0; x0 º 1
y=í
j=0
ï
î0 otherwise
x0=1 w0
x1
w1
x2
w2

 wd

Σ
d
åw
1
y
j
xj
j= 0
xd
-3
Mar-30-17
Rodney D. Nielsen
-2
-1
0
0
1
2
3
13
Logistic
Nonlinear
&
Differentiable
ML
x0=1 w0
x1
w1
x2
w2

 wd

Σ
d
g( x) = å w j x j
1
yˆ = s ( x) =
1+ e-g( x)
j= 0
w0 º 1
xd
ds ( z)
= s ( z)(1- s ( z))
dz
Mar-30-17
Rodney D. Nielsen
14
Multilayer
Networks
–
Logistic
σ(x)
ML
Mar-30-17
Rodney D. Nielsen
From Machine Learning (Mitchell, 1997)
15
The
Backpropagation
Algorithm
ML
• Sum the error over all N training instances as
before
• Also sum over the K outputs of the network
N
K
E ( w) = å å ( y - yˆ
(i)
k
)
(i) 2
k
i=1 k=1
Mar-30-17
Rodney D. Nielsen
16
ML
Backpropagation Algorithm
}
}
Create network and init each wjh & whk to a small random value
Until termination condition is met, Do: (1000s of epochs)
wjh
whk
For each <x,y>
x0=1
Calculate all unit outputs
w21
x1
w22
Error Backpropagation:
x2
w23
For all net outputs, k = 0..K
w2h
dk ¬ yˆ k (1- yˆ k )( y k - yˆ k )
For all hidden units, h = 0..H
dh ¬ oh (1- oh )åk=1..K whkdk






xd
Rodney D. Nielsen
y2
y3



yk
Update all w..
Mar-30-17
y1
17
Stopping Criteria
ML
• Until termination condition is met, Do
…
•
•
•
•
Specified number of iterations
Error on training data < ε
Error on validation data < ε
arg mini error on validation data
^
• arg mini f (error on validation)
Mar-30-17
Rodney D. Nielsen
18
Derivation
of
Backpropagation
Rule
ML
Mar-30-17
Rodney D. Nielsen
19
Derivation
of
Backpropagation
Rule
ML
in k º å w hk oh = weighted sum of inputs to output unit k
h
¶
=
¶yˆ k
Mar-30-17
x0=1
x2
¶s (in k )
å ( y l - yˆ l ) ¶in oh
k
l=1..K
Rodney D. Nielsen



xd
whk
y1
w21
w22
w23
x1
2
wjh
}
Derivation for the output units
}
¶E
¶E ¶in k
=
¶w hk ¶in k ¶w hk
¶E
=
oh
¶in k
¶E ¶yˆ k
=
oh
¶yˆ k ¶in k
w2h
y2
y3






yk
20
Derivation
of
Backpropagation
Rule
ML
¶E
¶
=
¶w hk ¶yˆ k
¶s (in k )
å ( y l - yˆ l ) ¶in oh
k
l=1..K
2
¶ ( y k - yˆ k )
=
s (in k )(1- s (in k ))oh
¶yˆ k
¶ ( y k - yˆ k )
= 2( y k - yˆ k )
yˆ k (1- yˆ k )oh
¶yˆ k
2
¶E
= -2( y k - yˆ k ) yˆ k (1- yˆ k )oh
¶w hk
Mar-30-17
Rodney D. Nielsen
21
Derivation
of
Backpropagation
Rule
ML
¶E
¶
=
¶whk ¶whk
=
¶
¶whk
å( y l - yˆ l )
2
l=1..K
( y k - yˆ k )
= 2( y k - yˆ k )
2
¶
( y k - yˆ k )
¶whk
¶yˆ k
= -2( y k - yˆ k )
¶w hk
¶s ( wHk o)
= -2( y k - yˆ k )
¶w hk
Mar-30-17
Rodney D. Nielsen
22
Derivation
of
Backpropagation
Rule
ML
¶s ( wHko)
¶E
= -2( y k - yˆ k )
¶whk
¶whk
¶wHko
= -2( y k - yˆ k )s ( wHk o)(1- s ( wHko))
¶whk
¶whk oh
= -2( y k - yˆ k ) yˆ k (1- yˆ k )
¶whk
¶E
= -2( y k - yˆ k ) yˆ k (1- yˆ k )oh
¶whk
Mar-30-17
Rodney D. Nielsen
23
Derivation
of
Backpropagation
Rule
ML
¶E
= -2( y k - yˆ k ) yˆ k (1- yˆ k )oh
¶whk
Mar-30-17
Rodney D. Nielsen
24
Arbitrary Acyclic ANNs
ML
• ANNs are not limited to two layers of σ units
– Can be any depth
• ANNs are not limited to fully connected
layered networks
– Can be any directed acyclic graph (DAG)
Mar-30-17
Rodney D. Nielsen
25
Convergence
and
Local
Minima
ML
• Learning is an iterative process following
gradient descent
• Error surface can have several local minima
• Gradient descent might not make it to the
global minimum
• Finding a local minimum does not necessarily
mean it is trapped
Mar-30-17
Rodney D. Nielsen
26
Avoiding Local Minima
ML
• Add Momentum term
Dwhk ( n) = hdk oh + aDwhk ( n -1), 0 £ a <1
– Keeps moving somewhat in the previous direction
– If direction unchanged, takes increasingly larger
step
• Use stochastic gradient descent
– Different error surface for each training instance
• Train multiple networks with different initial w
– Select based on validation data or use ensemble
Mar-30-17
Rodney D. Nielsen
27
ML
Representational Power
• Boolean functions can be learned by an
appropriately sized single-hidden-layer
network
• Bounded continuous functions can be
learned by the right sized single-HL network
• Arbitrary functions can be learned by a twoHL network
• Caveat emptor: Cannot necessarily reach an
optimal weight vector from some initial w
Mar-30-17
Rodney D. Nielsen
28
Hidden
Layer
Representations
ML
Inputs
10000000
01000000
00100000
00010000
00001000
00000100
00000010
00000001
Mar-30-17
Outputs
10000000
01000000
00100000
00010000
00001000
00000100
00000010
00000001
Rodney D. Nielsen
From Machine Learning (Mitchell, 1997)
.89
1 .04
0 .08
0
.15
0 .99
1 .99
1
.01
0 .97
1 .27
0
.99
1 .97
1 .71
1
.03
0 .05
0 .02
0
.01
0 .11
0 .88
1
.80
1 .01
0 .98
1
.60
1 .94
1 .01
0
29
Sum of squared errors
ML
Sum of squared errors for each output unit
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
Mar-30-17
500
1000
1500
Rodney D. Nielsen
From Machine Learning (Mitchell, 1997)
2000
2500
30
Hidden
Unit
Encoding
for
x=01000000
ML
Hidden unit encoding for input 01000000
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Mar-30-17
500
1000
1500
Rodney D. Nielsen
From Machine Learning (Mitchell, 1997)
2000
2500
31
Weights
to
one
Hidden
Unit
ML
Weights from inputs to one hidden unit
4
3
2
1
0
-1
-2
-3
-4
-5
0
Mar-30-17
500
1000
1500
Rodney D. Nielsen
From Machine Learning (Mitchell, 1997)
2000
2500
32
Generalization,
Overfitting
&
Stopping
ML
0.01
Training set error
Test set error
0.009
0.008
Error
0.007
0.006
0.005
0.004
0.003
0.002
0
Mar-30-17
2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
Training iterations
Rodney D. Nielsen
From Machine Learning (Mitchell, 1997)
33
Avoiding Overfitting
ML
• Over many epochs, weights grow large to fit
idiosyncrasies of (and noise in) the training
data
– Add a weight decay
• Use validation data set to assess performance
– Rollback to best weights
Mar-30-17
Rodney D. Nielsen
34
Generalization,
Overfitting
&
Stopping
ML
0.08
Training set error
Test set error
0.07
0.06
Error
0.05
0.04
0.03
0.02
0.01
0
0
Mar-30-17
1000
2000
4000
3000
Training iterations
Rodney D. Nielsen
From Machine Learning (Mitchell, 1997)
5000
6000
35
Advanced Topics in ANNs
ML
• Alternative error minimization procedures
– Line search
– Conjugate gradient descent
– Etc.
Mar-30-17
Rodney D. Nielsen
36
ML
Mar-30-17
Recurrent Networks
Rodney D. Nielsen
From Machine Learning (Mitchell, 1997)
37
Advanced Topics in ANNs
ML
• Dynamically modifying network structure
– Start with no hidden units and add hidden units
until error is at an acceptable level
– Remove connections with near zero weights
– Optimal brain damage: remove connections if
small changes in the weight have little effect on
the error
Mar-30-17
Rodney D. Nielsen
38
ML
Summary
• Real, discrete, and vector-valued functions
• Continuous or discrete valued attributes
• One-HL networks can represent any Boolean or
continuous-valued function
• Two-HL networks can represent any function
• Hidden units create new features not in the input
space
• Avoid overfitting, as with any ML algorithm
• Backpropagation only one of many algorithms
Mar-30-17
Rodney D. Nielsen
39