CS 478 - Machine Learning

Backpropagation
The Plague of Linear Separability
• The good news is:
– Learn-Perceptron is guaranteed to converge to a correct
assignment of weights if such an assignment exists
• The bad news is:
– Learn-Perceptron can only learn classes that are linearly
separable (i.e., separable by a single hyperplane)
• The really bad news is:
– There is a very large number of interesting problems that
are not linearly separable (e.g., XOR)
Linear Separability
• Let d be the number of inputs
Lim(number of LS functions) = ¥
d ®¥
number of LS functions
Lim(total number of functions) = 0
d ®¥
Hence, there are too many functions that escape
the algorithm
Historical Perspective
• The result on linear separability (Minsky & Papert,
1969) virtually put an end to connectionist research
• The solution was obvious: Since multi-layer networks
could in principle handle arbitrary problems, one
only needed to design a learning algorithm for them
• This proved to be a major challenge
• AI would have to wait over 15 years for a general
purpose NN learning algorithm to be devised by
Rumelhart in 1986
Towards a Solution
• Main problem:
– Learn-Perceptron implements discrete model of error (i.e.,
identifies the existence of error and adapts to it)
• First thing to do:
– Allow nodes to have real-valued activations (amount of
error = difference between computed and target output)
• Second thing to do:
– Design learning rule that adjusts weights based on error
• Last thing to do:
– Use the learning rule to implement a multi-layer algorithm
Real-valued Activation
• Replace the threshold unit (step function) with a
linear unit, where:
n
out = å wi xi
i =1
Error no longer discrete:
E = (Ot -Op ) Î Â
Training Error
• We define the training error of a hypothesis, or
weight vector, by:
2
1
E( w) = å (td - od )
2 dÎD
which we will seek to minimize
The Delta Rule
• Implements gradient descent (i.e., steepest) on the
error surface:
Dwi = h
å (td - od )xid
då D
Note how the xid multiplicative factor implicitly
identifies “active” lines as in Learn-Perceptron
Gradient-descent Learning (b)
• Initialize weights to small random values
• Repeat
– Initialize each wi to 0
– For each training example <x,t>
• Compute output o for x
• For each weight wi
– wi  wi + (t – o)xi
– For each weight wi
• wi  wi + wi
Gradient-descent Learning (i)
• Initialize weights to small random values
• Repeat
– For each training example <x,t>
• Compute output o for x
• For each weight wi
– wi  wi + (t – o)xi
Discussion
• Gradient-descent learning (with linear units) requires
more than one pass through the training set
• The good news is:
– Convergence is guaranteed if the problem is solvable
• The bad news is:
– Still produces only linear functions
– Even when used in a multi-layer context
• Needs to be further generalized!
Non-linear Activation
• Introduce non-linearity with a sigmoid function:
n
net = å wi xi
i =1
out =
1
- net
1+ e
1. Differentiable (required for gradient-descent)
2. Most unstable in the middle
Sigmoid Function
s ( x) =
1
-x
1+ e
ds ( x)
= s ( x).(1 - s ( x))
dx
• Derivative reaches maximum when output is most unstable.
Hence, change will be largest when output is most uncertain.
Multi-layer Feed-forward NN
i
k
i
k
i
j
k
i
Backpropagation (i)
• Repeat
– Present a training instance
– Compute error k of output units
– For each hidden layer
• Compute error j using error from next layer
– Update all weights: wij  wij + wij
where wij = Oij
• Until (E < CriticalError)
Error Computation
Oj = f(netj) where f is a sigmoid
f’(netj). Maximum when Oj is most unstable.
i
k
i
j
k
i
k
i
Output: Ek = (Tk - Ok)f’(netk)
Weight change: DWjk = C Oj Ek
Hidden: Ej = Sk(E k Wj k)f’(netj)
Weight change: DWij = C Oi Ej
Computation of errors
Example (I)
• Consider a simple network composed of:
– 3 inputs: a, b, c
– 1 hidden node: h
– 2 outputs: q, r
• Assume =0.5, all weights are initialized to 0.2 and
weight updates are incremental
• Consider the training set:
– 101–01
– 011–11
• 4 iterations over the training set
Example (II)
a
1
b
intialization
0
c
W a-h
W b-h
W c-h
1
0.2
0.2
0.2
0.2
0.2
0.2
0.199642
0.2
0.199642
1
0.199642
0.199642
0.2
0.205555
0.199642
0.205197
1
0.199642
0.205555
0.205197
1
0.200234
0.200234
0.205555
0.205555
0.205789
0.205789
0.200234
0.211888
0.212122
1
0.200234
0.2017
0.211888
0.211888
0.212122
0.213588
1
0.2017
0.211888
0.213588
1
0.2017
0.2017
0.218922
0.218922
0.220621
0.220621
0.203963
0.218922
0.222884
0.203963
0.203963
0.218922
0.226579
0.222884
0.230542
update weights
0
1
0
1
update weights
0
update weights
1
update weights
1
0
1
0
update weights
1
update weights
0
update weights
0
1
update weights
1
dq
D W h-q
dr
D W h-r
dh
D W a-h
D W b-h
1
-0.132
-0.03951
0.117105
0.035055
-0.00072
-0.00036
0
1
1
0.118726
0.035535
0.115647
0.034613
0.01111
0
0.540352
0
1
-0.13188
-0.03956
0.114164
0.03424
0.001184
0.000592
0.523508
0.545567
1
1
0.11886
0.035742
0.112665
0.033879
0.012666
0
0.337787
0.371236
0.528878
0.550634
0
1
-0.13178
-0.03964
0.11119
0.033449
0.002931
0.001466
0.152565
0.371236
0.523051
0.555896
1
1
0.118984
0.03598
0.109639
0.033154
0.014067
0
0.188545
0.188545
0.40439
0.40439
0.528441
0.560765
0
1
-0.13168
-0.03977
0.108187
0.032675
0.004526
0.002263
0.148775
0.437065
0.148775
0.185022
0.437065
0.4695
0.522624
0.56612
1
1
0.1191
0.036247
0.106573
0.032435
0.015315
0
h
W h-q
W h-r
Out q
Out r
0.598688
0.2
0.2
0.2
0.2
0.529899
0.529899
0
0.160486
0.235055
0.598602
0.160486
0.196021
0.235055
0.269668
0.523998
0.535118
0.59985
0.196021
0.269668
0.529362
0.60141
0.156466
0.156466
0.303908
0.303908
0.192208
0.337787
0.601653
0.192208
0.152565
0.604793
0.604039
0.608689
Target q
Target r
0.005555
0
0.006333
0
0.007034
0
0.007658
Dealing with Local Minima
• No guarantee of convergence to the global
minimum
– Use a momentum term:
• Keep moving through small local (global!) minima or along flat
regions
Dwij (n) = hOi d j + aDwij (n - 1)
– Use the incremental/stochastic version of the algorithm
– Train multiple networks with different starting weights
• Select best on hold-out validation set
• Combine outputs (e.g., weighted average)
Discussion
• 3-layer backpropagation neural networks are
Universal Function Approximators
• Backpropagation is the standard
– Extensions have been proposed to automatically set the
various parameters (i.e., number of hidden layers, number
of nodes per layer, learning rate)
– Dynamic models have been proposed (e.g., ASOCS)
• Other neural network models exist: Kohonen maps,
Hopfield networks, Boltzmann machines, etc.