CSE 473/573 Computer Vision and Image Processing (CVIP)

CSE 473/573
Computer Vision and Image
Processing (CVIP)
Ifeoma Nwogu
Lecture 31 – Neural networks
1
Schedule
• Last class
– Markov random fields (MRFs) in vision
• Today
– Introduction to neural networks
• Readings for this week:
– Deep learning tutorial on neural networks (optional)
• See online schedule for link
2
Brief history
• late-1800's - neural networks appeared to be an
analogy to biological systems
• 1960's and 70's – simple neural networks appeared
– Fell out of favor because the perceptron was not
effective by itself, and there were no good algorithms for
multilayer nets
• 1986 – backpropagation algorithm appeared
– Neural Networks have had a resurgence in popularity
• 2006 – contrastive divergence for training deep
neural nets
Basic idea
• Modeled on biological systems
– This association has become much looser
• Learn to classify objects
– Can do more than this
• Learn from given training data of the form
(x1...xn, output)
Properties
• Inputs are flexible
– any real values
– Highly correlated or independent
• Target function may be discrete-valued, realvalued, or vectors of discrete or real values
– Outputs are real numbers between 0 and 1
•
•
•
•
Resistant to errors in the training data
Long training time
Fast evaluation
The function produced can be difficult for humans
to interpret
Diagram
x1
x0
w1
x2
.
.
.
.
.
.
xn
w2
w0
Σ
wn
y=
wixi
Threshold
1 if y >0
-1 otherwise
Perceptrons
• Basic unit in a neural network
• Linear separator
• Parts
– N inputs, x1 ... xn
– Weights for each input, w1 ... wn
– A bias input x0 (constant) and associated weight w0
– Weighted sum of inputs, y = w0x0 + w1x1 + ... + wnxn
– A threshold function, i.e 1 if y > 0, -1 if y <= 0
Linear Separator
This...
But not this (XOR)
x2
x2
+
+
+
-
+
-
x1
x1
-
+
Perceptron training rule
1. Initialize the weights (either to zero or to a small random
value)
2. Pick a learning rate h ( this is a number between 0.0 and 1.0)
3. Until stopping condition is satisfied (e.g. weights don't
change):
4. For each training pattern (x, t):
–
–
–
–
–
compute output activation y = f(w x)
If y = t, don't change weights
If y != t, update the weights:
wj(new) = wj (old) + 2 h t xj; OR
w(new) = w(old) + h (t - y ) xj, for all t
• y = wj xj − 𝑏 ; The output y of the perceptron is +1 if y > 0,
and 0 otherwise.
Perceptron training rule (2)
P1 and p2 are initially misclassified
If we choose p1 to update the weights (left), weight is moved slightly in the direction of p1
If we choose p2 (right), weight is moved slightly in the direction of –p2
Either case, we get a new improved boundary (blue dashed line)
10
In summary
11
Gradient Descent
• Perceptron training rule may not converge if
points are not linearly separable
• Gradient descent will try to fix this by
changing the weights by the total error for
all training points, rather than the individual
– If the data is not linearly separable, then it will
converge to the best fit
LMS Learning
LMS = Least Mean Square learning systems, more general than the
previous perceptron learning rule. The concept is to minimize the
total error, as measured over all training examples, D.
O is the raw output, as calculated by  wi I i  
i
1
2
Error ( LMS )   TD  OD 
2 D
e.g. if we have two patterns i.e. D=2; then
T1=1, O1=0.8, T2=0, O2=0.5 then E=(0.5)[(1-0.8)2 + (0-0.5)2]=.145
We want to minimize the LMS:
C-learning rate
E
W(old)
W(new)
W
Gradient descent
14
Gradient descent (2)
15
Gradient descent (3)
16
Gradient descent (4)
17
So far….
• Perceptron training rule guaranteed to succeed if
– Training examples are linearly separable
– Sufficiently small learning rate
• Linear unit training rule uses gradient descent
– Guaranteed to converge to hypothesis with minimum
squared error
– Given sufficiently small learning rate
– Even when training data contains noise
– Even when training data not separable by plane H
18
LMS vs. Limiting Threshold
• With the new sigmoidal function that is differentiable, we
can apply the delta rule toward learning.
• Perceptron Method
– Forced output to 0 or 1, while LMS uses the net output
– Guaranteed to separate, if no error and is linearly separable
• Otherwise it may not converge
• Gradient Descent Method:
– May oscillate and not converge
– May converge to wrong answer
– Will converge to some minimum even if the classes are not
linearly separable, unlike the earlier perceptron training method
Gradient Descent Issues
• Converging to a local minimum can be very slow
– The while loop may have to run many times
• May converge to a local minima
• Stochastic Gradient Descent
– Update the weights after each training example rather
than all at once
– Takes less memory
– Can sometimes avoid local minima
– η must decrease with time in order for it to converge
Differentiable Threshold Unit
• Our old threshold function (1 if y > 0, 0
otherwise) is not differentiable
• Gradient descent requires that we need a
differentiable threshold unit in order to
continue
• One solution is the sigmoid unit
Sigmoid unit
Graph of Sigmoid Function
Multi-layer Neural Networks
• A single perceptron can only learn linearly
separable functions
• We would like to have a network of perceptrons,
but how do we determine the error of the output
for an internal node?
• Solution: Backpropogation algorithm
Multilayer network of sigmoid units
Backpropagation Networks
• Attributed to Rumelhart and McClelland, late 70’s
• To bypass the linear classification problem, we can
construct multilayer networks. Typically we have fully
connected, feedforward networks.
Input Layer
Output Layer
Hidden Layer
x1
O1
H1
x2
H2
O2
x3
1 e
1
1
1’s - bias
O( x) 
Wi,j
Wj,k
H ( x) 
1 e
1
 wi , x I i
i
1
 w j , x H j
j
Backpropagation - learning
Backpropagation Algorithm
• Initialize all weights to small random numbers
• Until stopping condition, do
– For each training input, do
1.
2.
3.
Input the training example to the network and propagate
computations to output
Adjust weights according to the delta rule, propagating the
errors back; the weights will begin to converge to a place where
the network learns to give the desired output
Repeat; stop when no errors, or enough epochs are completed
26
Backpropagation - Modifying Weights
We had computed:
wd  cxd
T
j
 O  f '  ActivationFunction ;
j
Hence,
1 

f 
sum 
1 e

wd  cxd T j  O j   f ( sum )(1  f ( sum ) 
For the Output unit d, f(sum)=Od. For the output units, this is:
w j ,d  cH j Td  Od Od (1  Od )
For the Hidden units, this is:
wi , j  cH j (1  H j ) xi  (Td  Od )Od (1  Od )w j ,d
d
x
Wi,j
H
Wj,k
O
Properties of backpropagation
• Is very powerful - can learn any function, given enough hidden
units! With enough hidden units, we can generate any function.
• Has the standard problem of generalization vs. Memorization.
With too many units, the network will tend to memorize the input
and not generalize well. Some schemes exist to “prune” the
neural network.
• Networks require extensive training, many parameters to fiddle
with. Can be extremely slow to train. May also fall into local
minima.
• Inherently parallel algorithm, ideal for multiprocessor hardware.
• Despite the cons, is a very powerful algorithm that has seen
widespread successful deployment.
Momentum
• Add the a fraction 0 <= α < 1 of the previous
update for a weight to the current update
• May allow the learner to avoid local
minimums
• May speed up convergence to global
minimum
When to Stop Learning
• Learn until error on the training set is below some
threshold
– Bad idea! Can result in overfitting
• If you match the training examples too well, your performance
on the real problems may suffer
• Learn trying to get the best result on some
validation data
– Data from your training set that is not trained on, but
instead used to check the function
– Stop when the performance seems to be decreasing on
this, while saving the best network seen so far.
– There may still be local minimums, so watch out!
Example: Face Recognition
• From Machine Learning by Tom M. Mitchell
• Input: 30 by 32 pictures of people with the
following properties:
– Wearing eyeglasses or not
– Facial expression: happy, sad, angry, neutral
– Direction in which they are looking: left, right,
up, straight ahead
• Output: Determine which category it fits
into for one of these properties
Input Encoding
• Each pixel is an input
– 30*32 = 960 inputs
• The value of the pixel (0 – 255) is linearly
mapped onto the range of reals between 0
and 1
Output Encoding
• Could use a single output node with the
classifications assigned to 4 values (e.g. 0.2, 0.4,
0.6, and 0.8)
• Instead, use 4 output nodes (one for each value)
– 1-of-N output encoding
– Provides more degrees of freedom to the network
• Use values of 0.1 and 0.9 instead of 0 and 1
– The sigmoid function can never reach 0 or 1!
• Example: (0.9, 0.1, 0.1, 0.1) = left, (0.1, 0.9, 0.1,
0.1) = right, etc.
Network structure
Inputs
x1
3 Hidden Units
x2
.
.
.
.
x960
Outputs
Other Parameters
• training rate: η = 0.3
• momentum: α = 0.3
• Used full gradient descent (as opposed to
stochastic)
• Weights in the output units were initialized
to small random variables, but input weights
were initialized to 0
– Yields better visualizations
• Result: 90% accuracy on test set!
Try it out!
• Get the code from
http://www.cs.cmu.edu/~tom/mlbook.html
– Go to the Software and Data page, then follow the
“Neural network learning to recognize faces” link
– Follow the documentation
Slide Credits
• Greg Grudic – University of Colorado
• Kenrick Mock – UAA
• Joel Anderson - UIUC
37
Next 2 classes
• Deep networks and vision applications
• Readings for next lecture:
– None (come and listen…)
• Readings for today:
– Deep learning tutorial on neural networks (optional)
38
Questions
39