Machine Learning

Artificial Neural Networks
• Threshold units
• Gradient descent
• Multilayer networks
• Backpropagation
• Hidden layer representations
• Example: Face recognition
• Advanced topics
CS 8751 ML & KDD
Artificial Neural Networks
1
Connectionist Models
Consider humans
•
•
•
•
•
Neuron switching time ~.001 second
Number of neurons ~1010
Connections per neuron ~104-5
Scene recognition time ~.1 second
100 inference step does not seem like enough
must use lots of parallel computation!
Properties of artificial neural nets (ANNs):
•
•
•
•
Many neuron-like threshold switching units
Many weighted interconnections among units
Highly parallel, distributed process
Emphasis on tuning weights automatically
CS 8751 ML & KDD
Artificial Neural Networks
2
When to Consider Neural Networks
• Input is high-dimensional discrete or real-valued (e.g., raw
sensor input)
• Output is discrete or real valued
• Output is a vector of values
• Possibly noisy data
• Form of target function is unknown
• Human readability of result is unimportant
Examples:
• Speech phoneme recognition [Waibel]
• Image classification [Kanade, Baluja, Rowley]
• Financial prediction
CS 8751 ML & KDD
Artificial Neural Networks
3
ALVINN drives 70 mph on highways
Sharp
Left
Straight
Ahead
Sharp
Right
4 Hidden
Units
30x32 Sensor
Input Retina
CS 8751 ML & KDD
Artificial Neural Networks
4
Perceptron
X0=1
X1
1
X2
W0
W
W
2
Wn

n
w x
i 0
Xn
i i
n

 1 if  wi xi
 
i 1
- 1 otherwise
1 if w0  w1 x1  ...  wn xn  0
o( x1 ,..., xn )  
- 1 otherwise

Sometimes we will use simpler ve ctor notation :
 
1 if w  x  0

o(x)  
- 1 otherwise
CS 8751 ML & KDD
Artificial Neural Networks
5
Decision Surface of Perceptron
X1
X1
X2
X2
Represents some useful functions
• What weights represent g(x1,x2) = AND(x1,x2)?
But some functions not representable
• e.g., not linearly separable
• therefore, we will want networks of these ...
CS 8751 ML & KDD
Artificial Neural Networks
6
Perceptron Training Rule
wi  wi  wi
•
where
wi   (t  o) xi

 t  c( x ) is target va lue
 o is perceptron output
  is small constant (e.g., .1) called learning rate
Can prove it will converge
 If training data is linearly separable
 and  is sufficient ly small
CS 8751 ML & KDD
Artificial Neural Networks
7
Gradient Descent
To understand , consider simple linear uni t , where
o  w0  w1 x1  ...  wn xn
Idea : learn wi ' s that minimize the squared error
 1
2
E w  2  (t d  od )
d D
Where D is the set of training examples
CS 8751 ML & KDD
Artificial Neural Networks
8
Gradient Descent
CS 8751 ML & KDD
Artificial Neural Networks
9
Gradient Descent
  E E
E 
Gradient E[ w]  
,
,...,

wn 
 w0 w1

Training rule : wi   E[ w]
i.e.,
CS 8751 ML & KDD
E
wi  
wi
Artificial Neural Networks
10
Gradient Descent
E
 1
2

(
t

o
)

d
d
wi wi 2 d
1

 
(t d  od ) 2
2 d wi
1

  2(t d  od )
(t d  od )
2 d
wi
 

  (t d  od )
(t d  w  xd )
wi
d
E
  (t d  od )(  xi ,d )
wi
d
CS 8751 ML & KDD
Artificial Neural Networks
11
Gradient Descent
GRADIENT  DESCENT (training _ examples, )


Each training examples is a pair of the form  x,t  , where x is the vector of
input values and t is the t arg et output value.  is the lea rning rate (e.g., .05 ).
 Initialize each wi to some small random value
 Until the terminati on condition is met, do
- Initialize each wi to zero.

- For each  x,t  in training _ examples, do

 Input the instance x and compute output o
* For each linear unit weigh t wi , do
wi  wi   (t  o) xi
- For each linear unit weigh t w, do
wi  wi  wi
CS 8751 ML & KDD
Artificial Neural Networks
12
Summary
Perceptron training rule guaranteed to succeed if
• Training examples are linearly separable
• Sufficiently small learning rate 
Linear unit training rule uses gradient descent
• Guaranteed to converge to hypothesis with
minimum squared error
• Given sufficiently small learning rate 
• Even when training data contains noise
• Even when training data not separable by H
CS 8751 ML & KDD
Artificial Neural Networks
13
Incremental (Stochastic) Gradient Descent
Batch mode Gradient Descent:
Do until satisfied:

1. Compute the gradient ED [w]



2. w  w   ED [ w]
Incremental mode Gradient Descent:
Do until satisfied:
- For each training example d in D

1. Compute the gradient Ed [w]



2. w  w   Ed [ w]

ED [ w] 
1
2
2
(
t

o
)
 d d
d D

Ed [ w]  12 (t d  od ) 2
Incremental Gradient Descent can approximate Batch Gradient
Descent arbitrarily closely if  made small enough
CS 8751 ML & KDD
Artificial Neural Networks
14
Multilayer Networks of Sigmoid Units
h2
h1
d1
d3
d3
d1
d0
o
h1
h2
x1
x2
CS 8751 ML & KDD
d2
Artificial Neural Networks
d2
d0
15
Multilayer Decision Space
CS 8751 ML & KDD
Artificial Neural Networks
16
Sigmoid Unit
X0=1
X1
1
X2
W0
W
W
2
Wn

n
net   wi xi
o  σ (net ) 
i 0
Xn
1
1  e  net
σ ( x) is the sigmoid function
1
1  e -x
d ( x )
Nice property :
  ( x)(1   ( x))
dx
We can derive gradient descent rules to train
 One sigmoid unit
 Multilayer networks of sigmoid units  Backpropagation
CS 8751 ML & KDD
Artificial Neural Networks
17
The Sigmoid Function
output
1
σ ( x) 
1  e -x
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
net input
Sort of a rounded step function
Unlike step function, can take derivative (makes learning
possible)
CS 8751 ML & KDD
Artificial Neural Networks
18
6
Error Gradient for a Sigmoid Unit
But we know :
E
 1

(t d  od ) 2

wi wi 2 dD
1

 
(t d  od ) 2
2 d wi

1

2
(
t

o
)
(t d  od )

d
d
2 d
wi
 o 
  (t d  od )  d 
d
 wi 
od net d
 -  (t d  od )
net d wi
d
CS 8751 ML & KDD
od
 (net d )

 od (1  od )
net d
net d
 
net d  ( w  xd )

 xi ,d
wi
wi
So :
E
   (t d  od )od (1  od ) xi ,d
wi
d D
Artificial Neural Networks
19
Backpropagation Algorithm
Initialize all weights to small random numbers. Until satisfied, do
 For each train ing example, do
1. Input the training example and compute the outputs
2. For each output unit k
 k  ok (1  ok )(t k  ok )
3. For each hidden unit h
 k  oh (1  oh )
w
koutputs
h ,k
k
4. Update each network we ight wi , j
wi , j  wi , j  wi , j
where
wi , j    j xi , j
CS 8751 ML & KDD
Artificial Neural Networks
20
More on Backpropagation
• Gradient descent over entire network weight vector
• Easily generalized to arbitrary directed graphs
• Will find a local, not necessarily global error minimum
– In practice, often works well (can run multiple times)
• Often include weight momentum a
wi , j (n)    j xi , j  a wi , j (n  1)
• Minimizes error over training examples
• Will it generalize well to subsequent examples?
• Training can take thousands of iterations -- slow!
– Using network after training is fast
CS 8751 ML & KDD
Artificial Neural Networks
21
Learning Hidden Layer Representations
Outputs
Input
Output
10000000  10000000
01000000  01000000
00100000  00100000
00010000  00010000
Inputs
00001000  00001000
00000100  00000100
00000010  00000010
00000001  00000001
CS 8751 ML & KDD
Artificial Neural Networks
22
Learning Hidden Layer Representations
Outputs
Input
Output
10000000  .89 .04 .08  10000000
01000000  .01 .11 .88  01000000
Inputs
00100000  .01 .97 .27  00100000
00010000  .99 .97 .71  00010000
00001000  .03 .05 .02  00001000
00000100  .22 .99 .99  00000100
00000010  .80 .01 .98  00000010
00000001  .60 .94 .01  00000001
CS 8751 ML & KDD
Artificial Neural Networks
23
Output Unit Error during Training
Sum of squared errors for each output unit
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
CS 8751 ML & KDD
500
1000
1500
Artificial Neural Networks
2000
2500
24
Hidden Unit Encoding
Hidden unit encoding for one input
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
CS 8751 ML & KDD
500
1000
1500
Artificial Neural Networks
2000
2500
25
Input to Hidden Weights
Weights from inputs to one hidden unit
4
3
2
1
0
-1
-2
-3
-4
-5
0
CS 8751 ML & KDD
500
1000
1500
Artificial Neural Networks
2000
2500
26
Convergence of Backpropagation
Gradient descent to some local minimum
• Perhaps not global minimum
• Momentum can cause quicker convergence
• Stochastic gradient descent also results in faster
convergence
• Can train multiple networks and get different results (using
different initial weights)
Nature of convergence
• Initialize weights near zero
• Therefore, initial networks near-linear
• Increasingly non-linear functions as training progresses
CS 8751 ML & KDD
Artificial Neural Networks
27
Expressive Capabilities of ANNs
Boolean functions:
• Every Boolean function can be represented by network
with a single hidden layer
• But that might require an exponential (in the number of
inputs) hidden units
Continuous functions:
• Every bounded continuous function can be approximated
with arbitrarily small error by a network with one hidden
layer [Cybenko 1989; Hornik et al. 1989]
• Any function can be approximated to arbitrary accuracy by
a network with two hidden layers [Cybenko 1988]
CS 8751 ML & KDD
Artificial Neural Networks
28
Overfitting in ANNs
Error
Error versus weight updates (example 1)
0.01
0.009
0.008
0.007
0.006
0.005
0.004
0.003
0.002
Training set
Validation set
0
5000
10000
15000
20000
Number of weight updates
CS 8751 ML & KDD
Artificial Neural Networks
29
Overfitting in ANNs
Error
Error versus weight updates (Example 2)
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
Training set
Validation set
0
1000
2000
3000
4000
5000
6000
Number of weight updates
CS 8751 ML & KDD
Artificial Neural Networks
30
Neural Nets for Face Recognition
left strt
rgt
up
30x32
inputs
90% accurate learning
head pose, and recognizing
1-of-20 faces
Typical Input Images
CS 8751 ML & KDD
Artificial Neural Networks
31
Learned Network Weights
left strt
rgt
up
Learned Weights
30x32
inputs
Typical Input Images
CS 8751 ML & KDD
Artificial Neural Networks
32
Alternative Error Functions
Penalize large weights :

E ( w)  12   (t kd  okd ) 2  
d D koutputs
2
w
 ji
i, j
Train on target slopes as well as values :
2



o

t



2
E ( w)  12   (t kd  okd )    kdj  kdj  
 x d x d  
d D koutputs 

Tie together weights :
 e.g., in phoneme recognitio n
CS 8751 ML & KDD
Artificial Neural Networks
33
Recurrent Networks
y(t+1)
y(t+1)
x(t)
x(t)
Feedforward
Network
Recurrent
Network
y(t+1)
y(t)
x(t)
y(t-1)
x(t-1)
Recurrent Network
unfolded in time
x(t-2)
CS 8751 ML & KDD
Artificial Neural Networks
34