Introduction to Machine Learning

Lecture Slides for
INTRODUCTION TO
Machine Learning
ETHEM ALPAYDIN
© The MIT Press, 2004
[email protected]
http://www.cmpe.boun.edu.tr/~ethem/i2ml
CHAPTER 11:
Multilayer Perceptrons
Neural Networks






Networks of processing units (neurons) with
connections (synapses) between them
Large number of neurons: 1010
Large connectitivity: 105
Parallel processing
Distributed computation/memory
Robust to noise, failures
3
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Understanding the Brain

Levels of analysis (Marr, 1982)
1.
2.
3.


Computational theory
Representation and algorithm
Hardware implementation
Reverse engineering: From hardware to theory
Parallel processing: SIMD vs MIMD
Neural net: SIMD with modifiable local memory
Learning: Update by training/experience
4
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Perceptron
d
y  w j x j  w0  wT x
j 1
w  w 0 , w1 ,..., w d 
T
x  1, x 1 ,..., x d 
T
(Rosenblatt, 1962)
5
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
What a Perceptron Does

Regression: y=wx+w0
y
w0

Classification: y=1(wx+w0>0)
y
y
s
w0
w
w
x
w0
x
x0=+1
x
1
y  sigmoid o  
1  exp  w T x


6
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
K Outputs
Regression:
d
y i   w ij x j  w i 0  wiT x
j 1
y  Wx
Classification:
oi  w iT x
expoi
yi 
k expok
choose C i
if y i  max y k
k
7
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Training

Online (instances seen one by one) vs batch (whole
sample) learning:





No need to store the whole sample
Problem may change in time
Wear and degradation in system components
Stochastic gradient-descent: Update after a single
pattern
Generic update rule (LMS rule):


w   ri  y x
t
ij
t
t
i
t
j
Update  LearningFa ctor DesiredOut put  ActualOutp ut  Input
8
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Training a Perceptron:
Regression

Regression (Linear output):



1 t
E w | x ,r  r  y t
2
w tj   r t  y t x tj
t
t

t


2


1 t
 r  wT xt
2

2
9
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Classification

Single sigmoid output

y t  sigmoid w T x t






E t w | x t , r t  r t log y t  1  r t log 1  y t



w tj   r t  y t x tj

K>2 softmax outputs
expwiT x t
y 
T t
exp
w
k
kx
t



E t wi i | x t , r t  rit log y it

i
wijt   rit  y it x tj
10
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Sigmoid Unit
x1
x2
.
.
.
xn
w1
w2
wn
x0=1

w0 net=i=0n wi xi
o=(net)=1/(1+e-net)
o
(x) is the sigmoid function: 1/(1+e-x)
d(x)/dx= (x) (1- (x))
Derive gradient decent rules to train:
• one sigmoid function
E/wi = -d(td-od) od (1-od) xi
• Multilayer networks of sigmoid units
backpropagation:
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
11
Sigmoid Unit
x1
x2
.
.
.
xn
w1
w2
wn
x0=1

w0 net=i=0n wi xi
o=(net)=1/(1+e-net)
o
(x) is the sigmoid function: 1/(1+e-x)
d(x)/dx= (x) (1- (x))
Derive gradient decent rules to train:
• one sigmoid function
E/wi = -d(td-od) od (1-od) xi
• Multilayer networks of sigmoid units
backpropagation:
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
12
Learning Boolean AND
13
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
XOR

No w0, w1, w2 satisfy:
w0
0
w2  w 0
0
w0
0
w1  w 2  w 0
0
w1 
(Minsky and Papert, 1969)
14
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multi-Layer Networks
output layer
hidden layer
input layer
15
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multilayer Perceptrons
H
y i  v z   v ih zh  v i 0
T
i
h 1

zh  sigmoid w hT x

 
1  exp 

1
d
j 1
w hj x j  w h 0

(Rumelhart et al., 1986)
16
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
17
Backpropagation
H
y i  v z   v ih zh  v i 0
T
i
h 1

zh  sigmoid w hT x


 
1  exp 
1
d
j 1
w hj x j  w h 0
E
E y i zh

w hj y i zh w hj
18
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Backpropagation Algorithm


Initialize each wi to some small random value
Until the termination condition is met, Do
 For each training example <(x1,…xn),t> Do
 Input the instance (x1,…,xn) to the network and compute the
network outputs ok
 For each output unit k


For each hidden unit h



k=ok(1-ok)(tk-ok)
h=oh(1-oh) k wh,k k
For each network weight w,j Do
wi,j=wi,j+wi,j where
wi,j=  j xi,j
19
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Backpropagation




Gradient descent over entire network weight vector
Easily generalized to arbitrary directed graphs
Will find a local, not necessarily global error minimum
-in practice often works well (can be invoked multiple times with
different initial weights)
Often include weight momentum term
wi,j(n)=  j xi,j +  wi,j (n-1)

Minimizes error training examples



Will it generalize well to unseen instances (over-fitting)?
Training can be slow typical 1000-10000 iterations
(use Levenberg-Marquardt instead of gradient descent)
Using network after training is fast
20
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

1
E W,v | X    r t  y t
2 t
Regression
H


vh   r t  y t zht
y   vh z  v 0
t

2
t
h
t
h 1
Backward
w hj
Forward

T
h
zh  sigmoid w x

E
 
w hj
E y t zht
  t
t

y

z
t
h w hj




   r t  y t v h zht 1  zht x tj
t
x




  r t  y t v h zht 1  zht x tj
t
21
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Regression with Multiple Outputs
yi

1
E W , V | X    rit  y it
2 t i

2
vih
H
y it   v ih zht v i 0
h 1

zh

v ih   rit  y it zht
whj
t
w hj





 t
t
t
   ri  y i v ih zh 1  zht x tj
t  i

xj
22
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
23
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
8-3-8 Binary Encoder -Decoder
8 inputs
3 hidden
8 outputs
Hidden
.89 .04
.01 .11
.01 .97
.99 .97
.03 .05
.22 .99
.80 .01
.60 .94
values
.08
.88
.27
.71
.02
.99
.98
.01
24
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Sum of Squared Errors for the
Output Units
25
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Hidden Unit Encoding for Input
0100000
26
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Convergence of Backprop
Gradient descent to some local minimum
 Perhaps not global minimum
 Add momentum
 Stochastic gradient descent
 Train multiple nets with different initial weights
Nature of convergence
 Initialize weights near zero
 Therefore, initial networks near-linear
 Increasingly non-linear functions possible as training progresses
27
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Expressive Capabilities of ANN
Boolean functions
 Every boolean function can be represented by network with
single hidden layer
 But might require exponential (in number of inputs) hidden units
Continuous functions
 Every bounded continuous function can be approximated with
arbitrarily small error, by network with one hidden layer
[Cybenko 1989, Hornik 1989]
 Any function can be approximated to arbitrary accuracy by a
network with two hidden layers [Cybenko 1988]
28
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
29
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
whx+w0
zh
vhzh
30
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Two-Class Discrimination

One sigmoid output yt for P(C1|xt) and P(C2|xt) ≡ 1-yt
H

t
y  sigmoid   v h zh  v 0 
 h 1

E W , v | X    r t log y t  1  r t log 1  y t
t

t

   r


v z 1  z x


v h    r t  y t zht
t
w hj
t
 yt
h
t
h
t
h
t
j
t
31
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
K>2 Classes
t
exp
o
t
i
y it 

P
C
|
x
i
t
exp
o
k
k

H
oit   v ih zht  vi 0
h 1
E W, v | X    rit log y it
t

i


v ih   rit  y it zht
t
whj





 t
t
t
   ri  y i vih zh 1  zht x tj
t  i

32
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multiple Hidden Layers

MLP with one hidden layer is a universal
approximator (Hornik et al., 1989), but using
multiple layers may lead to simpler networks
z1h
 d

 sigmoid w x  sigmoid   w1hj x j  w1h 0 ,h  1,..., H 1
 j 1


T
1h

 H1


z2l  sigmoid w x  sigmoid   w 2lhz1h  w 2l 0 , l  1,..., H 2
 h 1


T
2l

H2
y  v T z 2   v l z2l  v 0
l 1
33
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Improving Convergence

Momentum
t

E
wit  
 wit 1
wi

Adaptive learning rate
  a if E t    E t
  
 b otherwise
34
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Overfitting/Overtraining
Number of weights: H (d+1)+(H+1)*K
35
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
36
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Structured MLP
(Le Cun et al, 1989)
37
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Weight Sharing
38
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Hints
(Abu-Mostafa, 1995)

Invariance to translation, rotation, size
Virtual examples
 Augmented error: E’=E+λhEh
If x’ and x are the “same”: Eh=[g(x|θ)- g(x’|θ)]2
Approximation hint: 0
if g x |   ax ,bx 


2
E h  g x |   ax 
g x |   b 2
x

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
if g x |   ax
if g x |   bx
39
Tuning the Network Size


Destructive
Weight decay:


Constructive
Growing networks
E
w i  
 w i
w i

E'  E   w i2
2 i
(Ash, 1989)
(Fahlman and Lebiere, 1989)
40
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bayesian Learning

Consider weights wi as random vars, prior p(wi)
p w | X  
p X | w p w 
ˆ MAP  arg max logp w | X 
w
p X 
w
logp w | X   logp X | w   logp w   C

w i2 
p w    p w i  where p w i   c  exp

2
(
1
/
2

)
i


E'  E   w

2
Weight decay, ridge regression, regularization
cost=data-misfit + λ complexity
41
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Dimensionality Reduction
42
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
43
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Learning Time

Applications:




Sequence recognition: Speech recognition
Sequence reproduction: Time-series prediction
Sequence association
Network architectures


Time-delay networks (Waibel et al., 1989)
Recurrent networks (Rumelhart et al., 1986)
44
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Time-Delay Neural Networks
45
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Recurrent Networks
46
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Unfolding in Time
47
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)