Artificial Intelligence

Artificial Intelligence
Statistical learning methods
Chapter 20, AIMA
(only ANNs & SVMs)
Artificial neural networks
The brain is a pretty
intelligent system.
Can we ”copy” it?
There are approx. 1011
neurons in the brain.
There are approx.
23109 neurons in
the male cortex
(females have about
15% less).
The simple model
• The McCulloch-Pitts model (1943)
w2
w1
w3
y = g(w0+w1x1+w2x2+w3x3)
Image from
Neuroscience: Exploring the brain
by Bear, Connors, and Paradiso
Firing rate 
Neuron firing rate
Depolarization 
Transfer functions g(z)
The Heaviside function
The logistic function
The simple perceptron
With
{-1,+1}
representation
T

 1 if w x  0
T
y(x) sgn[ w x]  
T
 1 if w x  0
Traditionally (early 60:s) trained with Perceptron learning.
w x  w0  w1 x1  w2 x2  
T
Perceptron learning
Desired output
 1 if x(n) belongs to class A
f ( n)  
 1 if x(n) belongs to class B
Repeat until no errors are made anymore
1. Pick a random example [x(n),f(n)]
2. If the classification is correct,
i.e. if y(x(n)) = f(n) , then do nothing
3. If the classification is wrong, then do the
following update to the parameters
(h, the learning rate, is a small positive number)
wi  wi  hf (n) xi (n)
Example: Perceptron learning
x1
x2
f
0
0
-1
0
1
-1
1
0
-1
1
1
+1
x2
x1
Initial values:
h = 0.3
  0 .5 


w  1 
 1 


The AND function
Example: Perceptron learning
x1
x2
f
0
0
-1
0
1
-1
1
0
-1
1
1
+1
x2
This one is correctly
classified, no action.
x1
  0 .5 


w  1 
 1 


The AND function
Example: Perceptron learning
x1
x2
f
0
0
-1
0
1
-1
1
0
-1
1
1
+1
x2
This one is incorrectly
classified, learning action.
x1
  0 .5 


w  1 
 1 


The AND function
w0  w0  h 1  0.8
w1  w1  h  0  1
w2  w2  h 1  0.7
Example: Perceptron learning
x1
x2
f
0
0
-1
0
1
-1
1
0
-1
1
1
+1
x2
This one is incorrectly
classified, learning action.
x1
  0 .8 


w  1 
 0 .7 


The AND function
w0  w0  h 1  0.8
w1  w1  h  0  1
w2  w2  h 1  0.7
Example: Perceptron learning
x1
x2
f
0
0
-1
0
1
-1
1
0
-1
1
1
+1
x2
This one is correctly
classified, no action.
x1
  0 .8 


w  1 
 0 .7 


The AND function
Example: Perceptron learning
x1
x2
f
0
0
-1
0
1
-1
1
0
-1
1
1
+1
x2
This one is incorrectly
classified, learning action.
x1
  0 .8 


w  1 
 0 .7 


The AND function
w0  w0  h 1  1.1
w1  w1  h 1  0.7
w2  w2  h  0  0.7
Example: Perceptron learning
x1
x2
f
0
0
-1
0
1
-1
1
0
-1
1
1
+1
x2
This one is incorrectly
classified, learning action.
x1
  1.1


w   0.7 
 0.7 


The AND function
w0  w0  h 1  1.1
w1  w1  h 1  0.7
w2  w2  h  0  0.7
Example: Perceptron learning
x1
x2
f
0
0
-1
0
1
-1
1
0
-1
1
1
+1
x2
x1
  1.1


w   0.7 
 0.7 


The AND function
Final solution
Perceptron learning
• Perceptron learning is guaranteed to find a
solution in finite time, if a solution exists.
• Perceptron learning cannot be generalized
to more complex networks.
• Better to use gradient descent – based on
formulating an error and differentiable
functions
N
E ( W )    f ( n)  y ( W, n) 
n 1
2
Gradient search
W  hW E (W)
E(W)
The “learning rate” (h) is set heuristically
“Go downhill”
W(k)
W(k+1) = W(k) + W(k)
W
The Multilayer Perceptron
(MLP)
• Combine several single
layer perceptrons.
• Each single layer perceptron
uses a sigmoid function
(C)
 ( z )  tanh( z )
output
input
E.g.
 ( z )  1  exp(  z )
1
yl
hi
hj
Can be trained using gradient descent
xk
Example: One hidden layer
• Can approximate any continuous
function
J


yi (x)  q vi 0   vij h j (x)
j 1


D


h j (x)    w j 0   w jk xk 
k 1


q(z) = sigmoid or linear,
(z) = sigmoid.
yi
hj
xk
Training: Backpropagation
(Gradient descent)
w (t )  h w E (t )

N
vij (t )  h   i ( n, t )h j [x ( n ), t ]
n 1
N
w jk (t )  h   j ( n, t ) xk ( n )
n 1
where
 j ( n, t )    i ( n, t )vij (t )h 'j [x( n ), t ]
i
Support vector machines
Linear classifier on a linearly separable
problem
There are infinitely many
lines that have zero training
error.
Which line should we
choose?
Linear classifier on a linearly separable
problem
There are infinitely many
lines that have zero training
error.
margin
Which line should we
choose?
 Choose the line with the
largest margin.
The “large margin classifier”
Linear classifier on a linearly separable
problem
There are infinitely many
lines that have zero training
error.
margin
Which line should we
choose?
 Choose the line with the
largest margin.
The “large margin classifier”
”Support vectors”
Computing the margin
The plane separating
is defined by
and
w xa
T
w
margin
The dashed planes are given by
T
w x  ab
w x  a b
T
Computing the margin
Divide by b
T
w x / b  a / b 1
w x / b  a / b 1
T
w
margin
Define new w = w/b and a = a/b
wT x  a  1
We have defined a scale
for w and b
w x  a 1
T
Computing the margin
We have
w
x + w
x
margin
wT x  a  1
w T (x  w)  a  1
w  margin
which gives
2
margin 
w
Linear classifier on a linearly separable
problem
Maximizing the margin is
equal to minimizing
w
||w||
subject to the constraints
wTx(n) – a  +1 for all
wTx(n) – a  -1 for all
Quadratic programming problem,
constraints can be included with Lagrange multipliers.
Quadratic programming
problem
Minimize cost (Lagrangian)


 
N
1
2
L p  w   n y (n) w T x(n)  a  1
2
n 1
Minimum of Lp occurs at the maximum of (the Wolfe dual)
N
1 N N
LD   n   n m y (n) y (m)x(n)T x(m)
2 n 1 m 1
n 1
Only scalar product
in cost. IMPORTANT!
Linear Support Vector
Machine
Test phase, the predicted output


T
yˆ (x)  sgn w x  a  sgn   n y (n)x(n) x  a 
n s


T

Where a is determined e.g. by looking at one of the
support vectors.
Still only scalar products in the expression.
How deal with nonlinear
case?
• Project data into high-dimensional space Z. There we
know that it will be linearly separable (due to VC
dimension of linear classifier).
z (n)  [x(n)]
• We don’t even have to know the projection...!
Scalar product kernel trick
If we can find kernel such that
K (x(n), x(m))  φ(x(n)) φ(x(m))
T
Then we don’t even have to know the mapping
to solve the problem...
Valid kernels (Mercer’s
theorem)
Define the matrix
 K [x(1), x(1)] K [x(1), x(2)]

 K [x(2), x(1)] K [x(2), x(2)]
K 



 K [x( N ), x(1)] K [x( N ), x(2)]

K [x(1), x( N )] 

 K [x(2), x( N )] 




 K [x( N ), x( N )] 

If K is symmetric, K = KT, and positive semi-definite, then
K[x(i),x(j)] is a valid kernel.
Examples of kernels

K [x(i), x( j )]  exp  x(i)  x( j ) / 2

K [x(i), x( j )]  x(i) x( j )
T
2


d
First, Gaussian kernel.
Second, polynomial kernel. With d=1 we have linear SVM.
Linear SVM often used with good success on high
dimensional data (e.g. text classification).
Example: Robot color vision
(Competition 1999)
Classify the Lego pieces into red, blue, and yellow.
Classify white balls, black sideboard, and green carpet.
What the camera sees
(RGB space)
Yellow
Red
Green
Mapping RGB (3D) to rgb (2D)
R
r
RG  B
G
g
RG B
B
b
RG  B
Lego in normalized rgb space
x2
x1
Input is 2D
x X
2
Output is 6D:
{red, blue, yellow,
green, black, white}
c C
6
MLP classifier
2-3-1 MLP
LevenbergMarquardt
E_train = 0.21%
E_test = 0.24%
Training time
(150 epochs):
51 seconds
SVM classifier
E_train = 0.19%
E_test = 0.20%
SVM with
g = 1000
Training time:
22 seconds

K (x, y )  exp  g (x  y )2
