Lec7

Non-Bayes classifiers.
Linear discriminants,
neural networks.
Discriminant functions(1)
Bayes classification rule:
P(w1 | x)  P(w2 | x)  0 ?
w1
Instead might try to find a function:
f w1 , w2 ( x)  0 ?
f w1 , w2 ( x)
w1
:
w2
is called discriminant function.
{x | f w1 , w2 ( x)  0}
- decision surface
: w2
Discriminant functions (2)
Class 1
Class 1
Class 2
Class 2
Linear discriminant function:
f w1 ,w2 ( x)  wT x  w0
Decision surface is a hyperplane
w x  w0  0
T
Linear discriminant – perceptron cost function
w
 x
Replace w    and x   
1 
 w0 
T
f
(
x
)

w
x
Thus now decision function is w1 , w2
and decision surface is
wT x  0
Perceptron cost function:
J ( x)    x w x
T
x
where
 1, if x  w1 and x is wT x  0

 x  1, if x  w2 and x is wT x  0
0, x is correctly classified

Linear discriminant – perceptron cost function
Class 1
Perceptron cost function:
Class 2
J ( x)    x wT x
x
Value of J (x ) is proportional to
the sum of distances of all
misclassified samples to the
decision surface.
If discriminant function separates classes perfectly, then J ( x )  0
Otherwise, J ( x )  0 and we want to minimize it.
J (x ) is continuous and piecewise linear. So we might try to
use gradient descent algorithm.
Linear discriminant – Perceptron algorithm
Gradient descent:
J ( w)
w(t  1)  w(t )  t
w
w w ( t )
J ( w)
  δx x
At points where J (x ) is differentiable
w
misclas
sified x
Thus w(t  1)  w(t )   t
δ x
x
misclas
sified x
Perceptron algorithm converges when classes are linearly
separable with some conditions on  t
Sum of error squares estimation
Let denote y ( x)  1 as desired output function, 1 for
one class and –1 for the other.
Want to find discriminant function
whose output is similar to y (x)
f w1 ,w2 (x)  w x
T
Use sum of error squares as similarity criterion:
N
J (w )   yi  w xi
T
i 1
ˆ  arg min J (w )
w
w
2
Sum of error squares estimation
Minimize mean square error:
N
J (w )
T
 2  x i ( yi  w x i )  0
w
i 1
N
 N

T
ˆ   x i yi
  x i x i w
i 1
 i 1

Thus
1
N




T
ˆ    xi xi    xi yi 
w
 i 1
  i 1

N
Neurons
Artificial neuron.
x1
x2
w1
w2
xl
wl

f
w0
Above figure represent artificial neuron calculating:
 l

y  f   wi xi 
 i 1

Artificial neuron.
Threshold functions f:
Step function
Logistic function
1
1
0
1 x  0
f ( x)  
0 x  0
0
1
f ( x) 
1  e  ax
Combining artificial neurons
x1
x2
xl
Multilayer perceptron with 3 layers.
Discriminating ability of multilayer perceptron
Since 3-layer perceptron can approximate any smooth
function, it can approximate F ( x)  P(w1 | x)  P(w2 | x)
- optimal discriminant function of two classes.
Training of multilayer perceptron
f
f
y kr 1
f
wrjk
v
y rj
f
f
f
Layer r-1
r
j
Layer r
Training and cost function
Desired network output:
x(i )  y (i )
Trained network output:
x(i )  yˆ (i )
Cost function for one training sample:
1 kL
E (i)   ( ym (i)  yˆ m (i)) 2
2 m1
N
Total cost function:
J   E (i )
i 1
Goal of the training: find values of
cost function J .
wrjk which minimize
Gradient descent
Denote:
w rj  [wrj 0 , wrj1 ,..., wrjkr1 ]T
J
Gradient descent: w (new)  w (old )  
w rj
r
j
r
j
N

E (i ) , we might want to update weights
Since J 
i 1 each training sample separately:
after processing
E (i)
w (new)  w (old )  
r
w j
r
j
r
j
Gradient descent
Chain rule for differentiating composite functions:
r

v
E (i) E (i) j (i) E (i) r 1
 r
 r y (i)
r
r
w j
v j (i) w j
v j (i)
Denote:
E (i)
 (i)  r
v j (i)
r
j
Backpropagation
If r=L, then
kL


E
(
i
)

1
L
L
2
 j (i )  L  L   ( f (vm (i))  yˆ m (i)) 
v j (i ) v j (i )  2 m1

 ( f (v Lj (i ))  yˆ j (i )) f (v Lj (i ))  e j (i ) f (v Lj (i ))
If r<L, then
r

v
E (i )
E (i )
j (i )
r 1
 j (i )  r 1   r
v j (i ) k 1 v j (i ) v rj 1 (i )
kr
kr
v rj (i )
k 1
r 1
j
   jr (i )
v
(i )
kr
   jr (i ) wkjr f (v rj 1 (i ))
k 1
Backpropagation algorithm
• Initialization: initialize all weights with random values.
• Forward computations: for each training vector x(i)
r
r
v
(
i
),
y
compute all j
j (i)
• Backward computations: for each i, j and r=L, L-1,…,2
compute  jr 1 (i )
• Update weights:
E (i)
r
r
w j (new)  w j (old )  
w rj
 w rj (old )   jr (i) y r 1 (i)
MLP issues
• What is the best network configuration?
• How to choose proper learning parameter  ?
• When training should be stopped?
• Choose another threshold function f or cost function J?