Non-Bayes classifiers.
Linear discriminants,
neural networks.
Discriminant functions(1)
Bayes classification rule:
P(w1 | x) P(w2 | x) 0 ?
w1
Instead might try to find a function:
f w1 , w2 ( x) 0 ?
f w1 , w2 ( x)
w1
:
w2
is called discriminant function.
{x | f w1 , w2 ( x) 0}
- decision surface
: w2
Discriminant functions (2)
Class 1
Class 1
Class 2
Class 2
Linear discriminant function:
f w1 ,w2 ( x) wT x w0
Decision surface is a hyperplane
w x w0 0
T
Linear discriminant – perceptron cost function
w
x
Replace w and x
1
w0
T
f
(
x
)
w
x
Thus now decision function is w1 , w2
and decision surface is
wT x 0
Perceptron cost function:
J ( x) x w x
T
x
where
1, if x w1 and x is wT x 0
x 1, if x w2 and x is wT x 0
0, x is correctly classified
Linear discriminant – perceptron cost function
Class 1
Perceptron cost function:
Class 2
J ( x) x wT x
x
Value of J (x ) is proportional to
the sum of distances of all
misclassified samples to the
decision surface.
If discriminant function separates classes perfectly, then J ( x ) 0
Otherwise, J ( x ) 0 and we want to minimize it.
J (x ) is continuous and piecewise linear. So we might try to
use gradient descent algorithm.
Linear discriminant – Perceptron algorithm
Gradient descent:
J ( w)
w(t 1) w(t ) t
w
w w ( t )
J ( w)
δx x
At points where J (x ) is differentiable
w
misclas
sified x
Thus w(t 1) w(t ) t
δ x
x
misclas
sified x
Perceptron algorithm converges when classes are linearly
separable with some conditions on t
Sum of error squares estimation
Let denote y ( x) 1 as desired output function, 1 for
one class and –1 for the other.
Want to find discriminant function
whose output is similar to y (x)
f w1 ,w2 (x) w x
T
Use sum of error squares as similarity criterion:
N
J (w ) yi w xi
T
i 1
ˆ arg min J (w )
w
w
2
Sum of error squares estimation
Minimize mean square error:
N
J (w )
T
2 x i ( yi w x i ) 0
w
i 1
N
N
T
ˆ x i yi
x i x i w
i 1
i 1
Thus
1
N
T
ˆ xi xi xi yi
w
i 1
i 1
N
Neurons
Artificial neuron.
x1
x2
w1
w2
xl
wl
f
w0
Above figure represent artificial neuron calculating:
l
y f wi xi
i 1
Artificial neuron.
Threshold functions f:
Step function
Logistic function
1
1
0
1 x 0
f ( x)
0 x 0
0
1
f ( x)
1 e ax
Combining artificial neurons
x1
x2
xl
Multilayer perceptron with 3 layers.
Discriminating ability of multilayer perceptron
Since 3-layer perceptron can approximate any smooth
function, it can approximate F ( x) P(w1 | x) P(w2 | x)
- optimal discriminant function of two classes.
Training of multilayer perceptron
f
f
y kr 1
f
wrjk
v
y rj
f
f
f
Layer r-1
r
j
Layer r
Training and cost function
Desired network output:
x(i ) y (i )
Trained network output:
x(i ) yˆ (i )
Cost function for one training sample:
1 kL
E (i) ( ym (i) yˆ m (i)) 2
2 m1
N
Total cost function:
J E (i )
i 1
Goal of the training: find values of
cost function J .
wrjk which minimize
Gradient descent
Denote:
w rj [wrj 0 , wrj1 ,..., wrjkr1 ]T
J
Gradient descent: w (new) w (old )
w rj
r
j
r
j
N
E (i ) , we might want to update weights
Since J
i 1 each training sample separately:
after processing
E (i)
w (new) w (old )
r
w j
r
j
r
j
Gradient descent
Chain rule for differentiating composite functions:
r
v
E (i) E (i) j (i) E (i) r 1
r
r y (i)
r
r
w j
v j (i) w j
v j (i)
Denote:
E (i)
(i) r
v j (i)
r
j
Backpropagation
If r=L, then
kL
E
(
i
)
1
L
L
2
j (i ) L L ( f (vm (i)) yˆ m (i))
v j (i ) v j (i ) 2 m1
( f (v Lj (i )) yˆ j (i )) f (v Lj (i )) e j (i ) f (v Lj (i ))
If r<L, then
r
v
E (i )
E (i )
j (i )
r 1
j (i ) r 1 r
v j (i ) k 1 v j (i ) v rj 1 (i )
kr
kr
v rj (i )
k 1
r 1
j
jr (i )
v
(i )
kr
jr (i ) wkjr f (v rj 1 (i ))
k 1
Backpropagation algorithm
• Initialization: initialize all weights with random values.
• Forward computations: for each training vector x(i)
r
r
v
(
i
),
y
compute all j
j (i)
• Backward computations: for each i, j and r=L, L-1,…,2
compute jr 1 (i )
• Update weights:
E (i)
r
r
w j (new) w j (old )
w rj
w rj (old ) jr (i) y r 1 (i)
MLP issues
• What is the best network configuration?
• How to choose proper learning parameter ?
• When training should be stopped?
• Choose another threshold function f or cost function J?
© Copyright 2026 Paperzz