x - IUP GEII d`Amiens

unit #2
Giansalvo EXIN Cirrincione
STATISTICAL PATTERN RECOGNITION
An example: character recognition
Problem: distinguish handwritten versions of the characters a and b
pixel
 x1 


x  
 xd 
captured by a camera
array of pixel values xi which
range from 0 to 1 according to
the fraction of the pixel
square occupied by black ink
Goal: develop an algorithm which will assign any image, represented by
a vector x, to one of two classes, which we shall denote by Ck where
k=1,2, so that class C1 corresponds to a and class C2 corresponds to b.
data set (sample)
high dimensionality
For a 256x256 image, d = 65536;
representing grey values by 8 bits
implies 28x256x256  10158000
different images.
generalization
 x1 


x  
 xd 
height
~
x1 
width
feature selection/extraction
TS histograms
Approximation
minimum #
of the classmisclassifications
conditional pdf’s
overlap
ideal
classifier = feature + threshold
TS histograms
How to improve the classification ?
decision boundary
Consider a second feature !!!
Curse of dimensionality !!!
1 if x  C1
Classification outcome: y =
0 if x  C2
Mapping: x  n  y  c
( c classes)
Model: yk = yk (x;w)
like the
threshold
weights
 regression problems: continuous outputs
 classification problems: discrete outputs
function approximation
prior knowledge
x
preprocessing
(e.g. feature extraction)
x~
neural network
~
y
(postprocessing)
y
The curse of dimensionality
PROBLEM: model a mapping x  n  y   on
the basis of a set of training data
SIMPLE SOLUTION: discretize the input variables in bins.
This leads to a division of the whole input space into cells.
Each of the training examples corresponds to a point in one
of the cells and carries an associated value of the output y.
The curse of dimensionality
PROBLEM: model a mapping x  n  y   on
the basis of a set of training data
Given a new point in input space, find which cell the point falls
in and return the average value of y for all points in that cell.
By increasing the number of divisions M along each axis we
could increase the precision with which the input is specified.
The curse of dimensionality
PROBLEM: model a mapping x  n  y   on
the basis of a set of training data
If each input variable is divided into M divisions, then the total
number of cells is Md and this grows exponentially with the
dimensionality d of the input space. Since each cell must contain at
least one data point, this implies that the quantity of training data
needed to specify the mapping also grows exponentially.
For a limited quantity of data, increasing d leads to a
very poor representation of the mapping.
homework
Another example: polynomial curve fitting
Problem: fit a polynomial to a set of N data
points by minimizing an error function
linear in w
supervised learning
training set :
x , t ,
n
n
target
n  1,, N

sum-of-squares error
w* minimum
quadratic in w
M =3
Gaussian noise,
zero mean,
s = 0.05
hx  0.5  0.4 sin 2 x
M =1
M =10
11 points
M =3
M =1
M =10
overfitting in classification
not allowed overlap
model complexity
Occam’s razor
~
E  E  
complexity control
2
1 d y
    2  dx
2  dx 
2
Bayes’ theorem
Goal: classify a new character in such a way as to
minimize the probability of misclassification
P(Ck ) : prior probability (given the TS, fraction of characters
labelled k in the limit of an infinite number of observations)
Problem: classify a new character
without seeing the corresponding image
Assign it to the class having the higher prior probability
Bayes’ theorem
assigned to one of a
discrete set of values {X l}
The value of the feature variable ~x1 has been measured
Problem: seek a formalism which allows
this information to be combined with the
prior probabilities we already possess
Bayes’ theorem
P(C1)
Prior probability P(Ck)
(in the limit of an infinite number of images)
Bayes’ theorem
P(C1 , X 5 )
Joint probability P(Ck , X l )
(in the limit of an infinite number of images)
Bayes’ theorem
P( X 5 | C1 )
Class-conditional probability P( X l | Ck )
(in the limit of an infinite number of images)
Bayes’ theorem
P( X 5 | C1 )
Bayes’ theorem
P( X 5 )
Unconditional probability P( X l )
(in the limit of an infinite number of images)
Bayes’ theorem
Bayes’ theorem
also for degree of belief
prior
posterior
normalization factor
Bayes’ theorem
direct computation by ann
different prior probabilities
(e.g. classification normal tissue / tumour
on medical X-ray images)
P(C ) = 0.6
inference + decision making = classification process
Bayes’ theorem
(continuous variables)
p ( x | Ck )
P ( Ck )
Bayes’
rule
x
(observation)
P ( Ck | x )
Bayes’ theorem
(continuous variables)
for c classes and feature vector x
PCk x  
px Ck  PCk 
px 
px    px Ck  PCk 
c
k 1
Decision making
Minimum misclassification rule
Assign feature vector x to class Ck if
P Ck x  P C j x
 
jk
p x Ck  P Ck   p x C j P C j 
jk
decision regions R1 , …, Rc
such that a point falling in Rk is assigned to Ck
Decision making
joint probability of x being assigned to C1 and the true class being C2
Decision making
c
Pcorrect    Px  Rk , Ck 
k 1
  P x  Rk Ck  PCk  
c
k 1
   p x Ck  PCk  dx
c
k 1 Rk
discriminant (decision) functions y1(x), … , yc(x)
yk x   y j x 
jk
discriminant (decision) functions y1(x), … , yc(x)
yk x  px Ck  PCk 
decision boundaries
yk x   y j x 
discriminant (decision) functions y1(x), … , yc(x)
yk x  px Ck  PCk 
other discriminant functions
g  yk x
g monotonic function
yk x  ln px Ck  ln PCk 
decision boundaries
yk x   y j x 
two-class decision problems
yx  y1 x  y2 x
yx  PC1 x  PC2 x
p x C1 
PC1 
yk x   ln
 ln
p x C2 
PC2 
• assign x to class C1 if y(x) > 0
• assign x to class C2 if y(x) < 0
Lkj = penalty associated with assigning a
pattern to Cj when in fact it belongs to Ck
Rk   Lkj  px Ck dx
c
j 1
Rj
expected loss for
patterns in Ck
Loss matrix L = (Lij)
minimizing risk
Lkj = penalty associated with assigning a
pattern to Cj when in fact it belongs to Ck


R   Rk PCk      Lkj px Ck PCk  dx
Rj
k 1
j 1
 k 1

c
c
c
risk
Loss matrix L = (Lij)
minimizing risk
Lkj = penalty associated with assigning a
pattern to Cj when in fact it belongs to Ck


R   Rk PCk      Lkj px Ck PCk  dx
Rj
k 1
j 1
 k 1

c
c
c
Choose regions Rj such that x  Rj when
 L px C PC    L px C PC 
c
k 1
c
kj
k
k
k 1
ki
minimizing risk
k
k
i  j
homework
 L px C PC    L px C PC 
c
k 1
c
kj
k
k
k 1
ki
minimizing risk
k
k
i  j
The reject option
threshold in the range (0,1)
if
max PCk x 
k
 

 
then classify x
then reject x
One way in which the reject option can be used is to design a
relatively simple but fast classifier system to cover the bulk of the
feature space, while leaving the remaining regions to a more
sophisticated system which might be relatively slow.