unit #2 Giansalvo EXIN Cirrincione STATISTICAL PATTERN RECOGNITION An example: character recognition Problem: distinguish handwritten versions of the characters a and b pixel x1 x xd captured by a camera array of pixel values xi which range from 0 to 1 according to the fraction of the pixel square occupied by black ink Goal: develop an algorithm which will assign any image, represented by a vector x, to one of two classes, which we shall denote by Ck where k=1,2, so that class C1 corresponds to a and class C2 corresponds to b. data set (sample) high dimensionality For a 256x256 image, d = 65536; representing grey values by 8 bits implies 28x256x256 10158000 different images. generalization x1 x xd height ~ x1 width feature selection/extraction TS histograms Approximation minimum # of the classmisclassifications conditional pdf’s overlap ideal classifier = feature + threshold TS histograms How to improve the classification ? decision boundary Consider a second feature !!! Curse of dimensionality !!! 1 if x C1 Classification outcome: y = 0 if x C2 Mapping: x n y c ( c classes) Model: yk = yk (x;w) like the threshold weights regression problems: continuous outputs classification problems: discrete outputs function approximation prior knowledge x preprocessing (e.g. feature extraction) x~ neural network ~ y (postprocessing) y The curse of dimensionality PROBLEM: model a mapping x n y on the basis of a set of training data SIMPLE SOLUTION: discretize the input variables in bins. This leads to a division of the whole input space into cells. Each of the training examples corresponds to a point in one of the cells and carries an associated value of the output y. The curse of dimensionality PROBLEM: model a mapping x n y on the basis of a set of training data Given a new point in input space, find which cell the point falls in and return the average value of y for all points in that cell. By increasing the number of divisions M along each axis we could increase the precision with which the input is specified. The curse of dimensionality PROBLEM: model a mapping x n y on the basis of a set of training data If each input variable is divided into M divisions, then the total number of cells is Md and this grows exponentially with the dimensionality d of the input space. Since each cell must contain at least one data point, this implies that the quantity of training data needed to specify the mapping also grows exponentially. For a limited quantity of data, increasing d leads to a very poor representation of the mapping. homework Another example: polynomial curve fitting Problem: fit a polynomial to a set of N data points by minimizing an error function linear in w supervised learning training set : x , t , n n target n 1,, N sum-of-squares error w* minimum quadratic in w M =3 Gaussian noise, zero mean, s = 0.05 hx 0.5 0.4 sin 2 x M =1 M =10 11 points M =3 M =1 M =10 overfitting in classification not allowed overlap model complexity Occam’s razor ~ E E complexity control 2 1 d y 2 dx 2 dx 2 Bayes’ theorem Goal: classify a new character in such a way as to minimize the probability of misclassification P(Ck ) : prior probability (given the TS, fraction of characters labelled k in the limit of an infinite number of observations) Problem: classify a new character without seeing the corresponding image Assign it to the class having the higher prior probability Bayes’ theorem assigned to one of a discrete set of values {X l} The value of the feature variable ~x1 has been measured Problem: seek a formalism which allows this information to be combined with the prior probabilities we already possess Bayes’ theorem P(C1) Prior probability P(Ck) (in the limit of an infinite number of images) Bayes’ theorem P(C1 , X 5 ) Joint probability P(Ck , X l ) (in the limit of an infinite number of images) Bayes’ theorem P( X 5 | C1 ) Class-conditional probability P( X l | Ck ) (in the limit of an infinite number of images) Bayes’ theorem P( X 5 | C1 ) Bayes’ theorem P( X 5 ) Unconditional probability P( X l ) (in the limit of an infinite number of images) Bayes’ theorem Bayes’ theorem also for degree of belief prior posterior normalization factor Bayes’ theorem direct computation by ann different prior probabilities (e.g. classification normal tissue / tumour on medical X-ray images) P(C ) = 0.6 inference + decision making = classification process Bayes’ theorem (continuous variables) p ( x | Ck ) P ( Ck ) Bayes’ rule x (observation) P ( Ck | x ) Bayes’ theorem (continuous variables) for c classes and feature vector x PCk x px Ck PCk px px px Ck PCk c k 1 Decision making Minimum misclassification rule Assign feature vector x to class Ck if P Ck x P C j x jk p x Ck P Ck p x C j P C j jk decision regions R1 , …, Rc such that a point falling in Rk is assigned to Ck Decision making joint probability of x being assigned to C1 and the true class being C2 Decision making c Pcorrect Px Rk , Ck k 1 P x Rk Ck PCk c k 1 p x Ck PCk dx c k 1 Rk discriminant (decision) functions y1(x), … , yc(x) yk x y j x jk discriminant (decision) functions y1(x), … , yc(x) yk x px Ck PCk decision boundaries yk x y j x discriminant (decision) functions y1(x), … , yc(x) yk x px Ck PCk other discriminant functions g yk x g monotonic function yk x ln px Ck ln PCk decision boundaries yk x y j x two-class decision problems yx y1 x y2 x yx PC1 x PC2 x p x C1 PC1 yk x ln ln p x C2 PC2 • assign x to class C1 if y(x) > 0 • assign x to class C2 if y(x) < 0 Lkj = penalty associated with assigning a pattern to Cj when in fact it belongs to Ck Rk Lkj px Ck dx c j 1 Rj expected loss for patterns in Ck Loss matrix L = (Lij) minimizing risk Lkj = penalty associated with assigning a pattern to Cj when in fact it belongs to Ck R Rk PCk Lkj px Ck PCk dx Rj k 1 j 1 k 1 c c c risk Loss matrix L = (Lij) minimizing risk Lkj = penalty associated with assigning a pattern to Cj when in fact it belongs to Ck R Rk PCk Lkj px Ck PCk dx Rj k 1 j 1 k 1 c c c Choose regions Rj such that x Rj when L px C PC L px C PC c k 1 c kj k k k 1 ki minimizing risk k k i j homework L px C PC L px C PC c k 1 c kj k k k 1 ki minimizing risk k k i j The reject option threshold in the range (0,1) if max PCk x k then classify x then reject x One way in which the reject option can be used is to design a relatively simple but fast classifier system to cover the bulk of the feature space, while leaving the remaining regions to a more sophisticated system which might be relatively slow.
© Copyright 2024 Paperzz