Chemometrics: Classification of spectra Vladimir Bochko Jarmo Alander University of Vaasa November 1, 2010 Vladimir Bochko Chemometrics: Classification 1/36 Contents • Terminology • Introduction • Big picture • Support Vector Machine • Introduction • Linear SVM classifier • Nonlinear SVM classifier • KNN classifier • Cross-validation • Performance evaluation Vladimir Bochko Chemometrics: Classification 2/36 Terminology The task of pattern recognition is to classify the objects into a number of classes. • Objects are called patterns or samples. • Measurements of particular object parameters are called features or components or variables. • The classifier computes the decision curve which divides the feature space into regions corresponding to classes. • The class is a group of objects characterized by similar features. • The decision may not be correct. In this case a misclassification occurs. • The patterns used to design the classifier are called training (or calibration) patterns. • The patterns used to test the classifier are called test patterns. Vladimir Bochko Chemometrics: Classification 3/36 Introduction • • If the training data is available then we tell about supervised pattern recognition. • We consider only supervised pattern recognition. In this case the training set consists of data X and class labels Y. • When we test the classifier using the test data X, the classifier predicts class lables Y. • Thus, classification requires training or calibration and test. SOLO GUI [3] has buttons: calibration and test/validation: If the training data is not available then we tell about unsupervised pattern recognition or clustering. Vladimir Bochko Chemometrics: Classification 4/36 Abbreviations • • • KNN - K-nearest neighbor classifier. The KNN classifier requires labels. • PCA - Principal Component Analyzis. The mapping (compression) technique. Labels are not needed. • DA - Discriminant Analysis, e.g. PLSDA, SVMDA. DA means that classification is used. • • MCS - Multiplicative Scatter Correction. The preprocessing technique. SVM - Support Vector Machines. The SVM classifier requires labels. PLS - Partial Least Squares. The mapping (compression), regression and classification technique. PLS requires labels. SNV - Standard Normal Variate transformation. The preprocessing technique. Vladimir Bochko Chemometrics: Classification 5/36 Big picture Classification/prediction Measured spectra Preprocessing MCS, SNV, smoothing derivatives PCA, PLS SVMDA: classifire design Training X Training Y Classifier Knn, SVM, PLS Training/validation System evaluation Tested X Model Classification/prediction Model Vladimir Bochko Chemometrics: Classification Predicted Y 6/36 Example We have green, yellow, orange and red tomato. From the salesman viewpoint the orange and red tomato are suitable for sale. Therefore tomato is divided into two classes: green/yellow and orange/red. Vladimir Bochko Chemometrics: Classification 7/36 Measurement • The typical measurement system is shown in Figure. • Important! Write MEMO during measurement. MEMO includes a name of the file, physical or chemical parameter of object, i.g cheese fatness, and class labels. Computer MEMO Light source Spectrometer 1 0 00 11 0 Light 1 00 11 0 1 00 11 probe1 0 sensor probe 00 11 00000 11111 0 1 00 11 00000 11111 N File name Parameters Class labels 1 Measured object Vladimir Bochko Chemometrics: Classification 8/36 Spectral data Spectra measured by a spectrometer are usually arranged as follows: • The first row is wavelength. • The first column is a sample number. • The measured spectrum values given in table cells correspond to wavelengths given in nanometers and spectrum numbers. • Some spectrum values corrupted by noise are negative. The beginning and the end of the spectra contain mostly noise. Vladimir Bochko Chemometrics: Classification 9/36 Spectral data • The spectral values are obtained at intervals about 0.27 nm in the range 195-1118 nm. The number of measurement points is 3648 that is unnecessarily high. • The example shows how data may be arranged in the data file after measurements with spectrometer. a) Data includes wavelengths and sample numbers. b) Data without wavelengths and sample numbers. In this case a vector of wavelengths should be kept in a separate file. a b Wavelength 1, 2, ... Sample 1 numbers 2 Wavelegths Spectrum 1 Spectrum 2 , 3648 Sample number Matrix where entries are spectrum values Labels Y Data X Vladimir Bochko Chemometrics: Classification 10/36 Preprocessing • The spectra are usually limited to a useful range, e.g. 430 - 1000 nm. Smoothing the spectral signals and then downsampling reduce noise inside the useful interval. After smoothing each spectrum is described by a smaller number of values, e.g. 50 - 500. In our example 50 values. Tomato spectrum, smoothing Tomato spectrum Smoothed spectrum Transmittance, % 50 40 30 20 Wavelength Indexes 1, 2, ... Sample number 10 1, 2, ... , 3648 , 50 0 500 600 700 800 900 Wavelength, nm Vladimir Bochko 1000 Smoothed data Input data Chemometrics: Classification 11/36 Preprocessing We return to tomato for a while. We may see that tomato of different colors have different spectra. However tomato of the same color have similar spectra. This is good for classification. Vladimir Bochko Chemometrics: Classification 12/36 Preprocessing, PCA PCA generates new features, i.e. principal components, which are the linear combinations of the input components or variables. One may see that the number of components is reduced from 50 to 2. During exercises we will discuss how to select the number of principal comonents. Thus, PCA is a technique for data compression. a) SOLO GUI set up for compression. b) Illustration of PCA. a b 1, 2, ... , 50 PCA 1, 2 Principal components (PCs) PC 2 Smoothed data Vladimir Bochko Chemometrics: Classification PC 1 13/36 PCA. Feature selection. • How many features or principal components should we use for classification? • One way is to analyze the plot of cumulative variances or sum of variances for the training set. The place where the curve sharply changes suggests a number of first principal components to be used. For example, in Figure generated by the SOLO toolbox, first two PCs should be selected. However, very frequently such point is not so clear seen. In addition, these components are not necessarily useful for classification. Eigenvalues 100 Cumulative Variance Captured (%) 95 90 85 80 75 70 65 60 55 1 2 3 4 5 6 7 8 Principal Component Number Vladimir Bochko 9 10 11 Chemometrics: Classification 14/36 PCA. Feature selection. • The other way shown in the SOLO demo for PCA is to analyze a set of plots: PC1 vs. PC2, PC1 vs. PC3 etc. Then one may select only those features or components which are most efficient for discriminating the classes. This way also may not be efficient in some applications. • One may use the probabilistic PCA and cross-validation to find the maximum log likelihood for the given training set. The largest log likelihood defines the number of principal components. • In exercises we will use the approach based on cumulative variances. Vladimir Bochko Chemometrics: Classification 15/36 PCA • The 2-dimensional feature space spanned by first two principal (or embedded) components (Figure). One may see all tomato mapped onto the space. The density of tomato population is shown by gray levels. Remember that we use two classes: green/yellow and orange/red. Vladimir Bochko Chemometrics: Classification 16/36 PCA and classification • The 2-dimensional feature space and the curve separating two classes (Figure). This curve is determined during the classifier design stage. • When the new test pont arrives, it is mapped in one of two regions related to classes. Vladimir Bochko Chemometrics: Classification 17/36 SVM. Linear discriminant functions and decision hyperplanes. We consider N samples, two linearly separable classes ̟1 and ̟2 and linear discriminant functions. The decision hyperplane in the l-dimensional feature space is g(x) = wT x + w0 where w = (w1 , w2 , . . . , wl )T is a weight vector and w0 is a threshold. From math the distance of a point from a hyperplane is as follows: z= |g(x)| kwk x2 z x w + − g(x)=0 x1 The discriminant function g(x) takes positive values on one side of the plane and negative values on the other side. Vladimir Bochko Chemometrics: Classification 18/36 SVM. Linearly separable classes. We scale w and w0 so, that g(x) at the points x1 and x2 is equal to 1 for ̟1 and −1 for ̟2 . The points x1 and x2 are closest to the hyperplane. The margin is kw2 k . In Figure b = w0 . Minimizing the norm of a weight vector makes the margin maximum. Figure is taken from [2]. Vladimir Bochko Chemometrics: Classification 19/36 SVM. Linearly separable classes. Out task is to compute the hyperplane. This means that we have to compute w and w0 . We introduce yi (class labels) and yi = 1 for ̟1 and yi = −1 for ̟2 . Thus, the task is as follows: minimize subject to J(w) = T 1 2 kwk2 yi (w xi + w0 ) ≥ 1, i = 1, 2, . . . , N where N is a number of samples. This is a quadratic optimization task. The constraints are linear unequalities. Vladimir Bochko Chemometrics: Classification 20/36 SVM From computational viewpoint, it is better to represent this optimization problem in a dual representation form. The solution leads to the discriminant function ! n X g(x) = yi αi k(x, xi ) + w0 i=1 where αi are support vectors, i = 1, 2, . . . , n, k(., .) is a kernel function or kernel, in our case, i.e. linear case, k(x, xi ) =< x, xi > where < ., . > denotes the dot product. Vladimir Bochko Chemometrics: Classification 21/36 SVM. Linearly nonseparable classes • For linearly nonseparable classes we have three groups of samples. In this case new variables called slack variables are introduced ξi . • Samples which are outside the margin and correctly classified, ξi = 0. • Samples which are inside the margin and correctly classified, 0 < ξi ≤ 1. • Samples which are misclassified, ξi > 1. Vladimir Bochko Chemometrics: Classification 22/36 SVM. Linearly nonseparable classes • Then the optimization task is as follows: minimize subject to J(w, w0 , ξ) = 1 2 T kwk2 + C PN i=1 yi (w xi + w0 ) ≥ 1 + ξi , ξi i = 1, 2, . . . , N ξi ≥ 0 where C is a positive constant called a penalization term. It is used in the SOLO toolbox. • This problem is again reformulated and soved in a dual representation form. Vladimir Bochko Chemometrics: Classification 23/36 SVM. Nonlinear case • To obtain solution for nonlinear case we have to use solution for linear case and replace the linear kernel function by the nonlinear kernel function. For example, in the discriminant function obtained earlier ! n X g(x) = yi αi k(x, xi ) + w0 i=1 we have to use the nonlinear kernel k(x, xi ). Vladimir Bochko Chemometrics: Classification 24/36 SVM. Nonlinear case • Most widely used kernel functions are • Linear k(xi , xj ) =< xi , xj >= xT i xj • Polynomial k(xi , xj ) = (axT x + r)d , a > 0 i j • Radial basis function (RBF) or exponential function 2 k(xi , xj ) = exp(−γ xTi − xj ), γ > 0. • Radial basis function (RBF) is used in the SOLO toolbox. Vladimir Bochko Chemometrics: Classification 25/36 SVM. Nonlinear case. The nonlinear discriminant function and nonseparable classes. You may see three groups of samples: outside a margin and correctly classified, inside margin and correctly classified and misclassified shown with a cross. Figure is taken from [2]. We considered C-support vector classifiers. There is also ν-support vector classification which uses a parameter ν ∈ (0, 1]. This parameter is an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. C-support and ν-support vector classifiers give equal results. Vladimir Bochko Chemometrics: Classification 26/36 SVM. Conclusions • In many cases the SVM classifies demonstrates the high performance in comparison with the other classifiers. The SVM classifiers are used in many applications including, digit recognition, face recognition, medical imaging and others. • The disadvantage of SVM is that the is no technique to select the best kernel. When the kernel is known it is needed to solve optimization to find the values of parameters, for example, RBF parameter γ and penalization term C. This optimization is done in the SOLO toolbox: Build Model and Optimization Results. Vladimir Bochko Chemometrics: Classification 27/36 K Nearest Neighbor Classifier • The KNN algorithm is simple and performs very well. • The KNN algorithm is frequently used as a benchmark method. • The KNN classifier is a nonlinear classifier. • The nearest neighbor (k = 1) algorithm has asymptotic classification error rate no worse than twice the Bayesian optimal error rate. Vladimir Bochko Chemometrics: Classification 28/36 K Nearest Neighbor Classifier The KNN algorithm is as follows: • Given a test sample and a distance measure. • Out of N training vectors, define k nearest neighbors. • Out of the k nearest neighbors, define the number of vectors ki belonging to class ̟i , where i = 1, 2, . . . , M . • The test sample belongs to the class ̟i which has the muximum number ki of samples. ω1 k=3 k 1 =2 k 2 =1 Vladimir Bochko ω2 x Test sample Chemometrics: Classification 29/36 K Nearest Neighbor Classifier The distance measures • Euclidean distance. The test vector is assigned to the class ̟i if its Euclidean distance from the the class mean point µi is minimum among the other claases. de = kx − µik • Mahalanobis distance. It is assumed that all classes have the same covariance matrix Σ. Again we compute and select a minimum distance for the test sample. 1/2 dm = (x − µi)T Σ−1 (x − µi) Vladimir Bochko Chemometrics: Classification 30/36 K Nearest Neighbor Classifier Remarks • The drawback of the algorithm that all training vectors must be stored. • The other drawback is the complexity in search of nearest neighbors among N training samples. Vladimir Bochko Chemometrics: Classification 31/36 Cross-validation • The use of training set for evaluating the performance of classification system may give the poor predictive performance. • To evalutae the performance of classification system given the different parameters one may use three data sets: training, validation and test. Then after training the peformance may be evaluated using the validation set. The final evaluation is made using the test set. However, when the number of samples of training and test sets is small this approach cannot be used. Vladimir Bochko Chemometrics: Classification 32/36 Cross-validation • Cross-validation. In this case the available data is divided into several groups where only one group is used for test/validation and the rest groups are used for training. This is repeated so that all of the data are used. Then the results are averaged. Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 • When the data set is very small then the leave-one-out method is used. So the test/validation set represents only one sample. Vladimir Bochko Chemometrics: Classification 33/36 Confusion matrix • The performance of classification system can be assessed using the confusion matrix B(i, j). The entry (i, j) shows a number of samples belonging to the class i which are assigned to class j. • In the next example 48 samples of class ̟1 are correctly classified and 2 samples of class ̟1 are misclassified. For class ̟2 45 samples are correctly classified and 5 samples misclassified. For class ̟3 all samples correctly classified. 48 2 0 3 45 2 0 0 50 Vladimir Bochko Chemometrics: Classification 34/36 References Bishop C. M., Pettern recognition and machine learning. Springer, 2006. Scholkopf B. and Smola A. J., Support Vector Machines, Regularization, Optimization, and Beyond. MIT, 2002. SOLO toolbox: http://wiki.eigenvector.com/index.php?title=Main Page Theodoridis S., Koutroumbas K., Pettern recognition. Academic Press, 2006. Vladimir Bochko Chemometrics: Classification 35/36 Questions Vladimir Bochko Chemometrics: Classification 36/36
© Copyright 2026 Paperzz