Chemometrics: Classification of spectra

Chemometrics:
Classification of spectra
Vladimir Bochko
Jarmo Alander
University of Vaasa
November 1, 2010
Vladimir Bochko
Chemometrics: Classification
1/36
Contents
• Terminology
• Introduction
• Big picture
• Support Vector Machine
• Introduction
• Linear SVM classifier
• Nonlinear SVM classifier
• KNN classifier
• Cross-validation
• Performance evaluation
Vladimir Bochko
Chemometrics: Classification
2/36
Terminology
The task of pattern recognition is to classify the objects into a number
of classes.
• Objects are called patterns or samples.
• Measurements of particular object parameters are called features or
components or variables.
• The classifier computes the decision curve which divides the feature
space into regions corresponding to classes.
• The class is a group of objects characterized by similar features.
• The decision may not be correct. In this case a misclassification
occurs.
• The patterns used to design the classifier are called training (or
calibration) patterns.
• The patterns used to test the classifier are called test patterns.
Vladimir Bochko
Chemometrics: Classification
3/36
Introduction
•
•
If the training data is available then we tell about supervised pattern recognition.
•
We consider only supervised pattern recognition. In this case the training set
consists of data X and class labels Y.
•
When we test the classifier using the test data X, the classifier predicts class
lables Y.
•
Thus, classification requires training or calibration and test. SOLO GUI [3] has
buttons: calibration and test/validation:
If the training data is not available then we tell about unsupervised pattern
recognition or clustering.
Vladimir Bochko
Chemometrics: Classification
4/36
Abbreviations
•
•
•
KNN - K-nearest neighbor classifier. The KNN classifier requires labels.
•
PCA - Principal Component Analyzis. The mapping (compression) technique.
Labels are not needed.
•
DA - Discriminant Analysis, e.g. PLSDA, SVMDA. DA means that classification is
used.
•
•
MCS - Multiplicative Scatter Correction. The preprocessing technique.
SVM - Support Vector Machines. The SVM classifier requires labels.
PLS - Partial Least Squares. The mapping (compression), regression and
classification technique. PLS requires labels.
SNV - Standard Normal Variate transformation. The preprocessing technique.
Vladimir Bochko
Chemometrics: Classification
5/36
Big picture
Classification/prediction
Measured spectra
Preprocessing
MCS, SNV, smoothing
derivatives
PCA, PLS
SVMDA: classifire design
Training X
Training Y
Classifier
Knn, SVM, PLS
Training/validation
System evaluation
Tested X
Model
Classification/prediction
Model
Vladimir Bochko
Chemometrics: Classification
Predicted Y
6/36
Example
We have green, yellow, orange and red tomato. From the salesman viewpoint
the orange and red tomato are suitable for sale. Therefore tomato is divided
into two classes: green/yellow and orange/red.
Vladimir Bochko
Chemometrics: Classification
7/36
Measurement
• The typical measurement system is shown in Figure.
• Important! Write MEMO during measurement. MEMO includes a name
of the file, physical or chemical parameter of object, i.g cheese fatness,
and class labels.
Computer
MEMO
Light source Spectrometer
1
0
00
11
0
Light 1
00
11
0
1
00
11
probe1
0
sensor probe
00
11
00000
11111
0
1
00
11
00000
11111
N
File name Parameters
Class labels
1
Measured object
Vladimir Bochko
Chemometrics: Classification
8/36
Spectral data
Spectra measured by a spectrometer are usually arranged as follows:
• The first row is wavelength.
• The first column is a sample number.
• The measured spectrum values given in table cells correspond to
wavelengths given in nanometers and spectrum numbers.
• Some spectrum values corrupted by noise are negative. The beginning
and the end of the spectra contain mostly noise.
Vladimir Bochko
Chemometrics: Classification
9/36
Spectral data
• The spectral values are obtained at intervals about 0.27 nm in the range
195-1118 nm. The number of measurement points is 3648 that is
unnecessarily high.
• The example shows how data may be arranged in the data file after
measurements with spectrometer. a) Data includes wavelengths and
sample numbers. b) Data without wavelengths and sample numbers. In
this case a vector of wavelengths should be kept in a separate file.
a
b
Wavelength
1, 2, ...
Sample 1
numbers 2
Wavelegths
Spectrum 1
Spectrum 2
, 3648
Sample
number
Matrix where entries are spectrum values
Labels Y
Data X
Vladimir Bochko
Chemometrics: Classification
10/36
Preprocessing
• The spectra are usually limited to a useful range, e.g. 430 - 1000 nm.
Smoothing the spectral signals and then downsampling reduce noise
inside the useful interval. After smoothing each spectrum is described
by a smaller number of values, e.g. 50 - 500. In our example 50 values.
Tomato spectrum, smoothing
Tomato spectrum
Smoothed spectrum
Transmittance, %
50
40
30
20
Wavelength
Indexes 1, 2, ...
Sample
number
10
1, 2, ...
, 3648
, 50
0
500
600
700
800
900
Wavelength, nm
Vladimir Bochko
1000
Smoothed data
Input data
Chemometrics: Classification
11/36
Preprocessing
We return to tomato for a while. We may see that tomato of
different colors have different spectra. However tomato of the
same color have similar spectra. This is good for classification.
Vladimir Bochko
Chemometrics: Classification
12/36
Preprocessing, PCA
PCA generates new features, i.e. principal components, which are the linear
combinations of the input components or variables. One may see that the
number of components is reduced from 50 to 2. During exercises we will
discuss how to select the number of principal comonents. Thus, PCA is a
technique for data compression. a) SOLO GUI set up for compression. b)
Illustration of PCA.
a
b
1, 2, ...
, 50
PCA
1, 2 Principal components (PCs)
PC 2
Smoothed data
Vladimir Bochko
Chemometrics: Classification
PC 1
13/36
PCA. Feature selection.
• How many features or principal components should we use for
classification?
• One way is to analyze the plot of cumulative variances or sum of
variances for the training set. The place where the curve sharply
changes suggests a number of first principal components to be used.
For example, in Figure generated by the SOLO toolbox, first two PCs
should be selected. However, very frequently such point is not so clear
seen. In addition, these components are not necessarily useful for
classification.
Eigenvalues
100
Cumulative Variance Captured (%)
95
90
85
80
75
70
65
60
55
1
2
3
4
5
6
7
8
Principal Component Number
Vladimir Bochko
9
10
11
Chemometrics: Classification
14/36
PCA. Feature selection.
• The other way shown in the SOLO demo for PCA is to analyze a set of
plots: PC1 vs. PC2, PC1 vs. PC3 etc. Then one may select only those
features or components which are most efficient for discriminating the
classes. This way also may not be efficient in some applications.
• One may use the probabilistic PCA and cross-validation to find the
maximum log likelihood for the given training set. The largest log
likelihood defines the number of principal components.
• In exercises we will use the approach based on cumulative variances.
Vladimir Bochko
Chemometrics: Classification
15/36
PCA
• The 2-dimensional feature space spanned by first two principal (or
embedded) components (Figure). One may see all tomato mapped onto
the space. The density of tomato population is shown by gray levels.
Remember that we use two classes: green/yellow and orange/red.
Vladimir Bochko
Chemometrics: Classification
16/36
PCA and classification
• The 2-dimensional feature space and the curve separating two classes
(Figure). This curve is determined during the classifier design stage.
• When the new test pont arrives, it is mapped in one of two regions
related to classes.
Vladimir Bochko
Chemometrics: Classification
17/36
SVM. Linear discriminant functions
and decision hyperplanes.
We consider N samples, two linearly separable classes ̟1 and ̟2 and linear
discriminant functions. The decision hyperplane in the l-dimensional feature space is
g(x) = wT x + w0
where w = (w1 , w2 , . . . , wl )T is a weight vector and w0 is a threshold. From math the
distance of a point from a hyperplane is as follows:
z=
|g(x)|
kwk
x2
z
x
w
+
−
g(x)=0
x1
The discriminant function g(x) takes positive values on one side of the plane and
negative values on the other side.
Vladimir Bochko
Chemometrics: Classification
18/36
SVM. Linearly separable classes.
We scale w and w0 so, that g(x) at the points x1 and x2 is equal to 1 for ̟1
and −1 for ̟2 . The points x1 and x2 are closest to the hyperplane. The
margin is kw2 k . In Figure b = w0 . Minimizing the norm of a weight vector
makes the margin maximum. Figure is taken from [2].
Vladimir Bochko
Chemometrics: Classification
19/36
SVM. Linearly separable classes.
Out task is to compute the hyperplane. This means that we have to compute
w and w0 . We introduce yi (class labels) and yi = 1 for ̟1 and yi = −1 for
̟2 . Thus, the task is as follows:
minimize
subject to
J(w) =
T
1
2
kwk2
yi (w xi + w0 ) ≥ 1,
i = 1, 2, . . . , N
where N is a number of samples. This is a quadratic optimization task. The
constraints are linear unequalities.
Vladimir Bochko
Chemometrics: Classification
20/36
SVM
From computational viewpoint, it is better to represent this
optimization problem in a dual representation form. The
solution leads to the discriminant function
!
n
X
g(x) =
yi αi k(x, xi ) + w0
i=1
where αi are support vectors, i = 1, 2, . . . , n, k(., .) is a kernel
function or kernel, in our case, i.e. linear case,
k(x, xi ) =< x, xi > where < ., . > denotes the dot product.
Vladimir Bochko
Chemometrics: Classification
21/36
SVM. Linearly nonseparable classes
• For linearly nonseparable classes we have three groups of samples. In
this case new variables called slack variables are introduced ξi .
• Samples which are outside the margin and correctly
classified, ξi = 0.
• Samples which are inside the margin and correctly
classified, 0 < ξi ≤ 1.
• Samples which are misclassified, ξi > 1.
Vladimir Bochko
Chemometrics: Classification
22/36
SVM. Linearly nonseparable classes
• Then the optimization task is as follows:
minimize
subject to
J(w, w0 , ξ) =
1
2
T
kwk2 + C
PN
i=1
yi (w xi + w0 ) ≥ 1 + ξi ,
ξi
i = 1, 2, . . . , N
ξi ≥ 0
where C is a positive constant called a penalization term. It is used in
the SOLO toolbox.
• This problem is again reformulated and soved in a dual representation
form.
Vladimir Bochko
Chemometrics: Classification
23/36
SVM. Nonlinear case
• To obtain solution for nonlinear case we have to use
solution for linear case and replace the linear kernel
function by the nonlinear kernel function. For example, in
the discriminant function obtained earlier
!
n
X
g(x) =
yi αi k(x, xi ) + w0
i=1
we have to use the nonlinear kernel k(x, xi ).
Vladimir Bochko
Chemometrics: Classification
24/36
SVM. Nonlinear case
• Most widely used kernel functions are
• Linear k(xi , xj ) =< xi , xj >= xT
i xj
• Polynomial k(xi , xj ) = (axT
x
+
r)d , a > 0
i j
• Radial basis function (RBF) or exponential function
2
k(xi , xj ) = exp(−γ xTi − xj ), γ > 0.
• Radial basis function (RBF) is used in the SOLO toolbox.
Vladimir Bochko
Chemometrics: Classification
25/36
SVM. Nonlinear case.
The nonlinear discriminant function and nonseparable classes. You may see
three groups of samples: outside a margin and correctly classified, inside
margin and correctly classified and misclassified shown with a cross. Figure
is taken from [2].
We considered C-support vector classifiers. There is also ν-support vector
classification which uses a parameter ν ∈ (0, 1]. This parameter is an upper
bound on the fraction of training errors and a lower bound of the fraction of
support vectors. C-support and ν-support vector classifiers give equal results.
Vladimir Bochko
Chemometrics: Classification
26/36
SVM. Conclusions
• In many cases the SVM classifies demonstrates the high performance
in comparison with the other classifiers. The SVM classifiers are used
in many applications including, digit recognition, face recognition,
medical imaging and others.
• The disadvantage of SVM is that the is no technique to select the best
kernel. When the kernel is known it is needed to solve optimization to
find the values of parameters, for example, RBF parameter γ and
penalization term C. This optimization is done in the SOLO toolbox:
Build Model and Optimization Results.
Vladimir Bochko
Chemometrics: Classification
27/36
K Nearest Neighbor Classifier
• The KNN algorithm is simple and performs very well.
• The KNN algorithm is frequently used as a benchmark
method.
• The KNN classifier is a nonlinear classifier.
• The nearest neighbor (k = 1) algorithm has asymptotic
classification error rate no worse than twice the Bayesian
optimal error rate.
Vladimir Bochko
Chemometrics: Classification
28/36
K Nearest Neighbor Classifier
The KNN algorithm is as follows:
• Given a test sample and a distance measure.
• Out of N training vectors, define k nearest neighbors.
• Out of the k nearest neighbors, define the number of
vectors ki belonging to class ̟i , where i = 1, 2, . . . , M .
• The test sample belongs to the class ̟i which has the
muximum number ki of samples.
ω1
k=3
k 1 =2
k 2 =1
Vladimir Bochko
ω2
x Test sample
Chemometrics: Classification
29/36
K Nearest Neighbor Classifier
The distance measures
• Euclidean distance. The test vector is assigned to the
class ̟i if its Euclidean distance from the the class mean
point µi is minimum among the other claases.
de = kx − µik
• Mahalanobis distance. It is assumed that all classes have
the same covariance matrix Σ. Again we compute and
select a minimum distance for the test sample.
1/2
dm = (x − µi)T Σ−1 (x − µi)
Vladimir Bochko
Chemometrics: Classification
30/36
K Nearest Neighbor Classifier
Remarks
• The drawback of the algorithm that all training vectors must
be stored.
• The other drawback is the complexity in search of nearest
neighbors among N training samples.
Vladimir Bochko
Chemometrics: Classification
31/36
Cross-validation
• The use of training set for evaluating the performance of
classification system may give the poor predictive
performance.
• To evalutae the performance of classification system given
the different parameters one may use three data sets:
training, validation and test. Then after training the
peformance may be evaluated using the validation set. The
final evaluation is made using the test set. However, when
the number of samples of training and test sets is small
this approach cannot be used.
Vladimir Bochko
Chemometrics: Classification
32/36
Cross-validation
• Cross-validation. In this case the available data is divided
into several groups where only one group is used for
test/validation and the rest groups are used for training.
This is repeated so that all of the data are used. Then the
results are averaged.
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
• When the data set is very small then the leave-one-out
method is used. So the test/validation set represents only
one sample.
Vladimir Bochko
Chemometrics: Classification
33/36
Confusion matrix
• The performance of classification system can be assessed using the
confusion matrix B(i, j). The entry (i, j) shows a number of samples
belonging to the class i which are assigned to class j.
• In the next example 48 samples of class ̟1 are correctly classified and
2 samples of class ̟1 are misclassified. For class ̟2 45 samples are
correctly classified and 5 samples misclassified. For class ̟3 all
samples correctly classified.


48 2
0
 3 45 2 
0
0 50
Vladimir Bochko
Chemometrics: Classification
34/36
References
Bishop C. M., Pettern recognition and machine learning.
Springer, 2006.
Scholkopf B. and Smola A. J., Support Vector Machines,
Regularization, Optimization, and Beyond. MIT, 2002.
SOLO toolbox:
http://wiki.eigenvector.com/index.php?title=Main Page
Theodoridis S., Koutroumbas K., Pettern recognition. Academic
Press, 2006.
Vladimir Bochko
Chemometrics: Classification
35/36
Questions
Vladimir Bochko
Chemometrics: Classification
36/36