PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION

PATTERN RECOGNITION :
CLUSTERING AND
CLASSIFICATION
Richard Brereton
[email protected]
CLUSTER ANALYSIS - UNSUPERVISED PATTERN
RECOGNITION
•Grouping of objects according to similarity.
•No predefined classes
TAXONOMY
CHEMICAL TAXONOMY
Grouping organisms according to similarity from
chemical fingerprints
•DNA base pairs, proteins
•NMR and pyrolysis of extracts
•NIR spectra
SIMILAR PRINCIPLES IN ALL TYPES OF
CHEMISTRY
• Chemical archaeology
• Environmental samples
• Food
STEPS IN CLUSTER ANALYSIS
Similarity measures.
Calculate similarity between objects.
Example
Correlation coefficient : higher, more similar
Euclidean distance : smaller, more similar
Euclidean distance
Manhattan distance : smaller, more similar
Manhattan distance
Use correlations for illustration.
Group samples.
1. Find most similar, highest correlation.
Objects 2 and 5.
2. Combine them.
3. Work out new correlation of the new object
2&5 with the other objects (1,3,4,6).
Linkage methods – determination of new similarity
measures of groups.
Several methods.
•
Nearest neighbour uses the highest correlation
•
Furthest neighbour uses the lowest correlation
•
Average linkage uses an average.
Illustrate with nearest neighbour.
Dendrograms
CLUSTER ANALYSIS : SUMMARY
• Similarity measures
• Linkage methods
• Dendrogram
CLASSIFICATION
Many methods.
CONVENTIONAL
LDA (Linear discriminant analysis)
Original statistics : projections
Examples
Orange juices, can we class into origins and
can we detect adulteration from NIR spectra?
Class modelling of mussels, can we find which
come from polluted site from GC?
Detailed mathematical model
PRINCIPLES : BIVARIATE EXAMPLE
Class A
line 1
Class B
Class A
centre
Class B
centre
line 2
Often exact cut-off impossible
Class A
line 1
Class A
centre
Class B
line 2
Class B
centre
Class distance plots
Centre class A
Class distances
Centre class B
Multivariate data : several measurements per class
Example – Fisher Iris data – four measurements per iris
Petal width, petal length, sepal width, sepal length
150 Irises, divided into 50 of each species
I. Setosa
I. Versicolor
I. Verginica
SPECIAL DISTANCES USED.
Linear discriminant function between classes A and B
• The first term is simply the difference between the centres of
each class – so a more positive value indicates class A.
• The middle term is the inverse of the “pooled variance
covariance matrix.
What does this mean?
Sometimes measurements are correlated.
Sometimes classes are more dispersed.
Puts distances on common scale.
•The final term is the measurement for each object.
Discriminant score against sample number : I Versicolor and I Verginica
0
-5
-10
-15
-20
-25
-30
-35
Can shift the scale so that
•positive score probably class A,
•negative score probably class B.
Note some ambiguities. WAB.
Discriminant score against sample number - adjust for group means
15
10
5
0
-5
-10
-15
-20
Extending to more than 2 classes
Three classes – 2 out of 3 possible discriminant parameters
If we have 3 classes and choose to use WAB and WAC as the
functions, it is easy to see that
•an object belongs to class A if WAB and WAC are both positive,
•an object belongs to class B if WAB is negative and WAC is
greater than WAB, and
•an object belongs to class C if WAC is negative and WAB is
greater than WAC.
WAC
Class A
Class C
WAB
Class B
Mahalanobis distance
Similar idea to the Euclidean distance, i.e. distance to
the centre of a class but use the variance covariance
matrix for scaling.
5.0
Dustance to class B
4.0
3.0
2.0
1.0
0.0
0.0
2.0
4.0
6.0
Distance to class A
8.0
10.0
10.0
9.0
Class A
8.0
Outlier - maybe another
class?
Distance to class B
7.0
6.0
5.0
4.0
3.0
2.0
Class B
1.0 Ambiguous
0.0
0.0
2.0
4.0
6.0
Distance to class A
8.0
10.0
10
I Versicolor
9
I Verginica
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
18
I Versicolor
I Verginica
I Serosa
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
14
16
Many classical statistical methods developed first in
biology.
Problem for chemists: Mahalanobis distance depends on
measurements being more than variables
Spectroscopy, chromatography : often a huge number of
measurements per sample.
Solutions
•Variable selection
•PCA prior to performing classification
Many diagnostics
•Modelling power of variables
•Discriminatory power of variables
•Quality of class model
•Probabilities of class membership
•Ambiguous classification : is analytical data good
enough?
MANY SOPHISTICATIONS
Large number of methods for classification based on
LDA.
•Bayesian methods – based on prior probabilities.
•Methods that try to find optimal groupings before
class modelling.
LOTS OF INFORMATION
•Class membership
•Outliers
•Whether another new class
•Is a class well defined or are there subclasses e.g. subspecies
or species from different environments
•What measurements are most useful for discrimination. Can
we reduce the number of measurements?
•Are there ambiguous samples, and if so do we need more or
better measurements?
•Replicates analysis. Is our method sufficiently good for
repeatability. Clinical diagnostics.
SIMCA sometimes used in chemometrics as an
alternative
•Soft
•Independent
•Modelling of
•Class analogy
Use PCA models
*
Use PCA to model each class independently
•Choose optimal number of PCs
•Use distance from PC model as an indicator of
class distance
VALIDATION OF A CLASS MODEL
Procedure.
•Establish a training set.
•Assess model with a test set.
•Use model on real data.
Information
•Graphical - e.g. diagrams
•Quantitative - class distances
•Quantitative - probability of membership of a
given class.
Training set
Test set
SUMMARY
•Cluster analysis – unsupervised pattern recognition
•Similarity measures
•Linkage
•Dendrograms
•Classification – supervised pattern recognition
•Class models
•Class distances
•Graphical methods