basics of chemometrics - CRA-W

INTRODUCTION
- Chemometrics Introduction
What is this and why we need it
BASICS OF CHEMOMETRICS
- Some definitions
- Overview of methods
Juan Antonio Fernández Pierna
Vincent Baeten
Pierre Dardenne
- Examples
Walloon Agricultural Research Centre (CRA-W)
Valorisation of Agricultural Products Department
Gembloux, Belgium
1
X - METRICS
CHEMOMETRICS – INTRODUCTION
The use of multivariate analysis in the discipline X:
“Chemometrics is the chemical discipline that
uses mathematics and statistics to design or
select optimal experimental procedures, to
provide
maximum
relevant
chemical
information by analyzing chemical data, and to
obtain knowledge about chemical systems”
Statistical, mathematical or graphical
technique, considers multiple variables
simultaneously
- Biometrics (used in biology)
- Technometrics (used in engineering)
- Psychrometrics (used in phychology)
- Chemometrics (used in chemistry)
2
D. L. Massart
3
4
CHEMOMETRICS – INTRODUCTION
CHEMOMETRICS – INTRODUCTION
The scientific world today
Mathematics Computing
Biology
Industrial
Food
Feed
Engineering
Statistics
• The data flood generated by modern analytical
instrumentation produces large quantity of numbers
to understand and quantify phenomenons around us.
Analytical
chemistry
• The evolution of personal computers allows faster
acquisition, processing and interpretation of chemical
data.
Theoretical
and physical
chemistry
Organic
chemistry
• Every scientist uses software related to mathematical
methods or to processing of knowledge.
Meeting point of various disciplines
5
Chemometrics
CHEMOMETRICS – INTRODUCTION
• A deeper understanding of those methods and tools
for viewing all data simultaneously are needed.
6
CHEMOMETRICS – INTRODUCTION
Use of mathematical and statistical methods for
selecting optimal experiments
Statistical experimental design
Design of Experiments (DoE)…
CHEMOMETRICS
CHEMOMETRIC
Analytical
Analytical
Analytical
S
method
answer
request
Extracting maximum amount of information when
analysing multivariate (chemical) data
Classification
Process monitoring,
Multivariate calibration…
Useful at any point in an analysis, from the first
conception of an experiment until the data is
discarded.
7
8
CHEMOMETRICS – APPLICATIONS
CHEMOMETRICS – DEFINITIONS
Huge growth area in past 15 years
•
•
•
•
•
Linear algebra is the language of Chemometrics. One cannot expect
to truly understand most chemometric techniques without a basic
understanding of linear algebra (Wise and Gallagher, 1998)
Process Control and analysis
Food and feed analysis
Biology – metabolomics etc
Environmental monitoring
Analytical Chemistry
Matrix and vector operations
9
10
CHEMOMETRICS – DEFINITIONS
CHEMOMETRICS – DEFINITIONS
0.9
0.8
0.7
- Samples are referred to as OBJECTS
Absorbance
0.6
- Measurement results (e.g. concentration, absorbances, …) are referred
to as VARIABLES
0.5
0.4
0.3
0.2
K variables
x22
...
xn 2
... x1 p ù
... x2 p úú
... ... ú
ú
... xnp ûú
0.1
0
-0.1
1200
1400
1600
1800
Wavelength
2000
2200
2400
0.8
0.7
0.6
0.5
Absorbance
-A data table of K variables and M objects is referred to as a DATA
MATRIX OF SIZE MxK
x12
Samples
é x11
êx
21
X =ê
ê ...
ê
ëê xn1
0.4
0.3
0.2
Variables
X
0.1
0
-0.1
1200
1400
1600
1800
Wavelength
2000
2200
2400
1200
1400
1600
1800
Wavelength
2000
2200
2400
0.8
CHEMOMETRICS:
Extract meaningful information about the objects and the
variables from data matrices
0.7
0.6
0.5
Absorbance
If X has 3 rows and 5 columns
3 x 5 data matrix
M objects - observations
0.4
0.3
0.2
0.1
11
12
0
-0.1
CHEMOMETRICS – INTRODUCTION
CHEMOMETRICS – IMPORTANT DISCIPLINES
In summary…
é x11
êx
ê 21
ê ...
ê
ëê xn1
K variables
X
=
M objects
x12 ... x1 p ù
x22 ... x2 p úú
... ... ... ú
ú
xn 2 ... xnp ûú
Ø Sampling, selection of objects and variables
Ø Clustering
Ø Multivariate regressions, calibrations and predictions
Ø Neural Networks
Ø Validation
Ø Graphical display and outlier detection
13
14
Source: Chemometrics – Introduction - Jens C. Frisvad
Calibration data set (X)
Data Exploration
Pattern Recognition
INCLUDE IN THE CALIBRATION SET ALL THE FACTORS OF VARIATION:
Data matrix pre-treatment
Model construction
Calibration
Unsupervised analysis
Principal Component Analysis
(PCA)
Measurement factors:
- room temperature
- operator
- instrument setup
Supervised analysis
Regression models
Model building
Visualization
clustering tendency
SAMPLE SELECTION
Sample Selection
Selection optimal complexity
outlier detection
New data
Outliers
Prediction
15
Uncertainty estimation
Validation
Population sources of variation:
- origins of samples
- processes
- varieties
- storage conditions
- sample preparations (t°, particle size)
- residual moisture
16
PATTERN RECOGNITION
PRINCIPAL COMPONENT ANALYSIS
PCA i
sa chemomet
ric techni
que for
visual
izing high
dimensionaldata.PCA reducesthedimensionality(the
Groupingofobject
se.
g.how simil
arist
hebehaviourofcompounds,
how similar are products…
numberofvari
ables)ofadat
asetbymai
ntaini
ngasmuch
varianceaspossi
ble.
PCA
PrincipalComponentAnalysis
0.4
-3
PC1
x 10
0.2
-1
0
-2
-0.2
PC 2 (3.42%)
Lookingatrel
at
ionshi
ps:
Betweensamples:food samples / patients / people / spectra / …
-3
PC2
-0.4
-4
-0.6
-5
Betweenvariables:el
ementalcomposit
ions/compound
concentrations / spectral peaks / …
-0.8 0.8
0.7
17
-1
0.4
0.35
PC2 0.6 PC1
-1.2
-3
-2
-1
18
0.3
0.25
0.5
0.4 0
0.2
1
2
3
4
PC 1 (96.09%)
PRINCIPAL COMPONENT ANALYSIS
CALIBRATION
Quantit
at
iveorqual
it
at
iveest
imat
ion.Especial
lymixtures.
Samples/Scores Plot of Xoffset
0.2
0.2
0.15
0.15
Clusters
Scores on PC 3 (0.34%)
0.1
0.1
Univariat
ecal
ibrati
on:onemeasuremente.
g.apeakhight
0.05
0.05
0
Outliers
-0.05
M ult
ivariat
ecal
ibrati
on:severalmeasurementse.
g.spect
ra
Stat
ist
ical
,mat
hemat
icalorgraphicaltechnique,considers
mult
ipl
evariablessi
mult
aneousl
y
-0.1
-0.15
-0.2
-0.25
-3
-3
-2 -2
-1 -1
0 0
11
Scores on PC 1 (96.09%)
22
Object space
33
4
19
20
MULTIVARIATE CALIBRATION
MULTIVARIATE CALIBRATION
Calibration
X – Data
spectra
• Relateinstrumentalresponsetochemicalconcentrati
on.
• Impropercalibrationyieldsimproperresults.
• Linear,non-linearandmultivariatecalibrationalgorithms
areavailabl
e,dependingontheinstrumentandtheanalysis
involved.
+
Y – Data
Referencevalues
Calibration
model
-M oist
ure(regression)
-Con
centrati
onofprotein(regression)
-countryoforigin(discriminati
on)
-PDO (discriminati
on)
-…
Linki
ng two set
s of dat
a together : peak hi
ght t
o
concent
rati
on / Spect
ra t
o concent
rati
ons / bi
ologi
cal
act
ivi
tyt
ost
ruct
ure/…
21
MULTIVARIATE CALIBRATION - VALIDATION
-Infraredspect
roscopy(IR-Raman)
-Fourier-Transform Infraredspect
roscopy(FTIR)
-Nucl
earM agnet
icResonance(NM R)
-X-RayFluorescence(XRF)
-X-RayDiffract
ion(XRD)spect
roscopy
-…
22
MULTIVARIATE CALIBRATION
Calibration
X – Data
spectra
+
Y – Data
Refvalue
Y=f(X)=XB+E
Calibration
model
Y is a matrixnxm containingthe reference values
Validation
X – Data
spectra
+
Calibration
model
X is a matrixnxp containingthe spectra (NIR,MIR,
Raman,…)
Y – Data
//refvalue.
Calibration uses empiricaldata and prior knowledge for
determininghow to predict unknown quantitative
information yfrom available measurements x,via some
mathematicaltransfer functions
B is a matrixpxm containingthe regression coeficients
E is a matrixnxm explainingfor the modelerror.
23
24
Principal Component
Analysis (PCA)
Unsupervised - Pattern recognition
Unsupervised
Cluster Analysis
Multiple Linear Regression (MLR)
(CA)
Principal Component Regression (PCR)
Principal Component
Analysis (PCA)
Partial Least Squares (PLS)
Unsupervised
Cluster Analysis
Regression
Multivariate analysis
(CA)
Artificial Neural Networks (ANN)
Multiple Linear Regression (MLR)
Principal Component Regression (PCR)
Partial Least Squares (PLS)
LS–Support Vector Machines (LS-SVM)
Regression
Multivariate analysis
Artificial Neural Networks (ANN)
Supervised
Local techniques
LS–Support Vector Machines (LS-SVM)
Supervised
Local techniques
à Seeks similarities and regularities present in the data.
PLS-Discriminant Analysis (PLS-DA)
PLS-Discriminant Analysis (PLS-DA)
Discrimination
SIMCA
Discrimination
25
Algorithms
Support Vector Machines (SVM)
Supervised - Discrimination
Principal Component
Analysis (PCA)
Principal Component
Analysis (PCA)
à RegressionUnsupervised
based statisticaltechnique used in determining which
Cluster Analysis
Multipleof
Linear
particul
ar cl
assification or(CA)group an item
dRegression
ata or(MLR)
an object bel
ongs to
on the basis of its characteristics or essential
features.
Principal Component
Regression (PCR)
Unsupervised
Cluster Analysis
(CA)
Multiple Linear Regression (MLR)
Principal Component Regression (PCR)
Partial Least Squares (PLS)
Partial Least Squares (PLS)
Regression
Regression
Multivariate analysis
Artificial Neural Networks (ANN)
Artificial Neural Networks (ANN)
LS–Support Vector Machines (LS-SVM)
Supervised
26
Support Vector Machines (SVM)
Supervised - Regression
Multivariate analysis
SIMCA
K-Nearest Neighbours (k-NN)
K-Nearest Neighbours (k-NN)
LS–Support Vector Machines (LS-SVM)
Supervised
Local techniques
Local techniques
PLS-Discriminant Analysis (PLS-DA)
SIMCA
à It incl
udes anyDiscrimination
techniques for mod
el
ing and anal
yzing several
variables,when the focus is on thK-Nearest
e rel
ationsh
ip between a dependent
Neighbours
(k-NN)
variable (property of interest)
and one or more independent
Support Vector Machines (SVM)
variables (spectra).
PLS-Discriminant Analysis (PLS-DA)
Discrimination
SIMCA
K-Nearest Neighbours (k-NN)
27
Support Vector Machines (SVM)
28
VARIABLE SELECTION
VARIABLE SELECTION
Find a setofpre
dictorvari
ables which gi
ves a good
fi
t
,
pre
dicts t
he dependent val
ue well and i
s as smal
l as
possible.
Ø
Backwardel
iminat
ion:Startwit
hallthepredictorsand
potential
lydroppredict
orsinsubsequentsteps.
Ø
Forwardsel
ect
ion:Startwithnopredict
orsandpotential
l
y
addpredictorsinsubsequentst
eps.
Ø
Stepwiseregression:Combinat
ionofbackward
el
iminati
onandforwardsel
ect
ion.
29
VARIABLE SELECTION
30
CHEMOMETRICS – SOFTW ARE
Reduci
ngnumberofvariables
romet
ersbasedonareduced
Possi
bili
tyofconst
ructi
ngfastspect
number of variables ai
ming t
o a reducti
on of t
rai
ni
ng and
util
izat
ionti
mest
obeused,forinstance,i
naconveyerbelt
.
M oreover,t
hefactofgro
upi
ngt
hedi
fferentsel
ect
edvari
ablesi
n
regionsal
lowsaneasychemicalinterpret
at
ionofthespect
ra.
References
- ‘Backward variable selection in PLS (BVSPLS)’;J.
A.Fernández
Pierna,V.Baeten,P.Dardenne.AnalyticaChimicaActa642(2009)
pp.89-93.
- ‘Prediction errorimprovements using variable select
ion on small
calibration sets - a comparison of some recent methods’. S.
I.
Overgaard,J.
A.Fernández Pierna,V.Baeten,P.Dardenne & T.
Isaksson.J.NearInf
raredSpectrosc.20,329-337(2012).
31
32
EXAMPLE: FEED
EXAMPLE: FEED
FARIMAL
Other
Vegetal
Poultry
Bovine
+ Pig
Terrestrialanimal
Fish
Other
CAL
LOOCV
% classifiedas.
.
.
% classifiedas.
.
.
Belonging
to.
.
.
Fish
Fish
98.6
1.4
Rest
9.
5
90.
5
Rest
Fish
96.8
10
Rest
3.2
90
33
34
93.4 correct
%
classification
EXAMPLE: CEREALS - IMPURITIES
EXAMPLE: NUTS AND DRIED FRUITS
§
§
§
§
§
§
Walnut
Hazelnut
Brazil nut
Almond
Cashew
Raisin
SVM discrimination models
‘NIR hyperspectral imaging spectroscopy and chemometrics for the detection of undesirable substances in food and
feed’ J.A. Fernández Pierna, Ph. Vermeulen, O. Amand, A. Tossens, P. Dardenne and V. Baeten. Special issue
Chemometrics and Intelligent Laboratory Systems 117 (2012) 233-239
35
36
EXAMPLE: OLIVES - IMPURITIES
EXAMPLE: OLIVE OIL – QUALITY PARAMETERS
KNN - PCA models
‘Usinga visible vision system for on-line determination of qualityparameters of olive fruits’. E. Guzmán,
V.Baeten,J.
A.FernándezPierna,J.
A.García Mesa.Foofand Nutrition Sciences,4,90-98(2013).
37
38
‘Application of low-resolution Raman spectroscopyfor the analysis of oxidized olive oil’ E. Guzmán,
V.Baeten,J.
A.FernándezPierna,J.
A.García Mesa.(2011) Food control 22,2036-2040.
EXAMPLE: OLIVE OIL – ADULTERATION
EXAMPLE: SUGAR BEET PLANTS – CYST
optical microscopy
NIR hyperspectral imaging
SVM discrimination models
39
Detection ofthe Presence ofHazelnut Oil in Olive Oil by FT-Raman and FT-MIR Spectroscopy'V.Baeten,
J.
A.FernándezPierna,P.Dardenne,M.Meurens,D.
L.García González,R.Aparicio.Journal of
Agricultural and Food Chemistry,53 (16) (2005),6201-6206.
‘NIR
hyperspectral imagingspectroscopyand chemometrics for the detection of undesirable substances in food and
feed’ J.A. Fernández Pierna, Ph. Vermeulen, O. Amand, A. Tossens, P. Dardenne and V. Baeten. Special issue
Chemometrics and Intelligent Laboratory Systems 117 (2012) 233-239
40
EXAMPLE: SUGAR BEET PLANTS – CERCOSPORA
EXAMPLE: FEED PRODUCTS
Multivariate regression method comparison:
PLS, ANN and LS-SVM
41
EXAMPLE: FEED PRODUCTS
42
EXAMPLE: FEED PRODUCTS
Feed -------------------------------------------- (28676x700)
Ash, Fat, Fibre, Starch, Protein
Feed Ingredients --------------------------- (26652x700)
Ash, Fat, Fibre, Protein
Fresh Silages ------------------------------- (1035x700)
Dry Matter, Fibre, Protein
Soils ------------------------------------------- (1625x700)
CEC, COT_SK, N_Kj
Example of data distribution (Soils N_Kj)
43
Pre-processing: SNV + detrend + First derivative
44
EXAMPLE: FEED PRODUCTS
EXAMPLE: FEED PRODUCTS
Feed - fibre
Feed - Fat
1.4
1.2
val
0.4
ks
RMS
val
0.6
ks
0.7
rd
0.6
0.4
rd
0.2
0.2
0
0
pls
ann
pls
svm
ann
Feed ingredients - fat
Feed ingredients - ash
train
0.8
RMS
RMS
1
train
0.6
svm
0.5
train
0.4
val
0.3
ks
0.2
rd
RMS
1
0.8
0.1
0
Feed - ash
pls
ann
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
train
val
ks
rd
pls
svm
ann
svm
2.5
Feed ingredients - fibre
train
1.5
val
1
ks
0.5
0
pls
ann
svm
3.5
3
RMS
2
val
1.5
ks
1
rd
RMS
train
2.5
0.5
0
pls
ann
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
val
ks
rd
ann
svm
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
train
val
ks
rd
pls
ann
svm
train
val
ks
45
rd
pls
svm
train
pls
Feed - protein
Feed - starch
Feed ingredients - protein
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
rd
RMS
RMS
2
ann
46
svm
EXAMPLE: FEED PRODUCTS
EXAMPLE: TRANSFERT
CALIBRATION TRANSFER FROM DISPERSIVE
INSTRUM ENTSTO HANDHELD SPECTROM ETERS (M EM S)
Fresh silages - protein
Fresh silages - fibre
2.5
train
val
1
ks
rd
0.5
0
pls
ann
RMS
RMS
2
1.5
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
60
train
val
40
rd
30
pls
svm
50
ks
ann
svm
20
10
RMS
Fresh silages - dry matter
0
0
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
10
20
30
40
50
60
70
train
val
ks
rd
pls
ann
svm
47
48
‘Calibration Transfer from Dispersive Instruments to Handheld Spectrometers’, J.A. Fernández
Pierna, P. Vermeulen, B. Lecler, V. Baeten, P. Dardenne. Applied Spectroscopy 64 (6) (2010)
EXAMPLE: TRANSFERT
QUESTIONS ABOUT CHEMOMETRICS?
j.f
ernandez@ cra.
wallonie.
be
http://www.
cra.
wallonie.
be/
49
51
50