Multivariate statistics

Multivariate statistics: only the brave
16 February
Jos Hageman & Ron Wehrens
Eye-ball your data!
 The human mind is the
best pattern recognizer
● outliers
● relationships
● expectations
2
R-session: visualisation methods
 Meet the Plotting functions
 Find out when to use which one
> library(StatsDemo)
> ExPlot()
3
Higher dimensions...
3D
Named colors in RRGB space
4D? 5D?
1000D?
100,000D?
4
Principal Component Analysis and visualization
 Central ideas of PCA
● Variation equals information
● Find directions of greatest
information
● Force orthogonality
● Hopefully, only a few factors
are enough
● Back at 2d plots!
5
Matrix decomposition with SVD
 Singular Value Decomposition (SVD)
𝑋 = (𝑈𝐷)𝑉 𝑇 =𝑇𝑃𝑇
►Scores (𝑇 = 𝑈𝐷 ):
Location of samples in new coordinate system
►Loadings (𝑃 = 𝑉 )
Weights of original variables in the new ones
►Variances (𝐷 2 ):
How many components need to be assessed?
6
Matrix dimensions
►Data matrix X: n × p
►Scores T: n × A
►Loadings P: p × A (same dimensions as V)
►Singular values D: A × A (diagonal matrix)
7
Example data: Italian wines
8
PCA score plots
9
Loading plots - I
10
Loading plots - II
11
Biplots
12
Scaling
 General considerations:
● counts: use log or √-scaling
● before PCA: always set mean to zero (mean
centering)
● different units: standardization, auto-scaling (scale
to mean 0, s.d. 1)
● time series: equal variance
● application-dependent scaling: e.g. Pareto scaling,
double-centering, first-derivative scaling, ...
13
How many components in a PCA? Scree plots
14
R session! PCA this time.
 get used to the stuff
 look at some typical data sets
> library(StatsDemo)
> ExPCA()
15
Batch-correction... Why?
 Data from different ‘‘batches’’
often not directly comparable
 ‘‘Batches’’: several forms
(time, place, operator,
extraction, machine, ...)
 Difference in mean intensity,
in spread, more complex
differences
 Also within-batch differences
(drift, injection-order effects)
16
How 1/2
 Using QC samples
 Injection order may be used as a covariate
17
QC samples
 Drawback:
● Metabolites cannot be corrected if they are not
showing up in the QC samples
● May not be possible for all batches
● Linear trend is estimated using not too many
samples (typically)
 So...
18
How 2/2
 Alternative: use study samples
 Randomisation of samples is crucial!!
19
Correction example: PCA
20
https://github.com/rwehrens/BatchCorrMetabolomics
CBSG Tomato data
 Centre for BioSystems Genomics
● network of Dutch scientists in the field of plant
genomics,
 Identifying the genetic and metabolic basis of factors
determining taste in tomato
 Tomato taste project measurements on
● three groups of tomato varieties:
● Cherry tomatoes (18 varieties)
● Round tomatoes (55 varieties)
● Beef tomatoes (20 varieties)
● 25 sensory attributes and metabolic compounds
21
Biplot: metabolomics data
(20.21%)
V_phenylethanol
4
3
2
V_phenylacetaldehyde
DV_sucrose
V_trans_2_hexenal
V_1_penten_3_one
1
DV_citric_acid
DV_myo_inositol
DV_glucose
V_2_methylbutanol
V_trans_2_heptenal DV_fructose
DV_aspartic_acid
V_cis_3_hexenol
V_2_methylbutanal
V_3_methylbutanol
-2.5
-2
-1.5
0
-1
-0.5
0
V_cis_3_hexenal
V_methylsalicylate
V_beta_ionone
0.5
DV_glutamic_acid
DV_malic_acid
V_hexanal
1
1.5
2
2.5
3
3.5
4
(53.79%)
4.5
5
V_2_methoxyphenol
V_beta_damascenone
-1
V_2_isobutylthiazol
V_6_methyl_5hepten2_one
22
Biplot: sensory attributes
23
Spike-in apple data
10 control apples
10 treated apples - spiked
9 chemical compounds
ESI+ and ESI−
1,632 and 995 features,
respectively
• In each: 22 ‘‘true’’ biomarkers
•
•
•
•
•
Franceschi et al., J. Chemom.
(2012)
processed data in the BioMark
package (CRAN)
raw data at MetaboLights
24
Example: data from first control sample
25
Finding differences between groups – t tests
26
Finding differences between groups - t tests
27
Finding differences between groups - t tests
28
Finding differences between groups - t tests
29
Finding differences between groups - t tests
30
Finding differences between groups - PCA
 Luck
 Works when group
differences are a global
phenomenon
Loadings tell you about the
relevant variables...
31
PCA on the spiked-apple data
32
Finding differences between groups -PLSDA
Similar to PCA:
Different from PCA:
 Compression to lower

Compression done on the
basis of both X and Y
 Scores
 Loadings
 Percentage of explained



Scores for X and Y

SUPERVISED
number of dimensions
variance
Loadings for X and Y
Percentage of explained
variance for X and Y
33
Fitting a PLS(DA) model
1. decide on the scaling
2. divide data in training and testing sets
3. do cross-validation on the training data
4. choose the optimal number of components
5. predict the test data to obtain an estimate of prediction
error
6. refit the model with the optimal number of
components, using all data
34
Step 2: Training and testing sets
 When data is plentiful: divide in 2 parts
● Random to avoid confounding
p
Model creation
Independent
model validation
n
35
Step 3: Cross-validation on training set
 One part for model creation, other part for testing
e.g. 5-fold
36
Step 4: Optimal number of LV’s
 During cross-validation models with different LV’s are
produced and tested
 Which number of LV’s to take?
● Look for inflection point
Adding more LV’s
after 3 hardly
decrease the
RMSECV
37
Step 5: estimate of prediction error
 Using optimal #LV’s, create model on training set
p
Model creation
on training set
Independent
model validation
Predicted
n
Alternatively, if data is scarce, yet another cross
validation can be used  two nested cross validations
38
Criterion for prediction quality: Q2
 Percentage explained variation
● From test set or cross validation
 Like R2 but using results from testing set
Prediction error
𝑄2 = 1 −
Σ(𝑦𝑖 −𝑦𝑖 )2
Σ(𝑦𝑖 −𝑦)2
Prediction error
from average only
model
 Close to one means good predictions
 Close to zero (or negative!) bad predictions
39
R session! PLS
 See the good things
 See the bad things
> library(StatsDemo)
> ExPLS()
40
Wrap up
 Statistics important part of most research steps
● Pose one clear question...
● Take care of your sample
● 3 R’s: replication, randomisation, reduce noise
● Keep it simple, t-test vs PLSDA
● Make the data convince you... be critical!
41
Acknowledgements
Wageningen University &
Research
Ron Wehrens
Robert Hall
Ric de Vos
Roland Mumm
Fred van Eeuwijk
Fondazione Edmund Mach
Pietro Franceschi
Fulvio Mattivi
Urska Vrhovsek
Panagiotis Arapitsas
Domenico Masuero
42
43