Multivariate statistics: only the brave 16 February Jos Hageman & Ron Wehrens Eye-ball your data! The human mind is the best pattern recognizer ● outliers ● relationships ● expectations 2 R-session: visualisation methods Meet the Plotting functions Find out when to use which one > library(StatsDemo) > ExPlot() 3 Higher dimensions... 3D Named colors in RRGB space 4D? 5D? 1000D? 100,000D? 4 Principal Component Analysis and visualization Central ideas of PCA ● Variation equals information ● Find directions of greatest information ● Force orthogonality ● Hopefully, only a few factors are enough ● Back at 2d plots! 5 Matrix decomposition with SVD Singular Value Decomposition (SVD) 𝑋 = (𝑈𝐷)𝑉 𝑇 =𝑇𝑃𝑇 ►Scores (𝑇 = 𝑈𝐷 ): Location of samples in new coordinate system ►Loadings (𝑃 = 𝑉 ) Weights of original variables in the new ones ►Variances (𝐷 2 ): How many components need to be assessed? 6 Matrix dimensions ►Data matrix X: n × p ►Scores T: n × A ►Loadings P: p × A (same dimensions as V) ►Singular values D: A × A (diagonal matrix) 7 Example data: Italian wines 8 PCA score plots 9 Loading plots - I 10 Loading plots - II 11 Biplots 12 Scaling General considerations: ● counts: use log or √-scaling ● before PCA: always set mean to zero (mean centering) ● different units: standardization, auto-scaling (scale to mean 0, s.d. 1) ● time series: equal variance ● application-dependent scaling: e.g. Pareto scaling, double-centering, first-derivative scaling, ... 13 How many components in a PCA? Scree plots 14 R session! PCA this time. get used to the stuff look at some typical data sets > library(StatsDemo) > ExPCA() 15 Batch-correction... Why? Data from different ‘‘batches’’ often not directly comparable ‘‘Batches’’: several forms (time, place, operator, extraction, machine, ...) Difference in mean intensity, in spread, more complex differences Also within-batch differences (drift, injection-order effects) 16 How 1/2 Using QC samples Injection order may be used as a covariate 17 QC samples Drawback: ● Metabolites cannot be corrected if they are not showing up in the QC samples ● May not be possible for all batches ● Linear trend is estimated using not too many samples (typically) So... 18 How 2/2 Alternative: use study samples Randomisation of samples is crucial!! 19 Correction example: PCA 20 https://github.com/rwehrens/BatchCorrMetabolomics CBSG Tomato data Centre for BioSystems Genomics ● network of Dutch scientists in the field of plant genomics, Identifying the genetic and metabolic basis of factors determining taste in tomato Tomato taste project measurements on ● three groups of tomato varieties: ● Cherry tomatoes (18 varieties) ● Round tomatoes (55 varieties) ● Beef tomatoes (20 varieties) ● 25 sensory attributes and metabolic compounds 21 Biplot: metabolomics data (20.21%) V_phenylethanol 4 3 2 V_phenylacetaldehyde DV_sucrose V_trans_2_hexenal V_1_penten_3_one 1 DV_citric_acid DV_myo_inositol DV_glucose V_2_methylbutanol V_trans_2_heptenal DV_fructose DV_aspartic_acid V_cis_3_hexenol V_2_methylbutanal V_3_methylbutanol -2.5 -2 -1.5 0 -1 -0.5 0 V_cis_3_hexenal V_methylsalicylate V_beta_ionone 0.5 DV_glutamic_acid DV_malic_acid V_hexanal 1 1.5 2 2.5 3 3.5 4 (53.79%) 4.5 5 V_2_methoxyphenol V_beta_damascenone -1 V_2_isobutylthiazol V_6_methyl_5hepten2_one 22 Biplot: sensory attributes 23 Spike-in apple data 10 control apples 10 treated apples - spiked 9 chemical compounds ESI+ and ESI− 1,632 and 995 features, respectively • In each: 22 ‘‘true’’ biomarkers • • • • • Franceschi et al., J. Chemom. (2012) processed data in the BioMark package (CRAN) raw data at MetaboLights 24 Example: data from first control sample 25 Finding differences between groups – t tests 26 Finding differences between groups - t tests 27 Finding differences between groups - t tests 28 Finding differences between groups - t tests 29 Finding differences between groups - t tests 30 Finding differences between groups - PCA Luck Works when group differences are a global phenomenon Loadings tell you about the relevant variables... 31 PCA on the spiked-apple data 32 Finding differences between groups -PLSDA Similar to PCA: Different from PCA: Compression to lower Compression done on the basis of both X and Y Scores Loadings Percentage of explained Scores for X and Y SUPERVISED number of dimensions variance Loadings for X and Y Percentage of explained variance for X and Y 33 Fitting a PLS(DA) model 1. decide on the scaling 2. divide data in training and testing sets 3. do cross-validation on the training data 4. choose the optimal number of components 5. predict the test data to obtain an estimate of prediction error 6. refit the model with the optimal number of components, using all data 34 Step 2: Training and testing sets When data is plentiful: divide in 2 parts ● Random to avoid confounding p Model creation Independent model validation n 35 Step 3: Cross-validation on training set One part for model creation, other part for testing e.g. 5-fold 36 Step 4: Optimal number of LV’s During cross-validation models with different LV’s are produced and tested Which number of LV’s to take? ● Look for inflection point Adding more LV’s after 3 hardly decrease the RMSECV 37 Step 5: estimate of prediction error Using optimal #LV’s, create model on training set p Model creation on training set Independent model validation Predicted n Alternatively, if data is scarce, yet another cross validation can be used two nested cross validations 38 Criterion for prediction quality: Q2 Percentage explained variation ● From test set or cross validation Like R2 but using results from testing set Prediction error 𝑄2 = 1 − Σ(𝑦𝑖 −𝑦𝑖 )2 Σ(𝑦𝑖 −𝑦)2 Prediction error from average only model Close to one means good predictions Close to zero (or negative!) bad predictions 39 R session! PLS See the good things See the bad things > library(StatsDemo) > ExPLS() 40 Wrap up Statistics important part of most research steps ● Pose one clear question... ● Take care of your sample ● 3 R’s: replication, randomisation, reduce noise ● Keep it simple, t-test vs PLSDA ● Make the data convince you... be critical! 41 Acknowledgements Wageningen University & Research Ron Wehrens Robert Hall Ric de Vos Roland Mumm Fred van Eeuwijk Fondazione Edmund Mach Pietro Franceschi Fulvio Mattivi Urska Vrhovsek Panagiotis Arapitsas Domenico Masuero 42 43
© Copyright 2026 Paperzz