Multivariate Data
Hans-Peter Helfrich
University of Bonn
Theodor-Brinkmann-Graduate School
Jan 25, 2017
H.-P. Helfrich (University of Bonn)
Multivariate Data
Brinkmann School
1 / 16
Overview
1
Principal component analysis
2
Partial least squares
3
Cross-validation
4
References
H.-P. Helfrich (University of Bonn)
Multivariate Data
Brinkmann School
2 / 16
An example: Body data
Data (N = 252, p = 19) given by [Johnson, 1996]
We consider a data set with N = 252 samples. For each sample p = 19
features ( FAT DENSITY ..) are given.
FAT DENSITY WEIGHT HEIGHT BMI FATFREE
1 12.6
1.071 154.2 67.75 23.7
134.9 ...
2 6.9
1.085 173.2 72.25 23.4
161.3 ...
3 24.6
1.041 154.0 66.25 24.7
116.0 ...
4 10.9
1.075 184.8 72.25 24.9
164.7 ...
5 27.8
1.034 184.2 71.25 25.6
133.1 ...
6 20.6
1.050 210.2 74.75 26.5
167.0 ...
7 19.0
1.055 181.0 69.75 26.2
146.6 ...
8 12.8
1.070 176.0 72.50 23.6
153.6 ...
...
...
252
H.-P. Helfrich (University of Bonn)
Multivariate Data
Brinkmann School
3 / 16
Body data
Data set
We choose p = 16 features (out of 19) and get a data set with N = 252
samples.
Question
The measurement of the fat density is expansive. Can we approximate
that quantity by the values of the other features?
Answer
In principle, the fat density could be well approximated by a linear
regression. But the multicollinearity between the other features leads to
results which are not meaningful.
Introducing new variables
By the methods of principal components and partial least squares new
variables (features) can be introduced which are uncorrelated.
H.-P. Helfrich (University of Bonn)
Multivariate Data
Brinkmann School
4 / 16
Principal component analysis
Data
Principal component analysis is a tool for identifying patterns in data by
reducing the dimensionality. We consider random variables
X1 , X2 , . . . , X5 , . . . with N samples. As a first step, we subtract from each
sample xik for Xk the mean value x̄k which we denote by aik = xik − x̄k .
We come to the matrix
⎛
⎞
a11 a12 ⋅ ⋅ ⋅ a1p
⎜ a21 a22 ⋅ ⋅ ⋅ a2p ⎟
⎜
⎟
⎜ ................... ⎟
⎟
A=⎜
⎜ ................... ⎟
⎜
⎟
⎝ ................... ⎠
aN1 aN2 ⋅ ⋅ ⋅ aNp
Each column contains now N samples of the random variables
X1 − x̄1 , X2 − x̄2 , . . . , Xp − x̄p with mean value 0.
H.-P. Helfrich (University of Bonn)
Multivariate Data
Brinkmann School
5 / 16
Transforming the data
Choice of the method
We have two choices
Using the correlations. This can be done by dividing each value in
column k by the sample standard deviation sd(Xk ). This option is not
applicable if we have a column with constant values. This is the
default option in R, or we give explicitly the option cor = T in R.
Using the covariances by the option cor = F.
Outline
We can decompose the matrix A by the singular value decomposition
A = UDV T , where U and V are orthogonal matrices and D is a diagonal
matrix. The columns of the matrix AV are called scores giving the
principal components with decreasing variance. In many cases the data
can be well approximated by the first columns. The columns of V are the
matrix of the loadings.
H.-P. Helfrich (University of Bonn)
Multivariate Data
Brinkmann School
6 / 16
Scores and loadings
Scores
The scores S are given by
Sn×p = An×p Vp×p
The scores S are the data the new coordinate system obtained by the
loadings V
Loadings
We may reconstruct the data A by the scores and the loadings as
transformation matrix.
T
An×p = Sn×p Vp×p
H.-P. Helfrich (University of Bonn)
Multivariate Data
Brinkmann School
7 / 16
Principal components in R [Venables and Ripley, 2002]
Data (N = 252, p = 15) given by [Johnson, 1996]
We take as an example body data given by [Johnson, 1996]. The data set
contains body data for N = 252 persons. For each person, p = 19 features
are given, initially. We exclude the variables with numbers 1, 3, 5 (NR
DENSITY SIRI AGE) and come to the data with now p = 15 features
R program
# Read the data
table <- read.table("fatdata.txt",header = T)
data <- table[, -c(1,3,5)] # excluding NR, SIRI, AGE
prc <- princomp(data, cor = T)
print(summary(prc))
print(prc$loadings)
print(prc$scores)
H.-P. Helfrich (University of Bonn)
Multivariate Data
Brinkmann School
8 / 16
Output of princomp
Importance of components
The command summary gives for each component the standard deviation
and the proportion of variance
Importance of components
Comp.1 Comp.2 Comp.3
Sdev
3.203 1.459 0.861
Prop Var 0.641 0.133 0.046
Cum Prop 0.641 0.774 0.821
Comp.4
0.824
0.042
0.862
Comp.5
0.755
0.036
0.899
In the second row, the proportions of variances of each component are
given. The third row gives the cumulative proportions. In our case, 89.9 %
of the variance of the data are explained by the first five components.
H.-P. Helfrich (University of Bonn)
Multivariate Data
Brinkmann School
9 / 16
Principal components
Each principal component is obtained as a linear combination of the
variables.
Comp.1 = 0.2142 ⋅ FAT − 0.2094 ⋅ Density + 0.3053 ⋅ WEIGHT + ⋅ ⋅ ⋅
Comp.2 = 0.4541 ⋅ FAT − 0.4604 ⋅ Density − 0.0576 ⋅ WEIGHT + ⋅ ⋅ ⋅
Comp.3 = −0.3040 ⋅ FAT + 0.3045 ⋅ Density − 0.0221 ⋅ WEIGHT + ⋅ ⋅ ⋅
The coefficients are shown by the matrix V of the loadings.
Loadings V
Comp.1 Comp.2 Comp.3 Comp.4
FAT
0.2142 0.4541 -0.3040 0.0117
DENSITY -0.2094 -0.4604 0.3045 -0.0019
WEIGHT
0.3053 -0.0576 -0.0221 -0.0813
HEIGHT
0.0743 -0.4400 -0.8208 0.0243
BMI
0.2880 0.1624 0.1822 -0.0164
H.-P. Helfrich (University of Bonn)
Multivariate Data
Comp.5
0.1247
-0.1384
-0.1407
-0.1581
-0.0815
Brinkmann School
10 / 16
Scores
Scores
The scores are the data in the coordinate system of principal components.
We show here the scores for the data in lines 42, 36 and 216.
Comp.1
42 2.415
36 2.770
216 4.998
Comp.2 Comp.3 Comp.4 Comp.5 Comp. 6
6.888
8.823 -1.973 0.928 0.627
4.691 -0.115
0.123 -0.705 0.431
5.349 -0.054
0.232 0.094 -1.061
H.-P. Helfrich (University of Bonn)
Multivariate Data
Brinkmann School
11 / 16
Partial least squares [Mevik and Wehrens, 2007]
Partial least squares seeks directions that have high variance and have high
correlation with the response, in contrast to principal components
regression which keys only on high variance [Hastie et al., 2009]
require(stats); require(graphics);library(pls)
filename <- "mueller_pls.csv"
mueller <- read.table(filename, header = T, sep=";",dec =",")
waves <- mueller[,c(7,42:256)]
p <- 6
ppwaves <- plsr(Energiebilanz ˜ ., ncomp = p, data = waves,
validation = "CV")
summary(ppwaves)
coef(ppwaves)
H.-P. Helfrich (University of Bonn)
Multivariate Data
Brinkmann School
12 / 16
Summary
Data:
X dimension: 53 215
Y dimension: 53 1
Fit method: kernelpls
Number of components considered: 6
VALIDATION: RMSEP
Cross-validated using
(Intercept) 1
CV
21.9
adjCV
21.9
10 random segments.
comps 2 comps 3 comps
21.15
19.42
19.62
21.15
19.36
19.50
TRAINING: % variance explained
1 comps 2 comps
X
71.72
89.02
Energiebilanz
11.63
29.07
H.-P. Helfrich (University of Bonn)
3 comps
92.07
33.42
Multivariate Data
4 comps
19.24
18.57
4 comps
93.34
40.87
Brinkmann School
13 / 16
Choice of the optimal number Δ of features
Crossvalidation
A rather general method
Test
Train
Train
Train
Train
Divide that data in M (say M = 5) equal parts
For m = 1, 2, . . . , M: Choose part m as test part and the other parts
as training set. Make a prediction based on the training set and the
parameter Δ. Determine the error rate for each m and compute the
average error rate.
Repeat the procedure for many values of Δ and choose the value with
the least error rate.
H.-P. Helfrich (University of Bonn)
Multivariate Data
Brinkmann School
14 / 16
References I
Hastie, T., Tibshirani, R., and Friedman, J. (2009).
The elements of statistical learning, 2nd edition.
Springer Series in Statistics. Springer-Verlag, New York, second
edition.
Data mining, inference, and prediction.
Johnson, R. W. (1996).
Fitting percentage of body fat to simple body measurements.
Journal of Statistics Education, 4.
Mevik, B.-H. and Wehrens, R. (2007).
The pls package: Principal component and partial least squares
regression in R.
Journal of Statistical Software, 18:1–24.
H.-P. Helfrich (University of Bonn)
Multivariate Data
Brinkmann School
15 / 16
References II
Venables, W. N. and Ripley, B. D. (2002).
Modern applied statistics with S-Plus.
Statistics and Computing. Springer-Verlag, fourth edition.
H.-P. Helfrich (University of Bonn)
Multivariate Data
Brinkmann School
16 / 16
© Copyright 2026 Paperzz