Pitfalls of cross validation
Michael K. Tippett
International Research Institute for Climate and Society
The Earth Institute, Columbia University
Statistical Methods in Seasonal Prediction, ICTP
Aug 2-13, 2010
Main ideas
I
Cross-validation is useful for estimating the performance of
a model on independent data.
I
Few assumptions.
I
Computationally expensive.
I
Can be misused.
Outline
I
Cross validation and linear regression
I
I
I
Computational method
Bias
Pitfalls
I
I
Not including model/predictor selection in the
cross-validation.
Not leaving out enought data so that the training and
validation samples are independent.
Cross-validation is expensive.
Cross-validation of linear regression
Cross-validated error can be computed without the
computational cost of cross-validation.
y (i) − ŷcv (i) =
e(i)
1 − vii
where
I
ŷcv (i) is the prediction from the regression with the i-th
sample left out of the compuation of the regression
coefficients .
I
e(i) = y (i) − ŷ (i) in-sample error
I
vii is the i-th diagonal of the “hat”-matrix
V = X (X T X )−1 X T .
(Cook & Weisberg, Residuals and Influence in Regression, 1980)
Cross-validation for linear regression
What is the correlation of a climatological forecast?
What is the cross-validated correlation of a climatological
forecast?
I
X = column of n ones.
I
X T X = n, (X T X )−1 = 1/n,
I
V = X (X T X )−1 X T = n × n matrix with values 1/n.
I
e(i) = y (i) − y .
e(i)
y (i) − y
= y (i) −
1 − vii
1 − 1/n
n
1
=y
−y
n−1
n−1
ŷcv (i) = y (i) −
What is the correlation between y and ŷcv ? Why?
Cross-validation for linear regression
What is the correlation of a climatological forecast?
What is the cross-validated correlation of a climatological
forecast?
I
X = column of n ones.
I
X T X = n, (X T X )−1 = 1/n,
I
V = X (X T X )−1 X T = n × n matrix with values 1/n.
I
e(i) = y (i) − y .
e(i)
y (i) − y
= y (i) −
1 − vii
1 − 1/n
n
1
=y
−y
n−1
n−1
ŷcv (i) = y (i) −
What is the correlation between y and ŷcv ? Why?
Cross-validation for linear regression
What is the correlation of a climatological forecast?
What is the cross-validated correlation of a climatological
forecast?
I
X = column of n ones.
I
X T X = n, (X T X )−1 = 1/n,
I
V = X (X T X )−1 X T = n × n matrix with values 1/n.
I
e(i) = y (i) − y .
e(i)
y (i) − y
= y (i) −
1 − vii
1 − 1/n
n
1
=y
−y
n−1
n−1
ŷcv (i) = y (i) −
What is the correlation between y and ŷcv ? Why?
Cross-validation for linear regression
What is the correlation of a climatological forecast?
What is the cross-validated correlation of a climatological
forecast?
I
X = column of n ones.
I
X T X = n, (X T X )−1 = 1/n,
I
V = X (X T X )−1 X T = n × n matrix with values 1/n.
I
e(i) = y (i) − y .
e(i)
y (i) − y
= y (i) −
1 − vii
1 − 1/n
n
1
=y
−y
n−1
n−1
ŷcv (i) = y (i) −
What is the correlation between y and ŷcv ? Why?
Cross-validation for linear regression
What is the correlation of a climatological forecast?
What is the cross-validated correlation of a climatological
forecast?
I
X = column of n ones.
I
X T X = n, (X T X )−1 = 1/n,
I
V = X (X T X )−1 X T = n × n matrix with values 1/n.
I
e(i) = y (i) − y .
e(i)
y (i) − y
= y (i) −
1 − vii
1 − 1/n
n
1
=y
−y
n−1
n−1
ŷcv (i) = y (i) −
What is the correlation between y and ŷcv ? Why?
Cross-validation for linear regression
What is the correlation of a climatological forecast?
What is the cross-validated correlation of a climatological
forecast?
I
X = column of n ones.
I
X T X = n, (X T X )−1 = 1/n,
I
V = X (X T X )−1 X T = n × n matrix with values 1/n.
I
e(i) = y (i) − y .
e(i)
y (i) − y
= y (i) −
1 − vii
1 − 1/n
n
1
=y
−y
n−1
n−1
ŷcv (i) = y (i) −
What is the correlation between y and ŷcv ? Why?
Cross-validation for linear regression
What is the correlation of a climatological forecast?
What is the cross-validated correlation of a climatological
forecast?
I
X = column of n ones.
I
X T X = n, (X T X )−1 = 1/n,
I
V = X (X T X )−1 X T = n × n matrix with values 1/n.
I
e(i) = y (i) − y .
e(i)
y (i) − y
= y (i) −
1 − vii
1 − 1/n
n
1
=y
−y
n−1
n−1
ŷcv (i) = y (i) −
What is the correlation between y and ŷcv ? Why?
Cross-validation for linear regression
More intuitively
I
Leaving out a wet year, gives a drier mean
I
Leaving out a drier year, gives a wetter mean.
Negative correlation.
Problem solved if somehow the mean did not change. How?
Cross-validation for linear regression
More intuitively
I
Leaving out a wet year, gives a drier mean
I
Leaving out a drier year, gives a wetter mean.
Negative correlation.
Problem solved if somehow the mean did not change. How?
Cross-validation for linear regression
More intuitively
I
Leaving out a wet year, gives a drier mean
I
Leaving out a drier year, gives a wetter mean.
Negative correlation.
Problem solved if somehow the mean did not change. How?
Cross-validation for linear regression
More intuitively
I
Leaving out a wet year, gives a drier mean
I
Leaving out a drier year, gives a wetter mean.
Negative correlation.
Problem solved if somehow the mean did not change. How?
Cross-validation for linear regression
More intuitively
I
Leaving out a wet year, gives a drier mean
I
Leaving out a drier year, gives a wetter mean.
Negative correlation.
Problem solved if somehow the mean did not change. How?
Cross-validation for linear regression
More intuitively
I
Leaving out a wet year, gives a drier mean
I
Leaving out a drier year, gives a wetter mean.
Negative correlation.
Problem solved if somehow the mean did not change. How?
Cross-validation for linear regression
Some benefit to leaving out more years and predicting middle
year. (n = 30)
cross−validated correlation of climatology forecast
0
−0.1
−0.2
−0.3
−0.4
−0.5
−0.6
−0.7
−0.8
−0.9
−1
1
Problem with trend?
3
5
7
9
number left out
11
13
15
Cross-validation for linear regression
Some benefit to leaving out more years and predicting middle
year but increase in error variance. (n = 30)
cross−validated 1−normalized error variance
0
−0.01
−0.02
−0.03
−0.04
−0.05
−0.06
−0.07
−0.08
−0.09
−0.1
1
3
5
7
9
number left out
11
13
15
Bias of cross validation
n=30. y = ax + b.
cv correlation
1
0.8
0.6
0.4
0.2
0
−0.2
cv
−0.4
0
0.2
0.4
0.6
population correlation
0.8
1
Bias of cross validation
n=30. y = ax + b.
cv 1 − normalized error variance
1.2
1
0.8
0.6
0.4
0.2
0
−0.2
0
0.2
0.4
0.6
population correlation
0.8
1
Pitfalls of cross validation
I
Performing an initial analysis using the entire data set to
identify the most informative features
I
Using cross-validation to assess several models, and only
stating the results for the model with the best results.
I
Allowing some of the training data to be (essentially)
included in the test set
From wikipedia
Example: Philippines
Problem: predict gridded April-June precipitation over the
Philippines from proceeding (January-March) SST.
SST anomaly JFM 1971
1.6
20N
1.2
0.8
10N
0.4
0
0
−0.4
10S
−0.8
−1.2
20S
100E
120E
140E
160E
180
160W
140W
−1.6
Rainfall anomalies AMJ 1971
8
20N
6
4
15N
2
0
−2
10N
−4
−6
5N
−8
115E
120E
125E
130E
Example: Philippines
Simple regression model: Either climatology or ENSO as
predictors.
Use AIC to choose.
AIC selected model
20N
ENSO
15N
10N
CL
5N
115E
120E
125E
130E
Example: Philippines
I
I
Some skill
(cross-validated)
Why the negative
correlation?
correlation
1
20N
20N
0.8
0.6
0.4
15N
15N
0.2
0
−0.2
10N
10N
−0.4
−0.6
−0.8
5N
−1
115E
120E
125E
130E
5N
115E
Pitfall!
I
Showed the model selected.
I
Presented the cross-validated skill of that model.
What’s wrong?
The entire dataset was used to select the predictors.
Solution?
Pitfall!
I
Showed the model selected.
I
Presented the cross-validated skill of that model.
What’s wrong?
The entire dataset was used to select the predictors.
Solution?
Pitfall!
I
Showed the model selected.
I
Presented the cross-validated skill of that model.
What’s wrong?
The entire dataset was used to select the predictors.
Solution?
Pitfall!
Solution: include predictor selection in cross-validation.
min AIC selected model
20N
ENSO
15N
10N
CL
5N
115E
120E
125E
130E
max AIC selected model
20N
ENSO
15N
10N
CL
5N
115E
120E
125E
130E
Pitfall!
Negative impact on correlation in places with skill?
Change in correlation
0.5
20N
0.4
0.3
0.2
15N
0.1
0
−0.1
10N
−0.2
−0.3
−0.4
5N
115E
120E
125E
Why is the impact positive in some areas?
130E
−0.5
Pitfall!
Negative impact on normalized error
Change in 1−error var/clim var
0.2
20N
0.16
0.12
0.08
15N
0.04
0
−0.04
10N
−0.08
−0.12
−0.16
5N
115E
120E
125E
130E
−0.2
Pitfall!
Points with little skill are most affected.
1−error var/clim var
correlation
1
0.5
0.8
0.4
0.6
0.3
w/ cv pred. sel.
w/ cv pred. sel.
0.4
0.2
0
−0.2
−0.4
0.2
0.1
0
−0.1
−0.6
−0.2
−0.8
−1
−1
−0.5
0
0.5
w/o cv pred. sel.
1
−0.2 −0.1
0
0.1 0.2 0.3
w/o cv pred. sel.
Idea: convervative models don’t go too far bad . . .
What if the model had used many PCs?
0.4
A more nefarious (but real) example
I
Observe that in a 40-member ensemble of GCM
predictions some members have more skill than others.
I
Pick the members with skill exceeding some threshold.
I
Perform PCA and retain the PCs with skill exceeding some
threshold as your predictors.
I
Estimate skill using cross-validation.
Sounds harmless, maybe even clever.
What is the problem?
What is the impact?
A more nefarious (but real) example
I
Observe that in a 40-member ensemble of GCM
predictions some members have more skill than others.
I
Pick the members with skill exceeding some threshold.
I
Perform PCA and retain the PCs with skill exceeding some
threshold as your predictors.
I
Estimate skill using cross-validation.
Sounds harmless, maybe even clever.
What is the problem?
What is the impact?
A more nefarious (but real) example
I
Observe that in a 40-member ensemble of GCM
predictions some members have more skill than others.
I
Pick the members with skill exceeding some threshold.
I
Perform PCA and retain the PCs with skill exceeding some
threshold as your predictors.
I
Estimate skill using cross-validation.
Sounds harmless, maybe even clever.
What is the problem?
What is the impact?
A more nefarious (but real) example
I
Observe that in a 40-member ensemble of GCM
predictions some members have more skill than others.
I
Pick the members with skill exceeding some threshold.
I
Perform PCA and retain the PCs with skill exceeding some
threshold as your predictors.
I
Estimate skill using cross-validation.
Sounds harmless, maybe even clever.
What is the problem?
What is the impact?
A more nefarious (but real) example
I
Observe that in a 40-member ensemble of GCM
predictions some members have more skill than others.
I
Pick the members with skill exceeding some threshold.
I
Perform PCA and retain the PCs with skill exceeding some
threshold as your predictors.
I
Estimate skill using cross-validation.
Sounds harmless, maybe even clever.
What is the problem?
What is the impact?
A more nefarious (but real) example
I
Observe that in a 40-member ensemble of GCM
predictions some members have more skill than others.
I
Pick the members with skill exceeding some threshold.
I
Perform PCA and retain the PCs with skill exceeding some
threshold as your predictors.
I
Estimate skill using cross-validation.
Sounds harmless, maybe even clever.
What is the problem?
What is the impact?
A more nefarious (but real) example
Cross-validated forecasts show good skill.
correlation =0.84
2
obs.
CV forecast
1.5
1
0.5
0
−0.5
−1
−1.5
−2
1980
1985
What is the real skill?
1990
1995
2000
2005
2010
A more nefarious (but real) example
Apply this procedure 1000 times to random numbers
frequency of correlation
250
200
150
100
50
0
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
mean correlation = 0.8
Never underestimate the power of a screening procedure.
A more nefarious (but real) example
Apply this procedure 1000 times to random numbers
frequency of correlation
250
200
150
100
50
0
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
mean correlation = 0.8
Never underestimate the power of a screening procedure.
Cross-validation and Predictor selection: Bad
predictor_selection(x,y)
ypred = y+NA
for(ii in 1:N) {
out = (ii-1):(ii+1)
training = setdiff(1:N,out)
xcv = x[training]
ycv = y[training]
model.cv = lm(ycv ~ xcv)
ypred[ii] = predict(model.cv,list(xcv=x[ii]))
}
Cross-validation and Predictor selection: Good
ypred = y+NA
for(ii in 1:N) {
out = (ii-1):(ii+1)
training = setdiff(1:N,out)
xcv = x[training]
ycv = y[training]
predictor_selection(xcv,ycv)
model.cv = lm(ycv ~ xcv)
ypred[ii] = predict(model.cv,list(xcv=x[ii]))
}
Summary
I
Predictor selection needs to be included in the
cross-validation.
I
Impact varies.
Example: PCA and regression
We asked:
Is there any benefit to predicting the PCs of y rather than y ?
Compared regression at each gridpoint to regression between
patterns.
I
Compute PCs of SST
I
Compute PCs of rainfall.
I
Skill from cross-validated regression between the PCs.
What is the problem?
Cannot do PCA of “future” observations. PCA of y needs to go
in the CV-loop
Example: PCA and regression
We asked:
Is there any benefit to predicting the PCs of y rather than y ?
Compared regression at each gridpoint to regression between
patterns.
I
Compute PCs of SST
I
Compute PCs of rainfall.
I
Skill from cross-validated regression between the PCs.
What is the problem?
Cannot do PCA of “future” observations. PCA of y needs to go
in the CV-loop
Example: PCA and regression
We asked:
Is there any benefit to predicting the PCs of y rather than y ?
Compared regression at each gridpoint to regression between
patterns.
I
Compute PCs of SST
I
Compute PCs of rainfall.
I
Skill from cross-validated regression between the PCs.
What is the problem?
Cannot do PCA of “future” observations. PCA of y needs to go
in the CV-loop
Example: Philippines
Y PCA outside CV
Y PCA inside CV
correlation
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
pattern
pattern
correlation
1
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.2
0.4
0.6
pointwise
0.8
1
0
0
0.2
0.4
0.6
pointwise
0.8
1
Example: Philippines
Y PCA outside CV
Y PCA inside CV
1−error var/cl var
0.5
0.5
0.4
0.4
0.3
0.3
pattern
pattern
1−error var/cl var
0.2
0.2
0.1
0.1
0
0
−0.1
−0.1
0
0.1
0.2
0.3
pointwise
0.4
0.5
−0.1
−0.1
0
0.1
0.2
0.3
pointwise
0.4
0.5
Similar problem.
I
Do CCA.
I
Find patterns and time-series.
I
Use time-series in a regression.
I
Check skill using cross-validation.
What is the problem?
What is a solution?
Similar problem.
I
Do CCA.
I
Find patterns and time-series.
I
Use time-series in a regression.
I
Check skill using cross-validation.
What is the problem?
What is a solution?
CPT
Climate prediction tool.
I
I
3 consecutive years are left out.
CCA is applied to the remaining years.
I
CCA depends on the number of PCs retained.
I
Middle year of the left out years is forecast.
I
Repeat until all years are forecast.
I
Cross-validated forecast depends on the number of PCs
retained.
I
Select number of PCs that optimizes cross-validated skill.
This skill is the forecast skill.
What is the problem? Solution?
CPT
Climate prediction tool.
I
I
3 consecutive years are left out.
CCA is applied to the remaining years.
I
CCA depends on the number of PCs retained.
I
Middle year of the left out years is forecast.
I
Repeat until all years are forecast.
I
Cross-validated forecast depends on the number of PCs
retained.
I
Select number of PCs that optimizes cross-validated skill.
This skill is the forecast skill.
What is the problem? Solution?
CPT
Climate prediction tool.
I
I
3 consecutive years are left out.
CCA is applied to the remaining years.
I
CCA depends on the number of PCs retained.
I
Middle year of the left out years is forecast.
I
Repeat until all years are forecast.
I
Cross-validated forecast depends on the number of PCs
retained.
I
Select number of PCs that optimizes cross-validated skill.
This skill is the forecast skill.
What is the problem? Solution?
Three data sets
I
Data set 1 = estimate parameters of the model.
I
Data set 2 = select predictor/model.
I
Data set 3 = estimate skill of model
Why are two data sets not enough?
“Example”
Suppose many models are compared.
Same skill except for sampling differences. s ± δ
Pick model with largest skill s + δ , larger than real skill s.
Three data sets
I
Data set 1 = estimate parameters of the model.
I
Data set 2 = select predictor/model.
I
Data set 3 = estimate skill of model
Why are two data sets not enough?
“Example”
Suppose many models are compared.
Same skill except for sampling differences. s ± δ
Pick model with largest skill s + δ , larger than real skill s.
Three data sets
I
Data set 1 = estimate parameters of the model.
I
Data set 2 = select predictor/model.
I
Data set 3 = estimate skill of model
Why are two data sets not enough?
“Example”
Suppose many models are compared.
Same skill except for sampling differences. s ± δ
Pick model with largest skill s + δ , larger than real skill s.
Three data sets
I
Data set 1 = estimate parameters of the model.
I
Data set 2 = select predictor/model.
I
Data set 3 = estimate skill of model
Why are two data sets not enough?
“Example”
Suppose many models are compared.
Same skill except for sampling differences. s ± δ
Pick model with largest skill s + δ , larger than real skill s.
SST prediction
I
Predict monthly SST anomalies from monthly SST
anomalies six months before.
I
1982-2009. 28 years.
I
PCA of monthly SST anomalies.
I
Cross validation to pick the number of PCs.
I
Look at the correlation of those cv’d forecasts.
What are the problems?
SST prediction
I
Leave-one-month out cross-validation.
I
Checked truncations up to 75 PCs.
I
Lowest cross-validated error with 66 PCs.
What’s wrong with this picture?
mean correlation = 0.64
1
30N
0.8
20N
0.6
0.4
10N
0.2
0
0
−0.2
10S
−0.4
−0.6
20S
−0.8
30S
−1
120E
160E
160W
120W
80W
SST prediction
I
Leave-one-year-out cross-validation.
I
Checked truncations up to 75 PCs.
I
Lowest cross-validated error with 6 PCs.
mean correlation = 0.42
1
30N
0.8
20N
0.6
0.4
10N
0.2
0
0
−0.2
10S
−0.4
−0.6
20S
−0.8
30S
−1
120E
160E
160W
120W
80W
Summary
I
Efficient method to compute leave-one-out cross-validation
for linear regression.
I
There are some biases with CV. Climatology forecasts
have negative correlation.
I
Include model/predictor selection in the CV.
I
Left-out data must be independent.
© Copyright 2026 Paperzz