The predictors and predictand are zero-mean

LECTURE 6
MULTIVARIATE REGRESSION
Multivariate Regression; Selection
Rules
Supplementary Readings:
Wilks, chapters 6;
Bevington, P.R., Robinson, D.K., Data Reduction and
Error Analysis for the Physical Sciences, McGraw-Hill,
1992.
SOLUTION OF MATRIX EQUATION
a
 11
a
21

A .

a
 N1
a
12
a
a
...
22
N2
...
We can write

1N 
a 
2N 



a
NN 
a
 x 
 1
x 
x   .2 


x 
 N 
b 
 1
b 
b   2 
.


b 
 N 
Ax  b
If A is invertible there is a unique solution:

1
xA b
Recall Linear Regression
y  abx
2
y
x

a
x

b
x
 ii  i  i
We can write this as a matrix equation,






n
x
i



x
y
 i  a    
i
   

 



2
x
y
x
  b 

i 
i i

Recall Linear Regression
y  abx
2
y
x

a
x

b
x
 ii  i  i
We can generalize this for the case of multiple
independent variables (“multiple linear regression”):
yˆ b0 b1xˆ1 b2 xˆ2 ...bM xˆM
Let us consider this in more detail...
Consider the general linear model,
yˆ  b0  b1xˆ1  b2 xˆ2 ... bM xˆM
N
N
2
2

y

  ˆ  yi    b0  b1xˆ1i  b2 xˆ2i ... bM xˆ M i  yi 




i 1
i 1
We seek to minimize the sum-of-squares with
respect to the M+1 parameters b0,…,bM











N
2


ˆ
y

y
 0


i
b i 1


j
 N b  b xˆ  b xˆ  ... b xˆ  y 2  0
  0 1 1i

2
2
M
M
i
i
i


b i 1
j
This is SIMULTANEOUS multiple linear regression
 N b  b xˆ  b xˆ  ... b xˆ  y 2  0
  0 1 1i

2
2
M
M
i
i
i


b i 1
j
This yields a system of M+1 equations:
N 


 
ˆ
ˆ
ˆ
2
b

b
x

b
x

...

b
x

y
  0 1 1i
1  0
2 2i
M Mi
i  

i 1
N 



 ˆ 
ˆ
ˆ
ˆ
2
b

b
x

b
x

...

b
x

y
x
  0 1 1i

  0
2 2i
M Mi
i  1i 

i 1
N 



 ˆ

ˆ
ˆ
ˆ
2
b

b
x

b
x

...

b
x

y
x
  0 1 1i

  0
2 2i
M Mi
i  2i 

i 1
(1)
(2)
(3)
.
.
.
N 



 ˆ

(M+1)
ˆ
ˆ
ˆ
2
b

b
x

b
x

...

b
x

y
x
  0 1 1i

  0
2 2i
M Mi
i  M i 

i 1
N 

 
 2b0  b1xˆ1i  b2 xˆ2i ... bM xˆM i  yi 1  0


i 1
N 


 2b0  b1xˆ1i  b2 xˆ2i ... bM xˆM i  yi  xˆ1i   0



i 1
N 


 2b0  b1xˆ1i  b2 xˆ2i ... bM xˆM i  yi  xˆ2i   0



i 1
This set of M+1 linear
equations in M+1
unknowns
(1)
(2)
(3)
.
.
can be rewritten in matrix format


















N
 xˆ1i
2
 xˆ1i
 xˆ1i
 xˆ2i
.
.
 xˆ M i
 xˆ2i xˆ1i
.
.
 xˆ M i xˆ1i
 xˆ2i
 xˆ1i xˆ2i
2
ˆ
x
 2i
.
.
 xˆ M i xˆ2i
...
N 


 2b0  b1xˆ1i  b2 xˆ2i ... bM xˆM i  yi  xˆM i   0



i 1
 xˆ M i


















b
0
































 yi
 y xˆ
(M+1)
...  xˆ xˆ
b
1i M i
i 1i
1
b   y xˆ
...  xˆ xˆ
i 2i
2i M i .2
.
.
.
.
.
.
.
...  xˆ 2 bM  yi xˆ M i
Mi


































N
 xˆ1i
2
 xˆ1i
 xˆ1i
 xˆ2i
.
.
 xˆ M i
 xˆ2i xˆ1i
.
.
 xˆ M i xˆ1i
 xˆ2i
 xˆ1i xˆ2i
2
 xˆ2i
.
.
 xˆ M i xˆ2i
 xˆ M i
...
c is the
column
M+1vector:
b
0
































 yi
 y xˆ
...  xˆ xˆ
b
1i M i
i 1i
1
b   y xˆ
...  xˆ xˆ
i 2i
2i M i .2
.
.
.
.
.
.
.
...  xˆ 2 bM  yi xˆ M i
Mi
Where A is the
symmetric
M+1 x M+1
matrix


































 yi
 y xˆ
i 1i
c   y xˆ
i 2i
.
.
 yi xˆ M i


































N
 xˆ
1i
A   xˆ
2i
.
.
 xˆ M i
















 xˆ1i
 xˆ
2
1i
 xˆ2i xˆ1i
.
.
 xˆ M i xˆ1i
Can be written:
Ab c
 xˆ2i
 xˆ1i xˆ2i
2
 xˆ2i
.
.
 xˆ M i xˆ2i
b is the column
M+1-vector of
regression
coefficients
...
 xˆ M i
...  xˆ xˆ
1i M i
...  xˆ xˆ
2i M i
.
.
.
.
...  xˆ 2
Mi


































b
0
b
1
b b
2
.
.
b
M


































N
 xˆ1i
 xˆ2i
.
.
 xˆ M i
 xˆ1i
 xˆ1i
2
 xˆ2i xˆ1i
.
.
 xˆ M i xˆ1i
 xˆ2i
 xˆ1i xˆ2i
2
 xˆ2i
.
.
 xˆ M i xˆ2i
 xˆ M i
...
c is the
column
M+1vector:
b
0
































 yi
 y xˆ
...  xˆ xˆ
b
1i M i
i 1i
1
b   y xˆ
...  xˆ xˆ
i 2i
2i M i .2
.
.
.
.
.
.
.
...  xˆ 2 bM  yi xˆ M i
Mi
Where A is the
symmetric
M+1 x M+1
matrix


































 yi
 y xˆ
i 1i
c   y xˆ
i 2i
.
.
 yi xˆ M i


































N
 xˆ
1i
A   xˆ
2i
.
.
 xˆ M i
















 xˆ1i
 xˆ
2
1i
 xˆ2i xˆ1i
.
.
 xˆ M i xˆ1i
Solution is:
b  A1c
As long as A
is invertible!!
 xˆ2i
 xˆ1i xˆ2i
2
 xˆ2i
.
.
 xˆ M i xˆ2i
b is the column
M+1-vector of
regression
coefficients
...
 xˆ M i
...  xˆ xˆ
1i M i
...  xˆ xˆ
2i M i
.
.
.
.
...  xˆ 2
Mi


































b
0
b
1
b b
2
.
.
b
M
















Let us consider some special
cases...
(1) The predictors and predictand are zero-mean:


















 xˆ1i
N
 xˆ
1i
A   xˆ
2i
.
.
 xˆ M i
 xˆ
2
1i
 xˆ2i xˆ1i
.
.
 xˆ M i xˆ1i
 xˆ2i
 xˆ1i xˆ2i
2
 xˆ2i
.
.
 xˆ M i xˆ2i
...
 xˆ M i



































N
0
...  xˆ xˆ
1i M i
 0
...  xˆ xˆ
2i M i
.
.
.
.
.
.
0
...  xˆ 2
Mi
0
 xˆ
0
2
 xˆ1i xˆ2i
1i
 xˆ2i xˆ1i
.
.
 xˆ M i xˆ1i

0


...  xˆ xˆ 
1i M i 

...  xˆ xˆ 
2i M i 

.
.

.
.


...  xˆ 2 
Mi 
...
2
 xˆ2i
.
.
 xˆ M i xˆ2i
The system thus reduces to the condition b0=0 and the
matrix equation,
Ab c
A
















 xˆ1i
2
 xˆ2i xˆ1i
.
.
 xˆ M i xˆ1i
 xˆ1i xˆ2i
2
 xˆ2i
.
.
 xˆ M i xˆ2i
...  xˆ xˆ
1i M i
...  xˆ xˆ
2i M i
.
.
.
.
...  xˆ 2
Mi
What Symmetric
kind of matrix
is this?
(MxM)
















c is the
column Mvector:















 yi xˆ1i
c   yi xˆ2i
.
.
 yi xˆ M i






























b
1
b  b2
.
b is the column M.
vector of regression
b
M
coefficients















(2) The predictors/predictand are zero-mean AND the predictors
are orthogonal:
A
















2
 xˆ
1i
 xˆ2i xˆ1i
.
.
 xˆ M i xˆ1i













b
1
b
b  .2
.
b
M













 xˆ1i xˆ2i
2
 xˆ2i
.
.
 xˆ M i xˆ2i
...  xˆ xˆ
1i M i
...  xˆ xˆ
2i M i 
.
.
.
.
...  xˆ 2
Mi

ˆ
y
x
 i 1i 

 yi xˆ2i 

c
.


.


ˆ
y
x
 i M i 












































 xˆ
1i
0
.
.
0
2
0
...
0
2
ˆ
x
...
0
 2i
.
.
.
.
.
.
0
0  xˆ
Mi











2


Ab c
















2 
ˆ
ˆ
y
x
/
x
b
 i 1i  1i 

1
2 
b
ˆ
ˆ
2   yi x2i /  x2i 
b

Since the matrix is
.

.

.
.

diagonal, this is readily
b
2 
solved as:
M  yi xˆ M i /  xˆ M i 



























(3) The predictors/predictand are zero-mean AND have unit
standard deviation AND the predictors are orthogonal:
















 xˆ
2
















 xˆ1i xˆ2i
2
 xˆ2i
.
.
 xˆ M i xˆ2i
...  xˆ xˆ
1i
1i M i
...  xˆ xˆ
 xˆ2i xˆ1i
A
2i M i 
.
.
.
.
.
.
...  xˆ 2
 xˆ M i xˆ1i
Mi
r



 b 


x
y
  y xˆ 
1






1 
i 1i 


r

 b

ˆ
y
x


 x2 y 
2i 


i
2
b   .  c   .   N  . 


 . 
 . 
.






b

r
  y xˆ



i Mi



 x y














2
 xˆ
0
1i
...
0
2
ˆ
x
...
0
 2i
.
.
.
.
.
.
0
0  xˆ
Mi
0
.
.
0











2


Ab c

M















b
1




























 yi xˆ1i /  xˆ1i




2 





2 


2
b
y xˆ /  xˆ

2
The vector of regression coefficients b  . 
i 2i 2i
.
.
b is the vector of the individual
.
b
correlation coefficients between the
M  yi xˆM i /  xˆM i
predictand and predictors!
M

















rx y 
1 

rx y 
2
.
.
rx y
M










(4) the predictors are not orthogonal, but both predictors and
predictand are zero-mean and have unit standard deviation:
















 xˆ1i
2
...  xˆ xˆ
1i M i
2
...  xˆ xˆ
 xˆ2i xˆ1i
 xˆ2i
A
2i M i
.
.
.
.
.
.
.
.
2
 xˆ M i xˆ1i  xˆ M i xˆ2i ...  xˆ M i
r



 b 


x
y
  y xˆ 
1






1 
i 1i 


r

 b

ˆ
y
x


 x2 y 
2i 


i
2
b   .  c   .   N  . 


 . 
 . 
.






b

r
  y xˆ



i Mi



 x y

M
 xˆ1i xˆ2i

M
















 1


r
 x x
 2 1

.

.

r
 x x
 M 1
r
x x
1 2
1
...
.
.
0
.
.
0
Ab c

The vector of regression coefficients
b is the vector of partial correlation
coefficients between the predictand
and (correlated) predictors!
b A c
1


1 M 

r
x x 
2 M 
. 
. 
1 

... r
x x
 b
 1
 b
 2
 .

 .
b

 M
  r ' x y 
  1 
  r'

  x y
   2 
  . 
  . 

  r'
  x y
  M 
(5) One or more of the predictors are linearly dependent:
A
















 xˆ1i
2
 xˆ1i xˆ2i
 xˆ2i xˆ1i
.
.
 xˆ M i xˆ1i
b A c
1
2
 xˆ2i
.
.
 xˆ M i xˆ2i
...  xˆ xˆ
1i M i
...  xˆ xˆ
2i M i
.
.
.
.
...  xˆ 2
Mi
















is no longer full rank!
We can no longer obtain the solution by
inverting since A is NOT invertible!!
Ab c
One could try to obtain a solution to
by singular value decomposition (SVD)
or simply try to eliminate the redundant predictor...
A
A
















 xˆ
2
1i
 xˆ2i xˆ1i
.
.
 xˆ M i xˆ1i
 xˆ1i xˆ2i
2
ˆ
x
 2i
.
.
 xˆ M i xˆ2i
...  xˆ xˆ
1i M i
...  xˆ xˆ
2i M i
.
.
.
.
2
ˆ
...  x
Mi
















is the variance-covariance matrix of the dataset.
x are determined by
(A  I)x  0
The principal axes of variance in the dataset
Ax  x
det(A  I)  0
or
describes the non-trivial vectors
Principal Component Analysis (PCA)
Subject for a later lecture...
ANOVA for Multiple Linear Regression
Source
df
SS
Total
n-1
SST
Regression
K
SSR
Residual
n-K-1 SSE
MS
F-test
MSR=SSR/K
MSR/MSE
MSE=SSE/(n-K-1)=se2
R  SSR / SST
2
We can generalize the notion of the
univariate coefficient of
determination as the sum over the
product of the partial and actual
correlation coefficients between
the predictand and the M
predictors:
K
R2   r' x y rx y
i
i 1 i
Recall results for linear regression:
b

0
t
b
n y x  y  x
i i
i i
 
2
b
n x 2  x
i
i



















1/ 2
1
  se
b
(x  x)
i



2 


Recall results for linear regression:
This generalizes
to the
multivariate case
t
bj 0
b
n y x  y  x
i
i i
i
 
2
b
2
n x   x
i
i
( j)
j
( j)
  se
bj


















( j)
1/ 2
1
( x
i
( j)
 x) 2


















( j)
j
Example #1: Statistical model for
global CO2 concentrations as a
function of time
Monthly Atmospheric CO2 (1959-1988)
Measurements of CO2 in
parts per million (ppm) at
Mauna Loa Observatory
Higher readings occur in winter when plants die and release CO2
to the atmosphere. Lower readings occur in summer when more
abundant vegetation absorbs CO2 from the atmosphere.
Monthly Atmospheric CO2 (1959-1988)
Linear trend(dashed) and quadratic least-squares trends (solid)
shown along with actual (dotted) data
Monthly Atmospheric CO2 (1959-1988)
ANOVA Table
and summary
of regression
Results of multivariate regression are
simplest to interpret if predictors can
be specified based on objective a
priori considerations (e.g., for physical
reasons)...
Example #2: Statistical model of
Northern Hemisphere Temperatures as
a linear combination of climate
forcings
Temperature
Solar Irradiance
Greenhouse Gases
Volcanism
Relationship of
variations in
Northern
Hemisphere
Annual Mean
Temperature
reconstruction
to estimates of
three candidate
Climate
Forcings
Mann, M.E., Bradley, R.S.,
Hughes, M.K., Global-Scale
Temperature Patterns and
Climate Forcing Over the Past
Six Centuries, Nature, 392,
779-787, 1998.
Temperature
Solar Irradiance
Greenhouse Gases
Volcanism
Standardize and remove
mean from predictors and
predictand...
Treat this as a
multivariate
regression problem
We are interested in the
partial correlations
between candidate
forcings and “response”.
Temperature
Solar Irradiance
Greenhouse Gases
bA c
1
1

A  N r
sc
r
 sv
Volcanism
Standardize and remove
mean from predictors and
predictand...
b










r
sc
1
r
cv











sv 
r 
cv

1

r
r 
 ST 
sT


r'
c  N r 
CT
CT


r'
r 

VT
 VT 
r'
Temperature
Solar Irradiance
Greenhouse Gases
bA c
1
1

A  N r
sc
r
 sv
Volcanism
b










r
sc
1
r
cv











sv 
r 
cv

1

r
r 
 ST 
sT


r'
c  N r 
CT
CT


r'
r 

VT
 VT 
r'
Temperature
Solar Irradiance
How is the significance of
regression coefficients
established?
What is the appropriate
null hypothesis here?
Greenhouse Gases
Volcanism
Correlations between white
noise predictand time series
and 3 independent white
noise predictors?
Correlations between red
noise predictand time series
and 3 independent red noise
predictors?
Motivates a
Non-Parametric
Approach…
ALTERNATIVE
APPROACH
Science
What if predictors cannot be specified
based on objective a priori
considerations?
Selection Rules
Ithaca Winter Snowfall Variations (1980-1986)
Training Sample (Developmental Data)
Verification Sample (Independent Data)
Ithaca Winter Snowfall Variations (1980-1986)
Training Sample
Fails Cross-validation!
Classic example of
STATISTICAL OVERFITTING...
Ithaca Winter Snowfall Variations (1980-1986)
It is clear that we need some objective
way of sorting out real predictors
from spurious predictors!
Screening Regression
Training Sample
Fails Cross-validation!
Classic example of
STATISTICAL OVERFITTING...
Forward Selection or Stepwise Regression
•Begin with M potential predictors, and the assumption y=b0
•Evaluate the strength of the linear relationship between
predictand and the M predictors. Select the most significant
x1 (ie, that with the greatest value of r or largest t ratio)
•The model is now y= b0+ b1 x1 (b0 is not the same as above)
•Iteratively construct the best model y= b0+ b1 x1 +…+ bK xK
with K predictors (selecting at each step that xj that
produces the largest t ratio, greatest increase in R2, or the
greatest value of F)
•Stop at some point K<M!
Example of Forward Selection (Stepwise
Regression)
Backward Elimination
•Begin with simultaneous multivariate regression against the
M potential predictors: y= b0+ b1 x1 +…+ bM xM
•Evaluate the strength of the linear relationship between
predictand and the M predictors. Select the least significant x1
(ie, that with the smallest value of r or smallest t ratio), and
eliminate from the regression equation.
•The new model is y= b0+ b1 x1 +…+ bM-1 xM-1
•Iteratively construct the best model y= b0+ b0 x1 +…+ bK xK with
K predictors (selecting at each step that xj that has the smallest t
ratio, produces the smallest decrease in R2, or the smallest value
of F)
•Stop at some point K<M!
Accounting for Multiplicity of predictors
Critical F-ratio
K=12,
n=127
This curve can be generated using Monte Carlo methods
Stopping Point Choice
Cross-Validation
Stopping Point Choice
Cross-Validation