additive models

Lecture 16: Generalized Additive Models
Regression III:
Advanced Methods
Bill Jacoby
Michigan State University
http://polisci.msu.edu/jacoby/icpsr/regress3
Goals of the Lecture
• Introduce Additive Models
– Explain how they extend from simple nonparametric
regression (i.e., local polynomial regression)
– Discuss estimation using backfitting
– Explain how to interpret their results
• Conclude with some examples of Additive Models
applied to real social science data
2
Limitations of the Multiple
Nonparametric Models
• Recall that the general nonparametric model (both the
lowess smooth and the smoothing spline) takes the
following form:
• As we see here, the multiple nonparametric model allows
all possible interactions between the independent variables
in their effects on Y—we specify a jointly conditional
functional form
• This model is ideal under the following circumstances:
1. There are no more than two predictors
2. The pattern of nonlinearity is complicated and thus
cannot be easily modelled with a simple transformation
or polynomial regression
3. The sample size is sufficiently large
3
Limitations of the Multiple
Nonparametric Models (2)
• The general nonparametric model becomes impossible to
interpret and unstable as we add more explanatory
variables, however
1. For example, in the lowess case, as the number of
variables increases, the window span must become
wider in order to ensure that each local regression has
enough cases This process can create significant bias
(the curve becomes too smooth)
2. It is impossible to interpret general nonparametric
regression when there are more than two variables—
there are no coefficients, and we cannot graph effects
more than three dimensions
• These limitations lead us to the Additive Models
4
Additive Regression Models
• Additive regression models essentially apply local
regression to low dimensional projections of the data
• The nonparametric additive regression model is
The fi are arbitrary functions estimated from the data;
the errors ε are assumed to have constant variance and
a mean of 0
• Additive models create an estimate of the regression
surface by a combination of a collection of onedimensional functions
• The estimated functions fi are the analogues of the
coefficients in linear regression
5
Additive Regression Models (2)
• The assumption that the contribution of each covariate
is additive is analogous to the assumption in linear
regression that each component is estimated separately
• Recall that the linear regression model is
where the Bj represent linear effects
• For the additive model we model Y as an additive
combination of arbitrary functions of the Xs
• The fj represent arbitrary functions that can be estimated
by lowess or smoothing splines
6
Additive Regression Models (3)
• Now comes the question: How do we find these
arbitrary functions?
• If the X’s were completely independent—which will not be
the case—we could simply estimate each functional form
using a nonparametric regression of Y on each of the X’s
separately
– Similarly in linear regression when the X’s are
completely uncorrelated the partial regression slopes are
identical to the marginal regression slopes
• Since the X’s are related, however, we need to proceed in
another way, in effect removing the effects of other
predictors—which are unknown before we begin
• We use a procedure called backfitting to find each curve,
controlling for the effects of the others
7
Estimation and Backfitting
• Suppose that we had a two predictor additive model:
• If we unrealistically knew the partial regression function f2
but not f1 we could rearrange the equation in order to solve
for f1
• In other words, smoothing Yi-f2(xi2) against xi1 produces an
estimate of α+f1(xi1).
• Simply put, knowing one function allows us to find the
other—in the real world, however we don’t know either so
we must proceed initially with estimates
8
Estimation and Backfitting (2)
1. We start by expressing the variables in mean deviation
form so that the partial regressions sum to zero, thus
eliminating the individual intercepts
2. We then take preliminary estimates of each function from
a least-squares regression of Y on the X’s
3. These estimates are then used as step (0) in an iterative
estimation process
4. We then find the partial residuals for X1, which removes Y
from its linear relationship to X2 but retains the
relationship between Y and X1
9
Estimation and Backfitting (3)
The partial residuals for X1 are then
5. The same procedure in step 4 is done for X2
6. Next we smooth these partial residuals against their
respective X’s, providing a new estimate of f
where S is the (n × n) smoother transformation matrix
for Xj that depends only on the configuration of Xij for the
jth predictor
10
Estimation and Backfitting (4)
• This process of finding new estimates of the functions by
smoothing the partial residuals is reiterated until the
partial functions converge
– That is, when the estimates of the smooth functions
stabilize from one iteration to the next we stop
• When this process is done, we obtain estimates of sj(Xij) for
every value of Xj
• More importantly, we will have reduced a multiple
regression to a series of two-dimensional partial
regression problems, making interpretation easy:
– Since each partial regression is only two-dimensional,
the functional forms can be plotted on two-dimensional
plots showing the partial effects of each Xj on Y
– In other words, perspective plots are no longer
necessary unless we include an interaction between two
smoother terms
11
Interpreting the Effects
• A plot of of Xj versus sj(Xj) shows the relationship between
Xj and Y holding constant the other variables in the model
• Since Y is expressed in mean deviation form, the smooth
term sj(Xj) is also centered and thus each plot represents
how Y changes relative to its mean with changes in X
• Interpreting the scale of the graphs then becomes easy:
– The value of 0 on the Y-axis is the mean of Y
– As the line moves away from 0 in a negative direction
we subtract the distance from the mean when
determining the fitted value. For example, if the mean is
45, and for a particular X-value (say x=15) the curve is
at sj(Xj)=4, this means the fitted value of Y controlling
for all other explanatory variables is 45+4=49.
– If there are several nonparametric relationships, we can
add together the effects on the two graphs for any
particular observation to find its fitted value of Y
12
Additive Regression Models in R:
Example: Canadian prestige data
• Here we use the Canadian Prestige data to fit an additive
model to prestige regressed on income and occupation
• In R we use the gam function (for generalized additive
models) that is found in mgcv package
– The gam function in mgcv fits only smoothing splines
(local polynomial regression can be done in S-PLUS)
– The formula takes the same form as the glm function
except now we have the option of having parametric
terms and smoothed estimates
– Smooths will be fit to any variable specified with the
s(variable) argument
• The simple R-script is as follows:
13
Additive Regression Models in R:
Example: Canadian prestige data (2)
• The summary function returns tests for each smooth, the
degrees of freedom for each smooth, and an adjusted Rsquare for the model. The deviance can be obtained from
the deviance(model) command
14
Additive Regression Models in R:
Example: Canadian prestige data (3)
• Again, as with other nonparametric models, we have no
slope parameters to investigate (we do have an
intercept, however)
• A plot of the regression surface is necessary
15
Additive Regression Models in R:
Example: Canadian prestige data (4)
80
60
ige
Prest
Additive Model:
• We can see the
nonlinear relationship
for both education and
Income with Prestige
but there is no
interaction between
them—i.e., the slope
for income is the same
at every value of
education
• We can compare this
model to the general
nonparametric
regression model
40
20
5000
10000
In 15000
co
m
e
20000
14
8
12 n
tio
a
10 uc
Ed
25000
16
Additive Regression Models in R:
Example: Canadian prestige data (5)
General Nonparametric
Model:
80
60
40
20
14
12
on
5000
10000
Ed
uc
ati
Prestige
• This model is quite
similar to the additive
model, but there are
some nuances—
particularly in the midrange of income—that
are not picked up by the
additive model because
the X’s do not interact
10
15000
Inco
me
20000
8
25000
17
Additive Regression Models in R:
Example: Canadian prestige data (6)
• Perspective plots
can also be made
automatically
using the
persp.gam
function. These
graphs include a
95% confidence
region
80
60
40
20
5000
10000
in c15000
om
e 20000
14
8
12
on
i
t
10 ca
u
ed
25000
red/green are +/-2 se
18
Additive Regression Models in R:
Example: Canadian prestige data (7)
• Since the slices of the additive regression in the
direction of one predictor (holding the other
constant) are parallel, we can graph each partialregression function separately
• This is the benefit of the additive model—we can graph as
many plots as there are variables, and allowing us to
easily visualize the relationships
• In other words, a multidimensional regression has been
reduced to a series of two-dimensional partial-regression
plots
• To get these in R:
19
0 10
-20
s(income,3.12)
Additive Regression Models in R:
Example: Canadian prestige data (8)
0
5000
10000
15000
20000
25000
14
16
10
0
-20
s(education,3.18)
income
6
8
10
12
education
20
20
10
-20
-10
0
s(education,3.18)
10
0
-10
-20
s(income,3.12)
20
Additive Regression Models in R:
Example: Canadian prestige data (9)
0 5000
15000
income
25000
6
8
10
12
education
14
16
21
R-script for previous slide
22
Residual Sum of Squares
• As was the case for smoothing splines and lowess
smooths, statistical inference and hypothesis testing is
based on the residual sum of squares (or deviance in the
case of generalized additive models) and the degrees of
freedom
• The RSS for an additive model is easily defined in the
usual manner:
• The approximate degrees of freedom, however, need to be
adjusted from the regular nonparametric case, however,
because we are no longer specifying a jointly-conditional
functional form
23
Degrees of Freedom
• Recall that for nonparametric regression, the approximate
degrees of freedom are equal to the trace of the smoother
matrix (the matrix that projects Y onto Y-hat)
• We extend this to the additive model:
1 is subtracted from each df reflecting the constraint
that each partial regression function sums to zero (the
individual intercept have been removed)
• Parametric terms entered in the model each occupy a single
degree of freedom as in the linear regression case
• The individual degrees of freedom are then combined for a
single measure:
1 is added to the final degrees of freedom to account
for the overall constant in the model
24
Testing for Linearity
• I can compare the linear model of prestige regressed on
income and education with the additive model by
carrying out an analysis of deviance
• I begin by fitting the linear model using the gam
function
• Next I want the residual degrees of freedom from the
additive model
25
Testing for Linearity (2)
• Now I simply calculate the difference in the deviance
between the two model relative to the difference in
degrees of freedom (difference in df=7.3-2=5)
• This gives a Chi-square test for linearity
• The difference between the models is highly statistically
significant—the additive model describe the relationship
between prestige and education and income much better
26
Testing for Linearity
• An anova function written by John Fox (see the R-script
for this class) makes the analysis of deviance simpler to
implement:
• As we see here, the results are identical to those found on
the previous slide
27