Statistical inference

Lecture 6: Statistical Inference for OLS
Regression III:
Advanced Methods
William G. Jacoby
Department of Political Science
Michigan State University
http://polisci.msu.edu/jacoby/icpsr/regress3
Statistical Inference for Linear Models:
Assumptions of Multiple Regression
• Recall that the multiple regression model simply
extends the simple regression model by adding more
predictors. The model takes the following form:
• The assumptions of the model, concern the errors and are
the same as those for simple regression:
1.Linearity (i.e., preceding specification is correct)
2.Uncorrelated errors across observations
3.Constant error variance
4.X’s are independent of the errors
5.Normally distributed errors
• When these assumptions are met, the OLS estimators
have the same properties as the OLS estimators for simple
regression
2
Assumptions about the error terms in OLS
1. Linearity
• The expected value of Y is a linear function of the
X’s—it makes little sense to fit a linear model if the
functional form of the relationship is not linear
• This also means, then, that the average value of ε
given X is equal to 0
• Simply put, we assume that a straight line adequately
represents the relationship in the population
3
Assumptions about the error terms in OLS
2. Independent observations
• Each individual observation must be independent of the
others. More specifically, the error terms, or
equivalently the Y values, are independent of each other
• Although the assumption of independence is rarely
perfect, in practice a random sample from a large
population provides a close approximation
• Time series data, panel data, and clustered data often
do not satisfy this condition—in these cases,
dependencies among the errors can be quite strong
• If independence is violated, OLS is no longer the
optimal method of estimation—the estimates remain
unbiased, but the standard errors can be biased
(they are almost always biased downward). This
means that our estimates are not as precise as the
standard errors indicate
4
Assumptions about the error terms in OLS
3. Constant Error Variance
• Also known as Homoscedasticity (Latin for “same
variance”). The opposite is heteroscedasticity
• OLS assumes that the variance of the errors is the same
regardless of X: V(ε|xi)=σε2. In other words, the degree
of random noise is the same regardless of the value of
the X’s
• Since the distribution of the errors is the same as the
distribution of Y around the regression line, the model
also assumes constant conditional variance of Y given
X:
V(Y|xi)=E[(Yi-α-βxi)2=E(εi2)=σε2
• Nonconstant error variance does not bias the
estimates, but does affect efficiency—i.e., the
standard errors are usually inflated
• OLS is not optimal in the presence of nonconstant error
variance—weighted least squares provides a better
alternative
5
Assumptions about the error terms in OLS
4. Mean independence
• X’s are either fixed, or independent of ε
• Only in experimental research can the X’s be fixed. In
observational studies, X and Y are sampled together
and thus we must assume that the independent
variable and errors are independent in the population
from which they are drawn
• In other words, we assume that the error has the
same distribution for all values of X in the population:
n(0, σε2)
• Coefficients can be biased if this assumption is not
met
6
Assumptions about the error terms in OLS
5. Normality
• We also assume that the errors are normally distributed:
ε ∼ N(0, σε2)
• This also implies that the conditional distribution of Y is
normally distributed: Yi ∼ N(α+βxi, σε2)
• The normality assumption does NOT apply to the X’s
7
Properties of the Least-Squares Estimators:
Best Linear Unbiased Estimators (BLUE)
• If the assumptions of the model are met, the sample
least-squares intercept and slope have some desirable
properties as statistical estimators of the population
parameters:
1. The slope B, and intercept A, are linear functions
of the observations Yi. This facilitates deriving the
sampling distributions of A and B
2. If the linearity assumption is met, A and B are also
unbiased estimators of the population regression
coefficients:
8
3. If the assumptions of linearity, constant variance,
and independence are satisfied, A and B have
simple sampling variances:
In order to examine when the estimates are precise,
we can rewrite this equation as follows:
We see here that B is most precise with large sample
sizes, when the error variance is small, and when the
X-values are spread out
9
4. Under the assumptions of linearity, constant error
variance, and independence, OLS estimators are the
most efficient of all unbiased linear estimators (GaussMarkov theorem)—smallest sampling variance and
mean-squared error. If we further satisfy the assumption
of normality, they are the most efficient of all unbiased
estimators. Other robust methods are are more efficient,
however, if normality is not satisfied.
5. If all the assumptions are satisfied, the OLS coefficients
are the maximum likelihood estimators of α and β
6. Finally, under the assumption of normality, the
coefficients are also normally distributed
10
Measuring Goodness of Fit
• The process of fitting a model to data may be regarded as a
way of replacing a set of data values (Y) by a set of fitted
values derived from a model involving usually a relatively
small number of parameters
– In general the predicted values will not equal the observed
exactly
– It is important to assess, then, how discrepant the fitted
values are (relative to the observed values)
• The null model has a one parameter representing a common
mean for all Y values
– All variation is consigned to the random component
• The “full” model has n parameters, one for each observation,
thus matching the data exactly
– All the variation in Y is assigned to the systematic
component, leaving none for the random component
• In practice, the null model is too simple and the full model
uninformative because it does not summarize the data
11
Correlation Coefficient
• R2 is a relative measure of fit that determines the reduction in
error that is attributable to the linear regression of Y on X
• Calculated by comparing the regression sum of squares
(RegSS) with the total sum of squares for Y (TSS). To calculate
RegSS we first need to calculate the residual sum of squares
(RSS):
• The ratio of RegSS to TSS gives us the proportional
reduction in squared error associated with the regression
model. This also defines the square of the correlation
coefficient:
12
Correlation from the sample covariance
r can also be derived from the covariance and standard
deviations of two random variables
• We first must define the sample covariance:
•
• The correlation coefficient is then defined as:
• Unlike regression, it is clear from the symmetry of the
equation above, that r does not assume one variable is
functionally dependent on the other
• Also unlike the slope coefficient B which is measured in
the units of the dependent variable, correlation has no
units of measurement
13
Standard Error of the Regression
• The standard deviation of the residuals SE, also known as
the standard error of the the regression, provides another
measure of how well the least-squares line fits the data
• Measured in the units of the dependent variable, the standard
error of the regression represents an “average” of the residuals:
• For example, if SE=5000 for a regression of income on
education, we would conclude that, on average,
predicting income from education results in an error of
$5000
14
Standard Error and R2 for
Multiple Regression
• The standard error of the regression can also be
calculated for multiple regression:
• As in simple regression, the variation in Y can be
decomposed:
TSS=RegSS (explained)+RSS (unexplained)
• The squared multiple correlation R2 tells us the
proportion of variation in Y accounted for by the
regression
15
Adjusted R2
• R2 will always rise as more explanatory variables are
added to a model—it will never decline
• As a result, some researchers prefer to use an adjusted
R2 that corrects for the degrees of freedom (i.e., the
number of explanatory variables):
• Unless the sample size is very small and/or some of the
X’s are linear combinations of the others, the R2 and the
adjusted R2 usually differ very little
16
Confidence Intervals and Hypothesis Tests
• The error variance is never known—if it was, we would
also know the population slopes and intercepts—so we
must estimate it from the variance of the residuals:
• Now we can estimate the sampling variances of A and
B:
• The estimated standard error of B is then:
17
Confidence Intervals and Hypothesis Tests
• Reflecting the additional uncertainly of estimating the
error variance from the sample, we use the t-distribution
to construct hypothesis tests and confidence intervals
• The 100(1-a)% confidence interval (CI) for the slope is:
• For a 95% CI, t.025 2, unless n is very small
• The formula to test the hypothesis that H0:β=β0 is:
18
Example of inference for simple regression:
Duncan data
19
Sampling Variance of the Slope in
Multiple Regression (1)
• Our main concern in multiple regression is with the
slope coefficients
• The individual slopes Bj have the following sampling
variance:
20
Sampling Variance of the Slope in
Multiple Regression (2)
• The second factor in the equation for the variance of a
slope is basically the equation for the variance of the
slope in simple regression:
• The error variance, σε2, is now smaller since some of the
error from the simple regression is incorporated into the
systematic part of the model
• The first term is the Variance Inflation Factor (VIF). If
an X is strongly correlated with another X, the VIF is large,
and thus the estimate of the variance of the slope is large
as well
21
Confidence Intervals and Hypothesis Tests
for Slopes in Multiple Regression (1)
• These follow the same pattern as for simple regression
• Since we do not know the error variance, σε2, we
substitute the variance of the residuals, which provides
an unbiased estimator:
• The estimate of the standard error of of Bj is:
• Confidence intervals and hypothesis tests are based on
the t-distribution with n-k-1 degrees of freedom
22
Testing the effects of all slopes:
Omnibus F-Tests
• Tests whether all of the slopes are equal to zero:
• The F-test for this null hypothesis is given by:
• This test has an F-distribution with k and n-k-1 degrees of
freedom
• It is quite possible to have a model with a significant fit
despite that none of the coefficients are statistically
significant—this would happen in the case of high multicollinearity
23
Omnibus F-test
An example using Duncan data
24
Incremental F-Tests (2)
Testing a subset of slopes
• Incremental F-tests compare the RegSS from one Model
with the RegSS from another Model nested within the
first one
• If the difference is not large relative to the difference in
degrees of freedom, it will not be statistically significant
• F-tests can easily be calculated by fitting nested models
and comparing the regression sum of squares, but they
are produced more easily using the anova function in R
25
Incremental F-Tests (2)
Type I, II, and III Tests
Type I Tests add terms sequentially. Each test compares a
new model with the previous model that did not include
the term. Only after the last term is added is the full
model compared—For early tests, the new model is NOT
compared to the full model. Seldom a very helpful test.
Type II Tests assess each term after all others in the
model, ignoring any higher-order relatives (i.e.,
interaction terms related to the variable). Generally most
useful test.
Type III Tests violate marginality, testing each term in
the model after all of the others, including the interaction
terms
• The function anova included in the base R package
calculates Type I tests. Type II and III tests can be
calculated using the Anova function in the car package.
26
Log Likelihood-Ratio Statistic
• The full model gives a baseline for measuring the
discrepancy for an immediate model with p parameters
• The discrepancy of a fit is proportional to twice the
difference between the maximum log likelihood achievable
and that achieved by the model under investigation
• This discrepancy is known as the deviance for the current
model and is a function of the data only
• Simply put, the generalized likelihood-ratio replaces the
residual sum of squares with the deviance (G2):
27
Properties of the
Log Likelihood-Ratio Statistic
• The likelihood-ratio test is asymptotically distributed as
χ2 under the null hypothesis
• The degrees of freedom associated with the deviance is
equal to n-p (where p is the number of parameters)
• The deviance is additive for nested models if
maximum-likelihood estimates are used
• An analysis of deviance is especially useful for testing
the effects of factors and interactions
• The likelihood-ratio test is generalizable to models
other than linear regression, this test can be used in
place of the F-test for linear models
28
Next Topics
• Linear Models II: Effective Presentation
• The Vector Representation of Regression
29