Lecture 6: Statistical Inference for OLS Regression III: Advanced Methods William G. Jacoby Department of Political Science Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Statistical Inference for Linear Models: Assumptions of Multiple Regression • Recall that the multiple regression model simply extends the simple regression model by adding more predictors. The model takes the following form: • The assumptions of the model, concern the errors and are the same as those for simple regression: 1.Linearity (i.e., preceding specification is correct) 2.Uncorrelated errors across observations 3.Constant error variance 4.X’s are independent of the errors 5.Normally distributed errors • When these assumptions are met, the OLS estimators have the same properties as the OLS estimators for simple regression 2 Assumptions about the error terms in OLS 1. Linearity • The expected value of Y is a linear function of the X’s—it makes little sense to fit a linear model if the functional form of the relationship is not linear • This also means, then, that the average value of ε given X is equal to 0 • Simply put, we assume that a straight line adequately represents the relationship in the population 3 Assumptions about the error terms in OLS 2. Independent observations • Each individual observation must be independent of the others. More specifically, the error terms, or equivalently the Y values, are independent of each other • Although the assumption of independence is rarely perfect, in practice a random sample from a large population provides a close approximation • Time series data, panel data, and clustered data often do not satisfy this condition—in these cases, dependencies among the errors can be quite strong • If independence is violated, OLS is no longer the optimal method of estimation—the estimates remain unbiased, but the standard errors can be biased (they are almost always biased downward). This means that our estimates are not as precise as the standard errors indicate 4 Assumptions about the error terms in OLS 3. Constant Error Variance • Also known as Homoscedasticity (Latin for “same variance”). The opposite is heteroscedasticity • OLS assumes that the variance of the errors is the same regardless of X: V(ε|xi)=σε2. In other words, the degree of random noise is the same regardless of the value of the X’s • Since the distribution of the errors is the same as the distribution of Y around the regression line, the model also assumes constant conditional variance of Y given X: V(Y|xi)=E[(Yi-α-βxi)2=E(εi2)=σε2 • Nonconstant error variance does not bias the estimates, but does affect efficiency—i.e., the standard errors are usually inflated • OLS is not optimal in the presence of nonconstant error variance—weighted least squares provides a better alternative 5 Assumptions about the error terms in OLS 4. Mean independence • X’s are either fixed, or independent of ε • Only in experimental research can the X’s be fixed. In observational studies, X and Y are sampled together and thus we must assume that the independent variable and errors are independent in the population from which they are drawn • In other words, we assume that the error has the same distribution for all values of X in the population: n(0, σε2) • Coefficients can be biased if this assumption is not met 6 Assumptions about the error terms in OLS 5. Normality • We also assume that the errors are normally distributed: ε ∼ N(0, σε2) • This also implies that the conditional distribution of Y is normally distributed: Yi ∼ N(α+βxi, σε2) • The normality assumption does NOT apply to the X’s 7 Properties of the Least-Squares Estimators: Best Linear Unbiased Estimators (BLUE) • If the assumptions of the model are met, the sample least-squares intercept and slope have some desirable properties as statistical estimators of the population parameters: 1. The slope B, and intercept A, are linear functions of the observations Yi. This facilitates deriving the sampling distributions of A and B 2. If the linearity assumption is met, A and B are also unbiased estimators of the population regression coefficients: 8 3. If the assumptions of linearity, constant variance, and independence are satisfied, A and B have simple sampling variances: In order to examine when the estimates are precise, we can rewrite this equation as follows: We see here that B is most precise with large sample sizes, when the error variance is small, and when the X-values are spread out 9 4. Under the assumptions of linearity, constant error variance, and independence, OLS estimators are the most efficient of all unbiased linear estimators (GaussMarkov theorem)—smallest sampling variance and mean-squared error. If we further satisfy the assumption of normality, they are the most efficient of all unbiased estimators. Other robust methods are are more efficient, however, if normality is not satisfied. 5. If all the assumptions are satisfied, the OLS coefficients are the maximum likelihood estimators of α and β 6. Finally, under the assumption of normality, the coefficients are also normally distributed 10 Measuring Goodness of Fit • The process of fitting a model to data may be regarded as a way of replacing a set of data values (Y) by a set of fitted values derived from a model involving usually a relatively small number of parameters – In general the predicted values will not equal the observed exactly – It is important to assess, then, how discrepant the fitted values are (relative to the observed values) • The null model has a one parameter representing a common mean for all Y values – All variation is consigned to the random component • The “full” model has n parameters, one for each observation, thus matching the data exactly – All the variation in Y is assigned to the systematic component, leaving none for the random component • In practice, the null model is too simple and the full model uninformative because it does not summarize the data 11 Correlation Coefficient • R2 is a relative measure of fit that determines the reduction in error that is attributable to the linear regression of Y on X • Calculated by comparing the regression sum of squares (RegSS) with the total sum of squares for Y (TSS). To calculate RegSS we first need to calculate the residual sum of squares (RSS): • The ratio of RegSS to TSS gives us the proportional reduction in squared error associated with the regression model. This also defines the square of the correlation coefficient: 12 Correlation from the sample covariance r can also be derived from the covariance and standard deviations of two random variables • We first must define the sample covariance: • • The correlation coefficient is then defined as: • Unlike regression, it is clear from the symmetry of the equation above, that r does not assume one variable is functionally dependent on the other • Also unlike the slope coefficient B which is measured in the units of the dependent variable, correlation has no units of measurement 13 Standard Error of the Regression • The standard deviation of the residuals SE, also known as the standard error of the the regression, provides another measure of how well the least-squares line fits the data • Measured in the units of the dependent variable, the standard error of the regression represents an “average” of the residuals: • For example, if SE=5000 for a regression of income on education, we would conclude that, on average, predicting income from education results in an error of $5000 14 Standard Error and R2 for Multiple Regression • The standard error of the regression can also be calculated for multiple regression: • As in simple regression, the variation in Y can be decomposed: TSS=RegSS (explained)+RSS (unexplained) • The squared multiple correlation R2 tells us the proportion of variation in Y accounted for by the regression 15 Adjusted R2 • R2 will always rise as more explanatory variables are added to a model—it will never decline • As a result, some researchers prefer to use an adjusted R2 that corrects for the degrees of freedom (i.e., the number of explanatory variables): • Unless the sample size is very small and/or some of the X’s are linear combinations of the others, the R2 and the adjusted R2 usually differ very little 16 Confidence Intervals and Hypothesis Tests • The error variance is never known—if it was, we would also know the population slopes and intercepts—so we must estimate it from the variance of the residuals: • Now we can estimate the sampling variances of A and B: • The estimated standard error of B is then: 17 Confidence Intervals and Hypothesis Tests • Reflecting the additional uncertainly of estimating the error variance from the sample, we use the t-distribution to construct hypothesis tests and confidence intervals • The 100(1-a)% confidence interval (CI) for the slope is: • For a 95% CI, t.025 2, unless n is very small • The formula to test the hypothesis that H0:β=β0 is: 18 Example of inference for simple regression: Duncan data 19 Sampling Variance of the Slope in Multiple Regression (1) • Our main concern in multiple regression is with the slope coefficients • The individual slopes Bj have the following sampling variance: 20 Sampling Variance of the Slope in Multiple Regression (2) • The second factor in the equation for the variance of a slope is basically the equation for the variance of the slope in simple regression: • The error variance, σε2, is now smaller since some of the error from the simple regression is incorporated into the systematic part of the model • The first term is the Variance Inflation Factor (VIF). If an X is strongly correlated with another X, the VIF is large, and thus the estimate of the variance of the slope is large as well 21 Confidence Intervals and Hypothesis Tests for Slopes in Multiple Regression (1) • These follow the same pattern as for simple regression • Since we do not know the error variance, σε2, we substitute the variance of the residuals, which provides an unbiased estimator: • The estimate of the standard error of of Bj is: • Confidence intervals and hypothesis tests are based on the t-distribution with n-k-1 degrees of freedom 22 Testing the effects of all slopes: Omnibus F-Tests • Tests whether all of the slopes are equal to zero: • The F-test for this null hypothesis is given by: • This test has an F-distribution with k and n-k-1 degrees of freedom • It is quite possible to have a model with a significant fit despite that none of the coefficients are statistically significant—this would happen in the case of high multicollinearity 23 Omnibus F-test An example using Duncan data 24 Incremental F-Tests (2) Testing a subset of slopes • Incremental F-tests compare the RegSS from one Model with the RegSS from another Model nested within the first one • If the difference is not large relative to the difference in degrees of freedom, it will not be statistically significant • F-tests can easily be calculated by fitting nested models and comparing the regression sum of squares, but they are produced more easily using the anova function in R 25 Incremental F-Tests (2) Type I, II, and III Tests Type I Tests add terms sequentially. Each test compares a new model with the previous model that did not include the term. Only after the last term is added is the full model compared—For early tests, the new model is NOT compared to the full model. Seldom a very helpful test. Type II Tests assess each term after all others in the model, ignoring any higher-order relatives (i.e., interaction terms related to the variable). Generally most useful test. Type III Tests violate marginality, testing each term in the model after all of the others, including the interaction terms • The function anova included in the base R package calculates Type I tests. Type II and III tests can be calculated using the Anova function in the car package. 26 Log Likelihood-Ratio Statistic • The full model gives a baseline for measuring the discrepancy for an immediate model with p parameters • The discrepancy of a fit is proportional to twice the difference between the maximum log likelihood achievable and that achieved by the model under investigation • This discrepancy is known as the deviance for the current model and is a function of the data only • Simply put, the generalized likelihood-ratio replaces the residual sum of squares with the deviance (G2): 27 Properties of the Log Likelihood-Ratio Statistic • The likelihood-ratio test is asymptotically distributed as χ2 under the null hypothesis • The degrees of freedom associated with the deviance is equal to n-p (where p is the number of parameters) • The deviance is additive for nested models if maximum-likelihood estimates are used • An analysis of deviance is especially useful for testing the effects of factors and interactions • The likelihood-ratio test is generalizable to models other than linear regression, this test can be used in place of the F-test for linear models 28 Next Topics • Linear Models II: Effective Presentation • The Vector Representation of Regression 29
© Copyright 2026 Paperzz