ST430 Introduction to Regression Analysis ST430: Introduction to Regression Analysis, Ch3, Sec 3.1-3.6 Luo Xiao August 31, 2015 1 / 29 ST430 Introduction to Regression Analysis Simple Linear Regression 2 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis Recall: A regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates) X1 , X2 , . . . , Xp . 3 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis The straight-line probabilistic model Simplest case of a regression model: One independent variable, p = 1, X1 ≡ X ; Linear dependence; Model equation: E (Y ) = β0 + β1 X , or equivalently Y = β0 + β1 X + . 4 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis Interpreting the parameters: β0 is the intercept (so called because it is where the graph of y = β0 + β1 x meets the y -axis x = 0); β1 is the slope; that is, the change in E (Y ) as X = x is changed to X = x + 1. Note: if β1 = 0, X has no effect on y ; that will often be an interesting hypothesis to test. 5 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis Graph for a straight line 6 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis Advertising and Sales example x = monthly advertising expenditure, in hundreds of dollars; y = monthly sales revenue, in thousands of dollars; β0 = expected revenue with no advertising; β1 = expected revenue increase per $100 increase in advertising, in thousands of dollars. Sample data for five months: Advertising Revenue 7 / 29 1 1 2 1 3 2 4 2 5 4 Simple Linear Regression ST430 Introduction to Regression Analysis Scatterplot R code: x = c(1,2,3,4,5) y = c(1,1,2,2,4) plot(x,y,type="p",pch=19, xlim=c(0,5),ylim=c(0,4), xlab="Advertising expenditure",ylab="Sales revenue") 8 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis 2 1 0 Sales revenue 3 4 R output 0 1 2 3 4 5 Advertising expenditure 9 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis We could try various values of β0 and β1 . For given values of β0 and β1 , we get predictions pi = β0 + β1 xi , i = 1, 2, 3, 4, 5. The difference between the observed value yi and the prediction pi is the residual ri = yi − pi , i = 1, 2, 3, 4, 5. A good choice of β0 and β1 gives accurate predictions, and generally small residuals. 10 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis One candidate line (β0 = −0.1, β1 = 0.7): R code x = c(1,2,3,4,5) y = c(1,1,2,2,4) plot(x,y,type="p",pch=19, xlim=c(0,5),ylim=c(0,4), xlab="Advertising expenditure",ylab="Sales revenue") abline(a=-0.1,b=0.7,col="red") 11 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis 2 1 0 Sales revenue 3 4 R output 0 1 2 3 4 5 Advertising expenditure 12 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis Fitting the model How to measure the overall size of the residuals? Most common measure (but not the only possibility): sum of squares of residuals X ri2 = X = X (yi − pi )2 {yi − (β0 + β1 xi )}2 = S(β0 , β1 ). The least squares line is the one with the smallest sum of squares. P Note: the least squares line has the property that ri = 0; Definition 3.1 (page 95) does not need to impose that as a constraint. 13 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis The least squares estimates of β0 and β1 are the coefficients of the least squares line. Some algebra shows that the least squares estimates are (xi − x̄ )(yi − ȳ ) xi yi − nx̄ ȳ P = P 2 2 (xi − x̄ ) xi − nx̄ 2 P β̂1 = P and β̂0 = ȳ − β̂1 x̄ . 14 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis Other criteria Why square the residuals? We could use least absolute deviations estimates, minimizing S1 (β0 , β1 ) = X |yi − (β0 + β1 xi )|. Convenience: we have equations for the least squares estimates, but to find the least absolute deviations estimates we have to solve a linear programming problem. Optimality: least squares estimates are BLUE if the errors are uncorrelated with constant variance, and MVUE if additionally is normal. 15 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis Model assumptions The least squares line gives point estimates of β0 and β1 . These estimates are always unbiased. To use the other forms of statistical inference: interval estimates, such as confidence intervals; hypothesis tests; we need some assumptions about the random errors . 16 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis 1 Zero mean: E (i ) = 0; as noted earlier, this is not really an assumption, but a consequence of the definition = Y − E (Y ). 2 3 4 Constant variance: V (i ) = σ 2 ; this is a nontrivial assumption, often violated in practice. Normality: i ∼ N(0, σ 2 ); this is also a nontrivial assumption, always violated in practice, but sometimes a useful approximation. Independence: i and j are statistically independent; another nontrivial assumption, often true in practice, but typically violated with time series and spatial data. 17 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis Model assumptions 18 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis Notes: Assumptions 2 and 4 are the conditions under which least squares estimates are BLUE (Best Linear Unbiased Estimators); Assumptions 2, 3, and 4 are the conditions under which least squares estimates are MVUE (Minimum Variance Unbiased Estimators). 19 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis Estimating σ 2 Recall that σ 2 is the variance of i , which we have assumed to be the same for all i. That is, σ 2 = V (i ) = V [Yi − E (Yi )] = V [Yi − (β0 + β1 Xi )], i = 1, 2, . . . , n. We observe Yi = yi and Xi = xi ; if we knew β0 and β1 , we would estimate σ 2 by 1X 1 {yi − (β0 + β1 xi )}2 = S(β0 , β1 ). n n 20 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis We do not know β0 and β1 , but we have least squares estimates β̂0 and β̂1 . So we could use S β̂0 , β̂1 as an approximation to S(β0 , β1 ). But we know that S β̂0 , β̂1 < S(β0 , β1 ), so 1 S β̂0 , β̂1 n would be a biased estimate of σ 2 . 21 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis We can show that, under Assumptions 2 and 4, h E S β̂0 , β̂1 So s2 = i = (n − 2)σ 2 . 1 1 X S β̂0 , β̂1 = (yi − ŷi )2 , n−2 n−2 where ŷi = β̂0 + β̂1 xi , is an unbiased estimate of σ 2 . This is sometimes written s 2 = Mean Square Error = MSE SSE = . degrees of freedom for Error 22 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis Inferences about the line We are often interested in the question of whether X has any effect on E (Y ). Since E (Y ) = β0 + β1 X , the independent variable X has some effect whenever β1 6= 0. So we need to test the null hypothesis H0 : β1 = 0. 23 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis We also need to construct a confidence interval for β1 , to indicate how precisely we know its value. For both purposes, we need the standard error: σβ̂1 = √ σ , SSxx where SSxx = X (xi − x̄ )2 . As always, since σ is unknown, we replace it by its estimate s, to get the estimated standard error s σ̂β̂1 = √ . SSxx 24 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis A confidence interval for β1 is β̂1 ± tα/2,n−2 × σ̂β̂1 . Note that we use the t-distribution with n − 2 degrees of freedom, because that is the degrees of freedom associated with s 2 . To test H0 : β1 = 0, we use the test statistic t= β̂1 , σ̂β̂1 and reject H0 at the significance level α if |t| > tα/2,n−2 . 25 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis Compare the confidence interval and the hypothesis test Note that we reject H0 if and only if the corresponding confidence interval does not include 0. 26 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis Linear regression in R: Advertising and Sales example x = c(1,2,3,4,5) y = c(1,1,2,2,4) fit = lm(y~x) summary(fit) Try yourself to get the R output! 27 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis Construct confidence interval in R: Use R command: confint(fit) R output ## 2.5 % 97.5 % ## (Intercept) -2.12112485 1.921125 ## x 0.09060793 1.309392 28 / 29 Simple Linear Regression ST430 Introduction to Regression Analysis ANOVA table Use R command: anova(fit) R output: ## ## ## ## ## ## ## ## Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 1 4.9 4.9000 13.364 0.03535 * Residuals 3 1.1 0.3667 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' 29 / 29 Simple Linear Regression
© Copyright 2025 Paperzz