Lean Six Sigma Simple Linear Regression 1 Outline • • • • • • Introduction Simple Linear Regression Regression Vs. ANOVA Correlation and R2 Residuals Key Points 2 Introduction In this section, we introduce Simple Linear Regression as a method for studying the relationship between a continuous response and one continuous factor. One-way ANOVA and simple linear regression are conceptually very different. ANOVA allows us to determine whether different factor levels have a significant effect on the response. Regression allows us to model the relationship between the continuous experimental factor and the response. 3 Introduction ANOVA X - Nominal Y - Continuous Regression X - Continuous Y - Continuous 4 Simple Linear Regression Linear regression is used when experimental factors (also called regressors or predictors) are continuous. Linear regression is used to develop a model (a relationship) between the response and the factors. This model is used to predict the average response at various factor settings. In the case when there is only one factor of interest, a simple linear regression model fits a linear relationship between the factor and the response. When modeling a relationship, we need to verify that the proposed model fits the data. This is done by first fitting the model, then checking to see if any systematic relationships have been missed. 5 Simple Linear Regression Example: Suppose we want to understand the relationship between weights and heights for adults in a fitness study. We could plot weight versus height: 220 200 Weight 180 160 140 120 100 80 60 62 64 66 68 Height 70 72 74 76 Does there seem to be a relationship between the two variables? 6 Simple Linear Regression A scatterplot provides some understanding of the relationship. The graph suggests that the two variables may be linearly related. This is indicated by the fact that, for given values of weight, the average heights fall more or less on a straight line. 220 200 Weight 180 160 140 120 100 80 60 62 64 66 68 Height 70 72 74 76 7 Simple Linear Regression A simple linear regression model describes the mathematical relationship between two variables. The theoretical model for a simple regression line is: Y 0 1 X In this equation, Y is the response and X is the predictor. There are two coefficients, the intercept (o) and the slope (1). Coefficients define the mathematical relationship between the response and the predictor(s). Random variability is represented by . 8 Simple Linear Regression For a regression analysis to be appropriate, the underlying population must satisfy certain assumptions: For any given X, the values of Y are normally distributed, The means of the response Y depend linearly on X, The standard deviation of Y does not depend on X, namely, it is constant. In terms of the data obtained for fitting a regression, the values of Y must be independent of one another (a key assumption). For the most part, regression is fairly ‘robust’ to the assumptions about the population. Only substantial departures from any of the assumptions invalidate conclusions from a regression analysis. 9 Simple Linear Regression For each X, the response Y has a normal distribution with mean 0 + 1X, and variance s2. y Y | X 0 1 X x 10 Simple Linear Regression Caution: Association, not causality! Regression, with observational data, cannot be used to establish a causal relationship between response variables and predictors. Regression analysis, along with ANOVA, can only be used to quantify an association between two or more variables. To establish causality requires experimental proof that a change in X produces a change in Y. Designed experiments are the most efficient method for establishing experimental proof. Correlation does not imply causality!! 11 Regression Vs. ANOVA How do the approaches differ? 5950 ANOVA: Do different factor levels have different effects on the response? 5850 How does the response change as we increase or decrease the predictor settings? Predict the response one would obtain at given settings of the predictors. Lumens 5800 5750 5700 5650 5600 5550 5500 1 2 3 Phosphor 25 20 PSI Regression: What is the nature of the relationship between the response and the predictors? 5900 15 10 5 5 10 15 20 %HrdWd Linear Fit 12 Regression Vs. ANOVA Example: Tensile strength of a paper product is of interest It is believed that percent hardwood, %HrdWd, improves strength, but no real knowledge exists. An experiment is designed with %HrdWd at four levels: 5%, 10%, 15%, and 20%. Six replications of each of the experimental settings are obtained. The treatment conditions are reset for every run so that experimental variation is captured. The 24 trials are run in random order. The resulting data is summarized on the next slide. 13 Regression Vs. ANOVA Does there appear to be a relationship between percent hardwood and PSI? What is the apparent nature of the relationship? We will analyze the data using both ANOVA and regression to illustrate the differences between the two methods. Output for each of the methods is given on the following slides. %HrdWd 5 5 5 5 5 5 10 10 10 10 10 10 15 15 15 15 15 15 20 20 20 20 20 20 PSI 7 8 15 11 9 10 12 17 13 18 19 15 14 18 19 17 16 18 19 25 22 18 20 23 14 Regression Vs. ANOVA Based on the ANOVA analysis and the comparison of the means, we would conclude that 10% and 15% hardwood do not have different effects on PSI values. 15 Regression Vs. ANOVA Now, let’s look at the regression analysis of the data, again using SLR. To obtain this analysis, change %HrdWd to a continuous factor. The response is PSI and the predictor is %HrdWd. 16 Regression Vs. ANOVA Biv ariate Fit of PSI By %HrdWd a regression line. Of course, we should only make predictions within the range of the data. Additional options are available under the red arrow next to “Linear Fit”. 20 PSI The equation of this line allows us to predict PSI at different %HrdWd values. 25 15 10 5 5 10 15 20 %HrdWd Linear Fit Linear Fit PSI = 7.25 + 0.6966667 %HrdWd Analysis of Varian ce Source Model Error C. Total DF 1 22 23 Sum of Squares Mean Square 364.00833 364.008 148.95000 6.770 512.95833 F Ratio 53.7642 Prob > F <.0001 Parameter Estimates Term Intercept %HrdWd Estimate Std Error t Ratio 7.25 1.301005 5.57 0.6966667 0.095012 7.33 Prob>|t| <.0001 <.0001 17 Regression Vs. ANOVA Confid Curves Indiv for confidence intervals for individual values, and select Confid Curves Fit for confidence intervals for the predicted response (population mean) at different settings of %HrdWd. 25 PSI 20 95% CI for mean response. 15 10 95% CI for individual observations. 5 5 10 15 20 %HrdWd Linear Fit 18 Regression Vs. ANOVA Regression gives us a statistical model that allows us to predict PSI at different percent hardwood values. ANOVA gives us much less information. ANOVA Regression Do regression and ANOVA lead to different conclusions about the relationship between percent hardwood and PSI? 19 Regression Vs. ANOVA In a designed experiment, a factor that is continuous is only run at a small number of distinct settings. In such a case, both ANOVA and Regression can be used to analyze the resulting data. When a factor is set at only two levels, the results of linear regression analysis and ANOVA are identical. From the hardwood example, we can see that we may draw different conclusions based on the approach we use to analyze our experimental results. When more than two levels of truly continuous factors are used, ANOVA and regression can lead to very different conclusions. 20 Simple Linear Regression The prediction equation estimates the (assumed) linear relationship between PSI and percent hardwood. Linear Fit PSI = 7.25 + 0.6966667 %HrdWd This equation allows us to estimate PSI for different values of percent hardwood. For example, if we want to predict PSI for 14% hardwood: PSI = 7.25 + 0.697(14) = 17.08 Note: The regression equation is only valid for values of X over the range included in the analysis. Although extrapolation to other values of X is possible, this should only be done with extreme caution. 21 Simple Linear Regression If the p-value for the predictor, %HrdWd, is <0.05, we conclude that the slope is likely not equal to zero: there is a statistically significant relationship between percent hardwood and PSI. Paramete r Estimates Term Intercept %HrdWd Estimate Std Error t Ratio 7.25 1.301005 5.57 0.6966667 0.095012 7.33 Prob>|t| <.0001 <.0001 22 Correlation and R2 R Square (R2), the coefficient of determination, gives a measure of the predictive ability of the regression model. Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.709626 0.696427 2.60201 15.95833 24 R2 is the proportion of the total variation in Y (or SST) accounted for by the regression model(or SS(Model)). R2 is calculated from the sum of squares (SS) for the model: SS ( Model ) SST SSE R . SST SST 2 23 Correlation and R2 R2 can take on values between 000 and 1.00. Higher R2 is better. An R2 of 1.0 indicates that no random error exists in the response (the model accounts for all the variation in Y). Unfortunately, a high R2 value does not necessarily indicate a good regression model. 24 Correlation and R2 R2 has several shortcomings as a measure of fit for a regression model: R2 does not have an associated probability distribution that can be used to assess the significance of the observed value. Thus, there is no objective basis by which to determine what is a “good” or “significant” value of R2. R2 is partially redundant to the overall F test. Since the F test has an associated probability distribution, the significance of the F Ratio can be objectively evaluated while R2 cannot. R2 is systematically lowered by taking repeated measurements (or nearly repeated measurements) for the different values of X. 25 Correlation and R2 Shortcomings of R2 (continued): The value of R2 is sensitive to the range of the X values. By increasing the range of X, one can artificially improve R2. R2 can be increased simply by adding regressors to the model. We’ll talk about R2 Adjusted, which is adjusted for additional X’s, in the Multiple Regression section. 26 Correlation and R2 The correlation coefficient, denoted “r”, measures the linear association between two sets of measurements. The correlation coefficient can take on values between -1 and 1. The closer r is to 1 or -1, the more closely the points fall on a line. If r is positive, the fitted line has positive slope; if r is negative, the fitted line has negative slop. If r is close to 0, there is little, if any, linear association. In the case of simple linear regression, the coefficient of determination, R2, is the square of the correlation coefficient, as the notation implies. The following slides show some common patterns. 27 Correlation and R2 35 30 r close to 1, strong positive association 20 15 10 5 40 0 0 5 10 15 20 25 30 35 35 X 30 Y4 Y2 25 r close to -1, strong negative association 25 20 15 10 0 5 10 15 20 25 30 35 X 28 Correlation and R2 35 30 r close to 0, no linear association 20 15 10 5 25 0 0 5 10 15 20 25 30 35 X 20 Y3 Y1 25 r close to 0, no linear association (but there is a relationship between X and Y) 15 10 0 5 10 15 20 25 30 35 X 29 Residuals To determine if the regression assumptions are satisfied, we fit the regression model and then analyze how the data values behave relative to the fitted model to see if the fit is adequate. This analysis is based on residuals. A residual is the difference between the observed value at X and the predicted value at X. If the linear model is appropriate, residuals should be normally distributed with mean 0 and constant variance given X. 30 Residuals Under Linear Fit, we can save predicted values and residuals to the data sheet. The three points for which residuals are calculated on the previous slide are highlighted here. 31 Residuals To plot residuals, select Plot Residuals under the red arrow next to Linear Fit. Note that one can Save Residuals and Save Predicteds to the data sheet as well. 32 Residuals Residuals should be approximately normally spread around zero, the centerline. Their variation should be approximately constant across the values of X. Are there any obvious trends or patterns? 33 Residuals When evaluating the residual plots, look for the following patterns: Curvature in the residuals. The model we fit is not adequate. Variation increases as X increases. Homogeneity of variance assumption is violated. An outlier, and points not evenly dispersed on either side of the line. The unusual observation is influencing the model. 34 Key Points Regression can not be used to establish causal relationships. Always plot the data. Neither R2 nor R2 Adjusted measure model adequacy. At best, they measure how well the fitted model predicts the observed values. Be very cautious when extrapolating outside the range of X values used. A high p-value and a low R2 do not necessarily mean that there is not a significant relationship between X and Y. Always plot residuals to check the adequacy of the model. 35
© Copyright 2026 Paperzz