LINEAR REGRESSION & CORRELATION CHAPTER 7 BCT 2053 CONTENT 7.1 Simple Linear Regression Analysis and Correlation 7.2 Relationship Test and Prediction in Simple Linear Regression Analysis 7.3 Multiple Linear Regression Analysis and Correlation 7.4 Model Selection 7.1: Simple Linear Regression Analysis and Correlation OBJECTIVE: 1. Find a mathematical equation that can relate a dependent and independent variables x and y. 2. Plot a scatter diagram and graph the regression line 3. Calculate the strength of the linear relationship between x and y by correlation coefficient . INTRODUCTORY CONCEPTS Suppose you wish to investigate the relationship between a dependent variable (y) and independent variable (x) Independent variable (x) – the variables has been controlled Dependent variable (y) – the response variables In other word, the value of y depends on the value of x. Example A Suppose you wish to investigate the relationship between the numbers of hours student’s spent studying for an examination and the mark they achieved. Students A B C D E F G H numbers of hours (x) 5 8 9 10 10 12 13 15 Final marks ( (y) 49 60 55 72 65 80 82 85 Numbers of hours student’s spent studying for an examination ( x – Independent variable ) will cause the mark (y) they achieved. ( y – Dependent variable ) Other Examples The weight at the end of a spring (x) and the length of the spring (y) A student’s mark in Statistics test (x) and the mark in a Programming test (y) The diameter of the stem of a plant (x) and the average length of leaf of the plant (y) SCATTER DIAGRAM When pairs of values are plotted (x vs y), a scatter diagram is produced To see how the data looks like and relate with each other scatter plot Yield of hay (y) versus Amount of water (x) Yield of hay (y) 9 8 7 6 5 4 3 2 1 0 0 20 40 60 80 100 120 Amount of water (x) Exercise: Plot a scatter diagram for Example A 140 LINEAR CORRELATION AND SIMPLE LINEAR LINE Linear correlation If the points on the scatter diagram appear to lie near a straight line ( Simple regression line ) Or you would say that there is a linear correlation /linear relationship between x and y Sometimes a scatter plot shows a curvilinear (polynomial curve) or nonlinear (log, exp, etc…) relationship between data. Exercise: From the scatter diagram for Example A, is there any correlation between x and y? Positive Linear Correlation Negative Linear Correlation No Correlation No relationship between x and y INFERENCES IN CORRELATION The Pearson product moment correlation coefficient, r, is a numerical value between -1 and 1 inclusive which indicates the linear degree of scatter. 1 r 1 S xy r S xx x S xx S yy 2 x n 2 2 r = 1 indicates perfect positive linear y 2 correlation S yy y n r = -1 indicates perfect negative linear x y correlation S xy xy n r = 0 indicates no correlation INFERENCES IN CORRELATION The nearer the value of r is to 1 or -1, the closer the points on the scatter diagram are to the regression line Nearer to 1 is strong positive linear correlation/relationship Nearer to -1 is strong negative linear correlation/relationship Exercise: Calculate the correlation coefficient r for Example A and interpret the result. TIPS: USE YOUR CALCULATOR MODE REG LIN x,y M+ shift 1 OR shift 2 LINE OF BEST FIT A mathematical way of fitting the regression line (least of square method). The line of best fit must pass through the means of both sets of data, i.e. the point x , y . scatter plot Yield of hay (y) versus Amount of water (x) Yield of hay (y) 9 8 7 6 x, y 5 4 3 2 1 0 0 20 40 60 80 Amount of water (x) 100 120 140 Least square regression line of y on x Exercise: Find and draw the regression line for Example A 7.2: Relationship Test and Prediction in Simple Linear Regression Analysis OBJECTIVE: 1. Test the significance of regression slope. 2. Predict and estimate the new y value from the regression equation. RELATIONSHIP TESTS AND PREDICTION IN SIMPLE LINEAR REGRESSION ANALYSIS 1. HYPOTHESIS TESTING FOR THE SLOPE OF REGRESSION LINE 2. ESTIMATION AND PREDICTION 1: HYPOTHESIS TESTING FOR THE SLOPE OF REGRESSION LINE To test the linear relationship between x and y x and y have a linear relationship if the slope 0 Test the hypothesis, H : 0 and H : 0 o 1 with statistic test. t Var ~ tn 2 where S yy S xy n2 Var S xx If Ho is reject, x and y have a linear relationship Exercise: Test the linearity between x and y for Example A at 0.05 2. ESTIMATION AND PREDICTION When x is the independent variable and you want to estimate y for a given value of x estimate x for a given value of y. When neither variable is controlled and you want to estimate y for a given value of x The regression line y on x is used to make prediction when there is a linear correlation between x and y. Guideline for using regression equation If there is no linear correlation, don’t use the regression equation to make prediction When using the regression equation for predictions, stay within the scope of the available sample data A regression equation based on old data is not necessarily now. Don’t make predictions about a population that is different from the population from which the sample data were drawn. Exercise Use Example A to find: the estimate of y when x = 10 hours the estimate of x when y = 75 marks EXAMPLE B A study is done to see whether there is a relationship between a mother’s age and the number of children she has. The data are shown here. Mother’s age (x) 18 22 29 20 27 32 33 36 No. of children (y) 2 1 3 1 2 4 3 5 1. Plot a scatter diagram to illustrate the data. 2. Compute the value of the correlation coefficient, r and comment on the relationship between the value of r and the scatter plot. 3. Find the equation of the regression line of y on x. Then predict the number of children of a mother whose age is 34. 4. Test the linearity between x and y when α = 0.05. SOLVE SIMPLE LINEAR REGRESSION BY EXCEL Excel – key in data Tools – Data Analysis – Regression – enter the data range (y & x) – ok Computer Output - Excel Strong Linear positive correlation yˆ 26.89 4.06 x x and y have linear relationship ( P-value < 0.05) 7.3 Multiple Linear Regression Analysis and Correlation OBJECTIVE: 1. 2. To describe linear relationships involving more than two variables. Interpret the computer output for multiple linear regression analysis and make prediction. MULTIPLE LINEAR REGRESSION EQUATION A multiple regression equation is use to describe linear relationships involving more than two variables. A multiple linear regression equation expresses a linear relationship between a response variable y and two or more predictor variable (x1, x2,…,xk). The general form of a multiple regression equation is yˆ b0 b1 x1 b2 x2 bk xk A multiple linear regression equation identify the plane that gives the best fit to the data Notation: Multiple regression equation yˆ b0 b1 x1 b2 x2 bk xk where n : sample size k : number of independent variables yˆ : predicted value of y x1 , x2 , xk : independent variables b0 : estimate value of y -intercept b1 , b2 , , bk : estimate value of the independent variables coefficient Examples of real situation 1) A manufacturer of jams wants to know where to direct its marketing efforts when introducing a new flavour. Regression analysis can be used to help determine the profile of heavy users of jams. For instance, a company might predict the number of flavours of jam a household might have at any one time on the basis of a number of independent variables such as, number of children living at home, age of children, gender of children, income and time spent on shopping. Examples of real situation 2) Many companies use regression to study markets segments to determine which variables seem to have an impact on market share, purchase frequency, product ownerships, and product & brand loyalty, as well as many other areas. Examples of real situation 3) Personals directors explore the relationships of employee salary levels to geographic location, unemployment rates, industry growth, union membership, industry type, or competitive salaries. 4) Financial analysts look for causes of high stock prices by analysing dividend yields, earning per share, stock splits, consumer expectation of interest rates, savings levels and inflation rates. Examples of real situation 5) Medical researchers use regression analysis to seek links between blood pressure and independent variables such as age, social class, weight, smoking habits and race. 6) Doctors explore the impact of communications, number of contacts, and age of patient on patient satisfaction with service. Computing the Multiple Linear Regression Equation By using the least square method, the multiple linear regression equation is given by yˆ b0 b1 x1 b2 x2 bk xk Where the estimated regression coefficients b X X T 1 x1,1 Y1 1 x Y 2,1 Y 2, X Y n 1 xn ,1 1 x1,2 x2,2 xn ,2 T X Y x1, k b0 b x2, k , b 1 xn , k bk EXAMPLE C Assume that, a sales manager of Tackey Toys, needs to predict sales of Tackey products in selected market area. He believes that advertising expenditures and the population in each market area can be used to predict sales. He gathered sample of toy sales, advertising expenditures and the population as below. Find the linear multiple regression equation which the best fit to the data. Example, cont… Market Area Advertising Expenditures (Thousands of Dollars) x1 Population (Thousands) x2 Toy sales (Thousands of Dollars) y A 1.0 200 100 B 5.0 700 300 C 8.0 800 400 D 6.0 400 200 E 3.0 100 100 F 10.0 600 400 Solution Since we have 2 independent variables, so the multiple regression equation is given by: yˆ b0 b1 x1 b2 x2 where n 6 and k 2 yˆ predicted value of y x1 the value of the first independent variable x2 the value of the second independent variable b0 estimate value of y -intercept b1 slope associated with x1 (the change in yˆ if x2 is held constant and x1 varies by 1 unit) b2 slope associated with x2 (the change in yˆ if x1 is held constant and x2 varies by 1 unit) SOLVE MULTIPLE LINEAR REGRESSION BY EXCEL Excel – key in data Tools – Data Analysis – Regression – enter the data range (y & x) – ok Computer Output – Microsoft Excel yˆ 6.3972 20.4921x1 0.2805x2 Interpreting the Values in the Equation b0 = 6.3972 The value of estimated y when x1 and x2 are both zero is given by b0. When the population in thousands and the advertising expenditures in thousands dollars are zero then the estimated toy sales is given by 6.3972 thousands dollars . b1 = 20.4921 yˆ 6.3972 20.4921x1 0.2805x2 When the x2 is constant then the estimated y is increases by b1 for each x1. When the population in thousands is constant then the estimated toy sales increases by 20.4921 thousands dollars for each 1000 dollars of advertising expenditures. b2 = 0.2805 When the x1 is constant then the estimated y is increases by b2 for each x2. When the advertising expenditures in thousands dollars is constant then the estimated toy sales increases by 0.2805 thousands dollars for each 1000 people in the population. Making preliminary predictions with the multiple regression equation Assume that the sales manager needs a sales forecast for a market area Tackey Toys has recently spend $4,000 advertising in this market, which has a population of 500,000 people. So the point estimate of toy sales is given by: yˆ 6.3972 20.4921x1 0.2805 x2 6.3972 20.4921 4 0.2805 500 228.613 $228613 The Coefficient of Multiple Determinations Measure the percentage of variation in the y variable associated with the use of the set x variables A percentage that shows the variation in the y variable that’s explain by its relation to the combination of x1 and x2. . R2 SSR SST 1 SSR b T X T Y Y T 11T Y n where and 1 SST Y T Y Y T 11 T Y n bt 0 t 1, ,k R2 0 when all R 1 when all observations fall directly on the fitted response surface, i.e. when (the regression equation is good) 2 0 R2 1 Yt Yˆt for all t. Computer Output – Microsoft Excel 97.4% of Tackey Toy sales in the market area is explained by advertising expenditures and population size yˆ 6.3972 20.4921x1 0.2805x2 Multiple R and Adjusted R² The coefficient of multiple correlations R is the positive square root of R². R R2 The value of R can range from 0 to +1. The closer to +1, the stronger the relationship. The closer to 0, the weaker the relationship. The adjusted coefficient of determination is the multiple coefficient of determination R² modified to account for the number of variables and the sample size. adjusted R 2 n 1 1 1 R2 n k 1 When we compare a multiple regression equation to others, it is better to use the adjusted R². Computer Output – Microsoft Excel Strong relationship exist between sales and advertising expenditures and population size 97.4% of Tackey Toy sales in the market area is explained by advertising expenditures and population size yˆ 6.3972 20.4921x1 0.2805x2 Standard error and ANOVA test Standard error Measure the extent of the scatter, or dispersion, of the sample data points about the multiple regression plane. Compare the standard error between two or more regression equation Smallest standard error, less disperse ANOVA test H0: neither of the independent variables is related to the dependent variables (b1 = b2 = 0) H1: At least one of the independent variables is related to the dependent variables (b1 or b2 or both ≠ 0) Reject H0 if significance F < α significance F = P value Computer Output – Microsoft Excel Strong relationship exist between sales and advertising expenditures and population size 97.4% of Tackey Toy sales in the market area is explained by advertising expenditures and population size < 0.05 So reject H0 yˆ 6.3972 20.4921x1 0.2805x2 7.4: Model Selection OBJECTIVE: 1. Select the best multiple linear regression model for any given data set. TIPS: Model Selection in simple way Use common sense and practical considerations to include or exclude variables. Consider the P-value (the measure of the overall significance of multiple regression equation-significance F value) displayed by computer output. The smaller the better. Consider equation with high values of adjusted R² and try include only a few variables. Find the linear correlation coefficient r for each pair of variables being considered. If 2 predictor values have a very high r, there is no need to include them both. Exclude the variable with the lower value of r. EXAMPLE D The following table summarize the multiple regression analysis for the response variable (y) is weight (in pounds), and the predictor (x) variables are H ( height in inches), W (waist circumference in cm), and C (cholesterol in mg) x P-value R Adjusted R² Regression equation H, W, C 0.0000 0.880 0.870 -199 + 2.55H + 2.18W 0.005C H, W 0.0000 0.877 0.870 -206 + 2.66H + 2.15W H, C 0.0002 0.277 0.238 -148 + 4.65H + 0.006C W, C 0.0000 0.804 0.793 -42.8 + 2.41W + 0.01C H 0.0001 0.273 0.254 -139 + 4.55H W 0.0000 0.790 0.785 -44.1 + 2.37W C 0.874 0.001 0.000 173 - 0.003C Example D, cont… 1. If only one predictor variable is used to predict weight, which single variable is best? Why? 2. If exactly two predictor variables are used to predict weight, which two variables should be chosen? Why? 3. Which regression equation is best for predicting weight? Why? CONCLUSION This chapter introduces important methods (regression) for making inferences about a relationship between two or more variables and describing such a relationship with an equation that can be used for predicting value of one variable given the value of the other variable. Thank You
© Copyright 2026 Paperzz