Lecture 3 Selected material from: Ch. 5 Summarizing bivariate data 1 Example: Bus in the US One‐way bus fares ($) from Rochester, New York to cities less than 750 miles from it. We got the data from the website of the bus company because we want to know how fares depend on distance. Therefore, we make a scatterplot with distance on the x axis and fare on the y axis. 2 miles Destination City Distance Albany, NY 240 Baltimore, MD 430 Buffalo, NY 69 Chicago, IL 607 Cleveland, OH 257 Montreal, QU 480 New York City, NY 340 Ottawa, ON 467 Philadelphia, PA 335 Potsdam, NY 239 Syracuse, NY 95 Toronto, ON 178 Washington, DC 496 $ Standard One-Way Fare 39 81 17 96 61 70.5 65 82 67 47 20 35 87 Making a scatterplot $100 y‐axis Greyhound Bus Fares Vs. Distance $90 • 13 cities, each city gets one dot • One dot represents 2 variables: x=distance y=fare for the city Standard One-Way Fare $80 $70 $60 $50 $40 $30 $20 $10 50 150 250 350 Distance from Rochester, NY (miles) x‐axis 3 450 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. 550 65 650 Scatterplot of bus data $100 Greyhound Bus Fares Vs. Distance $90 Destination City Distance Albany, NY 240 Baltimore, MD 430 Buffalo, NY 69 Chicago, IL 607 Cleveland, OH 257 Montreal, QU 480 New York City, NY 340 Ottawa, ON 467 Philadelphia, PA 335 Potsdam, NY 239 Syracuse, NY 95 Toronto, ON 178 Washington, DC 496 4 Standard One-Way Fare 39 81 17 96 61 70.5 65 82 67 47 20 35 87 Standard One-W ay Fare $80 $70 $60 $50 $40 $30 $20 $10 50 150 250 350 450 550 Distance from Rochester, NY (miles) © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Toronto, ON 65 650 Interpretation of data But for two cities nearly the same distance away from Rochester the fare can be lower for one! 5 Scatterplot for 13 Cities $100 Greyhound Bus Fares Vs. Distance $90 $80 Standard One-Way Fare As would be expected, the fare (cost) increases the farther you travel. $70 $60 $50 $40 $30 $20 $10 50 150 250 350 450 Distance from Rochester, NY (miles) © 2008 Brooks/Cole, a division of Thomson Learning, Inc. 550 65 650 Scatterplot of bus data As would be expected, the cost increases the farther you travel. $100 Greyhound Bus Fares Vs. Distance $90 $80 Destination City Distance Albany, NY 240 Baltimore, MD 430 Buffalo, NY 69 Chicago, IL 607 Cleveland, OH 257 Montreal, QU 480 New York City, NY 340 Ottawa, ON 467 Philadelphia, PA 335 Potsdam, NY 239 Syracuse, NY 95 178 6Toronto, ON Washington, DC 496 Standard One-Way Fare 39 81 17 96 61 70.5 65 82 67 47 20 35 87 Standard One-W ay Fare But for two cities nearly the same $70 distance away from Rochester the cost can be lower for one! $60 $50 $40 $30 $20 $10 50 150 250 350 450 550 Distance from Rochester, NY (miles) Potsdam, NY distance=239 fare=$47 Albany, NY distance=240 fare=$39 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. 65 650 Why more expensive to travel to Potsdam than to Albany? Albany, NY, population ≈ 100,000, close to major cities, has businesses, tourist attractions 7 The value of y (cost) is not determined solely by x (distance). Potsdam, NY, population ≈ 15,000, far from any other city, no businesses, tourism (named after Potsdam, Germany) It usually costs more to go to less travelled places. 8 Bus in the US The y value tends to increase as x increases. We say that there is a positive relationship between the variables distance and fare. Greyhound Bus Fares Vs. Distance $90 $80 Standard One-Way Fare It appears that the y value (fare) could be predicted reasonably well from the x value (distance) by finding a line that is close to the points in the plot. We say the relationship is linear. $100 $70 $60 $50 $40 $30 $20 $10 50 150 250 350 450 Distance from Rochester, NY (miles) 9 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. 550 65 650 Scatterplot $100 y‐axis Greyhound Bus Fares Vs. Distance $90 Standard One-Way Fare $80 $70 $60 $50 $40 $30 $20 $10 50 150 250 350 450 550 65 650 Distance from Rochester, NY (miles) x‐axis The y and x axes need not start at (0,0). Here, x starts at 50 miles, y at $10; choose convenient x‐ and y‐axes so that all data points 10 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. appear on the graph. Association Positive Association ‐ Two variables are positively associated when above‐average values of one tend to accompany above‐ average values of the other and below‐average values tend similarly to occur together. (i.e., the y values tend to increase as the x values increase.) Negative Association ‐ Two variables are negatively associated when above‐average values of one accompany below‐average values of the other, and vice versa. (i.e., the y values tend to decrease as the x values increase.) 11 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Pearson correlation coefficient Pearson correlation coefficient (r): a measure of the strength of the linear relationship between two variables. Formula for r: sx ( x x) i (n 1) x x y y sy z x z y sx r n 1 n 1 2 the sample standard deviation of x, and similarly for sy. © 2008 Brooks/Cole, a division of Thomson Learning, Inc. 12 Properties of r (correlation) The value of r does not depend on the unit of measurement (e.g. miles or $) for each variable. The value of r does not depend on which of the two variables is labeled x. The value of r is between –1 and +1. The correlation coefficient is –1 only when all the points lie on a downward‐sloping line +1 only when all the points lie on an upward‐sloping line. The value of r is a measure of the extent to which x and y are linearly related. 13 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Example: Strong negative correlation 14 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. No correlation 15 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. 16 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Example: Strong positive association 17 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Interpretation of r In general the correlation is: • Strong r ≤ ‐.8 or r ≥ .8 • Moderate ‐.8 < r < ‐.5 or .5 < r < .8 • Weak ‐.5 ≤ r ≤ .5 18 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. How to calculate: Bus example x 325.615 s x 164.2125 y=59.0385 s y 25.506 x x y y sy z x z y sx r n 1 n 1 19 x y 240 430 69 607 257 480 340 467 335 239 95 178 496 39 81 17 96 61 70.5 65 82 67 47 20 35 87 11.6021 r 0.9668 13 1 x-x sx -0.5214 0.6357 -1.5627 1.7135 -0.4178 0.9402 0.0876 0.8610 0.0571 -0.5275 -1.4044 -0.8989 1.0376 y-y sy x-x y-y sx sy -0.7856 0.8610 -1.6481 1.4491 0.0769 0.4494 0.2337 0.9002 0.3121 -0.4720 -1.5305 -0.9424 1.0962 0.4096 0.5473 2.5755 2.4831 -0.0321 0.4225 0.0205 0.7751 0.0178 0.2489 2.1494 0.8472 1.1374 11.6021 Strong positive association! © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Warning! The pearson correlation coefficient only measures linear association. You must look at the scatterplot first to check that it looks approximately linear before just computing r. If it looks nonlinear then do not compute r. Example to illustrate what can go wrong . 20 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. x y 1.2 2.5 6.5 13.1 24.2 34.1 20.8 37.5 23.3 21.5 12.2 3.9 4.0 18.0 1.7 26.1 Example x y 1.2 2.5 6.5 13.1 24.2 34.1 20.8 37.5 23.3 21.5 12.2 3.9 4.0 18.0 1.7 26.1 y 1.2 2.5 6.5 13.1 24.2 34.1 20.8 37.5 23.3 21.5 12.2 3.9 4.0 18.0 1.7 26.1 -1.167 -1.074 -0.788 -0.314 0.481 1.191 0.237 1.434 yy sy x x y y s s X y 0.973 0.788 -0.168 -1.022 -1.012 0.428 -1.249 1.261 -1.136 -0.847 0.133 0.322 -0.487 0.510 -0.296 1.810 0.007 r= 0.001 x 17.488, sx 13.951, y 13.838, s y 9.721 r near 0 means no association between x and y. 21 x xx sX r x x y y 1 1 (0.007) 0.001 n 1 s X s y 7 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Example 30.0 Scatterplot 25.0 22 x y 1.2 2.5 6.5 13.1 24.2 34.1 20.8 37.5 23.3 21.5 12.2 3.9 4.0 18.0 1.7 26.1 20.0 y 15.0 10.0 5.0 0.0 0 5 10 15 20 25 30 x 35 40 30.0 Scatterplot 25.0 20.0 y 15.0 But the scatterplot reveals an almost perfect quadratic relationship between x and y, so to say there is no association or relationship between them is incorrect! 10.0 5.0 0.0 0 5 10 15 20 25 30 35 x © 2008 Brooks/Cole, a division of Thomson Learning, Inc. 40 Linear relations Mathematical formula for linearly related variables, x and y, where x is the explanatory variable and y y the response variable. y = a + bx is the equation of a straight line. b is the slope of the line: the amount by which y increases when x increases by 1 unit. 15 y = 7 + 3x Equation of the line: a + bx y increases by b = 3 10 x increases by 1 Slope: increase in y for a unit increase in x 5 a is the intercept (sometimes called the vertical intercept) of the line: the height of the line 23 above the value x = 0. intercept: value of y when x = 0 a=7 0 x 0 2 4 6 8 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. How well does the line a + bx fit the data? Regression Plot Standard Fare= 10.1380 + 0.150179 Distance Suppose a line is drawn through the data points. S = 6.80319 R-Sq(adj) = 92.9 % 105 95 How do we measure how well it fits the data? Vertical difference The closer the differences to 0 the better the fit. Standard Fare 85 The measure we use is based on vertical differences (sometimes called deviations) between the data values and the line. 24 R-Sq = 93.5 % = y – (a + bx) 75 65 55 45 35 25 15 0 100 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. 200 300 Distance 400 500 600 Least squares criterion The most widely used criterion for measuring the goodness of fit of a line y = a + bx to bivariate data (x1, y1), (x2, y2),, (xn, yn) is the sum of the squared deviations about the line: y (a bx) 2 y1 (a bx1 ) y n (a bx n ) 2 vertical distance of the first data value 2 vertical distance of the last data value The line that gives the best fit to the data is the one that minimizes this sum; it is called the least squares line or sample regression line. We find the best line by finding the a and b which give the smallest value of the sum of squared vertical differences. (technically, set partial derivatives of a,b to 0) 25 Solution for a and b slope of the least squares line x x y y b x x y intercept a y bx 2 Equation of the least squares line is ŷ a bx 26 where the ^ above y emphasizes that (read as y‐hat) is a ŷ prediction of y resulting from substitution of the fitted a and b (above) into the line equation. © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Calculations for US bus example x 240 430 69 607 257 480 340 467 335 239 95 178 496 4233 27 y 39 81 17 96 61 70.5 65 82 67 47 20 35 87 768 xx -85.615 104.385 -256.615 281.385 -68.615 154.385 14.385 141.385 9.385 -86.615 -230.615 -147.615 170.385 (x x)2 7329.994 10896.148 65851.456 79177.302 4708.071 23834.609 206.917 19989.609 88.071 7502.225 53183.456 21790.302 29030.917 323589.08 yy -20.038 21.962 -42.038 36.962 1.962 11.462 5.962 22.962 7.962 -12.038 -39.038 -24.038 27.962 x-x y-y 1715.60 2292.45 10787.72 10400.41 -134.59 1769.49 85.75 3246.41 74.72 1042.72 9002.87 3548.45 4764.22 48596.19 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Calculations for US bus example x x y y 48596.19 and x x 323589.08 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. 2 So x x y y 48596.19 b 0.15018 323589.08 x x 2 Also n 13, x 4233 and y 768 4233 768 325.615 and y 59.0385 13 13 This gives so x The fare increases on average 15 cents for each mile travelled. a y - bx 59.0385 - 0.15018(325.615) 10.138 28 The regression line is yˆ 10.138 0.15018x. Regression plot Overlay the fitted line on the scatterplot Regression Plot Standard Fare= 10.1380 + 0.150179 Distance S = 6.80319 R-Sq = 93.5 % regression line R-Sq(adj) = 92.9 % 105 95 85 Standard Fare This is the best line you can fit to the data according to the least squares criterion. 75 65 55 45 35 25 15 0 29 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. 100 200 300 Distance 400 500 600 Another formula for b The formula x x y y b x x 2 gives same result can be rewritten as b x y xy x 30 2 n 2 x n © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Easier to do on a calculator/computer Steps in a linear regression 1.) Determine which variable is the explanatory variable (x) and which is the response (y) use that the goal is to see how y changes with x. 2.) Look at the scatterplot to make sure the data could follow a linear relationship. If not, stop and consider a different model (e.g. quadratic). 3.) After fitting the regression model, check if there are unusual values (outliers) or patterns indicating something was wrong with the assumption of a linear relationship. Next 4.) Determine predictions from the regression equation and their accuracy. Next 31 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Predicted/fitted values The predicted or fitted values result from substituting each sample x value into the equation for the least squares line. This gives ŷ1 a bx1 =1st predicted value ŷ 2 a bx 2 =2nd predicted value ... ŷ n a bx n =nth predicted value 32 The residuals for the least squares line are the values: y1 yˆ 1 , y 2 yˆ 2 , ..., y n yˆ n © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Residuals The residuals for the least squaresRegression line are Plot the values: y1 yˆ 1 , y 2 yˆ 2 , ..., y n yˆ n Standard Fare= 10.1380 + 0.150179 Distance S = 6.80319 R-Sq = 93.5 % R-Sq(adj) = 92.9 % 105 95 the difference between each data value (y) and the fitted regression line at its corresponding (x). 85 Standard Fare Residuals are the vertical differences: 75 Vertical difference 65 55 45 35 25 15 0 33 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. 100 200 300 Distance 400 500 600 Bus in the US Regression line 34 x y 240 430 69 607 257 480 340 467 335 239 95 178 496 39 81 17 96 61 70.5 65 82 67 47 20 35 87 Predicted value yˆ 10.1 .150x 46.18 74.72 20.50 101.30 48.73 82.22 61.20 80.27 60.45 46.03 24.41 36.87 84.63 Residual y ŷ -7.181 6.285 -3.500 -5.297 12.266 -11.724 3.801 1.728 6.552 0.969 -4.405 -1.870 2.373 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Residual plot for bus example A residual plot is a scatterplot of the data pairs (x, residual). Residuals Versus x (response is y) Residual Since residuals are vertical differences between the response (y) and what is estimated as (fitted to) y, near 0 is good, large is bad. 10 0 -10 0 35 100 200 300 x © 2008 Brooks/Cole, a division of Thomson Learning, Inc. 400 500 600 Residual plot for bus example From the scatterplot of the data you can see that the 6th data value (Cleveland, Regression Plot OH, 257 miles, $61) is far from the Standard Fare= 10.1380 + 0.150179 Distance regression line. S = 6.80319 R-Sq = 93.5 % R-Sq(adj) = 92.9 % These two differences are the same, different y‐scales just makes them look different. 105 95 Residuals Versus x 75 (response is y) 65 55 10 45 35 25 15 0 100 200 300 400 500 Residual Standard Fare 85 0 600 Distance ... but it becomes more obvious in the residual plot. 36 -10 0 100 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. 200 300 x 400 500 600 US bus data Residuals Versus x (response is y) Residual 10 0 Predicted fares are too low. -10 Predicted fares are too high. 0 100 200 300 400 500 600 x It appears that the line systematically predicts fares that are too high for cities close to Rochester and predicts fares that are too low for most cities between 200 and 500 miles. 37 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. US bus data Residuals Versus x (response is y) © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Residual 10 0 Predicted fares are too low. -10 Predicted fares are too high. 0 100 200 300 400 500 600 Regression Plot x Standard Fare= 10.1380 + 0.150179 Distance S = 6.80319 R-Sq = 93.5 % R-Sq(adj) = 92.9 % 105 95 38 You can also see in the scatterplot of the data that at the beginning the line falls above the data values and afterwards, below. Standard Fare 85 75 65 55 45 35 25 15 0 100 200 300 Distance 400 500 600 Other kinds of residual plots Another common type of residual plot is a scatter plot of the data pairs (ŷ, residual). The following plot was produced by Minitab for the Greyhound data. Notice, that this residual plot shows the same type of systematic problems with the model. © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Residuals Versus the Fitted Values (response is y) Residual 10 0 -10 20 39 30 40 a + bx instead of x here 50 60 70 Fitted Value 80 90 100 Residual plot ‐ What to look for Residual Isolated residuals or patterns in residuals indicate potential problems. Ideally residuals should be randomly spread out above and below zero, as in this example. 0 Interpretation 1.Residuals below 0 indicate over prediction 2.Residuals above 0 indicate under prediction. 40 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. x Definitions and formulas The total sum of squares, denoted by SSTo, is defined as SSTo (y1 y) 2 (y 2 y) 2 (y n y)2 (y y) 2 Difference from the mean of all y‘s The residual sum of squares, denoted by SSResid, is defined as SSResid (y1 yˆ 1 ) (y 2 yˆ 2 ) (y n yˆ n ) 2 (y y) ˆ 2 41 2 2 Difference from the fitted regression line © 2008 Brooks/Cole, a division of Thomson Learning, Inc. SSTo versus SSResid SSResid measures from the regression Standard Fare= 10.1380 + 0.150179 Distance line y 59 Regression Plot S = 6.80319 R-Sq = 93.5 % 105 R-Sq(adj) = 92.9 % y 95 At this data value look how much larger the difference for SSTo is than SSResid! 85 Standard Fare SSTo measures vertical differences of all data values from this horizontal line. Horizontal line at y = 59 75 65 55 45 35 25 15 0 100 200 300 Distance 400 500 600 42 Calculation formulas SSTo and SSResid are generally found as part of the standard output from most statistical packages or can be obtained using the following computational formulas: SSTo y 2 y 2 n SSResid y a y b xy 2 The coefficient of determination, r2, can be computed as 2 SSResid r 1 SSTo Higher r2 means the slope improved the predictions more. 43 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Coefficient of determination The coefficient of determination, denoted by r2 or R2, gives the proportion of variation in y that can be attributed to an approximate linear relationship between x and y. Mathematics can show that the coefficient of determination (r2) is the square of the Pearson correlation coefficient (r). Recall the Pearson correlation coefficient measured the strength of a linear relationship between x and y and did not depend on which variable was called x. 44 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Bus in the US n 13, y 767.5, y 53119.25, xy 298506 2 a 10.138, b 0.150179 y SSTo y n 2 2 767.52 53119.25 7807.23 13 SSResid y a y b xy . 2 53119.25 (10.138)(767.5) (0.150179)(298506) 509.00 SSTo > SSResid as expected because these data had a strong slope! 45 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Interpretation SSResid 509 1 .9348 r 1 7807.23 SSTo 2 93.5% of the variation in bus fares can be attributed to the linear relationship between distance and fare. Distance traveled is a strong predictor for how much you expect to pay on the bus; the farther you travel the more you pay. Use this information to determine how far you can travel with only a certain amount of money to spend. 46 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. More on variability The standard deviation about the least squares line is denoted se and given by SSResid se n2 se is interpreted as the “typical” amount by which an observation deviates from the least squares line. Bus in the US 47 SSResid 509 se $6.80 11 n2 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Computer programs can do regression for you: Example output from Minitab a b Regression Analysis: Standard Fare versus Distance Least squares regression line The regression equation is Standard Fare = 10.1 + 0.150 Distance Predictor Constant Distance Coef 10.138 0.15018 S = 6.803 se 48 SE Coef 4.327 0.01196 R-Sq = 93.5% DF 1 11 12 P 0.039 0.000 R-Sq(adj) = 92.9% Analysis of Variance Source Regression Residual Error Total T 2.34 12.56 r2 SS 7298.1 509.1 7807.2 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. SSResid MS 7298.1 46.3 SSTo F 157.68 P 0.000 Bus in the US: Add more data Add 7 new cities throughout the US to increase the data from the previous 13 cities to 20 cities. Seattle is 4578.8 km (2848 miles) away from Rochester, but its fare is not that much higher! 49 Standard Distance Fare Buffalo, NY 69 17 New York City 340 65 Cleveland, OH 257 61 Baltimore, MD 430 81 Washington, DC 496 87 Atlanta, GE 998 115 Chicago, IL 607 96 San Francisco 2861 159 Seattle, WA 2848 159 Philadelphia, PA 335 67 Orlando, FL 1478 109 Phoenix, AZ 2569 149 Houston, TX 1671 129 New Orleans, LA 1381 119 Syracuse, NY 95 20 Albany, NY 240 39 Potsdam, NY 239 47 Toronto, ON 178 35 Ottawa, ON 467 82 Montreal, QU 480 70.5 First: Look at scatterplot Standard Fare 150 These new points do not follow the same trend as the previous ones. 100 They are not rising as fast (different slope). 50 0 0 1000 2000 Distance 50 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. 3000 x‐axis expanded from 650 to 3000 miles Second: Results expected? Seattle Standard Fare 150 New Houston Phoenix Orleans Atlanta San Francisco Examine these cities. 100 Orlando All major cities with tourist attractions. 50 All to south or west, travel to here cheaper than expected. 0 0 1000 2000 3000 Distance 51 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Linear regression © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Correlation coefficient, r = 0.921 R2 = 0.849, se = $17.42 Regression line, Standard Fare = 46.058 + 0.043535 Distance Regression Plot Residuals Versus Distance Standard Far = 46.0582 + 0.0435354 Distance S = 17.4230 R-Sq = 84.9 % A bad residual plot (response is Standard) R-Sq(adj) = 84.1 % 30 20 10 Residual Standard Far 150 100 0 -10 50 -20 -30 0 0 1000 2000 Distance 52 3000 0 1000 2000 3000 Distance Even though the correlation coefficient is reasonably high and 84.9% of the variation in fares is explained, the regression line does not fit the data well and will not give good predictions. Nonlinear regression The scatterplot does not look linear, it appears to have a curved shape. Standard Fare 150 We sometimes replace x or y or both with a transformation and then perform a linear regression using the transformed variables. This may lead to a better fit. 100 50 0 0 1000 2000 Distance 53 3000 For this particular data, the shape of the curve is almost logarithmic so we might try to replace the distance with log10(distance) [the logarithm to the base 10) of the distance]. © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Nonlinear regression for bus example Compute log10 of each of the distances. 54 Standard Distance Log10(distance) Fare Buffalo, NY 69 1.83885 17 New York City 340 2.53148 65 Cleveland, OH 257 2.40993 61 Baltimore, MD 430 2.63347 81 Washington, DC 496 2.69548 87 Atlanta, GE 998 2.99913 115 Chicago, IL 607 2.78319 96 San Francisco 2861 3.45652 159 Seattle, WA 2848 3.45454 159 Philadelphia, PA 335 2.52504 67 Orlando, FL 1478 3.16967 109 Phoenix, AZ 2569 3.40976 149 Houston, TX 1671 3.22298 129 New Orleans, LA 1381 3.14019 119 Syracuse, NY 95 1.97772 20 Albany, NY 240 2.38021 39 Potsdam, NY 239 2.37840 47 Toronto, ON 178 2.25042 35 Ottawa, ON 467 2.66932 82 Montreal, QU 480 70.5 © 2008 Brooks/Cole, a2.68124 division of Thomson Learning, Inc. Bus in the US Regression Analysis: Standard Fare versus Log10(Distance) The regression equation is Standard Fare = - 163 + 91.0 Log10(Distance) Predictor Constant Log10(Di S = 7.869 Coef -163.25 91.039 SE Coef 10.59 3.826 R-Sq = 96.9% Typical Error = $7.87 Better accuracy (Decreased from $17.42) T -15.41 23.80 P 0.000 0.000 R-Sq(adj) = 96.7% High r2 96.9% of the variation attributed to the model (increased from 84.9%) 55 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Bus in the US Rest of computer output… Analysis of Variance Source Regression Residual Error Total DF 1 18 19 Unusual Observations Obs Log10(Di Standard 11 3.17 109.00 SS 35068 1115 36183 Fit 125.32 MS 35068 62 F 566.30 SE Fit 2.43 P 0.000 Residual -16.32 St Resid -2.18R R denotes an observation with a large standardized residual The only outlier is Orlando and as you’ll see from the next two slides, it is not too bad. 56 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Bus in the US Regression Plot Standard Fare = -163.246 + 91.0389 Log10(Distance) S = 7.86930 R-Sq(adj) = 96.7 % 150 Standard Fare When we look at how the prediction curve looks on a graph that has the Standard Fare and log10(Distance) axes, we see the result looks reasonably linear. R-Sq = 96.9 % 100 50 0 57 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. 2.0 2.5 3.0 Log10(Distance) 3.5 Bus in the US When we look at how the prediction curve looks on a graph that has Distance instead of log10(Distance) as the x axis, the data and curve are nonlinear. The prediction model from transforming distance appears to now work well, and shows clearly that fares level off at higher distances. Fare 163.25 91.0Log10 (Distance) Standard Fare 150 Prediction Model 100 50 0 0 58 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. 1000 2000 Distance 3000 Tutorial: Call of the Amazonian frog From Herpetologica, 2004: pg. 420 – 29. The Amazonian tree frog uses vocal communication to call for a mate. A study was performed to access whether the daily rainfall (mm) was associated with the proportion of frogs calling (call rate). For proportions p that are between 0 and 1, a logistic transformation is often used: • Logistic transform = ln[p/(1‐p)], ln = natural logarithm. • For 0 < p < 1, ‐ Infinity < ln[p/(1‐p)] < Infinity as assumed for an outcome variable that follows the Normal distribution. • Choosing the right transformation for data can be tricky, there are tips in Chapter 5 of the book. For exams and homework, we 59 will guide you as here. Data: Call of the Amazonian frog 1.) Plots of p = call rate versus rainfall and the logistic transform ln[p/(1-p)] versus rainfall are given below. Which plot suits a linear regression better? ln(p/(1-p)) vs. rainfall 2 -1 0 1 ln(callrate/(1-callrate)) 0.6 0.4 0.2 callrate 0.8 3 4 1.0 p vs. rainfall 60 1 2 3 rainfall(mm) 4 5 1 2 3 rainfall(mm) 4 5 rainfall call rate [1] 0.2 0.17 [2] 0.3 0.19 [3] 0.4 0.20 [4] 0.5 0.21 [5] 0.7 0.27 [6] 0.8 0.28 [7] 0.9 0.29 [8] 1.1 0.34 [9] 1.2 0.39 [10] 1.3 0.41 [11] 1.5 0.46 [12] 1.6 0.49 [13] 1.7 0.53 [14] 2.0 0.60 [15] 2.2 0.67 [16] 2.4 0.71 [17] 2.6 0.75 [18] 2.8 0.82 [19] 2.9 0.84 [20] 3.2 0.88 [21] 3.5 0.90 [22] 4.2 0.97 [23] 4.9 0.98 [24] 5.0 0.98 Data: Call of the Amazonian frog 2.) Using the best transformation, calculate the intercept and slope of the fitted regression line. 3.) Interpret the meaning of the intercept and slope in 2.) 4.) Give an estimate of the regression line at 4 mm daily rainfall. What is this an estimate of? Recall computational formula for the slope b of the regression of y on x : b x y xy n . 2 x 2 x n Some useful summaries calculated in the stat package R: sum(log(callrate/(1-callrate))) 11.6 sum(rainfall) 47.9 sum(rainfall^2) 141.27 sum(rainfall*log(callrate/(1-callrate))) 77.56 61 rainfall call rate [1] 0.2 0.17 [2] 0.3 0.19 [3] 0.4 0.20 [4] 0.5 0.21 [5] 0.7 0.27 [6] 0.8 0.28 [7] 0.9 0.29 [8] 1.1 0.34 [9] 1.2 0.39 [10] 1.3 0.41 [11] 1.5 0.46 [12] 1.6 0.49 [13] 1.7 0.53 [14] 2.0 0.60 [15] 2.2 0.67 [16] 2.4 0.71 [17] 2.6 0.75 [18] 2.8 0.82 [19] 2.9 0.84 [20] 3.2 0.88 [21] 3.5 0.90 [22] 4.2 0.97 [23] 4.9 0.98 [24] 5.0 0.98
© Copyright 2026 Paperzz