NOTE: Do not Turn This In. Homeworks will not be graded Homework in preparation for quiz #7 – on October 27th Use SPSS exercises to start developing your computer skills. 1a) Write down the equation of the Pearson correlation coefficient ( r ) and explicitly explain how it relates to the covariance between x and y (+0.250 for each part). The Pearson correlation coefficient is the ratio of the covariance (shared variance of the two variables), divided by the product of the standard deviations of the two variables. Basically, the Pearson correlation coefficient is the standardized shared variance. Thus, its values range from -1 to +1 and are not affected by the units of the two variables. 1b) Write down the equation of slope of the linear regression ( b ) coefficient and explicitly explain how it relates to the covariance between x and y (+0.250 or each part). The slope of the simple linear regression line is the ratio of the covariance , divided by the variance of the independent variable (X). Basically, the slope is the shared variance relative to the variance of X. Thus, its values range from –infinite to + infinite, and are affected by the units of the two variables. 1 2) Define the following terms and describe their range of possible values (+0.125 for each definition). Note, if applicable, please include a formula in your definition: - Correlation coefficient (r ): Definition: This statistic quantifies the extent to which two variables co-vary (change together) in a linear fashion. This test requires ratio or interval data, whereby both distributions are normally distributed. Range of values: Ranges from +1 (perfect positive linear relationship) to -1 (perfect negative linear relationship), with a value of 0 indicating that the two variables are independent and do not vary together in a linear fashion. - Correlation coefficient (rho): Definition: This statistic quantifies the extent to which two variables co-vary (change together) in a linear fashion, when the ranks of the data are compared. Thus, this is a distribution-free statistic and it does not make any assumptions of normality. Range of values: Ranges from +1 (perfect positive linear relationship) to -1 (perfect negative linear relationship), with a value of 0 indicating that the ranks of the two variables are independent and do not vary together in a linear fashion. - Correlation coefficient (Tau): Definition: This statistic quantifies the extent to which two variables co-vary (change together) in a linear fashion, when the ranks of the data are compared. Thus, this is a distribution-free statistic and it does not make any assumptions of normality. Range of values: Ranges from +1 (perfect positive linear relationship) to -1 (perfect negative linear relationship), with a value of 0 indicating that the ranks of the two variables are independent and do not vary together in a linear fashion. - Coefficient of determination (r ^2): Definition: This statistic, calculated using the Pearson correlation coefficient squared, quantifies the extent to which the variance in variable X and the variance in variable Y is shared. Range of values: Ranges from 0 (no variance shared) to 1 (100% of variance shared). - 2 Intercept (bo): Definition: This statistic quantifies the value of Y when X equals 0. Thus, it calculates the Yintercept for the best-fit linear relationship between variables X and Y. Range of values: This statistic can range in value from a very large positive unbounded value (infinite) to a very small unbounded negative value (minus infinite). - Slope (b1): Definition: This statistic quantifies the linear relationship between the dependent (Y) and the independent (X) variable. Range of values: This statistic, which quantifies the linear relationship between the dependent (Y) and the independent (X) variable, can range in value from a very large unbounded positive value (infinite) to a very small unbounded negative value (minus infinite). - R squared (R ^2): Definition: The ratio of the Sum of Squares for the Model (for alternate hypothesis- calculated from the best-fit regression line) divided by the Total Sum of Squares (for null hypothesis – calculated using the mean of the Y values). R2 SS M SST Range of values: Ranges from 0 (0% of the total variance explained by the best-fit linear regression model) to 1 (100% of the total variance explained by best-fit linear regression model) - F: Definition: Ratio of the model MeanSquare (the regression variance) divided residual MeanSquare (the error variance). by the F MS M MS R Range of values: F ranges from 0 to a very large number (not bounded). The larger the F value, the stronger model 3a) Using the data in the matrix below (5 pairs of X-Y values), calculate the Pearson correlation coefficient (+0.125) and the coefficient of determination (+0.125). Show your work, including the way you calculated the deviations, the SDs and the r. Feel free to insert the tables from excel into this document, if it helps you organize your work. X Y 0 0 0 0 0 5 4 3 2 1 Mean_X = 0 / 5 = 0 / 5 = 0 3 Mean_Y = (5+4+3+2+1) / 5 = 15 / 5 = 3 Calculate deviations for each X and Y, from their respective means. Their sum is the covariance. (Xi - Xmean) (Yi - Ymean) 0 2 Product 0 0 1 0 0 0 0 0 -1 0 0 -2 0 0 0 0 Sum Covariance = 0 / 4 = 0 Calculate deviations for each X from Xmean, square deviations, sum and divide by n-1 to get variance: (Xi - Xmean) (Xi - Xmean) ^2 0 0 0 0 0 0 0 0 0 0 0 Sum Variance of X: 0 / 4 = 0 0 Calculate deviations for each Y from Ymean, square deviations, sum and divide by n-1 to get variance: (Yi - Ymean) 4 1 1 0 0 -1 1 -2 4 0 Sum Variance of Y: 10 / 4 = 2.5 4 (Yi - Ymean) ^2 2 10 Pearson r = covariance / (sqrt (variance Y) * sqrt (Variance X) Pearson r = 0 / (sqrt (0) * sqrt (2.5) = 0 / 0 = 0 Pearson correlation ( r ) = 0. The coefficient of determination (r squared ) = 0. 3b) Using the data in the matrix below (5 pairs of X-Y values), calculate the Pearson correlation coefficient (+0.125) and the coefficient of determination (+0.125). Show your work, including the way you calculated the deviations, the SDs and the r. Feel free to insert the tables from excel into this Y X 0 5 0 0 4 3 0 0 2 1 The result is the same: Pearson correlation ( r ) = 0. The coefficient of determination (r squared ) = 0. Changing the X and the Y does not alter the result. 3c) How do the correlation coefficients and the coefficient s of determination you calculated in 3a and 3b compare to each other. Are you surprised this happened when you switched the X and Y variables in the analysis? Explain why / why not (+0.25)? Results do not change – this makes sense: correlation does not make any difference between the two variables (X and Y). r (X and Y) = r (Y and X) 4a) Using the data in the matrix below (5 pairs of X-Y values), calculate the simple linear regression and report the equation of the best fit line (intercept and slope) (+0.125), the R square (+0.125) and the F value (+0.125). Show your work, including the way you calculated the deviations and the MS terms. Feel free to insert the tables from excel into this document, if it helps you organize your work. 5 X Y 5 4 3 2 1 -2 -4 -6 -8 -10 Calculate deviations for each X and Y, from their respective means. Their sum is the covariance. (Xi - Xmean) (Yi - Ymean) 2 4 Product 8 1 2 2 0 0 0 -1 -2 2 -2 -4 8 0 20 0 Sum Covariance = 20 / 4 = 5 Calculate deviations for each X from Xmean, square deviations, sum and divide by n-1 to get variance: (Xi - Xmean) (Xi - Xmean) ^2 2 4 1 1 0 0 -1 1 -2 4 0 Sum Variance of X: 10 / 4 = 2.5 10 The slope (b) = covariance / variance of X. The covariance = 5 / 2.5 = 2 (just like above) The variance of X: 10 / 4 = 2.5 Sum (Xi - Xmean) -2 -1 0 1 2 0 Slope = -5 / 2.5 = 2 6 (Xi - Xmean) ^2 4 1 0 1 4 10 To get the intercept, use the line equation: Y = bx + intercept Plug in any pair of values to figure out the intercept. For example, using x = 5 and y = -2: -2 = 2 * (5) + intercept -2 - 10 = intercept; intercept = -12 For example, using x = 1 and y = -10: -10 = 2 * (1) + intercept -10 -2 = intercept; intercept = -12 R-square = SumSquares of model / SumSquares total For SumSquares total: We use the mean Y as the value for the null hypothesis line (slope = 0) MeanY = -6 Sum Squares total: deviations from the null model (slope = 0 line) to the observations. Sum (Yi - Ymean) (Yi - Ymean) ^2 4 16 2 4 0 0 -2 4 -4 16 0 40 SumSquares Total 7 40 SumSquares model: Calculate the deviations between the model (predicted values using the slope calculated before) and the mean values (Ymean). X Ypredicted 1 2 3 4 5 -2 -4 -6 -8 -10 Sum (Ypredicted - Ymean) 4 2 0 -2 -4 0 (Ypredicted - Ymean) ^2 16 4 0 4 16 40 SumSquares Model 40 SumSquares error: Calculate the deviations between the model (predicted values using the slope calculated before) and the observed values (Yis). This assesses fit of best-fit line to the data. X Ypredicted 1 -2 2 -4 3 -6 4 -8 5 -10 Sum (Yi – Ypredicted) 0 0 0 0 0 0 (Yi – Y predicted ) ^2 0 0 0 0 0 0 SumSquares Error 0 Because there is no error about the best-fit line, the Sum Squares Error = 0, and the Sum Squares model = Sum Squares Model. R-square = SumSquares of model / SumSquares total = 40 / 40 = 1 The F statistic = Mean Square Model / Mean Square Error Mean Square Model = Sum Squares Model / df = 40 / 1 = 40 Mean Square Error = Sum Squares Error/ df = 0 / (4 – 1) = 0 / 3 = 0 The F statistic is infinite (40 / 0). 8 4b) Using the data in the matrix below (5 pairs of X-Y values), calculate the simple linear regression and report the equation of the best fit line (intercept and slope) (+0.125), the R square (+0.125) and the F value (+0.125). Show your work, including the way you calculated the deviations and the MS terms. Feel free to insert the tables from excel into this document, if it helps you organize your work. Y X 1 -2 2 3 -4 -6 4 5 -8 -10 The slope (b) = covariance / variance of X. The covariance = -5 (just like above) The variance of X: 40 / 4 = 10 Sum (Xi - Xmean) 4 2 0 -2 -4 0 (Xi - Xmean) ^2 16 4 0 4 16 40 Slope = -5 / 10 = -0.5 R-square = SumSquares of model / SumSquares total For SumSquares total: We use the mean Y as the value for the null hypothesis line (slope = 0) MeanY = 3 9 Sum Squares total: deviations from the null model (slope = 0 line) to the observations. (Yi - Ymean) -2 (Yi - Ymean) ^2 4 -1 1 0 0 1 1 2 4 0 Sum 10 SumSquares Total 10 SumSquares model: Calculate the deviations between the model (predicted values using the slope calculated before) and the mean values (Ymean). Sum 1 (Ypredicted - Ymean) -2 (Ypredicted - Ymean) ^2 4 -4 2 -1 1 -6 3 0 0 -8 4 1 1 -10 5 2 4 0 10 X Ypredicted -2 SumSquares Model 10 10 SumSquares error: Calculate the deviations between the model (predicted values using the slope calculated before) and the observed values (Yis). This assesses fit of best-fit line to the data. X Ypredicted -2 1 -4 2 -6 3 -8 4 -10 (Yi – Ypredicted) (Yi – Y predicted ) ^2 0 0 0 0 0 0 0 0 0 0 0 0 5 Sum SumSquares Error 0 Because there is no error about the best-fit line, the Sum Squares Error = 0, and the Sum Squares model = Sum Squares Model. R-square = SumSquares of model / SumSquares total = 40 / 40 = 1 The F statistic = Mean Square Model / Mean Square Error Mean Square Model = Sum Squares Model / df = 10 / 1 = 10 Mean Square Error = Sum Squares Error/ df = 0 / (4 – 1) = 0 / 3 = 0 The F statistic is infinite (40 / 0). 4c) How do the slopes and the R square terms you calculated in 4a and 4b compare to each other. Are you surprised this happened when you switched the X and Y variables in the analysis? Explain why / why not (+0.25)? When we shift the X and Y variables, the slopes of the best-fit lines become the inverse of each other: 1 / -2 = - 0.5 1 / -0.5 = -2 However, because the best-fit lines match the observations perfectly, and there is no variance due to errors (Sum Squares Error and MS Error = 0), the R squared is = 1 and the F is a very large positive value. 11 5) Load the data file “Biol4090_Hw#6.sav”, containing data on advertising costs and sales for records. Perform a linear regression, so you can predict how much money the sales of a record will make depend on how much money you spend in advertising. NOTE: Make sure you ask for the standardized residual plots- by highlighting the boxes shown below: - Calculate the R squared using the ratio of the SumSquares from the output table: SumSquares total: 1295952.00 SumSquares model: 433687.833 R-squared:__0.335______ (+0.10) 433687.833 / 1295952.00 - Calculate the F statistic using the ratio of MS squares from the output table: MSerror: 4354.870 MSmodel: 433687.833 F statistic: 99.587 (+0.10) 433687.833 / 4354.870 12 - Report the best-fit line: Slope of the line (b1) +/- S.E.: 0.096 (0.010) (+0.10) Intercept of the line (bo) +/- S.E.: 134.140 (7.537) (+0.10) - Report the p values for the slope and the intercept and the F statistic: p value of the slope: P < 0.001 (+0.10) p value of the intercept: P < 0.001 (+0.10) p value of the F statistic: P < 0.001 (+0.10) Copy and paste the histogram of the residuals (+0.10). Does it look like a normal distribution? The histogram looks like a normal distribution, with a mean = 0 and a S.D. of 1. Also, please notice that the P-P plot suggests there is an excess of observed values close to the mean. (+0.10). 13 Create a Scatterplot of the X-Y relationship, and paste the image below (+0.10): To do this, I used the “Chart Builder” and selected the Simple Scatterplot (with symbols of one color). I dropped “Record Sales” on the Y drop zone and “Advertising Budget” on the X drop zone. Next, left click on the image and open up the editor window. Add a reference line using the best-fit equation manually , over the Scatterplot. Hint: see attached menu (+0.20): 14 Finally, use the best-fit equation (intercept and slope) of the line you developed using the linear regression to anticipate the amount of sales revenue you expect to make if you spent the following on advertising. NOTE: if you cannot calculate the revenue, response N.A. - Spend $0.00, would lead to you make this much on sales: 134.140 (+0.10) Y(predicted) = 134.140 + 0.096 * (Xpredicted). This is the Y-intercept: 134.140 15 - Spend $2500.00, would lead to you make this much on sales: 374.140 (+0.10) - However, because this value is outside of the range of the model (x values) – we should not use the current best-fit linear regression to make this estimate. (extra +0.20) - Y(predicted) = 134.140 + 0.096 * (2500) = 374.140 - Spend $1000.00, would lead to you make this much on sales: 230.140 (+0.10) - Y(predicted) = 134.140 + 0.096 * (1000) = 230.140
© Copyright 2026 Paperzz