File - Varsity Field

15
STK120 Assignment 3: Simple Linear Regression
Section A: Homework
Exercise H1: Least Squares Method
Do the following exercises:
1. Given are 5 observations collected in a study on two variables:
xi
2 4 5 7 8
yi
2 3 2 6 4
a. Develop a scatter diagram for these data.
b. Develop the estimated regression equation for these data.
c. Use the estimated regression equation to predict the value of
y when x = 4 .
2. The following data were collected on the height (inches) and weight (pounds) of women swimmers:
Height
68
64
62
65
66
Weight
132 108 102 115 128
a. Develop a scatter diagram for these data with height as the independent variable.
b. What does the scatter diagram in part (a) indicate about the relationship between the two variables.
c. Try to approximate the relationship between height and weight by drawing a straight line through the data.
d. Develop the estimated regression equation for these data.
e. If a swimmer’s height is 63 inches, what would you estimate her weight to be?
Memorandum H1
1. a.
7
6
5
y
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
x
b. Summations needed to compute the slope and y-intercept are:
Σxi = 26 Σyi = 17 Σ( xi − x )( yi − y ) = 11.6 Σ( xi − x ) 2 = 22.8
b1 =
Σ( xi − x )( yi − y ) 11.6
=
= 0.5088
22.8
Σ( xi − x )2
y! = 0.75 + 0.51x
c.
y! = 0.75 + 0.51(4) = 2.79
b0 = y − b1 x = 3.4 − (0.5088)(5.2) = 0.7542
16
2. a.
b. There appears to be a linear relationship between x = height (inches) and y = weight (pounds).
c. Many different straight lines can be drawn to provide a linear approximation of the
relationship between x and y; in part d we will determine the equation of a straight line
that “best” represents the relationship according to the least squares criterion.
d. Summations needed to compute the slope and y-intercept are:
Σxi = 325
b1 =
Σyi = 585
Σ ( xi − x )( yi − y )
Σ ( xi − x )
2
Σ ( xi − x )( yi − y ) = 110
=
Σ ( xi − x ) 2 = 20
110
= 5 .5
20
b0 = y − b1 x = 117 − (5.5)(65) = −240.5
yˆ = −240.5 + 5.5 x
e.
yˆ = −240.5 + 5.5 x = −240.5 + 5.5(63) = 106 pounds
Exercise H2: Coefficient of Determination
Do the following exercise on p. 605-606 of the textbook:
No. 18
Memorandum H2
18.
a. The estimated regression equation and the mean for the
dependent variable are:
yˆ = 1790.5 + 581.1x
y = 3650
The sum of squares due to error and the total sum of squares are
SSE = ∑( yi − yˆi ) 2 = 85,135.14
SST = ∑ ( yi − y )2 = 335, 000
Thus, SSR = SST - SSE = 335,000 - 85,135.14 = 249,864.86
2
b. r = SSR/SST = 249,864.86/335,000 = .746
We see that 74.6% of the variability in y can be explained by the least squares regression
equation.
c.
r = .746 = +.8637
17
Exercise H3: Testing for Significance
Do the following exercise on p.617 of the textbook:
No. 26
Memorandum H3
2
26. a.
s = MSE = SSE / (n - 2) = 85,135.14 / 4 = 21,283.79
s = MSE = 21,283.79 = 14589
.
Σ( xi − x ) 2 = 0.74
sb1 =
t=
s
Σ( xi − x )
2
=
145.89
0.74
= 169.59
b1 58108
.
=
= 3.43
sb1 169.59
t0.025 = 2.776 (4 degrees of freedom)
Since t = 3.43 > t0.025 = 2.776 we reject H 0 : β1
= 0 . Hence there is a significant relationship
between grade point average and monthly salary.
b. MSR = SSR / 1 = 249,864.86 / 1 = 249.864.86
F = MSR / MSE = 249,864.86 / 21,283.79 = 11.74
F0.05 = 7.71 (1 and 4 degrees of freedom )
Since F = 11.74 > F0.05 = 7.71 we reject H 0 : β1
= 0 . Hence there is a significant relationship
between grade point average and monthly salary.
c.
Source of
Variation
Regression
Error
Total
Sum of
Squares
249864.86
85135.14
335000
Degrees of
Freedom
1
4
5
Mean Square
F
249864.86
21283.79
11.74
Exercise H4: Excel’s Regression Tool
Do the following exercise on p.631 of the textbook:
No. 41
Memorandum H4
32. a.
y! = 61092
.
+ 0.8951x
b. t =
b1 0.8951
=
= 6.01
sb1
0149
.
t0.025 = 2.306 (8 degrees of freedom)
Since t = 6.01 > t0.025 = 2.306 we reject H 0 : β1
= 0 . Hence monthly maintenance expense is
related to usage.
2
c. r = SSR/SST = 1575.76/1924.90 = 0.82. A good fit.
18
Section B: Practical
Exercise P1
1. Re-do the ARMAND’S PIZZA PARLOR example (data on p. 587 and on Webfile in the Armand’s data file
in the Chapter 14 Data folder ) following the instructions on p.591-592 for the scatter plot and trend line.
Compare your results to that given on p.592.
2. Calculate the coefficient of determination for the ARMAND’S PIZZA PARLOR example by following the
instructions on p.603. Compare your results to that given on p.604.
Exercise P2
The following data concerns the demand for a product (in thousands) and its price (in Rand) in five different
market areas.
Figure 1
1. Determine the regression equation and the coefficient of determination by following the
instructions on p.591-592 and p.603.
Enter the data in an Excel spreadsheet. Obtain a scatter diagram between the
price (x) and the demand (y). (Using similar instructions as on p.591)
Right click on any data point in the scatter diagram to obtain the estimated regression equation,
2
yˆ = b0 + b1 x and the coefficient of determination, r . (See Figure 2.)
Figure 2
Demand (in thousands)
140
Scatter diagram of the price (in Rand) and the
demand (in thousands) of a product
120
y = -12.1x + 235.4
100
2
R = 0.9709
80
60
40
20
0
8
10
12
14
16
18
20
Price (in Rand)
Note:
• From the graph it is evident that there exists a very strong negative linear relationship between the
price and the demand of the product.
• In this example it makes no sense to interpret the intercept b0 = 235.4 . Why?
•
•
From the estimate of the slope b1 = −12.1 it is clear that the demand decreases with 12100 units for
every R1 increase in the price.
According to the coefficient of determination we know that 97.09% of the variation in the demand is
explained by the price.
19
2. Determine the regression equation by making use of the formulas on p. 589 - 590
The calculations for the estimates of the estimated regression equation, yˆ = b0 + b1 x , are given in the
following Excel spreadsheet. (See Fig 3.) Make sure that you are also able to do these calculations with
your calculator.
Figure 3: Formula Sheet
Figure 3: Value Sheet
Questions
i.
Use the data above and write down the estimated regression equation
ii.
Calculate the estimated demand for a price of R15
iii.
Interpret the answer in (ii)
Answer:
i.
The estimated regression equation is:
ii.
iii.
yˆ = 235.4 − 12.1x
For a price of R15: yˆ = 235.4 − 12.1(15) = 53.9.
The estimated demand for a price of R15 is: 53 900 units.
3. Determine SST, SSE and SSR by making use of the formulas on p.599-602.
20
Figure 4: Formula Sheet
Note: Columns C – F are hidden
( y − y )2
Figure 4: Value Sheet
Note: Columns C – F are hidden
( y − y )2
Note:
• Due to the fact that SST=SSE+SSR it is also possible to obtain SSR by
SSR=SST-SSE=6032-175.6=5856.4
Questions:
i.
Calculate and interpret the coefficient of determination.
ii.
Calculate and interpret the correlation coefficient.
iii.
Calculate the estimate of σ , the standard deviation of ε .
Answers:
Figure 5: Formula Sheet
i.
Figure 5: Value Sheet
The value of the coefficient of determination is
r2 =
SSR
= 0.9709. Consequently, 97.09% of the
SST
total sum of squares is explained by the sum of squares for regression.
ii.
The value of the correlation coefficient is
rxy = − r 2 = −0.9853 which indicates a very strong
negative linear relationship between the price (x) and the demand (y).
iii.
According to s =
regression line is.
MSE = 7.6507 , the average deviation of the y -values around the estimated
21
Exercise P3
The following data concerns the demand for a product (in thousands) and its price (in Rand) in five different
market areas.
Figure 1
Note the calculations for the values are done in Exercise P2.
1. Consider the data in Figure 1 and test the hypothesis
H0 : β = 0
Ha : β ≠ 0
with
Figure 2: Formula Sheet
Figure 2: Value Sheet
• The standard error or the standard deviation of
sb1 =
b1 is
s
∑ ( xi − x)2
= 1.2097
• The value of the test statistic is
t=
b1
= −10.0026
sb1
α = 0.01
by using the t-test.
22
• According to the p-value the null hypothesis is rejected on a 1% level of significance because pvalue=0.0021<0.01.
± t.005 = ±5.8408 ( df = n − 2 = 3 ) the null hypothesis rejected on
• According to the critical values:
a 1% level of significance because.
t = −10.0026 < −5.8408
Note: This is a two-sided test.
2. Test the hypothesis
H0 : β = 0
Ha : β ≠ 0
with
α = 0.01
by making use of the F-test.
Figure 3: Formula Sheet
Figure 3: Value Sheet
• The value of the test statistic is
F=
MSR
= 100.0524
MSE
Note: The F-statistic is the square of the t-statistic i.e.:
F = 100.052 = (−10.0026) 2 = t 2
• According to the p-value the null hypothesis is rejected on a 1% level of significance because pvalue=0.0021<0.01.
Note: The p-value is exactly the same for the t-test.
• According to the critical value:
F.01 = 34.1161 with 1 and 3 degrees of freedom the null hypothesis
is rejected on a 1% level of significance because
F = 100.0524 > 34.1161
23
Exercise P4:
Do the following exercise by using the Excel Regression Tool: (Use similar instructions as on p.626)
A sample of 15 companies taken from the Stock Investor Pro was used to obtain the following data on the
price/earnings (P/E) ratio and the gross profit margin for each company.
Firm
Abbot Laboratories
American Home Products
Amoco
Bristol Meyers Squibb Co.
Chevron
Exxon
General Electric Company
Hewlett-Packard
IBM'
Merck & Co. Inc.
Mobil
Pfizer
Pharmacia & Upjohn, Inc.
Texaco
Travellers Group Inc.
Gross Profit Margin (%)
23.7
21.1
11.0
26.6
11.6
9.8
13.4
9.7
11.5
25.6
8.2
25.1
15.0
7.3
17.8
P/E Ratio
22.3
22.6
16.7
25.9
18.3
18.7
13.1
23.3
17.3
26.2
18.7
34.6
22.3
12.3
28.7
a. Determine the estimated regression equation that can be used to predict the price/earnings ratio given the
gross profit margin.
b. Use the F-test to determine whether the gross profit margin and the P/E ratio are related. What is your
conclusion on the 5% level of significance?
c. Use the t-test to determine whether the gross profit margin and the P/E ratio are related. What is your
conclusion on the 5% level of significance? Is the conclusion that you reached here the same as in part (b)?
Explain
d. Did the estimated regression equation provide a good fit to the data? Explain.
Excel Regression Tool output.
See note below
Coefficient of determination
Standard error for
SSR
MSE
ε
MSR
F-test stat
p-value for F-test
SSE
SST
t test stat
p-value for t-test
b0
b1
Standard error for b1
Note: The Multiple R is the square root of the coefficient of determination, hence the correlation coefficient
can be calculated by (sign of the b1 )*(Multiple R)
a.
b.
c.
d.
ŷ = 11.3332 + .6361x where x = Gross Profit Margin (%)
There is a significant relationship since the significance F (p-value) = .0017 < α = .05
There is a significant relationship since the significance t (p-value) = .0017 < α = .05
2
r = 0.5444; Not a good fit
24
Section C: Typical Exam Questions
Questions 1 to 4 are based on the following information:
The relationship between the number of hours spent studying (x) for a test and the test marks (y)
achieved by 8 randomly selected students is examined. The data are recorded in the following table:
2
2
2
Hours spent
Test mark
x−x y− y
x−x
y−y
y − ŷ
studying
(y)
(x)
(
)
(
)
(
)(
)
(
)
3
35
25
552.25
117.5
10.60
4
39
16
380.25
78
10.92
4
39
16
380.25
78
10.92
5
45
9
182.25
40.5
1.83
7
67
1
72.25
-8.5
157.48
10
72
4
182.25
27
29.18
15
78
49
380.25
136.5
78.19
16
93
64
184
1190.25
276
4.45
3320
745
303.56
Total:
The regression model fitted here is
y = β 0 + β1 x + ε and the estimated regression equation is yˆ = b0 + b1 x .
Question 1
The value of the slope of the estimated regression equation is:
(A) 25.7174
(C) 2.4542
(E) 0.2470
(B) 0.2244
(D) 4.0489
(2)
Question 2
The sum of squares due to the regression is:
(A) 3320.00
(C) 3136.00
(E) 3016.44
(B) 3623.56
(D) 303.56
(2)
Question 3
The estimate of
σ
, the standard deviation of
(A) 7.113
(C) 23.523
(E) 17.423
ε , is:
(B) 57.619
(D) 50.593
(2)
Question 4
The test statistic for testing the hypothesis
H 0 : β1 = 0
H a : β1 ≠ 0
at a 5% level of significance, has a t-distribution, with
critical values:
(A) ± t0.05,8
(B)
± t 0.025, 7
(C)
± t0.05,7
(D)
± t0.025, 6
(E)
± t0.05,6
(2)
25
Questions 5 and 6 are based on the following information:
A regression model was fitted, with one independent variable.
The following incomplete regression ANOVA table is given for a random sample of 20 paired
observations:
Source of
variation
Regression
Error
Total
Sum of squares
***
***
20 000
Degrees of
freedom
***
***
***
Mean square
F-statistic
(a)
190
***
Question 5
The mean square regression, (a), is:
(A) 1 000
(C) 3 420
(E) 8 290
(B) 1 052.63
(D) 16 580
(2)
Question 6
The value of the coefficient of determination is:
(A) 0.911
(C) 0.751
(E) 0.829
(B) 0.414
(D) 0.171
(2)
Memo: Q1 D, Q2 E, Q3 A, Q4 D, Q5 D, Q6 E