Chapter 14

Multiple Linear Regression and
Correlation Analysis
Chapter 14
McGraw-Hill/Irwin
©The McGraw-Hill Companies, Inc. 2008
GOALS










2
Describe the relationship between several independent variables and
a dependent variable using multiple regression analysis.
Set up, interpret, and apply an ANOVA table
Compute and interpret the multiple standard error of estimate, the
coefficient of multiple determination, and the adjusted coefficient of
multiple determination.
Conduct a test of hypothesis to determine whether regression
coefficients differ from zero.
Conduct a test of hypothesis on each of the regression coefficients.
Use residual analysis to evaluate the assumptions of multiple
regression analysis.
Evaluate the effects of correlated independent variables.
Use and understand qualitative independent variables.
Understand and interpret the stepwise regression method (skip)
Understand and interpret possible interaction among independent
variables (skip).
1. Multiple Regression Analysis
The general multiple regression with k
independent variables is given by:
3
 We use more than one independent variables to
explain or predict the dependent variables, Y.
 Again, the least squares criterion is used to
develop this equation.
 Because determining b1, b2, etc. is very tedious,
a software package such as Excel or MINITAB is
recommended.
Multiple Regression Analysis
For two independent variables, the general form
of the multiple regression equation is:
•X1 and X2 are the independent variables.
•a is the Y-intercept
•b1 is the net change in Y for each unit change in X1 holding X2 constant.
It is called a (partial, net) regression coefficient.
•Graphically, the relationship is portrayed as a plane (as shown in Chart 14-1)
4
Regression Plane for a 2-Independent
Variable Linear Regression Equation
5
Multiple Linear Regression - Example
One of the questions most frequently asked
by prospective home buyers is: If we
purchase this home, how much do we
expect to pay to heat it during the winter?
A real estate agency has been asked to
develop a guideline regarding heating
costs for single-family homes.
Three variables are thought to relate to the
heating costs: (1) the mean daily outside
temperature, (2) the number of inches of
insulation in the attic, and (3) the age in
years of the furnace (boiler).
To investigate, the research department
selected a random sample of 20 recently
sold homes. It collected the data for
heating cost to each home as well as
outside temperature, insulation inches,
and the age of the furnace.
6
Multiple Linear Regression - Example
7
Multiple Linear Regression – Excel
Example
8
The Multiple Regression Equation –
Interpreting the Regression Coefficients
The regression coefficient for outside temperature is 4.583. The coefficient is negative,
showing an inverse relationship between heating cost and temperature. As the
outside temperature increases, the heating cost decreases. The numeric value of
the regression coefficient provides more information. If temperature is increased by
1 degree, holding the other two independent variables constant, we estimate a
decrease of $4.583 in heating cost. So if the mean temperature is 25 degrees and
35 degrees in Boston ad Philadelphia respectively; insulation and age of furnace
being the same, we expect the heating cost would be $45.83 less in Philadelphia.
The insulation variable also shows an inverse relationship: the more insulation in the
attic, the less is the heating cost. So the negative sign for this coefficient is logical.
For each additional inch of insulation, we expect the heating cost to decline $14.83
per month, regardless of the outside temperature or the age of the furnace.
The age of the furnace variable shows a direct relationship. With an older furnace, the
cost to heat the home increases. Specifically, for each additional year older the
furnace is, we expect the heating cost to increase $6.10 per month.
9
Applying the Model for Estimation
What is the estimated heating cost for a home if the
mean outside temperature is 30 degrees, there
are 5 inches of insulation in the attic, and the
furnace is 10 years old?
10
2. How well does the regression
equation fit the data?

Several measures (methods) are used to
describe how effectively the independent
variables explain the variation of the
dependent variable.
–
–
–
–
11
(1) Multiple Standard Error of Estimate
(2) ANOVA Table
(3) Coefficient of (Multiple) Determination
(4) Adjusted Coefficient of (Multiple)
Determination
(1) Multiple Standard Error of
Estimate
The multiple standard error of estimate is a measure of the
effectiveness of the regression equation.
 It is derived based on the sum of squared ‘deviation from the
regression line (residual)’ , Σ(Y–Yhat)2.
 It is measured in the same units as the dependent variable. (if
Y is measured in dollar, standard error is also in dollar).
 The formula is:
12
13
(2) The ANOVA Table

The ANOVA table reports the (total) variation in the
dependent variable (SST), which is divided into two
components.
–
–

The degree of freedom in the explained (regression)
variation is the number of independent variables, k.
–
–

The degrees of freedom in the error variation is, (n-k-1).
The mean square value is obtained by dividing the SS (Sum
of square) by the matching df.
ANOVA table can be used to evaluate the regression result.
–
14
The Explained (or regression) Variation (SSR) that is
accounted for by the set of independent variable.
The Unexplained (or residual or error) Variation (SSE) that is
not accounted for by the independent variables.
(Eg.) Sy.123= Sq root (MSE), or MSR/MSE (discussed later).
Minitab – the ANOVA Table
15
(3) Coefficient of Multiple Determination (R2)
Characteristics of the coefficient of multiple determination:
1. It is symbolized by a capital R squared. In other words, it is
written as because it behaves like the square of a correlation
coefficient.
2. It ranges from 0 to 1. A value near 0 means little association
between the set of independent variables and the dependent
variable. A value near 1 means a strong association.
- It is easy to interpret (compare, and understand).
3. It cannot assume negative values. it is a squared value.
16
Minitab – the ANOVA Table
R2 
17
SSR
171,220

 0.804
SS total 212,916
(4) Adjusted Coefficient of
Determination

The number of independent variables in a multiple
regression equation affects the size of coefficient of
determination (R2).
–
–
–

To balance the effect the number of independent
variables has on the value of R2, we use an adjusted
coefficient of determination.
–
18
Each new independent variable makes SSE smaller and
SSR larger, thereby R2 larger.
R2 increases only because the number of independent
variables increase, not because the added variable is a
good predictor of Y.
If the number of independent variables, k, and the sample
size, n, are equal, the coefficient of determination is 1.0.
Statistical software packages provide this R2adj.
19
3. Inferences in Multiple Linear
Regression

Now understand the multiple regression as
inferential statistics. <cf: descriptive statistics
–

Data: a random sample taken from a population.
Model the (unknown, stochastic) linear relationship
in a population by the following equation
–
–
–
E(Y) = α + β1X1 + β2X2 + - - - + βkXk
Population parameters denoted by Greek letters, α, βi, are
estimated by sample statistics, a, bi, computed by least
squares method (point estimate).
Under a certain set of assumptions, the point estimate
follow the normal distribution (or t distribution), with the
corresponding population parameter being its mean.

20
Inferences about population parameters becomes possible,
based on the properties of the sampling distribution.
Assumptions for standard multiple
regression (discussed later)
1. There is a linear relationship. That is, there is a linear
relationship between the dependent variable and a set of
independent variables.
2. The variation in the errors is the same for both large and
small values of Y. That is, the error is independent whether
the estimated Y is large or small.
3. The errors follow the normal probability distribution, with
its mean being 0.
4. The independent variables should not be correlated. That is,
we would like to select a set of independent variables that are
not themselves correlated.
5. The errors are independent. This means that successive
observations of the dependent variable are not correlated. This
assumption is often violated when time is involved with the
sampled observations.
21
(1) Global Test: Testing the Multiple
Regression Model
We test the ability of the independent variables, X1, X2,
- - Xk, to explain the change in dependent variable Y.
The ‘global test’ is used to investigate whether any of
the independent variables have significant
coefficients.
The hypotheses are:
H 0 :  1   2  ...   k  0
H 1 : Not all  s equal 0
22
Global Test

The test statistic (from the ANOVA table)
follows the F distribution
–
–

continued
with k (number of independent variables)
and n-k-1 degrees of freedom, where n is
the sample size.
The example (of heating cost) follows F
distribution with degrees of freedom of 3
and 16 (with n=20).
Decision Rule:
Reject H0 if F > F,k,n-k-1
23
Finding the Critical F
24
Finding the Computed F
25
Interpretation




26
The computed value of F is
21.90, which is in the
rejection region.
The null hypothesis that all
the regression coefficients
are zero is rejected.
Interpretation: some of the
independent variables
(temperature, insulation,
furnace) do have the ability
to explain the variations in
the dependent variable
(heating cost).
Logical question– which
ones?
(2) Evaluating Individual Regression
Coefficients (Test whether βi = 0)




This test is used to determine which independent variables have non-zero
regression coefficients.
The variables that have zero regression coefficients are usually dropped
from the analysis.
The test statistic is the t distribution with n-(k+1) degrees of freedom.
The hypothesis test is as follows:
H0: βi = 0
H1: βi ≠ 0
Reject H0 if t > t/2,n-k-1 or t < -t/2,n-k-1
27
Critical t-stat for the Slopes
-2.120
28
2.120
Computed t-stat for the Slopes
29
Conclusion on Significance of Slopes
30
New Regression without Variable “Age”
(delete insignificant independent variable)
31
New Regression Model without Variable
“Age” – Minitab
32
Testing the New Model for Significance
33
Critical t-stat for the New Slopes
Reject H 0 if :
t  t / 2,n  k 1
t  t / 2,n  k 1
bi  0
 t / 2,n  k 1
sbi
bi  0
 t / 2,n  k 1
sbi
bi  0
 t.05 / 2, 20 21
sbi
34
bi  0
 t.05 / 2, 20 21
sbi
bi  0
 t.025,17
sbi
bi  0
 t.025,17
sbi
bi  0
 2.110
sbi
bi  0
 2.110
sbi
-2.110
2.110
Conclusion on Significance of New
Slopes
35
Procedure of adjusting regression
equation

Conduct global test
–

Conduct test for the individual coefficient.
–

Check the significance of individual explanatory
variables.
Rerun the regression, after deleting one
insignificant independent variable
–
–
36
Check whether the whole regression equation
has some explanatory power.
Delete one having the lowest absolute t value
(largest p-value), when two (or more) explanatory
variables are insignificant.
Start a new round of tests.
4. Evaluating the
Assumptions of Multiple Regression
* The validity of the previous tests rely on several
assumptions.
1. There is a linear relationship.
2. The variation in the errors is the same for
both large and small values of Y.
3. The errors follow the normal probability
distribution, with its mean being 0.
4. The independent variables should not be
correlated.
5. The errors are independent.

37
* Most of these assumptions are related with the error terms.
Analysis of Residuals
A residual is the difference between the actual
value of Y and the predicted value of Y.
Residuals are used to check about the assumptions
on the error term.

38
Residual plot: A plot of the residuals and their
corresponding Y’ values is used for showing whether there
are no trends or patterns in the residuals.
(1) Linear Relatiohship: Scatter Diagram
39
Linear Relationship: Residual Plot
40
(2) Same variation in error for large and
small value of Y

Homoscedasticity: The variation around the
regression equation is the same for all of the
values of the independent variables.

Example of violation of this assumption:
–

41
Salary is regressed on the age of worker.
The residual plots (in previous slide) used as
a preliminary check on this assumption.
(3) Distribution of Residuals: normal
distribution?
Histograms (and stem-and-leaf charts) are useful in checking this
assumption.
Both MINITAB and Excel offer graph that helps to evaluate the
assumption of normally distributed residuals. It is a called a normal
probability plot and is shown to the right of the histogram.
42
(4) Multicollinearity


Multicollinearity exists when independent
variables (X’s) are correlated.
Correlated independent variables make it difficult
to make inferences about the individual
regression coefficients and the individual effects
on the dependent variable Y.
–
–
–
43
It may cause unexpected sign or make important
independent variable having insignificant coefficient.
Need to select independent variables carefully.
In reality, it is very difficult to get rid of multicollinearity
problem fully.
How to check multicollinearity:
Variance Inflation Factor


A general rule is if the correlation between two
independent variables is between -0.7 and 0.7 there likely
is not a problem using the two independent variables.
A more precise test is to use the variance inflation
factor (VIF).
•The term R2j refers to the coefficient of determination, where the selected
independent variable is used as a dependent variable and the remaining
independent variables are used as independent variables.
•A VIF greater than 10 is considered unsatisfactory, indicating that the
independent variable should be removed from the analysis.
44
Multicollinearity – Example
Refer to the data in the
table, which relates the
heating cost to the
independent variables
outside temperature,
amount of insulation,
and age of furnace.
Develop a correlation
matrix for all the
independent variables.
Does it appear there is a
problem of
multicollinearity?
45
Correlation Matrix
A correlation matrix is used to show all
possible simple correlation
coefficients among the variables.
–
–
46
The matrix is useful for locating correlated
independent variables.
It also shows how strongly each independent
variable is correlated with the dependent
variable.
Correlation Matrix - Minitab
47
VIF – Minitab Example
Coefficient of
Determination
The VIF value of 1.32 is less than the upper limit
of 10. This indicates that the independent variable
temperature is not strongly correlated with the
other independent variables.
48
(5) Independence Assumption

The fifth assumption about regression
analysis is that successive residuals should
be independent.
–

When successive residuals are correlated we
refer to this condition as autocorrelation.
–
–
49
There is not a pattern to the residuals.
Autocorrelation frequently occurs when the data are
collected over a period of time (time-series data).
A test for autocorrelation, called Durbin-Watson test
introduced in Ch16.
Residual Plot versus Fitted Values


50
The graph below shows the
residuals plotted on the
vertical axis and the fitted
values on the horizontal axis.
Note the run of residuals
above the mean of the
residuals, followed by a run
below the mean. A scatter
plot such as this would
indicate possible
autocorrelation.
Qualitative Independent Variables

Frequently we need to use nominal-scale
variable our analysis, called qualitative
variables.
–

To use a qualitative variable in regression
analysis, we use a scheme of dummy
variables
–
51
Such as gender, whether the home has a
swimming pool, or whether the sports team was
the home or the visiting team, etc.
The variable has one of the two possible values,
either 0 or 1.
Qualitative Variable - Example
Suppose in the Salsberry
Realty (heating cost)
example that the
independent variable
“garage” is added. For those
homes without an attached
garage, 0 is used; for homes
with an attached garage, a 1
is used. We will refer to the
“garage” variable, as the
data from Table 14–2 are
used.
52
Qualitative Variable - Minitab
53
Using the Model for Estimation
What is the effect of the garage variable? Suppose we have two houses exactly
alike next to each other in Boston; one has an attached garage, and the other
does not. Both homes have 3 inches of insulation, and the mean January
temperature in Boston is 20 degrees.
For the house without an attached garage, a 0 is substituted for in the regression
equation. The estimated heating cost is $280.90, found by:
Without garage
For the house with an attached garage, a 1 is substituted for in the regression
equation. The estimated heating cost is $358.30, found by:
With garage
54
Testing the Model for Significance


55
We have shown the difference between the two
types of homes to be $77.40, but is the difference
significant?
We conduct the following test of hypothesis for
the dummy variable, as before.
H0: βi = 0
H1: βi ≠ 0
Reject H0 if t > t/2,n-k-1 or t < -t/2,n-k-1
Evaluating Individual Regression
Coefficients (βi = 0)

This test is to determine whether any independent variables have nonzero
regression coefficients.
– In this case, consider the coefficient for the garage variable.
– The test statistic follows the t distribution with n-(k+1) or n-k-1degrees
of freedom.

The hypothesis test is as follows:
H0: βi = 0
H1: βi ≠ 0
Reject H0 if t > t/2,n-k-1 or t < -t/2,n-k-1
56
Reject H 0 if :
t  t / 2,n  k 1
t  t / 2,n  k 1
bi  0
 t / 2,n  k 1
sbi
bi  0
 t / 2,n  k 1
sbi
bi  0
 t.05 / 2, 2031
sbi
bi  0
 t.05 / 2, 2031
sbi
bi  0
 t.025,16
sbi
bi  0
 t.025,16
sbi
bi  0
 2.120
sbi
bi  0
 2.120
sbi
Conclusion: The regression coefficient is not zero. The independent
variable garage should be included in the analysis.
57
Stepwise Regression
The advantages to the stepwise method are:
1. Only independent variables with significant regression
coefficients are entered into the equation.
2. The steps involved in building the regression equation are clear.
3. It is efficient in finding the regression equation with only
significant regression coefficients.
4. The changes in the multiple standard error of estimate and the
coefficient of determination are shown.
58
Stepwise Regression – Minitab Example
The stepwise MINITAB output for the heating cost
problem follows.
Temperature is
selected first. This
variable explains
more of the
variation in heating
cost than any of the
other three
proposed
independent
variables.
Garage is selected
next, followed by
Insulation.
59
Regression Models with Interaction


60
In Chapter 12 (ANOVA) interaction among independent variables
was discussed. Suppose we are studying weight loss and assume
that diet and exercise are related. So the dependent variable is
amount of change in weight and the independent variables are:
diet (yes or no) and exercise (none, moderate, significant). We are
interested in whether there is interaction among the independent
variables. That is, if they maintain their diet and exercise
significantly, will that increase the weight loss? Is total weight loss
more than the sum of the loss due to the diet effect and the loss
due to the exercise effect?
In regression analysis, interaction can be examined as a separate
independent variable. An interaction prediction variable can be
developed by multiplying data values in one independent variable
by the value in another independent variable, creating a new
independent variable. A two-variable model that includes an
interaction term is:
Regression Models with Interaction Example
Refer to the heating cost
example. Is there an
interaction between
the outside
temperature and the
amount of insulation?
If both variables are
increased, is the
effect on heating cost
greater than the sum
of savings from
warmer temperature
and the savings from
increased insulation
separately?
61
Regression Models with Interaction Example
Creating the Interaction Variable
Using the information from the table in previous
slide, an interaction variable is created by multiplying
the temperature by the insulation.
For the first sampled home the value temperature is 35
degrees and insulation is 3 inches so the value of
the interaction variable is 35 X 3 = 105. The values
of the other interaction products are found in a
similar fashion.
62
Regression Models with Interaction Example
63
Regression Models with Interaction Example
The regression equation is:
Is the interaction variable significant at 0.05
significance level?
64
There are other situations that can occur when studying
interaction among independent variables.
1. It is possible to have a three-way interaction among
the independent variables. In the heating example,
we might have considered the three-way interaction
between temperature, insulation, and age of the
furnace.
2. It is possible to have an interaction where one of the
independent variables is a qualitative variable. In
our heating cost example, we could have studied
the interaction between temperature and garage.
65
End of Chapter 14
66