Using and Applying Multiple Regression Analysis

Faculty Research Workshop
February 19, 2014
Tom Lehman, Ph.D.
Professor, Department of Economics
Indiana Wesleyan University
 Introduction
to Correlation Analysis and OLS
 OLS Multiple Regression Analysis Basics
 OLS Hierarchical Multiple Regression
• Definition, Purposes and Technique
 Examples
of OLS Hierarchical Modeling
 Q&A
 References
Correlation analysis hypothesis testing is designed
to investigate if two or more variables are
correlated or co-vary together, and if this
covariance is statistically significant.
 Examples:

• What is the relationship between housing costs and
•
•
•
•
monthly rental prices in urban housing markets?
What is the relationship between GDP growth and growth
in capital investment expenditures in the macroeconomy?
What is the relationship between education levels and the
unemployment rate in an urban economy?
Does advertising expenditure increase sales revenues? If
we spend $100,000 on ad costs, what are predicted gross
sales?
What is the relationship between county levels of
educational attainment and county household income?

“Bivariate” = two variables
• Exploring the relationship between a dependent variable (DV)
and a single independent variable (IV)
• Ideally both variables should be interval or ratio level
(continuous)
• Categorical (nominal or ordinal) variables do not work as well
in regression analysis (exception: dummy variables in multiple
regression)

Dependent Variable (DV):
• The variable (measurable data used to operationalize a
concept) that is thought to “depend upon” or be influenced by
another
• The variable the value of which is being predicted or estimated

Independent Variable (IV):
• The variable that is thought to (hypothesized to) influence the
behavior of the DV.
• The IV is sometimes referred to as a “predictor” variable; it
may “predict” the behavior of the DV
• We utilize the values of an IV to predict or estimate the value of
a DV in a regression equation: Y’ = a + bX
values of the DV ‘Y’
Y
values of the IV ‘X’
X
The “best fit” line
(a.k.a. the predicted
regression line)
assumes a linear
relationship; it
traces a path
through the scatter
plots that is, on average,
equidistant between
each data point.
values of the DV ‘Y’
Y
OLS regression attempts
to minimize the sum of
the squared distance
between the
observed data points
and the regression line
of predicted plots.
values of the IV ‘X’
X
A
computed value between -1.00 and +1.00
that measures the strength of association
between X (IV) and Y (DV)
 The closer the value of Pearson’s r to ±1.00, the
stronger the association
• A value of -1.00 is a perfect negative correlation
• A value of +1.00 is a perfect positive correlation
• A value of 0 is a perfect zero correlation
A
positive Pearson’s r = positive or direct
relationship
 A negative Pearson’s r = negative or inverse
relationship
Y
X mean
An X value above the
X mean correlates with
a Y value above the Y mean
in the upper right quadrant
(leads to “+” coefficient)
Mean of the
Y variable
Y mean
Very few outliers
in the opposing two
quadrants
An X value below the
X mean correlates with
a Y value below the Y mean
in the lower left quadrant
(leads to “+” coefficient)
Mean of the X variable
X
X mean
Y
Very few outliers
in the opposing two
quadrants
Mean of the
Y variable
Y mean
An X value below the
X mean correlates with
a Y value above the Y mean
in the upper left quadrant
(leads to “-” coefficient)
Mean of the X variable
An X value above the
X mean correlates with
a Y value below the Y mean
in the lower right quadrant
(leads to “-” coefficient)
X
 The
coefficient of determination is the
squared value of Pearson’s r expressed
as an absolute value (+) percentage
 Pearson’s R2 is a measure of the percent
of variation in the DV explained (or
accounted for) by the variation in the IV
 Example: If r = +0.849, then R2 = 0.721
• Interpretation: Roughly 72.1% of the variation in the
DV can be explained by the variation (changes) in
the IV

The regression equation: Y’ = a + bX
• Regression analysis and the regression equation are used
to predict the best-fit regression line from the X-Y data
• Simply hand-drawing a best-fit line through a scatter plot
is subjective and unreliable
• We need to use a precise statistical method to estimate
the true best-fit regression line
• Estimated Y value (Y’) = Y-intercept + slope(given value
of X)

Least-squares principle
• The best-fit regression line is statistically estimated by
minimizing the sum of the squared vertical difference
between the actual Y values (Y) and the predicted Y
values (Y’)
• Minimizing the distances between the best-fit line (Y’)
and the actual values of Y
• Minimizing S(Y – Y’)2

“Multiple” = more than two variables (more precise,
more thorough than simple bivariate regression
analysis)
• Exploring the relationship between a dependent variable (DV)
and two or more independent variables (IV)
• Variables must be interval or ratio level (continuous)

Dependent Variable (DV):
• The variable (measurable data used to operationalize a
concept) that is thought to “depend upon” or be influenced by
another
• The variable the value of which is being predicted or estimated

Independent Variables (IVs):
• The variables that are, together, thought to (hypothesized to)
influence the behavior of the DV.
• The IVs are sometimes referred to as “predictor” variables;
together, they may “predict” the behavior of the DV
• We utilize the values of IVs to predict or estimate the value of a
DV in a regression equation: Y’ = a + b1X1 + b2X2 … bnXn
 Multiple
regression analysis allows us to
investigate the relationship or correlation
between several IVs and a continuous DV while
controlling for the effects of all the other IVs in
the regression equation
 In other words, we can observe the impact of a
single IV on a DV while controlling for the
effects of several other IVs simultaneously
 Multiple regression allows us to “hold constant”
the other IVs in the equation so that we can
analyze the impact of each IV on the DV “net” of
the disturbances of other factors
 See Grimm and Yarnold, 1995; Gujarati, 1995;
Kennedy, 2008; Tabachnick and Fidel, 2012






For each value of X (IV) there is a group of Y (DV) values, and
these values must be normally distributed
The means of these Y values lie on the predicted regression
line
The DV must be a continuous variable (ratio or interval), not
categorical
The relationship between the DV and all IVs must be linear, not
curvilinear
The mean of the “residuals” (Y-Y’) must equal 0
The DV is statistically independent, no autocorrelation with
itself
(i.e., the DV cannot be autocorrelated with successive observations of
itself; one of the DV values cannot have influenced another of the DV
values, such as often occurs in time-series data)


Homoscedasticity: the values of the Y-Y’ residuals must be equal
over the entire range of the predicted regression line; must be
the same for all values of X, cannot be heteroscedastic
(Kennedy, 2008)
Multiple IVs included in the regression model cannot suffer
from multicollinearity with each other
The error terms
or residuals Y-Y’
are not equal
along the entire
regression line.
As the value of the
IV increases, the
Y-Y’ residuals get
larger and larger,
and the data points
“fan out” wider
about the regression
line.
values of the DV ‘Y’
Y
values of the IV ‘X’
X

“Standard” or “Simultaneous” Multiple
Regression Technique
•
•

All IVs are entered into the model simultaneously;
reveals only the unique effects of each IV on the DV
A single model is constructed with all IVs included at the
same time
“Hierarchical” or “Sequential” Multiple
Regression Technique
•
•
•
“Sets” of IVs are entered into each regression model
systematically, perhaps one by one
Allows the analyst to determine how much additional
variance in the DV (R2) is explained by adding
consecutive additional IVs in a systematic pattern
Multiple regression models are generated with each
successive model exhibiting more IVs than the previous
models (Grimm and Yarnold, 1995).
The first regression estimation (model) contains
one or more predictors
 The next estimation (model) adds one or more
new predictors to those used in the first analysis
 The change in R2 between consecutive models
represents the proportion of variance in the DV
shared exclusively with the newly entered
variables
 Caution: the partial coefficients on each
consecutive model are not directly comparable
to one another.

• The impact of the variables entered in earlier steps are
partialed from correlations involving variables entered in
later steps (Grimm and Yarnold, 1995)
 Standard
or simultaneous multiple
regression estimation on a set of IVs
 SPSS (Sweet and Grace-Martin, 2011)
 Variables:
• DV: county median household income 2011
• IVs:
 Geographic border to metro county
 Educational attainment measures
 Interstate highway density
 Population density
 Labor force participation rate
 Hierarchical
multiple regression
estimation using same IVs entered
consecutively or sequentially
 Variables:
• DV: county median household income 2011
• IVs:
 Geographic border to metro county
 Educational attainment measures
 Interstate highway density
 Population density
 Labor force participation rate
 Hierarchical
multiple regression
estimation using different model and
variables
 Variables:
• DV: county per capital income 2011
• IVs:
 County workforce composition: farming and
professional
 Water amenities: square miles of county water area
 Taxation: aggregate R/E taxes per capita
 Immigration: county share foreign-born population







Berry, W.D. (1993). Understanding Regression Assumptions.
Newbury Park, CA: Sage Publications.
Grimm, L.G. and Yarnold, P.R. (1995). Reading and
Understanding Multivariate Statistics. Washington, D.C.:
American Psychological Association.
Gujarati, D.N. (1995). Basic Econometrics, 3rd edition. New
York, NY: McGraw-Hill/Irwin.
Kennedy, P. (2008). A Guide to Econometrics, 6th edition. New
York, NY: Wiley-Blackwell.
Lind, D.A., Marchal, W.G. and Wathen, S. (2011). Statistical
Techniques in Business and Economics, 15th edition. Boston,
MA: McGraw-Hill/Irwin.
Sweet, S.A and Grace-Martin, K. (2011). Data Analysis With
SPSS: A First Course in Applied Statistics, 3rd edition. Boston,
MA: Allyn and Bacon.
Tabachnick, B.S. and Fidel, L.S. (2012). Using Multivariate
Statistics, 6th edition. Boston, MA: Allyn and Bacon.




Cimasi, R.J., Sharamitaro, A.P., and Seiler, R.L. (2013). The
association between health literacy and preventable
hospitalizations in Missouri: Implications in an era of reform.
Journal of Health Care Finance, 40(2), 1-16.
Ruiz-Palomino, P., Saez-Martinez, F.J., and Martinez-Canas, R.
(2013). Understanding pay satisfaction: Effects of supervisor
ethical leadership on job motivating potential influence.
Journal of Business Ethics, 118, 31-43.
Tartaglia, S. (2013). Different predictors of quality of life in
urban environment. Social Indicators Research, 113, 10451053.
Zhoutao, C., Chen, J., and Song, Y. (2013). Does total rewards
reduce core employees’ turnover intention? International
Journal of Business and Management, 8(20), 62-75.