Faculty Research Workshop February 19, 2014 Tom Lehman, Ph.D. Professor, Department of Economics Indiana Wesleyan University Introduction to Correlation Analysis and OLS OLS Multiple Regression Analysis Basics OLS Hierarchical Multiple Regression • Definition, Purposes and Technique Examples of OLS Hierarchical Modeling Q&A References Correlation analysis hypothesis testing is designed to investigate if two or more variables are correlated or co-vary together, and if this covariance is statistically significant. Examples: • What is the relationship between housing costs and • • • • monthly rental prices in urban housing markets? What is the relationship between GDP growth and growth in capital investment expenditures in the macroeconomy? What is the relationship between education levels and the unemployment rate in an urban economy? Does advertising expenditure increase sales revenues? If we spend $100,000 on ad costs, what are predicted gross sales? What is the relationship between county levels of educational attainment and county household income? “Bivariate” = two variables • Exploring the relationship between a dependent variable (DV) and a single independent variable (IV) • Ideally both variables should be interval or ratio level (continuous) • Categorical (nominal or ordinal) variables do not work as well in regression analysis (exception: dummy variables in multiple regression) Dependent Variable (DV): • The variable (measurable data used to operationalize a concept) that is thought to “depend upon” or be influenced by another • The variable the value of which is being predicted or estimated Independent Variable (IV): • The variable that is thought to (hypothesized to) influence the behavior of the DV. • The IV is sometimes referred to as a “predictor” variable; it may “predict” the behavior of the DV • We utilize the values of an IV to predict or estimate the value of a DV in a regression equation: Y’ = a + bX values of the DV ‘Y’ Y values of the IV ‘X’ X The “best fit” line (a.k.a. the predicted regression line) assumes a linear relationship; it traces a path through the scatter plots that is, on average, equidistant between each data point. values of the DV ‘Y’ Y OLS regression attempts to minimize the sum of the squared distance between the observed data points and the regression line of predicted plots. values of the IV ‘X’ X A computed value between -1.00 and +1.00 that measures the strength of association between X (IV) and Y (DV) The closer the value of Pearson’s r to ±1.00, the stronger the association • A value of -1.00 is a perfect negative correlation • A value of +1.00 is a perfect positive correlation • A value of 0 is a perfect zero correlation A positive Pearson’s r = positive or direct relationship A negative Pearson’s r = negative or inverse relationship Y X mean An X value above the X mean correlates with a Y value above the Y mean in the upper right quadrant (leads to “+” coefficient) Mean of the Y variable Y mean Very few outliers in the opposing two quadrants An X value below the X mean correlates with a Y value below the Y mean in the lower left quadrant (leads to “+” coefficient) Mean of the X variable X X mean Y Very few outliers in the opposing two quadrants Mean of the Y variable Y mean An X value below the X mean correlates with a Y value above the Y mean in the upper left quadrant (leads to “-” coefficient) Mean of the X variable An X value above the X mean correlates with a Y value below the Y mean in the lower right quadrant (leads to “-” coefficient) X The coefficient of determination is the squared value of Pearson’s r expressed as an absolute value (+) percentage Pearson’s R2 is a measure of the percent of variation in the DV explained (or accounted for) by the variation in the IV Example: If r = +0.849, then R2 = 0.721 • Interpretation: Roughly 72.1% of the variation in the DV can be explained by the variation (changes) in the IV The regression equation: Y’ = a + bX • Regression analysis and the regression equation are used to predict the best-fit regression line from the X-Y data • Simply hand-drawing a best-fit line through a scatter plot is subjective and unreliable • We need to use a precise statistical method to estimate the true best-fit regression line • Estimated Y value (Y’) = Y-intercept + slope(given value of X) Least-squares principle • The best-fit regression line is statistically estimated by minimizing the sum of the squared vertical difference between the actual Y values (Y) and the predicted Y values (Y’) • Minimizing the distances between the best-fit line (Y’) and the actual values of Y • Minimizing S(Y – Y’)2 “Multiple” = more than two variables (more precise, more thorough than simple bivariate regression analysis) • Exploring the relationship between a dependent variable (DV) and two or more independent variables (IV) • Variables must be interval or ratio level (continuous) Dependent Variable (DV): • The variable (measurable data used to operationalize a concept) that is thought to “depend upon” or be influenced by another • The variable the value of which is being predicted or estimated Independent Variables (IVs): • The variables that are, together, thought to (hypothesized to) influence the behavior of the DV. • The IVs are sometimes referred to as “predictor” variables; together, they may “predict” the behavior of the DV • We utilize the values of IVs to predict or estimate the value of a DV in a regression equation: Y’ = a + b1X1 + b2X2 … bnXn Multiple regression analysis allows us to investigate the relationship or correlation between several IVs and a continuous DV while controlling for the effects of all the other IVs in the regression equation In other words, we can observe the impact of a single IV on a DV while controlling for the effects of several other IVs simultaneously Multiple regression allows us to “hold constant” the other IVs in the equation so that we can analyze the impact of each IV on the DV “net” of the disturbances of other factors See Grimm and Yarnold, 1995; Gujarati, 1995; Kennedy, 2008; Tabachnick and Fidel, 2012 For each value of X (IV) there is a group of Y (DV) values, and these values must be normally distributed The means of these Y values lie on the predicted regression line The DV must be a continuous variable (ratio or interval), not categorical The relationship between the DV and all IVs must be linear, not curvilinear The mean of the “residuals” (Y-Y’) must equal 0 The DV is statistically independent, no autocorrelation with itself (i.e., the DV cannot be autocorrelated with successive observations of itself; one of the DV values cannot have influenced another of the DV values, such as often occurs in time-series data) Homoscedasticity: the values of the Y-Y’ residuals must be equal over the entire range of the predicted regression line; must be the same for all values of X, cannot be heteroscedastic (Kennedy, 2008) Multiple IVs included in the regression model cannot suffer from multicollinearity with each other The error terms or residuals Y-Y’ are not equal along the entire regression line. As the value of the IV increases, the Y-Y’ residuals get larger and larger, and the data points “fan out” wider about the regression line. values of the DV ‘Y’ Y values of the IV ‘X’ X “Standard” or “Simultaneous” Multiple Regression Technique • • All IVs are entered into the model simultaneously; reveals only the unique effects of each IV on the DV A single model is constructed with all IVs included at the same time “Hierarchical” or “Sequential” Multiple Regression Technique • • • “Sets” of IVs are entered into each regression model systematically, perhaps one by one Allows the analyst to determine how much additional variance in the DV (R2) is explained by adding consecutive additional IVs in a systematic pattern Multiple regression models are generated with each successive model exhibiting more IVs than the previous models (Grimm and Yarnold, 1995). The first regression estimation (model) contains one or more predictors The next estimation (model) adds one or more new predictors to those used in the first analysis The change in R2 between consecutive models represents the proportion of variance in the DV shared exclusively with the newly entered variables Caution: the partial coefficients on each consecutive model are not directly comparable to one another. • The impact of the variables entered in earlier steps are partialed from correlations involving variables entered in later steps (Grimm and Yarnold, 1995) Standard or simultaneous multiple regression estimation on a set of IVs SPSS (Sweet and Grace-Martin, 2011) Variables: • DV: county median household income 2011 • IVs: Geographic border to metro county Educational attainment measures Interstate highway density Population density Labor force participation rate Hierarchical multiple regression estimation using same IVs entered consecutively or sequentially Variables: • DV: county median household income 2011 • IVs: Geographic border to metro county Educational attainment measures Interstate highway density Population density Labor force participation rate Hierarchical multiple regression estimation using different model and variables Variables: • DV: county per capital income 2011 • IVs: County workforce composition: farming and professional Water amenities: square miles of county water area Taxation: aggregate R/E taxes per capita Immigration: county share foreign-born population Berry, W.D. (1993). Understanding Regression Assumptions. Newbury Park, CA: Sage Publications. Grimm, L.G. and Yarnold, P.R. (1995). Reading and Understanding Multivariate Statistics. Washington, D.C.: American Psychological Association. Gujarati, D.N. (1995). Basic Econometrics, 3rd edition. New York, NY: McGraw-Hill/Irwin. Kennedy, P. (2008). A Guide to Econometrics, 6th edition. New York, NY: Wiley-Blackwell. Lind, D.A., Marchal, W.G. and Wathen, S. (2011). Statistical Techniques in Business and Economics, 15th edition. Boston, MA: McGraw-Hill/Irwin. Sweet, S.A and Grace-Martin, K. (2011). Data Analysis With SPSS: A First Course in Applied Statistics, 3rd edition. Boston, MA: Allyn and Bacon. Tabachnick, B.S. and Fidel, L.S. (2012). Using Multivariate Statistics, 6th edition. Boston, MA: Allyn and Bacon. Cimasi, R.J., Sharamitaro, A.P., and Seiler, R.L. (2013). The association between health literacy and preventable hospitalizations in Missouri: Implications in an era of reform. Journal of Health Care Finance, 40(2), 1-16. Ruiz-Palomino, P., Saez-Martinez, F.J., and Martinez-Canas, R. (2013). Understanding pay satisfaction: Effects of supervisor ethical leadership on job motivating potential influence. Journal of Business Ethics, 118, 31-43. Tartaglia, S. (2013). Different predictors of quality of life in urban environment. Social Indicators Research, 113, 10451053. Zhoutao, C., Chen, J., and Song, Y. (2013). Does total rewards reduce core employees’ turnover intention? International Journal of Business and Management, 8(20), 62-75.
© Copyright 2026 Paperzz