Correlational analyses in R

Introduction to Correlation
Andrew Johnson
Load Libraries
We will need the psych library for this lab.
library(psych)
The Data
The International Personality Item Pool (IPIP) is a public domain item pool that may be used freely to create
personality scales and measures that assess a variety of commonly used personality constructs. One such
construct is neuroticism, a personality variable that has been assessed using such well-known instruments
as Costa and McCrae’s NEO-PI-R, and Eysenck’s EPI.
The epi.bfi dataset included with the psych package contains data on the IPIP version of the NEO-PI-R
Neuroticism scale, as well as data collected using the EPI Neuroticism scale. Finally, it contains data on the
Beck Depression inventory (BDI), as well as state and trait anxiety measures. A detailed description of the
dataset may be seen by typing ?epi.bfi.
data("epi.bfi")
We can use this data to illustrate correlation calculations, in the context of a few important validity concepts,
namely:
• if the IPIP Neuroticism scale is significantly correlated with the EPI Neuroticism scale, this provides
evidence of concurrent validity
• if the IPIP Neuroticism scale is significantly correlated with the BDI, and with state or trait anxiety
measures, this provides evidence of convergent validity
Let’s start our analysis by setting up our key variables as separate objects in the environment. We don’t
have to do this, but it will save us some typing later on.
ipip.neur <- epi.bfi$bfneur
epi.neur <- epi.bfi$epiNeur
bdi <- epi.bfi$bdi
anx.state <- epi.bfi$stateanx
anx.trait <- epi.bfi$traitanx
1
Evaluating the Concurrent Validity of the IPIP Neuroticism Scale
Our first analysis evaluates the concurrent validity of the IPIP Neuroticism scale, by evaluating the extent to
which it predicts scores on the EPI Neuroticism scale. Let’s start by fitting a linear model to the data using
the lm function.
The lm function allows us to specify a formula for our regression equation, in the form lm(y ~ x). Because
we are interested in determining the extent to which the IPIP scale predicts scores on the EPI scale, we
would specify our model as follows:
mod1 <- lm(epi.neur ~ ipip.neur)
This model can now be evaluated using the summary function:
summary(mod1)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = epi.neur ~ ipip.neur)
Residuals:
Min
1Q Median
-9.9422 -2.8613 -0.2533
3Q
Max
2.4649 12.3752
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.1786
0.9831 -1.199
0.232
ipip.neur
0.1318
0.0108 12.195
<2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.824 on 229 degrees of freedom
Multiple R-squared: 0.3937, Adjusted R-squared: 0.3911
F-statistic: 148.7 on 1 and 229 DF, p-value: < 2.2e-16
The correlation between the IPIP and EPI measures of neuroticism is equal to the square-root of the r-squared
for this model, which is 0.6274552, and this relationship is statistically significant, p < 0.001. Alternatively,
you could evaluate the correlation between these two variables using the cor function.
cor(ipip.neur, epi.neur)
## [1] 0.6274718
2
Visualizing the Relationship
We can visualize this relationship by plotting a scatterplot, and then adding a a “line of best fit”, by plotting
our model onto the graph, using the abline function:
15
10
0
5
epi.neur
20
plot(ipip.neur, epi.neur)
abline(mod1)
40
60
80
100
120
140
ipip.neur
Testing Assumptions
This visualization is important, of course, in order to test the assumptions of linear regression and correlation:
1) the relationship between X and Y is linear
2) the variability of observations about the line of best fit is constant (i.e., the assumption of homoscedasticity)
3) values are normally distributed about the line of best fit (i.e., the assumption of bivariate normality)
In the simple “two-variable case”, these assumptions may be addressed by visual inspection of the scatterplot.
As we will see in subsequent examples, however, visual inspection of a scatterplot becomes problematic as the
complexity of the model increases. Thus, it is helpful to use other methods of testing these assumptions that
don’t rely on a simple bivariate scatterplot, namely evaluation of the residuals, or the “errors” made within
the model.
Fortunately, R can produce a number of diagnostic plots (6, to be exact) quite simply, by applying the
plot function to the model object from our regression. Running the plot function without any additional
parameters provides us with plots 1, 2, 3, and 5. We are only going to look at plots 1, 2, and 5 in this lab:
Plot 1
Plot 2
Plot 5
Residuals vs. Fitted
Normal Q-Q plot for the standardized residuals
Residuals versus Leverage
3
If we run the plot command with multiple plots specified, R will cycle through the plots one by one (as we
press the return key). For the purposes of illustrating the purpose of each of these three plots, we will look at
each plot separately.
Evaluating the Assumptions of Linearity and Homoscedasticity
plot(mod1, which = 1)
15
Residuals vs Fitted
0
5
6
−10
−5
Residuals
10
138
12
5
10
15
Fitted values
lm(epi.neur ~ ipip.neur)
We can address our first assumption (linearity) with a plot of the residuals versus fitted values. If the
relationship is linear, then the red line in the centre of the graph will be fairly flat. As you can see, this
assumption is met for the present analysis.
We can also address our second assumption (homoscedasticity) with this plot. If the model evidences constant
variability across the range of fitted values, then the observations should be relatively evenly distributed
across the range of the fitted values - in other words, the error should be relatively constant across all of the
fitted values. Again, this assumption appears to be met for the present variable.
4
Evaluating the Assumption of Bivariate Normality
plot(mod1, which = 2)
Normal Q−Q
−1
0
1
2
6
−2
Standardized residuals
3
138
12
−3
−2
−1
0
1
2
3
Theoretical Quantiles
lm(epi.neur ~ ipip.neur)
We can address our third assumption (bivariate normality) with a normal Q-Q plot on the standardized
residuals. In a Q-Q plot, or “Quantile-Quantile” plot, we plot theoretical quantiles on the x-axis, and the
quantiles of the standardized residuals on the y-axis. If the model satisfies the requirement of bivariate
normality, the points should be lined up nicely along the red line - and as you can see, they do, for our
example.
5
Looking for Points with Unusual Influence on the Regression Line
plot(mod1, which = 5)
6
−2
−1
0
1
2
18355
Cook's distance
−3
Standardized residuals
3
Residuals vs Leverage
0.00
0.01
0.02
0.03
Leverage
lm(epi.neur ~ ipip.neur)
The third plot that we will look at is a plot of the residuals for each observation, against their corresponding
leverage. This is useful for identifying points that may change the regression line. It uses a concept called
“leverage” to describe the impact that outliers within your data may have on the slope of your regression
line. The extent to which an outlier can change the slope of your line is determined by it’s distance from the
centre of the data (their leverage), and the magnitude of it’s residual.
Leverage is a very descriptive and intuitive statistical term, if you imagine the regression line to be a lever
that has its fulcrum at the centre of your data (i.e., at the intersection of the mean of X and the mean of Y).
Consider an actual lever: force that is applied to a point that is further away from the fulcrum will have more
impact than force that is applied to a point that is closer to the fulcrum. So it is with statistical leverage
- points with greater leverage have greater potential to change the slope of the regression line than points
with less leverage. Further to this, the magnitude of the residual may be thought of as the amount of force
that is being applied. The larger the residual, the greater the amount of force that is being applied to move
the regression line. Finally, the sign of the residual may be thought of as the direction in which the force is
being applied. If the residual is positive, it will “pull the line upwards” (making the slope larger), and if the
residual is negative, it will “pull the line downwards” (making the slope smaller).
The importance of the leverage for each point is typically evaluated using a statistic called “Cook’s distance”
or “Cook’s D”. Cook’s D is a function of the leverage and the standardized residual for a point, and may be
used to estimate the amount that any given point will move the regression line. In terms of having an effect
on the regression line, observations that have a Cook’s D with an absolute value greater than 0.5 are likely to
be a problem, and observations that have a Cook’s D with an absolute value greater than 1.0 are almost
certainly going to be a problem. The third graph will draw dotted lines that represent Cook’s distance scores
of 0.5 and 1.0 when there are points that come close to these values, within the dataset. Such points will be
plotted and numbered (by their case number in the data) on this plot. As we can see, however, there are no
such outlier cases within our data.
6
Conclusion
Given the foregoing, we can interpret the correlation coefficient associated with this relationship (0.6274718),
and conclude that the IPIP Neuroticism scale demonstrates good concurrent validity when compared with a
previously validated scale (the EPI Neuroticism scale).
Evaluating the Convergent Validity of the IPIP Neuroticism Scale
Two of the key characteristics of individuals that score highly on neuroticism measures are anxiety and
depression. Thus, we can evaluate the convergent validity of the IPIP Neuroticism Scale by evaluating the
correlation between the IPIP Neuroticism Scale and scores on the Beck Depression Inventory, as well as scores
on both the State and Trait Anxiety measures.
We will be looking at several correlations for this analysis (three, to be specific), and so this is a good
opportunity to check out the excellent corr.test function in the psych package, as it produces outputs that
have the potential to save a lot of time in manuscript preparation.
Like the cor function, and the lm function, the corr.test function will take input from individual vectors, or
individual variables within data frames. It will, however, compute correlations among all possible combinations
of variables within a data frame if the input is a matrix or data frame. To take advantage of this, we will
create a data frame with just the variables that we are interested in for our estimations of convergent validity.
ipip <- data.frame(ipip.neur, bdi, anx.state, anx.trait)
Now, we can use the corr.test function to produce a correlation matrix for all possible combinations of
these variables.
corr.test(ipip)$r
##
##
##
##
##
ipip.neur
bdi
anx.state
anx.trait
ipip.neur
1.0000000
0.4661663
0.4920567
0.5930101
bdi
0.4661663
1.0000000
0.6087400
0.6547648
anx.state
0.4920567
0.6087400
1.0000000
0.5711474
anx.trait
0.5930101
0.6547648
0.5711474
1.0000000
Note that we just asked for the correlation coefficients, by specifying that we wanted just the r value from
within the object that is created by this function. If we had left this specification off the function call, the
output would have included the correlation coefficients, the n-size for the analysis, and the p-values for each
of the correlation coefficients.
More useful still is the confidence interval output that can be provided by the corr.test function.
corr.test(ipip)$ci
##
##
##
##
##
##
##
ipp.n-bdi
ipp.n-anx.s
ipp.n-anx.t
bdi-anx.s
bdi-anx.t
anx.s-anx.t
lower
0.3586704
0.3875967
0.5023874
0.5205656
0.5742174
0.4772540
r
0.4661663
0.4920567
0.5930101
0.6087400
0.6547648
0.5711474
upper
0.5614600
0.5840400
0.6707460
0.6840672
0.7227582
0.6521472
p
7.283063e-14
1.776357e-15
0.000000e+00
0.000000e+00
0.000000e+00
0.000000e+00
7
This output provides us with all of the analytic information we need, in order to present the correlation
coefficient for our convergent validity calculations. The confidence interval is probably more useful than the
p-values, as it provides us with an explicit presentation of the limitations to interpretation for the correlation
coefficients - as well as providing us with an estimate of the statistical significance for each correlation
coefficient.
Regardless of how you choose to present your results, however, the message is fairly clear - the IPIP
Neuroticism scale is significantly correlated with both anxiety and depression, and so this provides solid
convergent validity evidence for this measure.
Testing Assumptions
We still need to test the assumptions for each of these three correlations, however, and can do so using the
same methods presented earlier. Our first step is to compute linear models for each of the three prediction
equations that we will be examining:
mod.bdi <- lm(bdi ~ ipip.neur)
mod.state <- lm(anx.state ~ ipip.neur)
mod.trait <- lm(anx.trait ~ ipip.neur)
We can now evaluate the diagnostic plots for the residuals within each of these regression models.
Testing Assumptions for the Prediction of BDI with IPIP Neuroticism
plot(mod.bdi, which=c(1,2))
20
Residuals vs Fitted
76
5
0
−5
−15
Residuals
10
15
66
138
2
4
6
8
Fitted values
lm(bdi ~ ipip.neur)
8
10
12
14
−1
0
1
2
3
66
76
138
−2
Standardized residuals
4
Normal Q−Q
−3
−2
−1
0
1
2
3
Theoretical Quantiles
lm(bdi ~ ipip.neur)
No marked departure from linearity (plot #1), or homoscedasticity (plot #1), is noted within this data.
There is a slight departure from normality in the upper end of the distribution of the predicted residuals,
but considered over the range of values within the data, this is unlikely to produce substantive problems to
interpretation.
9
Testing Assumptions for the Prediction of State Anxiety with IPIP Neuroticism
plot(mod.state, which=c(1,2))
30
Residuals vs Fitted
5
99
10
0
−30
−10
Residuals
20
147
30
35
40
45
50
55
−1
0
1
2
147
99 5
−2
Standardized residuals
3
Fitted values
lm(anx.state ~ ipip.neur)
Normal Q−Q
−3
−2
−1
0
1
2
3
Theoretical Quantiles
lm(anx.state ~ ipip.neur)
No marked departure from linearity (plot #1), homoscedasticity (plot #1) or normality (plot #2) is noted.
10
Testing Assumptions for the Prediction of Trait Anxiety with IPIP Neuroticism
plot(mod.trait, which=c(1,2))
30
Residuals vs Fitted
93
20
10
0
−10
Residuals
131
158
25
30
35
40
45
50
55
4
3
158131
−1
0
1
2
93
−2
Standardized residuals
Fitted values
lm(anx.trait ~ ipip.neur)
Normal Q−Q
−3
−2
−1
0
1
Theoretical Quantiles
lm(anx.trait ~ ipip.neur)
11
2
3
Again, no marked departure from linearity (plot #1) or homoscedasticity (plot #1) is demonstrated for
this prediction equation. There is a slight departure from normality in the upper end of the distribution of
the predicted residuals, but considered over the range of values within the data, this is unlikely to produce
substantive problems to interpretation.
Evaluating the Data for Influential Observations
par(mfrow=c(1,3))
plot(mod.bdi, which=5)
plot(mod.state, which=5)
plot(mod.trait, which=5)
Residuals vs Leverage
Residuals vs Leverage
3
3
4
4
Residuals vs Leverage
2
96
6
1
2
96
−1
0
Standardized residuals
1
0
−1
0
1
Standardized residuals
2
91
−1
Standardized residuals
3
96
−2
−2
178
0.01
0.02
Leverage
0.03
Cook's distance
−3
−3
Cook's distance
0.00
174
161
−2
69
0.00
0.01
0.02
Leverage
0.03
Cook's distance
0.00
0.01
0.02
0.03
Leverage
If we look at the plots of the residuals versus leverage for each of our three prediction equations, we can see
that there are no points within the data that have a Cook’s D that falls beyond an absolute value of 0.5 (as
evidenced by the fact that no dotted contour lines have been plotted for us).
Conclusion
Given that our assumptions have been met, we may confidently interpret the correlation coefficients that we
calculated for these three relationships. Given that the Neuroticism scale of the IPIP may be demonstrated to
be a statistically significant predictor of both depression and anxiety, we may conclude that there is evidence
of convergent validity for this measure.
12