SPSS Practical 4 – Correlation and Linear Regression

SPSS Practical 4 – Correlation and Linear Regression
In this practical we are using a dataset on child development, development.sav. Locate
development.sav and save it to a network drive or your computer (Right click the online file,
select save as and select the location to save the data to). Open SPSS and load the trial data
set from where the file was saved [File → Open →Data, and navigate to the location where
you saved development.sav to open the data file for use]. A full list of variables in the
dataset is given below.
Physical
Developmental
Socioeconomic
Variable
Height
Weight
Head circumference
Sex
Kaufman assessment battery for
children
Number of pictures recognised
Mother’s qualifications
Number of rooms in home
Parents’ marital status
Parents’ joint income
cm
kg
cm
0 if male, 1 if female
A global measure of IQ. Must be >0.
Normal range is 105 ± 15
Out of 30 pictures shown
GCSE = 0 / Higher than GCSE = 1
0 = Married, 1 = cohabiting, 2 =
separated, 3 = single
0 = <£20,000; 1 = 20-30,000; 2 = 3040,000; 3 = ≥40,000
CORRELATION
Correlation is useful when exploring the linear relationship between two numerical variables
where they are likely to be associated but one probably does not depend on the other. The
correlation coefficient goes from -1 to +1: 1 is a perfect positive linear association, -1 is a
perfect negative linear association and 0 suggests no association. Large positive values closer
to 1 suggest that as one variable increases, so does the other, whilst large negative values
suggest that as one variable increases, the other decreases. We will use the example of
height and head circumference.
First we will examine a scatter plot of height against head circumference. Got to ‘Graphs →
Legacy Dialog→ Scatterplot’. Choose ‘height’ and ‘headcirc’ and move them to the right
hand side so that they sit along the proposed X and Y axes. Click ‘Ok’. The scatter plot
suggests there may be an association between the variables; the scatter of the data points
suggests that individuals with larger head circumferences tend to have greater height.
Click ‘Analyze → Correlate → Bivariate.’ Choose ‘height’ and ‘headcirc’ and move them to
the ‘variables’ box. Click ‘Ok’. The resulting table gives you the correlation between these
two variables. You are also given a p-value testing the null hypothesis that the correlation
coefficient in question is zero against the alternative that it is not equal to zero.
As a default, Pearson’s correlation was used. Pearson’s correlation coefficient assumes a
linear relationship between the two variables. To relax this assumption, if it appears initially
that your two variables are not linearly related Spearman’s correlation can alternatively be
calculated. Go back to the correlation dialog box, untick ‘Pearson’ and tick ‘Spearman’. Click
1
ok again. If the two correlations are in close agreement then it is likely the assumption of
linearity was ok. In this example we see that the correlations are in close agreement,
therefore the assumption of linearity was okay; we interpret Pearson’s correlation
coefficient.
Kendall’s tau-b is another non-parametric version of the correlation coefficient which is an
alternative to Spearman’s correlation. Explore the data set further. Go back to the correlate
dialog box and add more variables. This gives you a matrix of correlations for each pair of
variables selected. Try this and make sure you understand the output.
CORRELATION QUESTIONS
1. What is the Pearson correlation between height and head circumference?
2. What does this correlation value suggest?
3. Is the association between height and head circumference significant?
4. What is the Pearson correlation between height and weight?
REGRESSION
Regression analysis is a general modelling technique, useful for quantifying associations and
for prediction when you have one numerical variable which you expect to depend upon one
or more other variables. Linear Regression quantifies the linear relationship between one
numerical outcome variable and one or more explanatory variables (factors believed may
affect the outcome) with an equation – the regression line. We are interested in knowing
about the effects of weight on intelligence, represented by Kaufman score. Are heavier
children more or less intelligent than their lighter counterparts?
Click ‘Analyze → Regression →Linear’. Select ‘Kaufman’ as the dependent variable and
‘Weight’ as the independent variable. Click the ‘Statistics’ box and tick ‘Confidence
intervals’, then click ‘Continue’. Click ‘Plots’, tick the box next to ‘Normal probability plot’
and indicate that you also want SPSS to produce the fitted values against residual plot by
moving ‘ZPRED’ into the ‘X:’ box and ‘ZRESID’ into the ‘Y:’ box. We will come back to the use
of this plot shortly. Click ‘Continue’ then ‘Ok’.
The resulting coefficients table gives estimates of two parameters: the constant and the slope
(B relating to weight). The constant tells us the value of the dependent variable (Kaufman
score) when the independent variable (weight) is 0. In many cases this is not of interest –
there are no children with a weight of 0! The weight parameter B is of interest to us. This
gives the increase in Kaufman score associated with a 1kg increase in weight. There is also a
confidence interval around B and a significance test (testing the null hypothesis that the true
slope is 0).
You also get the normal probability plot and residuals versus fitted values plot as outputs
which are important for assessing the assumptions of the linear regression model.
There are 4 assumptions made when a linear regression model is run:
1. Independent data observations
2. Linear relationship between the numerical outcome variable and explanatory
variable(s)
2
3. Residuals are normally distributed
4. Residuals have constant variance
If any of these assumptions are not met, analysis may be invalid. Two observations are said
to be independent (1) if the value of one observation gives no information about the other
observation. To asses assumption (2) we consult the “Residuals versus fitted values” plot
which Is in the output viewer, we hope to see a random scatter in this plot. The assumption
is violated if a non-linear pattern is observed. We asses assumption (4) also using the
Residuals versus fitted values plot; we do not want to see any ‘funnelling’ in the scatter of
points, we want the variation in the residuals to be constant over the range of fitted values.
Finally we asses assumption (3) using the normal probability plot that is also in the output
viewer. The assumption is satisfied if the data points lie close to the diagonal line.
REGRESSION QUESTIONS
1. What is the effect (and 95% CI) of a 1kg increase in weight on the Kaufman score?
2. Therefore what would be the effect of a 5kg increase in weight on Kaufman score?
3. What is the R2 for this linear regression model?
4. Are the assumptions of the linear regression model satisfied?
It is entirely possible that the effect of weight on intelligence is confounded. What if this is
simply because weight affects intelligence and is also association with sex? We need to
adjust the estimate, B, for sex. To do this, open the regression dialog box again. Enter ‘sex’ as
an additional independent variable. Look at the coefficients output table again. Has the
estimate (B) corresponding to weight changed drastically?
Sex is a binary variable where 0 = Males and 1 = Females (see table of variables). B for sex
therefore quantifies the difference in the estimated mean value of the Kaufman score
between males and females after adjusting for weight.
QUESTIONS
5. Did the adjustment for sex you made change B for weight?
6. What is the effect (and 95% CI) of being female on the Kaufman score?
7. Did adding sex reduce the R2 of your model?
You can continue like this and adjust for other variables you think are potential confounders
until you have reached a model you are happy with. As well as adjusting for confounders, you
may wish to produce a model which explains as much variation in the dependent variable
(Kaufman) as possible. This is measured by R2, which is the proportion of variation explained
by the variables in your model. This is given in the Model Summary table. Try adding different
independent variables to your model to see which improve R2 the most.
3