SSS1 Lecture 8 - gwilympryce.co.uk

Social Science Statistics Module I
Gwilym Pryce
Lecture 8
Regression: Relationships between
continuous variables
Slides available from Statistics & SPSS page of www.gpryce.com
1
Notices:
• Register
• Revision lecture next week:
– Worked examples on:
• Confidence Intervals?
• Hypothesis Tests?
• Regression?
– Email me any particular issues
• Learning & Support strategy:
Learning & Support strategy:
1.
Independent learning:
–
–
2.
Lab Overview & Feedback:
–
–
3.
Please feedback to the tutors & Class Reps how you think that is going, how it could be improved.
Tutors and Class Reps will then report back to me how things are going each week.
Talk to tutors if you are struggling
–
–
4.
this is a PG course and a degree of independent learning is assumed.
do the reading, attend the labs, review the lectures, make use of the computer labs and online help in your
own time.
Let the tutors know if you are struggling (assuming you have done the reading, attended labs etc.)
Tutors cannot guarantee extra support, but it might be possible to arrange extra tutorials etc.
Support from Maths Advisor Shazia Ahmed, University’s Maths Adviser:
–
–
If you have gone through steps 1 to 3, Shazia has agreed to run one-on-one sessions with students that are
struggling with particular mathematical or statistical concepts (though she has made it clear that she cannot
advise on SPSS problems, nor will she do the assignment for you).
Students who have particular problems in this regard can contact her directly:
Shazia Ahmed, Maths Adviser, Student Learning Service
McMillan Reading Room, Tel: 330 5631 Fax: 330 8063
5.
Departmental Support:
–
–
–
6.
Struggling students should enquire whether their own dept has support to offer.
All the grad school courses are only intended to constitute a generic training component;
Individual depts & supervisors should supplement with additional training and support as necessary.
Tutor of Last Resort:
–
–
Students who have gone through steps 1 to 5 above, and who still feel they are not receiving enough
support, can email me directly
I will try to arrange individual or small group meetings for people who have tried all other avenues.
•
You will need to demonstrate that you have gone through steps 1 to 5.
Plan:
•
•
•
•
•
•
1. Linear & Non-linear Relationships
2. Fitting a line using OLS
3. Inference in Regression
4. Omitted Variables & R2
5. Categorical Explanatory Variables
6. Summary
1. Linear & Non-linear relationships between variables
• Often of greatest interest in social science is investigation into
relationships between variables:
– is social class related to political perspective?
– is income related to education?
– is worker alienation related to job monotony?
• We are also interested in the direction of causation, but this is
more difficult to prove empirically:
– our empirical models are usually structured assuming a particular
theory of causation
Relationships between scale variables
• The most straight forward way to investigate
evidence for relationship is to look at scatter plots:
– traditional to:
• put the dependent variable (I.e. the “effect”) on the vertical axis
– or “y axis”
• put the explanatory variable (I.e. the “cause”) on the horizontal
axis
– or “x axis”
Scatter plot of IQ and Income:
40000
30000
INCOME
20000
10000
60
IQ
80
100
120
140
160
We would like to find the line of best fit:
INCOME  a  b  IQ
40000
where,
a  y intercept
b  slope of line
yˆ  a  bx
20000
INCOME
Predicted values (i.e.
values of y lying on
the line of best fit) are
given by:
30000
10000
60
IQ
80
100
120
140
160
What does the output mean?
Model
1
(Constant)
IQ
Unstandardi
Coefficient
B
Std
-8236.836
12
258.523
a. Dependent Variable: INCOME
Sometimes the relationship appears non-linear:
40000
30000
INCOME
20000
10000
0
IQ2
100
200
300
… straight line of best fit is not always very satisfactory:
40000
30000
INCOME
20000
10000
0
IQ2
100
200
300
Could try a quadratic line of best fit:
40000
30000
INCOME
20000
10000
0
IQ2
100
200
300
We can simulate a non-linear relationship by first transforming one of the variables:
1000000000
40000
800000000
30000
600000000
400000000
20000
INC_SQ
INCOME
200000000
10000
0
IQ2
100
200
300
0
0
IQ2
100
200
300
40000
40000
30000
30000
20000
20000
INCOME
INCOME
e.g. squaring IQ and taking the natural log of IQ:
10000
0
IQ2
100
200
300
10000
4.2
4.4
IQ2_LN
4.6
4.8
5.0
5.2
5.4
5.6
5.8
… or a cubic line of best fit: (over-fitted?)
40000
30000
INCOME
20000
10000
0
IQ2
100
200
300
Or could try two linear lines:
“structural break”
40000
30000
INCOME
20000
10000
0
IQ2
100
200
300
2. Fitting a line using OLS
• The most popular algorithm for drawing the line of best
fit is one that minimises the sum of squared deviations
from the line to each observation:
n
min
(y
i 1
i
 yˆ i )
2
Where:
yi = observed value of y
ŷi = predicted value of yi
= the value on the line of
best fit corresponding to xi
Example: School Performance in 8 Schools
y = school performance; x = ave. HH income of pupils (£000s)
etc
1. Write this model output as an
equation.
2. When xi = 41 what is the value of yi?
3. When xi = 41 what is the value of
y_hat?
4. What is the difference between yi
and y_hat when xi = 41, and what
does this difference mean?
5. Where does the line of best fit cut
the vertical axis?
6. What is the value of school
performance when average HH
income of pupils is zero?
7. How sensitive is school performance
to the economic status of its intake?
8. How is this sensitivity calculated?
1.
y_hat = 6 + 2*xi
–
2.
yi = 6 + 2*xi + ei
From the table of observations we can see that, when xi = 41, yi = 91.7.
NB if there was another school with xi = 41, the observed value of y might not be the
same due to random variation.
3.
When xi = 41 what is the value of y_hat?
•
4.
The difference between yi and y_hat when xi = 41, is 91.7 – 88.0 = 3.7. This
difference is the “error” or “residual”.
–
5.
6.
7.
i.e. our model predicts that school performance will equal 88 when x = 41, but for
this particular school, the actual performance is 91.7, so the model underpredicts
performance by 3.7.
The line of best fit (our model) cuts the vertical axis where x = 0.
y_hat = 6 + 2*xi = 6 + 2*0 = 6
The value of school performance = 6 when average HH income of pupils, x,
is zero.
The regression slope, also called b, also called the slop coefficient is a
measure of how sensitive the dependent variable is to change in the
explanatory variables. SPSS has estimated that the slope in this case = 2.
–
–
8.
y_hat = 6 + 2*41 = 88
i.e. for every unit increase in the explanatory variable (average income of parents
measured in £000s) school performance rises by two units.
i.e. for every extra £1,000 average income, school performance goes up by one
unit.
How is this sensitivity calculated? Good question! It is the slope of the line of
best fit, calculated using the OLS formula which minimises the sum squared
residuals…
Regression estimates of a, b using Ordinary Least Squares (OLS):
• Solving the min[error sum of squares]
problem yields estimates of the slope b and
y-intercept a of the straight line:
b
n xy  x y
n x  ( x )
2
a  y  bx
=6
2
=2
y_hat = 6 + 2*xi
Now consider what would happen if we collected another
sample and calculated the line of best fit for this new sample…
A Second random sample of 8 schools:
b
n xy  x y
n x  ( x )
2
2
= 2.1
a  y  bx
= 7.6
A Third Random Sample of 8 Schools:
b
n xy  x y
n x  ( x )
2
2
= 1.9
a  y  bx
= 15.2
A Fourth Random Sample of 8 Schools:
b
n xy  x y
n x  ( x )
2
2
= 2.0
a  y  bx
= 14.5
A Fifth Random Sample of 8 Schools:
b
n xy  x y
n x  ( x )
2
2
= 1.9
a  y  bx
= 14.0
Further random samples…
Sample 6
Sample 8
Sample 7
Sample 9
Sample 1: b = 2.0
Sample 2: b = 2.1
Sample 3: b = 1.9
Sample 4: b = 2.0
Sample 5: b = 1.9
Sample 6: b = 1.7
Sample 7: b = 1.8
Sample 8: b = 2.5
Sample 9: b = 2.2
• Notice that, in the second, third
etc samples we have found
schools with exactly the same
values of x as in the first sample.
• Despite this, we find random
variation in the performance of
the school for a given value of x.
Average b from 9 samples = • This means that the slope
2.0
coefficient will also vary from
sample to sample.
Standard deviation of b
from 9 samples = 0.2
i.e. average deviation of b from sample
to sample = 0.2 = Standard Error of the
slope
• Q1/ What would the sampling distribution
of b look like if the sample size was
large?
• Q2/ What will the average of all sample
slopes by and what symbol do we use to
denote this value?
• Q3/ What section of that distribution are
we usually most interested in?
If n is large…
• A1/ sample slope b is
normally distributed if
n is large.
• A2/ average of all
sample slopes =
population slope b
• A3/ we are usually
most interested in the
central 95% of the
distribution of b
– We want to be 95%
sure that the
population value of
the slope lies
between some lower
bound and some
upper bound.
b
||
Average b
b
• Q/ Why is it useful that b is normally
distributed?
• A/ If b is normally distributed, it means that we
can use the standard normal curve to help us
work out the lower and upper bounds of the
central 95% of the sampling distribution of b
50
14
40
16
10
30
12
6
20
8
2
10
4
80
6.
00
6.
20
5.
40
4.
60
3.
80
2.
00
2.
20
1.
0
.4
0
-.4
0
.2
-1
0
.0
-2
0
.8
-2
0
.6
-3
0
.4
-4
0
.2
-5
0
.0
-6
0
.8
-6
c
b
a
80
6.
00
6.
20
5.
40
4.
60
3.
80
2.
00
2.
20
1.
0
.4
0
-.4
0
.2
-1
0
.0
-2
0
.8
-2
0
.6
-3
0
.4
-4
0
.2
-5
0
.0
-6
0
.8
-6
0
0
NORM_2
NORM_2
bb
z
sb
Convert to z
value
where sb is the SE of b
z
• Because the sampling distribution of the
regression slope from large samples is normal
(i.e. has a bell-shaped histogram), we can use
the standard normal curve (z distribution) to
work out confidence intervals and hypothesis
tests on b.
– i.e we can use the known probabilities for areas
under the standard normal curve to work out:
• The lower and upper bounds for the central 95% of b
• The probability of observing a sample like our own with
a value of b at least as far away from the H0 assumed
value of b
Small samples
• If the sample is small, b will have a tdistribution.
• Since the t-distribution is asymptotically
normal (i.e. tends towards the z distribution
as n increases) we tend to use the tdistribution whether the sample is large or
small.
50
14
40
16
10
30
12
6
20
8
2
10
4
80
6.
00
6.
20
5.
40
4.
60
3.
80
2.
00
2.
20
1.
0
.4
0
-.4
0
.2
-1
0
.0
-2
0
.8
-2
0
.6
-3
0
.4
-4
0
.2
-5
0
.0
-6
0
.8
-6
c
b
a
80
6.
00
6.
20
5.
40
4.
60
3.
80
2.
00
2.
20
1.
0
.4
0
-.4
0
.2
-1
0
.0
-2
0
.8
-2
0
.6
-3
0
.4
-4
0
.2
-5
0
.0
-6
0
.8
-6
0
0
NORM_2
NORM_2
bb
t
sb
Convert to t
value
where sb is the SE of b
t
3. Hypothesis tests on the slope coefficient
• Regressions are usually run on samples, but
usually we want to say something about the
population value of the relationship between x
and y.
• Repeated samples would yield a range of values
for estimates of b ~ N(b, sb)
• I.e. b is normally distributed with mean = b = population
mean = value of b if regression run on population
• If there is no relationship in the population
between x and y, then b = 0
• H0: b = 0, H1: b  0 is the hypothesis test which SPSS
runs automatically on every regression you run and
produces the output in two columns headed “t” and “Sig.”
in the Coefficients table.
– i.e. every SPSS output table of coefficients includes the
results of a hypothesis test on whether there is any
relationship at all between x and y.
• Some examples…
Returning to our IQ example:
Coefficientsa
Model
1
(Constant)
IQ
Unstandardized
Coefficients
B
Std. Error
-8236.836
1250.461
258.523
11.089
Standardi
zed
Coefficien
ts
Beta
.993
a. Dependent Variable: INCOME
• Q1/ what is the estimate of slope in this sample and what does it
tell us?
• Q2/ what is the standard error and what does it mean?
• Q3/ what is the value of the intercept term and what does it mean?
• Q4/ how would we test the hypothesis that b = 0, and what does
this hypothesis mean?
• A1/ the estimate of slope in this sample is 260. This tells
us that for every unit increase in IQ, income typically rises
by around £260.
• A2/ the standard error tells us how much the estimate of
the slope typically varies from sample to sample. We do
not know the SE of b for sure, but SPSS estimates it at
£11
– i.e. the slope estimate is likely to vary by around £11 from
sample to sample.
• A3/ the value of the intercept term is estimated to be 8,237. The intercept term tells us the value of the
dependenet variable when the explanatory variables are
all zero.
– i.e. where the line of best fit cuts the vertical axis
– So we estimate that for someone with zero IQ, their income
will typically be -£8,237.
• A4/ we would test the hypothesis that b = 0 by
calculating the probability of observing a sample with
an estimated slope of £260 when the value of the
population slope is zero.
– We would calculate this probability (=sig. = probability of
falsely rejecting H0: b = 0 ) by calculating the associated
value on the t-distribution and use this to work out the
areas in the tails.
• tc = (258.5 – 0)/11.01 = 23.5 where tc is the value of t you
have calculated. You then want to work out what
proportion of t lies above tc and below –tc.
– We would then look up this value for t in the t tables for
the degrees of freedom associated with out regression =
sample size -(1 + the number of explanatory variables).
t
bb
sb
Hypothesis test on b:
• (1)
H 0: b = 0
(I.e. slope coefficient, if regression run on population, would = 0)
H 1: b  0
• (2) a = 0.05 or 0.01 etc.
bb b0 b
t


sb
sb
sb
df  n  2
• (3) Reject H0 iff P < a
• (N.B. Rule of thumb if n fairly large: P < 0.05 if tc  2)
• (4) Calculate P and conclude.
Floor Area Example:
•
You run a regression of house price on floor area which yields the following
output. Use this output to answer the following questions:
Q/ What is the “Constant”? What does it’s value mean here?
Q/ What is the slope coefficient and what does it tell you here?
Q/ What is the estimated value of an extra square metre?
Q/ How would you test for the existence of a relationship between purchase price and
floor area?
Q/ How much is a 200m2 house worth?
Q/ How much is a 100m2 house worth?
Q/ On average, how much is the slope coefficient likely to vary from sample to
sample?
•
NB Write down your answers – you’ll need them later!
i
a
c
n
d
e
f
d
f
t
i
s
c
S
B
e
M
i
t
E
g
1
(
0
5
3
2
F
5
8
1
9
0
a
D
Floor area example:
•
•
(1)
H0: no relationship between house price and floor area.
H1: there is a relationship
(2), (3), (4):
i
a
c
n
d
e
f
d
f
t
i
s
c
S
B
e
M
i
t
E
g
1
(
0
5
3
2
F
5
8
1
9
0
a
D
•
P = 1- CDF.T(24.469,554) = 0.000000
Reject H0
4. Omitted Variables, Goodness of Fit and R2
Purchase Price
Q/ is floor area the
only factor?
Q/ How much of the
variation in Price does
it explain?
300000
200000
100000
0
0
100
Floor Area (sq meters)
200
300
R-square
• R-square tells you how much of the variation in y is
explained by your model
0 < R2 < 1
(NB: you want R2 to be near 1).
• If your have more than one explanatory variable, use
Adjusted R2 which takes into account the distortion
caused by adding extra variables.
u
E
u
t
s
R
q
q
m
M
a
1
8
a
P
Now add number of bathrooms as an
extra explanatory variable…
House Price Example cont’d: Two explanatory variables
Q/ How has the estimated value of an extra square metre changed?
Q/ Do a hypothesis test for the existence of a relationship between price and
number of bathrooms.
Q/ How much will an extra bathroom typically add to the value of a house?
Q/ What is the value of a 200m2 house with one bathroom? Compare your
estimate with that from the previous model.
Q/ What is the value of a 100m2 house with one bathroom? Compare your
estimate with that from the previous model.
Q/ What is the value of a 100m2 house with two bathrooms? Compare your
estimate with that from the previous model.
Q/ On average, how much is the slope coefficient on floor area likely to vary
from sample to sample?
a
u
i
c
n
d
E
e
t
s
h
f
d
f
q
q
R
m
M
u
u
t
i
s
c
a
1
S
B
e
E
M
i
t
a
P
1
(
3
1
3
0
F
F
1
2
2
1
0
N
9
2
4
6
0
a
D
Scatter plot (with floor spikes)
300000
200000
Purchase Price
100000
300
200
100
Floor Area (sq meters)
1.0
1.5
2.0
2.5
3.0
3.5
Number of Bathrooms
Non-linear effects can also be modelled when you
have more than one explanatory variable
3D Surface Plots: Construction, Price & Unemployment during a boom
Q = -246 + 27P - 0.2P2 - 73U + 3U2
Ut-1 15
10
5
0
Q
50
0
0
500
0
20
40
60
80
P
Construction Equation in a Slump
Q = 315 + 4P - 73U + 5U2
15
10
5
0
800
600
400
200
0
0
20
40
60
80
5. Categorical Explanatory Variables
• Sometimes certain observations display
consistently higher y values for a particular
subgroup in the sample.
– i.e. for a particular category of observations.
• If you assume the slope will have the same
value, and that only the intercept is shifting,
you can model the effect of categorical
variables by including “dummy” variables
– A dummy variable is simply a binary variable
– e.g. male = 1 or 0
• To model the effect of a categorical explanatory
variable in this way you need to:
1. Decide on a “baseline” category. This is usually an
arbitrary decision, so just choose the largest or most
familiar category.
• E.g. if the category is UK Region, choose London as the
baseline
2. Create dummies (binary variables) for all remaining
categories
• E.g.
Compute yorksh_dum = 0.
if (Region = “Yorkshire”) yorksh_dum = 1.
Execute.
3. Include in your regression the dummies for all
categories except your baseline category.
• E.g. suppose you only have two regions in your sample,
London and Yorkshire,
– you would do a regression of house price on floorarea and
yorksh_dum
• By including dummy variables you are saying that the
difference between categories can be modelled as a
parallel shift of the regression line above or below the
baseline category
– The value of the coefficient on the dummy variable tells you
how much higher the value of the dependent variable would be
observations in that category
• E.g. if the regression output were as follows:
price = -2000 + 500*floorarea - 27500*yorksh_dum
then the results tell us that a house of a given size is £27,500
cheaper in Yorkshire compared with London.
– i.e. the coefficient tells you the size of the intercept shift
associated with that category of observations…
Coefficient on Dummy Variable = size of Intercept Shift:
House price
London
Yorkshire
£27,500
£27,500
Slope = £500;
same for both
areas
Floorarea
Summary
•
•
•
•
•
1. Linear & Non-linear Relationships
2. Fitting a line using OLS
3. Inference in Regression
4. Omitted Variables & R2
5. Categorical Explanatory variables
• Revision lecture next week:
– Worked examples on:
• Confidence Intervals?
• Hypothesis Tests?
• Regression?
Reading:
• Regression Analysis:
–
–
–
–
–
*Pryce chapter on relationships.
*Field, A. chapters on regression.
*Moore and McCabe Chapters on regression.
Kennedy, P. ‘A Guide to Econometrics’
Bryman, Alan, and Cramer, Duncan (1999) “Quantitative
Data Analysis with SPSS for Windows: A Guide for
Social Scientists”, Chapters 9 and 10.
– Achen, Christopher H. Interpreting and Using
Regression (London: Sage, 1982).