Bivariates, Chi-Square

SPS 580 Lecture 2
I.
Bivariate
Chi Square
BIVARIATE ANALYSIS: CROSSTABULATION
ASSIGNMENT #1 ASKED YOU: For each dependent variable . . . Comment on the findings
from the univariate percentage table and/or column chart and implications they might have for
the test of your theory . . .
Here’s what I was thinking . . .
IDEA: People from Chicago are more
likely to be in favor of a tax on Lakefront
beach use for non-City residents
THEORY:
Place of Residence 
Opinion on Beach tax
Opinion on Beach Tax
37%
21%
21%
20%
FREQUENCIES VARIABLES=bchtax
Agree
Strongly




Agree
Somewhat
Disagree
Strongly
The majority oppose a beach tax for non-City residents
On average 41% of the general population favor a beach tax
IF THE THEORY IS RIGHT WE EXPECT SOMETHING LIKE THIS . . .
(ABSTRACTION ALERT)
Place of residence
Chicago
Suburban Cook Co
Collar Counties
Total
II.
Disagree
Somewhat
"Favor" Beach
Use Tax
50%
40%
30%
41%
 A BIVARIATE PERCENTAGE TABLE
 “MARGINAL” TOTAL
Frequencies = marginal distribution
BIVARIATE CROSSTABULATION: REAL DATA
A. RECODE X and Y variables so categories = the desired end product
RECODE bchtax (1 thru 2=1) (3 thru 4=2) (ELSE=9) INTO bchtax2.
VARIABLE LABELS bchtax2 'dichotomy'.
value labels bchtax2 1 'Favor tax' 2 'Oppose tax' 9 'other'.
missing values bchtax2 (9) .
RECODE region (1=1) (2=2) (3 thru 7=3) (ELSE=9) INTO region3.
VARIABLE LABELS region3 'region recoded'.
value labels region3 1 'Chicago' 2 'Suburban Cook Co' 3 'Collar Counties'.
missing values region3 (9).
1
SPS 580 Lecture 2
Bivariate
Chi Square
B. PERFORM A CROSSTABULATION OF X and Y VARIABLES
ANALYZE / DESCRIPTIVE STATISTICS / CROSSTABS / ROWS region3 / COLUMNS bchtx2 / CELLS row percentage
CONTINUE / OK
region3 region recoded * bchtax2 dichotomy Crosstabulation
% within region3 region recoded
bchtax2 dichotomy
1.00 Favor tax
region3 region recoded
2.00 Oppose tax
Total
1.00 Chicago
54.0%
46.0%
100.0%
2.00 Suburban Cook Co
33.1%
66.9%
100.0%
3.00 Collar Counties
37.0%
63.0%
100.0%
41.8%
58.2%
100.0%
Total
Rows = categories of place of
residence (X variable)
Cells = %s
add to
100%
across each
row
Columns = categories of
opinion (Y variable)
SPSS HINT: When you COMPUTE new variables they appear at the end of the list of SPSS
variables, not in alphabetical order
Categories of Y
Categories of X
Place of residence
Chicago
Suburban Cook Co
Collar Counties
Total
"Favor" Beach
Use Tax
54%
33%
37%
42%
A BIVARIATE PERCENTAGE TABLE
Conditional percents %Y|X
Marginal total percent %Y
Conditional percents %Y|X
Categories of Y
PQ Bivariate column chart
"Favor" Beach Use Tax
60%
54%
33%
40%
Y = %Y given the value of X, Conditional
distribution Y = %Y|X
37%
20%
X = independent variable, place of residence
0%
Chicago
Suburban Cook Collar Counties
Co
Categories of X
2
SPS 580 Lecture 2
Bivariate
Chi Square
MAKING A PRESENTATION QUALITY BIVARIATE GRAPH . . .
Highlight the Bivariate percentage table
INSERT COLUMN 2D-COLUMN
Delete gridlines
Delete legend
Add data labels
Re-set y-axis metric
Delete x-axis tick marks
Resize fonts



The theory is right . . . people from Chicago are more likely to support the beach tax (54%)
than people from the suburbs (33% and 37%).
Is it also right to say that people from the Collar Counties are more likely (37%) to support
the beach tax than people who live in Cook County (33%) ???
Well . . . It doesn’t make much sense, and it is probably not a statistically significant
difference.
III.
Statistical Significance I
A. When we talk about statistical significance, we are talking about significance of differences
B. When you say a difference is not statistically significant, that means you think . . .
a. It’s small It’s too small
b. Smaller than what?
C. The pattern in the data could have arisen by chance, so don’t get all worked up about it
D. OPERATIONAL DEFINITION . . . “Chance” means the PERCENT in each place (each
category of X) is the same as the TOTAL PERCENT plus or minus some sampling error
Place of residence
Chicago
Suburban Cook Co
Collar Counties
Total
"Favor" Beach
Use Tax
54%
33%
37%
42%
Place of residence
# of people in
the survey
Chicago
1,147
Suburban Cook Co
1,070
Collar Counties
954
Total
3,171
Observed Data
"Favor" Beach
Use Tax
42% +/- error
42% +/- error
42% +/- error
42%
Sampling Error Model
for testing stx significance
Y% does not depend on the value of X
aka . . . Null Hypothesis . . . “no” causal
relationship
3
SPS 580 Lecture 2
Bivariate
Chi Square
E. Test of Statistical Significance Means . . . Compare the Observed Data to the SEM (null
hypothesis) to see if the data are “statistically significant”
EXPECTED IF
NULL IS RIGHT
Place of
"Favor" Beach "Favor" Beach
residence
Use Tax
Use Tax
Chicago
54%
42%
Suburban Cook Co
33%
42%
Collar Counties
37%
42%
Total
42%
42%
OBSERVED
Null Hypothesis =>
conditional %s = marginal %
F. What you get from a significance test is the likelihood the null is true. General rule for
policy research: if the likelihood of something being true is 5% or less, then IT IS NOT
TRUE
G. So, what’s the likelihood that the % in favor everywhere is 42% and the differences are
just sampling variation? = Likelihood the apparent differences could have arisen from
sampling variation
4
SPS 580 Lecture 2
IV.
Bivariate
Chi Square
The Chi Square Test
OBSERVED COUNT
XTAB observed data
Favor tax
Oppose
Total
Chicago
619
528
1147
Suburban Cook Co
354
716
1070
Collar Counties
353
601
954
Total
1326
1845
3171
Expected Count if NULL is right
619 / 1147 = 54%
354 / 1070 = 33%
353 / 954 = 37%
1326 / 3171 = 42%
Favor tax
Oppose
Total
Chicago
480
667
1147
Suburban Cook Co
447
623
1070
Collar Counties
399
555
954
Total
1326
1845
3171
Calculate expected data if null is
right
42% * 1147 = 480
42% * 1070 = 447
42% * 954 = 399
42% * 3171 = 1326
Total
Calculate Observed minus
Expected
O-E
Favor tax
Oppose
Chicago
139
-139
Suburban Cook Co
-93
93
Collar Counties
-46
46
The positive differences show
where there are “too many” cases if
null is right
Total
(O-E)^2 / E
Favor tax
Oppose
Chicago
40
29
Suburban Cook Co
20
14
Collar Counties
5
4
Total
Calculate [ (O-E)^2 / E ]
139 ^ 2 = 19,222.65
19,222.65 / 480 = 40
Total
Add it all up SUM [ (O-E)^2 / E ] = 112
Determine degrees of freedom (df) = (# rows – 1) * (# columns – 1) = 2*1 = 2
Look up the “critical value” of chi square, given the df. . .
d.f.
p <.05 if chi sq >
1
3.841
2
5.991
3
7.815
4
9.488
5
11.07
6
12.592
7
14.067
8
15.507
9
16.919
10
18.307
. . . if chi square is greater than the critical value then the likelihood of the null hypothesis is less
than 5%, which means IT IS NOT TRUE
So in this case, chi square = 112 df = 2 critical value = 5.991 Null is NOT TRUE
The data COULD NOT have arisen because of sampling variation.
THE DIFFERENCES BETWEEN LOCATION AND ATTITUDE TOWARD THE BEACH
TAX ARE STATISTICALLY SIGNIFICANT
5
SPS 580 Lecture 2
Bivariate
Chi Square
What about the difference between Suburban Cook and the Collar counties?
OBSERVED COUNT
Favor tax
Oppose
Total
Suburban Cook Co
354
716
1070
Collar Counties
353
601
954
Total
707
1317
2024
35%
Expected Count if NULL is right
Favor tax
Oppose
Total
Suburban Cook Co
374
696
1070
Collar Counties
333
621
954
Total
707
1317
3171
Total
O-E
Favor tax
Oppose
Suburban Cook Co
-20
20
Collar Counties
20
-20
Total
(O-E)^2 / E
Favor tax
Oppose
Suburban Cook Co
1.04
0.56
Collar Counties
1.17
0.63
Total
Total
SUM [ (O - E ) ^2 / E ]
3.41
So in this case, chi square = 3.41 df = 1 critical value = 3.841 Null is TRUE
The data could have arisen because of sampling variation.
THE DIFFERENCE BETWEEN SUBURBAN COOK AND THE COLLAR COUNTIES ON
ATTITUDE TOWARD THE BEACH TAX IS NOT STATISTICALLY SIGNIFICANT
6
SPS 580 Lecture 2
Bivariate
Chi Square
CHI SQUARE – A BLANKET TEST
Idea: Poor people are more likely to use public transit
THEORY: Income causes mode of transportation
V.
tranmt Resp's Means Most Used
To Get To Work
1 Car/truck
76.9%
2 Van
1.1%
3 Bus
6.7%
4 Subway Or Elevated
4.1%
5 Railroad
4.1%
6 Taxi
0.2%
7 Motorcycle
0.2%
8 Bicycle
0.4%
9 Walked
3.2%
10 Worked At Home
1.8%
11 Airplane
0.4%
12 Other
1.0%
100%
 this is what’s in the data set
(NOT PRESENTATION QUALITY)
Let’s say you coded it into these four groups . . .
Mode
78.8%
14.9%
3.6%
2.7%
100.0%
Private
Public
Biked/Walked
Other
RECODE tranmt (1=1) (2=1) (6=1) (7=1)
(11=1) (10=4) (12=4) (3 thru 5=2) (8 thru
9=3) (ELSE=SYSMIS).
value labels tranmt 1 'private' 2 'public'
3 'bike or walked' 4 'other'.
There’s a nice income variable already there, just needs DK assigned to missing
INCOME . . . missing values inc4gp (7,8,9).
Mode of Transportation to Work
<$20k
Private
60%
Public
28%
Bike / walk
9%
Other
4%
$20k - $40k
77%
16%
4%
2%
$40k -$70k
85%
11%
2%
2%
$70k+
80%
14%
2%
3%
TOTAL
79%
15%
4%
3%
BIVARIATE PERCENTAGE TABLE PQ
BIVARIATE CHARTS (Ugly, but PQ) …
Mode of Transportation to Work
Mode of Transportation to Work
90%
90%
80%
80%
Private
70%
70%
<$20k
60%
$20k - $40k
50%
50%
40%
40%
$40k -$70k
$70k+
Public
60%
30%
30%
20%
20%
10%
10%
Bike / walk
Other
0%
0%
Private
Public
Bike / walk
<$20k
Other
7
$20k - $40k
$40k -$70k
$70k+
SPS 580 Lecture 2
Bivariate
Chi Square
Chi Square = 188 df = 9
But this chi square tests the null hypothesis that ALL OF THE INCOME GROUPS MAKE
ALL THE SAME TRANSPORTATION CHOICES i.e., . . .
EXPECTED UNDER NULL HYPOTHESIS, 9 DF
Mode of Transportation to Work
VI.
Private
Public
Bike / walk
Other
<$20k
79%
15%
4%
3%
$20k - $40k
79%
15%
4%
3%
$40k -$70k
79%
15%
4%
3%
$70k+
79%
15%
4%
3%
TOTAL
79%
15%
4%
3%
Chi square is a BLANKET TEST of all
possible differences
Likelihood < 5 %
Null is FALSE
All income groups did not make the
same transportation choices
FOCUSING CHI SQUARE TESTS ON SPECIFIC HYPOTHESES
It might be that the differences in PUBLIC transportation use are NOT SIGNIFICANT, even
though the overall pattern of differences is significant.
So you need to construct a chi square test that makes use of all the data available, but focuses
on the set of differences you are most interested in . . .
Mode of Transportation to Work
<$20k
Public
28%
Non-Public
72%
$20k - $40k
16%
84%
$40k -$70k
11%
89%
$70k+
14%
86%
TOTAL
15%
85%
28%
Public Transpoprtation Use
16%
<$20k
11%
$20k - $40k $40k -$70k
14%
$70k+
EXPECTED UNDER NULL
<$20k
Mode of Transportation
NonPublic
Public
15%
85%
$20k - $40k
15%
85%
$40k -$70k
15%
85%
$70k+
15%
85%
TOTAL
15%
85%
Chi square = 99.6 df = 3
Likelihood < 5 %
Null is FALSE
Income groups not equally likely to use public transportation
8
SPS 580 Lecture 2
VII.
Bivariate
Chi Square
PHI, SENSITIVITY OF CHI SQUARE TEST TO N
THEORY: Smoking is correlated with health status
COLLAPSE RARE CATEGORIES TO AVOID SMALL CELL COUNTS IN TABLES . . .
helthr Respondent's Health
1 Excellent
40%
2 Good
43%
3 Fair
14%
4 Poor
3%
100%
Respondent's Health
Excellent
40%
Good
43%
Fair + Poor
17%
100%
RECODE helthr (1=1) (2=2) (3=3)
(4=3) (7 thru 9=9) INTO helthr3.
VARIABLE LABELS helthr3 '3
categories'.
value labels helthr3 1 'Excellent' 2
'Good' 3 'Fair or Poor'.
missing values helthr3 (7,8,9).
CHI SQUARE rule: can’t have cell counts with Expected value < 5.0
Rule of Thumb:
recode so there is always 10% + per final category
Rule of thumb:
rare categories OK for DEPENDENT VARIABLES only
Health Status
20%
48%
Chi square = 152 df = 2
16%
Null is not true > 5.99
41%
Smoking is correlated with health
status
Fair/Poor
Good
Excellent
32%
Smokers
42%
Non-Smokers
But the difference is not very big (42% - 32% ) = 10%
Chi square is large because N is large (N = 18,381)
If there were 500 cases with all other distributions the same chi sq = 4.13 < 5.99 null is true
CHI SQUARE IS SENSITIVE TO SAMPLE SIZE
values of phi
weak
.00 - .10
moderate
.10 - .30
strong
.30 +
PHI = SQUARE ROOT [ CHI SQ / N ]
Case 1 . . . n = 18,381 . . . phi = .09 (weak)
Case 2 . . . n = 500 . . . phi = .09 (weak)
9
SPS 580 Lecture 2
Bivariate
Chi Square
VIII. PUTTING IT ALL TOGETHER
THEORY: Age causes likelihood of unemployment
missing values umemp2 (7,8,9).  NOTE variable name is UMEMP2 typo in data entry ????
Likelihood of Unemployment in the Next 5 Years
Very likely Fairly likely Not too likely Not at all likely
Under 30
7%
8%
32%
53%
100%
31-45
7%
8%
29%
56%
100%
46-64
11%
9%
24%
56%
100%
65+
33%
16%
14%
37%
100%
Total
9%
9%
28%
55%
100%
big table
somewhat large N = 2,032
2 expected values = 6.0
Chi sq = 74 df = 9
Phi = .191
Null is not true the likelihood of being unemployed is not the same in all age groups
Chart/graph . . . have to choose what to focus on because the whole table is unwieldy
"Very Likely" to be Unemployed in the
Next 5 Years
"Very Likely + Fairly Likely" to be
Unemployed in the Next 5 Years
33%
49%
50%
30%
40%
20%
30%
11%
10%
7%
20%
7%
20%
15%
15%
Under 30
31-45
10%
0%
0%
Under 30
31-45
46-64
65+
10
46-64
65+