Chapter 5

5. Two-way tables
The Practice of Statistics in the Life Sciences
Third Edition
© 2014 W.H. Freeman and Company
Objectives (PSLS Chapter 5)
Two-way tables

Two-way tables

Marginal distributions

Conditional distributions

Simpson’s paradox
Two-way tables
Two-way tables summarize data about two categorical variables (or
factors) collected on the same set of individuals.
Each factor can have any number of levels. If the row factor has “r”
levels, and the column factor has “c” levels, we say that the two-way
table is an “r by c” table.
High school students were asked whether they smoke,
and whether their parents smoke:
Second factor:
Student smoking status
First factor:
Parent smoking status
400
416
188
1380
1823
1168
Marginal distributions
We can examine each factor in a two-way table separately by studying
the row totals and the column totals. They represent the marginal
distributions, expressed in counts or percents.
400
416
188
1380
1823
1168
Marginal distribution for
student smoking
Marginal
distribution
for parental
smoking
Computing marginal percents
Marginal percents are marginal counts divided by the table grand total.
400
416
188
1380
1823
1168
400
416
188
18.7%
1380
1823
1168
81.3%
1004
 18.7%
5375
33.1%
41.7%
25.2%
100%
1780
 33.1%
5375
The marginal distributions can
be displayed on separate bar
graphs, typically expressed as
percents instead of raw counts.
Percent of students interviewed
Graphs
45%
Parental smoking
Sum of Counts
40%
35%
30%
25%
20%
15%
10%
5%
0%
Both
Each graph represents only one
of the two variables, ignoring
the second one. Each marginal
distribution can also be shown
in a pie chart.
Percent of students interviewed
90%
Sum of Counts
One
Neither
StudentParents
smoking
80%
70%
60%
50%
40%
30%
20%
10%
0%
Smoker
Nonsmoker
Conditional distributions
A conditional distribution is the distribution of one factor for each level
of the other factor.
A conditional percent is computed using the counts within a single row
or a single column. The denominator is the corresponding row or
column total (rather than the table grand total).
400
416
188
1380
1823
1168
Percent of students who smoke when both parents smoke = 400/1780 = 22.5%
Comparing conditional distributions
Comparing conditional distributions helps us describe the “relationship"
between the two categorical variables.
We can compare the percent of individuals in one level of factor 1 for
each level of factor 2. Substantial differences suggest an association
between factor 1 and factor 2.
400
416
188
1380
1823
1168
Conditional distribution of student smokers for different parental smoking statuses:
Percent of students who smoke when both parents smoke = 400/1780 = 22.5%
Percent of students who smoke when one parent smokes = 416/2239 = 18.6%
Percent of students who smoke when neither parent smokes = 188/1356 = 13.9%
Graphs
The conditional distributions can be compared graphically by displaying
the percents making up one factor, for each level of the other factor.
Conditional distribution of student smoking status for different levels of parental
smoking status:
Both parents smoke
One parent smokes
Neither parent smokes
Percent who Percent who
Row total
smoke
do not smoke
22%
78%
100%
19%
81%
100%
14%
86%
100%
Conditional distribution of
student smoking status for
Both parents smoke
different levels of parental One parent smokes
smoking status:
Neither parent smokes
Among students
with 2 parents
smoking
Percent who Percent who
Row total
smoke
do not smoke
22%
78%
100%
19%
81%
100%
14%
86%
100%
22%
Percent who
smoke
78%
Percent who
do not smoke
Among students
with 1parent
smoking
19%
Percent who
smoke
Percent who
do not smoke
Among students
with 0 parent
smoking
14%
Percent who
smoke
81%
Percent who
do not smoke
86%
Conditional distribution of parental smoking status for different levels
of student smoking status:
Percent with 2 parents smoking
Percent with 1 parent smoking
Percent with 0 parent smoking
Column total
Among students
who smoke
Percent with 2
parents smoking
Percent with 1
parent smoking
Percent with 0
parent smoking
Student
smokes
40%
41%
19%
100%
Student does
not smoke
32%
42%
27%
100%
Among students
who do not smoke
19%
40%
41%
Percent with 2
parents smoking
27%
32%
Percent with 1
parent smoking
42%
Percent with 0
parent smoking
A 2013 Gallup survey investigated how phrasing may affect the opinions of
American adults regarding physician-assisted suicide. Here are the findings:
Should be allowed
Should not be allowed
No opinion
Form A
"End the patient's life by
some painless means"
70%
27%
3%
Form B
"Assist the patient to
commit suicide"
51%
45%
4%
719
816
Number interviewed
The value 70% is
A. a marginal value representing the proportion of respondents in favor of
physician-assisted suicide.
B. a conditional value representing the proportion of respondents in favor of
physician-assisted suicide, given that the question was asked in Form A.
Should be allowed
Should not be allowed
No opinion
Form A
"End the patient's life by
some painless means"
70%
27%
3%
Form B
"Assist the patient to
commit suicide"
51%
45%
4%
719
816
Number interviewed
Allowed
Not allowed
No opinion
Total
"Painless
means"
503
194
22
719
"Commit
suicide"
416
367
33
816
Simpson’s paradox
Lurking variables are always a problem for interpretation, but their impact
can be even more drastic when dealing with categorical data.
An association that holds for all of several groups can reverse direction
when the data are combined to form a single group. This reversal is
called Simpson's paradox.
The table on the right compares the
failure rates when removing kidney
stones in a sample of patients, using one
of two procedures: open surgery and
PCNL (a minimally invasive technique).
Small stones
Open surgery
PCNL
Success
81
234
273
289
77
6136
Failure
6
% failure
7%
13%
22%
17%
Can you think of a possible lurking variable here?
Small stones
Open surgery
PCNL
Success
81
234
273
289
61
Failure
677
36
% failure
7%
13%
22%
17%
Success
Failure
% failure
Lar
Open sur
192
71
27%
The procedures are not chosen randomly by surgeons! In fact, the minimally
invasive procedure is most likely used for smaller stones with a good chance of
success, whereas open surgery is likely used for more problematic conditions.
Small stones
Open surgery
PCNL
Success
81
234
Failure
6
36
% failure
7%
13%
Success
Failure
% failure
Large stones
Open surgery
PCNL
192
55
71
25
27%
31%
In New York State (excluding New York City), 1,359 white men and 121 black men
died from prostate cancer in 1994. Based on how many white and black men lived
there in 1994, the prostate cancer mortality rates were as follows:
Death from
prostate cancer
All ages
White
1,359
Yes
No
Total
Rate per 100,0000
4,736,887
4,738,246
28.7
Black
121
418,871
418,992
28.9
Cancer mortality rates are
similar in both groups.
But when the data are broken down by age group we see that
Death from
prostate cancer
Yes
No
Total
Rate per 100,0000
Under 65 years of age
White
Black
76
18
4,177,823
396,899
4,177,899
396,917
1.8
4.5
Yes
No
Total
Rate per 100,0000
Age 65 and older
White
Black
1,282
102
559,075
21,973
560,357
22,075
228.8
462.1
black men had a much higher rate of prostate cancer death than white men.
What is the source of this example of Simpson’s paradox?