Chapter 5 - Personal Pages

Chapter 12
The Analysis of
Categorical Data and
Goodness of Fit Tests
Suppose we wanted to determine if the
proportions for the different colors in a
large bag of M&M candies matches the
proportions that the company claims is in
their candies.
k is used to denote the
We could record
number the
of categories for
color of eacha categorical
candy in variable
the bag.
There are six colors –
so would
k = 6. be univariate,
This
How many categories for
categorical
data.be?
color
would there
M&M Candies Continued . . .
We could count how many candies of each
color are in the bag.
Red
Blue
Green
23
28
21
Yellow Orange Brown
19
A goodness-of-fit test will
allow usfrequency
to determine
if
A one-way
table
these
observed
counts
is used
to display
the are
consistent
with
what
observed
counts
for
thewe
k
expect
to have.
categories.
22
25
Goodness-of-Fit Test Procedure
...
Null Hypothesis:
H0: p1 = hypothesized
proportion for
Category
1
The
goodness-of-fit
statistic,
The
goodness-of-fit
test
is
used
to
2, is a quantitative
denoted
by
X
analysze univariate categorical data
Readmeasure
“chi-squared”
to the
extentsample.
to which the
from
a
single
pk = hypothesized
proportion
forfrom
Category
k
observed counts
differ
those
The X2 value can
Ha: H0 is not true
expected when
true.
0 isnegative.
neverHbe
Test Statistic:
X 
2

all cells
observed cell count - expected cell count 
2
expected cell count
Goodness-of-Fit Test Procedure
Continued . . .
P-values: When H0 is true and all expected counts are
at least 5, X2 has approximately a chi-square
distribution with df = k – 1. Therefore, the P-value
associated with the computed test statistic value is
the area to the right of X2 under the df = k – 1 chisquare curve.
Assumptions:
1) Observed cell counts are based on a random sample
2) The sample size is large enough as long as every
expected cell count is at least 5
Facts About c2 distributions
• Different df have different curves
• c2 curves are skewed right
• As df increases, the c2 curve shifts
df=3
toward
the right and becomes more like
a normal curve
df=5
df=10
A common urban legend is that more babies than
expected are born during certain phases of the
lunar cycle, especially near the full moon.
The table below shows the number of days in the eight
lunar phases withThere
the number
of births
in each
are eight
phases
sophase
k = 8.
for 24 lunar cycles.
Lunar Phase
Number of Days
Number of Births
New Moon
24
7680
Waxing Crescent
152
48,442
First Quarter
24
7579
Waxing Gibbous
149
47,814
Full Moon
24
7711
Waning Gibbous
150
47,595
Last Quarter
24
7733
Waning Crescent
152
48,230
Lunar Phases Continued . . .
Let:
p1 = proportion of births that occur during the new moon
p2 = proportion of births that occur during the waxing crescent moon
There is a total of 699 days in the
p3 = proportion of births
that
occur
during the
first quarter
24
lunar
cycles.
If there
is no moon
relationship
number
ofmoon
p4 = proportion of births
that occurbetween
during thethe
waxing
gibbous
births and lunar phase, then the
p5 = proportion of births
that occurproportions
during the full
moonthe
expected
equal
The hypothesis statements
number
of
days
in each
phasegibbous
out ofmoon
p6 = proportion of births
thatwould
occur
during
the waning
be:
the total number of days.
p7 = proportion of births that occur during the last quarter moon
p8 = proportion of births that occur during the waning crescent moon
Hp01: =p1.0343
= .0343, pp
.2175 p3 = .0343,
p3 = .0343
p4 = .2132,pp45==.2132
.0343,
2 2==.2175,
p6 = .2146, p7 = .0343, p8 = .2175
P5 = .0343
p6 = .2146
p7 = .0343
p8 = .2175
Ha: H0 is not true
Lunar Phases Continued . . .
H0: p1 = .0343, p2 = .2175, p3 = .0343, p4 = .2132, p5 = .0343,
p6 = .2146, p7 = .0343, p8 = .2175
Ha: H0 is not true
Lunar Phase
New Moon
Observed Number Expected Number
of Births
of Births
7680
7641.49
Waxing Crescent
48,442
48455.52
There is a total
of 222,784 births
in the
sample. If there
First Quarter
7579 is no relationship
7641.49
between the number
and lunar
Waxing Gibbous
47,814 of births
47,497.55
phase, then the expected counts for each
7711
7641.49
category would
equal n(hypothesized
Waning Gibbous
47,595
proportion). 47809.45
Full Moon
Last Quarter
Waning Crescent
7733
7641.49
48,230
48,455.52
Lunar Phases Continued . . .
H0: p1 = .0343, p2 = .2175, p3 = .0343, p4 = .2132, p5 = .0343,
p6 = .2146, p7 = .0343, p8 = .2175
type
of error could we have
Ha:What
H0 is not
true
potentially made with this decision?
Test Statistic: Type II
2 test
(7680  7641.49)2The
( 48X
,442
 48,455
.52)2
(is
48,smaller
230  48,455.52)2
statistic
2
X 

 ... 
7641.49
48,455
.52
,455.52
than the
smallest
entry in48the
 6.557
df = 7 column of Appendix Table
8.
P-value > .10
df = 7
a = .05
Since the P-value > a, we fail to reject H0. There is
not sufficient evidence to conclude that lunar
phases and number of births are related.
A study was conducted to determine if collegiate soccer
players had in increased risk of concussions over other
athletes or students. The two-way frequency table below
If there
no difference
these
displays
thewere
number
of previous between
concussions
for3students
populations
in regards
to thesamples
numberof
of91 soccer
We would
expect
in independently
selected
random
concussions,
how many
soccerand
players
would
(158/240)(91).
players,
96 non-soccer
athletes,
53
non-athletes.
These
values
green are
you expect
to have
no concussions?
Also
called
aincontingency
Number
of Concussions
the observed
counts.
table.
0
1
2
3 or
more
Total
Soccer Players
45
25
11
10
91
Non-Soccer Players
68
15
8
5
96
Non-Athletes
45
5
3
0
53
Total
158
45
22
15
240
This is univariate categorical
These
values
value
blue
red isare
the data
-This
number
ofinconcussions
marginal
grand total.
totals.
fromthe
3 independent
samples.
X2 Test for Homogeneity
Null Hypothesis:
The c2 Test for Homogeneity is
used to
analyze univariate
H0: the true category
proportions
are the same
categorical
from 2 or more
for all the populations
or data
treatments
independent samples.
Alternative Hypothesis:
Ha: the true category proportions are not all the
same for all the populations or treatments
Test Statistic:
X2 

all cells
observed cell count - expected cell count 2
expected cell count
X2 Test for Homogeneity
Continued . . .
Expected Counts: (assuming H0 is true)
(row marginal total)(col umn marginal total)
expected cell counts 
grand total
P-value: When H0 is true and all expected counts are at
least 5, X2 has approximately a chi-square distribution
with df = (number of rows – 1)(number of columns – 1).
The P-value associated with the computed test statistic
value is the area to the right of X2 under the appropriate
chi-square curve.
X2 Test for Homogeneity
Continued . . .
Assumptions:
1) Data are from independently chosen random
samples or from subjects who were assigned
at random to treatment groups.
2) The sample size is large: all expected cell
counts are at least 5. If some expected
counts are less than 5, rows or columns of the
table may be combined to achieve a table with
satisfactory expected counts.
Soccer Players Continued . . .
State the hypotheses.
Number of Concussions
0
1
2
3 or
more
Total
Soccer Players
45
25
11
10
91
Non-Soccer Players
68
15
8
5
96
Non-Athletes
45
5
3
0
53
Total
158
45
22
15
240
H0: Proportions in each response category
To Another
find
df count
the
number
of
and
(number
of concussions)
therows
same
for
way
to
find
df are
– you
can
also
columns
– groups
not
the totals!
all three
cover
one
rowincluding
and one column,
then
df = (number of rows – 1)(number of columns – 1)
the number of cells left (not
Hcount
a: Category proportions are not all the same
for all three
groups totals)
including
Df = (2)(3) = 6
Soccer Players Continued . . .
NumberofofConcussions
Concussions
Number
0
0
1
1
2
2 or
3 or
more
more
Total
Total
Soccer Players
45 (59.9)
25 (17.1)
(14.0)
45 (59.9)
25 (17.1)
11 (8.321 10
(5.7)
9191
Non-Soccer Players
68 (63.2)
15 (18.0)
68 (63.2)
15 (18.0)
8 (8.8)13 (14.8)
5 (6.0)
96
96
Non-Athletes
45 (34.9)
5 (10.0)
45 (34.9)
5 (10.0)
3 (4.9) 3 (8.2)
0 (3.3)
53
53
Total
158 158
45
45 22
2215
240
240
df = 4
2
2
(
45

59
.
9
)
(
3

8
.
2
)
Test Statistic: X 2 
 ... 
 20.6
Notice
that
NOT
the
So
combine
the
column
for 2
59
.5 table
8a.all
2df
This combined
has
Expected
counts
are
shown
expected
counts
atcolumn
least
the
= (2)(2)
= 4.andare
inconcussions
the
parentheses
next
to
5. concussions.
for
3observed
or more
the
P-value < .001
acounts.
= .05
Soccer Players Continued . . .
Number of Concussions
0
1
2 or
more
Total
Soccer Players
45 (59.9)
25 (17.1)
21 (14.0)
91
Non-Soccer Players
68 (63.2)
15 (18.0)
13 (14.8)
96
Non-Athletes
45 (34.9)
5 (10.0)
3 (8.2)
53
158
45
22
240
Total
Since the P-valueWe
< a,can
we look
reject
H
is
0. There
at
the
chi-square
These cells
had
the
largest
strong evidencecontributions
to suggest
that
the
category
– which
of X
the
cells
2 test
contributions
to
the
proportions
for
thethe
number
of
above
have
greatest
statistic.
concussions
is
not
the
same
for
contributions to
the
value
ofthe
the3
Is that all I 2can say – that there
groups.
X statistic?
is a difference in proportions for
the groups?
X2 Test for Independence
Null Hypothesis:
The c2 Test for Independence is
used to
bivariate
H0: The two variables
areanalyze
independent
categorical data from a single
Alternative Hypothesis: sample.
Ha: The two variables are not independent
Test Statistic:
X2 

all cells
observed cell count - expected cell count 2
expected cell count
X2 Test for Independence
Continued . . .
Expected Counts: (assuming H0 is true)
(row marginal total)(col umn marginal total)
expected cell counts 
grand total
P-value: When H0 is true and assumptions for X2 test
are satisfied, X2 has approximately a chi-square
distribution with df = (number of rows – 1)(number of
columns – 1). The P-value associated with the computed
test statistic value is the area to the right of X2 under
the appropriate chi-square curve.
X2 Test for Independence
Continued . . .
Assumptions:
1) The observed counts are based on data from
a random sample.
2) The sample size is large: all expected cell
counts are at least 5. If some expected
counts are less than 5, rows or columns of the
table may be combined to achieve a table with
satisfactory expected counts.
The paper “Contemporary College Students and
Body Piercing” (Journal of Adolescent Health,
2004) described a survey of 450 undergraduate
students at a state university in the southwestern
region of the United States. Each student in the sample
was classified according to class standing (freshman,
sophomore, junior, senior) and body art category (body
piercing only, tattoos only, both tattoos and body
piercing, no body art). Is there evidence that there is an
association between class standing and response to the
body art question? Use a =State
.01.
the hypotheses.
Body
Piercing
Only
Tattoos
Only
Both Body
Piercing and
Tattoos
No Body
Art
Freshman
61
7
14
86
Sophomore
43
11
10
64
Junior
20
9
7
43
Senior
21
17
23
54
Body Art Continued . . .
Body
Piercing
Only
Tattoos
Only
Both Body
Piercing and
Tattoos
No Body
Art
Freshman
61 (49.7)
61
7 (15.1)
7
14 (18.5)
14
86 (84.7)
86
Sophomore
43 (37.9)
43
11 (11.5)
11
10 (14.1)
10
64 (64.5)
64
Junior
20 (23.4)
20
9 (7.1)
9
7 (8.7)
7
43 (39.8)
43
Senior
21 (34.0)
21
17 (10.3)
17
23 23
(12.7)
54 (58.0)
54
H0: class standing and body art category are
Assuming
H
is true,
what
are
0
How
many
degrees
of
independent
the expected
counts?
freedom
does this
two-way
Ha: class standing and body art
category
table
have? are not
independent
df = 9
Body Art Continued . . .
Body
Piercing
Only
Tattoos
Only
Both Body
Piercing and
Tattoos
No Body
Art
Freshman
61 (49.7)
7 (15.1)
14 (18.5)
86 (84.7)
Sophomore
43 (37.9)
11 (11.5)
10 (14.1)
64 (64.5)
Junior
20 (23.4)
9 (7.1)
7 (8.7)
43 (39.8)
Senior
21 (34.0)
17 (10.3)
23 (12.7)
54 (58.0)
Test Statistic:
2
2
(
61

49
.
7
)
(
54

58
.
0
)
X2 
 ... 
 29.48
49.7
58.0
P-value < .001
a = .01
Body Art Continued . . .
Body
Piercing
Only
Tattoos
Only
Both Body
Piercing and
Tattoos
No Body
Art
Freshman
61 (49.7)
7 (15.1)
14 (18.5)
86 (84.7)
Sophomore
43 (37.9)
11 (11.5)
10 (14.1)
64 (64.5)
Junior
20 (23.4)
9 (7.1)
7 (8.7)
43 (39.8)
Senior
21 (34.0)
17 (10.3)
23 (12.7)
54 (58.0)
Since the P-value < a, we reject H0. There is
sufficient evidence
to suggest
that
class
Seniors
having
both
body the
Which
cell
contributes
standing and the body
art
category
are
piercing
andthe
tattoos
2 test
most
to
X
associated.
contribute the
most to the
statistic?
X2 statistic.