Chi-square Basics

Unit #8 – Chapter 13
The Chi-Square
Distribution!
1
Why used? Two purposes…..
• Chi-square analysis is primarily used to
deal with categorical (frequency) data
(1) We measure the “goodness of fit”
between our observed outcome and the
expected outcome for some variable
(2) With two variables, we test in particular
whether they are independent of one
another using the same basic approach.
2
Section13-1:
“Goodness of Fit” Tests
• Suppose we want to know how people in a
particular area will vote. We perform an
SRS with n=60, asking them which party
they prefer.
Republican
Democrat
Independent
20
30
10
• How will we analyze what’s really going
on?
3
Goodness of Fit Example, continued
• Null Hypothesis: There is no preference –
all three parties are equally liked
• Solution: chi-square analysis to determine
if our outcome is different from what would
be expected if there was no preference
2
(
O

E
)
2  
E
4
Calculations
Observed
Expected
Republican
Democrat
Other
20
20
30
20
10
20
• Plug into the formula:
 
2
(20  20)2 (30  20)2 (10  20)2


20
20
20
<<< If there were
truly no preference
(as the null hypo
states), then we
would expect
equal #’s for each.
= 10
5
More Calculations
 2 (2)  10
2
.05
 5.99
• So we will Reject H0
• More on p-values on the board….
6
Conclusion
• Note that all we really can conclude is that our
data is different from the expected outcome
given a situation – that’s what the Alternative
Hypothesis says
– Although it would appear that the district will vote
Democratic, really we can only conclude they were
not responding by chance
– Regardless of the order that I wrote the outcome
categories, the calculations would have been the
same
– In other words, it is a non-directional test regardless
of the prediction
7
Summary: The Chi-square distribution
• Skewed to the right, but it becomes more
symmetric (and normal-looking) with
increasing degrees of freedom
• No, this is NOT a normal distribution, a tdistribution, or a uniform distribution – it’s
a different thing!
• So, different graph, different numbers, etc.
n
z
i 1
2
i
8
Conditions for Chi-Square Tests
• Counts:
– We need at least 5 for each of our expected frequencies
values
• Inclusion of non-occurences:
– Must include all responses, not just those positive ones
• Independence:
– Not that the variables are independent or related (that’s what the
test can be used for), but rather, as with our t-tests, the
observations (data points) don’t have any bearing on one
another. So, as usual, make sure it was an SRS.
• To help with the last two, make sure that your N equals
the total number of people who responded
9
Tests of Independence
between 2 categorical variables
Section 13-2:
• What do Stats kids do with their free time?
• We ask an SRS of n=200 students and get
these observed results:
Males
Females
TV
Nap
Study
Stare at
Ceiling
30
20
40
30
20
40
10
10
10
• Is there a relationship between gender
(X) and what the Stats kids do with their
free time (Y)?
Males
Females
Totals:
TV
Nap
Study
Stare at
Ceiling
Totals:
30
20
40
30
20
40
10
10
100
100
50
70
60
20
200
• Expected = (Ri*Cj)/N
• Example for males/TV: the expected
count would be (100*50)/200 = 25
11
The same chart, with the
Observed and (Expected) counts
TV
Nap
Study
Stare at
Ceiling
Totals:
Males (E)
30 (25)
40 (35)
20 (30)
10 (10)
100
Females (E)
20 (25)
30 (35)
40 (30)
10 (10)
100
Totals:
50
70
60
20
200
• df = (R-1)(C-1) = (2-1)(4-1) = 3
R = number of rows and
C = number of columns
12
Calculations and Conclusion:
• Chi-Squared TOI:
 (3)  10.10
2
  7.82
2
.05
• Reject H0 (more on p-values on the board)
• Conclusion: There is some relationship
between gender and how Stats students
spend their free time
13
Conditions for both types of
Chi-Square Tests
• Counts:
– Rule of thumb is that we need at least 5 for our expected
frequencies values
• Inclusion of non-occurences:
– Must include all responses, not just those positive ones
• Independence:
– Not that the variables are independent or related (that’s what the
test can be used for), but rather, as with our t-tests, the
observations (data points) don’t have any bearing on one
another. So, as usual, make sure it was an SRS.
• To help with the last two, make sure that your N equals
the total number of people who responded
14
Other
• Important point about the non-directional
nature of the test: the chi-square test by
itself cannot speak to specific hypotheses
about the way the results would come out
>> In other words, the null hypothesis is
always “no relationship or no preference”,
while the alternative hypothesis is always
“there is some relationship” or “there is
some sort of preference”
15
For example, we asked: “What do Stats kids do
with their free time??
TV
Nap
Study
Stare at
Ceiling
Males
30
40
20
10
Females
20
30
40
10
• Even though we rejected the null hypothesis – concluding that
gender and free time behavior are associated with each other
– that’s our only conclusion.
• We can’t also conclude that males nap more or females study
more, for example, even though it looks that way, since that
wasn’t in our alternative hypothesis to start with.
16
Summary: The Chi-square distribution
• Skewed to the right, but it becomes more
symmetric (and normal-looking) with
increasing degrees of freedom
• No, this is NOT a normal distribution, a tdistribution, or a uniform distribution – it’s
a different thing!
• So, different graph, different numbers, etc.
n
z
i 1
2
i
17
Finally…..
• There are two different types of Chi-Square
Tests:
> Chi-Square Goodness of Fit Test (13-1)
> Chi-Square Test of Independence (13-2)
* When you write the name of the test (the
second C in HCCC), you must specify which
one it is – don’t just write “Chi-Square Test.”
18