Chi-Square Analysis

Chi-Square Analysis
Goodness of Fit
"Linkage Studies of the Tomato" (Trans. Royal Canad. Inst. (1931))
We will develop the use of the  2 distribution through an
example from biology.
Consider two different characteristics of tomatoes, leaf shape
and plant size. The leaf shape may be potato-leafed or cutleafed, and the plant may be tall or dwarf.
If we cross a tall cut-leaf tomato with a dwarf potato-leaf
tomato and examine the progeny we will discover a uniform F1
generation.
The traits tall and cut-leaf are each dominant, while dwarf and
potato-leaf are recessive. We use the letter T for height, and C
for leaf shape, so the alleles are T, t, C, and c.
Dwarf potato-leaf
tomato
We will examine a Punnett square to illustrate this dihybrid
cross.
Tall cut-leaf tomato
gametes
TC
TC
tc
TtCc
TtCc
tc
TtCc
TtCc
Notice the uniformity among the offspring, all are TtCc.
Now we cross the F1 among themselves to produce the F2:
gametes
TC
Tc
tC
tc
TC
TTCC
TTCc
TtCC
TtCc
Tc
TTCc
TTcc
TtCc
Ttcc
tC
TtCC
TtCc
ttCC
ttCc
tc
TtCc
Ttcc
ttCc
ttcc
Now we identify the tall cut-leaf tomatoes:
gametes
TC
Tc
tC
tc
TC
TTCC
TTCc
TtCC
TtCc
Tc
TTCc
TTcc
TtCc
Ttcc
tC
TtCC
TtCc
ttCC
ttCc
tc
TtCc
Ttcc
ttCc
ttcc
Now we identify the tall potato-leaf tomatoes:
gametes
TC
Tc
tC
tc
TC
TTCC
TTCc
TtCC
TtCc
Tc
TTCc
TTcc
TtCc
Ttcc
tC
TtCC
TtCc
ttCC
ttCc
tc
TtCc
Ttcc
ttCc
ttcc
Next we identify the dwarf cut-leaf tomatoes:
gametes
TC
Tc
tC
tc
TC
TTCC
TTCc
TtCC
TtCc
Tc
TTCc
TTcc
TtCc
Ttcc
tC
TtCC
TtCc
ttCC
ttCc
tc
TtCc
Ttcc
ttCc
ttcc
Finally, the last type of tomato is dwarf potato-leaf:
gametes
TC
Tc
tC
tc
TC
TTCC
TTCc
TtCC
TtCc
Tc
TTCc
TTcc
TtCc
Ttcc
tC
TtCC
TtCc
ttCC
ttCc
tc
TtCc
Ttcc
ttCc
ttcc
So now we have four phenotypes (different physical forms) of
tomatoes originating from the single phenotype of the F1
generation.
They are, along with their genotypes and expected frequencies:
Tall cut-leaf
Tall potatoleaf
Dwarf cut-leaf
Dwarf potatoleaf
TTCC, TTCc,
TtCC, TtCc
9
16
TTcc, Ttcc
3
16
ttCC, ttCc
3
16
ttcc
1
16
If our understanding of genetics is correct and we have
constructed the crosses we believe we have, we expect the
proportions of the four phenotypes to fit our calculations.
2

With the
distribution, we are able to test to see if groups of
individuals are present in the same proportions as we expect.
This is rather like conducting multiple Z-tests for proportions,
all at once.
In this example we carry out the dihybrid cross to produce an
F1 generation, and, as expected, the F1 are all of the same
phenotype, tall and cut-leaf.
Further, the F1 are crossed among themselves to produce the F2
generation. We record the numbers of individuals in each
category.
The following table gives the observed numbers of each category.
Phenotype
Observed
Expected
frequency
Tall cut-leaf
926
9
16
Tall potatoleaf
288
3
16
293
3
16
104
1
16
Dwarf cutleaf
Dwarf
potato-leaf
2

To make a test for “goodness of fit” we start as with all other
tests of significance, with a null hypothesis.
Step 1:
H0: The F2 generation is comprised of four
phenotypes in the proportions predicted by our
calculations (based on Mendelian genetics).
Ha: The F2 generation is not comprised of four phenotypes in
the proportions predicted by our calculations.
Another way of saying this is that for the null hypothesis the
population fits our expected pattern, and for the alternate
hypothesis, it does not fit our pattern.
Assumptions: Our first assumption is that our
Step 2:
data are counts. (We cannot use proportions or
2

means.) With , we do not always have a
sample of a population, and sometimes examine an entire
population, as with this example. We must ensure that we have
a representative sample, when we work from a sample.
In order to check assumptions for this goodness of fit test we
must calculate the expected counts for each category. Then
we must meet two criteria:
1. All expected counts must be one or more.
2. No more than 20% of the counts may be less than 5.
We calculate the expected counts by finding the total number of
observations and multiplying that by each expected frequency.
Phenotype
Observed
counts
Expected
frequency
Expected counts
926
9
16
9
1611  906.188
16
Tall potato-leaf
288
3
16
3
1611  302.063
16
Dwarf cut-leaf
293
3
16
3
1611  302.063
16
Dwarf potatoleaf
104
1
16
1
1611  100.688
16
Tall cut-leaf

As you can see, all expected counts are greater than 5, so all
assumptions are met.
Step 3:
2

The formula for the test statistic is:
2
(o

e)
2  
e
where o = observed counts, and
e = expected counts
This calculation needs to be made in the graphing calculator.
Enter the observed counts in L1. Enter the expected frequencies
in L2, as exact numbers. (Enter numbers like 1/3, directly, as
fractions, never round to just .3 or .33.)
In L3 multiply L2 by 1611. This will give the expected counts.
The sum of L1 can be found using 1-Var Stats.
2
2

(L

L
)
/
L
Now in L4, enter 1
3
3, this will give you the
contribution for each category.
2

Finally, is the sum of L4.
2

For this problem, the statistic is 1.4687.
2

In , we always need to know and report the degrees of
freedom. The degrees of freedom are the number of categories
minus one.
Here we have 3 degrees of freedom.
Step 4:
Step 5:
P(  2  1.4687)  .6895
The area can also be found with
 2cdf(1.4687,10^99,3).
Step 6:
Step 7:
Fail to reject H0, a test statistic this large may occur
by chance alone almost 70% of the time.
We lack strong evidence that the pattern of tomato
phenotypes is different from the expected. That is,
the F2 generation are present in the expected
proportions.
THE END