DARWIN`S DATA ON ZEA MAYS

STATISTICAL LABORATORY, 28 May, 2010
AN EXAMPLE OF PAIRED DATA: DARWIN’S DATA ON
ZEA MAYS
Mario Romanazzi
1
INTRODUCTION
Darwin (1809-1882) performed extensive investigations of the effects of cross- and self-fertilization on
plants (the results of his work were published in the book The effects of cross- and self-fertilisation in the
vegetable kingdom, John Murray, London, 1876). His hypothesis, now widely accepted, was that crossfertilization produces more robust and vigorous individuals. To check this assumption he collected a huge
amount of data, mostly related to height, weight and seed production. Here we present the Zea Mays
data set, dealing with a particular species of maize. Darwin paired 15 cross-fertilized and 15 self-fertilized
seedlings and let the pairs grow in different pots. The aim was to control different soil, watering and light
conditions. After a number of weeks, the stalk height of each plant was measured.
Darwin’s data can be used to answer the following questions.
1. Is there an empirical evidence that height of cross-fertilized maize plants is higher than height of
self-fertilized ones?
2. What is the linear correlation of height of cross-fertilized maize plants and self-fertilized ones? What
about outliers?
3. The hypotheses of the t test are satisfied?
4. What is the confidence interval for µCROSS − µSELF , the mean difference of stalk height in the
reference population? What is the p-value of the null hypothesis µCROSS − µSELF = 0?
2
DATA INPUT
> seed <- read.table("http://venus.unive.it/romanaz/statistics/data/darwin.txt",
+
header = TRUE)
> str(seed)
’data.frame’:
$ cross: num
$ self : num
15 obs. of 2 variables:
23.5 12 21 22 19.1 21.5 22.1 20.4 18.3 21.6 ...
17.4 20.4 20 20 18.4 18.6 18.6 15.3 16.5 18 ...
> n <- dim(seed)[1]
> n
[1] 15
First of all we transform the measurement unit of height from inches to centimeters.
1
3
DATA SUMMARY AND DISTRIBUTIONAL PLOTS
2
> seed$cross1 <- seed$cross/0.3937
> seed$self1 <- seed$self/0.3937
> seed[, 3:4]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cross1
59.69012
30.48006
53.34011
55.88011
48.51410
54.61011
56.13411
51.81610
46.48209
54.86411
59.18212
53.34011
56.13411
58.42012
30.48006
3
self1
44.19609
51.81610
50.80010
50.80010
46.73609
47.24409
47.24409
38.86208
41.91008
45.72009
41.40208
45.72009
32.51207
39.37008
45.72009
DATA SUMMARY AND DISTRIBUTIONAL PLOTS
> summary(seed[, 3:4])
cross1
Min.
:30.48
1st Qu.:50.17
Median :54.61
Mean
:51.29
3rd Qu.:56.13
Max.
:59.69
self1
Min.
:32.51
1st Qu.:41.66
Median :45.72
Mean
:44.67
3rd Qu.:47.24
Max.
:51.82
The summary function shows that the difference of the sample averages is 6.62 cm. Moreover, a
difference between sample averages and sample medians is apparent, more important in the cross-fertilized
sample.
> stem(seed[, 3])
The decimal point is 1 digit(s) to the right of the |
3
4
5
6
|
|
|
|
00
69
2335566689
0
> stem(seed[, 4])
The decimal point is 1 digit(s) to the right of the |
3
3
4
4
5
|
|
|
|
|
3
99
124
666777
112
4
COMPUTATION OF DIFFERENCES
3
> boxplot(seed$cross1, seed$self1, horizontal = TRUE, notch = TRUE,
+
xlab = "Height of Seedling (cm)", names = c("Cross", "Self"),
+
col = "lightgrey", main = "Darwin’s Zea Mays Data")
> plot(seed$cross1, seed$self1, pch = 20, xlab = "Height of Cross-Fertilized Seedling (cm)",
+
ylab = "Height of Self-Fertilized Seedling (cm)", main = "Darwin’s Zea Mays Data")
> cor(seed$cross1, seed$self1)
[1] -0.3378578
Darwin's Seedling Data
Darwin's Seedling Data
●
●
●
●
●
●
●
●
●
●
35
●
●
●
●
45
●
40
Height of Self−Fertilized Seedling (cm)
Cross
Self
50
●
●
30
35
40
45
50
55
60
30
Height of Seedling (cm)
35
40
45
50
55
Height of Cross−Fertilized Seedling (cm)
The paired boxplot highlights some possibly important features of the data. 1. There are a few outliers
(e. g., pairs no. 2, 13 and 15); 2. the distribution of the heights of the cross-fertilized seedlings is clearly
shifted on the right w. r. t. self-fertilized seedlings; 3. the confidence intervals for the median heights do
not overlap (they are obtained with the option notch=TRUE), which suggests that the population medians
are different. Moreover, the scatter plot suggests that the (negative) linear correlation could be due to
the outlying observations. We recompute below the linear correlation coefficient with the pairs no. 2 and
15 removed from the sample.
> cor(seed$cross1[-c(2, 15)], seed$self1[-c(2, 15)])
[1] -0.1169845
4
COMPUTATION OF DIFFERENCES
> diff <- seed$cross1 - seed$self1
> stem(diff)
The decimal point is 1 digit(s) to the right of the |
-2 | 1
-1 | 5
60
5
TESTING NORMALITY OF DIFF
-0
0
1
2
4
|
| 23557899
| 3589
| 4
> boxplot(diff, xlab = "Height Difference of Paired Seedlings",
+
horizontal = TRUE, col = "lightgrey", main = "Darwin’s Zea Mays Data",
+
notch = TRUE)
Darwin's Seedling Data
●
−20
●
−10
0
10
20
Height Difference of Paired Seedlings
5
TESTING NORMALITY OF DIFF
In the present example, sample size is low and the usual statistical inference procedures (t tests and
t confidence intervals) rely on the hypothesis of normality of the data. We empirically check this hypothesis with a qqplot and then we perform a Kolmogorov-Smirnov test. To gain greater insight in the
characteristics of the data, first we plot the ECDF of the data and we superimpose the normal CDF,
with parameters estimated from the sample.
> plot.ecdf(diff, xlab = "Height Difference of Paired Seedlings",
+
ylab = "ECDF", main = "Darwin’s Seedling Data")
> plot(function(x) pnorm(x, mean(diff), sd(diff)), mean(diff) +
4 * sd(diff), mean(diff) + 4 * sd(diff), lwd = 2, col = "red",
+
add = TRUE)
> qqnorm(scale(diff), pch = 20, xlab = "Theoretical Normal Quantiles",
+
main = "Darwin’s Seedling Data", xlim = c(-3, 3), ylim = c(-3,
+
3), asp = 1)
> qqline(scale(diff), col = "red", lwd = 2)
> ks.test(diff, "pnorm", mean(diff), sd(diff))
6
TESTING SIGNIFICANCE OF DIFFERENCES
5
Darwin's Seedling Data
1.0
Darwin's Seedling Data
3
●
●
0.8
2
●
●
●
●
●
0.4
●
●
●
1
●
●
● ●
● ●
0
ECDF
●
●
●
●
●
●
●
−1
Sample Quantiles
0.6
●
0.2
●
●
−2
●
●
●
−3
0.0
●
−20
−10
0
10
20
−3
Height Difference of Paired Seedlings
−2
−1
0
1
2
3
Theoretical Normal Quantiles
One-sample Kolmogorov-Smirnov test
data: diff
D = 0.2096, p-value = 0.4635
alternative hypothesis: two-sided
The p-value is high, suggesting that observed discrepancies from normality (see the above boxplot)
are not statistically significant and may be due to sampling errors. Note that here again the outliers may
play a role.
6
TESTING SIGNIFICANCE OF DIFFERENCES
> summary(diff)
Min. 1st Qu.
-21.340
3.556
Median
7.620
Mean 3rd Qu.
6.621 14.220
Max.
23.620
> sd(diff)
[1] 11.97059
> se <- sd(diff)/sqrt(n)
> se
[1] 3.090792
It is clear that the observed mean difference 6.621 is rather extreme (the standard error is about 3.09)
and the median difference is even higher (why?).
> t.test(seed$cross1, seed$self1, mu = 0, paired = TRUE, alternative = "two.sided",
+
conf.level = 0.95)
6
TESTING SIGNIFICANCE OF DIFFERENCES
6
Paired t-test
data: seed$cross1 and seed$self1
t = 2.1422, df = 14, p-value = 0.05025
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.008142555 13.250035706
sample estimates:
mean of the differences
6.620947
The overall p-value (evaluated w. r. t. a t distribution with 14 = n − 1 degrees of freedom) is a bit
higher than 0.05. Here we are interested in the directional alternative µDIF F = µCROSS − µSELF > 0,
hence the relevant p-value is about 2.5%. There is the possibility that this result is perturbed by the
outliers therefore we recompute the p-value and the confidence interval discarding pairs no. 2 and 15.
> diff1 <- diff[-c(2, 15)]
> summary(diff1)
Min. 1st Qu.
1.778
5.080
Median
8.890
Mean 3rd Qu.
10.450 15.490
Max.
23.620
> sd(diff1)
[1] 6.805165
> sd(diff1)/sqrt(n - 3)
[1] 1.964482
> t.test(seed$cross1[-c(2, 15)], seed$self1[-c(2, 15)], mu = 0,
+
paired = TRUE, alternative = "two.sided", conf.level = 0.95)
Paired t-test
data: seed$cross1[-c(2, 15)] and seed$self1[-c(2, 15)]
t = 5.5383, df = 12, p-value = 0.0001281
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
6.340778 14.565418
sample estimates:
mean of the differences
10.45310