LAB 1 – Introduction to R, Descriptive Statistics, and Correlat

LAB 4 – Inference for 2 Independent Population Means, Chi-Squared
Test for Independence, One-Way ANOVA
The ECA 225 has open lab hours if you need to finish LAB 4. The lab is open
Monday-Thursday 6:30-10:00pm and Saturday-Sunday 2:00-6:00pm. The last
day the labs are open coincides with the last day of ASU regularly scheduled
class.
To download R onto your own personal computer, go to:
http://cran.r-project.org/bin/windows/base/
Click on the link for R-2.6.1-win32.exe. Save the file to your computer. Then
click on the file to start the installation to your computer.
Your submission to LAB 4 should consist of answering the numbered questions as you
work through the Lab.
***AS YOU ARE WORKING THROUGH THE LAB, copy and paste each output into a
blank word file****
You can either print the completed word file out and turn that in, or you can e-mail the
word file to me for you LAB 4 grade.
Everything MUST be done in R and included in your word file.
***********************************************************************
***********************************************************************
Access R
On the desktop or through the Programs Menu, find the R icon and click on it.
You should be brought to a screen with a command prompt.
Two Independent Population Means
The dataset for this example can be found on my website and is saved as wings.txt.
This example is looking at two subspecies of dark-eyed Juncos. One of the
subspecies migrates each year and the other does not migrate. One of the variables
measured was wing length. The unit of measurement is in millimeters. We are
interested if there is a difference between the average wing length of migratory and
nonmigratory Juncos.
Now, conduct a t-test on the independent population means, first downloading the
data set.
>site=”http://math.asu.edu/~coombs/wings.txt”
>wings<-read.table(file=site,header=T)
>attach(wings)
>names(wings)
OUTPUT: [1] “MIGRA” “NONMIGRA”
> t.test(MIGRA, NONMIGRA)
OUTPUT:
Welch Two Sample t-test
Data: MIGRA and NONMIGRA
t= -4.6217, df = 25.614, p-value = 9.422e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.046220 -1.553780
sample estimates:
mean of x mean of y
82.1
84.9
To interpret the hypothesis test, the test statistic is –4.6217, the Satterthwaite
degree of freedom is 25.614, the P-value is 0.00009422, and the sentence underneath,
“alternative hypothesis: true difference in means is not equal to 0”, is telling you that they
used a two-tailed alternative.
My interpretation of the non-directional alternative is: At 5% significance, data
provides evidence that there is a difference between the average wing length of migratory
and nonmigratory Juncos.
To interpret the confidence interval, they tell you your limits of –4.046220
and –1.553780. The interval consists of all negative numbers, it is telling you that
“NONMIGRA” is greater than “MIGRA”. My interpretation of the interval would be:
At 95% confidence, the average wing length of nonmigratory Juncos is between 1.5538 to
4.0462 millimeters greater than the average wing length of migratory Juncos..
If you wanted to do a one-tailed test, change the coding to:
> t.test (MIGRA,NONMIGRA, alt=”less”)
> t.test(MIGRA,NONMIGRA, alt=”greater”)
for left tailed or:
for right tailed
If you wanted to change the significance level to anything other than 5%, change
the coding to:
>t.test(MIGRA,NONMIGRA,conf.level=0.99)
for alpha = 1%
1. Do a hypothesis test for the dataset cloud.txt
The data set consists of results of a study on cloud seeding with silver nitrate. The
variable collected is rainfall amounts, in acre-feet, for unseeded and seeded
clouds. At 5% significance, is there evidence that average rainfall for seeded
clouds is greater than average rainfall for unseeded clouds? Include your code,
output, and interpretation.
2. Do a hypothesis test and confidence interval for the dataset homes.txt
The data set consists of random samples of homes from New York and Los
Angeles. The variable collected is home prices in thousands of dollars. At 10%,
is there evidence that average home price is different in New York than that in
Los Angeles? Include your code, output, and interpretations of both a significance
test and of the confidence interval
Chi-Squared Test for Independence among Categorical Variables
The data set for this example is shown in the following table:
This example is looking at a sample of 114 people to see if there is evidence that
hair color is associated with eye color.
Light Eyes
38
14
Light Hair
Dark Hair
Dark Eyes
11
51
First, you need to get the data into R as a matrix:
>hair<-matrix(c(38,14,11,51),nrow=2)
>hair
output:
[,1]
[,2]
[1,]
38
11
[2,]
14
51
Notice that you enter the data column-wise into the matrix.
Now, to do the chi-squared test, enter the following code: We use the option
“correct=F”, because we do not want to apply Yate’s continuity correction. We just want
to do a Chi-Squared test similar to our homework.
>chisq.test(hair, correct=F)
Output:
Pearson's Chi-squared test
data: hair
X-squared = 35.3338, df = 1, p-value = 2.778e-09
To interpret the hypothesis test, the test statistic is 35.3338, the degree of freedom
is 1, and the p-value is 0.000000002778. Therefore, at 5% significance, there is very
strong evidence that hair color is associated with eye color.
3. Conduct a hypothesis test for the following dataset.
At 1% significance, is there evidence that being enrolled in a block program at a
university is associated with retention? 100 students were broken into two groups
of 50 at random. Fifty are in a block program and the others are not. The number
of years each student attends the college is then measured. Include your code,
output, and interpretation in context.
Nonblock
Block
1 year
18
10
2 years
15
5
3 years
5
7
4 years
8
18
5+ years
4
10
4. Conduct a hypothesis test for the following dataset.
At 1% significance, is there evidence that accident type (non, minor, major) is
associated with age of driver? The sample consists of a random sample of drivers
who were asked if they had been in an accident in the previous year, and, if so,
whether it was a minor or major accident. Include your code, output, and
interpretation in context.
Under 18 years
18-25 years
26-40
40-65
Over 65
None
67
42
75
56
57
Minor
10
6
8
4
15
Major
8
8
7
9
4
One-Way ANOVA (Analysis of Variance)
Analysis of variance is an extension of testing the difference between two
independent population means. However, instead of just two population means, you can
test for a difference between more than two population means. The null and alternative
of a one-way ANOVA test for “I” different population means is:
H 0 : 1   2   3  ...   I
H a : not all of the means are the same
The data set for this example is found on my website and is called ozone.txt
The data consists of atmospheric ozone concentrations measured in parts per hundred
million (pphm) in two commercial lettuce-growing gardens (garden A and garden B) We
will do an ANOVA test to see if there is a difference in the average ozone concentrations
between the two gardens.
> site="http://math.asu.edu/~coombs/ozone.txt"
> data<-read.table(file=site, header=T)
> attach(data)
> names(data)
OUTPUT: [1] “ozone” “garden”
The code for the ANOVA test is:
> summary(aov(ozone~garden))
Df
Sum Sq
garden
1
20.0000
Residuals 18
24.0000
Mean Sq
20.0000
1.3333
F value Pr(>F)
15
0.001115 **
The output is referred to as an ANOVA table, it reports different statistics like SSG, SSE,
MSG, and MSE. But most important, it reports the test statistic of 15 and the P-value of
0.001115.
If I interpreted this, I would say that at 5% significance there is evidence that the average
ozone concentrations between the gardens are different.
5. Do an ANOVA test on the dataset earnings.txt
This dataset consists of independent random samples from workers in five
service-producing industries (transportation, wholesale trade, retail trade, finance,
services). The variable is the weekly earnings of randomly sampled workers from
each of the industries. At 5% significance, is there evidence that the average
weekly earnings are different amongst the five industries? Include code, output,
and interpretation.
6. Do an ANOVA test of the dataset bone.txt
This data set consists of independent samples of femurs of rats that were subjected
to two treatments and a control group (three populations). The variable recorded
was the bone mineral density in the femur, in grams per square centimeter. One
treatment was a control group, one treatment received a low dose of Kudzu, and
one treatment received a high dose of Kudzu. Kudzu is a plant from Japan that is
believed to have beneficial effects on bones. At 1%, is there evidence that the
different treatments have different average bone densities in the rat’s femurs?
Include your code, output, and interpretation.