LAB 1 – Introduction to R, Descriptive Statistics, and Correlat

LAB 3 – T-distributions, Confidence Intervals and Hypothesis Tests for
One Population Mean and Two Matched Population Means
The ECA 225 has open lab hours if you need to finish LAB 3. The lab is open
Monday-Thursday 6:30-10:00pm and Saturday-Sunday 2:00-6:00pm
To download R onto your own personal computer, go to:
http://cran.r-project.org/bin/windows/base/
Click on the link for R-2.6.1-win32.exe. Save the file to your computer. Then
click on the file to start the installation to your computer.
Your submission to LAB 3 should consist of answering the numbered questions as you
work through the Lab.
***AS YOU ARE WORKING THROUGH THE LAB, copy and paste each output into a
blank word file****
You can either print the completed word file out and turn that in, or you can e-mail the
word file to me for you LAB 3 grade.
Everything MUST be done in R and included in your word file.
***********************************************************************
***********************************************************************
Access R
On the desktop or through the Programs Menu, find the R icon and click on it.
You should be brought to a screen with a command prompt.
T-distributions
A t-curve, or Studentized curve, is a density curve like a normal distribution.
You can graph t-curves or calculate areas under a t-curve similar to how we did in Lab 2.
First, let’s create a data set of x-values between –4 and 4 with increments of 0.01.
> data<-seq(-4,4,0.01)
Then, plot a normal curve (z-curve) as a dotted line (hence = “lty=2”)
> plot(data, dnorm(data), type="l",lty=2)
Now, overlay a Student’s t curve with df = 5 as a solid line to see the difference.
> lines(data, dt(data, df=5))
1. Compare the t-curve and the normal curve above in your own words.
2. Overlay a second Student’s t curve, with df=10. Include your graph and compare
this third curve to the previous two.
We can also calculate areas under a Student’s t curve. In Lab 2, to calculate the area
to the left of 3 in a normal curve with mean 4 and standard deviation 0.5, the following
was the command:
>pnorm(3,mean=4,sd=0.5)
[1] 0.02275013
The command is similar in a t-curve. However, t-curves are always centered at
zero. If you wanted to find the area to the left of 3 in a Student’s t-curve with df = 2, the
following is the command.
>pt (3,df=2)
[1] 0.952267
3. Find the area to the right of z = 2.5. For all questions, include code and output in
lab.
4. Find the area to the right of t = 2.5, with df = 5.
5. Find the area to the right t = 2.5, with df = 25.
6. What would you have estimated if you used the t-table in your book to answer
question 5?
7. Rank the areas of questions 3-5 from smallest to largest. Explain why the areas
are ranked in that order, in your own words.
This lab will look at some different data sets. To download the data set into R, you
have 2 options: You can download it from my website or you can save the data on your
computer. In either case, the dataset needs to be formatted correctly as a text file, with
variables represented by columns and tabs separating the columns, if there are more than
one variable.
To download data from my website:
The dataset for this example can be found on my website and is saved as streams.txt.
Create a name for your data set. For example, since this data set looks at
biodiversity scores for two sections of a stream, I will call it streams. The command to
load in your data is:
> site=”http://math.asu.edu/~coombs/streams.txt”
> streams<-read.table(file=site, header=T)
Then, so that we can call the variables in the future by name, enter the following
lines of code:
>attach(streams)
>names(streams)
output: [1] “down” “up”
To upload data from a file on your computer:
First, you need to know the exact path where your file is saved. Notice below that
there are double “\\” separating directories.
>streams<-read.table("C:\\Documents and Settings\\Desktop\\streams.txt", header=T)
> attach(streams)
> names(streams)
output: [1] “down” “up”
Two Matched Population Means
As we saw in class, the one population t-test can be used in a two population matched
design. You first have to transform the data into a sample of differences and then do the
hypothesis test or confidence interval on the differences.
The dataset for this example can be found on my website and is saved as streams.txt.
This example is looking at composite biodiversity scores based off of samples of
aquatic invertebrates. The two samples are taken from the same river, one upstream
and one downstream from the same sewage outfall. Since the samples are taken from
the same river, the data set design is matched or dependent.
The question of interest is whether or not there is a difference between the average
scores of upstream and downstream.
First, you need to create the differences between upstream and downstream:
>diff<- up-down
Now, conduct a t-test on the matched pair differences:
> t.test(diff)
output:
One Sample t-test
Data: diff
t= 3.0502, df = 15, p-value = 0.0081
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.2635612 1.4864388
sample estimates:
mean of x
0.875
To interpret the hypothesis test, the test statistic is 3.0502, the degree of freedom
is 15, the P-value is 0.0081, and the sentence underneath, “alternative hypothesis: true
mean is not equal to 0”, is telling you that they used a two-tailed alternative
At the bottom of the output they are telling you that x-bar, or the sample average
difference is 0.875. My interpretation of the non-directional alternative is: At 5%
significance, data provides evidence that there is a difference between the average
biodiversity scores of upstream and downstream.
If you wanted to do a one-tailed test, change the coding to:
> t.test (diff, alt=”less”)
> t.test(diff, alt=”greater”)
for left tailed or:
for right tailed
To interpret the confidence interval, they tell you your limits of 0.2635612 and
1.4864388. Since the interval consists of all positive numbers, it is telling you that “up”
is greater than “down”. My interpretation of the interval would be: At 95% confidence,
the average biodiversity score upstream is between 0.2636 to 1.4864 greater than the
average biodiversity score downstream.
8. Do a hypothesis test and confidence interval for the dataset tire.txt
The data set consists of eleven tires that were measured for treadwear by two
methods, one based on weight and one based on groove wear. You are interested
in whether there is a difference in the two methods. The data is in thousands of
miles. Include your code, output, and interpretations.
9. Do a hypothesis test for the dataset eye.txt
The data set consists of 8 people that have glaucoma in one eye but not in the
other eye. The variable is comparing corneal thickness, in microns, between the
two eyes. Do a one tailed test to see if the normal eye is, on average, thicker than
the glaucoma eye. Include your code, output, and interpretations.
One Population Mean
Extending the previous idea to a one-population test should be just a few small
changes. It is the same test and code, except now your  0 may not be zero, which is the
default value in R.
Here is an example of how to do a one population test about a mean in R.
This example uses the data set phones.txt found on my website. The data set
looks at monthly bills, in dollars, for a sample of 50 cell phone users.
> site="http://math.asu.edu/~coombs/faculty.txt"
> site="http://math.asu.edu/~coombs/phones.txt"
> phones<-read.table(file=site, header=T)
> attach(phones)
> names(phones)
output: [1] "BILL"
Suppose you want to see if this area’s average is different than the national
average of $40.
First, let’s check the normality assumption:
> qqnorm(BILL)
You can see from the normality probability plot, that the sample is NOT normal,
because of the fact that the graph is not linear, it is definitely curved. However, since our
sample size is 50, it is large enough that the Central Limit Theorem says that the x-bars
will be normally distributed. So, the normality assumption isn’t needed. However, if the
sample size was small, you should not perform the following t-test.
> t.test(BILL, mu=40)
One Sample t-test
data: phones
t = 0.2843, df = 49, p-value = 0.7774
alternative hypothesis: true mean is not equal to 40
95 percent confidence interval:
33.51006 48.62914
sample estimates:
mean of x
41.0696
You can see a test statistic that is very small and a P-value that is very large.
My interpretation of the hypothesis test would be: At 5% significance, data does
not provide evidence that this areas average monthly phone bill is different than the
national average of $40.
My interpretation of the confidence interval would be: At 95% confidence, the
average phone bill for the area is between $33.51 and $48.53 per month.
****FOR BOTH #10 and #11, check the normality assumption*****
10. Do a hypothesis test and confidence interval for the data set: mushroom.txt
At the 5% significance level, do the data provide evidence that average cadmium
level in Boletus pinicola mushrooms is higher than the government’s
recommended limit of 0.5 ppm? The data set consists of 12 cadmium levels of
Boletus pinicola mushrooms is parts per million. Include code, output, normality
assumption checking, and interpretation. (If you want the confidence interval, the
t-test must be two-tailed. So, you’ll need two t-test codes to answer this question)
11. Do a hypothesis test and confidence interval for the data set: stress.txt
At the 5% significance level, do the data provide evidence that the average blood
pressure of bus drivers exceed the normal diastolic blood pressure of 80 mmHg?
The data set consists of 41 blood pressure measurements from bus drivers in
millimeters of mercury. Include code, output, normality assumption checking,
and interpretation.