R-project: A free environment for statistical computing

Statistical Concepts and
Analysis in R
Fish 552: Lecture 9
Outline
• Probability Distributions
• Exploratory Data Analysis
• Comparison of two samples
Random variables
• For a sample space, S, a random variable is any rule that
associates a number with each outcome in S.
• A random variable X is continuous if its set of possible values is an
entire interval of numbers
• A random variable X is discrete if its set of possible values is a finite
set or an infinite sequence
• Often we describe discrete or continuous observations by a
probability model
– e.g. Binomial, normal, . . . . .
Probability distributions in R
•
R includes a comprehensive set of probability distributions that can be used to
simulate and model data
•
If the function for the probability model is named xxx
– pxxx: evaluate the cumulative distribution function P(X ≤ x)
– dxxx: evaluate the probability mass or density function f(x)
– qxxx: evaluate the quantile function
(given q, the smallest x such that P(X ≤ x) > q )
– rxxx: generate a random variable from the model xxx
Probability distributions in R
Distribution
R name
Additional arguments
beta
beta
shape1, shape2
binomial
binom
size, prob
Cauchy
cauchy
location, scale
chi-squared
chisq
df
exponential
exp
rate
F
f
df1, df2
gamma
gamma
shape, scale
geometric
geom
prob
hypergeometric
hyper
m,n, k
log-normal
lnorm
meanlog, sdlog
logistic
logis
location, scale
negative binomial
rbinom
size, prob
normal
norm
mean,sd
Poisson
pois
lambda
Student’s t
t
df
unifrom
unif
min, max
Weibull
weibull
shape, scale
Wilcoxon
wilcox
m,n
0.2
0.1
2
2
0.0
densiy
0.3
0.4
Standard normal distribution
-4
-2
2
4
Standard normal distribution
• Quantile
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
– Do these numbers looks familiar ?
> quants <- qnorm(c(0.01,0.025,0.05,0.95,0.975,0.99))
> round(quants,2)
[1] -2.33 -1.96 -1.64
1.64
1.96
2.33
• Probability
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
> pnorm(quants)
[1] 0.010 0.025 0.050 0.950 0.975 0.990
Standard normal distribution
• Density
dnorm(x, mean = 0, sd = 1, log = FALSE)
> dnorm(quants)
[1] 0.02665214 0.05844507 0.10313564 0.10313564 0.05844507 0.02665214
• Random normal variable
rnorm(n, mean = 0, sd = 1)
> rnorm(1)
[1] -0.9392975
• Did you get the same random number ?
Random number generation
• Random number generators are actually pseudo random number
generators, a deterministic sequence of numbers that behave like
random numbers. The state of the seed can be viewed with
– .Random.seed
• By default, R will initialize the random sequence based on the start
time of the program
• The user can initialize the random sequence with the set.seed()
function
– set.seed(seed)
– seed can be any integer between −2147483648 through 2147483647
Random number generation
• When simulating data or working with random numbers, ALWAYS
use set.seed() and save your script or the number used to
generate the random seed.
> set.seed(34)
> rnorm(1)
[1] -0.1388900
sample() function
• sample() can be used to generate random numbers from a
discrete distributions.
– With and without replacement
– Equal and weighted probabilities
• This is the “work-horse” function of many modern statistical
techniques
– Bootstrap, MCMC, . . .
• ?sample
sample() function
•
Roll a dice 10 times
> sample(1:6, 10, replace = TRUE)
[1] 6 2 2 6 2 5 3 4 2 3
•
Flip a coin ten times
> sample(c("H", "T"), 10, replace = TRUE)
[1] "T" "H" "H" "T" "T" "H" "H" "H" "H" "T "
•
Pick 5 cards
paste() concatenates two vectors after converting to strings
> cards <- paste(rep(c("A", 2:10, "J", "Q", "K"), 4),
c("Heart", "Diamond", "Spade", "Club"))
> sample(cards, 5)
[1] "A Club" "Q Club" "9 Diamond" "9 Club" "2 Heart"
replicate() function
•
Useful way to avoid loops (FISH 553)
• replicate() repeatedly evaluates an expression n number of times
•
We are interested in the statistical properties of the median as an estimator
of central tendency for an Exponential distribution with a rate of 1 with small
sample sizes. What is the standard deviation of this estimator?
– Simulate it!
–
> medianResults <- replicate(n = 999,
expr = median(rexp(n = 10, rate = 1)))
–
> sd(medianResults)
–
[1] 0.3012640
Random exponential
Hands-on exercise 1
• Generate 100 random normal numbers with mean 100 and standard
deviation 10. What proportion of random numbers are 2 standard
deviations away from the mean?
• Select 6 numbers from a lottery containing 56 balls. Go to:
http://www.walottery.com/sections/WinningNumbers/
Did you win?
• For a standard normal random variable, find the number z such that
P(-z ≤ Z ≤ z) = 0.23
– Use the symmetry of the normal distribution
Exploratory data analysis
• Important starting point in any study, analysis, etc..
• Numerical summaries can be used to quantitatively assess
characteristics of data
– summary()
– boxplot()
– fivenum()
– sd(), range(), etc..
• Visualizing characteristics of data is often more informative
– There’s a lot of built in plots, find the one you want !
Edgar Anderson’s Iris Data
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1
5.1
3.5
1.4
0.2
setosa
2
4.9
3.0
1.4
0.2
setosa
3
4.7
3.2
1.3
0.2
setosa
4
4.6
3.1
1.5
0.2
setosa
5
5.0
3.6
1.4
0.2
setosa
6
5.4
3.9
1.7
0.4
setosa
•
Species is coded as factor
– Important for several plotting routines
> is.factor(iris$Species)
[1] TRUE
pairs()
• Produces a matrix of scatterplots
pairs(iris[,1:4], main = "Edgar Anderson's Iris Data", pch = 21,
bg = rep(c("red", "green3", "blue"), table(iris$Species)))
Just repeating “red” the number of Setosa
observations and “green3” the number of versicolor
observations . . .
Edgar Anderson's Iris Data
0.5 1.0 1.5 2.0 2.5
6.5
7.5
2.0 2.5 3.0 3.5 4.0
2.0 2.5 3.0 3.5 4.0
4.5
5.5
Sepal.Length
5
6
7
Sepal.Width
0.5 1.0 1.5 2.0 2.5
1
2
3
4
Petal.Length
Petal.Width
4.5
5.5
6.5
7.5
1
2
3
4
5
6
7
boxplot()
•
?boxplot
•
boxplot(Sepal.Length ~ Species, data=iris, col=c("red", "green3", "blue"),
main = "Edgar Anderson's Iris Data")
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
Edgar Anderson's Iris Data
setosa
versicolor
virginica
Histograms
•
Useful for examining shape, center and spread of data
• ?hist
– Lots of options
•
When values are continuous, measurements must be sub-divided into
intervals. In R the user can specify this by a
– Vector of breaks points
– Number of breaks
– Character string specifying an algorithm (default = nclass.Sturges)
•
Use the default to avoid any bias
hist(iris$Sepal.Length[iris$Species == "setosa"], col="red",
xlab="Setosa Sepal Length", main="Histogram")
6
4
2
0
Frequency
8
10
12
Histogram
4.5
5.0
Setosa Sepal Length
5.5
Adding density curves
• Kernel density estimation is a non-parametric way of estimating a
probability density function
– The bandwidth determines how smooth this estimated curve is
• Smaller bandwidths = more wiggly
• Bias can be introduced and often it is best to let R choose the
optimal bandwidth
• There are also many kernel density methods and it is best for now to
just use the default
hist(iris$Sepal.Length[iris$Species == "setosa"], col="red", freq=FALSE, border="gray",
xlab="Setosa Sepal Length", main="Histogram")
lines(density(iris$Sepal.Length[iris$Species == "setosa"]), lwd=2)
0.6
0.4
0.2
0.0
Density
0.8
1.0
1.2
Histogram
4.5
5.0
Setosa Sepal Length
5.5
Other useful plots
Plot
R function
Barplot
barplot()
Contour lines of two-dimensional distribution
contour()
Plot of two variables conditioned on the other
coplot()
Dotchart
dotchart()
Pie chart
pie()
3-dimensional surface plot
persp()
Quantile-quantile plot
qqplot()
Stripchart
stripchart()
Basic statistical tests in R
• R has many built in functions to perform classical statistical tests
– correlation – cor.test()
– Chi squared – chisq.test()
– Test of equal proportions - prop.test()
• Covered in Lecture 10
– ANOVA – aov()
– Linear Models – lm()
Comparison of two samples
• The t-test tests whether the two population means (with unknown
population variances) are significantly different (or smaller/larger)
from each other.
H0: μ1 – μ2 = 0
H1: μ1 – μ2 ≠ 0
• Independent vs. paired t-test
• The two-sample independent t-test assumes
– Normality of populations (but not an issue for large samples)
– Equal variances (this can be relaxed)
– Independent samples
t.test()
• The t.test() function in R can be used to perform many variants
of the t-test
– ?t.test
• Specify direction, μ, α
• Two methods were performed to determine the latent heat of the
fusion of ice. The investigator wishes to find out how much (if at all)
the methods differed.
– The data is entered as methodA and methodB
Assumptions
• If these populations were normal, then the points should fall around the line
-A has a strong left skew. B has right skew
- qqnorm(methodA) ; qqline(methodA)
Normal Q-Q Plot
80.00
79.96
79.98
Sample Quantiles
80.02
80.00
79.94
79.98
Sample Quantiles
80.04
80.02
Normal Q-Q Plot
-1.5
-1.0
-0.5
0.0
0.5
Theoretical Quantiles
1.0
1.5
-1.5
-1.0
-0.5
0.0
Theoretical Quantiles
0.5
1.0
1.5
Assumptions
79.94
79.96
79.98
80.00
80.02
80.04
• Equal variance seems like a reasonable assumption
A
B
t-test
• The Welch two sample t-test (default) assumes unequal variances
> t.test(methodA, methodB)
Welch Two Sample t-test
data:
methodA and methodB
t = 3.274, df = 12.03, p-value = 0.006633
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.01405393 0.06992684
sample estimates:
mean of x mean of y
80.02062
79.97862
Accessing output
•
The results shown in the previous slide can be accessed by assigning a
name to the output and accessing individual elements as usual
– What data type do you think this ?
> resultsAB <- t.test(methodA, methodB)
> names(resultsAB)
[1] "statistic"
"null.value"
"parameter"
"p.value"
"alternative" "method"
> resultsAB$p.value
[1] 0.006633411
"conf.int"
"data.name"
"estimate"
t-test
• The box-plot suggested that the two variances might be
approximately equal
> var(methodA)
[1] 0.0005654231
> var(methodB)
[1] 0.0009679821
• More formally we can perform an F-test to test for equality in the
variances
F-test
• If we take two samples n1 and n2 from normal populations then the
ratio of the variances has an F-distribution with n1 – 1 and n2 – 1
degrees of freedom.
H0: σ21 / σ22 = 1
H1: σ21 / σ22 ≠ 1
F-test
•
There is no evidence of a significance difference between the two variances
> var.test(methodA, methodB)
F test to compare two variances
data: methodA and methodB
F = 0.5841, num df = 12, denom df = 7, p-value = 0.3943
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.1251922 2.1066573
sample estimates:
ratio of variances
0.5841255
t-test
• We should now specify equal variances
> t.test(methodA, method.B, varequal=TRUE)
Two Sample t-test
data:
methodA and methodB
t = 3.4977, df = 19, p-value = 0.002408
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.01686368 0.06711709
sample estimates:
mean of x mean of y
80.02062
79.97862
Non-parametric tests
• The normality assumption (based on the qq-plots) is probably not
true in this case, particular with small sample sizes
• The two-sample Wilcoxon (Mann-Whitney) is a useful alternative
when the assumptions of the t-test are not met
– Have you seen this test before?
Mann-Whitney test
• How it works:
– Arrange all the observations into a ranked series
– Sum the ranks from sample 1 (R1) and sample 2
– Calculate the test-statistic U
n1(n1  1)
n1(n1  1) 

U  max{R1 
, n1n2   R1 
}

2
2


• The distribution under H0, can be found enumerating all the possible
subset of the ranks (assuming each subset is equally likely), and
comparing the test statistic to the probability of observing of the
rank.
– This can be a cumbersome calculation for large sample sizes
Mann-Whitney test
•
When there are ties in the data, this method provides an approximate pvalue
•
If the sample size is less than 50 and there are no ties in observations, by
default R will calculate an exact p-value
– When this is not the case, a normal approximation is used
• ?wilcox.test
Mann-Whitney test
•
Once again reach the same conclusion
> wilcox.test(methodA, methodB)
Wilcoxon rank sum test with continuity correction
data: methodA and methodB
W = 88.5, p-value = 0.008995
alternative hypothesis: true location shift is not equal to 0
Warning message:
In wilcox.test.default(methodA, methodB) :
cannot compute exact p-value with ties
Ranks are sensitive rounding
and R will produce a warning
if rounded observations are
nearly equal
Hands-on exercise 2
• Create 10 qqnorm plots (including qqline) by sampling 30 points
from the following distributions: normal, exponential, t, Cauchy
• Make between 2 and 5 of the plots come from a normal distribution
• Have your partner guess which plots are actually from a normal
distribution