R - cicia

R: A Statistics Program
For Teaching & Research
Josué Guzmán
11 Nov. 2007
[email protected]
Some Useful R Links
• R Home Page www.r-project.org
• CRAN http://cran.r-project.org
• Precompiled Binary Distributions
• Windows (95 and later)
• R Manuals
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
2
R Installation
• R: Statistical Analysis & Graphics
• Freely Available Under GPL
• Binary Distributions
• Installation – Standard Steps
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
3
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
4
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
5
Running R
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
6
Statistical Programming with R
• Learn Language Basics
• Learn Documentation / Help System
• Learn Data Manipulation & Graphics
• Perform Basic Statistical Analysis
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
7
First Steps: Interacting with R
• Type a Command & Press Enter
• R Executes (printing the result if relevant)
• R waits for more input
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
8
Some Examples
 2 * 2
[1] 4
 exp(-2)
[1] 0.1353353
 rdmnorm =rnormal(1000)
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
9
R Functions
• exp, log and rnorm are functions
• Function calls are indicated by the presence
of parentheses
Example:
 hist(rdmnorm, col = "magenta")
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
10
Variables and Assignments
The = operator; the <- operator also works
x = 2.2
y = x + 3.5
sqrt(x)
y
x ^ y
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
11
Variables and Assignments
• Variable names cannot start with a digit
• Names are Case-Sensitive
• Some common names are already used by R
• Examples: c, q, t, C, D, F, I, T
• Should be avoided
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
12
Vectorized Arithmetic
• Elementary data types in R are all vectors
• The c(...) construct used to create vectors:
• Bolstad, 2004, exercise 13.2, page 253
fertilizer = c(1, 1.5, 2, 2.5,
3, 3.5, 4, 4.5)
fertilizer
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
13
Vectorized Arithmetic [cont.]
• Arithmetic operations (+, -, *, /, ^) and
mathematical functions (sin, cos, log, …)
work element-wise on vectors
yield = c(25, 31, 27, 28,
36, 35, 32, 34)
log(yield)
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
14
Vectorized Arithmetic [cont.]
sum.yield = sum(yield)
sum.yield
n = length(yield)
n
avg.yield = sum.yield/n
avg.yield
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
15
Graphics
• plot(x, y) function – simple way to produce
R graphics:
plot(fertilizer, log(yield),
main = "Fertilizer vs. Yield")
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
16
Getting Help
• help.start( )
Starts a browser window with an HTML help
interface. Links to manual An Introduction to R,
as well as topic-wise listings.
• help(topic)
Help page for a particular topic or function. Every
R function has a help page.
• help.search("search string")
Subject/keyword search
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
17
Getting Help [cont.]
• Short-cut: question mark (?)
 help(plot)
 ? plot
• To know about a specific subject, use
help.search function. Example:
 help.search("logarithm")
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
18
apropos( )
• apropos function - list of topics that partially
match its argument:
 apropos("plot")[1:10]
[1]
[3]
[5]
[7]
[9]
".__C__recordedplot" "biplot"
"interaction.plot" "lag.plot"
"monthplot" "plot.TukeyHSD"
"plot.density" "plot.ecdf"
"plot.lm" "plot.mlm"
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
19
R Packages
• R makes use of a system of packages
• Each package is a collection of routines with a
common theme
• The core of R itself is a package called base
• A collection of packages is called a library
• Some packages are already loaded when R starts
up
• Other packages need be loaded using the library
function
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
20
R Packages [cont.]
Several packages come pre-installed with R:
 installed.packages( )[, 1]
[1] "ISwR" "KernSmooth" "MASS" "base"
[5] "boot" "class" "cluster" "foreign"
[9] "graphics" "grid" "lattice" "methods"
[13] "mgcv" "nlme" "nnet" "rpart"
[17] "spatial" "splines" "stats" "stats4"
[21] "survival" "tcltk" "tools" "utils"
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
21
Contributed Packages
• Many packages are available from CRAN
• Some packages are already loaded when R starts up. List
of currently loaded packages - use search:
 search( )
[1] ".GlobalEnv" "package:tools"
"package:methods"
[4] "package:stats" "package:graphics"
"package:utils"
[7] "Autoloads" "package:base"
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
22
R Packages
• Can be loaded by the user. Example: UsingR
package
 library(UsingR)
• New packages downloaded using the
install.packages function:
 install.packages("UsingR")
 library(help = UsingR)
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
23
Data Types
• vector – Set of elements in a specified order
• matrix – Two-dimensional array of elements of the
same mode
• factor – Vector of categorical data
• data frame – Two-dimensional array whose
columns may represent data of different modes
• list – Set of components that can be any other
object type
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
24
Editing Data Sets
• Can create and modify data sets on the command line


xx = seq(from = 1, to = 5)
xx


x2 = 1 : 5
x2
 yy = scan( )
5 8 10 4 2 6 20
11 21 32 43 55
 yy
• Can edit a data set once it is created


edit(mydata)
data.entry(mydata)
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
25
Built-in Data
Data from a library:
 library(UsingR)
 attach(cfb)#Consumer-Finances Survey
 cfb$INCOME
 cfb$EDUC
 educ.fac = factor(EDUC)
 plot(INCOME ~ educ.fac, xlab =
"EDUCATION", ylab = "INCOME")
 detach(cfb)
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
26
Data Modes
• logical – Binary mode, values represented as
TRUE or FALSE
• numeric – Numeric mode [integer, single, &
double precision]
• complex – Complex numeric values
• character – Character values represented as
strings
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
27
Data Frames
• read.table( ) – Reads in data from an external file
 read.table("data.txt" , header = T)
 read.table(file = file.choose( ),
header = T)
• data.frame – Binds R objects of various kinds
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
28
read.table Function
• Reads ASCII file, creates a data frame
• Data in tables of rows and columns
• If first line contains column labels:
Use argument header = T
• Field separator is white space
• Also read.csv and read.csv2
– Assume , and ; separations, respectively
• Treats characters as factors
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
29
save( ) and load( )
• Used for R Functions and Objects
• Understandable to load only
 x = 23
 y = 44
 save(x, y, file = "xy.Rdata")
 load("xy.Rdata")
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
30
Comparison Operators
© J. Guzmán, 2007
!=
Not Equal To
<
Less Than
<=
Less Than or Equal To
==
Exactly Equal To
>
Greater Than
>=
Greater Than or Equal To
R: Stat. Prog. for Teach. & Res.
31
Some Logical Operators
!
Not
|
Or (For Calculating Vectors and Arrays
of Logicals)
&
And (For Calculating Vectors and
Arrays of Logicals)
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
32
Some Mathematical Functions
abs
ceiling
floor
cos, sin, tan
exp(x)
log
log10
sqrt
Absolute Value
Next Larger Integer
Next Smallest Integer
Trigonometric Functions
e^x [e = 2.71828 …]
Natural Logarithm
Logarithm Base 10
Square Root
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
33
Statistical Summary Functions
length Length of Object
max
Maximum Value
mean
Arithmetic Mean
median
Median
min
Minimum Value
prod
Product of Values
quantile
Empirical Quantiles
sum
Sum
var
Variance - Covariance
sd
Standard Deviation
cor
Correlation Between Vectors or Matrices
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
34
Sorting and Other Functions
rev
sort
order
rank
match
cumsum
cumprod
© J. Guzmán, 2007
Put Values of Vectors in Reverse
Order
Sort Values of Vector
Permutation of Elements to Produce
Sorted Order
Ranks of Values in Vector
Detect Occurrences in a Vector
Cumulative Sums of Values in Vector
Cumulative Products
R: Stat. Prog. for Teach. & Res.
35
Plotting Functions Useful for
One-Dimensional Data
barplot
Bar plot
boxplot
Box & Whisker plot
hist
Histogram
dotchart
Dot plot
pie
Pie chart
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
36
Plotting Functions Useful for
Two-Dimensional Data
plot
Creates a scatter plot:
 plot(x, y)
qqnorm
Quantile-quantile plot sample vs. N(0, 1):
 qqnorm(x)
qqplot
Plot quantile-quantile plot for two samples:
 qqplot(x , y)
pairs


Creates a pairs or scatter plot matrix:
attach(babies)
pairs(babies[ , c("gestation", "wt", "age",
"inc" ) ] )
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
37
Three-Dimensional Plotting
Functions
contour
Contour plot
persp
Perspective plot
image
Image plot
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
38
Probability Distributions Using R
• Pseudo-random sampling
 sample(0:20, 5)
# select 5 WOR
 sample(0:20, 5, replace = T)
# select WR
• Coin toss simulation [0 = tail; 1 = head]
20 tosses:
 sample(c(0, 1), 20, replace=T)
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
39
For Any Probability Distribution
ddist
density or probability
pdist
cumulative probability
qdist
quantiles [percentiles]
rdist
pseudo-random selection
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
40
Binomial Distribution
X ~ Binomial(n , p) ; x = 0, 1, …, n
dbinom(x , n , p ) Density or point probability
pbinom(x , n , p ) Cumulative distribution
qbinom(q , n , p ) Quantiles
[0<q<1]
rbinom(m , n , p ) Pseudo-random numbers
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
41
Binomial Distribution
Coin toss simulation:
 x = 0:20
# num. of heads in 20 tosses
 px = dbinom(x , size = 20, prob = 0.5)
 plot(x , px, type = "h")
# graph display
 curve(dnorm(x, 10, sqrt(20*.5*.5)), col=2,
add=T)
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
42
0.15
0.10
0.00
0.05
px
0
© J. Guzmán, 2007
5
10
R: Stat. Prog. for Teach.
& Res.
x
15
20
43
Normal Distribution
X ~ Normal(µ,)
dnorm(x , µ,)
Density
pnorm(x , µ,)
Cumulative probability
qnorm(q , µ,)
Quantiles
rnorm(m , µ,)
Random numbers
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
44
Standard Normal
 x = seq(-3.5,3.5,0.1) # x ~ N(0,1)
 prx = dnorm(x)
# M = 0 , SD = 1
 plot(x , prx , type = "l" )
Or using:
 curve(dnorm(x), from = -3.5 , to =
3.5)
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
45
Cumulative Normal & Quantiles
 curve(pnorm(x), from=-3.5,to=3.5)
 qnorm(.25) #Percentile 25, x~N(0,1)
 qnorm(.75, m=50, sd=2) # M=50,SD=2
 qnorm(c(.1,.3,.7,.9), m=65, sd=3)
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
46
Poisson Distribution
X ~ Poisson( λ ) ; X = 0, 1, 2, 3, …
 x = 0:20
# Suppose λ = 3.5
 prx = dpois(x, lambda = 3.5)
 plot(x , prx, type = "h", main =
"Poisson Distribution")
 text(10, .10, "Lambda = 3.5")
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
47
0.10
Lambda = 3.5
0.00
0.05
prx
0.15
0.20
Poisson Distribution
0
© J. Guzmán, 2007
5
10
R: Stat. Prog. forxTeach. & Res.
15
20
48
Sampling Distributions
 n = 25; curve(dnorm(x , 0,
1/sqrt(n)), -3, 3,
xlab = "Mean", ylab = "Densities
of Sample Mean", bty = "l" )
 n=5 ; curve(dnorm(x, 0, 1/sqrt(n)),
add=T)
 n=1 ; curve(dnorm(x, 0, 1/sqrt(n)),
add=T)
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
49
2.0
1.5
1.0
0.5
0.0
Densities of Sample Mean
-3
-2
-1
0
1
2
3
Mean
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
50
t – Distribution as df Increase
 curve(dnorm(x), -4, 4, main="Normal & t
Distributions", ylab="Densities" )
 k=3; curve(dt(x , df = k ), lty = k,
add = T)
 k=5; curve(dt(x , df = k ), lty = k,
add = T)
 k=15; curve(dt(x , df = k ), lty = k,
add = T)
 k=100; curve(dt(x , df = k ), lty = k,
add = T)
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
51
0.2
0.0
0.1
Densities
0.3
0.4
Normal & t Distributions
-4
-2
0
2
4
x
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
52
Binomial-Normal Approximation
• Coin toss example: n = 100, p = .5
• P(X ≤ 40)?
Using Larget’s prob.R file:
 source(file.choose( ) )
 gbinom(100, .5, b = 40 )
Normal approximation: µ = 50,  = 5
 gnorm(50, 5, b = 40.5)
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
53
0.04
0.00
0.02
Probability
0.06
0.08
Binomial Distribution
n = 100 , p = 0.5
P(0 <= Y <= 40) = 0.028444
30
40
50
60
70
Possible Values
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
54
Probability Density
Normal Distribution with
50,
5
P( X < 40.5 ) = 0.0287
P( X > 40.5 ) = 0.9713
30
40
50
60
70
Possible Values
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
55
One-Sample t-test
Ho: µ = µ0
Null Hypothesis
Ha: µ  µ0
Two-sided
Ha: µ > µ0
One-sided
Ha: µ < µ0
One-sided
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
56
R One-Sample t.test
 x = c(x1, x2, …, xn)
# data set
 t.test(x, mu = Mo)
# two-sided
 t.test(x, mu = Mo, alt = "g") #
one-sided
 t.test(x, mu = Mo, alt = "l")
one-sided
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
#
57
R One-Sample t.test [cont.]
Example: Text, Problem 8.11, page 226






library(UsingR)
attach(stud.recs)
x = sat.m
#
hist(x)
#
qqnorm(x)
#
qqline(x, col=2) #
Math SAT Scores
Visual display
Normal quantile plot
Add equality line
 t.test(x, mu = 500)
 detach(stud.recs)
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
58
Normality Test
Shapiro-Wilk test:
Ho: X ~ Normal
Ha: X !~ Normal
Command:
 shapiro.test(x)
# Examine p-value
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
59
Normality Test [cont.]
Example: On Base %
0.3
 summary(OBP)
0.4
0.5
 data(OBP)
0.2
 boxplot(OBP)
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
60
Normality Test [cont.]
 qqnorm(OBP)
Normal Q-Q Plot
0.4
0.3
 wilcox.test(OBP,
mu=.330)
0.2
 shapiro.test(OBP)
Sample Quantiles
0.5
 qqline(OBP, col=2)
-3
-2
-1
0
1
2
3
Theoretical Quantiles
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
61
One-Sample Proportion Test
x total successes; n sample size
 prop.test(x, n, p = Po) # two-sided
 prop.test(x, n, p = Po, alt= "g")
 prop.test(x, n, p = Po, alt= "l")
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
62
Or Using Binomial “Exact” Test
 binom.test(x, n, p = Po)
 binom.test(x, n, p = Po,
alt = "g")
 binom.test(x, n, p = Po,
alt = "l")
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
63
Proportion Test
Text, Example 8.3: Survey US Poverty Rate
Ho: P = 0.113
Ha: P > 0.113




# Year 2000 Rate
# Year 2001 Rate Increased
x = 5850
# Sample people UPL
n = 50000
# Sample size
prop.test(x, n, p = 0.113, alt = "g")
binom.test(x, n, p = 0.113, alt = "g")
© J. Guzmán, 2007
R: Stat. Prog. for Teach. & Res.
64
Some Modeling Functions/Packages
Linear Models:
Graphics:
Multivariate:
Survey:
SQC:
Time Series:
Bayesian:
Simulation:
© J. Guzmán, 2007
anova, car, lm, glm
graphics, grid, lattice
mva, cluster
survey
qcc
tseries
BRugs, MCMCpack, …
boot, bootstrap, Zelig
R: Stat. Prog. for Teach. & Res.
65
You Perform An Experiment
In Order To Learn,
Not To Prove.
W Edwards Deming