R: A Statistics Program For Teaching & Research Josué Guzmán 11 Nov. 2007 [email protected] Some Useful R Links • R Home Page www.r-project.org • CRAN http://cran.r-project.org • Precompiled Binary Distributions • Windows (95 and later) • R Manuals © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 2 R Installation • R: Statistical Analysis & Graphics • Freely Available Under GPL • Binary Distributions • Installation – Standard Steps © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 3 © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 4 © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 5 Running R © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 6 Statistical Programming with R • Learn Language Basics • Learn Documentation / Help System • Learn Data Manipulation & Graphics • Perform Basic Statistical Analysis © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 7 First Steps: Interacting with R • Type a Command & Press Enter • R Executes (printing the result if relevant) • R waits for more input © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 8 Some Examples 2 * 2 [1] 4 exp(-2) [1] 0.1353353 rdmnorm =rnormal(1000) © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 9 R Functions • exp, log and rnorm are functions • Function calls are indicated by the presence of parentheses Example: hist(rdmnorm, col = "magenta") © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 10 Variables and Assignments The = operator; the <- operator also works x = 2.2 y = x + 3.5 sqrt(x) y x ^ y © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 11 Variables and Assignments • Variable names cannot start with a digit • Names are Case-Sensitive • Some common names are already used by R • Examples: c, q, t, C, D, F, I, T • Should be avoided © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 12 Vectorized Arithmetic • Elementary data types in R are all vectors • The c(...) construct used to create vectors: • Bolstad, 2004, exercise 13.2, page 253 fertilizer = c(1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5) fertilizer © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 13 Vectorized Arithmetic [cont.] • Arithmetic operations (+, -, *, /, ^) and mathematical functions (sin, cos, log, …) work element-wise on vectors yield = c(25, 31, 27, 28, 36, 35, 32, 34) log(yield) © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 14 Vectorized Arithmetic [cont.] sum.yield = sum(yield) sum.yield n = length(yield) n avg.yield = sum.yield/n avg.yield © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 15 Graphics • plot(x, y) function – simple way to produce R graphics: plot(fertilizer, log(yield), main = "Fertilizer vs. Yield") © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 16 Getting Help • help.start( ) Starts a browser window with an HTML help interface. Links to manual An Introduction to R, as well as topic-wise listings. • help(topic) Help page for a particular topic or function. Every R function has a help page. • help.search("search string") Subject/keyword search © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 17 Getting Help [cont.] • Short-cut: question mark (?) help(plot) ? plot • To know about a specific subject, use help.search function. Example: help.search("logarithm") © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 18 apropos( ) • apropos function - list of topics that partially match its argument: apropos("plot")[1:10] [1] [3] [5] [7] [9] ".__C__recordedplot" "biplot" "interaction.plot" "lag.plot" "monthplot" "plot.TukeyHSD" "plot.density" "plot.ecdf" "plot.lm" "plot.mlm" © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 19 R Packages • R makes use of a system of packages • Each package is a collection of routines with a common theme • The core of R itself is a package called base • A collection of packages is called a library • Some packages are already loaded when R starts up • Other packages need be loaded using the library function © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 20 R Packages [cont.] Several packages come pre-installed with R: installed.packages( )[, 1] [1] "ISwR" "KernSmooth" "MASS" "base" [5] "boot" "class" "cluster" "foreign" [9] "graphics" "grid" "lattice" "methods" [13] "mgcv" "nlme" "nnet" "rpart" [17] "spatial" "splines" "stats" "stats4" [21] "survival" "tcltk" "tools" "utils" © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 21 Contributed Packages • Many packages are available from CRAN • Some packages are already loaded when R starts up. List of currently loaded packages - use search: search( ) [1] ".GlobalEnv" "package:tools" "package:methods" [4] "package:stats" "package:graphics" "package:utils" [7] "Autoloads" "package:base" © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 22 R Packages • Can be loaded by the user. Example: UsingR package library(UsingR) • New packages downloaded using the install.packages function: install.packages("UsingR") library(help = UsingR) © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 23 Data Types • vector – Set of elements in a specified order • matrix – Two-dimensional array of elements of the same mode • factor – Vector of categorical data • data frame – Two-dimensional array whose columns may represent data of different modes • list – Set of components that can be any other object type © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 24 Editing Data Sets • Can create and modify data sets on the command line xx = seq(from = 1, to = 5) xx x2 = 1 : 5 x2 yy = scan( ) 5 8 10 4 2 6 20 11 21 32 43 55 yy • Can edit a data set once it is created edit(mydata) data.entry(mydata) © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 25 Built-in Data Data from a library: library(UsingR) attach(cfb)#Consumer-Finances Survey cfb$INCOME cfb$EDUC educ.fac = factor(EDUC) plot(INCOME ~ educ.fac, xlab = "EDUCATION", ylab = "INCOME") detach(cfb) © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 26 Data Modes • logical – Binary mode, values represented as TRUE or FALSE • numeric – Numeric mode [integer, single, & double precision] • complex – Complex numeric values • character – Character values represented as strings © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 27 Data Frames • read.table( ) – Reads in data from an external file read.table("data.txt" , header = T) read.table(file = file.choose( ), header = T) • data.frame – Binds R objects of various kinds © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 28 read.table Function • Reads ASCII file, creates a data frame • Data in tables of rows and columns • If first line contains column labels: Use argument header = T • Field separator is white space • Also read.csv and read.csv2 – Assume , and ; separations, respectively • Treats characters as factors © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 29 save( ) and load( ) • Used for R Functions and Objects • Understandable to load only x = 23 y = 44 save(x, y, file = "xy.Rdata") load("xy.Rdata") © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 30 Comparison Operators © J. Guzmán, 2007 != Not Equal To < Less Than <= Less Than or Equal To == Exactly Equal To > Greater Than >= Greater Than or Equal To R: Stat. Prog. for Teach. & Res. 31 Some Logical Operators ! Not | Or (For Calculating Vectors and Arrays of Logicals) & And (For Calculating Vectors and Arrays of Logicals) © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 32 Some Mathematical Functions abs ceiling floor cos, sin, tan exp(x) log log10 sqrt Absolute Value Next Larger Integer Next Smallest Integer Trigonometric Functions e^x [e = 2.71828 …] Natural Logarithm Logarithm Base 10 Square Root © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 33 Statistical Summary Functions length Length of Object max Maximum Value mean Arithmetic Mean median Median min Minimum Value prod Product of Values quantile Empirical Quantiles sum Sum var Variance - Covariance sd Standard Deviation cor Correlation Between Vectors or Matrices © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 34 Sorting and Other Functions rev sort order rank match cumsum cumprod © J. Guzmán, 2007 Put Values of Vectors in Reverse Order Sort Values of Vector Permutation of Elements to Produce Sorted Order Ranks of Values in Vector Detect Occurrences in a Vector Cumulative Sums of Values in Vector Cumulative Products R: Stat. Prog. for Teach. & Res. 35 Plotting Functions Useful for One-Dimensional Data barplot Bar plot boxplot Box & Whisker plot hist Histogram dotchart Dot plot pie Pie chart © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 36 Plotting Functions Useful for Two-Dimensional Data plot Creates a scatter plot: plot(x, y) qqnorm Quantile-quantile plot sample vs. N(0, 1): qqnorm(x) qqplot Plot quantile-quantile plot for two samples: qqplot(x , y) pairs Creates a pairs or scatter plot matrix: attach(babies) pairs(babies[ , c("gestation", "wt", "age", "inc" ) ] ) © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 37 Three-Dimensional Plotting Functions contour Contour plot persp Perspective plot image Image plot © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 38 Probability Distributions Using R • Pseudo-random sampling sample(0:20, 5) # select 5 WOR sample(0:20, 5, replace = T) # select WR • Coin toss simulation [0 = tail; 1 = head] 20 tosses: sample(c(0, 1), 20, replace=T) © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 39 For Any Probability Distribution ddist density or probability pdist cumulative probability qdist quantiles [percentiles] rdist pseudo-random selection © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 40 Binomial Distribution X ~ Binomial(n , p) ; x = 0, 1, …, n dbinom(x , n , p ) Density or point probability pbinom(x , n , p ) Cumulative distribution qbinom(q , n , p ) Quantiles [0<q<1] rbinom(m , n , p ) Pseudo-random numbers © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 41 Binomial Distribution Coin toss simulation: x = 0:20 # num. of heads in 20 tosses px = dbinom(x , size = 20, prob = 0.5) plot(x , px, type = "h") # graph display curve(dnorm(x, 10, sqrt(20*.5*.5)), col=2, add=T) © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 42 0.15 0.10 0.00 0.05 px 0 © J. Guzmán, 2007 5 10 R: Stat. Prog. for Teach. & Res. x 15 20 43 Normal Distribution X ~ Normal(µ,) dnorm(x , µ,) Density pnorm(x , µ,) Cumulative probability qnorm(q , µ,) Quantiles rnorm(m , µ,) Random numbers © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 44 Standard Normal x = seq(-3.5,3.5,0.1) # x ~ N(0,1) prx = dnorm(x) # M = 0 , SD = 1 plot(x , prx , type = "l" ) Or using: curve(dnorm(x), from = -3.5 , to = 3.5) © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 45 Cumulative Normal & Quantiles curve(pnorm(x), from=-3.5,to=3.5) qnorm(.25) #Percentile 25, x~N(0,1) qnorm(.75, m=50, sd=2) # M=50,SD=2 qnorm(c(.1,.3,.7,.9), m=65, sd=3) © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 46 Poisson Distribution X ~ Poisson( λ ) ; X = 0, 1, 2, 3, … x = 0:20 # Suppose λ = 3.5 prx = dpois(x, lambda = 3.5) plot(x , prx, type = "h", main = "Poisson Distribution") text(10, .10, "Lambda = 3.5") © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 47 0.10 Lambda = 3.5 0.00 0.05 prx 0.15 0.20 Poisson Distribution 0 © J. Guzmán, 2007 5 10 R: Stat. Prog. forxTeach. & Res. 15 20 48 Sampling Distributions n = 25; curve(dnorm(x , 0, 1/sqrt(n)), -3, 3, xlab = "Mean", ylab = "Densities of Sample Mean", bty = "l" ) n=5 ; curve(dnorm(x, 0, 1/sqrt(n)), add=T) n=1 ; curve(dnorm(x, 0, 1/sqrt(n)), add=T) © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 49 2.0 1.5 1.0 0.5 0.0 Densities of Sample Mean -3 -2 -1 0 1 2 3 Mean © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 50 t – Distribution as df Increase curve(dnorm(x), -4, 4, main="Normal & t Distributions", ylab="Densities" ) k=3; curve(dt(x , df = k ), lty = k, add = T) k=5; curve(dt(x , df = k ), lty = k, add = T) k=15; curve(dt(x , df = k ), lty = k, add = T) k=100; curve(dt(x , df = k ), lty = k, add = T) © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 51 0.2 0.0 0.1 Densities 0.3 0.4 Normal & t Distributions -4 -2 0 2 4 x © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 52 Binomial-Normal Approximation • Coin toss example: n = 100, p = .5 • P(X ≤ 40)? Using Larget’s prob.R file: source(file.choose( ) ) gbinom(100, .5, b = 40 ) Normal approximation: µ = 50, = 5 gnorm(50, 5, b = 40.5) © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 53 0.04 0.00 0.02 Probability 0.06 0.08 Binomial Distribution n = 100 , p = 0.5 P(0 <= Y <= 40) = 0.028444 30 40 50 60 70 Possible Values © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 54 Probability Density Normal Distribution with 50, 5 P( X < 40.5 ) = 0.0287 P( X > 40.5 ) = 0.9713 30 40 50 60 70 Possible Values © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 55 One-Sample t-test Ho: µ = µ0 Null Hypothesis Ha: µ µ0 Two-sided Ha: µ > µ0 One-sided Ha: µ < µ0 One-sided © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 56 R One-Sample t.test x = c(x1, x2, …, xn) # data set t.test(x, mu = Mo) # two-sided t.test(x, mu = Mo, alt = "g") # one-sided t.test(x, mu = Mo, alt = "l") one-sided © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. # 57 R One-Sample t.test [cont.] Example: Text, Problem 8.11, page 226 library(UsingR) attach(stud.recs) x = sat.m # hist(x) # qqnorm(x) # qqline(x, col=2) # Math SAT Scores Visual display Normal quantile plot Add equality line t.test(x, mu = 500) detach(stud.recs) © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 58 Normality Test Shapiro-Wilk test: Ho: X ~ Normal Ha: X !~ Normal Command: shapiro.test(x) # Examine p-value © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 59 Normality Test [cont.] Example: On Base % 0.3 summary(OBP) 0.4 0.5 data(OBP) 0.2 boxplot(OBP) © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 60 Normality Test [cont.] qqnorm(OBP) Normal Q-Q Plot 0.4 0.3 wilcox.test(OBP, mu=.330) 0.2 shapiro.test(OBP) Sample Quantiles 0.5 qqline(OBP, col=2) -3 -2 -1 0 1 2 3 Theoretical Quantiles © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 61 One-Sample Proportion Test x total successes; n sample size prop.test(x, n, p = Po) # two-sided prop.test(x, n, p = Po, alt= "g") prop.test(x, n, p = Po, alt= "l") © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 62 Or Using Binomial “Exact” Test binom.test(x, n, p = Po) binom.test(x, n, p = Po, alt = "g") binom.test(x, n, p = Po, alt = "l") © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 63 Proportion Test Text, Example 8.3: Survey US Poverty Rate Ho: P = 0.113 Ha: P > 0.113 # Year 2000 Rate # Year 2001 Rate Increased x = 5850 # Sample people UPL n = 50000 # Sample size prop.test(x, n, p = 0.113, alt = "g") binom.test(x, n, p = 0.113, alt = "g") © J. Guzmán, 2007 R: Stat. Prog. for Teach. & Res. 64 Some Modeling Functions/Packages Linear Models: Graphics: Multivariate: Survey: SQC: Time Series: Bayesian: Simulation: © J. Guzmán, 2007 anova, car, lm, glm graphics, grid, lattice mva, cluster survey qcc tseries BRugs, MCMCpack, … boot, bootstrap, Zelig R: Stat. Prog. for Teach. & Res. 65 You Perform An Experiment In Order To Learn, Not To Prove. W Edwards Deming
© Copyright 2024 Paperzz