integrating R into > introductory statistics - Statistical Science

> integrating R into
> introductory statistics
Mine Çetinkaya-Rundel - Duke University
Andrew Bray - UCLA
JSM 2012 - August 2, 2012
> experience
> STA 101 AT DUKE
> STA 101 AT DUKE
> first course in statistics for non-majors, mostly social sciences majors
> STA 101 AT DUKE
> first course in statistics for non-majors, mostly social sciences majors
> 80-120 students in lecture, 25 students / lab section
> STA 101 AT DUKE
> first course in statistics for non-majors, mostly social sciences majors
> 80-120 students in lecture, 25 students / lab section
> weekly lab sessions using R
> labs designed for an interdisciplinary introductory course, can be modified for discipline-specific courses
> can also be used in a first data-analysis course for stats majors, ideally by reducing step-by-step instructions
> R
> WHY R?
unlike most software designed specifically for courses at this level, R is
> free and open-source
> powerful and flexible
> relevant beyond the introductory statistics classroom
> WHY NOT R?
> WHY NOT R?
> perceived challenge of teaching programming in addition to teaching statistical concepts
> labs and activities that try to find the right balance of standard and custom functions
> consistent syntax highlighting helps
> WHY NOT R?
> perceived challenge of teaching programming in addition to teaching statistical concepts
> labs and activities that try to find the right balance of standard and custom functions
> consistent syntax highlighting helps
> working with a command line tends to be more intimidating than traditional GUI based tools
> GUI tools also have a learning curve > a user-friendly IDE (like RStudio)
> RSTUDIO
> RSTUDIO
> what it helps resolve:
> loading and viewing data > saving code
> code history
> workspace organization
> plot history
> RSTUDIO
> what it helps resolve:
> loading and viewing data > saving code
> code history
> workspace organization
> plot history
> what still remains a challenge:
> working with a command line
> balance
> BALANCE
> BALANCE
> teach coding as a way of introducing/reinforcing concepts, especially those that are otherwise difficult to convey without computation
> simulations
> sampling distributions > confidence levels > bootstrapping > randomization tests > ...
> BALANCE
> teach coding as a way of introducing/reinforcing concepts, especially those that are otherwise difficult to convey without computation
> simulations
> sampling distributions > confidence levels > bootstrapping > randomization tests > ...
> hide implementation issues that are outside the scope of the course
> BALANCE
> teach coding as a way of introducing/reinforcing concepts, especially those that are otherwise difficult to convey without computation
> simulations
> sampling distributions > confidence levels > bootstrapping > randomization tests > ...
> hide implementation issues that are outside the scope of the course
> minimize coding for repeated mechanics that can be unified
> TEACH
concept: confidence levels
resample from the population many times and construct many confidence intervals (loops)
The data
In the previous lab we looked at the population data of houses from Ames, Iowa. Let’s start with loading
that data set.
> TEACH
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
concept: confidence levels
load("ames.RData")
In resample from the population many times and construct many this lab we’ll start with just a sample from the population, which is a more realistic situation. Specifically,
this is a simple random sample of size 60. Note that the data set has information on many variables on
confidence intervals (loops)
these
houses, but for the first portion of the lab we’ll focus on the size of the house, represented by the
variable Gr.Liv.Area.
population <- ames$Gr.Liv.Area
sample <- sample(population, 60)
Exercise 1 Describe the distribution of your sample. What would you say is the “typical” size
within your sample? Also state precisely what you interpreted “typical” to mean.
Exercise 2 Now compare your distribution to your neighbor’s. Do they look similar? Are they
identical? Why, or why not?
Confidence intervals
One of the most common ways to describe the typical or central value of a distribution is to use the mean.
In this case we can calculate the mean age size our sample:
The data
In the previous lab we looked at the population data of houses from Ames, Iowa. Let’s start with loading
that data set.
> TEACH
download.file("http://www.openintro.org/stat/data/ames.RData",
destfile = "ames.RData")
Here
is the rough outline:
(1)
Obtain a random sample.
load("ames.RData")
concept: confidence levels
(2) Calculate its means and standard deviation.
(3)
Uselab
these
statistics
to calculate
a confidence
interval.
In resample from the population many times and construct many this
we’ll
start with
just a sample
from the population,
which is a more realistic situation. Specifically,
this Repeat
is a simple
(4)
stepsrandom
(1)-(3) 50sample
times. of size 60. Note that the data set has information on many variables on
confidence intervals (loops)
these
houses, but for the first portion of the lab we’ll focus on the size of the house, represented by the
But
before
we do all of this, we need to first create empty vectors where we can save the means and
variable
Gr.Liv.Area.
standard deviations that will be calculated in step (2). And while we’re at it, let’s also store the desired
sample size as n.
population <- ames$Gr.Liv.Area
samp_mean
rep(NA, 50)
sample <- <sample(population,
60)
samp_sd <- rep(NA, 50)
1 Describe the distribution of your sample. What would you say is the “typical” size
n <- Exercise
60
within your sample? Also state precisely what you interpreted “typical” to mean.
NowExercise
we’re ready
for compare
the loop your
wheredistribution
we obtain to
50 your
random
samples Do
andthey
quickly
andthey
save their
2 Now
neighbor’s.
look calculate
similar? Are
means
and standard
identical?
Why, deviations.
or why not?
for (i in 1:50) {
Confidence
intervals
samp <- sample(population,
n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
One samp_sd[i]
of the most common
ways to# describe
the typical
central
value of
of asamp_sd
distribution is to use the mean.
<- sd(samp)
save sample
sd inorith
element
In this case we can calculate the mean age size our sample:
}
The data
Here is the rough outline:
In the previous lab we looked at the population data of houses from Ames, Iowa. Let’s start with loading
(1) Obtain a random sample.
that data set.
EACH
(2) Calculate its means and standard deviation.
> T
download.file("http://www.openintro.org/stat/data/ames.RData",
destfile = "ames.RData")
(3)
Use
these
statistics
to calculate a confidence interval.
Here
is the
rough
outline:
(4)
Repeat
(1)-(3)
50 times.
(1)
Obtain steps
a random
sample.
load("ames.RData")
confidence levels
But
before weitsdomeans
all of
this,standard
we need
to first create empty vectors where we can save the means and
(2) Calculate
and
deviation.
standard deviations that will be calculated in step (2). And while we’re at it, let’s also store the desired
(3)
Uselab
these
to calculate
a confidence
interval.
In resample from the population many times and construct many this
we’ll
just a sample
from the population,
which is a more realistic situation. Specifically,
sample
size
asstatistics
n.start with
this Repeat
is a simple
(4)
stepsrandom
(1)-(3) 50sample
times. of size 60. Note that the data set has information on many variables on
confidence intervals (loops)
these
houses, but for the first portion of the lab we’ll focus on the size of the house, represented by the
samp_mean
<But
before
we rep(NA,
do all of 50)
this, we need to first create empty vectors where we can save the means and
variable
Gr.Liv.Area.
standard deviations that will be calculated in step (2). And while we’re at it, let’s also store the desired
samp_sd
<- as
rep(NA,
50)
sample size
n.
population <- ames$Gr.Liv.Area
n <- 60
samp_mean
rep(NA, 50)
sample <- <sample(population,
60)
concept:
samp_sd
<-ready
rep(NA,
50)loop where we obtain 50 random samples and quickly calculate and save their
Now
we’re
for the
means and standard deviations.
1 Describe the distribution of your sample. What would you say is the “typical” size
n <- Exercise
60
within your sample? Also state precisely what you interpreted “typical” to mean.
for (i in 1:50) {
sample(population,
n)we#obtain
obtain
arandom
samplesamples
of sizeand
n =quickly
60 from
the population
Nowsamp
we’re<-ready
for compare
the loop your
where
50 your
calculate
andthey
save their
Exercise
2
Now
distribution
to
neighbor’s.
Do
they
look
similar?
Are
samp_mean[i]
<-deviations.
mean(samp) # save sample mean in ith element of samp_mean
means
and standard
identical?
Why,
or
why not?# save sample sd in ith element of samp_sd
samp_sd[i] <- sd(samp)
}
for (i in 1:50) {
Confidence
intervals
samp <- sample(population,
n) # obtain a sample of size n = 60 from the population
samp_mean[i]
# save sample mean in ith element of samp_mean
Lastly,
we construct <the mean(samp)
confidence intervals.
One samp_sd[i]
of the most common
ways to# describe
the typical
central
value of
of asamp_sd
distribution is to use the mean.
<- sd(samp)
save sample
sd inorith
element
In this case we can calculate the mean age size our sample:
}
lower <- samp_mean - 1.96 * samp_sd/sqrt(n)
The
data its means and standard deviation.
(2) Calculate
Here is the rough outline:
(3) the
Useprevious
these statistics
calculate
a confidence
In
lab we to
looked
at the
populationinterval.
data of houses from Ames, Iowa. Let’s start with loading
(1) Obtain a random sample.
that
data set.
(4) Repeat
steps (1)-(3) 50 times.
EACH
(2) Calculate its means and standard deviation.
But before we do all of this, we need to first create empty vectors where we can save the means and
download.file("http://www.openintro.org/stat/data/ames.RData",
destfile
= "ames.RData")
(3)
Use
these
statistics
to calculate
a confidence
interval.
Here
is the
rough
outline:
standard
deviations
that
will be calculated
in step
(2). And while we’re at
it, let’s also
store the desired
sample
sizesteps
as n.(1)-(3) 50 times.
(4)
Repeat
(1)
Obtain
a
random sample.
load("ames.RData")
confidence levels
But
before weitsdomeans
all of
this,standard
we need
to first create empty vectors where we can save the means and
(2) Calculate
and
deviation.
samp_mean
<rep(NA,
50)
standard deviations that will be calculated in step (2). And while we’re at it, let’s also store the desired
(3)
Uselab
these
to calculate
a confidence
interval.
In resample from the population many times and construct many this
we’ll
just a sample
from the population,
which is a more realistic situation. Specifically,
sample
size
asstatistics
n.start with
samp_sd
<- rep(NA,
50)
this
is a simple
(4)
Repeat
stepsrandom
(1)-(3) 50sample
times. of size 60. Note that the data set has information on many variables on
confidence intervals (loops)
these
houses, but for the first portion of the lab we’ll focus on the size of the house, represented by the
samp_mean
<n <-before
60 Gr.Liv.Area.
But
we rep(NA,
do all of 50)
this, we need to first create empty vectors where we can save the means and
variable
standard deviations that will be calculated in step (2). And while we’re at it, let’s also store the desired
samp_sd
<- as
rep(NA,
50)
sample size
n.
population
<- ames$Gr.Liv.Area
Now we’re ready
for the loop where we obtain 50 random samples and quickly calculate and save their
nmeans
<- 60and standard deviations.
samp_mean
rep(NA, 50)
sample <- <sample(population,
60)
> T
concept:
for (i
in
1:50)
{ the
samp_sd
<-ready
rep(NA,
50)loop where we obtain 50 random samples and quickly calculate and save their
Now
we’re
for
samp
sample(population,
n) # obtain a sample of size n = 60 from the population
means
and<standard
deviations.
<- mean(samp)
# save
sample
meanWhat
in ith
element
ofissamp_mean
Exercise
1 Describe
the distribution
of your
sample.
would
you say
the “typical” size
n <-samp_mean[i]
60
samp_sd[i]
<- sd(samp)
# save
sample
sd you
in ith
element“typical”
of samp_sd
within your sample?
Also state
precisely
what
interpreted
to mean.
for
(i
in
1:50)
{
}
sample(population,
n)we#obtain
obtain
arandom
samplesamples
of sizeand
n =quickly
60 from
the population
Nowsamp
we’re<-ready
for compare
the loop your
where
50 your
calculate
andthey
save their
Exercise
2
Now
distribution
to
neighbor’s.
Do
they
look
similar?
Are
samp_mean[i]
<-deviations.
mean(samp) # save sample mean in ith element of samp_mean
means
and standard
identical?
Why,
or
why not?# save sample sd in ith element of samp_sd
samp_sd[i]
<sd(samp)
Lastly, we construct the confidence intervals.
}
for (i in 1:50) {
Confidence
intervals
lower
<- samp_mean
- 1.96 * samp_sd/sqrt(n)
samp
<- sample(population,
n) # obtain a sample of size n = 60 from the population
samp_mean[i]
# save sample mean in ith element of samp_mean
Lastly,
we construct <the mean(samp)
confidence intervals.
upper
+ 1.96
One
of <thesamp_mean
most common
ways* tosamp_sd/sqrt(n)
the typical
central
value of
of asamp_sd
distribution is to use the mean.
samp_sd[i]
<- sd(samp)
# describe
save sample
sd inorith
element
In this case we can calculate the mean age size our sample:
}
lower <- samp_mean - 1.96 * samp_sd/sqrt(n)
> HIDE
concept: confidence levels
plot these confidence intervals and highlight those that do not contain the true population parameter (custom function)
> HIDE
concept:
confidence levels
On your own
plot these confidence intervals and highlight those that do not 1. Using the following custom function, plot all intervals. What proportion of your confidence intervals
include
the true population mean? Is this proportion exactly equal to the confidence level? If no,
contain the true population parameter (custom function)
explain why.†
plot_ci(lower, upper, mean(population))
2. Pick a confidence level of your choosing. What is the appropriate critical value?
3. Calculate 50 confidence intervals at this confidence level. You do not need to obtain new samples,
simply calculate new intervals based on the samples you have already collected. Using the plot ci
function plot all intervals and calculate the proportion of intervals that include the true population
mean. How does this percentage compare to the confidence level you picked?
4. What concepts from the textbook are covered in this lab? What concepts, if any, are not covered in
the textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs,
or homework problems? Be specific in your answer.
Confidence levels
> HIDE
Resample from the population many times and construct many
concept:
confidence levels
confidence
intervals
(loops)
On your own
Plot these confidence intervals and highlight those that do not
plot these confidence intervals and highlight those that do not 1. Using the following custom function, plot all intervals. What proportion of your confidence intervals
contain
the
true population
(custom
include
the true
population
mean? Is this parameter
proportion exactly
equal to function)
the confidence level? If no,
contain the true population parameter (custom function)
explain why.†
plot_ci(lower, upper,
plot.ci(lower,
upper,mean(population))
mean(pop))
2. Pick a confidence level of your choosing. What is the appropriate critical value?
3. Calculate 50 confidence intervals at this confidence level. You do not need to obtain new samples,
simply calculate new intervals based on the samples you have already collected. Using the plot ci
function plot all intervals and calculate the proportion of intervals that include the true population
mean. How does this percentage compare to the confidence level you picked?
4. What concepts from the textbook are covered in this lab? What concepts, if any, are not covered in
the textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs,
or homework problems? Be specific in your answer.
mu = 1499.6904
> MINIMIZE
concept: statistical inference
> MINIMIZE
concept: statistical inference
> traditional curriculum for an introductory statistics course includes various statistical inference techniques
> MINIMIZE
concept: statistical inference
> traditional curriculum for an introductory statistics course includes various statistical inference techniques
> when introduced as disconnected topic these can be overwhelming to students
> MINIMIZE
concept: statistical inference
> traditional curriculum for an introductory statistics course includes various statistical inference techniques
> when introduced as disconnected topic these can be overwhelming to students
> to help unify inferential concepts, use one function that does it all, but still requires students to think about the nature of the data and encourages them to conduct exploratory data analysis
> EXAMPLE
concept: statistical inference
function: inference() - theoretical and simulation based inference
> EXAMPLE
concept: statistical inference
function: inference() - theoretical and simulation based inference
inference <- function(data, group = NULL, est = c("mean", "median", "proportion"),
success = NULL, order = NULL, nsim = 10000, conflevel = 0.95, null = NULL,
alternative = c("less", "greater", "twosided"), type = c("ci", "ht"),
method = c("theoretical", "simulation"), drawlines = "yes", simdist = FALSE){
...
}
> EXAMPLE
concept: statistical inference
function: inference() - theoretical and simulation based inference
inference <- function(data, group = NULL, est = c("mean", "median", "proportion"),
success = NULL, order = NULL, nsim = 10000, conflevel = 0.95, null = NULL,
alternative = c("less", "greater", "twosided"), type = c("ci", "ht"),
method = c("theoretical", "simulation"), drawlines = "yes", simdist = FALSE){
...
}
> data: response variable, quantitative or categorical
> group: explanatory variable, categorical for grouping (optional)
> type: confidence interval or hypothesis test
> method: theoretical or simulation
> ...
> USE IN LABS
question: compare birth weights of babies born to smoker and nonsmoker mothers (source: north carolina births)
There is clearly an observed difference, but is this difference statistically significant? In order to answ
this question we need to conduct a hypothesis test.
> U
L
Exercise 3 Check if the conditions necessary for
inference
satisfied? Note that you will
SE
IN are
ABS
need to obtain sample sizes to check the conditions.
Exercise 4 Write the hypotheses for testing if the average weights of babies born to smoker
question: compare birth weights of babies born to smoker and non-smoker mothers are different.
and nonsmoker mothers (source: north carolina births)
Next, we introduce a new function, inference, that we will use for conducting hypothesis tests an
constructing confidence intervals.
input:
inference(data = nc$weight, group = nc$habit, est = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")
Let’s pause for a moment to go through the arguments of this custom function.
• The first argument is data, this is the response variable that we are interested in: weight
• The second argument is the grouping variable, group, this is the variable that we use to split the da
into two groups, smokers and nonsmokers: habit.
• The third argument (est) is the parameter we’re interested in: mean (other options are median,
proportion.)
• Next we decide on the type of inference we want: a hypothesis test (ht) or a confidence interval (ci
• When doing a hypothesis test we also need to supply the null value, which in this case is 0, sin
the null hypothesis sets the two population means equal to each other.
• The alternative hypothesis can be less, greater, twosided.
There is clearly an observed difference, but is this difference statistically significant? In order to answ
this question we need to conduct a hypothesis test.
> U
L
Exercise 3 Check if the conditions necessary for
inference
satisfied? Note that you will
SE
IN are
ABS
need to obtain sample sizes to check the conditions.
Exercise 4 Write the hypotheses for testing if the average weights of babies born to smoker
question: compare birth weights of babies born to smoker and non-smoker mothers are different.
and nonsmoker mothers (source: north carolina births)
Next, we introduce a new function, inference, that we will use for conducting hypothesis tests an
constructing confidence intervals.
input:
inference(data = nc$weight, group = nc$habit, est = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")
• The first argument is data, this is the response variable that we are interested in: weight
quantitative
one categorical
• The second argument is the grouping variable, One
group,
this is the and
variable
that we usevariable
to split the da
Difference between two means
into two groups, smokers and nonsmokers: habit.
•
4
•
2
6
8
10
12
output:
Let’s pause for a moment to go through the arguments of this custom function.
•
nonsmoker
n_nonsmoker = 873 ; n_smoker = 126
Observed
difference
between
means
= 0.3155
The third argument (est) is the parameter we’re
interested
in: mean
(other
options
are median,
H0: mu_nonsmoker - mu_smoker = 0
proportion.)
HA: mu_nonsmoker - mu_smoker != 0
Standard
error = test
0.134
Next we decide on the type of inference we want:
a hypothesis
(ht) or a confidence interval (ci
Test statistic: Z = 2.359
When doing a hypothesis test we also need to p-value:
supply the
null value, which in this case is 0, sin
0.0184
the null hypothesis sets the two population means equal to each other.
smoker
nonsmoker
smoker
-0.32
0 -0.32 0.32 0
0.32
• The alternative hypothesis can be less, greater, twosided.
> USE IN PROJECTS
question: compare numbers of sexual partners of males and females (source: national survey of family growth)
> USE IN PROJECTS
question: compare numbers of sexual partners of males and females (source: national survey of family growth)
input:
inference(data = partners, group = gender, type = "ci", est = "mean", method =
"theoretical")
> USE IN PROJECTS
question: compare numbers of sexual partners of males and females (source: national survey of family growth)
input:
inference(data = partners, group = gender, type = "ci", est = "mean", method =
"theoretical")
6
7
output:
0
1
2
3
4
5
One quantitative and one categorical variable
Difference between two means
n_female = 12190 ; n_male = 10397
Observed difference between means = -0.432
Standard error = 0.0361
95 % Confidence interval = ( -0.5 , -0.36 )
female
male
> resources
> RESOURCES
openintro.org/stat/labs.php
> LAB 2 - PROBABILITY
> LAB 8 - MULTIPLE REGRESSION
> reactions
> STUDENT REACTIONS - LABS
positive:
> ``I like them. I feel like in the real world we’ll be using software to do stats, so I’m glad we’re learning how to use it.’’
> ``I LOVE the labs. They really help cement basic statistic ideas, and I especially love that you can finish them in class.’’
> ``The labs are a lot of fun. It’s great being able to create our own simulations and watch R Studio calculate everything. I also enjoy learning some code.’’
negative:
> ``The labs are alright. Sometimes I feel like I’m just plugging in stuff and I feel disconnected from what I’m really doing. It’s also frustrating when the code doesn’t work.’’
> ``Wish other students focused more.’’
> STUDENT REACTIONS - R
positive:
>
``Super useful and powerful software. It’s exciting to be introduced to it. Once again, don’t always feel comfortable writing code/ understanding what I’m doing.’’
>
``I like it! I kind of know MATLAB, which has helped with the coding a bit, but it’s a little more intuitive/easier, and very helpful.’’
>
``I am not a computer person at all, but I find RStudio very easy to use.’’
>
``I like it better than STATA which we used for [another class]. The user interface is easy and there is plenty of help for it online. Overall, it’s pretty good.’’
negative:
>
``I am not a fan of coding in general. I used Python before and RStudio is better (for me) than Python was, but I am not a fan of either.’’
>
``Easy to use, language is not too hard to understand although error messages could be more informative.’’
>
``I don’t think RStudio will have any use to me outside of this class.’’
> also...
> ADDITIONAL CONSIDERATIONS
> ADDITIONAL CONSIDERATIONS
> Labs should be fully integrated with the curriculum
> ADDITIONAL CONSIDERATIONS
> Labs should be fully integrated with the curriculum
``What concepts, if any, are not covered in the textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs, or homework problems? Be specific in your answer.’’
> ADDITIONAL CONSIDERATIONS
> Labs should be fully integrated with the curriculum
``What concepts, if any, are not covered in the textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs, or homework problems? Be specific in your answer.’’
> Works best in a classroom environment where students can collaborate with each other and get immediate support
> ADDITIONAL CONSIDERATIONS
> Labs should be fully integrated with the curriculum
``What concepts, if any, are not covered in the textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs, or homework problems? Be specific in your answer.’’
> Works best in a classroom environment where students can collaborate with each other and get immediate support
``Collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.”
> ADDITIONAL CONSIDERATIONS
> Labs should be fully integrated with the curriculum
``What concepts, if any, are not covered in the textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs, or homework problems? Be specific in your answer.’’
> Works best in a classroom environment where students can collaborate with each other and get immediate support
``Collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.”
> TAs need to be familiar and comfortable with the material
> thank you
contact : [email protected]
web : stat.duke.edu/~mc301
labs : openintro.org/stat/labs.php