A common example in business analytics data is to take a random sample of a very large dataset, to test your
analytics code. Note most business analytics datasets are data.frame ( records as rows and variables as
columns) in structure or database bound.This is partly due to a legacy of traditional analytics software.
Here is how we do it in R• Refering to parts of data.frame rather than whole dataset.
Using square brackets to reference variable columns and rows
The notation dataset[i,k] refers to element in the ith row and jth column.
The notation dataset[i,] refers to all elements in the ith row .or a record for a data.frame
The notation dataset[,j] refers to all elements in the jth column- or a variable for a data.frame.
For a data.frame dataset
> nrow(dataset) #This gives number of rows
> ncol(dataset) #This gives number of columns
An example for corelation between only a few variables in a data.frame.
> cor(dataset1[,4:6])
Splitting a dataset into test and control.
ts.test=dataset2[1:200] #First 200 rows
ts.control=dataset2[201:275] #Next 75 rows
• Sampling
Random sampling enables us to work on a smaller size of the whole dataset.
use sample to create a random permutation of the vector x.
Suppose we want to take a 5% sample of a data frame with no replacement.
Let us create a dataset ajay of random numbers
ajay=matrix( round(rnorm(200, 5,15)), ncol=10)
#This is the kind of code line that frightens most MBAs!!
Note we use the round function to round off values.
ajay=as.data.frame(ajay)
nrow(ajay)
[1] 20
> ncol(ajay)
[1] 10
This is a typical business data scenario when we want to select only a few records to do our analysis (or test
our code), but have all the columns for those records. Let us assume we want to sample only 5% of the
whole data so we can run our code on it
Then the number of rows in the new object will be 0.05*nrow(ajay).That will be the size of the sample.
The new object can be referenced to choose only a sample of all rows in original object using the size
parameter.
We also use the replace=FALSE or F , to not the same row again and again. The new_rows is thus a 5% sample
of the existing rows.
Then using the square backets and ajay[new_rows,] to getb=ajay[sample(nrow(ajay),replace=F,size=0.05*nrow(ajay)),]
You can change the percentage from 5 % to whatever you want accordingly.
> I incountered the following error when I try to make a sample inside a
> dataset.
> My code is:
>
>> data_ostrya <- sample(ostrya,200, replace=F)
> Error in `[.data.frame`(x, .Internal(sample(length(x), size, replace, :
>
cannot take a sample larger than the population when 'replace = FALSE'
>
> Why it does not work?
> The whole dataset is composed of 536 rows and I just want to sample
> randomly 200 of them...
The sample() works on vectors, not dataframes. Since dataframes are
lists containing the columns, it was trying to sample columns, not rows.
This would give you a sample of rows:
ostrya[sample(1:536, 200, replace=FALSE),]
Duncan Murdoch
>
> Thank you in advance,
>
> Gian
>
>
>
>
> On 13 September 2012 14:01, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
>
>> On 12-09-13 7:43 AM, Gian Maria Niccolò Benucci wrote:
>>
>>> Thank you Duncan,
>>>
>>> I got the result of sampling, but it gave me only the row numbers. Is it
>>> possible to have the entire row with variables and other information?
>>> Because I need to re-sample inside my matrix the whole rows in reason to
>>> have 20 samples (i.e., rows) each year.
>>> Thank you for your invaluable help!
>>>
>>
>> Use those row numbers to index the dataframe or matrix, e.g.
>>
>> a[rows,]
>>
>> Duncan Murdoch
>>
>>
>>> Gian
>>>
>>>
>>> On 13 September 2012 13:32, Duncan Murdoch <murdoch.duncan at gmail.com>
>>> wrote:
>>>
>>>
On 12-09-13 7:18 AM, Gian Maria Niccolň Benucci wrote:
>>>>
>>>>
Thank you very much for your help,
>>>>>
>>>>> I was wondering if is possible to sample randomly specifying to select
>>>>> in
>>>>> a
>>>>> particular group of data inside the matrix, for example only within the
>>>>> whole samples collected in 2011 I would randomly choose 20 random
>>>>> samples...
>>>>>
>>>>>
>>>> You need two steps: find the rows that meet your condition, then sample
>>>> from those. For example,
>>>>
>>>> rows <- which( a$year == 2011 )
>>>> sample(rows, 20)
>>>>
>>>> There is one thing to watch out for: if you have a condition that only
>>>> matches one row, you will get unexpected results here, because the sample
>>>> will be taken from 1:rows. See the examples in ?sample for the
>>>> workaround
>>>> that uses sample.int.
>>>>
>>>> Duncan Murdoch
>>>>
>>>>
>>>>
Thanks a again,
>>>>>
>>>>>
>>>>> Gian
>>>>>
>>>>> On 13 September 2012 12:26, anna freni sterrantino <annafreni at yahoo.it
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>
>>>>>
Hello Gian,
>>>>>
>>>>>> sure sample function
>>>>>> will do it for your sampling.
>>>>>>
>>>>>> a=as.data.frame(matrix(1:20,4)****)
>>>>>>
>>>>>> sample(rownames(a),2)
>>>>>>
>>>>>> see ?sample for more details.
>>>>>> Hope it helps
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> Anna
The Statistical Bootstrap and Other Resampling
Methods
This page has the following sections:
Preliminaries
The Bootstrap
R Software
The Bootstrap More Formally
Permutation Tests
Cross Validation
Simulation
Random Portfolios
Summary
Links
Preliminaries
The purpose of this document is to introduce the statistical bootstrap and related techniques in order to
encourage their use in practice. The examples work in R — see Impatient R for an introduction to using R.
However, you need not be a user to follow the discussion. On the other hand, R is arguably the best
environment in which to perform these techniques.
A dataset of the daily returns of IBM and the S&P 500 index for 2006 is used in the examples. A tab separated
file of the data is available at: http://www.burns-stat.com/pages/Tutor/spx_ibm.txt
An R command — that you can do yourself — to create the data as used in the examples is:
spxibm <- as.matrix(read.table(
"http://www.burns-stat.com/pages/Tutor/spx_ibm.txt",
header=TRUE, sep='\t', row.names=1))
The above command reads the file from the Burns Statistics website and creates a two column matrix with 251
rows. The first few rows look like:
> head(spxibm)
spx
ibm
2006-01-03 1.629695715 -0.1727542
2006-01-04 0.366603354 -0.1359451
2006-01-05 0.001570512 0.6778851
2006-01-06 0.935554103 2.9173527
2006-01-09 0.364963909 -1.4419861
2006-01-10 -0.035661127 0.4106782
The following two commands each extract one column to create a vector.
spxret <- spxibm[, 'spx']
ibmret <- spxibm[,'ibm']
The calculations we talk about are random. If you want to be able to reproduce them exactly, there are two
choices: You can save the .Random.seed object in existence just before you start the computation. You can use
the set.seed function.
The Bootstrap
The idea: We have just one dataset. When we compute a statistic on the data, we only know that one statistic
— we don’t see how variable that statistic is. The bootstrap creates a large number of datasets that we might
have seen and computes the statistic on each of these datasets. Thus we get a distribution of the statistic. Key is
the strategy to create data that “we might have seen”.
Our example data are log returns (also known as continuously compounded returns). The log return for the year
is the sum of the daily log returns. The log return for the year for the S&P is 12.8% (the data are in percent
already). We can use the bootstrap to get an idea of the variability of that figure.
There are 251 daily returns in the year. One bootstrap sample is 251 randomly sampled daily returns. The
sampling is with replacement, so some of the days will be in the bootstrap sample multiple times and other days
will not appear at all. Once we have a bootstrap sample, we perform the calculation of interest on it — in this
case the sum of the values. We don’t stop at just one bootstrap sample though, typically hundreds or thousands
of bootstrap samples are created.
Below is some simple code to perform this bootstrap with 1000 bootstrap samples.
spx.boot.sum <- numeric(1000) # numeric vector 1000 long
for(i in 1:1000) {
this.samp <- spxret[ sample(251, 251, replace=TRUE) ]
spx.boot.sum[i] <- sum(this.samp)
}
The key step of the code above is the call to the sample function. The command says to sample from the
integers from 1 to 251, make the sample size 251 and sample with replacement. The effect is that this.samp is
a year of daily returns that might have happened (but probably didn’t). In the subsequent line we collect the
annual return from each of the hypothetical years. We can then plot the distribution of the bootstrapped annual
returns.
plot(density(spx.boot.sum), lwd=3, col="steelblue")
abline(v=sum(spxret), lwd=3, col='gold')
The plot shows the annual return to be quite variable. (Your plot won’t look exactly like this one, but should
look very similar.) It could have easily been anything from 0 to 25. The actual annual return is squarely in the
middle of the distribution. That doesn’t always happen — there can be substantial bias for some statistics and
datasets.
Bootstrapping smooths
More than numbers can be bootstrapped. In the following example we bootstrap a smooth function over time.
spx.varsmu <- array(NA, c(251, 20)) # make 251 by 20 matrix
for(i in 1:20) {
this.samp <- spxret[ sample(251, 251, replace=TRUE) ]
spx.varsmu[,i] <- supsmu(1:251,
(this.samp - mean(this.samp))^2)$y
}
plot(supsmu(1:251, (spxret-mean(spxret))^2), type='l',
xlab='Days', ylab='Variance')
matlines(1:251, spx.varsmu, lty=2, col='red')
The black line is a smooth of the variance of the real S&P data over the year, while the red lines are smooths
from bootstrapped samples. It isn’t absolutely clear that the black line is different from the red lines, but it is.
Market data experience the phenomenon of “volatility clustering”. There are periods of low volatility, and
periods of high volatility. Since the bootstrapping does not preserve time ordering, the bootstrap samples will
not have volatility clustering. We will return to volatility clustering later.
Bootstrap regression coefficients (version 1)
A statistic that can be of interest is the slope of the linear regression of a stock’s returns explained by the returns
of the “market”, that is, of an index like the S&P. In finance this number is called the beta of the stock. This is
often thought of as a fixed number for each stock. In fact betas change continuously, but we will ignore that
complication here.
The command below shows that our data gives an estimate of IBM’s beta of about 0.85.
> coef(lm(ibmret ~ spxret))
(Intercept) spxret
0.02887969 0.84553741
What is after the “>” on the first line is what the user typed (and you can copy and paste), the rest was the
response. If we start on the inside of the command, ibmret ~ spxret is a formula that says ibmret is to be
explained by spxret. lm stands for linear model. The result of the call to lm is an object representing the linear
regression. We then extract the coefficients from that object. To get just the beta, we can subscript for the
second element of the coefficients:
> coef(lm(ibmret ~ spxret))[2]
spxret
0.8455374
We are now ready to try bootstrapping beta in order to get a sense of its variability.
beta.obs.boot <- numeric(1000)
for(i in 1:1000) {
this.ind <- sample(251, 251, replace=TRUE)
beta.obs.boot[i] <- coef(lm(
ibmret[this.ind] ~ spxret[this.ind]))[2]
}
plot(density(beta.obs.boot), lwd=3, col="steelblue")
abline(v=coef(lm(ibmret ~ spxret))[2], lwd=3, col='gold')
Each bootstrap takes a sample of the indices of the days in the year, creates the IBM return vector based on
those indices, creates the matching return vector for the S&P, and then performs the regression. Basically, a
number of hypothetical years are created.
Bootstrap regression coefficients (version 2)
There is another approach to bootstrapping the regression coefficient. The response in a regression is identically
equal to the fit of the regression plus the residuals (the regression line plus distance to the line). We can take the
viewpoint that only the residuals are random.
Here’s the alternative bootstrapping approach. Sample from the residuals of the regression on the original data,
and then create synthetic response data by adding the bootstrapped residuals to the fitted value. The explanatory
variables are still the original data.
In our case we will use the original S&P returns because that is our explanatory variable. For each day we will
create a new IBM return by adding the fit of the regression for that day to the residual from some day. This
second method is performed below.
ibm.lm <- lm(ibmret ~ spxret)
ibm.fit <- fitted(ibm.lm)
ibm.resid <- resid(ibm.lm)
beta.resid.boot <- numeric(1000)
for(i in 1:1000) {
this.ind <- sample(251, 251, replace=TRUE)
beta.resid.boot[i] <- coef(lm(
ibm.fit + ibm.resid[this.ind] ~ spxret))[2]
}
plot(density(beta.resid.boot), lwd=3, col="steelblue")
abline(v=coef(lm(ibmret ~ spxret))[2], lwd=3, col='gold')
In this case the results are equivalent. (An experiment with 50,000 bootstrap samples showed the two bootstrap
densities to be almost identical.) There are times, though, when the residual method is forced upon us. If we are
modeling volatility clustering, then sampling the observations will not work — that destroys the phenomenon
we are trying to study. We need to fit our model of volatility clustering and then sample from the residuals of
that model.
R Software
For data exploration the techniques that have just been presented are likely to be sufficient. If you are using R,
S-PLUS or a few other languages, then there is no need for any specialized software — you can just write a
simple loop. Often it is a good exercise to decide how to bootstrap your data. You are not likely to understand
your data unless you know how to mimic its variability with a bootstrap.
If formal inference is sought, then there are some technical tricks, such as bias corrected confidence intervals,
that are often desirable. Specialized software does generally make sense in this case.
There are a number of R packages that are either confined to or touch upon bootstrapping or its relatives. These
include:
boot:
This package incorporates quite a wide variety of bootstrapping tricks.
bootstrap:
coin:
A package of relatively simple functions for bootstrapping and related techniques.
A package for permutation tests (which are discussed below).
MChtest:
This package is for Monte Carlo hypothesis tests, that is, tests using some form of resampling. This
includes code for sampling rules where the number of samples taken depend on how certain the result is.
meboot:
Provides a method of bootstrapping a time series.
permtest:
A package containing a function for permutation tests of microarray data.
scaleboot:
This package produces approximately unbiased hypothesis tests via bootstrapping.
simpleboot:
A package of a few functions that perform (or present) bootstraps in simple situations, such as one
and two samples, and linear regression.
There are a large number of R packages that include bootstrapping. Examples include multtest that has the
boot.resample function, and Matching which has a function for a bootstrapped Kolmogorov-Smirnov test
(the equality of two probability distributions). The BurStMisc package has some simple functions for
permutation tests.
The Bootstrap More Formally
Bootstrapping is an alternative to the traditional statistical technique of assuming a particular probability
distribution. For example, it would be reasonably common practice to assume that our return data are normally
distributed. This is clearly not the case. However, there is decidedly no consensus on what distribution would
be believable. Bootstrapping outflanks this discussion by letting the data speak for itself.
As long as there are more than a few observations the data will reveal their distribution to a reasonable extent.
One way of describing bootstrapping is that it is sampling from the empirical distribution of the data.
A sort of a compromise is to do “smoothed bootstrapping”. This produces observations that are close to a
specific data observation rather than exactly equal to the data observation. An R statement that performs
smoothed bootstrapping on the S&P return vector is:
rnorm(251, mean=sample(spxret, 251, replace=TRUE), sd=.05)
This generates 251 random numbers with a normal distribution where the mean values are a standard bootstrap
sample and the standard deviation is small (relative to the spread of the original data). If zero were given for sd,
then it would be exactly the standard bootstrap sample.
Some statistics are quite sensitive to tied values (which are inherent in bootstrap samples). Smoothed
bootstrapping can be an improvement over the standard bootstrap for such statistics.
The usual assumption to make about data that are being bootstrapped is that the observations are independent
and identically distributed. If this is not the case, then the bootstrap can be misleading.
Let’s look at this assumption in the case of bootstrapping the annual return. If you consider just location, returns
are close to independent. However, independence is definitely shattered by volatility clustering. It is probably
easiest to think in terms of predictability. The predictability of the returns is close to (but not exactly) zero.
There is quite a lot of predictability to the squared returns though. The amount that the bootstrap is distorted by
predictability of the returns is infinitesimal. Distortion due to volatility clustering could be appreciable, though
unlikely to be overwhelming.
There are a number of books that discuss bootstrapping. Here are a few:
A book that briefly discusses bootstrapping along with a large number of additional topics in the context of R
and S-PLUS is Modern Applied Statistics with S, Fourth Edition by Venables and Ripley.
An Introduction to the Bootstrap by Efron and Tibshirani. (Efron is the inventor of the bootstrap.)
Bootstrap Methods and their Application by Davison and Hinkley.
Permutation Tests
The idea: Permutation tests are restricted to the case where the null hypothesis really is null — that is, that
there is no effect. If changing the order of the data destroys the effect (whatever it is), then a random
permutation test can be done. The test checks if the statistic with the actual data is unusual relative to the
distribution of the statistic for permuted data.
Our example permutation test is to test volatility clustering of the S&P returns. Below is an R function that
computes the statistic for Engle’s ARCH test.
engle.arch.test <- function (x, order=10)
{
xsq <- x^2
nobs <- length(x)
inds <- outer(0:(nobs - order - 1), order:1, "+")
xmat <- xsq[inds]
dim(xmat) <- dim(inds)
xreg <- lm(xsq[-1:-order] ~ xmat)
summary(xreg)$r.squared * (nobs - order)
}
All you need to know is that the function returns the test statistic and that a big value means there is volatility
clustering, but here is an explanation of it if you are interested:
The test does a regression with the squared returns as the response and some number of lags (most recent
previous data) of the squared returns as explanatory variables. (An estimate of the mean of the returns is
generally removed first, but this has little impact in practice.) If the last few squared returns have power to
predict tomorrow’s squared return, then there must be volatility clustering. The tricky part of the function is the
line that creates inds. This object is a matrix of the desired indices of the squared returns for the matrix of
explanatory variables. Once the explanatory matrix is created, the regression is performed and the desired
statistic is returned. The default number of lags to use is 10.
A random permutation test compares the value of the test statistic for the original data to the distribution of test
statistics when the data are permuted. We do this below for our example.
spx.arch.perm <- numeric(1000)
for(i in 1:1000) {
spx.arch.perm[i] <- engle.arch.test(sample(spxret))
}
plot(density(spx.arch.perm, from=0), lwd=3, col="steelblue")
abline(v=engle.arch.test(spxret), lwd=3, col='gold')
The simplest way to get a random permutation in R is to put your data vector as the only argument to the
sample function. The call:
sample(spxret)
is equivalent to:
sample(spxret, size=length(spxret), replace=FALSE)
A simple calculation for the p-value of the permutation test is to count the number of statistics from permuted
data that exceed the statistic for the original data and then divide by the number of permutations performed. In
R a succinct form of this calculation is:
mean(spx.arch.perm >= engle.arch.test(spxret))
In my case I get 0.01. That is, 10 of the 1000 permutations produced a test statistic larger than the statistic from
the real data. A test using 100,000 permutations gave a value of 0.0111.
There is a more pedantic version of the p-value computation that adds 1 to both numerator and denominator.
The p-value is a calculation assuming that the null hypothesis is true. Under this assumption the real data
produces a statistic at least as extreme as the statistic produced with the real data (that is, itself).
The reason to do a permutation test is so that we don’t need to depend on an assumption about the distribution
of the data. In this case the standard assumption that the statistic follows a chisquare distribution gives a p-value
of 0.0096 which is in quite good agreement with the permutation test. But we wouldn’t necessarily know
beforehand that they would agree.
In the example test that we performed, we showed evidence of volatility clustering because the statistic from
the actual data was in the right tail of the statistic’s distribution with permuted data. If our null hypothesis had
been that there were some given amount of volatility clustering, then we couldn’t use a permutation test.
Permuting the data gives zero volatility clustering, and we would need data that had that certain amount of
volatility clustering.
Given current knowledge of market data, performing this volatility test is of little interest. Market data do have
volatility clustering. If a test does not show significant volatility clustering, then either it is a small sample or
the data are during a quiescent time period. In this case we have both.
We could have performed bootstrap sampling in our test rather than random permutations. The difference is
that bootstraps sample with replacement, and permutations sample without replacement. In either case, the time
order of the observations is lost and hence volatility clustering is lost — thus assuring that the samples are
under the null hypothesis of no volatility clustering. The permutations always have all of the same observations,
so they are more like the original data than bootstrap samples. The expectation is that the permutation test
should be more sensitive than a bootstrap test. The permutations destroy volatility clustering but do not add any
other variability.
Permutations can not always be used. If we are looking at the annual return (meaning the sum of the
observations), then all permutations of the data will yield the same answer. We get zero variability in this case.
The paper Permuting Super Bowl Theory has a further discussion of random permutation tests. The code from
that paper is in the BurStMisc package.
Cross Validation
The idea: Models should be tested with data that were not used to fit the model. If you have enough data, it is
best to hold back a random portion of the data to use for testing. Cross validation is a trick to get out-of-sample
tests but still use all the data. The sleight of hand is to do a number of fits, each time leaving out a different
portion of the data.
Cross validation is perhaps most often used in the context of prediction. Everyone wants to predict the stock
market. So let’s do it.
Below we predict tomorrow’s IBM return with a quadratic function of today’s IBM and S&P returns.
predictors <- cbind(spxret, ibmret, spxret^2, ibmret^2,
spxret * ibmret)[-251,]
predicted <- ibmret[-1]
predicted.lm <- lm(predicted ~ predictors)
The p-value for this regression is 0.048, so it is good enough to publish in a journal. However, you might want
to hold off putting money on it until we have tested it a bit more.
In cross validation we divide the data into some number of groups. Then for each group we fit the model with
all of the data that are not in the group, and test that fit with the data that are in the group. Below we divide the
data into 5 groups.
group <- rep(1:5, length=250)
group <- sample(group)
mse.group <- numeric(5)
for(i in 1:5) {
group.lm <- lm(predicted[group != i] ~
predictors[group != i, ])
mse.group[i] <- mean((predicted[group == i] cbind(1, predictors[group == i,]) %*% coef(group.lm))^2)
}
The first command above repeats the numbers 1 through 5 to get a vector of length 250. We do not want to use
this as our grouping because we may be capturing systematic effects. In our case a group would be on one
particular day of the week until a holiday interrupted the pattern. Hence the next line permutes the vector so we
have random assignment, but still an equal number of observations in each group. The for loop estimates each
of the five models and computes the out-of-sample mean squared error.
The mean squared error of the predicted vector taking only its mean into account is 0.794. The in-sample mean
squared error from the regression is 0.759. This is a modest improvement, but in the context of market
prediction might be of use if the improvement were real. However, my cross validation mean squared error is
0.800 — even higher than from the constant model. This is evidence of overfitting (which this regression model
surely is doing).
(Practical note: Cross validation destroys the time order of the data, and is not the best way to test this sort of
model. Better is to do a backtest — fit the model up to some date, test the performance on the next period,
move the “current” date forward, and repeat until reaching the end of the data.)
Doing several cross validations (with different group assignments) and averaging can be useful. See for
instance: http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RmS/logistic.val.pdf
Simulation
In a loose sense of the word, all the techniques we’ve discussed can be called simulations. Often the word is
restricted to the situation of getting output from a model given random inputs. For example, we might create a
large number of series that are possible 20-day price paths of a stock. Such paths might be used, for instance, to
price an option.
Random Portfolios
The idea: We can mimic the creation of a financial portfolio (containing, for example, stocks or bonds) by
satisfying a set of constraints, but otherwise allocating randomly.
Generating random portfolios is a technique in finance that is quite similar to bootstrapping. The process is to
produce a number of portfolios of assets that satisfy some set of constraints. The constraints may be those under
which a fund actually operates, or could be constraints that the fund hypothesizes to be useful.
If the task is to decide on the bound that should be imposed on some constraint, then the random portfolios are
precisely analogous to bootstrapping. The distribution (of the returns or volatility or …) is found, showing
location and variability.
Another use of random portfolios is to test if a fund exhibits skill. Random portfolios are generated which
match the constraints of the fund but have no predictive ability. The fraction of random portfolios that
outperform the actual fund is a p-value of the null hypothesis that the fund has zero skill. This is quite similar to
a permutation test with restrictions on the permutations.
See “Random Portfolios in Finance” for more.
Summary
When bootstrapping was invented in the late 1970′s, it was outrageously computationally intense. Now a
bootstrap can sometimes be performed in the blink of an eye. Bootstrapping, random permutation tests and
cross validation should be standard tools for anyone analyzing data.
R Library: Introduction to bootstrapping
The programs
The R program (as a text file) for the code on this page.
In order to see more than just the results from the computations of the functions (i.e. if you want to see the
functions echoed back in console as they are processed) use the echo=T option in the source function when
running the program.
source("d:/stat/bootstrap.txt", echo=T)
Introduction
Bootstrapping can be a very useful tool in statistics and it is very easily implemented in R. Bootstrapping comes
in handy when there is doubt that the usual distributional assumptions and asymptotic results are valid and
accurate. Bootstrapping is a nonparametric method which lets us compute estimated standard errors, confidence
intervals and hypothesis testing.
Generally bootstrapping follows the same basic steps:
1. Resample a given data set a specified number of times
2. Calculate a specific statistic from each sample
3. Find the standard deviation of the distribution of that statistic
The sample function
A major component of bootstrapping is being able to resample a given data set and in R the function which does this is
the sample function.
sample(x, size, replace, prob)
The first argument is a vector containing the data set to be resampled or the indices of the data to be resampled.
The size option specifies the sample size with the default being the size of the population being resampled. The
replace option determines if the sample will be drawn with or without replacement where the default value is
FALSE, i.e. without replacement. The prob option takes a vector of length equal to the data set given in the
first argument containing the probability of selection for each element of x. The default value is for a random
sample where each element has equal probability of being sampled. In a typical bootstrapping situation we
would want to obtain bootstrapping samples of the same size as the population being sampled and we would
want to sample with replacement.
#using sample to generate a permutation of the sequence 1:10
sample(10)
[1] 4 8 3 5 1 10 6 2 9 7
#bootstrap sample from the same sequence
sample(10, replace=T)
[1] 1 3 9 4 10 3 5 1 6 4
#boostrap sample from the same sequence with
#probabilities that favor the numbers 1-5
prob1 <- c(rep(.15, 5), rep(.05, 5))
prob1
[1] 0.15 0.15 0.15 0.15 0.15 0.05 0.05 0.05 0.05 0.05
sample(10, replace=T, prob=prob1)
[1] 4 2 1 7 6 5 4 4 2 9
#sample of size 5 from elements of a matrix
#creating the data matrix
y1 <- matrix( round(rnorm(25,5)), ncol=5)
y1
[,1] [,2] [,3] [,4] [,5]
[1,]
6
4
6
4
5
[2,]
[3,]
[4,]
[5,]
6
5
5
3
5
4
3
4
5
5
6
4
7
7
6
5
4
6
6
5
#saving the sample of size 5 in the vector x1
x1 <- y1[sample(25, 5)]
x1
[1] 6 4 5 5 4
#sampling the rows of the a matrix
#creating the data matrix
y2 <- matrix( round(rnorm(40, 5)), ncol=5)
y2
[,1] [,2] [,3] [,4] [,5]
[1,]
5
5
4
7
4
[2,]
5
6
4
6
4
[3,]
5
4
4
6
3
[4,]
5
6
5
6
6
[5,]
6
5
4
4
4
[6,]
5
5
5
4
5
[7,]
4
5
5
5
4
[8,]
5
5
4
6
6
#saving the sample of rows in the matrix x2
x2 <- y2[sample(8, 3), ]
x2
[,1] [,2] [,3] [,4] [,5]
[1,]
4
5
5
5
4
[2,]
5
6
5
6
6
[3,]
6
5
4
4
4
A bootstrap example
In the following bootstrapping example we would like to obtain a standard error for the estimate of the median.
We will be using the lapply, sapply functions in combination with the sample function. (For more information
about the lapply and sapply function please look at the advanced function R library pages or consult the help
manuals.)
#calculating the standard error of the median
#creating the data set by taking 100 observations
#from a normal distribution with mean 5 and stdev 3
#we have rounded each observation to nearest integer
data <- round(rnorm(100, 5, 3))
data[1:10]
[1] 6 3 3 4 3 8 2 2 3 2
#obtaining 20 bootstrap samples
#display the first of the bootstrap samples
resamples <- lapply(1:20, function(i)
sample(data, replace = T))
resamples[1]
[[1]]:
[1] 5 1 7 6 5 2 2 6 9 5 4 6 6
[25] 8 3 0 9 3 2 3 10 5 8 5 4 0
[49] 2 4 9 6 6 0 7 5 9 3 0 6 8
[73] 2 3 8 2 8 3 9 6 5 2 4 3 3
[97] 3 6 9 7
3
4
5
7
5
7
2
1
4 10
3 5
3 3
3 5
7
6
3
9
8
3
4
4
1
6
3
3
8
3
2
4
0
2
9
2
5
9
3
9
#calculating the median for each bootstrap sample
r.median <- sapply(resamples, median)
r.median
[1] 4.0 4.5 4.0 5.0 4.0 5.0 5.0 5.0 5.0 4.0 5.0 5.0 5.0 5.0 5.0 4.0 5.0 5.0
[19] 6.0 5.0
#calculating the standard deviation of the distribution of medians
sqrt(var(r.median))
2
7
3
0
[1] 0.5250313
#displaying the histogram of the distribution of the medians
hist(r.median)
We can put all these steps into a single function where all we would need to specify is which data set to use and
how many times we want to resample in order to obtain the adjusted standard error of the median. For more
information on how to construct functions please consult the R library pages on introduction to functions and
advanced functions.
#function which will bootstrap the standard error of the median
b.median <- function(data, num) {
resamples <- lapply(1:num, function(i) sample(data, replace=T))
r.median <- sapply(resamples, median)
std.err <- sqrt(var(r.median))
list(std.err=std.err, resamples=resamples, medians=r.median)
}
#generating the data to be used (same as in the above example)
data1 <- round(rnorm(100, 5, 3))
#saving the results of the function b.median in the object b1
b1 <- b.median(data1, 30)
#displaying the first
b1$resamples[1]
[[1]]:
[1] 6 6 6 0 9
[25] 6 3 5 9 6
[49] 3 3 5 10 9
[73] 4 8 6 3 5
[97] 3 3 2 3
of the 30 bootstrap samples
3 5
4 8
6 2
2 10
6
4
3
5
2
3
4
0
7
1
5
7
5
3
1
6
3
2
3
5
3
4
5
4
2
5
0
9
5
2
8
6
3
3
1
0
0
1
4
6
#displaying the standard error
b1$std.err
[1] 0.5155477
#displaying the histogram of the distribution of medians
hist(b1$medians)
4
7
2
6
3
6
7
3
7
3
8
4
3 1 6
7 10 7
2 2 10
8 10 7
3
2
6
6
#we can input the data directly into the function and display
#the standard error in one line of code
b.median(rnorm(100, 5, 2), 50)$std.err
[1] 0.5104178
#displaying the histogram of the distribution of medians
hist(b.median(rnorm(100, 5, 2), 50)$medians)
It would be fairly simple to generalize the function to work for any summary statistic. We will not show that
generalized function but encourage the user to try and figure out how to do it before downloading the program
which has the answer.
Built in bootstrapping functions
R has numerous built in bootstrapping functions, too many to mention all of them on this page, please refer to
the boot library.
#R example of the function boot
#bootstrap of the ratio of means using the city data included in the boot package
#obtaining the data from the package
data(city)
#defining the ratio function
ratio <- function(d, w) sum(d$x * w)/sum(d$u * w)
#using the boot function
boot(city, ratio, R=999, stype="w")
ORDINARY NONPARAMETRIC BOOTSTRAP
Bootstrap Statistics :
original
bias
t1* 1.520313 0.04465751
For more information
std. error
0.2137274
Randomly extract rows from dataset
> I am looking for a way to randomly extract a specified number of rows from a
> data frame. I was planning on binding a column of random numbers to the
> data frame and then sorting the data frame using this bound column. But I
> can't figure out how to use this column to sort the entire data frame so
> that the content of the rows remains together. Does anyone know how I can
> do this? Hints for other ways to approach this problem would also be
> appreciated.
>
> Cheers
> Amy
See ?sample
Using the 'iris' dataset in R:
# Select 2 random rows
> iris[sample(nrow(iris), 2), ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
96
17
5.7
5.4
3.0
3.9
4.2
1.3
1.2 versicolor
0.4
setosa
# Select 5 random rows
> iris[sample(nrow(iris), 5), ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
83
12
63
80
49
5.8
4.8
6.0
5.7
5.3
2.7
3.4
2.2
2.6
3.7
3.9
1.6
4.0
3.5
1.5
1.2 versicolor
0.2
setosa
1.0 versicolor
1.0 versicolor
0.2
setosa
Sample of rows from part of a dataframe?
If I just have data such as
gender <- c("F", "M", "M", "F", "F", "M", "F", "F")
age
<- c(23, 25, 27, 29, 31, 33, 35, 37)
then I can easily sample the ages of three of the Fs with
sample(age[gender == "F"], 3)
and get something like
[1] 31 35 29
but if I turn this data into a dataframe
mydf <- data.frame(gender, age)
I cannot use the obvious
sample(mydf[mydf$gender == "F", ], 3)
though I can concoct something convoluted with an absurd number of brackets like
mydf[sample((1:nrow(mydf))[mydf$gender == "F"], 3), ]
and get what I want which is something like
7
4
1
gender age
F 35
F 29
F 23
Is there a better way that takes me less time to work out how to write?
Carrie Li wrote:
> Dear R-helpers,
>
> I would like to generate a variable that takes 0 or 1, and each subject has
> different probabilities of taking the draw.
>
> So, which of the following code I should use ?
>
<snip>
I don't think either.
Try this:
probs <- seq(0,1, by = .1)
sapply(probs, function(x) sample(0:1, 1, prob = c(1-x, x)))
probs will be the vector of your probabilities for obtaining a '1' per subject,
so set that to whatever you want, then run the second line.
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Jorge I Velez
Reply | Threaded | More
May 23, 2010; 5:16am
Bernoulli random variable with different probability
In reply to this post by Carrie Li
Hi Carrie,
Use the first approach:
n <- 5
p <- c(0.2, 0.9, 0.15, 0.8, 0.75)
rbinom(n, 1, p)
# [1] 0 0 0 1 1
rbinom(n, 1, p)
# [1] 1 1 0 1 1
To check, replicate the analysis 5000 times and then estimate the
probability for each subject:
rowMeans(replicate(5000, rbinom(n, 1, p)))
# 0.2002 0.9026 0.1550 0.8020 0.7562
Using Erik's solution you will get the "same" results:
> rowMeans(replicate(5000, sapply(p, function(x) sample(0:1, 1, prob =
c(1-x, x)))))
# [1] 0.1958 0.9006 0.1492 0.8066 0.7446
Note that in both cases the result is close to the original values of p.
HTH,
Jorge
rvbern {rv}
R Documentation
Generate a Random Vector from a Bernoulli Sampling Model
Description
rvbern
generates a random vector where each simulation comes from a Bernoulli sampling distribution.
Usage
rvbern(n=1, prob, logical=FALSE)
Arguments
n
number of random scalars to draw
prob
probability of ``success"; may be a random vector itself
logical logical; return a logical random variable instead
Details
rvbern
is a special case of rvbinom with the argument size=1.
If logical is TRUE, the function returns a logical random variable which has TRUE for 1, FALSE for 0. (The
printed summary of this object is slightly different from a regular continuous numeric random variable.)
Value
A random vector (an rv object) of length n.
Note
The resulting vector will not be independent and identically distributed Bernoulli unless prob is a fixed
number.
Author(s)
Jouni Kerman [email protected]
References
Kerman, J. and Gelman, A. (2007). Manipulating and Summarizing Posterior Simulations Using Random
Variable Objects. Statistics and Computing 17:3, 235-244.
See also vignette("rv").
Examples
rvbern(10, prob=0.5)
rvbinom(10, size=1, prob=0.5) # Equivalent
print(rvbern(1, 0.5))
print(rvbern(1, 0.5, logical=TRUE)) # won't show the quantiles
print(as.logical(rvbern(1, 0.5))) # equivalent
Binomial {stats}
R Documentation
The Binomial Distribution
Description
Density, distribution function, quantile function and random generation for the binomial distribution with
parameters size and prob.
Usage
dbinom(x,
pbinom(q,
qbinom(p,
rbinom(n,
size,
size,
size,
size,
prob, log = FALSE)
prob, lower.tail = TRUE, log.p = FALSE)
prob, lower.tail = TRUE, log.p = FALSE)
prob)
Arguments
x, q
vector of quantiles.
p
vector of probabilities.
n
number of observations. If length(n) > 1, the length is taken to be the number required.
size
number of trials (zero or more).
prob
probability of success on each trial.
log, log.p logical; if TRUE, probabilities p are given as log(p).
lower.tail logical; if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x].
Details
The binomial distribution with size = n and prob = p has density
p(x) = choose(n, x) p^x (1-p)^(n-x)
for x = 0, …, n. Note that binomial coefficients can be computed by choose in R.
If an element of x is not integer, the result of dbinom is zero, with a warning. p(x) is computed using Loader's
algorithm, see the reference below.
The quantile is defined as the smallest value x such that F(x) ≥ p, where F is the distribution function.
Value
dbinom
rbinom
gives the density, pbinom gives the distribution function, qbinom gives the quantile function and
generates random deviates.
If size is not an integer, NaN is returned.
The length of the result is determined by n for rbinom, and is the maximum of the lengths of the numerical
parameters for the other functions.
The numerical parameters other than n are recycled to the length of the result. Only the first elements of the
logical parameters are used.
Source
For dbinom a saddle-point expansion is used: see
Catherine Loader (2000). Fast and Accurate Computation of Binomial Probabilities; available from
http://www.herine.net/stat/software/dbinom.html.
pbinom
uses pbeta.
qbinom
uses the Cornish–Fisher Expansion to include a skewness correction to a normal approximation,
followed by a search.
rbinom
(for size < .Machine$integer.max) is based on
Kachitvichyanukul, V. and Schmeiser, B. W. (1988) Binomial random variate generation. Communications of
the ACM, 31, 216–222.
For larger values it uses inversion.
See Also
Distributions for other standard distributions, including dnbinom for the negative binomial, and dpois for the
Poisson distribution.
Examples
require(graphics)
# Compute P(45 < X < 55) for X Binomial(100,0.5)
sum(dbinom(46:54, 100, 0.5))
## Using "log = TRUE" for an extended range :
n <- 2000
k <- seq(0, n, by = 20)
plot (k, dbinom(k, n, pi/10, log = TRUE), type = "l", ylab = "log density",
main = "dbinom(*, log=TRUE) is better than log(dbinom(*))")
lines(k, log(dbinom(k, n, pi/10)), col = "red", lwd = 2)
## extreme points are omitted since dbinom gives 0.
mtext("dbinom(k, log=TRUE)", adj = 0)
mtext("extended range", adj = 0, line = -1, font = 4)
mtext("log(dbinom(k))", col = "red", adj = 1)
How to generate a binomial sample and plot histogram using R
Here's how you can do it:
#create a place to store your resamples
histdat=rep(NA,1000)
#loop
for(i in 1:1000){
histdat[i]=sum(sample(c(0,1),50,replace=T,prob=c(0.2,0.8)))
}
#plot
hist(histdat,breaks=seq(30,50,1))
#Check with theorectical curve
lines(seq(30,50,1),dbinom(seq(30,50,1),50,0.8)* 1e+3 ,col="green")
#seems to have worked!
Good luck,
How to generate a random number in R
By David Smith on July 21, 2010
As a language for statistical analysis, R has a comprehensive library of functions for generating random
numbers from various statistical distributions. In this post, I want to focus on the simplest of questions: How
do I generate a random number?
The answer depends on what kind of random number you want to generate. Let's illustrate by example.
Generate a random number between 5.0 and 7.5
If you want to generate a decimal number where any value (including fractional values) between the stated
minimum and maximum is equally likely, use the runif function. This function generates values from the
Uniform distribution. Here's how to generate one random number between 5.0 and 7.5:
> x1 <- runif(1, 5.0, 7.5)
> x1
[1] 6.715697
Of course, when you run this, you'll get a different number, but it will definitely be between 5.0 and 7.5. You
won't get the values 5.0 or 7.5 exactly, either.
If you want to generate multiple random values, don't use a loop. You can generate several values at once by
specifying the number of values you want as the first argument to runif. Here's how to generate 10 values
between 5.0 and 7.5:
> x2 <- runif(10, 5.0, 7.5)
> x2
[1] 6.339188 5.311788 7.099009 5.746380 6.720383 7.433535 7.159988
[8] 5.047628 7.011670 7.030854
Generate a random integer between 1 and 10
This looks like the same exercise as the last one, but now we only want whole numbers, not fractional values.
For that, we use the sample function:
> x3 <- sample(1:10, 1)
> x3
[1] 4
The first argument is a vector of valid numbers to generate (here, the numbers 1 to 10), and the second
argument indicates one number should be returned. If we want to generate more than one random number, we
have to add an additional argument to indicate that repeats are allowed:
> x4 <- sample(1:10, 5, replace=T)
> x4
[1] 6 9 7 6 5
Note the number 6 appears twice in the 5 numbers generated. (Here's a fun exercise: what is the probability of
running this command and having no repeats in the 5 numbers generated?)
Select 6 random numbers between 1 and 40, without replacement
If you wanted to simulate the lotto game common to many countries, where you randomly select 6 balls from
40 (each labelled with a number from 1 to 40), you'd again use the sample function, but this time without
replacement:
> x5 <- sample(1:40, 6, replace=F)
> x5
[1] 10 21 29 12 7 31
You'll get a different 6 numbers when you run this, but they'll all be between 1 and 40 (inclusive), and no
number will repeat. Also, you don't actually need to include the replace=F option -- sampling without
replacement is the default -- but it doesn't hurt to include it for clarity.
Select 10 items from a list of 50
You can use this same idea to generate a random subset of any vector, even one that doesn't contain numbers.
For example, to select 10 distinct states of the US at random:
> sample(state.name, 10)
[1] "Virginia"
"Oklahoma"
"Maryland"
[5] "Alaska"
"South Dakota" "Minnesota"
[9] "Indiana"
"Connecticut"
"Michigan"
"Idaho"
You can't sample more values than you have without allowing replacements:
> sample(state.name, 52)
Error in sample(state.name, 52) :
cannot take a sample larger than the population when 'replace = FALSE'
... but sampling exactly the number you do have is a great way to randomize the order of a vector. Here are the
50 states of the US, in random order:
> sample(state.name, 50)
[1] "California"
"Iowa"
[4] "Montana"
"South Dakota"
[7] "Louisiana"
"Maine"
[10] "New Hampshire" "Rhode Island"
[13] "Florida"
"North Carolina"
[16] "Arkansas"
"Pennsylvania"
[19] "Idaho"
"Connecticut"
[22] "South Carolina" "Illinois"
[25] "New Jersey"
"Indiana"
[28] "Mississippi"
"Michigan"
[31] "West Virginia" "Alaska"
[34] "Vermont"
"Virginia"
[37] "Washington"
"New Mexico"
[40] "Delaware"
"Nevada"
[43] "Kentucky"
"Missouri"
[46] "Tennessee"
"Arizona"
[49] "Kansas"
"Nebraska"
"Hawaii"
"North Dakota"
"Maryland"
"Texas"
"Minnesota"
"Colorado"
"Utah"
"Ohio"
"Wisconsin"
"Wyoming"
"Georgia"
"Oklahoma"
"New York"
"Alabama"
"Oregon"
"Massachusetts"
You could also have just used sample(state.name) for the same result -- sampling as many values as
provided is the default.
Further reading
For more information about how R generates random numbers, check out the following help pages: runif, and
sample. .Random.seed provides technical detail on the random number generator R uses, and how you can set
the random seed to recreate strings of random numbers.
Verzani - Random Data
Although Einstein said that god does not play dice, R can. For example
> sample(1:6,10,replace=T)
[1] 6 4 4 3 5 2 3 3 5 4
or with a function
> RollDie = function(n) sample(1:6,n,replace=T)
> RollDie(5)
[1] 3 6 1 2 2
In fact, R can create lots of different types of random numbers
ranging from familiar families of distributions to
specialized ones.
6.1 Random number generators in R-- the ``r'' functions.
As we know, random numbers are described by a distribution. That is, some function which specifies the
probability that a random number is in some range. For example P(a < X b). Often this is given by a
probability density (in the continuous case) or by a function P(X=k) = f(k) in the discrete case. R will give
numbers drawn from lots of different distributions. In order to use them, you only need familiarize yourselves
with the parameters that are given to the functions such as a mean, or a rate. Here are examples of the most
common ones. For each, a histogram is given for a random sample of size 100, and density (using the ``d''
functions) is superimposed as appropriate.
Uniform.
Uniform numbers are ones that are "equally likely" to be in the specified range. Often these numbers are
in [0,1] for computers, but in practice can be between [a,b] where a,b depend upon the problem. An
example might be the time you wait at a traffic light. This might be uniform on [0,2].
> runif(1,0,2)
# time at light
[1] 1.490857
# also runif(1,min=0,max=2)
> runif(5,0,2)
# time at 5 lights
[1] 0.07076444 0.01870595 0.50100158 0.61309213 0.77972391
> runif(5)
# 5 random numbers in [0,1]
[1] 0.1705696 0.8001335 0.9218580 0.1200221 0.1836119
The general form is runif(n,min=0,max=1) which allows you to decide how
numbers you want (n), and the range they are chosen from ([min,max])
To see the distribution with min=0 and max=1 (the default) we have
> x=runif(100)
# get the random numbers
> hist(x,probability=TRUE,col=gray(.9),main="uniform on [0,1]")
> curve(dunif(x,0,1),add=T)
many uniform random
Figure 25: 100 uniformly random numbers on [0,1]
The only tricky thing was plotting the histogram with a background ``color''. Notice how the dunif
function was used with the curve function.
Normal.
Normal numbers are the backbone of classical statistical theory due to the central limit theorem The
normal distribution has two parameters a mean � and a standard deviation . These are the location and
spread parameters. For example, IQs may be normally distributed with mean 100 and standard deviation
16, Human gestation may be normal with mean 280 and standard deviation about 10 (approximately).
The family of normals can be standardized to normal with mean 0 (centered) and variance 1. This is
achieved by "standardizing" the numbers, i.e. Z=(X-�)/.
Here are some examples
> rnorm(1,100,16)
[1] 94.1719
> rnorm(1,mean=280,sd=10)
[1] 270.4325
# an IQ score
# how long for a baby (10 days early)
Here the function is called as rnorm(n,mean=0,sd=1) where one specifies the mean and the standard
deviation.
To see the shape for the defaults (mean 0, standard deviation 1) we have (figure 26)
> x=rnorm(100)
> hist(x,probability=TRUE,col=gray(.9),main="normal mu=0,sigma=1")
> curve(dnorm(x),add=T)
## also for IQs using rnorm(100,mean=100,sd=16)
Figure 26: Normal(0,1) and normal(100,16)
Binomial.
The binomial random numbers are discrete random numbers. They have the distribution of the number
of successes in n independent Bernoulli trials where a Bernoulli trial results in success or failure,
success with probability p.
A single Bernoulli trial is given with n=1 in the binomial
> n=1, p=.5
> rbinom(1,n,p)
[1] 1
> rbinom(10,n,p)
[1] 0 1 1 0 1 0 1 0 1 0
# set the probability
# different each time
# 10 different such numbers
A binomially distributed number is the same as the number of 1's in n such Bernoulli numbers. For the
last example, this would be 5. There are then two parameters n (the number of Bernoulli trials) and p
(the success probability).
To generate binomial numbers, we simply change the value of n from 1 to the desired number of trials.
For example, with 10 trials:
> n = 10; p=.5
> rbinom(1,n,p)
[1] 6
> rbinom(5,n,p)
[1] 6 6 4 5 4
# 6 successes in 10 trials
# 5 binomial number
The number of successes is of course discrete, but as n gets large, the number starts to look quite
normal. This is a case of the central limit theorem which states in general that (X-- - �)/ is normal in
the limit (note this is standardized as above) and in our specific case that
^
p-p
1/2
( pq/n )
is approximately normal, where p^ = (number of successes)/n.
The graphs (figure 27) show 100 binomially distributed random numbers for 3 values of n and for
p=.25. Notice in the graph, as n increases the shape becomes more and more bell-shaped. These graphs
were made with the commands
> n=5;p=.25
# change as appropriate
> x=rbinom(100,n,p)
# 100 random numbers
> hist(x,probability=TRUE,)
## use points, not curve as dbinom wants integers only for x
> xvals=0:n;points(xvals,dbinom(xvals,n,p),type="h",lwd=3)
> points(xvals,dbinom(xvals,n,p),type="p",lwd=3)
...
repeat with n=15, n=50
Figure 27: Random binomial data with the theoretical distribution
Exponential
The exponential distribution is important for theoretical work. It is used to describe lifetimes of
electrical components (to first order). For example, if the mean life of a light bulb is 2500 hours one
may think its lifetime is random with exponential distribution having mean 2500. The one parameter is
the rate = 1/mean. We specify it as follows rexp(n,rate=1). Here is an example with the rate being
1/2500 (figure 28).
> x=rexp(100,1/2500)
> hist(x,probability=TRUE,col=gray(.9),main="exponential mean=2500")
> curve(dexp(x,1/2500),add=T)
Figure 28: Random exponential data with theoretical density
There are others of interest in statistics. Common ones are the Poisson, the Student t-distribution, the F
distribution, the beta distribution and the 2 (chi squared) distribution.
6.2 Sampling with and without replacement using sample
R
has the ability to sample with and without replacement. That is, choose at random from a collection of things
such as the numbers 1 through 6 in the dice rolling example. The sampling can be done with replacement (like
dice rolling) or without replacement (like a lottery). By default sample samples without replacement each
object having equal chance of being picked. You need to specify replace=TRUE if you want to sample with
replacement. Furthermore, you can specify separate probabilities for each if desired.
Here are some examples
## Roll a die
> sample(1:6,10,replace=TRUE)
[1] 5 1 5 3 3 4 5 4 2 1
# no sixes!
## toss a coin
> sample(c("H","T"),10,replace=TRUE)
[1] "H" "H" "T" "T" "T" "T" "H" "H" "T" "T"
## pick 6 of 54 (a lottery)
> sample(1:54,6)
# no replacement
[1] 6 39 23 35 25 26
## pick a card. (Fancy! Uses paste, rep)
> cards = paste(rep(c("A",2:10,"J","Q","K"),4),c("H","D","S","C"))
> sample(cards,5)
# a pair of jacks, no replacement
[1] "J D" "5 C" "A S" "2 D" "J H"
## roll 2 die. Even fancier
> dice = as.vector(outer(1:6,1:6,paste))
> sample(dice,5,replace=TRUE)
# replace when rolling dice
[1] "1 1" "4 1" "6 3" "4 4" "2 6"
The last two illustrate things that can be done with a little typing and a lot of thinking using the fun commands
paste for pasting together strings, rep for repeating things and outer for generating all possible products.
6.3 A bootstrap sample
Bootstrapping is a method of sampling from a data set to make statistical inference. The intuitive idea is that by
sampling, one can get an idea of the variability in the data. The process involves repeatedly selecting samples
and then forming a statistic. Here is a simple illustration on obtaining a sample.
The built in data set faithful has a variable ``eruptions'' that measures the time between eruptions at Old
Faithful. It has an unusual distribution. A bootstrap sample is just a sample with replacement from the given
values. It can be found as follows
> data(faithful)
# part of R's base
> names(faithful)
# find the names for faithful
[1] "eruptions" "waiting"
> eruptions = faithful[['eruptions']] # or attach and detach faithful
> sample(eruptions,10,replace=TRUE)
[1] 2.03 4.37 4.80 1.98 4.32 2.18 4.80 4.90 4.03 4.70
> hist(eruptions,breaks=25)
# the dataset
## the bootstrap sample
> hist(sample(eruptions,100,replace=TRUE),breaks=25)
Figure 29: Bootstrap sample
Notice that the bootstrap sample has a similar histogram, but it is different (figure 29).
6.4 d, p and q functions
The d functions were used to plot the theoretical densities above. As with the ``r'' functions, you need to specify
the parameters, but differently, you need to specify the x values (not the number of random numbers n).
Figure 30: Illustration of 'p' and 'q' functions
The p and q functions are for the cumulative distribution functions and the quantiles. As mentioned, the
distribution of a random number is specified by the probability that the number is between a and b for arbitrary
a and b, P(a < X b). In fact, the value F(x) = P(X b) is enough.
The p functions answer what is the probability that a random variable is less than x. Such as for a standard
normal, what is the probability it is less than .7?
> pnorm(.7)
[1] 0.7580363
> pnorm(.7,1,1)
[1] 0.3820886
# standard normal
# normal mean 1, std 1
Notationally, these answer P(Z .7) where Z is a standard normal or normal(1,1). To answer P(Z > .7) is also
easy. You can do the work by noting this is 1 - P(Z .7) or let R do the work, by specifying lower.tail=F as
in:
> pnorm(.7,lower.tail=F)
[1] 0.2419637
The q function are inverse to this.
They ask, what value corresponds to a given probability. This the quantile or
point in the data that splits it accordingly. For example, what value of z has .75 of the area to the right for a
standard normal? (This is Q3)
> qnorm(.75)
[1] 0.6744898
Notationally, this is finding z which solves 0.75 = P(Z z).
6.5 Standardizing, scale and z scores
To standardize a random variable you subtract the mean and then divide by the standard deviation. That is
X-�
Z=
.
To do so requires knowledge of the mean and standard deviation.
You can also standardize a sample. There is a convenient function scale that will do this for you. This will
make your sample have mean 0 and standard deviation 1. This is useful for comparing random variables which
live on different scales.
Normal random variables are often standardized as the distribution of the standardized normal variable is again
normal with mean 0 and variance 1. (The ``standard'' normal.) The z-score of a normal number is the value of it
after standardizing.
If we have normal data with mean 100 and standard deviation 16 then the following will find the z-scores
> x
>
> x
[1]
> z
> z
[1]
= rnorm(5,100,16)
93.45616 83.20455 64.07261 90.85523 63.55869
= (x-100)/16
-0.4089897 -1.0497155 -2.2454620 -0.5715479 -2.2775819
The z-score is used to look up the probability of being to the right of the value of x for the given random
variable. This way only one table of normal numbers is needed. With R, this is not necessary. We can use the
pnorm function directly
> pnorm(z)
[1] 0.34127360 0.14692447 0.01236925 0.28381416 0.01137575
> pnorm(x,100,16)
# enter in parameters
[1] 0.34127360 0.14692447 0.01236925 0.28381416 0.01137575
6.6 Problems
6.1
Generate 10 random numbers from a uniform distribution on [0,10]. Use R to find the maximum and
minimum values.x
6.2
Generate 10 random normal numbers with mean 5 and standard deviation 5 (normal(5,5)). How many
are less than 0? (Use R)
6.3
Generate 100 random normal numbers with mean 100 and standard deviation 10. How many are 2
standard deviations from the mean (smaller than 80 or bigger than 120)?
6.4
Toss a fair coin 50 times (using R). How many heads do you have?
6.5
Roll a ``die'' 100 times. How many 6's did you see?
6.6
Select 6 numbers from a lottery containing 49 balls. What is the largest number? What is the smallest?
Answer these using R.
6.7
For normal(0,1), find a number z* solving P(Z z*) = .05 (use qnorm).
6.8
For normal(0,1), find a number z* solving P(-z* Z z*) = .05 (use qnorm and symmetry).
6.9
How much area (probability) is to the right of 1.5 for a normal(0,2)?
6.10
Make a histogram of 100 exponential numbers with mean 10. Estimate the median. Is it more or less
than the mean?
6.11
Can you figure out what this R command does?
> rnorm(5,mean=0,sd=1:5)
6.12
Use R to pick 5 cards from a deck of 52. Did you get a pair or better? Repeat until you do. How long did
it take?
Copyright © John Verzani, 2001-2. All rights reserved.
From the help page, we have seen that the sample function can take a number of different arguments. x must be
a vector of items, size must be a number. Since 1:6 gives a vector of the numbers from 1 to 6, we can set x=1:6
and size=5. Here are 5 examples (note that the first 4 are equivalent, although the actual result will differ due to
chance effects when sampling[1]).
###The next 4 lines are equivalent, 5 numbers are selected from a list of 1..6
sample(x=1:6, size=5, replace=FALSE) #when sampling WITHOUT replacement, each number
appears once
sample(replace=FALSE, size=5, x=1:6) #you can change the order of the arguments
sample(x=1:6, size=5)
#the same, because replace=FALSE by default
sample(1:6, 5)
#we don't need x= and size= if arguments are in the same order
in the help file
### The next line is a different model
sample(1:6, 5, TRUE)
#sampling WITH replacement (the same number can
appear twice)
###The next 4 lines are equivalent, 5 numbers are selected from a list of 1..6
sample(x=1:6, size=5, replace=FALSE) #when sampling WITHOUT replacement, each number
appears once
[1] 1 5 4 3 6
sample(replace=FALSE, size=5, x=1:6) #you can change the order of the arguments
[1] 5 6 4 2 1
sample(x=1:6, size=5)
#the same, because replace=FALSE by default
[1] 2 3 4 6 5
sample(1:6, 5)
#we don't need x= and size= if arguments are in the same order
in the help file
[1] 1 6 3 5 4
### Now simulate a different model
sample(1:6, 5, TRUE)
#sampling WITH replacement (the same number can
appear twice)
[1] 3 6 2 1 3
only
as
only
as
As noted, our "fair die" model is one of sampling with replacement: the same number can appear twice (and
indeed does in our data). So our simulation model in R is simply
sample(1:6, 5, TRUE)
r.sample (Sample Features)
Creates a random or stratified random sample of records in a feature
table
Description
This tool uses the R sample command to create a random or stratified random sample of records in a feature
attribute table. Sampled records are coded with a 1 in a field in the attribute table, and all other records are 0.
By default this field is named RNDSAMP, although you can specify any other field names using the 'field'
parameter. This tool will not automatically overwrite an existing field unless you explicitly specify
'overwrite=TRUE' in the parameter list.
The number of features that are sampled can be specified as a count using the 'size' parameter, or as a
proportion of the total number of records using the 'proportion' parameter. In the case of stratified sampling, the
size parameter refers to the number of samples that will be selected within each stratum. By default, all strata
are processed. If you wish to randomly sample among strata first, and then sample within strata then you must
implement the stratalimit parameter (this controls the number of strata that are randomly selected).
For simple random samples a weight field can be specified with values that are proportional to the relative
probability of selection. The values cannot be less than 0, but can be greater than 1 because the tool will
automatically standardize the values by dividing my the maximum value (this ensures all the values are within
the range 0-1). The weight field functionality is not yet implemented for the stratified sampling tool. If no
weight field is specified, all records or features have an equal probability of being selected.
When using the size parameter, if you specify a sample size that is greater than the number of available records
then a warning message will be generated, but all available features will be marked as being sampled.
This is a stochastic algorithm and is, therefore, very unlikely to yield the same results each time you run it. The
'verbose' option can be useful for checking that this algorithm is working correctly as it provides a detailed
report of the sampling process. For further information on the R sample command, type '? sample' at the R
prompt, and press Enter.
This command is driven by R. Type 'citation' to see the suggested citation for R.
Syntax
r.sample(in, size, proportion, [field], [weightfield], [stratified], [stratalimit], [overwrite], [verbose], [where]);
in
size
the input feature data source
the number of features to sample (integer); takes precedence over 'proportion' if both are
specified
proportion
the proportion of features to sample (0.0-1.0)
[field]
the field that will record the selection (if it exists the program will stop, but see the overwrite
option below) (default=RNDSAMP)
[weightfield] the field that contains relative probabilities of selection (does not apply to stratified samples; see
documentation for details)
[stratified]
the field that describes the strata in the data (typically an integer field representing unique group
ID's); the count or proportion options are applied at the level of the strata
[stratalimit]
[overwrite]
[verbose]
[where]
the number of strata to randomly sample (if undefined, all strata are processed)
(TRUE/FALSE): if TRUE, if the output field already exists it will automatically be deleted and
recreated, if FALSE the program stops with an error message if the field exists (default=FALSE)
(TRUE/FALSE): if TRUE, reports the sequence of sampled record numbers in the output
window (default=FALSE)
the selection statement that will be applied to the feature data source to identify a subset of
features to process (see full Help documentation for further details)
Example
r.sample(in="C:\data\plots.shp", size=100, overwrite=TRUE);
r.sample(in="C:\data\plots.shp", size=100, overwrite=TRUE, weightfield="SELPROB");
r.sample(in="C:\data\locs.shp", field="STRSEL", proportion=0.1, stratified="ANIMALID", verbose=TRUE);
Random Samples and Permutations
sample {base}
R Documentation
Description
sample
takes a sample of the specified size from the elements of x using either with or without replacement.
Usage
sample(x, size, replace = FALSE, prob = NULL)
sample.int(n, size = n, replace = FALSE, prob = NULL)
Arguments
Either a vector of one or more elements from which to choose, or a positive integer. See ‘Details.’
a positive number, the number of items to choose from. See ‘Details.’
size
a non-negative integer giving the number of items to choose.
replace Should sampling be with replacement?
prob
A vector of probability weights for obtaining the elements of the vector being sampled.
x
n
Details
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from
1:x. Note that this convenience feature may lead to undesired behaviour when x is of varying length in calls
such as sample(x). See the examples.
Otherwise x can be any R object for which length and subsetting by integers make sense: S3 or S4 methods for
these operations will be dispatched as appropriate.
For sample the default for size is the number of items inferred from the first argument, so that sample(x)
generates a random permutation of the elements of x (or 1:x).
It is allowed to ask for size = 0 samples with n = 0 or a length-zero x, but otherwise n > 0 or positive
length(x) is required.
Non-integer positive numerical values of n or x will be truncated to the next smallest integer, which has to be
no larger than .Machine$integer.max.
The optional prob argument can be used to give a vector of weights for obtaining the elements of the vector
being sampled. They need not sum to one, but they should be non-negative and not all zero. If replace is true,
Walker's alias method (Ripley, 1987) is used when there are more than 250 reasonably probable values: this
gives results incompatible with those from R < 2.2.0, and there will be a warning the first time this happens in a
session.
If replace is false, these probabilities are applied sequentially, that is the probability of choosing the next item
is proportional to the weights amongst the remaining items. The number of nonzero weights must be at least
size in this case.
sample.int
is a bare interface in which both n and size must be supplied as integers.
As from R 3.0.0, n can be larger than the largest integer of type integer, up to the largest representable integer
in type double. Only uniform sampling is supported. Two random numbers are used to ensure uniform
sampling of large integers.
Value
For sample a vector of length size with elements drawn from either x or from the integers 1:x.
For sample.int, an integer vector of length size with elements from 1:n, or a double vector if n >= 2^31.
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
Ripley, B. D. (1987) Stochastic Simulation. Wiley.
See Also
RNG
about random number generation.
CRAN package sampling for other methods of weighted sampling without replacement.
Examples
x <- 1:12
# a random permutation
sample(x)
# bootstrap resampling -- only if length(x) > 1 !
sample(x, replace = TRUE)
# 100 Bernoulli trials
sample(c(0,1), 100, replace = TRUE)
## More careful bootstrapping -- Consider this when using sample()
## programmatically (i.e., in your function or simulation)!
# sample()'s surprise
x <- 1:10
sample(x[x > 8])
sample(x[x > 9])
sample(x[x > 10])
-- example
# length 2
# oops -- length 10!
# length 0
resample <- function(x, ...)
resample(x[x > 8]) # length
resample(x[x > 9]) # length
resample(x[x > 10]) # length
x[sample.int(length(x), ...)]
2
1
0
## R 3.x.y only
sample.int(1e10, 12, replace = TRUE)
sample.int(1e10, 12) # not that there is much chance of duplicates
© Copyright 2026 Paperzz