Stata Walkthrough #3: Sampling Distributions We are going to

Stata Walkthrough #3: Sampling Distributions
We are going to create a bunch of simulated random samples. Here’s the scenario: X
is a random variable, which (in truth) is distributed with mean zero and standard
deviation of one. Each of our samples will consist of ten observations of X, which we
will call x1, x2, x3, x4, x5, x6, x7, x8, x9, and x10. We will create 1000
samples like this. Keep in mind that for this experiment, each “observation” in Stata
is a different sample. In reality, as a researcher, you would have only one of these
samples. However, what we are trying to see if what would happen if a thousand
researchers collected a thousand random samples, and each researcher created his
own estimate of the mean and a confidence interval for this estimate.1
Just as a note, any command listed in italics is something that I will not expect you to
commit to memory.
First, create your 1000 samples:
drawnorm x1 x2 x3 x4 x5 x6 x7 x8 x9 x10, n(1000)
Each researcher has been asked to come up with an estimate of the mean, and a
confidence interval for your estimate. You can calculate the mean of each
researcher’s sample:
gen xbar = (x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10)/10
You can look at the distribution of xbar. Almost never is it exactly zero:
list xbar
However, theory tells us that the average of these many sample means should be
close to zero; µ X = µ X . We can check that this is true:
1
This type of experiment is often called a “Monte Carlo” simulation, and we use it to
verify that statistical theory is correct. In this case, the prediction is that the 95%
confidence interval will contain the true mean, 95% of the time. In a Monte Carlo
simulation we create a bunch of random samples, where we know the true value of
the parameters. We then estimate this parameter for each of the samples, and see
whether the prediction is true for these samples.
summ xbar
The average across these 1000 samples might not be exactly zero, but it should be
pretty darned close. Also, we know that the standard deviation in xbar should be
! X = ! X / N = 1 / 10 = 0.3162 . This should also match closely with your standard
deviation in xbar. Here’s what I got for my 1000 researchers:
. summ xbar
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------xbar |
1000
.0082488
.3030376 -1.314279
.9897205
Next, each researcher will calculate a 95% confidence interval for xbar, which is
X ± t 0.05 /2 ! (s X / N " 1) , according to the formula on p. 236 of the textbook. (Why is
it t 0.05 /2 ? We want the 95% confidence interval, and 0.05 = (100 ! 95) / 100 .)
In order to calculate this interval, each researcher needs to know the s X2 and s X for
his sample:
gen varx = ( (x1-xbar)^2 + (x2-xbar)^2 + (x3-xbar)^2 + (x4xbar)^2 + (x5-xbar)^2 + (x6-xbar)^2 + (x7-xbar)^2 + (x8xbar)^2 + (x9-xbar)^2 + (x10-xbar)^2 )/(9)
gen sx = varx^0.5
(Remember not to confuse s X and ! X ; they both standard deviations, but in
different random variables.)
When we have 9 = N ! 1 “degrees of freedom” in our sample, the critical value of
t 0.025 is 2.262, according to the table in Appendix II. You can also get Stata to show
you this value:
display invttail(9,0.025)
Anyhow, the final task is for each researcher to form a confidence interval. He will
set the lower and upper bounds as:
gen lowerx = xbar – 2.262 * sx/9^0.5
gen upperx = xbar + 2.262 * sx/9^0.5
This is the 95% confidence interval. This (supposedly) means that for 95% of these
samples, the true average (zero) will lie within this interval. Let’s see if this is true.
We’ll create a dummy variable that equals one if the researcher’s confidence interval
contains the true value:
gen containszero = (lowerx < 0 & 0 < upperx)
Now let’s look at the distribution of this variable:
tab containszero
You should find that about 95% of researchers gave confidence intervals that did, in
fact, contain the true mean (zero) inside them. In my database:
containszer |
o |
Freq.
Percent
Cum.
------------+----------------------------------0 |
48
4.80
4.80
1 |
952
95.20
100.00
------------+----------------------------------Total |
1,000
100.00
You might have slightly different percentages, but approximately 95% of your
researchers should have gotten it right.