exercises stk 4600, fall 2013

EXERCISES STK4600, SPRING 2015
Exercise 1, for Wednesday, February 4.
1. Assume simple random sample, SRS, and let s 2 
ys  is yi . Show that
1
 ( yi  ys )2 where
n  1 i s
E(s 2 )   2
where
2 
N
1
N
( yi   ) 2 ,   i 1 yi / N

i 1
N 1
2. Assume SRS, and let yi = 1 if unit i has characteristic A and 0 otherwise. Let p be the
proportion in the population with characteristic A. Show that
2 
N
p(1  p)
N 1
3. Consider the example on p. 43 of lecture notes. Show that
Var (t̂1 )  Var (t̂2 ) 
1
y3 (3 y2  3 y1  y3 )
6
4. (R-exercise 1). Consider data set “trees” in R, with y = Volume
a. Make a histogram of the y-data.
b. Select a simple random sample of size n = 10 and compute an approximate 95%
confidence interval for the mean volume of the 31 trees.
c. Do 1000 simulations with sample sizes 10, 15 and 20. Are the sample means close to
the normal distributions?
d. What are the actual coverage of the 95% confidence intervals for the cases in part c ?
1
Exercise 2, for Wednesday, February 11
1. Assume SRS and consider the ratio estimator. Show that, with f = n/N,
Cov( xs , ys ) 
 xy
n
(1  f )
Hint: Find the variance of the sample mean of ui  xi  yi ,i  s.
2. Show that the ratio estimator is approximately unbiased and variance formula (III) on p. 61 in
lecture notes by using the Taylor linearization for a function in two variables. I.e.,
we linearize R̂  ys / xs as a function of two variables :
f ( x , y )  f ( a ,b) 
a   x ,b   y

R̂  R 
f
f
|xa ,y b ( x  a)  |xa ,y b ( y  b)
x
y
y
y  Rxs
1
( xs   x ) 
( ys   y )  R  s
.
2
x
x
x
3. Suppose the population consists of 4 households with total income (in units of kr. 1000) of 210,
350, 700 and 1260 for households 1, 2, 3, 4 respectively. The numbers of adults in the households
are 1, 1, 2 and 3 for households 1, 2, 3, 4. We take a simple random sample if size 2 for estimating
the total income t ( = 2520). Consider the two estimators t̂e  Nys and t̂ R  X 
ys
, where the
xs
auxiliary variable xi is the number of adults in household i.
a. Find the expected value and variance (exact values) of the two estimators. Which one
would you choose.
b. Compare the exact variance of the ratio estimator with the approximate value (III) on
p. 61 in the lecture notes.
c. Compute the approximate 95% confidence intervals (assuming approximate
normality) based on the two estimators for all 6 possible samples. What are the exact
confidence levels for the two CI procedures based on t̂e and t̂R ?
Derive the expected lengths of the CI’s based on t̂e and t̂R .
4. (R-exercise 2). Consider data set “trees” in R, with y = Volume. Let x = Girth (circumference)
a. Draw a simple random sample of size 15 and estimate the mean Volume using the ratio
estimator. Compute SE and the 95% confidence interval (CI) using R. Do the same for the
sample mean based estimator and compare.
b. Find an estimate of the correct confidence level of the 95% CI based on the ratio estimator
for n =5, 10, 15 and 20, based on 10000 simulations.
2
Exercise 3, for Wednesday, February 18
1. In many social surveys the population consists of households. Assume the population consists of 3
households. Household 1 consists of 3 persons with a total income of kr. 640 000, while the other two
are one-person households with kr. 280 000 and kr. 360 000. We want to estimate the total income
(which we know is kr. 1 280 000) based on sample of 2 households. The sample is drawn by selecting
2 persons at random and consists of the 2 households the persons belong to. We draw households
without replacement. For example, if the person selected first belongs to household 1, then at the
second draw it is only possible to select one of the two persons from household 2 or 3.
a. Let π1, π2, π3 be the inclusion probabilities for households 1,2,3 respectively. Show that π1
= 9/10, π2 = π3 = 11/20.
b. Compute the Horvitz-Thompson and ratio estimates for the 3 possible samples {1,2},
{1,3} and {23,3}. For the ratio estimator let xi be the number of persons in household i.
c. Show that the sampling design is: p({1,2}) = 9/20, p({1,3}) = 9/20, p({2,3}) = 2/20
d. Find expected values and variance of the two estimators (exact values). Which estimator
do you prefer?
2. Consider a population of N = 10 units with y1 = …= y9 = c and y10 = 2c. To estimate t = 11c, a
sample of one single unit is to be drawn and the inclusion probabilities are π1 = … = π9 = 0.11 and
π10 = 0.01. Hence, the unit i = 10 has the largest y-value and smallest inclusion probability.
a. Compute the values of t̂HT and t̂w (see p.81) for the different possible samples.
b. Find the expected values and variances of the two estimators in part a. Which estimator
would you choose?
3. Consider the elephant example on p. 74 in the lecture notes. Show that E (t̂eleph )  6.15 and
Var (t̂eleph )  2.2275 .
4.
a. Let Z i  1 if i  s , and 0 otherwise. Let πi = P(Zi = 1) and πij = P(Zi = 1, Zj = 1). Show that
Var(Zi) = πi(1 - πi) and Cov(Zi , Zj) = πij - πi πj.
N yi
b. By writing t̂ HT 
Zi , show that

i 1
Var( t̂ HT ) 
i

1 i
N
i 1
i
yi2  2
 
N 1
N
i 1
j  i 1
 ij   i j
yi y j
 i j
Hint: Use the general fact that
Var(



k
k
X )
i 1 i

Var( X i )  2
i 1
k
Var( X i ) 
i 1

k
i 1
k
  Cov( X , X
k
i 1
i
j)
j 1, j  i
k
 Cov( X , X
i
j)
j 1, j  i 1
N
c. Assume |s| = n is fixed in advance. Show, using that
 ( 
i
j
  ij )   i (1   i ) , that
j 1, j  i
2
y
yj 
Var( t̂ HT )  i 1
( i j   ij ) i  
j  i 1
 i  j 


5. Let N = 4 with sampling design p({1,4}) = 0.2, p({2,4}) = 0.3 and p({3,4}) =0.5.
a. Derive the four univariate inclusion probabilities and all bivariate inclusion probabilities
b. Find the variance estimate of the Horvitz-Thompson estimator when s = {i,4}. What can
you conclude?
 
N 1
N
3
Exercise 4, for Wednesday, February 25
1. At one university there were 807 faculty members and research specialists in the College of
Liberal Arts and Science in 1993; the list of faculty and their publications for 1992-1993 were
available on the computer system. For each faculty member, the number of refereed
publications was recorded. The number is not directly available on the database, so the
investigator is required to examine each record separately. A simple random sample of 50
faculty members was selected. The data were as follows.
Refereed publications
Faculty members
0
28
1
4
2
3
3
4
4
4
5
2
6
1
7
0
8
2
9
2
a. Estimate the mean number of publications per faculty member, compute the standard
error and coefficient of variation.
b. Estimate the proportion of faculty members with no publications and give a 95%
confidence interval.
2. Consider the same problem as in Exercise 1. In the SRS of 50 faculty members not all the
departments were represented. The SRS contained several faculty members from psychology
and from chemistry but none from the foreign languages. It was therefore decided to take a
stratified simple random sample, using the areas biological sciences, physical sciences, social
sciences and humanities as the strata. Proportional allocation was used in this sample. The
distribution of the strata for population and sample are given below
Stratum
1- Biological sciences
2- Physical sciences
3- Social sciences
4- Humanities
Total
Number of faculty
members in the stratum
102
310
217
178
807
Number of faculty members in the
samples
7
19
13
11
50
a. Show that this is proportional allocation.
The data from the stratified sample turned out to be
Number of refereed
publications
0
1
2
3
4
5
6
7
8
Number of faculty members
Biological
Physical Social
Humanities
1
2
0
1
0
2
0
1
0
10
2
0
1
2
1
1
0
2
9
0
1
0
2
0
1
0
0
8
2
0
1
0
0
0
0
0
b. Estimate the total number of refereed publications by faculty members in the college
and compute the standard error.
c. How does the result from part (b) compare with the result from the SRS in Exercise 1?
4
d. Estimate the proportion of faculty with no refereed publications and compute the
standard error and 95% confidence interval.
e. Compare the result in part (d) with the result from the SRS in Exercise 1.
f. Did the stratification increase precision for the two estimation problems considered in
parts (b) and (d) ? Explain why you think it did or did not.
g. Assume that  h2 /  42  6.25 for h  1,2,3. Find the optimal allocation for a stratified
sample of 50 faculty members.
h. Suppose the number of faculty members with no refereed publications for the
stratified sample with optimal allocation turned out to be 1, 12, 11 and 4 for strata
1,2,3,4 respectively. Estimate the proportion of faculty with no refereed publications
and compute the standard error and 95% confidence interval.
i. Compare the result in part (h) with the results in part (d) and Exercise 1. Did optimal
allocation increase the precision? Explain why you think it did or did not.
3.
A public opinion researcher has a budget of kr. 120 000 for taking a survey. She
knows that 90% of all households have known telephone numbers. Telephone
interviews cost kr. 60 per household; in-person interviews cost kr. 180 each if all
interviews are conducted in person and kr. 240 each if only households with unknown
telephone numbers are interviewed in person (because there will be extra travel costs).
Assume that the variances in phone and nonphone strata are similar and that the fixed
costs are kr. 30 000.
a. How many households should be interviewed in each stratum if all households are
interviewed in person?
b. How many households should be interviewed in each stratum if households with a
known telephone number are contacted by telephone and the other households are
contacted in person?
c. Which of the two data collection methods would you choose? Give a justification for
your answer.
Suppose that for the main question in the survey, the variance in the phone stratum is only
half the variance in the nonphone stratum.
d. Use this information to determine the number of households to be interviewed in each
stratum.
4.
Assume stratified simple random sampling. Show that the stratified estimator t̂ st (see
p.95 in lecture notes) is identical to the Horvitz-Thompson estimator.
5.
(R-exercise 3). Consider data set “trees” in R.
a. Stratify the population according to the variable “Girth” (Circumference-omkrets) with
3 strata according to the value of x= Girth, with
stratum1: x  11.0, stratum 2 :11.1  x  14.5 and stratum 3 : x  14.5.
Compute the population means for the 3 strata.
b. Select a stratified simple random sample of sizes 3,5,3 for the 3 strata, and compute the
stratified estimate for the average height of the trees and 95% confidence interval.
c. Regard the sample in part b as a simple random sample. Compute the SE and the CI in
this case and compare to the results in part b.
d. Select a random sample of size 11 and do the same statistical analysis as in part b.
5