Exercise Sheet

Non-parametric Inference and Resampling
Exercises
by David Wozabal
(Last update 2. Juni 2010)
1
Basic Facts about Rank and Order Statistics
1.1 10 students were asked about ’the amount of time they spend surfing Internet per month’:
Student i
1 2 3 4 5 6 7 8 9 10
Amount of time xi (in Hours) 78 58 60 82 83 85 58 72 70 58
(a) What are the corresponding values for the ordered statistics?
(b) Determine the rank statistics (r1 , . . . , r10 ).
1.2 Suppose that we have a sample of size n for the setting of Exercise 1.1. In terms of the order statistics,
what statistic will you consider to study the following :
(a) the least amount of time spent
(b) the highest amount of time spent
(c) the median of the amount of time spent.
Hint: distinguish between cases, when n is odd and when n is even.
(d) the range of the amount of time spent
(e) Determine the values of the statistics discussed in (a)-(d) based on the data in Exercise 1.1.
1.3 The empirical distribution function F̂n corresponding to the sample x1 , . . . , xn is given by:
F̂n (t) =
1
. Number of samples not exceeding t.
n
(a) Calculate (and draw the graph of) the empirical distribution function for Exercise 1.1.
(b) Can you express the empirical distribution function in terms of order statistics.
1.4 Let X = (X1 , . . . , Xn ) be a random vector with a continuous density, such that R is uniformly
distributed on Sn (the permutations of n elements). Show that
(a) P(Ri = k) = n1 , 1 ≤ i, k ≤ n
(b) P(Ri = k, Rj = h) =
1
n(n−1) ,
i 6= j, k 6= h
(c) P(Ri = Rj ) = 0, 1 ≤ i 6= j ≤ n.
1.5 Let X1 , . . . , Xn be i.i.d. random variables. Show that
fX(1) ,X(n) (x, y) = n(n − 1) (F (y) − F (x))n−2 f (x)f (y),
−∞ < x ≤ y < ∞.
Hint: Use the fact that
P(X(1) ≤ x, X(n) ≤ y) = P(X(n) ≤ y) − P(X(1) > x, X(n) ≤ y)
to obtain the joint distribution of (X(1) , X(n) ).
1.6 Show that the density of the midrange Vn = (X(1) + X(n) )/2 for an i.i.d. sample X1 , . . . Xn from an
absolutely continuous distribution F with density f is given by
Z v
{F (2v − x1 ) − F (x1 )}n−2 f (x1 )f (2v − x1 )dx1 , −∞ < v < ∞.
fVn (v) = 2n(n − 1)
−∞
Hint: Use the joint density of (X(1) , X(n) ) from 1.5 to calculate the joint density of (X(1) , Vn ) by
finding a transformation T : R2 → R2 , such that T (X(1) , X(n) ) = (X(1) , Vn ) and noting that using
the change of variable formula
Z
Z
−1
fX (T −1 (x))|JT −1 (x)|dx
fX (x)dx =
P(T (X) ∈ A) = P(X ∈ T (A)) =
x∈T −1 (A)
x∈A
where |JT −1 | is the determinant of the Jacobian of T −1 . Now integrate out the dependence on x(1) .
1.7 Calculate the explicit form for the density of the midrange Vn , when Xi ∼ F = U (0, 1) for all
1 ≤ i ≤ n.
1.8 For an i.i.d. sample X1 , . . . , Xn , calculate the density of the order statistic X(i) for 1 ≤ i ≤ n.
1.9 Using example 1.8 calculate the density of U(i) for Ui i.i.d. with Ui ∼ U (0, 1) for all 1 ≤ i ≤ n.
1.10 For the U1 , . . . Un defined in Exercise 1.9 prove that
E(U(i) ) =
i
.
n+1
Hint: Use the following result (Beta function)
Z 1
Γ(x)Γ(y)
B(x, y) =
tx−1 (1 − t)y−1 dt =
Γ(x + y)
0
where Γ is the Gamma function.
1.11 For the U1 , . . . Un defined in Exercise 1.9 prove that
Var(U(i) ) =
i(n − i + 1)
.
(n + 1)2 (n + 2)
Hint: Again use the Beta function and it’s relation to the Gamma function.
1.12 Show that in the i.i.d. case the joint density of two arbitrary order statistics (X(j) , X(i) ) with 1 ≤
i < j ≤ n is given by (for −∞ < xi < xj < ∞)
fX(i) ,X(j) (xi , xj ) =
n!
F (xi )i−1 (F (xj ) − F (xi ))j−i−1 (1 − F (xj ))n−j f (xi )f (xj ).
(i − 1)!(j − i − 1)!(n − j)!
Hint: Use the formula for the joint density f¯ of all the order statistics (why does the last equality
hold?)
X
f¯(x(1) , . . . , x(n) ) =
f (x(r1 ) , . . . , x(rn ) ) = n!f (x(1) , . . . , x(n) )
r∈Sn
and integrate out the orders k : k 6= i, k 6= j.
1.13 Let U1 , . . . , Un be a i.i.d. sample from the standard uniform distribution U (0, 1). Use 1.12 and 1.8 to
show that the random variables U(i) /U(j) and U(j) for 1 ≤ i < j ≤ n, are statistically independent
and calculate the respective distribution functions.
2
2
Simple Non-Parametric Tests
2.1 Use the sign test and the following i.i.d. sample
1.1232, 1.2990, 1.4409, −0.4921, 0.7120, 0.7379, −0.5078, −0.2420, 1.5823, 0.3685
to test the null hypothesis that the median φ of the underlying distribution equals 0, against H1 :
φ 6= 0 and H1 : φ > 0 (at the level α = 0.05).
2.2 Repeat the test in 2.1 with the Wilcoxon test.
2.3 Let X1 , . . . , X48 be a random sample from a distribution with distribution function F . To test
H0 : F (41) = 14 against H1 : F (41) < 14 , use the statistic Y , which is the number of samples
less than or equal to 41. If the observed value of Y is smaller than 7 reject H0 and accept H1 . If
p = F (41), find the power function K(p), 0 < p < 14 of the test. Approximate α = K( 14 ).
2.4 A statistician decides that working with numbers is a tedious job and designs a universal test, which
rejects the null hypothesis in favor of the alternative hypothesis if a coin toss (whose outcome is
independent of the testing problem in question) yields heads. What is the power of this test?
P
2.5 Derive the probability generating function of V = ni=1 sign(Xi )Ri , where (X1 , . . . , Xn ) is an i.i.d.
sample from a distribution which fulfills the assumptions of the signed rank (Wilcoxon) test and
(R1 , . . . , Rn ) are the ranks of the sample (|X1 |, . . . , |Xn |).
2.6 Write a computer programm that calculates the pdf of V from 2.5 for different values of n.
2.7 Write a computer program that samples m samples of size n from a
(a)
(b)
(c)
(d)
(e)
normal distribution with mean µ and σ = 1;
a shifted t-distribution with ν degrees of freedom and mean µ;
a uniform distribution on [µ − c, µ + c];
an exponential distribution with mean 1;
a distribution of your choice.
Use the samples generated above to assess the power of the
(a)
(b)
(c)
(d)
t-test
sign test
Wilcoxon test
Mann-Whitney-U test
when testing H0 : µ = 0 against H1a : µ 6= 0 and H1b : µ > 0 for different values of µ and the above
distributions. Plot the empirically found power of the tests as a function of the true mean µ. How
does the power depend on n and α? Generate meaningful plots to illustrate your results.
P
2.8 A generalized Wilcoxon statistic can be defined by W = ni=1 sign(Xi )di (Xi ) where di (Xi ) = cRi
for some c1 < c2 < . . . < cn and
P the ranks Ri of |Xi |. Calculate the mean and the variance of the so
called binary statistic Wg = ni=1 sign(Xi )di (Xi ) where cj = 2j .
P
2.9 Check whether the distribution of the test statistic Wg = ni=1 sign(Xi )di (Xi ) from 2.8 is asymptotically normally distributed by checking the Liapounov condition
Pn
3
i=1 E(|Vi − µi | )
=0
lim
3
Pn
n→∞
2 2
σ
i=1 i
is satisfied for Vi = sign(Xi )di (Xi ) with mean µi and variance σi2 .
3
2.10 Show that if di (Xi ) ≡ 1, then the generalized Wilcoxon test based on W =
equivalent to the sign test.
Pn
i=1 sign(Xi )di (Xi )
is
2.11 Using the code from 2.7 empirically approximate the asymptotic relative efficiency of the tests in 2.7
for the case that the real data is distributed according to the distributions listed in 2.7. Which test
is asymptotically the best for which family of distributions? Generate meaningful plots to illustrate
your results.
2.12 Use the asymptotic distribution of the Wilcoxon statistic to evaluate the efficacy of the Wilcoxon
test statistic where the distribution FX of the data generating process X is given by
(a) X ∼ N (µX , σ 2 );
(b) X ∼ U (−0.5, 05);
(c) X is double exponential, i.e.
FX (x) =
Hint: Use the fact that for WN =
(1/2)eλx ,
x≤0
−λx
1 − (1/2)e
, x > 0.
PN
i=1 sign(Xi )rank(|Xi |),
Var(WN )|θ=0 =
and
(2N + 1)(N + 1)(N )
6
N (N − 1)
N (N + 1)
p2 −
E(WN ) = 2 N p1 +
2
2
where p1 = P(Xi > 0) = 1 − FX (2θ) and
Z
∞
p2 = P(Xi + Xj > 0) =
FX (x + 2θ)dFX (x).
−∞
Use the above to show that the efficacy is given by
24N (N − 1)2
(N + 1)(2N + 1)
where I =
−2fX (0)
+I
N −1
2
R∞
2
−∞ fX (x)dx.
2.13 Calculate the efficacy of the Student’s t test statistic in cases (b) and (c) of the above problem.
2.14 Use the answers to Problem 2.12 and 2.13 to verify the following results for the Asymptotic Relative
Efficacy of the Wilcoxon signed-rank test to the Students t test
Normal
Uniform
Double exponential
3
3/π
1
3/2
Estimation of CDFs and plug-in estimates
3.1 Generate 100 observations from a N (0, 1) distribution. Compute a 95 percent confidence band for
the CDF based on the sample using the results covered in the course. Repeat this 100 times and see
how often the confidence band contains the true distribution function. Repeat the experiment using
data from a Cauchy distribution.
4
3.2 Let X1 , . . . Xn ∼ Bernoulli(p). Find the plug-in estimator and estimated standard error of p. Find
an approximate 95% confidence interval for p.
3.3 Let X1 , . . . Xn ∼ F and let F̂n be the empirical distribution function. Let a < b be fixed numbers
and define θ = T (F ) = F (b) − F (a). Let θ̂ = T (F̂n ) = F̂n (b) − F̂n (a). Find the influence function and
the estimated standard error of θ̂. Find an expression for an approximate 1 − α confidence interval
for θ.
3.4 Let X1 , . . . , Xn ∼ F and
q let F̂n (x) be the empirical distribution function. For a fixed x, find the
limiting distribution of F̂n (x).
3.5 Let x and y be two distinct points. Find Cov(F̂n (x), F̂n (y)).
3.6 Download the data on magnitudes of earthquakes near Fiji from
http://www.stat.cmu.edu/~larry/all-of-nonpar/data.html.
Estimate the CDF F(x). Compute and plot a 95% confidence envelope for F. Find an approximate
95% confidence intervall for F (4.9) − F (4.3).
3.7 Suppose that
|T (F ) − T (G)| ≤ C sup |F (x) − G(x)|
(1)
x
a.s.
Rfor some C > 0. Prove that T (F̂n ) −→ T (F ). Suppose that |X| ≤ M < ∞ and show that T (F ) =
xdF (x) fulfills (1).
5

Download Report

Exercise Sheet

Paperzz.com

Your Paperzz