(ST217: Mathematical Statistics B)

MSB
(ST217: Mathematical Statistics B)
Aim
To review, expand & apply the ideas from MSA.
In particular, MSA mainly studied one unknown quantity at once. In MSB we’ll study interrelationships.
Lectures & Classes
Monday
Wednesday
Thursday
12–1
10–11
1–2
R0.21
R0.21
PLT
Examples classes will begin in week 3.
Style
• Lectures will be supplemented (NOT replaced!!) with printed notes.
Please take care of these notes—duplicates may not be readily available.
• I shall teach mainly by posing problems (both theoretical and applied) and working through them.
Contents
1. Overview of MSA.
2. Bivariate & Multivariate Probability Distributions.
Joint distributions, conditional distributions, marginal distributions; conditional expectation. The
χ2 , t, F and multivariate Normal distributions and their interrelationships.
3. Inference for Multiparameter Models.
Likelihood, frequentist and Bayesian inference, prediction and decision-making. Comparison between
various approaches. Point and interval estimation. Classical simple and composite hypothesis testing,
likelihood ratio tests, asymptotic results.
4. Linear Statistical Models.
Linear regression, multiple regression & analysis of variance models. Model choice, model checking
and residuals.
5. Further Topics (time permitting).
Nonlinear models, problems & paradoxes, etc.
Books
The books recommended for MSA are also useful for MSB. Excellent books on mathematical statistics are:
1. ‘Statistical Inference’ by George Casella & Roger L. Berger [C&B], Duxbury Press (1990),
2. ‘Probability and Statistics’ by Morris DeGroot, Addison-Wesley (2nd edition 1989).
A good book discussing the application and interpretation of statistical methods is ‘Introduction to the
Practice of Statistics’ by Moore & McCabe [M&M], Freeman (3rd edition 1998). Many of the data sets
considered below come from the ‘Handbook of Small Data Sets’ [HSDS] by Hand et al., Chapman & Hall,
London (1994).
There are many other useful references on mathematical statistics available in the library, including books
by Hogg & Craig [H&C], Lindgren, Mood, Graybill & Boes [MG&B], and Rice.
c 1998,1999,2000,2001 by J. E. H. Shaw
These notes are copyright 1
Chapter 1
Overview of MSA
1.1
Basic Ideas
1.1.1
What is ‘Statistics’ ?
Statistics may be defined as:
‘The study of how information should be employed to reflect on, and give guidance for action
in, a practical situation involving uncertainty.’ [italics by JEHS]
Vic Barnett, Comparative Statistical Inference
Figure 1.1: A practical situation involving uncertainty
2
1.1.2
Statistical Modelling
The emphasis of modern statistics is on modelling the patterns and interrelationships in the existing data,
and then applying the chosen model(s) to predict future data.
Typically there is a measurable response (for example, reduction Y in patient’s blood pressure) that is
thought to be related to explanatory variables xj (for example, treatment applied, dose, patient’s age,
weight, etc.) We seek a formula that relates the observed responses to the corresponding explanatory
variables, and that can be used to predict future responses in terms of their corresponding explanatory
variables:
Observed Response =
Fitted Value
+ Residual,
Future Response
= Predicted value +
Error.
Here the fitted values should take account of all the consistent patterns in the data, and the residuals
represent the remaining random variation.
1.1.3
Prediction and Decision-Making
Always remember that the main aim in modelling as above is to predict (for example) the effects of different
medical treatments, and hence to decide which treatment to use, and in what circumstances.
The fundamental assumption is that the future data will be in some sense similar to existing data. The
ideas of exchangeability and conditional independence are crucial.
The following notation is useful:
X⊥
⊥Y
X⊥
⊥ Y |Z
‘X is independent of Y ’, i.e. Y gives you no information about X,
‘X is conditionally independent of Y given Z’, i.e. if you know the value taken by the
RV Z, then Y gives you no further information about X.
Most methods of statistical inference proceed indirectly from what we know (the observed data and any
other relevant information) to what we really want to know (future, as yet unobserved, data), by assuming
that the random variation in the observed data can be thought of as a sample from an underlying population,
and learning about the properties of this population.
1.1.4
Known and Unknown Features of a Statistical Problem
A statistic is a property of a sample, whereas a parameter is a property of a population. Often it’s natural
to estimate a parameter θ (such as the population mean µ) by the corresponding property of the sample
(here the sample mean X). Note that θ may be a vector or more complicated object.
Unobserved quantities are treated mathematically as random variables. Potentially observable quantities
are usually denoted by capital letters (Xi , X, Y etc.) Once the data have been observed, the values taken
by these random variables are known (Xi = xi , X = x etc.) Unobservable or hypothetical quantities are
usually denoted by Greek letters (θ, µ, σ 2 etc.), and estimators are often denoted by putting a hat on the
b µ
corresponding symbol (θ,
b, σ
b2 etc.)
Nearly all statistics books use the above style of notation, so it will be adopted in these notes. However, sometimes I shall wish to distinguish carefully between knowns and unknowns, and shall denote all
unknowns by capitals. Thus Θ represents an unknown parameter vector, and θ represents a particular
assumed value of Θ. This is especially useful when considering probability distributions for parameters;
one can then write fΘ (θ) and Pr(Θ = θ) by exact analogy with fX (x) and Pr(X = x).
The set of possible values for a RV X is called its sample space ΩX . Similarly the parameter space ΩΘ is
the set of possible values for the parameter Θ.
3
1.1.5
Likelihood
In general, we can infer properties θ of the population by comparing how compatible are the various possible
values of θ with the observed data. This motivates the idea of likelihood (equivalently, log-likelihood or
support). We need a probability model for the data, in which the probability distribution of the random
variation is a member of a (realistic but mathematically tractable) family of probability distributions,
indexed by a parameter θ.
Likelihood-based approaches have both advantages and disadvantages—
Advantages
Disadvantages
Unified theory (many practical problems can be
tackled in essentially the same way).
Is the theory directly relevant? (is likelihood alone
enough? and how do we balance realism and
tractability?)
Often get simple sufficient statistics (hence we can
summarise a huge data set by a few simple properties).
If the probability model is wrong, then results can
be misleading (e.g. if one assumes a Normal distribution when the true distribution is Cauchy).
CLT suggests likelihood methods work well when
there’s loads of data.
One seldom has loads of data!
1.1.6
Where Will We Go from Here?
• MSA provided the mathematical toolbox (e.g. probability theory and the idea of random variables)
for studying random variation.
• MSB will add to this toolbox and study interrelationships between (random) variables.
• We shall also consider some important general forms for the fitted/predicted values, in particular
linear models and their generalizations.
1.2
Sampling Distributions
Statistical analysis involves calculating various statistics from the data, for example the maximum likelihood
b for θ. We want to understand the properties of these statistics; hence the importance
estimator (MLE) θ
of the central limit theorem (CLT) & its generalizations, and of studying the probability distributions of
transformed random variables.
P
If we have a formula for a summary statistic S, e.g. S =
Xi /n = X, and are prepared to make certain assumptions about the original random variables Xi , then we can say things about the probability
distribution of S.
The probability distribution of a statistic S, i.e. the pattern of values S would take if it were calculated in
successive samples similar to the one we actually have, is called its sampling distribution.
4
1.2.1
Typical Assumptions
1. Standard Assumption (IID RVs):
Xi are IID (independent and identically distributed) with (unknown) mean µ and variance σ 2 .
This implies
(a) E[X] = E[Xi ] = µ, and
(b) Var[X] =
1
n Var[Xi ]
= σ 2 /n.
(c) If we define the standardised random variables
Zn =
X −µ
,
σ
then as n → ∞, the distribution of Zn tends to the standard Normal N(0, 1) distribution.
2. Additional Assumption (Normality):
2
The Xi are IID Normal: Xi IID
∼ N (µ, σ ).
This implies that X ∼ N(µ, σ 2 /n).
1.2.2
Further Uses of Sampling Distributions
We can also
• compare various plausible estimators (e.g. to estimate the centre of symmetry of a supposedly symmetric distribution we might use the sample mean, median, or something more exotic),
• obtain interval estimates for unknown quantities (e.g. 95% confidence intervals, HPD intervals, support intervals),
• test hypotheses about unknown quantities.
Comments
1. Note the importance of expectations of (possibly transformed) random variables:
E[X]
E[(X − µ)2 ]
E[esX ]
E[eitX ]
=
=
=
=
µ (measure of location)
σ 2 (measure of scale)
moment generating function
characteristic function
2. We must always consider whether the assumptions made are reasonable, both from general considerations (e.g.: is independence reasonable? is the assumption of identical distributions reasonable?
is it reasonable to assume that the data follow a Poisson distribution? etc.) and with reference to
the observed set of data (e.g. are there any ‘outliers’—unreasonably extreme values—or unexpected
patterns?)
3. Likelihood and other methods suggest estimators for unknown quantities of interest (parameters etc.)
under certain specified assumptions.
Even if these assumptions are invalid (and in practice they always will be to some extent!) we may still
want to use summary statistics as estimators of properties of the underlying population. Therefore
(a) We’ll want to investigate the properties of estimators under various relaxed assumptions, for
example partially specified models that use only the first and second moments of the unknown
quantities.
(b) It’s useful if the calculated statistics (e.g. MLEs) have an intuitive interpretation (like ‘sample
mean’ or ‘sample variance’).
5
1.3
(Revision?) Problems
1. First-year students attending a statistics course were asked to carry out the following procedure:
Toss two coins, without showing anyone else the results.
If the first coin showed ‘Heads’ then answer the following question:
“Did the second coin show ‘Heads’ ? (Yes or No)”
If the first coin showed ‘Tails’ then answer the following question:
“Have you ever watched a complete episode of ‘Teletubbies’ ? (Yes or No)”
The following results were recorded:
Males
Females
Yes
84
23
No
48
24
For each sex, and for both sexes combined, estimate the proportion who have watched a complete
episode of ‘Teletubbies’.
Using a chi-squared test, or otherwise, test whether the proportions differ between the sexes.
Discuss the assumptions you have made in carrying out your analysis.
2. Let X and Y be IID RVs with a standard Normal N (0, 1) distribution, and define Z = X/Y .
(a) Write down the lower quartile, median and upper quartile of Z, i.e. the points z25 , z50 & z75
such that Pr(Z < zk ) = k/100.
(b) Show that Z has a Cauchy distribution, with PDF 1/π(z 2 + 1).
HINT : consider the transformation Z = X/Y and W = |Y |.
3. Let X1 , . . . Xn be mutually independent RVs, with respective MGFs (moment generating functions)
MX1 (t), . . . , MXn (t), and let a1 , . . . , an and b1 , . . . , bn be fixed constants.
Show that the MGF of Z = (a1 X1 + b1 ) + (a2 X2 + b2 ) + · · · + (an Xn + bn ) is
P MZ (t) = exp t bi MX1 (a1 t) × · · · × MXn (an t).
Hence or otherwise show that any linear combination of independent Normal RVs is itself Normally
distributed.
4. A workman has to move a rectangular stone block a short distance, but doesn’t want to strain himself.
He rapidly estimates:
• height of block = 10 cm, with standard deviation 1 cm.
• width of block = 20 cm, with standard deviation 3 cm.
• length of block = 25 cm, with standard deviation 4 cm.
• density of block = 4.0 g/cc, with standard deviation 0.5 g/cc.
Assuming these estimates are mutually independent, calculate his estimates of the volume V (cc)
and total weight W (Kg) of the block, and their standard deviations.
The workman fears that he might hurt his back if W ≥ 30.
Using Chebyshev’s inequality, give an upper bound for his probability Pr(W ≥ 30).
[Chebyshev’s inequality states that if X has mean µ & variance σ 2 , then Pr(|X − µ| ≥ c) ≤
σ 2 /c2 —see MSA].
What is the workman’s value for Pr(W > 30) under the additional assumption that W is Normally
distributed? Compare this value with the bound found earlier.
How reasonable are the independence and Normality assumptions used in the above analysis?
6
5. Calculate the MLE of the centre of symmetry θ, given IID RVs X1 , X2 , . . . , Xn , where the common
PDF fX (x) of the Xi s is
(a) Normal (or Gaussian):
fX (x|θ, σ) = √
2 1
exp − 12 (x − θ)/σ
2πσ
(b) Laplacian (or Double Exponential ):
fX (x|θ, σ) =
1
exp |x − θ|/σ
2σ
(c) Uniform (or Rectangular ):
fX (x|θ) =
1 if θ − 21 < x < θ + 12
0 otherwise.
Do you consider these MLEs to be intuitively reasonable?
6. Calculate E[X], E[X 2 ], E[X 3 ] and E[X 4 ] under each of the following assumptions:
(a) X ∼ Poi (λ), i.e. X has PMF (probability mass function)
Pr(X = x|λ) =
λx exp(−λ)
x!
(x = 0, 1, 2, . . . )
(b) X ∼ Exp(β), i.e. X has PDF (probability density function)
βe−βx if x > 0
fX (x|β) =
0
otherwise.
(c) X ∼ N µ, σ 2 , i.e. X has PDF
fX (x|µ, σ) = √
2 1
exp − 12 (x − µ)/σ
2πσ
7. Describe briefly how, and under what circumstances, you might approximate
(a) a binomial distribution by a Normal distribution,
(b) a binomial distribution by a Poisson distribution,
(c) a Poisson distribution by a Normal distribution.
Suppose X ∼ Bin(100, 0.1), Y ∼ Poi (10), and Z ∼ N 10, 32 . Calculate, or look up in tables,
(i)
(iv)
Pr(X ≥ 6),
Pr(X > 16),
(ii)
(v)
Pr(Y ≥ 6),
Pr(Y > 16),
(iii)
(vi)
Pr(Z > 5.5),
Pr(Z > 16.5),
and comment on the accuracy of the approximations here.
8. The t distribution with n degrees of freedom, denoted tn or t(n), has the PDF
Γ 21 (n + 1)
1
1
√
,
−∞ < t < ∞,
f (t) =
nπ 1 + t2 /n(n+1)/2
Γ 12 n
and the F distribution with m and n degrees of freedom, denoted Fm,n or F (m, n), has PDF
Γ 12 (m + n)
x(m/2)−1
m/2 n/2
,
0 < x < ∞,
m
n
f (x) =
(mx + n)(m+n)/2
Γ 12 m Γ 12 n
with f (x) = 0 for x ≤ 0.
Show that if T ∼ tn and X ∼ Fm,n , then T 2 and X −1 both have F distributions.
7
9. Table 1.1 shows the estimated total resident population (thousands) of England and Wales at
30 June 1993:
Age
Persons
Males
Females
<1
1–14
15–44
45–64
65–74
≥ 75
669.6
9,268.0
21,875.0
11,435.8
4,595.9
3,594.9
343.1
4,756.9
11,115.6
5,676.6
2,081.7
1,224.5
326.5
4,511.1
10,759.4
5,759.2
2,514.2
2,370.4
Total
51,439.2
25,198.4
26,240.8
Table 1.1: Estimated resident population of England & Wales, mid 1993, by sex and
age-group (simplified from Table 1 of the 1993 mortality tables )
Table 1.2, also extracted from the published 1993 Mortality Statistics, shows the number of deaths
in 1993 among the resident population of England and Wales, categorised by sex, age-group and
underlying cause of death.
Assume that the rates observed in Tables 1.1 and 1.2 hold exactly, and suppose that an individual
I is chosen at random from the population. Define the random variables S (sex), A (age group), D
(death) and C (cause) as follows:
S
=
0
1
if I is male,
if I is female,
A
=
1
2
3
if I is under 1 year old,
if I is aged 1–14,
if I is aged 15–44,
D
=
0
1
if I survives the year,
if I dies,
C
=
cause of death (0–17).
4
5
6
if I is aged 45–64,
if I is aged 65–74,
if I is 75 years old or over,
For example,
Pr(S=0)
= 25198.4/51439.2,
Pr(S=0 & A=6) = 1224.5/51439.2,
Pr(D=0|S=0 & A=6)
= 1 − 138.239/1224.5,
Pr(C=8|S=0 & A=6)
= 28.645/1224.5,
etc.
(a) Calculate Pr(D=1|S=0), and Pr(D=1|S=0 & A=a) for a = 1, 2, 3, 4, 5, 6.
Also calculate Pr(S=0|D=1), and Pr(S=0|D=1 & A=a) for a = 1, 2, 3, 4, 5, 6.
If you were an actuary, and were asked by a non-expert “is the death rate for males higher or
lower than that for females?”, how would you respond based on the above calculations? Justify
your answer.
(b) Similarly, explain how you would respond to the questions
i. “is the death rate from neoplasms higher for males or for females?”
ii. “is the death rate from mental disorders higher for males or for females?”
iii. “is the death rate from diseases of the circulatory system higher for males or for females?”
iv. “is the death rate from diseases of the respiratory system higher for males or for females?”
8
Sex
All
ages
<1
1–14
0 Deaths below 28 days
(no cause specified)
M
F
1,603
1,192
1,603
1,192
−
−
−
−
−
−
−
−
−
−
1 Infectious & parasitic
diseases
M
F
1,954
1,452
60
46
79
44
565
169
390
193
346
283
514
717
2 Neoplasms
M
F
74,480
67,966
16
8
195
138
2,000
2,551
16,372
15,026
25,644
19,141
30,253
31,102
3 Endocrine, nutritional
& metabolic diseases and
immunity disorders
M
F
3,515
4,403
28
17
43
37
208
153
639
474
959
901
1,638
2,821
4 Diseases of blood and
blood-forming organs
M
F
897
1,084
5
3
12
14
62
28
106
73
204
163
508
803
5 Mental disorders
M
F
2,530
5,189
−
−
8
1
281
83
169
99
334
297
1,738
4,709
6 Diseases of the nervous
system and sense organs
M
F
4,403
4,717
59
42
136
118
530
313
675
546
890
809
2,113
2,889
7 Diseases of the
circulatory system
M
F
123,717
134,439
41
44
66
45
1,997
834
20,682
7,783
37,195
23,185
63,736
102,548
8 Diseases of the
respiratory system
M
F
41,802
49,068
86
59
79
74
608
322
3,157
2,145
9,227
6,602
28,645
39,866
9 Diseases of the
digestive system
M
F
7,848
10,574
10
20
27
14
511
298
1,706
1,193
2,058
1,921
3,536
7,128
10 Diseases of the
genitourinary system
M
F
3,008
3,710
4
4
6
7
57
55
215
219
676
535
2,050
2,890
11 Complications of pregnancy,
childbirth and the puerperium
M
F
−
27
−
−
−
−
−
27
−
−
−
−
−
−
12 Diseases of the skin
and subcutaneous tissue
M
F
269
748
1
−
1
−
7
15
22
30
62
80
176
623
13 Diseases of the musculoskeletal
system and connective tissue
M
F
785
2,639
1
−
5
5
28
43
106
173
151
385
494
2,033
14 Congenital anomalies
M
F
660
675
131
136
114
116
158
133
118
101
58
87
81
102
15 Certain conditions originating
in the perinatal period
M
F
186
114
93
60
8
5
13
3
18
4
16
10
38
32
16 Signs, symptoms and
ill-defined conditions
M
F
1,642
5,146
238
171
17
17
126
50
111
53
72
75
1,078
4,780
17 External causes of
injury and poisoning
M
F
9,859
5,869
34
30
311
162
4,749
1,240
2,183
882
941
731
1,641
2,824
M
F
279,158
299,012
2,410
1,832
1,107
797
11,900
6,317
46,669
28,994
78,833
55,205
138,239
205,867
Cause of death
Total
Age at death (years)
15–44
45–64
65–74
Table 1.2: Deaths in England & Wales, 1993, by underlying cause, sex and age-group
(extracted from Table 2 of the 1993 mortality tables )
9
≥ 75
(c) Now treat the data in Tables 1.1 & 1.2 as subject to statistical fluctuations. One can still
estimate
psac = Pr(S=s & A=a & C=c),
p·ac = Pr(A=a & C=c),
ps · · = Pr(S=s)
etc.
from the data, for example pb0,·,14 = 660/25198400 = 2.62×10−5 . Similarly estimate p1,·,14 and
p·,a,14 for a = 1 . . . 6. Using a chi-squared test or otherwise, investigate whether the relative
risk of death from a congenital anomaly between males and females is the same at all ages,
i.e. whether it reasonable to assume that
ps, a,14 = ps, ·,14 × p·, a,14 .
10. Data were collected on litter size and sex ratios for a large number of litters of piglets. The following
table gives the data for all litters of size between four and twelve:
Number
of males
Litter size
7
8
9
4
5
6
0
1
2
3
4
5
6
7
8
9
10
11
12
1
14
23
14
1
2
20
41
35
14
4
3
16
53
78
53
18
0
0
21
63
117
104
46
21
2
1
8
37
81
162
77
30
5
1
Total
53
116
221
374
402
10
11
12
0
2
23
72
101
83
46
12
7
0
0
7
8
19
79
82
48
24
10
0
0
0
1
3
15
15
33
13
12
8
1
1
0
0
0
1
8
4
9
18
11
15
4
0
0
0
346
277
102
70
(a) Discuss briefly what sort of probability distributions it might be reasonable to assume for the
total size N of a litter, and for the number M of males in a litter of size N = n.
(b) Suppose now that the litter size N follows a Poisson distribution with mean λ. Write down
an expression for Pr(N = n|4 ≤ N ≤ 12). Hence or otherwise give an expression for the
log-likelihood `(λ; . . .) given the above table of data.
(c) Evaluate `(λ; . . .) at λ = 7.5, 8 and 8.5. By fitting a quadratic to these values, provide point
and interval estimates of λ.
(d) Using a chi-squared test or otherwise, check how well your model fits the data.
(e) Comment on the following argument: ‘Provided λ isn’t too small, we could approximate the
Poisson distribution Poi (λ) by the Normal distribution N (λ, λ). This is symmetric, so we may
simply estimate the mean λ by the mode of the data (8 in our case). The standard deviation is
therefore nearly 3, and so we would expect the counts at litter size 8 ± 3 to be nearly 60% the
count at 8 (note that for a standard Normal, φ(1)/φ(0) = exp(−0.5) l 0.6). Since there are far
fewer litters of size 5 & 11 than this, the Poisson distribution must be a poor fit.’
Data from HSDS, set 176
Education is what survives when what has been learnt has been forgotten.
Burrhus Frederoc Skinner
10
Chapter 2
Bivariate & Multivariate
Distributions
MSA largely concerned IID (independent & identically distributed) random variables.
However in practice we are usually most interested in several random variables simultaneously, and their
interrelationships. Therefore we need to consider the probability distributions of random vectors, i.e. the
joint distribution of the individual random variables.
Bivariate Examples
A. (X1 , X2 ), the number of male & female pigs in a litter.
B. (X, Y ), the systolic and diastolic blood pressure of an individual.
C. (X, Y ), the age and height of an individual.
D. (X, Y ), the height and weight of an individual.
E. (b
µ, σ
b2 ), the estimated common mean and variance of n IID random variables X1 , . . . , Xn .
F. (Θ, X) where Θ ∼ U (0, 1) and X|Θ ∼ Bin(n, Θ), i.e.
1 if 0 < x < 1
fΘ (θ) =
0 otherwise,
fX (x|Θ = θ) =
n
x
θx (1 − θ)n−x
x = 0, 1, . . . , n.
Definition 2.1 (Bivariate CDF)
The joint cumulative distribution function of 2 RVs X & Y is the function
FX,Y (x, y) = Pr(X ≤ x & Y ≤ y),
(x, y) ∈ R2 .
(2.1)
Comments
1. The joint cumulative distribution function (or joint CDF) may also be called the ‘joint distribution
function’ or ‘joint DF’.
2. If there’s no ambiguity, then we may simply write F (x, y) for FX,Y (x, y).
11
2.1
Discrete Bivariate Distributions
If RVs X & Y are discrete, then they have a discrete joint distribution and a probability mass function
(PMF) that, similarly to the univariate case, is usually written fX,Y (x, y) or more simply f (x, y):
Definition 2.2 (Bivariate PMF)
The joint probability mass function of discrete RVs X and Y is
f (x, y) = Pr(X = x & Y = y).
Exercise 2.1
Suppose that the numbers X1 and X2 of male and female piglets follow independent Poisson distributions
with means λ1 & λ2 respectively. Find the joint PMF.
k
Exercise 2.2
Now assume the model N ∼ Poi (λ), (X1 |N ) ∼ Bin(N, θ), i.e. the total number N of piglets follows a
Poisson distribution, and, conditional on N = n, X has a Bin(n, θ) distribution (in particular θ = 0.5 if
the sexes are equally likely). Again find the joint PMF.
k
Exercise 2.3
Verify that the two models given in Exercises 2.1 & 2.2 give identical fitted values, and are therefore in
practice indistinguishable.
k
2.1.1
Manipulation
A discrete RV has a countable sample space, which without loss of generality can be represented as
N = {0, 1, 2, . . .}. Values of a discrete joint distribution f (x, y) can therefore be tabulated:
0
X
0
1
..
.
f00
f10
..
.
1
Y
2
3
f01
f11
..
.
f02
f12
..
.
...
...
..
.
...
and the probability of any event E obtained by simple summation:
X
Pr (X, Y ) ∈ E =
f (xi , yi ).
(xi ,yi )∈E
Exercise 2.4
Continuing Exercise 2.2, find the PMF of X1 , and hence identify the distribution of X1 .
k
Exercise 2.5
The RV Q is defined on the rational numbers in [0, 1] by Q = X/Y , where f (x, y) = (1 − α)αy−1 /(y + 1),
0 < α < 1, y = {1, 2, . . .}, x = {0, 1, . . . , y}.
Show that Pr(Q = 0) = (α − 1) α + log(1 − α) /α2 .
k
12
2.2
Continuous Bivariate Distributions
Definition 2.3 (Continuous bivariate distribution)
Random variables X & Y have a continuous joint distribution if there exists a function f from R2 to
[0, ∞) such that
ZZ
Pr (X, Y ) ∈ A =
f (x, y) dx dy
∀A ⊆ R2 .
(2.2)
A
Definition 2.4 (Bivariate PDF)
The function f (x, y) defined by Equation 2.2 is called the joint probability density function of X & Y .
Comments
1. f (x, y) may be written more explicitly as fX,Y (x, y).
Z ∞Z ∞
2.
f (x, y) dx dy = 1.
−∞
−∞
3. f (x, y) is not unique—it could be arbitrarily defined at a countable
RR set of points (xi , yi ) (more
generally, any ‘set with measure zero’) without changing the value of A f (x, y) dx dy for any set A.
4. f (x, y) ≥ 0 at all continuity points (x, y) ∈ R2 .
Examples
1. As in Example E from page 11, we will want to know properties of the joint distribution of (b
µ, σ
b2 ),
IID
2
2
the MLEs of µ and σ respectively given X1 , . . . , Xn ∼ N (µ, σ ).
2. In the situation of Example B from page 11, where X is the systolic blood pressure and Y the diastolic
blood pressure of an individual, it might be reasonable to assume that
X
Y |X
∼ N (µS , σS2 ),
∼ N (α + βX, σD2 ),
and hence obtain
fX,Y (x, y)
=
fX (x) fY |X (y|x).
Comment
As in Exercise 2.2, a family of multivariate distributions is most easily built up hierarchically using simple univariate distributions and conditional distributions like that of Y |X. Conditional distributions are
considered formally in Section 2.4.
2.2.1
Visualising and Displaying a Continuous Joint Distribution
A continuous bivariate distribution can be represented by a contour or other plot of its joint PDF (Fig. 2.1).
Comments
1. The joint distribution of X and Y may be neither discrete nor continuous, for example:
• Either X or Y may have both continuous and discrete components,
• One of X and Y may have a continuous distribution, the other discrete (like Example F on
page 11).
2. Higher dimensional joint distributions are obviously much more difficult to interpret and to represent
graphically, with or without computer help.
13
Figure 2.1: Contour and perspective plots of a bivariate distribution
2.3
Marginal Distributions
Given a joint CDF FX,Y (x, y), the distributions defined by the CDFs FX (x) = limy→∞ FX,Y (x, y) and
FY (y) = limx→∞ FX,Y (x, y) are called the marginal distributions of X and Y respectively:
Definition 2.5 (Marginal CDF, PMF and PDF—bivariate case)
FX (x) = limy→∞ FX,Y (x, y) is the marginal CDF of X.
If X has a discrete distribution, then fX (x) = Pr(X = x) is the marginal PMF of X.
d
If X has a continuous distribution, then fX (x) =
FX (x) is the marginal PDF of X.
dx
Marginal CDFs and PDFs of Y , and of other RVs for higher-dimensional joint distributions, are defined
similarly.
Exercise 2.6
Suppose that you are given a bag containing five coins:
1 double-tailed,
1 with Pr(head) = 1/4,
2 fair,
1 double-headed.
You pick one coin at random (each with probability 1/5), then toss it twice.
By finding the joint distribution of Θ = Pr(head) and X = number of heads, or otherwise, calculate the
distribution of the number of heads obtained.
k
Comments
1. If you’ve tabulated Pr(Θ = θ & X = x), then it’s simple to find FΘ (θ) and FX (x) by writing the row
sums and column sums in the margins of the table of Pr(Θ = θ & X = x)—hence the name ‘marginal
distribution’.
2. Although the most satisfactory general definition of marginal distributions is in terms of their CDFs,
in practice it’s usually easiest to work with PMFs or PDFs
2.4
2.4.1
Conditional Distributions
Discrete Case
If X and Y are discrete RVs then, by definition,
Pr(Y =y|X=x) = Pr(X=x & Y =y)/ Pr(X=x).
14
(2.3)
In other words (or, more accurately, in other symbols):
Definition 2.6 (Conditional PMF—bivariate case)
If X and Y have a discrete joint distribution with PMF fX,Y (x, y), then the conditional PMF fY |X
of Y given X = x is
fX,Y (x, y)
fY |X (y|x) =
(2.4)
fX (x)
P
where fX (x) = y fX,Y (x, y) is the marginal PMF of X.
Exercise 2.7
Continuing Exercise 2.6, what are the conditional distributions of [X |Θ = 1/4] and [Θ|X = 0]?
k
2.4.2
Continuous Case
Now suppose that X and Y have a continuous joint distribution. If we observe X = x, then we will want
to know the conditional CDF FY |X (y|X = x). But we CAN’T use Equation 2.3 directly, which would
entail dividing by zero. Therefore, by analogy with Equation 2.4, we adopt the following definition:
Definition 2.7 (Conditional PDF—bivariate case)
If X and Y have a continuous joint distribution with PDF fX,Y (x, y), then the conditional PDF fY |X
of Y given that X = x is
fX,Y (x, y)
fY |X (y|x) =
,
(2.5)
fX (x)
defined for all x ∈ R such that fX (x) > 0.
2.4.3
Independence
Recall that two RVs X and Y are independent (X ⊥
⊥ Y ) if, for any two sets A, B ∈ R,
Pr(X∈A & Y ∈B) = Pr(X ∈ A) Pr(Y ∈ B)
(2.6)
Exercise 2.8
Show that X and Y are independent according to Formula 2.6 if and only if
FX,Y (x, y) = FX (x)FY (y)
− ∞ < x, y < ∞,
(2.7)
fX,Y (x, y) = fX (x)fY (y)
− ∞ < x, y < ∞,
(2.8)
or equivalently if and only if
(where the functions f are interpreted as PMFs or PDFs in the discrete or continuous case respectively).
k
15
2.5
Problems
1. Let the function f (x, y) be defined by
6xy 2
f (x, y) =
0
if 0 < x < 1 and 0 < y < 1,
otherwise.
(a) Show that f (x, y) is a probability density function.
(b) If X and Y have the joint PDF f (x, y) above, show that Pr(X + Y ≥ 1) = 9/10.
(c) Find the marginal PDF fX (x) of X.
(d) Show that Pr(0.5 < X < 0.75) = 5/16.
2. Suppose that the random vector (X, Y ) takes values in the region A = {(x, y)|0 ≤ x ≤ 2, 0 ≤ y ≤ 2},
and that its CDF within A is given by FX,Y (x, y) = xy(x + y)/16.
(a) Find FX,Y (x, y) for values of (X, Y ) outside A.
(b) Find the marginal CDF FX (x) of X.
(c) Find the joint PDF fX,Y (x, y).
3. Suppose that X and Y are RVs with joint PDF
cx2 y
f (x, y) =
0
if x2 ≤ y ≤ 1,
otherwise.
(a) Find the value of c.
(b) Find Pr(X ≥ Y ).
(c) Find the marginal PDFs fX (x) & fY (y)
4. For each of the following joint PDFs f of X and Y , determine the constant c, find the marginal PDFs
of X and Y , and determine whether or not X and Y are independent.
(a)
f (x, y) =
ce−(x+2y) , for x, y ≥ 0,
0
otherwise.
(b)
f (x, y) =
cy 2 /2, for 0 ≤ x ≤ 2 and 0 ≤ y ≤ 1,
0
otherwise.
(c)
f (x, y) =
cxe−y , for 0 ≤ x ≤ 1 and 0 ≤ y < ∞,
0
otherwise.
(d)
f (x, y) =
cxy, for x, y ≥ 0 and x + y ≤ 1,
0
otherwise.
5. Suppose that X and Y are continuous RVs with joint PDF f (x, y) = e−y on 0 < x < y < ∞.
(a) Find Pr(X + Y ≥ 1)
[HINT : write this as 1 − Pr(X + Y < 1)].
(b) Find the marginal distribution of X.
(c) Find the conditional distribution of Y given that X = x.
16
6. Assume that X and Y are random variables each taking values in [0, 1]. For each of the following
CDFs, show that the marginal distribution of X and Y are both uniform U (0, 1), and determine the
conditional CDF FX|Y (x|Y = 0.5) in each case:
(a) F (x, y) = xy,
(b) F (x, y) = min(x, y),
0,
if x + y < 1,
(c) F (x, y) =
x + y − 1 if x + y ≥ 1.
7. Suppose that Θ is a random variable uniformly distributed on (0, 1), i.e. Θ ∼ U (0, 1), and that, once
Θ = θ has been observed, the random variable X is drawn from a binomial distribution [X|θ] ∼
Bin(2, θ).
(a) Find the joint CDF F (θ, x).
(b) How might you display the joint distribution of Θ and X graphically?
(c) What (as simply as you can express them) are the marginal CDFs F1 (θ) of Θ and F2 (x) of X?
8. Suppose that X and Y are two RVs having a continuous joint distribution. Show that X and Y are
independent if and only if fX|Y (x|y) = fX (x) for each value of y such that fY (y) > 0, and for all x.
9. Suppose that X ∼ U (0, 1) and [Y |X = x] ∼ U (0, x). Find the marginal PDFs of X and Y .
2.6
2.6.1
Multivariate Distributions
Introduction
Given a random vector X = (X1 , X2 , . . . , Xn )T , the joint distribution of the random variables X1 , X2 , . . . , Xn
is called a multivariate distribution.
Definition 2.8 (Joint CDF)
The joint cumulative distribution function of RVs X1 , X2 , . . . , Xn is the function
FX (x1 , x2 , . . . , xn ) = Pr(Xk ≤ xk ∀ k = 1, 2, . . . , n).
(2.9)
Comments
1. Formula 2.9 can be written succinctly as FX (x) = Pr(X ≤ x), in an ‘obvious’ vector notation.
2. FX (x) can be called simply the CDF of the random vector X.
3. Properties of FX are similar to the bivariate case. Unfortunately the notation is messier, particularly
for the things we’re generally most interested in for statistical inference, such as
(a) marginal distributions of unknown quantities and vectors,
(b) conditional distributions of unknown quantities and vectors, given what we know.
4. It’s often simpler to blur the distinction between row and column vectors, i.e. to let X denote either
(X1 , X2 , . . . , Xn ) or (X1 , X2 , . . . , Xn )T , depending on context.
17
Definition 2.9 (Discrete multivariate distribution)
The RV X ∈ Rn has a discrete distribution if it can take only a countable number of possible values.
Definition 2.10 (Multivariate PMF)
If X has a discrete distribution, then its probability mass function (PMF) is
x ∈ Rn
f (x) = Pr(X = x),
(2.10)
[i.e. the RVs X1 . . . Xn have joint PMF f (x1 . . . xn ) = Pr(X1 = x1 & · · · &Xn = xn )].
Definition 2.11 (Continuous multivariate distribution)
The RV X = (X1 , X2 , . . . , Xn ) has a continuous distribution if there is a nonnegative function f (x),
where x = (x1 , x2 , . . . , xn ), such that for any subset A ⊂ Rn ,
Z Z
Pr (X1 , X2 , . . . , Xn ) ∈ A = . . . f (x1 , x2 , . . . xn ) dx1 dx2 . . . dxn .
(2.11)
A
Definition 2.12 (Multivariate PDF)
The function f in 2.11 is the (joint) probability density function of X.
Comments
1. Without loss of generality, if X is discrete, then we can take its possible values to be Nn (i.e. each
coordinate Xi of X is a nonnegative integer).
2. Equation 2.11 could be simply written
Z
Pr X ∈ A) =
f (x)dx
(2.12)
A
3. As usual, f (·) may be written more explicitly fX (·), etc.
4. By the fundamental theorem of calculus,
fX (x1 , . . . , xn ) =
∂ n FX (x1 , . . . , xn )
∂x1 · · · ∂xn
(2.13)
at all points (x1 , . . . , xn ) where this derivative exists i.e. fX (x) = ∂ n FX (x)/∂x .
5. Mixed distributions (neither continuous nor discrete) can be handled using appropriate combinations
of summation and integration.
2.6.2
Useful Notation for Marginal & Conditional Distributions
We’ll sometimes adopt the following notation from DeGroot, particularly when the components Xi of X
are in some way similar, as in the multivariate Normal distribution (see later).
F (x)
denotes the CDF of X = (X1 , X2 , . . . , Xn ) at x = (x1 , x2 , . . . , xn ),
f (x)
denotes the corresponding joint PMF (discrete case) or PDF (continuous case),
fj (xj )
denotes the marginal PMF (PDF) of Xj (integrating over x1 . . . xj−1 , xj+1 . . . xn ),
fjk (xj , xk )
denotes the marginal joint PDF of Xj & Xk (integrating over the remaining xi s),
gj (xj |x1 . . . xj−1 , xj+1 . . . xn ) denotes the conditional PMF (PDF) of Xj given Xi = xi , i 6= j,
Fj (xj )
denotes the marginal CDF of Xj ,
Gjk
denotes the conditional CDF of (Xj , Xk ) given the values xi of all Xi , i 6= j, k, etc.
18
2.7
Expectation
2.7.1
Introduction
The following are important definitions and properties involving expectations, variances and covariances:
Var(X)
= E (X − µ)2
where µ = EX
2
2
= E X −µ ,
E[aX + b]
= aEX + b
where a and b are constants,
2
2
2
= a E X + 2abEX + b2 ,
E (aX + b)
Var(aX + b)
= a2 Var(X),
E[X1 X2 ]
= (EX1 )(EX2 )
Cov(X1 , X2 )
= E(X1 − µ1 )(X2 − µ2 ) = E[X1 X2 ] − µ1 µ2 ,
p
Var(X),
=
Cov(X1 , X2 )
= ρ(X1 , X2 ) =
.
SD(X1 )SD(X2 )
SD(X)
corr(X1 , X2 )
if X1 ⊥
⊥ X2 ,
Note that the definition of expectation applies directly in the multivariate case:
Definition 2.13 (Multivariate expectation)
 P
if X is discrete,

x h(x) f (x)

Z
E[h(X)] =


h(x) f (x) dx if X is continuous.
Rn
For example, if X = (X1 , X2 , X3 ) has a continuous distribution, then
Z ∞Z ∞Z ∞
x1 f (x1 , x2 , x3 ) dx1 dx2 dx3
E[X1 ] =
−∞
−∞
−∞
Exercise 2.9
Let X and Y be independent continuous RVs. Prove that, for arbitrary functions g(·) and h(·),
E g(X)h(Y ) = E g(X) E h(Y ) .
k
Exercise 2.10
Let X, Y and Z have independent Poisson distributions with means λ, µ, ν respectively. Find E[X 2 Y Z].
k
Exercise 2.11
[Cauchy-Schwartz] By considering E (tX − Y )2 , or otherwise, prove the Cauchy Schwartz inequality for
2
expectations, i.e. for any two RVs X and Y with finite second moments, E(XY ) ≤ E X 2 E Y 2 , with
equality if and only if Pr(Y = cX) = 1 for some constant c.
Hence or otherwise prove that the correlation ρX,Y between X and Y satisfies |ρX,Y | ≤ 1.
Under what circumstances does ρX,Y = 1?
k
19
2.8
Approximate Moments of Transformed Distributions
The moments of a transformed RV g(X) can often be well approximated via a Taylor series:
Exercise 2.12
[delta method] Let X1 , X2 , . . . , Xn be independent, each with mean µ and variance σ 2 , and let g(·) be a
function with a continuous derivative g 0 (·).
By considering a Taylor series expansion involving
X −µ
,
Zn = p
σ 2 /n
show that
E g(X) = g(µ) + O(n−1 ),
Var g(X) = n−1 σ 2 g 0 (µ)2 + O(n−3/2 ).
(2.14)
(2.15)
k
Comments
1. There is similarly a multivariate delta method, outside the scope of this course.
2. Important uses of expansions like the delta method include identifying useful transformations
g(·), for
example to remove skewness or, when Var(X) is a function of µ, to make Var g(X) (approximately)
independent of µ.
practice applied to the original RVs onP
the (often
3. A useful transformation g(X) is sometimes in P
reasonable) assumption that the properties of
g(Xi ) /n will be similar to those of g
Xi )/n .
Exercise 2.13
[Variance stabilising transformations]
Suppose that X1 , X2 , . . . , Xn are IID and that the (common) variance of each Xi is a function of the
(common) mean µ = EXi .
Show that the variance of g(X) is approximately constant if
p
g 0 (µ) = 1/ Var(µ).
If X ∼ Poi (µ), show that Y =
√
X has approximately constant variance.
k
20
2.9
Problems
1. The discrete random vector (X1 , X2 , X3 ) has the following PMF:
(X1 = 1)
X2
1
2
3
1
.02
.04
.02
X3
2
.03
.06
.03
(a) Calculate the marginal PMFs:
(X1 = 2)
3
.05
.10
.05
1
2
3
X2
f1 (x1 ), f2 (x2 ), f3 (x3 )
1
.08
.12
.05
X3
2
.04
.11
.05
3
.03
.07
.05
and f12 (x1 , x2 ).
(b) Are X1 and X2 independent?
(c) What are the conditional PMFs: g1 (x1 |X2 = 1, X3 = 3),
g3 (x3 |X1 = 1, X2 = 3), and g12 (x1 , x2 |X3 = 3) ?
g2 (x2 |X1 = 1, X3 = 3),
2. The RVs A, B, C etc. count the number of times the corresponding letter appears when a word is
chosen at random from the following list (each being chosen with probability 1/16):
MASCARA,
MOVIE,
RITE,
SQUID,
MASK,
PREY,
SEAT,
TENDER,
MERCY,
REPLICA,
SNAKE,
TIME,
MONSTER,
REPTILES,
SOMBRE,
TROUT.
(a) Complete the following table of the joint distribution of E, M and R:
E=0
R=0
M =0
M =1
1/16
1/16
R=1
E=1
M =0
R=0
E=2
M =1
M =0
2/16
M =1
R=0
R=1
R=1
(b) Calculate all three bivariate marginal distributions, and hence find which of the following statements are true:
(a) E ⊥
⊥ M,
(b) E ⊥
⊥ R,
(c) M ⊥
⊥ R.
(c) Similarly discover which of the following statements are true:
(d) M ⊥
⊥ R|E=0,
(g) M ⊥
⊥ R|E,
(e) M ⊥
⊥ R|E=1,
(h) E ⊥
⊥ R|M ,
(f) M ⊥
⊥ R|E=2,
(i) E ⊥
⊥ M |R.
3. Find variance stabilizing transformations for
(a) the exponential distribution,
(b) the binomial distribution.
4. Let Z ∼ N (0, 1) and define the RV X by
√
Pr(X = − 3) = 1/6,
Pr(X = 0) = 4/6,
√
Pr(X = + 3) = 1/6.
(a) Show that X has the same mean and variance as Z, and that X 2 has the same mean and
variance as Z 2 .
(b) Suppose the RV Y has mean µ and variance σ 2 . Compare the delta method for estimating the
mean and variance of the RV T = g(Y ) with the alternative estimates µ
b(T ) l E g(µ + σX) ,
d ) l Var g(µ + σX) . [Try a few simple distributions for Y and transformations g(·)].
Var(T
21
2.10
Conditional Expectation
2.10.1
Introduction
A common practical problem arises when X1 and X2 aren’t independent, we observe X2 = x2 , and we
want to know the mean of the resulting conditional distribution of X1 .
Definition 2.14 (Conditional expectation)
The conditional expectation of X1 given X2 is denoted E[X1 |X2 ]. If X2 = x2 then
Z
E[X1 |x2 ]
∞
x1 g1 (x1 |x2 ) dx1
=
(continuous case)
(2.16)
−∞
=
X
x1 g1 (x1 |x2 )
(discrete case)
(2.17)
x1
where g1 (x1 |x2 ) is the conditional PDF or PMF respectively.
Comment
Note that before X2 is known to take the value x2 , E[X1 |X2 ] is itself a random variable, being a function
of the RV X2 . We’ll be interested in the distribution of the RV E[X1 |X2 ], and (for example) comparing it
with the unconditional expectation EX1 . The following is an important result:
Theorem 2.1 (Marginal expectation)
For any two RVs X1 & X2 ,
E E[X1 |X2 ] = EX1 .
(2.18)
Exercise 2.14
Prove Equation 2.18 (i) for continuous RVs X1 and X2 , (ii) for discrete RVs X1 and X2 .
k
Exercise 2.15
Suppose that the RV X has a uniform distribution, X ∼ U (0, 1), and that, once X = x has been observed,
the conditional distribution of Y is [Y |X = x] ∼ U (x, 1).
Find E[Y |x] and hence, or otherwise, show that EY = 3/4.
k
Exercise 2.16
Suppose that Θ ∼ U (0, 1) and (X|Θ) ∼ Bin(2, Θ).
Find E[X |Θ] and hence or otherwise show that EX = 1.
2.10.2
k
Conditional Expectations of Functions of RVs
By extending Theorem 2.1, we can relate the conditional and marginal expectations of functions of RVs
(in particular, their variances).
Theorem 2.2 (Marginal expectation of a transformed RV)
For any RVs X1 & X2 , and for any function h(·),
E E[h(X1 )|X2 ] = E[h(X1 )].
22
(2.19)
Exercise 2.17
Prove Equation 2.19 (i) for discrete RVs X1 and X2 , (ii) for continuous RVs X1 and X2 .
k
An important consequence of Equation 2.19 is the following theorem relating marginal variance to conditional variance and conditional expectation:
Theorem 2.3 (Marginal variance)
For any RVs X1 & X2 ,
Var(X1 ) = E Var(X1 |X2 ) + Var E[X1 |X2 ] .
(2.20)
Comments
1. Equation 2.20 is easiest to remember in English:
‘marginal variance
=
expectation of conditional variance
+ variance of conditional expectation’.
2. A useful interpretation of Equation 2.20 is:
Var(X1 )
=
average random variation inherent in X1 even if X2 were known
+ random variation due to not knowing X2 and hence not knowing EX1 .
i.e. the uncertainty involved in predicting the value x1 taken by a random variable X1 splits into two
components. One component is the unavoidable uncertainty due to random variation in X1 , but the
other can be reduced by observing quantities (here the value x2 of X2 ) related to X1 .
Exercise 2.18
[Proof of Theorem 2.3] Expand E Var(X1 |X2 ) and Var E[X1 |X2 ] .
Hence show that Var(X1 ) = E Var(X1 |X2 ) + Var E[X1 |X2 ] .
k
Exercise 2.19
Continuing
Exercise
2.16, in which Θ ∼ U (0, 1), (X|Θ) ∼ Bin(2, Θ), and E[X |Θ] = 2Θ, find Var E[X |Θ]
and E Var(X |Θ) . Hence or otherwise show that VarX = 2/3, and comment on the effect on the uncertainty in X of observing Θ.
k
23
2.11
Problems
1. Two fair coins are tossed independently. Let A1 , A2 and A3 be the following events:
A1
A2
A3
=
=
=
‘coin 1 comes down heads’
‘coin 2 comes down heads’
‘results of both tosses are the same’.
(a) Show that A1 , A2 and A3 are pairwise independent (i.e. A1 ⊥
⊥ A2 , A1 ⊥
⊥ A3 and A2 ⊥
⊥ A3 ) but
not mutually independent.
(b) Hence or otherwise construct three random variables X1 , X2 , X3 such that E[X3 |X1 = x1 ] and
E[X3 |X2 = x2 ] are constant, but E[X3 |X1 = x1 &X2 = x2 ] isn’t.
2. Construct three random variables X1 , X2 , X3 with continuous distributions such that X1 ⊥
⊥ X2 ,
X1 ⊥
⊥ X3 and X2 ⊥
⊥ X3 , but any two Xi ’s determine the remaining one.
3. (a) Show that for any random variables X and Y ,
i. E[Y ] = E E[Y |X] ,
ii. Var[Y ] = E Var[Y |X] + Var E[Y |X] .
(b) Suppose that the random variables Xi and Pi , i = 1, . . . , n, have the following distributions:
1 with probability Pi ,
Xi =
0 with probability 1 − Pi ,
IID
Pi
Beta(α, β),
∼
i.e. Pi has density
f (p) =
Γ(α + β) α−1
p
(1 − p)β−1
Γ(α) Γ(β)
with mean µ and variance σ 2 given by
µ = E[Pi ] =
α
,
α+β
σ 2 = Var[Pi ] =
αβ
,
(α + β)2 (α + β + 1)
and Xi has a Bernoulli (Pi ) distribution.
Find
i.
ii.
iii.
iv.
E[X1 |P1 ],
Var[X1 |P1 ],
Var E[X1 |P1 ] , and
E Var[X1 |P1 ] .
Hence find E[Y ] where Y =
Pn
i=1
Xi , and show that Var[Y ] = nαβ/(α + β)2 .
(c) Express E[Y ] and Var[Y ] in terms of µ and σ 2 , and comment on the result.
From Warwick ST217 exam 1998
4. Suppose that the number N of bye-elections occurring in Government-held seats over a 12-month
period follows a Poisson distribution with mean 10.
Suppose also that, independently for each such bye-election, the probability that the Government
hold onto the seat is 1/4. The number X of seats retained in the N bye-elections therefore follows a
binomial distribution:
[X|N ] ∼ Bin(N, 0.25).
(a) What are E[N ], Var[N ], E[X|N ] and Var[X|N ]?
(b) What are E[X] and Var[X]?
(c) What is the distribution of X?
[HINT : try using generating functions—see MSA]
24
5. (a) For continuous random variables X and Y , define
i.
ii.
iii.
iv.
the
the
the
the
marginal density fX (x) of X,
conditional density fY |X (y|x) of Y given X = x,
conditional expectation E[Y |X] of Y given X, and
conditional variance Var[Y |X] of Y given X.
(b) Show that
i. E[g(Y )] = E E[g(Y )|X] , for an arbitrary function g(·), and
ii. Var[Y ] = E Var[Y |X] + Var E[Y |X] .
(c) Suppose that the random variables X and Y have a continuous joint distribution, with PDF
2
f (x, y), means µX & µY respectively, variances σX
& σY2 respectively, and correlation ρ. Also
suppose the conditional mean of Y given X = x is a linear function of x:
E[Y |x] = β0 + β1 x.
Show that
R∞
i. −∞ yf (x, y)dy = (β0 + β1 x)fX (x),
ii. µY = β0 + β1 µX , and
2
iii. ρσX σY + µX µY = β0 µX + β1 (σX
+ µ2X ).
(Hint: use the fact that E[XY ] = E[E[XY |X]]).
(d) Hence or otherwise express β0 and β1 in terms of µX , µY , σX , σY & ρ, and write down (or
derive) the maximum likelihood estimates of β0 & β1 under the assumption that the data
(x1 , y1 ), . . . , (xn , yn ) are i.i.d observations from a bivariate Normal distribution.
From Warwick ST217 exam 1997
6. For discrete random variables X and Y , define:
(i) The conditional expectation of Y given X, E[Y |X], and
(ii) The conditional variance of Y given X, Var[Y |X].
Show that
(iii) E[Y ] = E E[Y |X] , and
(iv) Var[Y ] = E Var[Y |X] + Var E[Y |X] .
(v) Show also that if E[Y |X] = β0 + β1 X for some constants β0 and β1 , then
E[XY ] = β0 E[X] + β1 E[X 2 ].
The random variable X denotes the number of leaves on a certain plant at noon on Monday, Y
denotes the number of greenfly on the plant at noon on Tuesday, and Z denotes the number of
ladybirds on the plant at noon on Wednesday.
Suppose that, given X = x, Y has a Poisson distributions with mean µx. If X has a Poisson
distribution with mean λ, show that
E[Y ] = λµ
and
Var[Y ] = λµ(1 + µ),
(you may assume that for a Poisson distribution the mean and variance are equal).
Suppose further that, given Y = y, Z has a Poisson distributions with mean νy. Find E[Z], Var[Z],
and the correlation between X and Z.
From Warwick ST217 exam 1996
25
7. Using the relationship
E E[h(X1 )|X2 ] = E[h(X1 )],
where
h(x1 ) = (x1 − E[X1 |x2 ] + E[X1 |x2 ] − EX1 )2 ,
prove that
Var(X1 ) = E Var(X1 |X2 ) + Var E[X1 |X2 ]
for any two random variables X1 & X2 .
8. Prove that, for any three RVs X, Y and Z for which the various expectations exist,
(a) X and Y − E(Y |X) are uncorrelated,
(b) Var Y − E(Y |X) = E Var(Y |X) ,
(c) if X and Y are uncorrelated then E Cov(X, Y |Z) = −Cov E(X|Z), E(Y |Z) ,
(d) Cov Z, E(Y |Z) = Cov(Z, Y ).
In scientific thought we adopt the simplest theory which will explain all the facts under consideration and enable us to predict new facts of the same kind. The catch in this criterion lies
in the word ‘simplest’. It is really an aesthetic canon such as we find implicit in our criticisms
of poetry or painting.
J. B. S. Haldane
All models are wrong, some models are useful.
G. E. P. Box
A child of five would understand this. Send somebody to fetch a child of five.
Groucho Marx
26
Chapter 3
The Multivariate Normal
Distribution
3.1
Motivation
A Normally distributed RV
X ∼ N (µ, σ 2 )
has PDF
1 (x − µ)2
f (x; µ, σ 2 ) = constant × exp −
2
σ2
(3.1)
where
µ
σ2
‘constant’
is the mean of X,
is the variance of X, and
is there to make f integrate to 1.
The P
Normal distribution
is important because, by the CLT, as n → ∞, the CDF of a MLE such as
P
θb =
Xi /n or θb = (Xi − ΣXj /n)2 / n, tends uniformly (under reasonable conditions) to the CDF of a
Normal RV with the appropriate mean and variance.
i.e. the log-likelihood tends to a quadratic in θ.
Similarly it can be shown that, for a model with parameter vector θ = (θ1 , . . . , θp )T , under reasonable
conditions the log-likelihood will tend to a quadratic in (θ1 , . . . , θp ).
Therefore, by analogy with Equation 3.1, we will want to define a distribution with PDF
1
f (x; µ, V) = constant × exp − (x − µ)T V−1 (x − µ)
2
where
µ
V
‘constant’
is a (p × 1) matrix or column vector,
is a (p × p) matrix, and
is again there to make f integrate to 1.
27
(3.2)
As an example of a PDF of this form, if X1 , X2 , . . . , Xp IID
∼ N (0, 1), then
f (x)
= f1 (x1 ) × f2 (x2 ) × · · · × fp (xp )
by independence
1
1
=
exp − 12 Σx2i
=
exp − 12 xT x .
p/2
p/2
(2π)
(2π)
(3.3)
Definition 3.1 (Multivariate standard Normal)
The distribution with PDF
f (z) = f (z1 , z2 , . . . , zp ) =
1
exp − 21 zT z
p/2
(2π)
is called the multivariate standard Normal distribution.
The statement ‘Z has a multivariate standard Normal distribution’ is often written
Z ∼ N (0, I),
Z ∼ MVN (0, I),
Z ∼ N p (0, I),
or Z ∼ MVN p (0, I),
and the CDF and PDF of Z are often written Φ(z) and φ(z), or Φp (z) and φp (z), respectively.
In the more general case, where the component RVs X1 , X2 , . . . , Xp in Equation 3.2 aren’t independent,
we need an expression for the constant term.
3.2
Digression: Transforming a Random Vector
Exercise 3.1
Suppose that the RVs Z1 , Z2 , . . . , Zn have a continuous joint distribution, with joint PDF fZ (z).
Consider a 1-1 transformation (i.e. a bijection between the corresponding sample spaces) to new RVs
X1 , X2 , . . . , Xn . What is the PDF fX (x) of the transformed RVs? Solution: Because the transformation
is 1-1 we can invert it and write
Z = u(X)
i.e. a given point (z1 , . . . , zn ) transforms to (x1 , . . . , xn ), where
z1
z2
= u1 (x1 , . . . , xn ),
= u2 (x1 , . . . , xn ),
..
.
zn
= un (x1 , . . . , xn ).
Now assume that each function ui (·) is continuous and differentiable.
Then we can form the following matrix:

∂u1 ∂u1
∂u1
 ∂x1 ∂x2 . . . ∂xn

 ∂u
∂u2
∂u2
2

∂u
...

=  ∂x1 ∂x2
∂xn
 .
∂x
..
..
..
 ..
.
.
.

 ∂un ∂un
∂un
...
∂x1 ∂x2
∂xn
(3.4)











(3.5)
and its determinant J, which is called the Jacobian of the transformation u
[i.e. of the joint transformation (u1 , . . . , un )].
Then it can be shown that
fX (x) = |J| × fZ (z)
at all points in the ‘sample space’ (i.e. set of possible values) of X.
k
28
z + δ2
z + δ1 + δ2
z
infinitesimal δ1 × δ2 rectangle
density = fZ (z)
area = δ1 δ2
z + δ1
∴ probability content = δ1 δ2 fZ (z)
6
u
infinitesimal parallelogram
area = δ1 δ2 /|J|
probability content = δ1 δ2 fZ (z)
u−1 (z + δ 1 + δ 2 )
u−1 (z + δ 2 ) −1
u (z + δ 1 )
x = u−1 (z)
∴ density = |J| × fZ (z)
Figure 3.1: Bivariate Parameter Transformation
3.3
The Bivariate Normal Distribution
Suppose that Z1 and Z2 are IID with N (0, 1) distributions, i.e. (as in Equation 3.3):
fZ (z1 , z2 ) =
1
exp − 12 (z12 + z22 ) .
2π
Now let µ1 , µ2 ∈ (−∞, ∞), σ1 , σ2 ∈ (0, ∞) & ρ ∈ (−1, 1), and define (as in DeGroot §5.12):
X1
X2
= σ1 Z1 + µ1p
,
= σ2 ρZ1 + 1 − ρ2 Z2 + µ2 .
(3.6)
Then the Jacobian of the transformation from Z to X is given by
σ1
p
0
= 1 − ρ2 σ 1 σ 2 .
p
J =
ρ σ2
1 − ρ2 σ2 p
Therefore the Jacobian of the inverse transformation from X to Z is 1/ 1 − ρ2 σ1 σ2 , and the PDF of
X is given by Equations 3.7 & 3.8 below.
Definition 3.2 (Bivariate Normal Distribution)
The continuous bivariate distribution with PDF
fX (x)
1
fZ (z)
|J|
=
!
1
1
Q ,
p
×
× exp −
2π
2 1 − ρ2
1 − ρ2 σ 1 σ 2
1
=
(3.7)
where
Q=
x1 − µ1
σ1
2
− 2ρ
x1 − µ1
σ1
is called the bivariate Normal distribution.
29
x2 − µ2
σ2
+
x2 − µ2
σ2
2
.
(3.8)
Exercise 3.2
If the RV X = (X1 , X2 ) has PDF given by Equations 3.7 & 3.8, then show by substituting
v=
x2 − µ2
σ2
followed by w =
v − ρ(x1 − µ1 )/σ1
p
,
1 − ρ2
or otherwise, that X1 ∼ N (µ1 , σ12 ).
Hence or otherwise show that the conditional distribution of X1 given X2 = x2 is Normal with mean
µ1 + (ρσ1 /σ2 )(x2 − µ2 ) and variance σ12 1 − ρ2 .
k
Comments
1. It’s easy to show (problem 3.4.2, page 31) that EXi = µi , VarXi = σi2 and corr(X1 , X2 ) = ρ. This
suggests that we will be able to write
X
=
(X1 , X2 )T ∼ MVN (µ, V),
T
µ =
V
(µ1 , µ2 )
σ12
=
ρ σ1 σ 2
where
is the ‘mean vector ’ of X, and
ρ σ1 σ 2
is the ‘variance-covariance matrix ’ of X.
σ22
2. The ‘level curves’ (i.e. contours in 2-d) of the bivariate Normal PDF are given by Q = constant in
formula 3.8; i.e. ellipses provided the discriminant is negative:
ρ
σ1 σ2
2
−
1 1
ρ2 − 1
= 2 2 < 0.
2
2
σ 1 σ2
σ 1 σ2
This holds as we are only considering ‘nonsingular’ bivariate Normal distributions with ρ 6= ±1.
3. PLEASE MAKE NO ATTEMPT TO MEMORISE FORMULAE 3.7 & 3.8!!
Exercise 3.3
Show that the inverse of the variance-covariance matrix V =
V
−1
1
=
1 − ρ2
1/σ12
−ρ/σ1 σ2
σ12
ρ σ1 σ 2
−ρ/σ1 σ2
1/σ22
ρ σ1 σ 2
σ22
is
.
k
30
3.4
Problems
1. Suppose that the RVs X1 , X2 , . . . , Xn have a continuous joint distribution with PDF fX (x), and that
the RVs Y1 , Y2 , . . . , Yn are defined by Y = AX, where the (n × n) matrix A is nonsingular. Show
that the joint density of the Yi s is given by
1
fX A−1 y
for y ∈ Rn .
fY (y) =
| det A|
Hence or otherwise show carefully that if X1 and X2 are independent RVs with PDFs f1 and f2
respectively, then the PDF of Y = X1 + X2 is given by
Z ∞
fY (y) =
f1 (y − z)f2 (z)dz
for −∞ < y < ∞
−∞
or equivalently by
Z
∞
f1 (z)f2 (y − z)dz
fY (y) =
for −∞ < y < ∞
−∞
If Xi IID
∼ Exp(1), i = 1, 2, then what is the distribution of X1 + X2 ?
2. Suppose that Z1 and Z2 are i.i.d. random variables with standard Normal N (0, 1) distributions.
Define the random vector (X1 , X2 ) by:
X1 = µ1 + σ1 Z1 ,
i
h
p
X2 = µ2 + σ2 ρZ1 + 1 − ρ2 Z2 ,
where σ1 , σ2 > 0 and −1 ≤ ρ ≤ 1.
Show that E[X1 ] = µ1 , E[X2 ] = µ2 , Var[X1 ] = σ12 , Var[X2 ] = σ22 , and corr[X1 , X2 ] = ρ.
Find E[X2 |X1 ] and Var[X2 |X1 ].
Derive the joint PDF f (x1 , x2 ).
Find the distribution of [X2 |X1 ]. Hence or otherwise show that two r.v.s. with a joint bivariate
Normal distribution are independent if and only if they are uncorrelated.
(e) Now suppose that σ1 = σ2 . Show that the RVs Y1 = X1 +X2 and Y2 = X1 −X2 are independent.
(a)
(b)
(c)
(d)
3. Suppose that X and Y have the joint density
1
p
fX,Y (x, y) =
2π σX σY 1 − ρ2
"
2
2 #!
1
x − µX
x − µX
y − µY
y − µY
× exp −
− 2ρ
+
.
σX
σX
σY
σY
2 1 − ρ2
p
(a) Show by substituting u = (x − µX )/σX and v = (y − µY )/σY followed by w = (u − ρv)/ 1 − ρ2 ,
or otherwise, that fX,Y does indeed integrate to 1.
(b) Show that the ‘joint MGF’ MX,Y (s, t) = E exp(sX + tY ) is given by
2 2
s + 2ρσX σY st + σY2 t2 ) .
MX,Y (s, t) = exp µX s + µY t + 12 (σX
(c) Show that
∂MX,Y ∂s s,t=0
∂ 2 MX,Y ∂s2 = µX ,
2
= µ2X + σX
,
s,t=0
&
∂ 2 MX,Y ∂s∂t s,t=0
= µX µY + ρσX σY .
(d) Guess the formula for the MGF MX (s) of X, where X ∼ MVN (µ, V).
4. Suppose that (X1 , X2 ) have a bivariate Normal distribution. Show that any linear combination
Y = a0 + a1 X1 + a2 X2 has a univariate Normal distribution.
31
3.5
The Multivariate Normal Distribution
Definition 3.3 (Multivariate Normal distribution)
Let µ = (µ1 , µ2 , . . . , µp ) be a p-vector, and let V be a symmetric positive-definite (p × p) matrix.
Then the multivariate probability density defined by
fX (x; µ, V)
=
1
p
(2π)p |V|
exp − 12 (x − µ)T V−1 (x − µ)
(3.9)
is called a multivariate Normal PDF with mean vector µ and variance-covariance matrix V.
Comments
1. Expression 3.9 is a natural generalisation of the univariate Normal density, with V taking the rôle of
σ 2 in the exponent, and its determinant |V| taking the rôle of σ 2 in the ‘normalising constant’ that
makes the whole thing integrate to 1. Many of the properties of the MVN distribution are guessable
from properties of the univariate Normal distribution—in particular, it’s helpful to think of 3.9 as
‘exponential of a quadratic’.
2. The statement ‘X = (X1 , X2 , . . . , Xp ) has a multivariate Normal distribution with mean vector µ
and variance-covariance matrix V’ may be written
X ∼ N (µ, V),
X ∼ MVN(µ, V),
X ∼ N p (µ, V),
or X ∼ MVNp (µ, V).
3. The mean vector µ is sometimes called just the mean, and the variance-covariance matrix V is
sometimes called the dispersion matrix, or simply the variance matrix or covariance matrix.
4. µ = EX, (or equivalently, componentwise, EXi = µi , i = 1, 2, . . . , p). This fact should be obvious
from the name ‘mean vector’, and can be proved in various ways, e.g. by differentiating a multivariate
generalization of the MGF, or simply by symmetry.
5. V = E (X − µ)(X − µ)T = E(XXT ) − µµT , i.e.

 

µ21
µ1 µ2 . . . µ1 µp
X12
X 1 X 2 . . . X1 X p

 X2 X1
. . . µ2 µp 
µ22
X22
. . . X2 X p 


  µ2 µ1
−
E(XXT ) − µµT = E 


.
.
.. 
..
..
.
..
..
..
..
  ..

.
.
. 
.
.
Xp X1



= 

...
Xp2
µp µ1
E X12 − µ21
E(X2 X1 ) − µ2 µ1
..
.
E(X1 X2) − µ1 µ2
E X22 − µ22
..
.
E(Xp X1 ) − µp µ1
E(Xp X2 ) − µp µ2

= V
Xp X2
=




v11
v21
..
.
v12
v22
..
.
...
...
..
.
v1p
v2p
..
.
vp1
vp2
...
vpp
µp µ2
...
µ2p
. . . E(X1 Xp ) − µ1 µp
. . . E(X2 Xp ) − µ2 µp
..
..
.
.
...
E Xp2 − µ2p








,

say,
from which it follows that



V=

σ12
ρ12 σ1 σ2
..
.
ρ12 σ1 σ2
σ22
..
.
. . . ρ1p σ1 σp
. . . ρ2p σ2 σp
..
..
.
.
ρ1p σ1 σp
ρ2p σ2 σp
...
32
σp2



,

(3.10)
where σi is the standard deviation of Xi and ρij is the correlation between Xi and Xj . Again these
results can be proved using a multivariate generalization of the MGF.
6. The p-dimensional MVN p (µ, V) distribution can therefore be parametrised by—
p
p
1
p(p
−
1)
2
means µi ,
variances σi2 , and
correlations ρij
NB. —a total of 21 p(p + 3) parameters.
7. Given n random vectors Xi = (Xi1 , Xi2 , . . . , Xip ) IID
∼ MVN (µ, V), i = 1, 2, . . . , n,
a set of minimal sufficient statistics for the unknown parameters is given by:

n
X


Xij
j = 1, . . . , p,




i=1




n

X
2
Xij
j = 1, . . . , p,


i=1




n

X


&
Xij Xik j = 2, . . . , p, k = 1, . . . , (j − 1), 

(3.11)
i=1
and MLEs for µ and V are given by:
µ
bj
=
σ
bj2
=
ρbjk
=
1X
Xij ,
n i
1X
(Xij − µ
bj )2 ,
n i
P
1
bj )(Xik − µ
bk )
i (Xij − µ
n
,
σ
bj σ
bk
(3.12)
(3.13)
(3.14)
or, in matrix notation,
n
b =
µ
1X
Xi ,
n i=1
b
V
=
1X
b )(Xi − µ
b )T
(Xi − µ
n i=1
=
1X
bµ
bT .
Xi XTi − µ
n i=1
(3.15)
n
(3.16)
n
(3.17)
8. The fact that V is positive-definite implies various (messy!) constraints on the correlations ρij .
9. Surfaces of constant density form concentric (hyper-)ellipsoids (concentric hyper-spheres in the case
of the standard MVN distribution). In particular, the contours of a bivariate Normal density form
concentric ellipses (or concentric circles for the standard bivariate Normal).
10. It can be proved that all conditional and marginal distributions of a MVN are themselves MVN. The
proof of this important fact is quite straightforward, quite tedious, and mercifully omitted from this
course.
33
3.6
Distributions Related to the MVN
Because of the CLT, the MVN distribution is important throughout statistics. For example, the joint
distribution of the MLEs θb1 , θb2 , . . . , θbp of unknown parameters θ1 , θ2 , . . . , θp will under reasonable conditions
b = (θb1 , θb2 , . . . , θbp )T was calculated increases.
tend to a MVN as the size of the sample from which θ
Therefore various distributions arising from the MVN by transformation are also important.
Throughout this Section we shall usually denote independent standard Normal RVs by Zi , i.e.:
Zi
IID
∼
N (0, 1),
i.e. Z
=
(Z1 , Z2 , . . . , Zn )T
i = 1, 2, . . .
∼
MVN (0, I).
Exercise 3.4
Show that if a is a constant (n × 1) column vector, B is a constant nonsingular (n × n) matrix, and Z =
(Z1 , Z2 , . . . , Zn )T is a random n-vector with a MVN (0, I) distribution, then Y = a+BZ ∼ MVN a, BBT .
k
3.6.1
The Chi-squared Distribution
Definition 3.4 (Chi-squared Distribution)
If Zi IID
∼ N(0, 1) for i = 1, 2, . . . , n, then the distribution of
X = Z12 + Z22 + · · · + Zn2
is called a Chi-squared distribution on n degrees of freedom, and we write X ∼ χ2n .
Comments
1. In particular, if Z ∼ N (0, 1), then Z 2 ∼ χ21 .
2. The above construction of the χ2n distribution shows that if X ∼ χ2m , Y ∼ χ2n , and X ⊥
⊥Y ,
then (X + Y ) ∼ χ2m+n .
This summation property accounts for the importance and usefulness of the χ2 distribution:
essentially a squared length is split into two orthogonal components, as in Pythagoras’ theorem.
3. If X ∼ χ2n , then the (unmemorable) density of X can be shown to be
fX (x) =
1
x(n/2)−1 e−x/2
2n/2 Γ(n/2)
for x > 0,
(3.18)
with fX (x) = 0 for x ≤ 0.
Comparing this with the definition of a Gamma distribution (MSA) shows that a Chi-squared distribution on n degrees of freedom is just a Gamma distribution with α = n/2 and β = 1/2 (in the
usual parametrisation).
4. It can be shown that if X ∼ χ2n then EX = n and VarX = 2n.
Note that this implies that E[X/n] = 1 and Var[X/n] = 2/n.
5. The χ2 distributions are positively skewed—for example, χ22 is just an exponential distribution with
mean 2. However, because of the CLT, the χ2n distribution tends (slowly!) to Normality as n → ∞.
6. The PDF 3.18 cannot be integrated analytically except for the special case n = 2. Therefore the
CDFs of χ2n distributions for various n are given in standard Statistical Tables.
34
Figure 3.2: Chi-squared distributions for 1, 2, 5 & 20 d.f. Vertical lines show the 2.5%, 16%, 50%, 84%
and 97.5% points (which for N (0, 1) are at −2, −1, 0, 1, 2).
3.6.2
Student’s t Distribution
Definition 3.5 (t Distribution)
If Z ∼ N(0, 1), Y ∼ χ2n and Y ⊥
⊥ Z, then the distribution of
Z
X=p
Y /n
is called a (Student’s) t distribution on n degrees of freedom, and we write X ∼ tn .
Comments
1. The shape of the t distribution is like that of a Normal, but with heavier tails (since there is variability
in the denominator of t as well as in the Normally-distributed numerator Z).
However, as n → ∞, the denominator becomes more and more concentrated around 1, so (loosely
speaking!) ‘tn → N (0, 1) as n → ∞’.
2. The (highly unmemorable) PDF of X ∼ tn can be shown to be
−(n+1)/2
Γ (n + 1)/2
fX (x) = √
1 + x2 /n
for −∞ < x < ∞.
nπ Γ(n/2)
35
(3.19)
Figure 3.3: t distributions for 1, 2, 5 & 20 d.f. Vertical lines show the 2.5%, 16%, 50%, 84% and 97.5%
points.
3. The t distribution on 1 degree of freedom is also called the Cauchy distribution—note that it arises
as the distribution of Z1 /Z2 where Zi IID
∼ N (0, 1).
The Cauchy distribution is infamous for not having a mean. More generally, only the first n − 1
moments of the tn distribution exist.
2
4. Note that if Xi IID
∼ N (0, σ ), then the RV
T = pPn
i=2
X1
Xi2 /(n − 1)
has a tn−1 distribution, and is a measure of the length of X1 compared to the root mean square
length of the other Xi s.
i.e. if X has a spherical MVN (0, σ 2 I) distribution, then we would expect T not to be too large. This
is, in effect, how the t distribution usually arises in practice.
5. The PDF 3.19 cannot be integrated analytically in general (exception: n = 1 d.f.). The CDF must
be looked up in Statistical Tables or approximated using a computer.
36
3.6.3
Snedecor’s F Distribution
Definition 3.6 (F Distribution)
If Y ∼ χ2m , Z ∼ χ2n and Y ⊥
⊥ Z, then the distribution of
X=
Y /m
Z/n
is called an F distribution on m & n degrees of freedom, and we write X ∼ Fm,n .
Figure 3.4: F distributions for selected d.f. Vertical lines show the 2.5%, 16%, 50%, 84% and 97.5% points.
Comments
1. Note that the numerator Y /m and denominator Z/n of X both have mean 1. Therefore, provided
both m and n are large, X will usually take values around 1.
2. If X ∼ Fm,n , then the (extraordinarily unmemorable) density of X can be shown to be
Γ (m + n)/2 mm/2 nn/2
x(m/2)−1
fX (x) =
×
for x > 0,
Γ(m/2) Γ(n/2)
(mx + n)(m+n)/2
with fX (x) = 0 for x ≤ 0.
37
(3.20)
3.7
Problems
1. Let Z ∼ N (0, 1) & Y = Z 2 , and let φ(·) & Φ(·) denote the PDF & CDF respectively of the standard
Normal N (0, 1) distribution.
√
√
(a) Show that FY (y) = Φ( y) − Φ(− y).
√
(b) Express fY (y) in terms of φ( y).
(c) Hence show that
1
fY (y) = √ y −1/2 e−y/2
2π
for y > 0.
(d) Find the MGF of Y .
2. Using Formula 3.18 for the PDF of the χ2 distribution, show that if X ∼ χ2n then the MGF of X is
MX (t) = (1 − 2t)−n/2 .
(3.21)
Deduce that if X ∼ χ2m & Y ∼ χ2n with X ⊥
⊥ Y , then (X + Y ) ∼ χ2m+n .
3. Given Z1 , Z2 IID
∼ N (0, 1), what is the probability that the point (Z1 , Z2 ) lies
(a) in the square {(z1 , z2 ) | −1 < z1 < 1 & −1 < z2 < 1},
(b) in the circle {(z1 , z2 ) | (z12 + z22 ) < 1}?
4. Let Z = (Z1 , Z2 , . . . , Zn )T ∼ MVN n (0, I), and define Y = (Y1 , Y2 , . . . , Yn )T by Y = AZ, where
A = (aij ) is an n × n orthogonal matrix, i.e. A−1 = AT .
Pn
Pn
(a) Show that i=1 Yi2 = i=1 Zi2 .
(b) Show that Y ∼ MVN n (0, I).
(c) Show that the following matrix A is orthogonal:

−1
√
2×1
1
√
3×2
..
.
−2
√
3×2
..
.
1
p
(n−1)(n−2)
1
p
n(n−1)
1
√
n
1
p
(n−1)(n−2)
1
p
n(n−1)
1
√
n
1
√
2×1
1
√
3×2
..
.










A=
1

 p

(n−1)(n−2)


1

p


n(n−1)


1
√
n
0
···
0
···
0
..
..
.
.
···
···
···
−(n−2)
p
(n−1)(n−2)
1
p
n(n−1)
1
√
n
0





0



..


.

.

0



−(n−1) 

p

n(n−1) 


1
√
n
Pn−1
Pn
√
(d) With the above definition of A, show that i=1 Yi2 = i=1 (Zi − Z)2 and that Yn = n Z.
Pn
(e) Hence show that the RVs Z and i=1 (Zi − Z)2 are independent and have N (0, 1/n) and χ2n−1
distributions respectively.
Pn
2
(f) HencePor otherwise show that if X1 , X2 , . . . , Xn IID
b = i=1 Xi /n and
∼ N (µ, σ ), then the RVs µ
n
σ
b2 = i=1 (Xi − µ
b)2 /n satisfy
i. µ
b⊥
⊥σ
b2 ,
ii. µ
b ∼ N (µ, σ 2 /n),
iii. nb
σ 2 /σ 2 ∼ χ2n−1 .
5. Let Z = (Z1 , Z2 , . . . , Zm+n )T ∼ MVN m+n (0, I).
qP
m+n 2
(a) Describe the distribution of Y = Z/
Zi .
1
Pm 2
Pm+n 2
(b) Show that the RV X = (n 1 Yi )/(m m+1 Yi ) has an Fm,n distribution.
(c) Hence show that if Y = (Y1 , Y2 , . . . , Ym+nP
)T has any continuous
spherically symmetric distriPm+n
m
bution centred at the origin, then X = (n 1 Yi2 )/(m m+1 Yi2 ) has an Fm,n distribution.
38
6. Suppose that Yi are independent RVs with Poisson distributions: Yi ∼ Poi (λi ), i = 1, . . . , k.
√
(a) Assuming that λi is large, what is the approximate distribution of Zi = (Yi − λi )/ λi ?
Pk
(b) Hence or otherwise show that if all the λi s are large, then the RV X = i=1 (Yi − λi )2 /λi has
approximately a χ2k distribution.
7. Suppose that the RVs Oi have independent Poisson distributions: Oi ∼ Poi (npi ), i = 1, . . . , k, where
Pk
i=1 pi = 1.
(a) Find EOi and Var Oi . Hence or otherwise show that E[Oi − npi ] = 0 and Var[Oi − npi ] = npi .
Pk
(b) Define the RV N by N = i=1 Oi . What is the distribution of N ?
(c) Define the RVs Ei = N pi , i = 1, . . . , k. Show that EEi = npi and VarEi = np2i .
Pk
(d) By writing E[O1 E1 ] = p1 E[O12 ] + E[O1 i=2 Oi ] , or otherwise, show that Cov(O1 , E1 ) = np21 .
(e) Deduce that the RV (Oi − Ei ) has mean 0 and variance npi (1 − pi ) for i = 1, . . . , k.
8. Suppose that X has a χ2n distribution with PDF given by Formula 3.18. Find the mean, mode &
variance of X, and an approximate variance-stabilising transformation.
9. Suppose that Zi IID
∼ N (0, 1), i = 1, 2, . . . What is the distribution of the following RVs?
(a)
X1 = Z1 + Z2 − Z3
(b)
X2 =
Z1 + Z2
Z1 − Z2
(c)
X3 =
(Z1 − Z2 )2
(Z1 + Z2 )2
(d)
(Z1 + Z2 )2 + (Z1 − Z2 )2
X4 =
2
(e)
2Z1
X5 = p 2
Z2 + Z32 + Z42 + Z52
(f)
(Z1 + Z2 + Z3 )
X6 = p 2
Z4 + Z52 + Z62
(g)
X7 =
3(Z1 + Z2 + Z3 + Z4 )2
(Z1 + Z2 − Z3 − Z4 )2 + (Z1 − Z2 + Z3 − Z4 )2 + (Z1 − Z2 − Z3 + Z4 )2
(h)
X8 = 2Z12 + (Z2 + Z3 + Z4 + Z5 )2
10. For each of the RVs Xi defined in the previous question, use Statistical Tables to find ci (i = 1 . . . 8)
such that Pr(Xi > ci ) = 0.95.
39
11. Let z(m, n, P ) denote the P % point of the Fm,n distribution. Without looking in statistical tables,
what can you say about the relationships between the following values:
(a) z(2, 2, 50) and z(20, 20, 50),
(c) z(2, 20, 16) and z(20, 2, 84),
(b) z(2, 20, 50) and z(20, 2, 50),
(d) z(20, 20, 2.5) and z(20, 20, 97.5)?
12. Show that the PDFs of the t and F distributions (definitions 3.5 & 3.6) are indeed given by formulae
3.19 & 3.20.
13. (a) Define the Standard Multivariate Normal distribution MVN (0, I).
(b) Given Z = (Z1 , Z2 , . . . , Zm+n )T ∼ MVN (0, I), write down transformed random variables X(Z),
T (Z) and Y (Z) with the following distributions:
i. X ∼ χ2n ,
ii. T ∼ tn ,
iii. Y ∼ Fm,n .
(c) Given that the PDF of X ∼ χ2n is
fX (x) =
1
x(n/2)−1 e−x/2
2n/2 Γ(n/2)
for x > 0,
and fX (x) = 0 elsewhere, show that
i. E[X] = n,
ii. E[X 2 ] = n2 + 2n, and
iii. E[1/X] = 1/(n − 2) (provided n > 2).
(d) Hence or otherwise find
2
i. the variance σX
of X ∼ χ2n ,
ii. the mean µY of Y ∼ Fm,n and
iii. the mean µT and variance σT2 of T ∼ tn ,
2
stating under what conditions σX
, µY , µT and σT2 exist.
From Warwick ST217 exam 1998
Theory is often just practice with the hard bits left out.
J. M. Robson
Get a bunch of those 3–D glasses and wear them at the same time. Use enough to get it up to
a good, say, 10– or 12–D.
Rod Schmidt
The Normal . . . is the Ordinary made beautiful; it is also the Average made lethal.
Peter Shaffer
Symmetry, as wide or as narrow as you define is meaning, is one idea by which man through
the ages has tried to comprehend and create order, beauty and perfection.
Hermann Weyl
40
Chapter 4
Inference for Multiparameter Models
4.1
4.1.1
Introduction: General Concepts
Modelling
Given a random vector X = (X1 , X2 , . . . , Xp ), we can describe the joint distribution of the Xi s by the
CDF FX (x) or, usually more conveniently, by the PMF or PDF fX (x).
Interrelationships between the Xi s can be described using
1. marginals Fi (xi ), fi (xi ), Fij (xi , xj ), etc.,
2. conditionals Gi (xi |xj , j 6= i), gi (xi |xj , j 6= i), Gij (xi , xj |xk , k 6= i, j), etc.,
3. conditional expectations E[Xi |Xj ], Var[Xi |Xj ], etc.
Often FX (x) is assumed to lie in a family of probability distributions:
F = {F (x|θ) | θ ∈ ΩΘ }
(4.1)
where ΩΘ is the ‘parameter space’.
The process of formulating, choosing within, & checking the reasonableness of, such families F, is called
statistical modelling (or probability modelling, or just modelling).
Exercise 4.1
The data-set in Table 4.1, plotted in Figure 1.1 (page 2), shows patients’ blood pressures before and after
treatment. Suggest some reasonable models for the data.
k
4.1.2
Data
In practice, we typically have a set of data in which d variables are measured on each of n ‘cases’ (or
‘individuals’ or ‘units’):
D
=
case.1
case.2
..
.
case.n





var.1
var.2
···
var.d
x11
x21
..
.
x12
x22
..
.
···
···
..
.
x1d
x2d
..
.
xn1
xn2
···
xnd
41





(4.2)
Patient
Number
before
Systolic
after
change
before
Diastolic
after
change
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
210
169
187
160
167
176
185
206
173
146
174
201
198
148
154
201
165
166
157
147
145
168
180
147
136
151
168
179
129
131
-9
-4
-21
-3
-20
-31
-17
-26
-26
-10
-23
-33
-19
-19
-23
130
122
124
104
112
101
121
124
115
102
98
119
106
107
100
125
121
121
106
101
85
98
105
103
98
90
98
110
103
82
-5
-1
-3
2
-11
-16
-23
-19
-12
-4
-8
-21
4
-4
-18
Table 4.1: Supine systolic and diastolic blood pressures of 15 patients with moderate
hypertension (high blood pressure), immediately before and 2 hours after taking 25mg
of the drug captopril.
Data from HSDS, set 72
Definition 4.1 (Data Matrix)
A set of data D arranged in the form of 4.2 is called a data matrix or a cases-by-variables array.
The data-set D is assumed to be a representative sample (of size n) from an underlying population of
potential cases. This population may be actual, e.g. the resident population of England & Wales at noon
on June 30th 1993, or purely theoretical/hypothetical, e.g. MVN(µ, V).
Exercise 4.2
Table 4.2 presents data on ten asthmatic subjects, each tested with 4 drugs. Describe various ways that
the data might be set out as a data matrix for analysis by a statistical computing package.
k
4.1.3
Statistical Inference
Statistical inference is the art/science of using the sample to learn about the population (and hence,
implicitly, about future samples).
Typically we use statistics (properties of the sample)
to learn about parameters (properties of the population).
This activity might be:
1. Part of analysing a formal probability model,
b of θ, after making an assumption as in Expression 4.1, or
e.g. calculating the MLEs θ
2. Purely to summarise the data as a part of ‘data analysis’ (Section 4.2),
For example, given X1 , X2 , . . . , Xn IID
∼ FX (unknown), the statistics
S1 =
1X
Xi = X,
n
S2 =
1X
(Xi − X)2 ,
n
42
S3 =
1X
(Xi − X)3
n
Patient number
5
6
7
Drug
Time
1
2
3
4
8
9
10
P
−5 mins
+15 mins
0.0
3.8
2.3
9.2
2.4
5.4
1.9
3.3
1.6
4.2
4.8
15.1
0.6
1.3
2.7
6.7
0.9
4.2
1.3
3.1
C
−5 mins
+15 mins
0.5
2.0
1.0
5.3
2.0
7.5
1.1
6.4
2.1
4.1
6.8
9.1
0.6
0.6
3.1
14.8
1.5
2.4
3.0
2.3
D
−5 mins
+15 mins
0.8
2.4
2.3
4.8
0.8
2.4
0.8
1.9
1.2
1.2
9.6
12.5
1.1
1.7
9.7
12.5
0.8
4.3
4.9
8.1
K
−5 mins
+15 mins
0.2
0.4
1.7
3.4
2.2
2.0
0.1
1.3
1.7
3.4
9.2
6.7
0.6
1.1
12.7
12.5
1.1
2.7
2.8
5.7
Table 4.2: NCF (Neutrophil Chemotactic Factor) of ten individuals, each tested with
4 drugs: P (Placebo), C (Clemastine), D (DSCG), K (Ketotifen). On a given day,
an individual was administered the chosen drug, and his NCF measured 5 minutes
before, and 15 minutes after, being given a ‘challenge’ of allergen.
Data from Dr. R. Morgan of Bart’s Hospital
provide measures of location, scale and skewness.
Note that here we’re implicitly estimating the corresponding population quantities
µX = EX,
E (X − µX )2 ,
E (X − µX )3 ,
and using these as measures of population location, scale and skewness. Without a formal probability
model, it can be hard to judge whether these or some other measures may be most appropriate.
In both cases, the CLT & its generalisations (to higher dimensions and to ‘near-independence’) show that,
b or (S1 , S2 , S3 ), is
under reasonable conditions, the joint distribution of the statistics of interest, such as θ
approximately MVN. This approximation improves if
1. the sample size n → ∞, and/or
2. the joint distribution of the random variables being summed (e.g. the original random vectors
X1 , X2 , . . . , Xn ) is itself close to MVN.
QUESTIONS: How should we interpret this? How should we try to link probability models to reality?
4.2
Data Analysis
Data analysis is the art of summarising data while attempting to avoid probability theory.
For example, you can calculate summary statistics such as means, medians, modes, ranges, standard
deviations etc., thus summarising in a few numbers the main features of a possibly huge data-set. For
example, the (0%, 25%, 50%, 75%, 100%) points of the data distribution (i.e. minimum, lower quartile,
median, upper quartile and maximum) form the five-number summary, and the inter-quartile range (IQR =
upper quartile - lower quartile) is a measure of spread, containing the ‘middle 50%’ of the data.
These summaries can be formalised as follows
Definition 4.2 (Order statistics)
Given RVs X1 , X2 , . . . , Xn , one can order them and denote the smallest of the Xi s by X(1) , the second
smallest by X(2) , etc. Then X(k) is called the kth order statistic.
43
Thus X(1) , X(2) , . . . , X(n) are a permutation of X1 , X2 , . . . , Xn , and x(n) , the observed value of X(n) , denotes
the largest observed value in a sample of size n.
Given ordered data x(1) ≤ x(2) ≤ · · · ≤ x(n) , one can define:
Definition 4.3 (Sample median)
xM
We can always write xM = x
n
2



x
=
1

 2 x
+
1
2
n
2
n+1
2
+x
if n is odd,
n
2
if n is even.
+1
, provided we adopt the following convention:
1. If the number in brackets is exactly half-way between two integers, then take the average of the two
corresponding order statistics.
2. Otherwise round the bracketed subscript to the nearest integer, and take the corresponding order
statistic.
Similarly the quartiles etc. can be formally defined as follows:
Definition 4.4 (Sample lower quartile)
xL = x
n
4
+
1
2
,
Definition 4.5 (Sample upper quartile)
xU = x
3n
4
+
1
2
,
Definition 4.6 (100p th sample percentile)
x100p% = x
pn
100
+
1
2
,
Definition 4.7 (Five number summary)
x(1) , xL , xM , xU , x(n) .
4.3
Classical Inference
4.3.1
Introduction
In ‘classical statistical inference’, the typical procedure is:
1. Choose a family F of models indexed by θ (formula 4.1).
2. Assume temporarily that the true distribution lies in F
i.e. data D ∼ F (d|θ) for some true but unknown parameter vector θ ∈ ΩΘ .
3. Compare possible models according to some criterion of compatibility between the model & the data
(equivalently, between the population & the sample).
4. Assess the chosen model(s), and go back to step (1) or (2) if the model proves inadequate.
44
Comments
1. Step 1 is a compromise between
(a) what we believe is the true underlying mechanism that produced the data, and
(b) what we can do mathematically.
If in doubt, keep it simple.
2. Step 2, by assuming a true θ exists, implicitly interprets probability as a property of Nature
e.g. a ‘fair’ coin is assumed to have an intrinsic property: if you toss it n times, then the proportion
of ‘heads’ tends to 1/2 as n → ∞.
Thus probability represents a ‘long-run relative frequency’.
3. Most statistical computer packages currently use the classical approach, and we’ll mainly be using
classical inference in MSB.
4. There are many possible criteria at step 3. For example, hypothesis-testing and likelihood approaches
are both discussed briefly below.
4.3.2
Point Estimation (Univariate)
Given RVs X = (X1 , X2 , . . . , Xn ), a point estimator for an unknown parameter Θ ∈ ΩΘ is simply a function
b
Θ(X)
taking values in the parameter space ΩΘ . Once data X = x are obtained, one can calculate the
b
corresponding point estimate θb = Θ(x).
b to be considered a ‘good’ estimator of Θ. For example:
There are many plausible criteria for Θ
1. Mean Squared Error
b to be small whatever the true value θ of Θ, where
One would like the mean squared error (MSE) of Θ
b = E (Θ
b − θ)2 .
MSE(Θ)
(4.3)
b has minimum mean squared error if
In particular, an estimator Θ
b = min MSE(Θ
b 0 ).
MSE(Θ)
b0
Θ
2. Unbiasedness
Definition 4.8 (Bias)
b is
The bias of an estimator Θ
b = E[Θ
b − θ|Θ = θ].
Bias(Θ)
(4.4)
Exercise 4.3
b = Var(Θ)
b + (Bias Θ)
b 2.
Show that MSE(Θ)
k
Definition 4.9 (Unbiasedness)
b for a parameter Θ is called unbiased if E[Θ|Θ
b
An estimator Θ
= θ] = θ for all possible true
values θ of Θ.
45
Example
Given a random sample X1 , X2 , . . . , Xn , i.e. Xi IID
∼ FX (x), where FX is a member of some family F
of probability distributions,
Pn
(a) X = i=1 Xi /n is an unbiased estimate of the mean µX = EX of FX .
Pn
Pn
(b) More generally, any statistic of the form i=1 wi Xi , where i=1 wi = 1, is an unbiased estimate
of µX .
Pn
2
of FX , but
(c) σ
b12 = i=1 (Xi − X)2 /(n − 1) is an unbiased estimate of the variance σX
P
n
2
2
2
(d) σ
b2 = i=1 (Xi − X) /n is NOT an unbiased estimate of the variance σX of FX .
3. Efficiency & Minimum Variance Unbiased Estimation
b1 & Θ
b 2 for a parameter Θ, the efficiency of Θ
b 1 relative to Θ
b 2 is
Given two unbiased estimators Θ
defined by
b
b 1, Θ
b 2 ) = Var(Θ1 ) .
Eff(Θ
(4.5)
b 2)
Var(Θ
Definition 4.10 (MVUE)
b out
The Minimum Variance Unbiased Estimator of a parameter Θ is the unbiased estimator Θ,
of all possible unbiased estimators, that has minimum variance.
Example
Given Xi IID
∼ FX (x) ∈ F, the family of all probability distributions with finite mean & variance, it can
be shown that
(a) X is the MVUE of the mean µX = EX of FX , and
Pn
2
2
(b)
i=1 (Xi − X) /(n − 1) is the MVUE of the variance σX of FX .
Note that there are major problems with using MVUE as a criterion for estimation:
(a) The MVUE may not exist (e.g. in general there is no unbiased estimator for the underlying
standard deviation σX of X).
(b) The MVUE may exist but be nonsensical (see Problems).
(c) Even if the MVUE exists and appears reasonable, other (biased) estimators may be better by
other criteria, for example by having smaller mean squared error, which is much more important
in practice than being unbiased.
4. Consistency
Definition 4.11 (Consistency)
b 1, Θ
b 2 , . . . is consistent for Θ ∈ ΩΘ if, for all > 0 and for all θ ∈ ΩΘ ,
A sequence of estimators Θ
b n − θ| > |Θ = θ) = 0.
lim Pr(|Θ
n→∞
5. Sufficiency
b 1 , . . . Xn ) is sufficient for Θ if the conditional distribution of (X1 , . . . Xn ) given Θ
b = θb & Θ = θ
Θ(X
does not depend on θ. See MSA.
6. Maximum likelihood
See MSA.
7. Invariance
See Casella & Berger, page 300.
46
8. The ‘plug in’ property
If θ is a specified property of the CDF F (x), then θb is the corresponding property of the empirical
CDF
1
Fb(x) = × (number of Xi ≤ xi ).
(4.6)
n
For example (assuming the named quantities exist):
P
(a) the sample mean θb = x = xi /n is the plug-in estimate of the population mean θ = EX,
(b) the sample median is the plug-in estimate of the population median F −1 (0.5).
4.3.3
Hypothesis Testing (Introduction)
In this approach you
1. Choose a statistic T that has a known distribution F0 (t) if the true parameter value is θ = θ 0 (for
some particular parameter value θ 0 of interest). The statistic T should provide a measure of the
discrepancy of the data D from what would be reasonable if θ = θ 0 .
2. Test the hypothesis ‘θ = θ 0 ’ using the tail probabilities of F0 .
An example is the ‘Chi-squared’ statistic used in MSA. Hypothesis testing will be covered in more detail
in chapter 5.
Some problems with the standard hypothesis testing approach are:
1. In practice, we don’t really believe that θ = θ 0 is ‘true’ and all other possible values of θ are ‘false’;
instead we just wish to adopt ‘θ = θ 0 ’ as a convenient assumption, because it’s as good as, and
simpler than, other models.
2. If we really do want to make a decision [e.g. to give drug ‘A’ or drug ‘B’ to a particular patient],
then we should weigh up the possible consequences.
3. It’s hard to create appropriate hypothesis tests in complex situations, such as to test whether or not
θ lies in a particular subset Ω0 of the parameter space Ω.
Unfortunately, real life is a complex situation.
4.3.4
Likelihood Methods
Use the likelihood function
L(θ; D) =
Pr(D|θ)
(constant) × f (D|θ)
(discrete case)
(continuous case),
(4.7)
or equivalently the log-likelihood or ‘support’ function
`(θ; D) = log L(θ; D)
(4.8)
as a measure of the compatibility between data D and parameter θ.
In particular, the MLE corresponds to the particular F b ∈ F that is most compatible with the data D.
θ
Likelihood underlies the most useful general approaches to statistics:
1. It can handle several parameters simultaneously.
2. The CLT implies that in many cases the log-likelihood will be approximately quadratic in θ (at least
near the MLE).
This makes both theory and numerical computation easier.
47
However, there are difficulties with basing inference solely on likelihood:
1. How should we handle ‘nuisance parameters’ (i.e. components θi that we’re not interested in)?
Note that it makes no sense to integrate over values of θi to get a ‘marginal likelihood’ for the other
θj s, since L(θ; d) is NOT a probability density or probability function—we would get a different
marginal likelihood if we reparametrised say by θi 7→ log θi .
2. A more fundamental problem is that likelihood takes no account of how far-fetched the model might
be (‘high likelihood’ does NOT mean ‘likely’ !)
This suggests that in practice we may wish to incorporate information not contained in the likelihood:
1. Prior information/Expert opinion: Are there external reasons for doubting some values of θ more
than others?
2. For decision-making: How relatively important are the possible consequences of our inferences?
[e.g. an innocent person is punished / a murderer walks free].
4.4
Problems
1. How might the mortality data in Tables 1.1 and 1.2 (pages 8 & 9) be set out as a data matrix?
b 1 , . . . Xn ) is unbiased. Show that θb is consistent iff limn→∞ Var(θ(X
b 1 , . . . Xn )) = 0.
2. Suppose that θ(X
3. Given Xi IID
∼ FX (x), where FX is a member of some family F of probability distributions, show that
Pn
Pn
(a) Any statistic of the form i=1 wi Xi , where i=1 wi = 1, is an unbiased estimate of µX = EX,
Pn
(b) The mean X = i=1 Xi /n is the unique UMVUE of this form,
Pn
2
(c) σ
b2 = i=1 (Xi − X)2 /(n − 1) is an unbiased estimate of the variance σX
of FX .
4. The number of mistakes made each lecture by a certain lecturer follow independent Poisson distributions, each with mean λ > 0.
You decide to attend the Monday lecture, note the number of mistakes X, and use X to estimate
the probability p that there will be no mistakes in the remaining two lectures that week.
(a) Show that p = exp(−2λ).
(b) Show that the only unbiased estimator of p (and hence, trivially, the MVUE), is
1 if X is even,
pb =
−1 if X is odd.
(c) What is the maximum likelihood estimator of p?
(d) Discuss (briefly) the relative merits of the MLE and the MVUE in this case.
5. Let T be an unbiased estimator for g(θ), let S be a sufficient statistic for θ, and let φ(S) = E[T |S].
Prove the Rao-Blackwell theorem:
φ(S) is also an unbiased estimator of g(θ), and Var[φ(S)|θ] ≤ Var[T |θ], for all θ,
and interpret this result.
48
6. (a) Explain what is meant by an unbiased estimator for an unknown parameter θ.
(b) Show, using moment generating functions or otherwise, that if X1 & X2 have independent
Poisson distributions with means λ1 & λ2 respectively, then their sum (X1 + X2 ) follows a
Poisson distribution with mean (λ1 + λ2 ).
(c) A particular sports game comprises four ‘quarters’, each lasting 15 minutes, and a statistician
attending the game wishes to predict the probability p that no further goals will be scored before
full time.
The statistician assumes that the numbers Xk of goals scored in the kth quarter follow independent Poisson distributions, each with (unknown) mean λ, so that
Pr(Xk = x) =
λx −λ
e
x!
(k = 1, 2, 3, 4; x = 0, 1, 2, . . .).
Suppose that the statistician makes his prediction halfway through the match (i.e. after observing
X1 = x1 & X2 = x2 ). Show that an unbiased estimator of p is
1 if (x1 + x2 ) = 0,
T =
0 otherwise.
(d) Suppose the statistician also made a prediction after 15 minutes. Show that in this case the
ONLY unbiased estimator of p given X1 = x1 is
x
2 1
if x1 is even,
T =
−2x1 if x1 is odd.
(e) What are the maximum likelihood estimators of p after 15 and after 30 minutes?
(f) Briefly compare the advantages of maximum likelihood and unbiased estimation for this situation.
From Warwick ST217 exam 1997
7. (a) Explain what is meant by a minimum variance unbiased estimator (MVUE).
(b) Let X and Y be random variables. Write down (without proof) expressions relating E[Y ] and
Var[Y ] to the conditional moments E[Y |X] and Var[Y |X].
(c) Let S be a sufficient statistic for a parameter θ, let T be an unbiased estimator for τ (θ), and
define W = E[T |S]. Show that
i. W is an unbiased estimator for τ (θ), and
ii. Var[W ] ≤ Var[T ] for all θ.
Deduce that a MVUE, if one exists, must be a function of a sufficient statistic.
(d) Let X1 , X2 , . . . , Xn be IID Bernoulli random variables, i.e.
Pr(Xi = 1) =
θ
i = 1, 2, . . . , n.
Pr(Xi = 0) = 1 − θ
i. Show that S =
ii. Define T by
Pn
i=1
Xi is a sufficient statistic for θ.
T =
1
0
if X1 = 1 and X2 = 0,
otherwise.
What is E[T ]?
iii. Find E[T |S], and hence show that S(n − S)/(n − 1) is an MVUE of Var[S] = nθ(1 − θ).
From Warwick ST217 exam 1999
49
8. Given Xi IID
∼ Poi (θ), compare the following possible estimators for θ in terms of unbiasedness, consistency, relative efficiency, etc.
n
θb1
1X
Xk ,
= X=
n
k=1
θb2
=
θb3
=
θb4
=
1
n
100 +
n
X
!
Xk
,
k=1
1
(X2 − X1 )2 ,
2
n
1X
(Xk − X)2 ,
n
k=1
n
=
1 X
(Xk − X)2 ,
n−1
θb6
=
(θb1 + θb5 )/2,
θb7
=
median(X1 , X2 , . . . , Xn ),
θb8
=
mode(X1 , X2 , . . . , Xn ),
θb9
=
X
2
kXk ,
n(n + 1)
θb5
k=1
n
k=1
θb10
9. [Light relief]
=
1
n−1
n
X
Xk .
k=2
Discuss the following possible defence submission at a murder trial:
‘The supposed DNA match placing the defendant at the scene of the crime would have arisen with
even higher probability if the defendant had a secret identical twin
[the more people with that DNA, the more chances of getting a match at the crime scene].
‘Now assume that my client has been cloned θ times, θ ∈ {0, 1, . . . , n} for some n > 0. Clearly the
larger the value of θ, the higher the probability of obtaining the observed DNA results
[every increase in θ means another clone who might have been at the scene of the crime].
‘Therefore the MLE of θ is n.
‘But then, even assuming somebody with my client’s DNA committed this terrible crime, the probability that it was my client is only 1/(n + 1) (under reasonable assumptions).
‘Therefore you cannot say that my client is, beyond a reasonable doubt, guilty.
‘The defence rests.’
4.5
4.5.1
Bayesian Inference
Introduction
Classical inference regards probability as a property of physical objects (e.g. a ‘fair coin’).
An alternative interpretation uses probability to represent an individual’s (lack of) understanding of an
uncertain situation.
50
Examples
1. ‘I have no reason to suspect that “heads” or “tails” are more likely. Therefore, by symmetry, my
current probability for this particular coin’s coming down “heads” is 1/2.’
2. ‘I doubt the accused has any previously-unknown identical siblings. I’d bet 100,000 to 1 against’
(i.e. if θ is the number of identical siblings, then my probability for θ > 0 is 1/100001).
Different people, with different knowledge, can legitimately have different probabilities for real-world events
(therefore it’s good discipline to say ‘my probability for. . . ’ rather than ‘the probability of. . . ’).
As you learn. your probabilities can be continually updated using Bayes’ theorem, i.e.
Pr(A|B) =
Pr(B|A) × Pr(A)
Pr(B)
(4.9)
assuming Pr(B) is positive, and using the fact that Pr(A&B) = Pr(A|B) Pr(B) = Pr(B|A) Pr(A) .
The Bayesian approach to statistical inference treats all uncertainty via probability, as follows:
1. You have a probability model for the data, with PMF p(D|Θ).
2. Your prior PMF for Θ (i.e. your PMF for Θ based on a combination of expert opinion, previous
experience, and your own prejudice), is p(θ).
3. Then Bayes’ theorem says
p(θ|D) =
p(D|θ) p(θ)
p(D)
or, since once the data have been obtained p(D) is a constant,
p(θ|D) ∝ p(D|θ) p(θ)
∝ L(θ; D) p(θ)
i.e. ‘posterior probability ∝ ‘likelihood’ × ‘prior’
(4.10)
Formula 4.10 also applies in the continuous case, in which case p(·) represents a PDF.
Comments
1. Further applications to decision theory are given in the third year course ST301.
2. Note that if θ = (θ1 , θ2 , . . . , θp ), then p(θ|D) is a p-dimensional function, and may prove difficult to
manipulate, summarise or visualise.
3. Treating all uncertainty via probability has the advantage that one-off events (e.g. management
decisions, or the results of horse races) can be handled. However, it’s not at all obvious that all
uncertainty can be treated via probability!
4. As with Classical inference, a Bayesian analysis of a problem should involve checking whether the
assumptions underlying p(D|θ) and p(θ) are reasonable, and rethinking & reanalysing the model if
necessary.
Exercise 4.4
Describe the Bayesian approach to statistical inference, denoting the data by x, the prior by fΘ (θ), and
the likelihood by L(θ; x) = fX|Θ (x|θ).
k
51
4.6
Nonparametric Methods
Standard Classical and Bayesian methods make strong assumptions, e.g. Xi IID
∼ F (x|θ) for some θ ∈ Ω.
Assumptions of independence are critical (what aspects of the problem provide information about other
aspects?)
Assumptions about the form of probability distributions are often less important, at least provided the
sample size n is large. However, there are exceptions to this:
1. It might be that the probability distribution encountered in practice is fundamentally different from
the form assumed in our model. For example, some probability distributions are so ‘heavy-tailed’
that their means don’t exist e.g. the Cauchy distribution with f (x) = 1/π(1 + x2 ), x ∈ R .
2. Some data may be recorded incorrectly, or there may be a few atypically large/small data values
(‘outliers’), etc.
3. In any case, what if n is small and the CLT can’t be invoked?
‘Nonparametric’ methods don’t assume that the actual probability distribution F (·|θ) lies in a particular
parametric family F; instead they make more general assumptions, for example
1. ‘F (x) is symmetric about some unknown value Θ’.
Note that this may be a reasonable assumption even if EX doesn’t exist.
Θ is the (unknown) median of the population, i.e. Pr(X < Θ) = Pr(X > Θ).
Therefore one could estimate Θ by the median of the data (though better methods may exist).
2. ‘F (x, y) is such that if (Xi , Yi ) IID
∼ F , (i = 1, 2), then Pr(Y1 < Y2 |X1 < X2 ) = 1/2’.
This is a nonparametric version of the statement ‘X & Y are uncorrelated’.
Many statistical methods involve estimating means, as we’ll see in the rest of the course (t-tests, linear
regression, many MLEs etc.)
Corresponding nonparametric methods typically involve medians—or equivalently, various probabilities.
Exercise 4.5
Suppose that X has a continuous distribution. Show that a test of the statement ‘median of X is θ0 ’ is
equivalent to a test of the statement ‘Pr(X < θ0 ) = 1/2’.
If Xi are IID, what is the distribution of R = (number of Xi < θT ), where θT is the true value of θ?
k
Other nonparametric methods involve ranking the data Xi : replacing the smallest Xi by 1, the next
smallest by 2, etc. Classical statistical methods can then be applied to the ranks. Note that the effect of
outliers will be reduced.
Example
Given data (Xi , Yi ), i = 1, . . . , n from a continuous bivariate distribution, ‘Spearman’s rank correlation’
(often written ρS ) can be calculated as follows:
1. replace the Xi values by their ranks Ri ,
2. similarly replace the Yi values by their ranks Si ,
3. calculate the usual (‘product-moment’ or ‘Pearson’s’) correlation between the Ri s and Si s.
52
Comments
1. If the distribution of the original RVs is not continuous, then some data values may be repeated (‘tied
ranks’). Repeated Xi s are given averaged ranks (for example, if there are two Xi with the smallest
value, then they are each given rank 1.5 = (1 + 2)/2).
2. If X ⊥
⊥ Y , so the ‘true’ ρS is zero,Pthen the distribution of the calculated ρS is easily approximated
n
(using the standard formulae for i=1 ik ).
3. ‘Easily approximated’ does not necessarily mean ‘well approximated’ !
4. Most books give another formula for ρS , which is equivalent unless there are tied ranks, but which
obscures the relationship with the standard product-moment correlation
P
(xi − x)(yi − y)
ρ = pP
.
P
(xi − x)2 (yi − y)2
5. Other, perhaps better, types of nonparametric correlation have been defined (‘Kendall’s τ ’).
4.7
Graphical Methods
A vital part of data analysis is to plot the data using bar-charts, histograms, scatter diagrams etc. Plotting
the data is important no matter what further formal statistical methods will be used:
1. It enables you to ‘get a feel for’ the data,
2. It helps you look for patterns and anomalies,
3. It helps in checking assumptions (such as independence, linearity or Normality).
Many useful plots can be easily churned out using a computer, though sometimes you have to devise original
plots to display the data in the most appropriate way.
Exercise 4.6
The following table shows 66 measurements on the speed of light, made by S. Newcomb in 1882. Values
are the times in nanoseconds (ns), less 24,800 ns, for light to travel from his laboratory to a mirror and
back. Values are to be read row-by-row, thus the first to observations are 24,828 ns and 24,826 ns.
28
29
24
37
36
26
29
26
22
20
25
23
32
27
33
24
36
28
27
32
28
24
21
32
26
27
24
29
34
25
36
30
28
39
16
-44
30
28
32
27
28
23
27
23
25
36
31
24
16
29
21
26
27
25
40
31
28
30
26
32
-2
19
29
22
33
25
Produce a histogram, a Normal probability plot and a time plot of Newcomb’s data. Decide which (if any)
observations to ignore, and produce a normal probability plot of the remaining reduced data set. Finally
compare the mean of this reduced data set with (i) the mean and (ii) the 10% trimmed mean of the original
data. Solution: Plots are shown in Figure 4.1. There are clearly 2 large outliers, but the time plot also
suggests that the 6th to 10th observations are unusually variable, and that the last two observations are
atypically low (both being lower than the previous 20 observations).
The Normal probability plot is calculated by calculating y(i) (the sorted data) and zi as follows, and
plotting y(i) against zi .
53
i
y(i)
xi = (i+0.5)/(n+1)
zi = Φ(xi )
1
2
3
4
..
.
−44
−2
16
16
..
.
0.0075
0.0224
0.0373
0.0522
..
.
−2.434
−2.007
−1.783
−1.624
..
.
65
66
39
40
0.9776
0.9925
2.007
2.434
Omitting the first 10 and the last 2 recorded observations leaves a data-set where the Normality and
independence assumptions are much more reasonable—see plot (d) of Figure 4.1.
Location estimates are (i) 26.2, (ii) 27.4, (iii) 27.9. The trimmed mean is reasonably close to the mean of
observations 11–64.
Figure 4.1: Plots of Newcomb’s data: (a) histogram, (b) Normal probability plot, (c) time plot, (d) Normal
probability plot of data after excluding the first 10 and last 2 observations.
k
4.8
Bootstrapping
‘Bootstrap’ methods have become increasingly used over the past few years. They address the general
question:
54
b given that the underlying
‘What are the properties of the calculated statistics (e.g. MLEs θ)
distributional assumptions may be false (and, in reality, will be false)?’
Bootstrapping uses the observed data directly as an estimate of the underlying population, then uses
‘plug-in’ estimation, and typically involves computer simulation.
Several other computer-intensive approaches to statistical inference have also become very popular recently.
4.9
Problems
1. [Light relief]
Discuss the following quote:
‘As a statistician, I want to use mathematics to help deal with practical uncertainty. The natural
mathematical way to handle uncertainty is via probability.
‘About the simplest practical probability statement I can think of is “The probability that a fair coin,
tossed at random, will come down ‘heads’ is 1/2”.
‘Now try to define “fair coin”, “at random” and “probability 1/2” without using subjective probability
or circular definitions.
‘Summary: if a practical probability statement is not subjective, then it must be tautologous, illdefined, or useless.
‘Of course, for balance, some of the time I teach subjective methods, and some of the time I teach
useless methods :-).’
Ewart Shaw (Internet posting 13–Aug–1993).
2. (a) Plot the captopril data (Table 4.1), and suggest what sort of models seem reasonable.
(b) Roughly estimate from your graph(s) the effect of captopril (C) on systolic and diastolic blood
pressure (SBP & DBP).
(c) Suggest a single summary measure (SBP, DBP or a combination of the two) to quantify the
effect of treatment.
(d) Do you think a transformation of the data would be appropriate?
(e) Comment on the number of parameters in your model(s).
(f) Calculate ρS and ρ between ∆S , the change (after-before) in SBP, and ∆D , the change (afterbefore) in DBP.
Suggest some advantages and disadvantages in using ρS and ρ here.
(g) Calculate some further summary statistics such as means, variances, correlations and fivenumber summaries, and comment on how useful they are as summaries of the data.
(h) Are there any problems in using the data to estimate the effect of captopril? What further
information would be useful?
(i) What advantages/disadvantages would there be in using bootstrapping here, i.e. using the discrete distribution that assigns probability 1/15 to each of the 15 points x1 = (210, 201, 130, 125),
x2 = (169, 165, 122, 121), . . . , x15 = (154, 131, 100, 82) as an estimate of the underlying population, and working out the properties of ρS , ρ, etc. based on that assumption?
55
This page intentionally left blank (except for this sentence).
56
Chapter 5
Hypothesis Testing
5.1
Introduction
A hypothesis is a claim about the real world; statisticians will be interested in hypotheses like:
1. ‘The probabilities of a male panda or a female panda being born are equal’,
2. ‘The number of flying bombs falling on a given area of London during World War II follows a Poisson
distribution’,
3. ‘The mean systolic blood pressure of 35-year-old men is no higher than that of 40-year-old women’,
4. ‘The mean value of Y = log(systolic blood pressure) is independent of X = age’
(i.e. E[Y |X = x] = constant).
These hypotheses can be translated into statements about parameters within a probability model:
1. ‘p1 = p2 ’,
n
2. ‘N ∼ Poi (λ) for some λ > 0’,
Pi.e.: pn = Pr(N = n) = λ exp(−λ)/n! (within the general probability
model pn ≥ 0 ∀n = 0, 1, . . .;
pn = 1),
3. ‘θ1 ≤ θ2 ’ and
4. ‘β1 = 0’ (assuming the linear model E[Y |x] = β0 + β1 x).
Definition 5.1 (Hypothesis test)
A hypothesis test is a procedure for deciding whether to accept a particular hypothesis as a reasonable
simplifying assumption, or to reject it as unreasonable in the light of the data.
Definition 5.2 (Null hypothesis)
The null hypothesis H0 is the simplifying assumption we are considering making.
Definition 5.3 (Alternative hypothesis)
The alternative hypothesis H1 is the alternative explanation(s) we are considering for the data.
Definition 5.4 (Type I error)
A type I error is made if H0 is rejected when H0 is true.
Definition 5.5 (Type II error)
A type II error is made if H0 is accepted when H0 is false.
57
Comments
1. In the first example above (pandas) the null hypothesis is H0 : p1 = p2 .
2. The alternative hypothesis in the first example would usually be H1 : p1 6= p2 , though it could also
be (for example)
(a) H1 : p1 < p2 ,
(b) H1 : p1 > p2 , or
(c) H1 : p1 − p2 = δ for some specified δ 6= 0.
5.2
Simple Hypothesis Tests
The simplest type of hypothesis testing occurs when the probability distribution giving rise to the data is
specified completely under the null and alternative hypotheses.
Definition 5.6 (Simple hypotheses)
A simple hypothesis is of the form Hk : θ = θk ,
i.e. the probability distribution of the data is specified completely.
Definition 5.7 (Composite hypotheses)
A composite hypothesis is of the form Hk : θ ∈ Ωk ,
i.e. the parameter θ lies in a specified subset Ωk of the parameter space ΩΘ .
Definition 5.8 (Simple hypothesis test)
A simple hypothesis test tests a simple null hypothesis H0 : θ = θ0 against a simple alternative
H1 : θ = θ1 , where θ parametrises the distribution of our experimental random variables X =
X 1 , X 2 , . . . Xn .
There may be many seemingly sensible approaches to testing a given hypothesis. A reasonable criterion
for choosing between them is to attempt to minimise the chance of making a mistake: incorrectly rejecting
a true null hypothesis, or incorrectly accepting a false null hypothesis.
Definition 5.9 (Size)
A test of size α is one which rejects the null hypothesis H0 : θ = θ0 in favour of the alternative
H1 : θ = θ1 iff
X ∈ Cα
where Pr(X ∈ Cα | θ = θ0 ) = α
for some subset Cα of the sample space S of X.
Definition 5.10 (Critical region)
The set Cα in Definition 5.9 is called the critical region or rejection region of the test.
Definition 5.11 (Power & power function)
The power function of a test with critical region Cα is the function
β(θ) = Pr(X ∈ Cα | θ),
and the power is β = β(θ1 ), i.e. the probability that we reject H0 in favour of H1 when H1 is true.
A hypothesis test typically uses a test statistic T (X), whose distribution is known under H0 , and such that
extreme values of T (X) are more compatible with H1 that H0 .
Many useful hypothesis tests have the following form:
58
Definition 5.12 (Simple likelihood ratio test)
A simple likelihood ratio test (SLRT) of H0 : θ = θ0 against H1 : θ = θ1 rejects H0 iff
n L(θ ; x)
o
0
≤ Aα
X ∈ Cα∗ = x L(θ1 ; x)
where L(θ; x) is the likelihood of θ given the data x, and the number Aα is chosen so that the size of
the test is α.
Exercise 5.1
Suppose that X1 , X2 , . . . , Xn IID
∼ N (θ, 1). Show that the likelihood ratio for testing H0 : θ = 0 against
H1 : θ = 1 can be written
λ(x) = exp n x − 12 .
Hence show that √
the corresponding SLRT of size α rejects H0 when the test statistic T (X) = X satisfies
T > Φ−1 (1 − α)/ n.
k
Comments
1. For a simple hypothesis test, both H0 and H1 are ‘point hypotheses’, each specifying a particular
value for the parameter θ rather than a region of the parameter space.
2. The size α is the probability of rejecting H0 when H0 is in fact true; clearly we want α to be small
(α = 0.05, say).
3. Clearly for a fixed size α of test, the larger the power β of a test the better.
However, there is an inevitable trade-off between small size and high power (as in a jury trial: the
more careful one is not to convict an innocent defendant, the more likely one is to free a guilty one
by mistake).
4. In practice, no hypothesis will be precisely true, so the whole foundation of classical hypothesis testing
seems suspect!
5. Regarding likelihood as a measure of compatibility between data and model, an SLRT compares the
compatibility of θ0 and θ1 with the observed data x, and accepts H0 iff the ratio is sufficiently large.
6. One reason for the importance of likelihood ratio tests is the following theorem, which shows that
out of all tests of a given size, an SLRT (if one exists) is ‘best’ in a certain sense.
Theorem 5.1 (The Neyman-Pearson lemma)
Given random variables X1 , X2 , . . . , Xn , with joint density f (x|θ), the simple likelihood ratio test of
a fixed size α for testing H0 : θ = θ0 against H1 : θ = θ1 is at least as powerful as any other test of
the same size.
Exercise 5.2
[Proof of Theorem 5.1] Prove the Neyman-Pearson lemma. Solution: Fix the size of the test to be α.
Let A be a positive constant and C0 a subset of the sample space satisfying
1. Pr(X ∈ C0 | θ = θ0 ) = α,
2. X ∈ C0
⇐⇒
L(θ0 ; x)
f (x|θ0 )
=
≤ A.
L(θ1 ; x)
f (x|θ1 )
Suppose that there exists another test of size α, defined by the critical region C1 , i.e.
59
C0
C1
B2
B1
B3
ΩX
Figure 5.1: Proof of Neyman-Pearson lemma
Reject H0 iff x ∈ C1 , where Pr(x ∈ C1 |θ = θ0 ) = α.
Let B1 = C0 ∩ C1 , B2 = C0 ∩ C1c , B3 = C0c ∩ C1 .
Note that B1 ∪ B2 = C0 , B1 ∪ B3 = C1 , and B1 , B2 & B3 are disjoint.
Let the power of the likelihood ratio test be I0 = Pr(X ∈ C0 | θ = θ1 ),
and the power of the other test be I1 = Pr(X ∈ C1 | θ = θ1 ).
We want to show that I0 − I1 ≥ 0.
But
I0 − I1
=
R
f (x|θ1 )dx −
R
f (x|θ1 )dx
R
= B1 ∪B2 f (x|θ1 )dx − B1 ∪B3 f (x|θ1 )dx
R
R
= B2 f (x|θ1 )dx − B3 f (x|θ1 )dx.
C0
C1
R
Also B2 ⊆ C0 , so f (x|θ1 ) ≥ A−1 f (x|θ0 ) for x ∈ B2 ,
similarly B3 ⊆ C0c , so f (x|θ1 ) ≤ A−1 f (x|θ0 ) for x ∈ B3 ,
Therefore
I0 − I1
i
f
(x|θ
)dx
0
B3
i
hR
R
−1
= A
f
(x|θ
)dx
−
f
(x|θ
)dx
0
0
C0
C1
≥ A−1
hR
f (x|θ0 )dx −
B2
= A−1 [α − α]
=
R
0
as required.
k
5.3
Simple Null, Composite Alternative
Suppose that we wish to test the simple null hypothesis H0 : θ = θ0 against the composite alternative
hypothesis H1 : θ ∈ Ω1 .
The easiest way to investigate this is to imagine the collection of simple hypothesis tests with null hypothesis
H0 : θ = θ0 and alternative H1 : θ = θ1 , where θ1 ∈ Ω1 . Then, for any given θ1 , an SLRT is the most
powerful test for a given size α. The only problem would be if different values of θ1 result in different
SLRTs.
60
Definition 5.13 (UMP Tests)
A hypothesis test is called a uniformly most powerful test of H0 : θ = θ0 against H1 : θ = θ1 , θ1 ∈ Ω1 ,
if
1. There exists a critical region Cα corresponding to a test of size α not depending on θ1 ,
2. For all values of θ1 ∈ Ω1 , the critical region Cα defines a most powerful test of H0 : θ = θ0
against H1 : θ = θ1 .
Exercise 5.3
2
Suppose that X1 , X2 , . . . , Xn IID
∼ N (0, σ ).
1. Find the UMP test of H0 : σ 2 = 1 against H1 : σ 2 > 1.
2. Find the UMP test of H0 : σ 2 = 1 against H1 : σ 2 < 1.
3. Show that no UMP test of H0 : σ 2 = 1 against H1 : σ 2 6= 1 exists.
k
Comments
1. If a UMP test exists, then it is clearly the appropriate test to use.
2. Often UMP tests don’t exist!
3. A UMP test involves the data only via a likelihood ratio, so is a function of the sufficient statistics.
4. The critical region Cα therefore often has a simple form, and is usually easily found once the distribution of the sufficient statistics have been determined
(hence the importance of the χ2 , t and F distributions).
5. The above three examples illustrate how important is the form of alternative hypothesis being considered. The first two are one-sided alternatives whereas H1 : σ 2 6= 1 is a two-sided alternative
hypothesis, since σ 2 could lie on either side of 1.
5.4
Composite Hypothesis Tests
The most general situation we’ll consider is where the parameter space Ω is divided into two subsets:
Ω = Ω0 ∪ Ω1 , where Ω0 ∩ Ω1 = ∅, and the hypotheses are H0 : θ ∈ Ω0 , H1 : θ ∈ Ω1 .
For example, one may want to test the null hypothesis that the data come from an exponential distribution
against the alternative that the data come from a more general gamma distribution. Note that here, as in
many other cases, dim(Ω0 ) < dim(Ω1 ) = dim(Ω).
One possible approach to this situation is to regard the maximum possible likelihood over θ ∈ Ωi as a
measure of compatibility between the data and the hypothesis Hi (i = 0, 1). It’s therefore convenient to
define the following:
b
θ
b
θ0
b1
θ
is the MLE of θ over the whole parameter space Ω,
is the MLE of θ over Ω0 , i.e. under the null hypothesis H0 , and
is the MLE of θ over Ω1 , i.e. under the alternative hypothesis H1 .
b must therefore be the same as either θ
b0 or θ
b1 , since Ω = Ω0 ∪ Ω1 .
Note that θ
b1 ; x)/L(θ
b0 ; x), by direct analogy with the SLRT.
One might consider using the likelihood ratio criterion L(θ
b
b0 ; x):
However, it’s generally easier to use the equivalent ratio L(θ; x)/L(θ
61
Definition 5.14 (Likelihood Ratio Test (LRT))
b ∈ Ω0 in favour of the alternative H1 : θ
b ∈ Ω1 = Ω \ Ω0 iff
A likelihood ratio test rejects H0 : θ
λ(x) =
b x)
L(θ;
≥ λ,
b
L(θ 0 ; x)
(5.1)
b is the MLE of θ over the whole parameter space Ω, θ
b0 is the MLE of θ over Ω0 , and the
where θ
value λ is fixed so that
sup Pr(λ(X) ≥ λ|θ) = α
θ∈Ω0
where α, the size of the test, is some chosen value.
Equivalently, the test criterion uses the log LRT statistic:
b x) − `(θ
b0 ; x) ≥ λ0 ,
r(x) = `(θ;
(5.2)
where `(θ; x) = log L(θ; x), and λ0 is chosen to give chosen size α = supθ∈Ω0 Pr(r(X) ≥ λ0 |θ).
Comments
1. The size α is typically chosen by convention to be 0.05 or 0.01.
2. Note that high values of the test statistic λ(x), or equivalently of r(x), are taken as evidence against
the null hypothesis H0 .
3. The test given in Definition 5.14 is sometimes referred to as a generalized likelihood ratio test, and
Equation 5.1 a generalized likelihood ratio test statistic.
4. Equation 5.2 is often easier to work with than Equation 5.1—see the exercises and problems.
Exercise 5.4
P
P
[Paired t-test] Suppose that X1 , X2 , . . . , Xn IID
N (µ, σ 2 ), and let X = Xi /n, S 2 = (Xi − X)2 /(n − 1).
√ ∼
What is the distribution of T = X/(S/ n)?
Is the test based on rejecting H0 : µ = 0 for large T a likelihood ratio test?
Assuming that the observed differences in diastolic blood pressure (after–before) are IID and Normally
distributed with mean δD , use the captopril data (4.1) to test the null hypothesis H0 : δD = 0 against the
alternative hypothesis H1 : δD 6= 0.
Comment: this procedure is called the paired t test
k
Exercise 5.5
IID
2
2
[Two sample t-test] Suppose X1 , X2 , . . . , Xm IID
∼ N (µX , σ ) and Y1 , Y2 , . . . , Yn ∼ N (µY , σ ).
1. Derive the LRT for testing H0 : µX = µY versus H1 : µX 6= µY .
2. Show that the LRT can be based on the test statistic
T =
where
Sp2 =
Pm
i=1 (Xi
X −Y
q
1
Sp m
+
.
(5.3)
Pn
− X)2 + i=1 (Yi − Y )2
.
m+n−2
(5.4)
3. Show that, under H0 , T ∼ tm+n−2 .
62
1
n
4. Two groups of female rats were placed on diets with high and low protein content, and the gain
in weight (grammes) between the 28th and 84th days of age was measured for each rat, with the
following results:
High protein diet
134 146 104 119 124 161 107 83 113 129 97 123
Low protein diet
70 118 101 85 107 132 94
Using the test statistic T above, test the null hypothesis that the mean weight gain is the same under
both diets.
Comment: this is called the two sample t-test, and Sp2 is the pooled estimate of variance.
k
Exercise 5.6
IID
2
2
[F -test] Suppose X1 , X2 , . . . , Xm IID
∼ N (µX , σX ) and Y1 , Y2 , . . . , Yn ∼ N (µY , σY ), where µX , µY , σX and σY
are all unknown.
2
2
Suppose we wish to test the hypothesis H0 : σX
= σY2 against the alternative H1 : σX
6= σY2 .
Pn
Pm
2
1. Let SX
= i=1 (Xi − X)2 and SY2 = i=1 (Yi − Y )2 .
2
2
What are the distributions of SX
/σX
and SY2 /σY2 ?
2. Under H0 , what is the distribution of the statistic
V =
2
SX
/(m − 1)
?
SY2 /(n − 1)
3. Taking values
of V much
or smaller than
P
P larger P
P 1 2as evidence against H0 , and given data with m = 16,
n = 16,
xi = 84,
yi = 18,
x2i = 563,
yi = 72, test the null hypothesis H0 .
2
Comment: with the alternative hypothesis H1 : σX
> σY2 , the above procedure is called an F test.
k
Even in simple cases like this, the null distribution of the log likelihood ratio test statistic r(x) (5.2) can be
difficult or impossible to find analytically. Fortunately, there is a very powerful and very general theorem
that gives the approximate distribution of r(x):
Theorem 5.2 (Wald’s Theorem)
Let X1 , X2 , . . . , Xn IID
∼ f (x|θ) where θ ∈ Ω, and let r(x) denote the log likelihood ratio test statistic
b x) − `(θ
b0 ; x),
r(x) = `(θ;
b is the MLE of θ over Ω and θ
b0 is the MLE of θ over Ω0 ⊂ Ω.
where θ
Then under reasonable conditions on the PDF (or PMF) f (·|·), the distribution of 2r(x) converges
to a χ2 distribution on dim(Ω) − dim(Ω0 ) degrees of freedom as n → ∞.
Comments
1. A proof is beyond the scope of this course, but may be found in e.g. Kendall & Stuart, ‘The Advanced
Theory of Statistics’, Vol. II.
2. Wald’s theorem implies that, provided the sample size is large, you only need tables of the χ2
distribution to find the critical regions for a wide range of hypothesis tests.
63
Another important theorem, see Problem 3.7.4, page 38, is the following:
2
Theorem 5.3 (Sample Mean and Variance of Xi IID
∼ N (µ, σ ))
IID
Let X1 , X2 , . . . , Xn ∼ N(µ, σ 2 ). Then
P
P
1. X = Xi /n and Y = (Xi − X)2 are independent RVs,
2. X has a N(µ, σ 2 /n) distribution,
3. Y /σ 2 has a χ2n−1 distribution.
Exercise 5.7
Suppose X1 , X2 , . . . , Xn IID
∼ N (θ, 1), with hypotheses H0 : θ = 0 and H1 : θ arbitrary.
Show that 2r(x) = nx2 , and hence that Wald’s theorem holds exactly in this case.
k
Exercise 5.8
Suppose now that Xi ∼ N (θi , 1), i = 1, . . . , n are independent, with null hypothesis H0 : θi = θ ∀i and
alternative hypothesis H1 : θi arbitrary.
Pn
Show that 2r(x) = i=1 (xi − x)2 . and hence (quoting any other theorems you need) that Wald’s theorem
again holds exactly.
k
5.5
Problems
2
2
2
1. Suppose X1 , X2 , . . . , Xn IID
∼ N (µ, σ ) with null hypothesis H0 : σ = 1 and alternative H1 : σ is
arbitrary. Show
v −1−log vb),
Pn that the LRT will reject H0 for large values of the test statistic r(x) = n(b
where vb = i=1 (xi − x)2 /n.
2. Suppose that X ∼ Bin(n, p). Under the null hypothesis H0 : p = p0 , what are EX and VarX?
Show that if n is large and p0 is not too close to 0 or 1, then
X/n − p0
p
p0 (1 − p0 )/n
∼ N (0, 1)
approximately.
Out of 1000 tosses of a given coin, 560 were heads and 440 were tails. Is it reasonable to assume that
the coin is fair? Justify your answer.
3. Out of 370 new-born babies at a Hospital, 197 were male and 173 female.
Test the null hypothesis H0 : p < 1/2 versus H1 : p ≥ 1/2, where p denotes the probability that a
baby born at the Hospital will be male.
Discuss any assumptions you make.
4. (a) Define the size and power of a hypothesis test of a simple null hypothesis H0 : θ = θ0 against a
simple alternative hypothesis H1 : θ = θ1 .
(b) State and prove the Neyman-Pearson Lemma for continuous random variables X1 , . . . , Xn when
testing the null hypothesis H0 : θ = θ0 against the alternative H1 : θ = θ1 .
(c) Assume that a particular bus service runs at regular intervals of θ minutes, but that you do
not know θ. Assume also that the times you find you have to wait for a bus on n occasions,
X1 , . . . , Xn , are independent and identically distributed with density
−1
θ
if 0 ≤ x ≤ θ,
f (x|θ) =
0
otherwise.
64
i. Discuss briefly when the above assumptions would be reasonable in practice.
ii. Find the likelihood L(θ; x) for θ given the data (X1 , . . . , Xn ) = x = (x1 , . . . , xn ).
iii. Find the most powerful test of size α of the hypothesis H0 : θ = θ0 = 20 against the
alternative H1 : θ = θ1 > 20.
From Warwick ST217 exam 1997
5. Let X1 , . . . , Xn be independent each with density
λx−2 e−λ/x
f (x) =
0
if x > 0,
otherwise,
where λ is an unknown parameter.
(a) Show that the UMP test of H0 : λ = 21 against H1 : λ > 12 is of the form:
Pn
‘reject H0 if i=1 Xi−1 ≤ A∗ ’, where A∗ is chosen to fix the size of the test.
Pn
(b) Find the distribution of i=1 Xi−1 under the null & alternative hypotheses.
(c) You observe values 0.59, 0.36, 0.71, 0.86, 0.13, 0.01, 3.17, 1.18, 3.28, 0.49 for X1 , . . . , X10 .
Test H0 against H1 , & comment on the test in the light of any assumptions made.
6. (a) Define the size and power function of a hypothesis test procedure.
(b) State and prove the Neyman-Pearson lemma in the case of a test statistic that has a continuous
distribution.
2
2
(c) Let X1 , X2 , . . . , Xn IID
∼ N (µ, σ ), where σ is known. Find the likelihood ratio
fX (x|µ1 )/fX (x|µ0 )
and hence show that the most powerful test of size α for testing the null hypothesis H0 : µ = µ0
against the alternative H1 : µ = µ1 , for some µ1 < µ0 , has the form:
√
‘Reject H0 if X < µ0 + σ Φ−1 (α)/ n ’,
Pn
where X = i=1 Xi /n is the sample mean, and Φ−1 (α) is the 100 α% point of the standard
Normal N (0, 1) distribution.
(d) Define a uniformly most powerful (UMP) test, and show that the above test is UMP for testing
H0 : µ = µ0 against H1 : µ < µ0 .
(e) What is the UMP test of H0 : µ = µ0 against H1 : µ > µ0 ?
(f) Deduce that no UMP test of size α exists for testing H0 : µ = µ0 against H1 : µ 6= µ0 .
(g) What test would you choose to test H0 : µ = µ0 against H1 : µ 6= µ0 , and why?
From Warwick ST217 exam 1999
7. X is a single observation whose density is given by
(1 + θ)xθ
f (x) =
0
if 0 < x < 1,
otherwise.
Find the most powerful size α test of H0 : θ = 0 against H1 : θ = 1.
Is there a U.M.P. test of H0 : θ ≤ 0 against H1 : θ > 0? If so, what is it?
−θx
8. Let X1 , X2 , . . . , Xn IID
for θ ∈ (0, ∞).
∼ Exp(θ), i.e. f (x|θ) = θe
Show that a likelihood ratio test for H0 : θ ≤ θ0 versus H1 : θ > θ0 has the form:
Z
‘Reject H0 iff θ0 x < k, where k is given by α =
0
nk
1 n−1 −z
z
e dz’.
Γ(n)
Show that a test of this form is UMP for testing H0 : θ = θ0 versus H1 : θ > θ0 .
65
9. Hypothesis test procedures can be inverted to produce confidence intervals or more generally confidence regions. Thus, given a size α test of the null hypothesis H0 : θ = θ0 , the set of all values θ0
that would NOT be rejected forms a ‘100(1 − α)% confidence interval for θ’.
An amateur statistician argues as follows:
Suppose something starts at time t0 and ends at time t1 . Then at time t ∈ (t0 , t1 ), the
ratio r of its remaining lifetime (t1 − t) to its current age (t − t0 ), i.e.
r(t) =
t1 − t
,
t − t0
is clearly a monotonic decreasing function of t. Also it is easy to check that r = 39 after
(1/40)th of the total lifetime, and that r = 1/39 after (39/40)th of the total lifetime.
Therefore, for 95% of something’s existence, its remaining lifetime lies in the interval
(t − t0 )/39, 39(t − t0 ) ,
where t is the time under consideration, and t0 is the time the thing came into existence.
The statistician is also an amateur theologian, and firmly believes that the World came into existence
6006 year ago. Using his pet procedure outlined above, he says he is ‘95% confident that the World
will end sometime between 154 years hence, and 234234 years hence’.
His friend, also an amateur statistician, says she has an even more general procedure to produce
confidence intervals:
In any situation I simply roll an icosahedral (fair 20-sided) die. If the die shows ‘13’ then I
quote the empty set ∅ as a 95% confidence interval, otherwise I quote the whole real line R.
She rolls the die, which comes up 13. She therefore says she is ‘95% confident that the World ended
before it even began (although presumably no-one has noticed yet).’
Discuss.
10. The following problem is quoted verbatim from Osborn (1979), ‘Statistical Exercises in Medical
Research’ :
A study of immunoglobulin levels in mycetoma patients in the Sudan involved 22 patients to be
compared to 22 normal individuals. The levels of IgG recorded for the 22 mycetoma patients are
shown below. The mean level for the normal individuals was calculated to be 1,477 mg/100ml before
the data for this group was lost overboard from a punt on the river Nile. Use the data below to
estimate the within group variance and hence perform a ‘t’ test to investigate the significance of the
difference between the mean levels of IgG in mycetoma patients and normals.
IgG levels (mg/100ml) in 22 mycetoma patients
1,047
1,377
1,210
1,103
1,270
1,135
1,375
1,067
907
1,230
1,350
804
1,032
960
1,122
1,062
1,002
960
1,345
1,204
1,053
936
Osborn (1979) 4.6.16
11. A group of clinicians wish to study survival after heart attack, by classifying new heart attack patients
according to
(a) whether they survive at least 7 days after admission, and
(b) whether they currently smoke 10 or more cigarettes per day.
From previous experience, the clinicians predict that after N days the observed counts
66
Smoker
Non-smoker
Survive
Die
R1
R3
R2
R4
will follow independent Poisson distributions with means
Smoker
Non-smoker
Survive
Die
N r1
N r3
N r2
N r4
The clinicians intend to estimate the population log-odds ratio ` = log(r1 r4 /r2 r3 ) by the sample
value L = log(R1 R4 /R2 R3 ), and they wish to choose N to give a probability 1 − β of being able to
reject the hypothesis H0 : ` = 0 at the 100α% significance level, when the true value of ` is `0 > 0.
2
Using the formula Var f (X) ≈ f 0 (EX) Var(X), show that L has approximate variance
1
1
1
1
+
+
+
,
N r1
N r2
N r3
N r4
and hence, assuming a Normal approximation to the distribution of L, that the required number of
days is roughly
2
1
1
1
1
1 −1
N= 2
+
+
+
Φ (α/2) + Φ−1 (β) ,
`0 r1
r2
r3
r4
where Φ is the standard Normal cumulative distribution function.
Comment critically on the clinicians’ method for choosing N .
From Warwick ST332 exam 1988
The Multinomial Distribution and χ2 Tests
5.6
5.6.1
Multinomial Data
Definition 5.15 (Multinomial Distribution)
The multinomial distribution Mn(n, θ) is a probability distribution on points y = (y1 , y2 , . . . , yk ),
Pk
where yi ∈ {0, 1, 2, . . .}, i = 1, 2, . . . , k, and i=1 yi = n, with PMF
f (y1 , y2 , . . . , yk ) =
where θi > 0 for i = 1, . . . , k, and
Pk
i=1 θi
k
Y
n!
θ yi
y1 !y2 ! · · · yk ! i=1 i
(5.5)
= 1.
Comments
1. The multinomial distribution arises when one has n independent observations, each classified in one
of k ways (e.g. ‘eye colour’ classified as ‘Brown’, ‘Blue’ or ‘Other’; here k = 3).
Let θi denote the probability that any given observation lies in category number i, and let Yi denote
the number of observations falling in category i. Then the random vector Y = (Y1 , Y2 , . . . , Yk ) has a
Mn(n, θ) distribution.
2. A binomial distribution is the special case k = 2, and is usually parametrised by p = θ1 (so θ2 = 1−p).
67
Exercise 5.9
By partial differentiation of the likelihood function, show that the MLEs θbi of the parameters θi of the
Mn(n, θ) satisfy the equations
yi
yk
−
(i = 1, . . . , k − 1)
Pk−1 b = 0,
b
θi
θj
1−
j=1
and hence that θbi = yi /n for i = 1, . . . , k.
5.6.2
k
Chi-Squared Tests
Suppose one wishes to test the null hypothesis H0 that, in the multinomial distribution 5.5, θ is some
function θ(φ) of another parameter φ. The alternative hypothesis H1 is that θ is arbitrary.
Exercise 5.10
Suppose H0 is that X1 , X2 , . . . Xn IID
∼ Bin(3, φ). Let Yi (for i = 1, 2, 3, 4) denote the number of observations
Xj taking value i − 1. What is the null distribution of Y = (Y1 , Y2 , Y3 , Y4 )?
k
The log likelihood ratio test statistic r(X) is given by
r(X) =
k
X
Yi log θbi −
i=1
k
X
b
Yi log θi (φ)
(5.6)
i=1
where θbi = yi /n for i = 1, . . . , k.
By Walds theorem, under H0 , 2r(X) has approximately a χ2 distribution:
2
k
X
b
Yi [log θbi − log θi (φ)]
∼
χ2k1 −k0
Pk
= 1.
(5.7)
i=1
where
θbi = Yi /n,
k0 is the dimension of the parameter φ, and
k1 = k − 1 is the dimension of θ under the constraint
i=1 θi
Comments
b = X = P4 Yi .
1. in Example 5.10, k = 4, k0 = 1, k1 = 3 and φ
i=1
We would reject H0 , that the sample comes from a Bin(3, φ) distribution for some φ, if 2r(x) is
greater than the 95% point of the χ22 distribution, where r(x) is given in Formula 5.6.
2. It is straightforward to check, using a Taylor series expansion of the log function, that provided EYi
is large ∀ i,
k
k
X
X
(Yi − µi )2
b l
2
Yi [log θbi − log θi (φ)]
,
(5.8)
µi
i=1
i=1
b is the expected number of individuals (under H0 ) in the ith category.
where µi = nθi (φ)
Definition 5.16 (Chi-squared Goodness of Fit Statistic)
X2 =
k
X
(oi − ei )2
i=1
ei
,
(5.9)
where oi is the observed count in the ith category and ei is the corresponding expected count under
the null hypothesis, is called the χ2 goodness-of-fit statistic.
68
Comments
1. Under H0 , X 2 has approximately a χ2 distribution with number of degrees of freedom being (number
of categories) - 1 - (number of parameters estimated under H0 ).
This approximation works well provided all the expected counts are reasonably large (say all are at
least 5).
2. This χ2 test was suggested by Karl Pearson before the theory of hypothesis testing was fully developed.
5.7
Problems
1. In a genetic experiment, peas were classified according to their shape (‘round’ or ‘angular’) and
colour (‘yellow’ or ‘green’). Out of 556 peas, 315 were round+yellow, 108 were round+green, 101
were angular+yellow and 32 were angular+green.
Test the null hypothesis that the probabilities of these four types are 9/16, 3/16, 3/16 and 1/16
respectively.
2. A sample of 300 people was selected from a population, and classified into blood type (O/A/B/AB,
and Rhesus positive/negative), as shown in the following table:
O
82
13
Rh positive
Rh negative
A
89
27
B
54
7
AB
19
9
The null hypothesis H0 is that being Rhesus negative is independent of whether an individual’s blood
group is O, A, B or AB. Estimate the probabilities under H0 of falling into each of the 8 categories,
and hence test the hypothesis H0 .
P
3. The random variables X1 , X2 , . . . , Xn are IID with Pr(Xi = j) = pj for j = 1, 2, 3, 4, where
pj = 1
and pj > 0 for each j = 1, 2, 3, 4.
Interest centres on the hypothesis H0 that p1 = p2 and simultaneously p3 = p4 .
(a) Define the following terms
i. a hypothesis test,
ii. simple and composite hypotheses, and
iii. a likelihood ratio test.
(b) Letting θ = (p1 , p2 , p3 , p4 ), X = (X1 , . . . , Xn )T with observed values x = (x1 , . . . , xn )T , and
letting yj denote the number of x1 , x2 , . . . , xn equal to j, what is the likelihood L(θ|x)?
(c) Assume the usual regularity conditions, i.e. that the distribution of −2 log L(θ|x) tends to χ2ν
as the sample size n → ∞. What are the dimension of the parameter space Ωθ and the number
of degrees of freedom ν of the asymptotic chi-squared distribution?
(d) By partial differentiation of the log-likelihood, or otherwise, show that the maximum likelihood
estimator of pj is yj /n.
(e) Hence show that the asymptotic test statistic of H0 : p1 = p2 and p3 = p4 is
−2 log L(x) = 2
4
X
yj log(yj /mj ),
j=1
where m1 = m2 = (y1 + y2 )/2 and m3 = m4 = (y3 + y4 )/2.
(f) In a hospital casualty unit, the numbers of limb fractures seen over a certain period of time are:
69
Arm
Leg
Left
Side
Right
46
22
49
32
Using the test developed above, test the hypothesis that limb fractures are equally likely to
occur on the right side as on the left side.
Discuss briefly whether the assumptions underlying the test appear reasonable here.
From Warwick ST217 exam 1998
Prudens quaestio dimidium scientiae.
Half of science is asking the right questions.
Roger Bacon
We all learn by experience, and your lesson this time is that you should never lose sight of the
alternative.
Sir Arthur Conan Doyle
One forms provisional theories and then waits for time or fuller knowledge to explode them.
Sir Arthur Conan Doyle
What used to be called prejudice is now called a null hypothesis.
A. W. F. Edwards
The conventional view serves to protect us from the painful job of thinking.
John Kenneth Galbraith
Science must begin with myths, and with the criticism of myths.
Sir Karl Raimund Popper
70
Chapter 6
Linear Statistical Models
6.1
Introduction
Definition 6.1 (Response Variable)
a response variable is a random variable Y whose value we wish to predict.
Definition 6.2 (Explanatory Variable)
An explanatory variable is a random variable X whose values can be used to predict Y .
Definition 6.3 (Linear Model)
A linear model is a prediction function for Y in terms of the values x1 , x2 , . . . , xk of X1 , X2 , . . . , Xk
of the form
E[Y |x1 , x2 , . . . , xk ] = β0 + β1 x1 + β2 x2 + · · · + βk xk
(6.1)
Thus if Y1 , Y2 , . . . , Yn are the responses for cases 1, 2, . . . , n, and xij is the value of Xj (j = 1, . . . , k) for
case i, then
E[Y|X] = Xβ
(6.2)
where

Y1
Y2
..
.


Y=






Yn
is the vector of responses,
X = (xij )
where xi0 = 1 for i = 1, . . . n,
is the matrix of explanatory variables, and

β0
β1
..
.


β=

βk
is the (unknown) parameter vector.
71





Examples
Consider the captopril data (page 42), and let
X1
X3
Z1
=
=
=
Diastolic BP before treatment,
Diastolic BP after treatment,
2X1 + X2 ,
X2
X4
Z2
=
=
=
Systolic BP before treatment,
Systolic BP after treatment,
2X3 + X4 .
Some possible linear models of interest are:
1. Response Y = X4 ,
(a) explanatory variable X2 (this is a ‘simple linear regression model ’, with just 1 explanatory
variable),
(b) explanatory variable X3
(c) explanatory variables X1 and X2 (a ‘multiple regression model ’).
2. Response Y = Z2 ,
(a) explanatory variable Z1
(b) explanatory variables Z1 and Z12 (a ‘quadratic regression model ’).
Note how new explanatory variables may be obtained by transforming and/or combining old ones.
3. Looking just at the interrelationship between SBP and DBP at a given time:
(a) response Y = X2 , explanatory variable X1 ,
(b) response Y = X1 , explanatory variable X2 ,
(c) response Y = X4 , explanatory variable X3 , etc.
Comments
1. A linear relationship is the simplest possible relationship between response variables and explanatory
variables, so linear models are easy to understand, interpret and also to check for plausibility.
2. One can (in theory) approximate an arbitrarily complicated relationship by a linear model, for example quadratic regression can obviously be extended to ‘polynomial regression’
E[Y |x] = β0 + β1 x + β2 x2 + · · · + βm xm .
3. Linear models have nice links with
• geometry,
• linear algebra,
• conditional expectations and variances,
• the Normal distribution.
4. Distributional assumptions (if any!) will typically be made ONLY about the response variable Y ,
NOT about the explanatory variables.
Therefore the model makes sense even if the Xi s are chosen nonrandomly (‘designed experiments’).
5. The response variable Y is sometimes called the ‘dependent variable’, and the explanatory variables
are sometimes called ‘predictor variables’, ‘regressor variables’, or (very misleadingly) ‘independent
variables’.
72
6.2
Simple Linear Regression
Definition 6.4
A simple linear regression model is a linear model with one response variable Y and one explanatory
variable X, i.e. a model of the form
E[Y |x1 ] = β0 + β1 x1 .
(6.3)
Typically in practice we have n data points (xi , yi ) for i = 1, . . . , n, and we want to predict a future
response Y from the corresponding observed value x of X.
Often there’s a natural candidate for which variable should be treated as the response:
1. X may precede Y in time, for example
(a) X is BP before treatment and Y is BP after treatment, or
(b) X is number of hours revision and Y is exam mark;
2. X may be in some way more fundamental, for example
(a) X is age and Y is height or
(b) X is height and Y is weight;
3. X may be easier or cheaper to observe, so we hope in future to estimate Y without measuring it.
In simple linear regression we don’t know β0 or β1 , but need to estimate them in order to predict Y by
Yb = βb0 + βb1 x.
To make accurate predictions we require the prediction error
Y − Yb
=
Y − βb0 + βb1 x
to be small.
This suggests that, given data (xi , yi ) for i = 1, . . . , n, we should fit βb0 and βb1 by simultaneously making
all the vertical deviations of the observed data points from the fitted line y = βb0 + βb1 x small.
P
The easiest way to do this is to minimise the sum of squared deviations (yi − ybi )2 , i.e. to use the ‘least
squares’ criterion.
6.3
Method of Least Squares
For simple linear regression,
ybi = β0 + β1 xi
(i = 1, . . . , n)
(6.4)
Therefore to estimate β0 and β1 by least squares, we need to minimise
Q=
n
X
[yi − (β0 + β1 xi )]2 .
(6.5)
i=1
Exercise 6.1
Show that Q in equation 6.5 is minimised at values β0 and β1 satisfying the simultaneous equations
βP
0n
β0 xi
P
+ β1 P xi
+ β1 x2i
73
P
= P yi ,
=
xi yi ,
(6.6)
and hence that
P
xi yi − n x y
P 2
,
xi − nx2
βb1
=
βb0
= y − βb1 x.
(6.7)
(6.8)
k
Comments
b
1. Forming ∂ 2 Q/∂β02 , ∂ 2 Q/∂β12 and ∂ 2 Q/∂β0 β1 verifies that Q is minimised at β = β.
2. Equations 6.6 are called the ‘normal equations’ for β0 and β1 (‘normal’ as in ‘perpendicular’ rather
than as in ‘standard’ or as in ‘Normal distribution’).
3. y = βb0 + βb1 x is called the ‘least squares fit’ to the data.
4. From equations 6.7 and 6.8, the least squares fitted line passes through (x, y), the centroid of the
data points.
b rather than on memorising
5. Concentrate on understanding and remembering the method for finding β,
the formulae 6.7 and 6.8 for βb1 and βb1 .
6. Geometrical interpretation
We have a vector y = (y1 , y2 , . . . , yn )T of observed responses, i.e. a point in n-dimensional space,
together with a surface S representing possible joint predicted values under the model (for simple
linear regression, it’s the 2-dimensional surface β0 + β1 x for real values of β0 and β1 ).
P
Minimising (yi − ybi )2 is equivalent to dropping a perpendicular from the point y to the surface S;
b . Thus we are literally finding the model closest to the data.
the perpendicular hits the surface at y
6.4
Problems
P
1. P
Show that the expression
xi yi −Pn x y occurring in the formula for βb1 could also be written as
P
(xi − x)(yi − y), (xi − x)yi , or
xi (yi − y).
Pn
2. Show that the ‘residual sum of squares’, i=1 (yi − ybi )2 , satisfies the following identity:
n
X
(yi − ybi )2 =
i=1
n
X
(yi − βb0 − βb1 xi )2 =
i=1
n
X
(yi − y)2 − βb1
i=1
n
X
(xi − x)(yi − y).
i=1
3. For the captopril data, find the least squares lines
(a) to predict SBP before captopril from DBP before captopril,
(b) to predict SBP after captopril from DBP after captopril,
(c) to predict DBP before captopril from SBP before captopril.
Compare these three lines.
Discuss whether it is sensible to combine the before and after measurements in order to obtain a
better prediction of SBP at a given time from DBP measured at that time.
4. Illustrate the geometrical interpretation of least squares (see above comments) in the following two
cases
(a) model E[Y |x] = β0 + β1 x with 3 data points (x1 , y1 ), (x2 , y2 ) and (x3 , y3 ),
(b) model E[Y |x] = βx with 2 data points (x1 , y1 ) and (x2 , y2 ).
What does Pythagoras’ theorem tell us in the second case?
74
6.5
The Normal Linear Model (NLM)
6.5.1
Introduction
Definition 6.5 (NLM)
Given n response RVs Yi (i = 1, 2, . . . , n), with corresponding values of explanatory variables xTi , the
NLM makes the following assumptions:
1. (Conditional) Independence
The Yi are mutually independent given the xTi .
2. Linearity
The expected value of the response variable is linearly related to the unknown parameters β:
EYi = xTi β.
3. Normality
The random variation Yi |xi is Normally distributed.
4. Homoscedasticity (Equal Variances)
i.e. Yi |xi ∼ N(xTi β, σ 2 ).
6.5.2
Matrix Formulation of NLM
The NLM for responses y = (y1 , y2 , . . . , yn )T can be recast as follows
1. E[Y] = Xβ for some parameter vector β = (β1 , β2 , . . . , βp )T ,
2. = Y − E[Y] ∼ MVN(0, σ 2 I), where I is the (n × n) identity matrix.
It can be shown that the least squares estimates of β are given by solving the simultaneous linear equations
XT y = XT Xβ
(6.9)
(the normal equations), with solution (assuming that XT X is nonsingular)
b = (XT X)−1 XT y,
β
(6.10)
Comments
1. Note that, by formula 6.10, each estimator βbj is a linear combination of the Yi s.
b has a MVN distribution.
Therefore under the NLM, β
2. Even if the Normality assumption doesn’t hold, the CLT implies that, provided the number n of
b will still be approximately MVN.
cases is large, the distribution of the estimator β
3. The most important assumption is independence, since it’s relatively easy to modify the standard
NLM to account for
• nonlinearity: transform the data, or include e.g. x2ij as an explanatory variable,
• unequal variances (‘heteroscedasticity’): e.g. transform from yi − ybi to zi = (yi − ybi )/b
σi .
• non-Normality: transform, or simply get more data!
75
4. In the general formulation the constant term β0 is omitted, though in practice the first column of the
matrix X will often contain 1’s and the corresponding parameter β1 will be the ‘constant term’.
b and the vector of residuals is r = y − y
b = XT β,
b,
5. The corresponding fitted values are y
Pp
Tb
i.e. ri = yi − ybi , where ybi = xi β = j=1 xij βbj .
Definition 6.6 (RSS)
The residual sum of squares (RSS) in the fitted NLM is
s2
=
Pn
=
b T (y − Xβ)
b
(y − Xβ)
i=1 (yi
− ybi )2
(6.11)
Important Fact about the RSS
Considering the RSS s2 to be the observed value of a corresponding RV S 2 , it can be shown that
• S 2 /σ 2 ∼ χ2(n−p) ,
b
• S 2 is independent of β.
Exercise 6.2
1. Show that the log-likelihood function for the NLM is
(constant) −
n
1
log(σ 2 ) − 2 (y − Xβ)T (y − Xβ).
2
2σ
(6.12)
2. Show that the maximum likelihood estimate of β is identical to the least squares estimate.
b
What is the distribution of β?
3. Show that the MLE σ
b2 of σ 2 is
σ
b2 =
s2
.
n
(6.13)
What are the mean and variance of σ
b2 ?
4. Show that an unbiased estimator of σ 2 is given by the formula
Residual Sum of Squares
Residual Degrees of Freedom
6.5.3
k
Examples of the NLM
1. Simple Linear Regression (again)
Yi = β0 + β1 xi + i ,
(6.14)
2
where i IID
∼ N (0, σ ).
2. Two-sample t-test

x1
x2
..
.





y=
 xm
 y1

 .
 ..
yn






,











X=





1
1
..
.
0
0
..
.
1
0
..
.
0
1
..
.
0
1






,





and we’re interested in the hypothesis H0 : (β0 − β1 ) = 0.
76
β=
β0
β1
,
(6.15)
3. Paired t-test
Some quantity Y is measured on each of n individuals under 2 different conditions (e.g. drugs A
and B), and we want to test whether the mean of Y can be assumed equal in both circumstances.




1 0 ··· 0 0
y11
 0 1 ··· 0 0 
 y21 






α1
 .. .. . .
 .. 
.. .. 
 . .
 . 
. . . 
 α2 






 0 0 ··· 1 0 
 yn1 
 .. 
,

,
X
=
(6.16)
y=
β
=
,

 1 0 ··· 0 1 
 y12 
 . 






αn
 0 1 ··· 0 1 
 y22 




δ
 . . .
 . 

.
.
. . .. .. 
 .. ..
 .. 
yn2
0 0 ··· 1 1
where δ is the difference between the expected responses under the two conditions, and the αi are
‘nuisance parameters’ representing the overall level of response for the ith individual.
The null hypothesis is H0 : δ = 0.
4. Multiple Regression (example thereof)
Y = SBP after captopril, x1 = SBP before captopril,



1 210
201
 1 169
 165 



 166 
 1 187



 1 160
 157 



 1 167
 147 



 1 176
 145 



 1 185
 168 



 1 206
,
180
X
=
y=



 1 173
 147 



 1 146
 136 



 1 174
 151 



 1 201
 168 



 1 198
 179 



 1 148
 129 
1 154
131
x2 = DBP before captopril,

130
122 

124 

104 

112 

101 



β0
121 

124 
β =  β1  ,
,
115 
β2

102 

98 

119 

106 

107 
100
(6.17)
where (roughly speaking) β1 represents the increase in EY per unit increase in SBP before captopril
(x1 ), allowing for the fact that EY also depends partly on DBP before captopril (x2 ), and β2 has a
similar interpretation in terms of the effect of x2 allowing for x1 .
b = (XT X)−1 XT y, and also (for example) to
In all the above examples, it’s straightforward to calculate β
calculate the sampling distribution of βbi under the null hypothesis H0 : βi = 0.
Exercise 6.3
Verify the following calculations from the data given in 6.17 above:



15
2654
1685
XT X =  2654 475502 300137  ,
XT y = 
1685 300137 190817


8.563 −0.009165
−0.06120
0.0003026 −0.0003951  ,
(XT X)−1 =  −0.009165
−0.06120 −0.0003951
0.001167

2370
424523  ,
268373


−20.7
b =  0.724  .
β
0.450
k
77
6.6
Checking Assumptions of the NLM
Clearly it’s very important in practice to check that your assumptions seem reasonable; there are various
ways to do this
6.6.1
Formal hypothesis testing
2
χ tests are not very powerful, but are simple and general: count the number of data points satisfying
various (exhaustive & mutually exclusive) conditions, and compare with the expected counts under your
assumptions.
Other tests, for example to test for Normality, have been devised. However, a general problem with
statistical tests is that they don’t usually suggest what to do if your null hypothesis is rejected.
Exercise 6.4
How might you use a χ2 test to check whether SBP after captopril is independent of SBP before captopril?
k
Exercise 6.5
A possible test for linearity in the simple Normal linear regression model (i.e. the NLM with just one
explanatory variable x) is to fit the quadratic NLM
EY = β0 + β1 x + β2 x2
(6.18)
and test the null hypothesis H0 : β2 = 0.
Suppose that Y is SBP and x is dose of drug, and that you have rejected the above null hypothesis.
Comment on the advisability of using Formula 6.18 for predicting Y given x.
k
6.6.2
Graphical Methods and Residuals
If all the assumptions of the NLM are valid, then the residuals
ri
= yi − ybi
b
= yi − xTi β
(6.19)
should resemble observations on IID Normal random variables.
Therefore plots of ri against ANYTHING should be patternless
SEE LECTURE
Comments
1. Before fitting a formal statistical model (including e.g. performing a t-test), you should plot the data,
particularly the response variable against each explanatory variable.
2. After fitting a model, produce several residual plots. The computer is your friend!
3. Note that it’s the residual plots that are most informative. For example, the NLM DOESN’T assume
that the Yi are Normally distributed about µY , but DOES assume that each Yi is Normally distributed
about EYi |xi .
i.e. it’s the conditional distributions, not the marginal distributions, that are important.
78
6.7
Problems
1. Show that the following is an equivalent formulation of the two-sample t-test to that given above in
Formulae 6.15




x1
1 0
 x2 
 1 0 




 .. 
 .. .. 
 . 
 . . 




β0




Y =  xm  ,
β=
,
(6.20)
X =  1 0 ,
β1
 y1 
 1 1 




 . 
 . . 
 .. 
 .. .. 
yn
1 1
with null hypothesis H0 : β1 = 0.
2. Independent samples of 10 U.S. men aged 25–34 years, and 15 U.S. men aged 45–54 years were taken.
Their heights (in inches) were as follows:
(a) Age 25–34
73.3 64.8 72.1 68.9 68.7 70.4 66.8 70.7 74.4 71.8
(b) Age 45–54
73.2 68.5 62.4 65.5 71.3 69.5 74.5 70.6 69.3 67.1 64.7 73.0 66.7 68.1 64.3
Use a two-sample t-test to test the hypothesis that the population means of the two age-groups are
equal (the 90%, 95%, 97.5%, and 99% points of the t23 distribution are 1.319, 1.714, 2.069 and 2.500
respectively).
Comment on whether the underlying assumptions of the two-sample t-test appear reasonable for this
set of data.
Comment also on whether the data can be used to suggest that the population of the U.S. has (or
hasn’t) tended to get taller over the last 20 years.
3. The following data-set shows average January minimum temperature in degrees Fahrenheit (y), together with Latitude (x1 ) and Longitude (x2 ) for 28 US cities. Plot y against x1 , and comment on
what this plot suggests about the reasonableness of the various assumptions underlying the NLM for
predicting y from x1 and x2 .
y
x1
x2
y
x1
x2
y
x1
x2
44
31
15
30
58
19
22
12
21
8
31.2
35.4
40.7
39.7
26.3
42.3
38.1
44.2
43.1
47.1
88.5
92.8
105.3
77.5
80.7
88.0
97.6
70.5
83.9
112.4
38
47
22
45
37
21
27
25
2
32.9
34.3
41.7
31.0
33.9
39.8
39.0
39.7
45.9
86.8
118.7
73.4
82.3
85.0
86.9
86.5
77.3
93.9
35
42
26
65
22
11
45
23
24
33.6
38.4
40.5
25.0
43.7
41.8
30.8
42.7
39.3
112.5
123.0
76.3
82.0
117.1
93.6
90.2
71.4
90.5
Data from HSDS, set 262
79
4. Table 6.1, originally from Narula & Wellington (1977), shows data on selling prices of 28 houses
in Erie, Pennsylvania, together with explanatory variables that could be used to predict the selling
price. The variables are:
X1
X2
X3
X4
X5
X6
X7
X8
X9
Y
=
=
=
=
=
=
=
=
=
=
current taxes (local, school and county) ÷ 100,
number of bathrooms,
lot size ÷ 1000 (square feet),
living space ÷ 1000 (square feet),
number of garage spaces,
number of rooms,
number of bedrooms,
age of house (years),
number of fireplaces,
actual sale price ÷ 1000 (dollars).
Find a function of X1 –X9 that predicts Y reasonably accurately (such functions are used to fix
property taxes, which should be based on the current market value of each property).
X1
X2
X3
X4
X5
X6
X7
X8
X9
Y
4.9176
5.0208
4.5429
4.5573
5.0597
3.8910
5.8980
5.6039
15.4202
14.4598
5.8282
5.3003
6.2712
5.9592
5.0500
8.2464
6.6969
7.7841
9.0384
5.9894
7.5422
8.7951
6.0931
8.3607
8.1400
9.1416
12.0000
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
2.5
2.5
1.0
1.0
1.0
1.0
1.0
1.5
1.5
1.5
1.0
1.0
1.5
1.5
1.5
1.5
1.0
1.5
1.5
3.4720
3.5310
2.2750
4.0500
4.4550
4.4550
5.8500
9.5200
9.8000
12.8000
6.4350
4.9883
5.5200
6.6660
5.0000
5.1500
6.9020
7.1020
7.8000
5.5200
4.0000
9.8900
6.7265
9.1500
8.0000
7.3262
5.0000
0.9980
1.5000
1.1750
1.2320
1.1210
0.9880
1.2400
1.5010
3.4200
3.0000
1.2250
1.5520
0.9750
1.1210
1.0200
1.6640
1.4880
1.3760
1.5000
1.2560
1.6900
1.8200
1.6520
1.7770
1.5040
1.8310
1.2000
1.0
2.0
1.0
1.0
1.0
1.0
1.0
0.0
2.0
2.0
2.0
1.0
1.0
2.0
0.0
2.0
1.5
1.0
1.5
2.0
1.0
2.0
1.0
2.0
2.0
1.5
2.0
7
7
6
6
6
6
7
6
10
9
6
6
5
6
5
8
7
6
7
6
6
8
6
8
7
8
6
4
4
3
3
3
3
3
3
5
5
3
3
2
3
2
4
3
3
3
3
3
4
3
4
3
4
3
42
62
40
54
42
56
51
32
42
14
32
30
30
32
46
50
22
17
23
40
22
50
44
48
3
31
30
0
0
0
0
0
0
1
0
1
1
0
0
0
0
1
0
1
0
0
1
0
1
0
1
0
0
1
25.9
29.5
27.9
25.9
29.9
29.9
30.9
28.9
84.9
82.9
35.9
31.5
31.0
30.9
30.0
36.9
41.9
40.5
43.9
37.5
37.9
44.5
37.9
38.9
36.9
45.8
41.0
Table 6.1: House price data
Weisberg (1980)
80
5. (a) Assuming the model
E[Y |x] = β0 + β1 x,
Var[Y |x] = σ 2 independently of x,
derive formulae for the least squares estimates βb0 and βb1 from data (xi , yi ), i = 1, . . . , n.
What advantages are gained if the corresponding random variables Yi |xi can be assumed to be
independently Normally distributed?
(b) The following table shows the tensile strength (y) of different batches of cement after being
‘cured’ (dried) for various lengths of time x: 3 batches were cured for 1 day, 3 for 2 days, 5 for
3 days, etc. The batch means and standard deviations (s.d.) are also given.
Curing time
Tensile strength
2
(kg/cm ) y
(days) x
1
2
3
7
28
13.0
21.9
29.8
32.4
41.8
13.3
24.5
28.0
30.4
42.6
11.8
24.7
24.1
34.5
40.3
24.1
33.1
35.7
26.2
35.7
37.3
mean
s.d.
12.7
23.7
26.5
33.2
40.0
0.8
1.6
2.5
2.0
3.0
Plot y against x and discuss briefly how reasonable seem each of the following assumptions:
(i) linearity: E[Yi |xi ] = β0 + β1 xi for some constants β0 and β1 .
(ii) independence: the Yi are mutually independent given the xi .
If conditional independence (ii) is assumed true, then how reasonable here are the further assumptions:
(iii) homoscedasticity: Var[Yi |xi ] = σ 2 for all i = 1, . . . , n,
(iv) Normality: the random variables Yi are each Normally distributed.
Say briefly whether you consider any of the above assumptions (i)–(iv) would be more plausible
following
(A) transforming from y to y 0 = loge (y), and/or
(B) transforming x in an appropriate way.
NOTE: you do not need to carry out numerical calculations such as finding the
least-squares fit explicitly.
From Warwick ST217 exam 2000
81
6. The number of ‘hits’ recorded on J.E.H.Shaw’s WWW homepage in late 1999 are given below. ‘Local’
means the homepage was accessed from within Warwick University, ‘Remote’ means it was accessed
from outside. Data for the week beginning 7–Nov–1999 were unavailable. Note that there was an
exam on Wednesday 8–Dec–1999 for the course ST104, taught by J.E.H.Shaw.
Week
Beginning
Number of Hits
Local Remote Total
26 Sept
3 Oct
10 Oct
17 Oct
24 Oct
31 Oct
7 Nov
14 Nov
21 Nov
28 Nov
5 Dec
12 Dec
19 Dec
0
35
901
641
1549
823
—
1136
2114
2097
3732
5
0
182
253
315
443
525
344
—
383
584
536
461
352
296
182
288
1216
1084
2074
1167
—
1519
2698
2633
4193
357
296
(a) Fit a linear least-squares regression line to predict the number of remote hits (Y ) in a week from
the observed number x of local hits.
(b) Calculate the residuals and plot them against date. Does the plot give any evidence that the
interrelationship between X and Y changes over time?
(c) Using both general considerations and residual plots, comment on how reasonable here are the
assumptions underlying the simple Normal linear regression model, and suggest possible ways
to improve the prediction of Y .
7. The following table shows the assets x (billions of dollars) and net income y (millions of dollars) for
the 20 largest US banks in 1973.
Bank
x
y
Bank
x
y
Bank
x
y
Bank
x
y
1
2
3
4
5
49.0
42.3
36.3
16.4
14.9
218.8
265.6
170.9
85.9
88.1
6
7
8
9
10
14.2
13.5
13.4
13.2
11.8
63.6
96.9
60.9
144.2
53.6
11
12
13
14
15
11.6
9.5
9.4
7.5
7.2
42.9
32.4
68.3
48.6
32.2
16
17
18
19
20
6.7
6.0
4.6
3.8
3.4
42.7
28.9
40.7
13.8
22.2
(a) Plot income (y) against assets (x), and also log(income) against log(assets).
(b) Verify that the least squares fit regression lines are
fit 1:
fit 2:
y = 4.987 x + 7.57,
log(y) = 0.963 log(x) + 1.782
(Note: logs to base e),
and show the fitted lines on your plots.
(c) Produce Normal probability plots of the residuals from each fit.
(d) Which (if either) of these models would you use to describe the relationship between total assets
and net income? Why?
(e) Bank number 19 (the Franklin National Bank) failed in 1974, and was the largest ever US bank to
fail. Identify the point representing this bank on each of your plots, and discuss briefly whether,
from the data presented, one might have expected beforehand that the Franklin National Bank
was in trouble.
82
8. The following data show the blood alcohol levels (mg/100ml) at post mortem for traffic accident
victims. Blood samples in each case were taken from the leg (A) and from the heart (B). Do these
results indicate that blood alcohol levels differ systematically between samples from the leg and the
heart?
Case
A
B
Case
A
B
1
2
3
4
5
6
7
8
9
10
44
265
250
153
88
180
35
494
249
204
44
269
256
154
83
185
36
502
249
208
11
12
13
14
15
16
17
18
19
20
265
27
68
230
180
149
286
72
39
272
277
39
84
228
187
155
290
80
50
290
Osborn (1979) 4.6.5
9. (a) Explain what is meant by a linear model with response vector Y = (Y1 , Y2 , . . . , Yn )T , design
matrix X, parameter vector β = (β1 , β2 , . . . , βp )T and error variance σ 2 .
b + X(β
b − β), or otherwise,
(b) Suppose that XT X is nonsingular. By writing Y − Xβ = (Y − Xβ)
show that
(Y − Xβ)T (Y − Xβ)
b = (XT X)−1 XT Y.
is minimised at β = β
b = β and that Var[β]
b = σ 2 (XT X)−1 .
(c) Show that E[β]
(d) Let A = X(XT X)−1 XT and let In denote the n × n identity matrix. Show that A and In − A
are both idempotent, i.e. AA = A and (In − A)(In − A) = In − A.
(e) For the particular case of a Normal linear model, find the joint distribution of the fitted values
b and show that Y − Y
b = Xβ,
b is independent of Y.
b Quote carefully any properties of the
Y
Normal distribution you use.
From Warwick ST217 exam 1999
10. Verify that the least squares estimates in simple linear regression
P
xi yi − n x y
βb1 = P 2
,
βb0 = y − βb1 x,
xi − nx2
b = (XT X)−1 XT y.
are a special case of the general formula β
83
6.8
The Analysis of Variance (ANOVA)
6.8.1
One-Way Analysis of Variance: Introduction
This is a generalization of the two-sample t-test to p > 2 groups.
Suppose there are observations yij (j = 1, 2, . . . , ni ) in the ith group (i = 1, 2, . . . , p),
and let n = n1 + n2 + · · · + np denote the total number of observations.
Denote the corresponding RVs by Yij , and assume that Yij ∼ N (βi , σ 2 ) independently.
Traditionally the main aim has been to test the null hypothesis
H0 : β1 = β2 = . . . = βp
i.e. : β = β 0 = (β0 , β0 , . . . , β0 )
b and β
b and apply a likelihood ratio test, i.e. test whether the ratio
The idea is to fit MLEs β
0
change in RSS
RSS
b to y
b0
squared distance from y
b
squared distance from y to y
=
b and y
b0 are the corresponding fitted values) is larger than would be expected by chance.
(where y
A useful notation for group means etc. uses overbars and ‘+’ suffices as follows:
!
p ni
p
ni
1 X
1 XX
1X
y i+ =
yij ,
y ++ =
yij
=
ni y i+ ,
ni j=1
n i=1 j=1
n i=1
etc.
The underlying models fit naturally in the NLM framework:
Definition 6.7 (One-Way ANOVA)
The one-way ANOVA model is a NLM of the form
Y ∼ MVN(Xβ, σ 2 I),
where




Y=

Y1
Y2
..
.
Yn



,















X=













0 ···
0 ···
.. . .
.
.
0 ···
0 ···
.. . .
.
.
1
1
..
.
0
0
..
.
0
0
..
.
1
0
..
.
0
1
..
.
0
0
..
.
1
0
..
.
0
..
.
0
..
.
0
..
.
0
..
.
0 ··· 0
1 ··· 0
..
..
..
.
.
.
1 ··· 0
..
..
..
.
.
.
0 ··· 1
.. . .
.
. ..
.
0
0
0 ··· 1
0
0
..
.















,
















β=

β1
β2
..
.



,

(6.21)
βp
where X has n1 rows of the first type, . . . np rows of the last type, and n1 + n2 + · · · + np = n.
Exercise 6.6
b = (Y 1+ , Y 2+ , . . . , Y p+ )T .
Show that for one-way ANOVA, XT X = diag(n1 , n2 , . . . , np ), and hence β
k
84
6.8.2
One-Way Analysis of Variance: ANOVA Table
Let
p
β0
= E[Y ++ ]
αi
= βi − β0
p
n
i
1 XX
EYij
n i=1 j=1
=
1X
ni βi ,
n i=1
=
(i = 1, 2, . . . , p).
Typically the p groups correspond to p different treatments, and αi is then called the ith treatment effect.
We’re interested in the hypotheses
H0
H1
: αi = 0 (i = 1, 2, . . . , p),
: the αi are arbitrary.
Note that
1. Y ++ is the MLE of β = β0 under H0 ,
2. Y i+ is the MLE of β + αi , i.e. the mean response given the i treatment.
Hence the fitted values under H0 and H1 are given by Y ++ and Y i+ respectively.
If we also include the ‘null model’ that all the βi are zero, then the possible models of interest are:
Model
βi = 0 ∀ i
βi = β0 ∀ i
βi arbitrary,
# params
i.e. ybij = 0
0
i.e. ybij = y ++
1
i.e. ybij = y i+
p
d.f.
RSS
P
n
i,j
n−1
P
n−p
P
i,j (yij
2
yij
(1)
2
− y ++ )
(2)
− y i+ )2
(3)
i,j (yij
The calculations needed to test H0 , involving the RSS formulae given above, can be conveniently presented
in an ‘ANOVA table’:
Source of variation
Degrees of freedom (d.f.)
Overall mean
1
Sum of squares (SS)
ny 2++
(1)–(2) =
Treatment
p−1
(2)–(3) =
Residual
n−p
(3) =
Total
n
(1) =
2
i ni (y i+ − y ++ )
P
2
i,j (yij − y i+ )
P
2
i,j yij
P
Note: DeGroot presents a more general version.
Finally, calculate the ‘F ratio’
F =
Treatment SS/(p − 1)
Residual SS/(n − p)
(6.22)
which, under H0 , has an F distribution on (p − 1) and (n − p) d.f.
Large values of F are evidence against H0 .
Note: DON’T try to remember formulae for sums of squares in an ANOVA table.
Instead THINK OF THE MODELS BEING FITTED. The ‘lack of fit’ of each model is given by the
corresponding RSS, & the formulae for the differences in RSS simplify.
85
6.9
Problems
1. Show that the formulae for sums of squares in one-way ANOVA simplify:
p
X
ni (Y i+ − Y ++ )2
=
i=1
p
X
2
2
ni Y i+ − nY ++ ,
i=1
p X
ni
X
(Yij − Y i+ )2
=
i=1 j=1
p X
ni
X
Yij2 −
i=1 j=1
p
X
2
ni Y i+ .
i=1
2. (a) Define the Normal Linear Model, and describe briefly how each of its assumptions may be
informally checked by plotting residuals.
(b) The following data summarise the number of days survived by mice inoculated with three strains
of typhoid (31 mice with ‘9D’, 60 mice with ‘11C’ and 133 mice with ‘DSCI’).
Days
to
Death
2
3
4
5
6
7
8
9
10
11
12
13
14
Total
P
P X2i
Xi
Numbers of Mice
Inoculated with. . .
9D 11C DSCI
Total
6
4
9
8
3
1
1
3
3
6
6
14
11
4
6
2
3
1
3
5
5
8
19
23
22
14
14
7
8
4
1
10
12
17
22
28
38
33
18
20
9
11
5
1
31
125
561
60
442
3602
133
1037
8961
224
1604
13124
(Xi is the survival time of the ith mouse in the given group).
Without carrying out any calculations, discuss briefly how reasonable seem the assumptions
underlying a one-way ANOVA on the data, and whether a transformation of the data may be
appropriate.
(c) Carry out a one-way ANOVA on the untransformed data. What do you conclude about the
responses to the three strains of typhoid?
From Warwick ST217 exam 1997
3. The amount of nitrogen-bound bovine serum albumin produced by three groups of mice was measured.
The groups were: normal mice treated with a placebo (i.e. an inert substance), alloxan-diabetic mice
treated with a placebo, and alloxan-diabetic mice treated with insulin. The resulting data are shown
in the following table:
86
Normal
+ placebo
Alloxan-diabetic
+ placebo
Alloxan-diabetic
+ insulin
156
282
197
297
116
127
119
29
253
122
349
110
143
64
26
86
122
455
655
14
391
46
469
86
174
133
13
499
168
62
127
276
176
146
108
276
50
73
82
100
98
150
243
68
228
131
73
18
20
100
72
133
465
40
46
34
44
(a) Produce appropriate graphical display(s) and numerical summaries of these data, and comment
on what can be learnt from these.
(b) Carry out a one-way analysis of variance on the three groups. You may feel it necessary to
transform the data first.
Data from HSDS, set 304
4. The following table shows measurements of the steady-state haemoglobin levels for patients with
different types of sickle-cell anaemia (‘HB SS’, ‘HB S/-thalassaemia’ and ‘HB SC’). Construct an
ANOVA table and hence test whether the steady-state haemoglobin levels differ between the three
types.
HB SS
HB S/-thalassaemia
HB SC
7.2
7.7
8.0
8.1
8.3
8.4
8.4
8.5
8.6
8.7
9.1
9.1
9.1
9.8
10.1
10.3
8.1
9.2
10.0
10.4
10.6
10.9
11.1
11.9
12.0
12.1
10.7
11.3
11.5
11.6
11.7
11.8
12.0
12.1
12.3
12.6
12.6
13.3
13.3
13.8
13.9
Data from HSDS, set 310
87
5. The data in Table 6.2, collected by Brian Everitt, are described in HSDS as being the ‘weights, in
kg, of young girls receiving three different treatments for anorexia over a fixed period of time with
the control group receiving the standard treatment’.
(a) Using a one-way ANOVA on the weight gains, compare the three methods of treatment.
(b) Plot the data so as to clarify the effects of the three treatments, and discuss whether the above
formal analysis was appropriate.
Cognitive
behavioural
treatment
Control
Weight (kg)
before after
Weight (kg)
before after
80.5
84.9
81.5
82.6
79.9
88.7
94.9
76.3
81.0
80.5
85.0
89.2
81.3
81.3
76.5
70.0
80.4
83.3
83.0
87.7
84.2
86.4
76.5
80.2
87.8
83.3
79.7
84.5
80.8
87.4
82.2
85.6
81.4
81.9
76.4
103.6
98.4
93.4
73.4
82.1
96.7
95.3
82.4
82.4
72.5
90.9
71.3
85.4
81.6
89.1
83.9
82.7
75.7
82.6
100.4
85.2
83.6
84.6
96.2
86.7
80.7
89.4
91.8
74.0
78.1
88.3
87.3
75.1
80.6
78.4
77.6
88.7
81.3
81.3
78.1
70.5
77.3
85.2
86.0
84.1
79.7
85.5
84.4
79.6
77.5
72.3
89.0
Family
therapy
Weight (kg)
before after
80.2
80.1
86.4
86.3
76.1
78.1
75.1
86.7
73.5
84.6
77.4
79.5
89.6
89.6
81.4
81.8
77.3
84.2
75.4
79.5
73.0
88.3
84.7
81.4
81.2
88.2
78.8
83.8
83.3
86.0
82.5
86.7
79.6
76.9
94.2
73.4
80.5
81.6
82.1
77.6
77.6
83.5
89.9
86.0
87.3
95.2
94.3
91.5
91.9
100.3
76.7
76.8
101.6
94.9
75.2
77.8
95.5
90.7
90.7
92.5
93.8
91.7
98.0
Table 6.2: Anorexia data
Data from HSDS, set 285
88
6. To monitor an industrial process for converting ammonia to nitric acid, the percentage of ammonia
lost (y) was measured on each of 21 consecutive days, together with explanatory variables representing
air flow (x1 ), cooling water temperature (x2 ) and acid concentration (x3 ). The data, together with
the residuals after fitting the model yb = 3.614 + 0.072 x1 + 0.130 x2 − 0.152 x3 , are given in the
following table:
Day
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
y
Air
Flow
(x1 )
Water
Temp.
(x2 )
Acid
Conc.
(x3 )
Resid.
4.2
3.7
3.7
2.8
1.8
1.8
1.9
2.0
1.5
1.4
1.4
1.3
1.1
1.2
0.8
0.7
0.8
0.8
0.9
1.5
1.5
80
80
75
62
62
62
62
62
58
58
58
58
58
58
50
50
50
50
50
56
70
27
27
25
24
22
23
24
24
23
18
18
17
18
19
18
18
19
19
20
20
20
58.9
58.8
59.0
58.7
58.7
58.7
59.3
59.3
58.7
58.0
58.9
58.8
58.2
59.3
58.9
58.6
57.2
57.9
58.0
58.2
59.1
0.323
−0.192
0.456
0.570
−0.171
−0.301
−0.239
−0.139
−0.314
0.127
0.264
0.278
−0.143
−0.005
0.236
0.091
−0.152
−0.046
−0.060
0.141
−0.724
Some residual plots are shown on the next page (Fig. 6.1).
(a) Discuss whether the pattern of residuals casts doubt on any of the assumptions underlying the
Normal Linear Model (NLM).
Describe any further plots or calculations that you think would help you assess whether the
fitted NLM is appropriate here.
Continued. . .
89
(b) Various suggestions could be made for improving the model, such as
i.
ii.
iii.
iv.
v.
vi.
vii.
viii.
transforming the response (e.g. to log y or to y/x1 ),
transforming some or all of the explanatory variables,
deleting outliers,
including quadratic or even higher-order terms (e.g. x22 ),
including interaction terms (e.g. x1 x3 ),
carrying out a nonparametric analysis of the data,
applying a bootstrap procedure,
fitting a nonlinear model.
Outline the merits and disadvantages of each of these suggestions here. What would be your
next step in analysing this data-set?
Figure 6.1: Residual plots
From Warwick ST217 exam 1999
90
7. The following data come from a study of pollution in inland waterways. In each of seven localities,
five pike were caught and the log concentration of copper in their livers measured.
Locality
1.
2.
3.
4.
5.
6.
7.
Windermere
Grassmere
River Stour
Wimbourne St Giles
River Avon
River Leam
River Kennett
Log concentration of copper (ppm)
0.187
0.449
0.628
0.412
0.243
0.134
0.471
0.836
0.769
0.193
0.286
0.258
0.281
0.371
0.704
0.301
0.810
0.497
-0.276
0.529
0.297
0.938
0.045
0.000
0.417
-0.538
0.305
0.691
0.124
0.846
0.855
0.337
0.041
0.459
0.535
(a) The data are plotted in Figure 6.2. Discuss briefly what the plot suggests about the relative
copper pollution in the various localities.
Figure 6.2: Concentration of copper in pike livers
(b) Carry out a one-way analysis of variance to test for differences between the data between localities. Do the results of the formal analysis agree with your subjective impressions from
Figure 6.2?
91
6.10
Two-Way Analysis of Variance
Here there are 2 factors (e.g. 2 treatments, or patient number and treatment given) that can be varied
independently.
Factor A has I ‘levels’ (e.g. treatment 1, 2, . . . , I)
Factor B has J ‘levels’ (e.g. supplementary treatment 1, 2, . . . , J, or patient number 1, 2, . . . , J, every patient
receiving each treatment in turn).
Data can be conveniently tabulated:
Factor A
1
2
3
..
.
1
Y11
Y21
Y31
..
.
Factor B
2
...
Y12 . . .
Y22 . . .
Y32 . . .
..
..
.
.
J
Y1J
Y2J
Y3J
..
.
I
YI1
YI2
YIJ
...
i.e. there is precisely one observation Yij at each (i, j) combination of factor levels.
Again assume the NLM with
E[Yij ] = θi + φj
for i = 1 . . . I and J = 1 . . . J.
i.e.
Yij ∼ N (θi + φj , σ 2 )
independently.
(6.23)
A problem here is that one could transform θi →
7 θi + c and φj 7→ φj − c for each i and j, where c is
arbitrary. Therefore for identifiability one needs to impose some (arbitrary) constraints.
The most convenient reformulation for the two-way ANOVA model is
Yij
∼ N (µ + αi + βj , σ 2 ),
PI
αi
= 0,
PJ
βj
=
i=1
j=1
where
(6.24)
0.
Exercise 6.7
What is the matrix formulation of the model 6.24?
k
Particular models of interest within the framework of Formulae 6.24 are:
(1) Yij ∼ N (0, σ 2 ),
X
RSS =
Yij2 ,
DF = n = IJ.
i,j
(2) Yij ∼ N (µ, σ 2 ),
X
RSS =
(Yij − Y ++ )2 ,
DF = n − 1 = IJ − 1.
i,j
92
(3) Yij ∼ N (µ + αi , σ 2 ),
Ybij = µ
b+α
bi = Y i+ .
X
Therefore RSS =
(Yij − Y i+ )2 ,
DF = n − I = I(J − 1).
i,j
(4) Yij ∼ N (µ + βj , σ 2 ),
Ybij = µ
b + βbj = Y +j .
X
Therefore RSS =
(Yij − Y +j )2 ,
DF = n − J = (I − 1)J.
i,j
(5) Yij ∼ N (µ + αi + βj , σ 2 ),
Ybij = µ
b+α
bi + βbj = Y i+ + Y +j − Y ++ .
X
Therefore RSS =
(Yij − Y i+ − Y +j + Y ++ )2 ,
DF = n − I − J + 1 = (I−1)(J−1).
i,j
Again, we can form an ANOVA table summarising the independent ‘sources of variation’.
The degrees of freedom are the differences between the DFs associated with the various models.
The sums of squares are the differences between the DFs associated with the various models.
Source of variation
Degrees of freedom (d.f.)
Sum of squares (SS)
Overall mean
Effect of Factor A
Effect of Factor B
Residuals
1
I−1
J−1
(I−1)(J−1)
(1)-(2)
(2)-(3)
(2)-(4)
(5)
Total
IJ = n
(1)
Table 6.3: Two-way ANOVA table
Comments
1. DeGroot gives a more general version.
2. As with one-way ANOVA, one can test H0 : αi = 0, i = 1 . . . I, by comparing
(SS due to A)/(I − 1)
(Residual SS)/([I − 1][J − 1])
with the 95% point of F(I−1),([I−1][J−1]) .
3. Similarly one can test H0 : βj = 0, j = 1 . . . J, by comparing
(SS due to B)/(J − 1)
(Residual SS)/([I − 1][J − 1])
with the 95% point of F(J−1),([I−1][J−1]) .
4. The above two F tests are using completely separate aspects of the data (row sums of the Yij table,
column sums of the Yij table).
5. The case J = 2 is equivalent to the paired t-test (Exercise 5.4).
93
6.11
Problems
1. Three pertussis vaccines were tested on each of ten days. The following table shows estimates of the
log doses of vaccine (in millions of organisms) required to protect 50% of mice against a subsequent
infection with pertussis organisms.
Day
A
Vaccine
B
C
Total
1
2
3
4
5
6
7
8
9
10
2.64
2.00
3.04
2.07
2.54
2.76
2.03
2.20
2.38
2.42
2.93
2.52
3.05
2.97
2.44
3.18
2.30
2.56
2.99
3.20
2.93
2.56
3.35
2.55
2.45
3.25
2.17
2.18
2.74
3.14
8.50
7.08
9.44
7.59
7.43
9.19
6.50
6.94
8.11
8.76
Total
24.08
28.14
27.32
79.54
Test the statistical significance of the differences between days and between vaccines.
Osborn (1979) 8.1.2
2. Table 6.4 shows purported IQ scores of identical twins, one raised in a foster home (Y ), and the
other raised by natural parents (X). The data are also categorised according to the social class of
the natural parents (upper, middle, low). The data come from Burt (1966), and are also available in
Weisberg (1980).
upper class
Case
Y
X
1
2
3
4
5
6
7
82
80
88
108
116
117
132
82
90
91
115
115
129
131
middle class
Case
Y
X
8
9
10
11
12
13
71
75
93
95
88
111
78
79
82
97
100
107
lower class
Case
Y
X
14
15
16
17
18
19
20
21
22
23
24
25
26
27
63
77
86
83
93
97
87
94
96
112
113
106
107
98
68
73
81
85
87
87
93
94
95
97
97
103
106
111
Table 6.4: Burt’s twin IQ data
(a) Plot the data.
(b) Fit simple linear regression models to predict Y from X within each social class.
(c) Fit parallel lines predicting Y from X within each social class (i.e. fit regression models with
the same slope in each of the three classes, but possibly different intercepts).
(d) Produce an ANOVA table and an F -test to test whether the parallelism assumption is reasonable. Comment on the calculated F ratio.
94
3. For the two-way analysis of variance (table 6.3, page 93), find simplified formulae for the sums of
squares analogous to those found for the one-way ANOVA (exercise 6.9.1).
4. Table 4.2, page 43, presented data on the preventive effect of four different drugs on allergic response
in ten patients.
A simple way to analyse the
patient √
response,
√ data is via a two-way ANOVA on a suitable measure of√
such as√the increase
in
NCF,
which
is
tabulated
below
(for
example,
1.95
=
3.8
−
0.0 and
√
1.52 = 9.2 − 2.3).
Drug
1
2
3
4
P
C
D
K
1.95
0.71
0.65
0.19
1.52
1.30
0.67
0.54
0.77
1.32
0.65
−0.07
0.44
1.48
0.48
0.82
Patient number
5
6
0.78
0.58
0.00
0.54
1.69
0.41
0.44
−0.44
7
8
9
10
0.37
0.00
0.26
0.27
0.95
2.09
0.42
−0.03
1.10
0.32
1.18
0.59
0.62
−0.22
0.63
0.71
Test the statistical significance of the differences between drugs and between patients.
For we know in part, and we prophesy in part.
But when that which is perfect is come, then that which is in part shall be done away.
1 Corinthians 13:9–10
Everything should be made as simple as possible, but not simpler.
Albert Einstein
A theory is a good theory if it satisfies two requirements: it must accurately describe a large
class of observations on the basis of a model that contains only a few arbitrary elements, and
it must make definite predictions about the results of future observations.
Stephen William Hawking
The purpose of models is not to fit the data but to sharpen the question.
Samuel Karlin
Science may be described as the art of systematic oversimplification.
Sir Karl Raimund Popper
95
This page intentionally left blank (except for this sentence).
96
Chapter 7
Further Topics
7.1
Generalisations of the Linear Model
You can generalise the systematic part of the linear model, i.e. the formula for E[Y |x]
and/or the random part, i.e. the distribution of Y − E[Y |x].
7.1.1
Nonlinear Models
These are models of the form
E[Y |x] = g(x, β)
(7.1)
T
where Y is the response, x is a vector of explanatory variables, β = (β1 . . . βp ) is a parameter vector, and
the function g is nonlinear in the βi s.
Examples
1. Asymptotic regression:
Yi
=
i
IID
∼
α − βγ xi + i
(i = 1, 2, . . . , n),
2
N (0, σ ).
There are four parameters to be estimated: β = (α, β, γ, σ 2 )T .
Assuming that 0 < γ < 1, we have:
(a) E[Y |x] is monotonic increasing in x,
(b) E[Y |x = 0] = α − β,
(c) as x → ∞, E[Y |x] → α.
This ‘asymptotic regression’ model might be appropriate, for example, if
(a) x = age of an animal,
y = height or weight, or
(b) x = time spent training,
y = height jumped
(for n people of similar build).
2. The ‘Michaelis-Menten’ equation in enzyme kinetics
E[Y |x] =
β1 x
β2 + x
with various possible distributional assumptions, the simplest of which is
[Y |x] ∼ N (β1 x/(β2 +x), σ 2 ).
97
Comments
1. Nonlinear models can be fitted, in principle, by maximum likelihood.
2. In practice one needs computers and iteration.
3. Even if the random variation is assumed to be Normal, the likelihood may have a very non-Normal
shape.
7.1.2
Generalised Linear Models
Definition 7.1 (GLM)
A generalized linear model (GLM) has a random part and a systematic part:
Random Part
1. The ith response Yi has a probability distribution with mean µi .
2. The distributions are all of the same form (e.g. all Normal with variance σ 2 , or all Poisson, etc.)
3. The Yi s are independent.
Systematic Part
g(µi ) = xTi β
=
p
X
βj xij ,
where
j=1
1. xi = (xi1 . . . xip )T is a vector of explanatory variables,
2. β = (β1 . . . βp )T is a parameter vector, and
3. g(·) is a monotonic function called the link function.
Comments
1. If Yi ∼ N (µi , σ 2 ) and g(·) is the identity function, then we have the NLM.
2. Other GLMs typically must have their parameters estimated by maximising the likelihood numerically
(iteratively in a computer).
3. The principles behind fitting GLMs are similar to those for fitting NLMs
Example: ‘logistic regression’
1. Random part: binary response
e.g. Yi |xi =
1 if individual i survived
0 if individual i died
(and all Yi s are conditionally independent given the corresponding xi s).
Note that µi = E[Yi |xi ] is here the probability of surviving given explanatory variables xi , and is
usually written pi or πi .
2. Systematic part:
g(πi ) = log
98
πi
1 − πi
.
Exercise 7.1
Show that under the logistic regression model, if n patients have identical explanatory variables x say, then
1. Each of these n patients has probability of survival given by
π=
exp(xT β)
,
1 + exp(xT β)
2. The number R surviving out of n has expected values nπ and variance nπ(1 − π).
k
7.2
Simpson’s Paradox
Simpson’s paradox occurs when there are three RVs X, Y and Z, such that the conditional distributions
[X, Y |Z] show a relationship between [X|Z] and [Y |Z], but the marginal distribution [X, Y ] apparently
shows a very different relationship between X and Y . For example,
1. X(Y )=male(female) death rate, Z=age,
2. X(Y )=male(female) admission rate to University, Z=admission rate for student’s chosen course.
7.3
Problems
1. (a) Explain what is meant by
i. the Normal linear model,
ii. simple linear regression, and
iii. nonlinear regression.
(b) For simple linear regression applied to data (xi , yi ), i = 1, . . . , n, show that the maximum likelihood estimators βb0 and βb1 of the intercept β0 and slope β1 satisfy the simultaneous equations
βb0 n + βb1
n
X
xi =
i=1
and
βb0
n
X
xi + βb1
i=1
n
X
n
X
yi
i=1
x2i =
n
X
xi yi .
i=1
i=1
Hence find βb0 and βb1 .
(c) The following table shows Y , the survival time (weeks) of leukaemia patients and x, the corresponding log of initial white blood cell count.
x
Y
x
Y
x
3.36
2.88
3.63
3.41
3.78
4.02
65
156
100
134
16
108
4.00
4.23
3.73
3.85
3.97
4.51
121
4
39
143
56
26
4.54
5.00
5.00
4.72
5.00
Y
22
1
1
5
65
Plot the data and, without carrying out any calculations, discuss how reasonable are the assumptions underlying simple linear regression in this case.
From Warwick ST217 exam 1998
99
2. Because of concerns about sex discrimination, a study was carried out by the Graduate Division at
the University of California, Berkeley. In fall 1973, there were 8,442 male applications and 4,321
female applications to graduate school. It was found that about 44% of the men and 35% of the
women were admitted.
When the data were investigated further, it was found that just 6 of the more than 100 majors
accounted for over one-third of the total number of applicants. The data for these six majors (which
Berkeley forbids identifying by name) are summarized in the table below.
Men
Women
Major
Number of
applicants
Percent
admitted
Number of
applicants
Percent
admitted
A
B
C
D
E
F
825
560
325
417
191
373
62
63
37
33
28
6
108
25
593
375
393
341
82
68
34
35
24
7
Discuss the possibility of sex discrimination in admission, with particular reference to explanatory
variables, conditional probability, independence and Simpson’s paradox.
Data from Freedman et al. (1991), page 17
1
3. (a) At a party, the POTAS of your dreams approaches you, and says by way of introduction:
Hi—I’m working on a study of human pheromones, and need some statistical help. Can
you explain to me what’s meant by ‘logistic regression’, and why the idea’s important?
Give a brief verbal explanation of logistic regression, without (i) using any formulae, (ii) saying anything that’s technically incorrect, (iii) boring the other person senseless and ruining a
potentially beautiful friendship, (iv) otherwise embarrassing yourself.
(b) Repeat the exercise, replacing logistic regression successively with:
Bayesian inference,
a multinomial distribution,
nuisance parameters,
the Poisson distribution,
statistical independence,
conditional expectation,
multiple regression,
one-way ANOVA,
a linear model,
a t-test.
likelihood,
the Neyman-Pearson lemma,
order statistics,
size & power,
(c) Suddenly, a somewhat inebriated student (SIS) appears and interrupts your rather impressive
explanation with the following exchange:
SIS:
POTASOYD:
SIS:
Think of a number from 1 to 10.
Erm—seven?
Wrong. Get your clothes off.
You then watch aghast while he starts introducing himself in the same way to everyone in the
room. As a statistician, you of course note down the numbers xi he is given, namely
7, 2, 3, 1, 5, 2, 10, 10, 7, 3, 9, 1, 2, 2, 7, 10, 5, 8, 5, 7, 3, 10, 6, 1, 5, 3, 2, 7, 8, 5, 7.
His response yi is ‘Wrong’ in each case, and you formulate the hypotheses
H0 : yi
H1 : yi
=
‘Wrong’ irrespective of xi
‘Right’ if xi = x0 , for some x0 ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
=
‘Wrong’ if xi 6= x0 .
How might you test the null hypothesis H0 against the alternative H1 ?
1 Person
Of The Appropriate Sex
100
4.
(i) Explain what is meant by:
(a) a generalised linear model,
(b) a nonlinear model.
(ii) Discuss the models you would most likely consider for the following data sets:
(a) Data on the age, sex, and weight of 100 people who suffered a heart attack (for the first
time), and whether or not they were still alive two years later.
(b) Data on the age, sex and weight of 100 salmon in a fish farm.
From Warwick ST217 exam 1996
I have yet to see any problem, however complicated, which, when you looked at it the right
way, did not become still more complicated.
Poul Anderson
The manipulation of statistical formulas is no substitute for knowing what one is doing.
Hubert M. Blalock, Jr.
A judicious man uses statistics, not to get knowledge, but to save himself from having ignorance
foisted upon him.
Thomas Carlyle
The best material model of a cat is another, or preferably the same, cat.
A. Rosenblueth & Norbert Wiener
A little inaccuracy sometimes saves tons of explanation.
Saki (Hector Hugh Munro)
karma police arrest this man he talks in maths he buzzesLikeAfridge hes like a detuned radio.
Thom Yorke
Better is the end of a thing than the beginning thereof.
Ecclesiastes 7:8
101
1. Problem 1.3.1, page 6
Let
Prob(first coin shows ‘Heads’)
Prob(first coin shows ‘Tails’)
Prob(second coin shows ‘Heads’)
Prob(second coin shows ‘Tails’)
Prob(individual has watched ‘Teletubbies’)
Prob(individual has not watched ‘Teletubbies’)
=
=
=
=
=
=
p1 ,
1 − p1 ,
p2 ,
1 − p2 ,
θ,
1 − θ.
Then
Prob(individual says ‘Yes’)
Prob(individual says ‘No’)
In particular, if p1 = p2 = 12 , then φ =
1
4
=
=
φ
1−φ
=
=
p1 p2 + (1 − p1 )θ,
p1 (1 − p2 ) + (1 − p1 )(1 − θ).
+ 12 θ.
Now let
Prob(student has watched ‘Teletubbies’)
Prob(male student has watched ‘Teletubbies’)
Prob(female student has watched ‘Teletubbies’)
=
=
=
θ,
θM ,
θF .
Then estimating the unknown probabilities by the observed proportions gives
1
4
1
4
1
4
+ 12 θb = (84 + 23)/(84 + 23 + 48 + 24),
+ 12 θbM
= 84/(84 + 48),
+ 12 θbF
=
23/(23 + 24).
Therefore
θb = 2
107
179
−
1
4
= 249/358 = 0.696,
84
θbM = 2 132
− 14 = 17/22 = 0.773,
1
θbF = 2 23
47 − 4 = 45/94 = 0.479.
If the probability of having watched ‘Teletubbies’ is the same for males and females, then the expected
counts corresponding to the data are:
Males
Females
Yes
132 φ
47 φ
No
132(1 − φ)
47(1 − φ)
where φ is estimated by φb = (84 + 23)/(84 + 23 + 48 + 24) = 0.598, i.e. expected counts are:
Males
Females
I
Yes
78.9
28.1
No
53.1
18.9
∴ X2
=
X
categories
=
=
(Observed − Expected)2
Expected
(84 − 78.9)2
(48 − 53.1)2
(23 − 28.1)2
(24 − 18.9)2
+
+
+
78.9
53.1
28.1
18.9
3.12 (with 1 degree of freedom).
From tables, the 90% and 95% points of the χ21 distribution are 2.706 and 3.841 respectively, so
the null hypothesis (that the underlying proportion is the same for both sexes) is accepted at the
5% level.
Thus the point estimates from the data suggest that 75% of males and 50% of females have watched
a complete episode of ‘Teletubbies’. However, the data don’t provide strong evidence against the
hypothesis that the same proportion (estimated to be about 70%) of males and females from whatever
population the data come from have watched a complete episode of ‘Teletubbies’.
Discussion of assumptions made
• The formal statistical inference implicitly assumes that the data arise as a representative sample
from any population, which is highly dubious. For example, if students judge that a first-year
statistics lecture is in some way similar to an episode of ‘Teletubbies’, then students who enjoy
‘Teletubbies’ will be disproportionately attracted to the course!
• The assumption that each coin has the same probability (which, moreover, is 21 ) of landing
‘Heads’ is dubious: in reality, different types of coins, and even coins minted in different years,
will be (slightly) biased one way or the other. However, experience suggests that the assumption
is reasonable (and certainly convenient).
If we actually knew p1 and p2 rather than having to estimate them as 21 , then we would be
able to estimate θ more accurately. Obtaining separately a count of what proportion of ‘Heads’
appeared on each toss would help.
• We have assumed that everyone told the truth, which seems reasonable here.
Note that if p1 were closer to 0 then we would obtain more accurate answers. For example,
students could be asked to roll a die and toss a coin, and answer the ‘Teletubbies’ question
unless the die showed a ‘6’. However, if p1 were so small that people were worried about being
identified as ‘almost certain Teletubbies addicts’ if they answered ‘Yes’, then they might be
tempted to lie (this is of course more relevent to genuinely ‘difficult’ subjects like drugs or
abortion).
2. Problem 1.3.2, page 6
(a) Z is positive if X & Y have the same sign, otherwise Z is negative. Therefore
Pr(Z > 0) = Pr (X < 0)&(Y < 0) + Pr (X > 0)&(Y > 0)
=
Pr(X < 0) Pr(Y < 0) + Pr(X > 0) Pr(Y > 0)
1 1 1 1
=
. + .
by symmetry of X & Y ,
2 2 2 2
1
=
.
2
by independence,
Thus z50 = 0. Also, since X and Y are IID, the distribution of 1/Z is the same as that of Z.
Therefore
Pr(Z < 1|Z > 0) = Pr (1/Z) < 1|Z > 0 = Pr(Z > 1|Z > 0)
1
since Pr(Z < 1|Z > 0) + Pr(Z > 1|Z > 0) = 1
2
∴ Pr(Z < 1) = Pr(Z < 0) + Pr(Z > 0 & Z < 1)
1 1 1
3
= Pr(Z < 0) + Pr(Z < 1|Z > 0) Pr(Z > 0) = + . = .
2 2 2
4
=
II
Thus (z25 , z50 , z75 ) = (−1, 0, 1).
⇐
(b) C&B 4.3.4 page 152
3. Problem 1.3.3, page 6
MZ (t)
= E exp(tZ)
P
P
= E exp( i tbi + i tai Xi )
P
P
= exp(t i bi )E[exp(ta1 X1 ) × . . . × exp(tan Xn )]
since t bi is a constant,
P
= exp(t i bi ) E exp(ta1 X1 ) × . . . × E exp(tan Xn )
by independence,
P
= exp t i bi ) MX1 (a1 t) × · · · × MXn (an t).
If X ∼ N (µ, σ 2 ), then MX (t) = exp µt + 12 σ 2 t2 [CHECK].
Substituting into the above formula for MZ (·) gives
P
P
P
MZ (t) = exp ( i bi + i ai µi )t + 21 ( i a2i σi2 )t2 ,
P
P
which canPbe recognised as the MGF of a Normal distribution with mean ( i bi + i ai µi ) and
variance ( i a2i σi2 ). The result follows by the uniqueness of MGFs.
4. Problem 1.3.4, page 6
Denote the height, width, length & density of the block by X1 , X2 , X3 & D respectively. Then
EV
= E(X1 X2 X3 ) = (EX1 )(EX2 )(EX3 )
10 × 20 × 25 = 5000 cc,
E (X1 X2 X3 )2 = E[X12 ] E[X22 ] E[X32 ]
by independence,
Var(X1 ) + (EX1 )2 Var(X2 ) + (EX2 )2 Var(X3 ) + (EX3 )2
since VarX = E X 2 − (EX)2
12 + 102 32 + 202 42 + 252 = 26479069,
p
26479069 − 50002 = 1216 cc.
=
EV 2
=
=
=
∴ SD(V )
by independence,
=
Similarly EW = 20.0 kg and SD(W ) = 5.5 kg.
Pr(W ≥ 30) ≤ Pr(|W − 20| ≥ 10) ≤ 5.52 /102
=
by Chebyshev’s inequality,
0.3025
Assuming Normality, (W − 20)/5.5 ∼ N (0, 1). Therefore Pr(W ≥ 30) = 1 − Φ (30 − 20)/5.5 =
1 − Φ(1.82) = 0.0344, where Φ(·) represents the CDF of the standard Normal N (0, 1) density.
Assuming Normality allows clearer inferences to be made. Note that under Normality, extreme values
are highly unlikely. Thus the Normality assumption is often convenient but dubious.
Independence of the Xi seems highly dubious: if the workman is good at estimating relative dimensions, then the Xi will tend to be all overestimates, or else all underestimates (and therefore the SD
of W will have been underestimated).
There’s no good reason to assume Normality, but if X1 , X2 , X3 and D were independent lognormal,
then W = X1 X2 X3 D would also be lognormal. A safer and perhaps more realistic assumption for
W might be that it has a unimodal distribution on (0, ∞), with heavier tails than the Normal (e.g. a
Gamma distribution with appropriate mean & variance).
5. Problem 1.3.5, page 7
⇐
cf C&B p112
III
6. Problem 1.3.6, page 7
⇐
cf C&B 2.29 page 80
7. Problem 1.3.7, page 7
(a) If Xn ∼ Bin(n, p) then Xn has mean np and variance np(1 − p).
Also, Xn can be considered as arising from the sum of n independent Bernoulli trials:
Xn =
n
X
Ti
i=1
where Ti IID
∼ Bin(1, p). Therefore, by the CLT, the distribution of
Xn − np
Yn = p
np(1 − p)
tends to N (0, 1) as n → ∞ (provided p 6= 0, 1). Hence the distribution of Xn is approximately
N (np, np(1 − p)) for large n. This approximation is traditionally considered adequate if np and
n(1 − p) are both at least 5.
The statement ‘the continuous RV Z is a good approximation to the integer-valued RV X’
implies that Pr(X = x) l Pr(x − 0.5 < Z < x + 0.5) for integer x.
(b) The Poisson distribution Poi (λ) arises as the limit of the binomial Bin(n, λ/n) as n → ∞.
Therefore Bin(n, p) can be approximated by Poi (np) provided p is small and n is large (here
there appear to be no traditional recommendations as to what is meant by ‘small’ and ‘large’).
(c) From the above approximations it follows that if λ is large then the distribution Poi (λ) can be
approximated by N (λ, λ), since the mean and variance of Poi (λ) are both λ.
(i)
Pr(X ≥ 6) = 1 − Pr(X ≤ 5) = 1 −
5 X
100
k=0
k
0.1k 0.9100−k = 0.9424.
(ii)
Pr(Y ≥ 6) = 1 − Pr(Y ≤ 5) = 1 −
5
X
10k e−10
k=0
(iii)
(iv)
(v)
(v)
k!
= 0.9329.
Pr(Z > 5.5) = 1 − Φ(−1.5) = Φ(1.5) = 0.9332.
Pr(X > 16) = 0.0206 similarly to (i) above.
Pr(Y > 16) = 0.0270.
Pr(Z > 16.5) = 1 − Φ(2.166) = 0.0151.
The approximations are relatively poorer in the tails of the distributions.
8. Problem 1.3.8, page 7
C&B x 5.30, 5.31 (p 241)
⇐
9. Problem 1.3.9, page 8
⇐
10. Problem 1.3.10, page 10
(a) For N , want a discrete distribution on the non-negative integers, probably unimodal, with
no upper bound [since it seems unreasonable to pick an n such that Pr(N = n) > 0 but
Pr(N = n + 1) = 0]. The obvious choice is the Poisson [which would apply if each sperm
independently had the same tiny probability of fertilising an egg].
For M |N , want a probability distribution with positive probability on {0, 1, 2, . . . , n}. Expect
(both from general considerations and from the data) that EM |N l N/2. Binomial is obvious
choice [would apply if each fertilised egg has, independently, the same probability of being male].
Would like to know more about the biology before suggesting other models.
IV
(b)
X
12
λn
λk e−λ
λk
=
k!
n!
k!
k=4
k=4
"
#!
n
12
X
λ
λk
log
− log
n!
k!
λn e−λ
Pr(N = n|4 ≤ N ≤ 12) =
n!
∴
`(λ; data) =
12
X
xn
n=4
X
12
k=4
(omitting a constant term), where x4 = 53, x5 = 116, . . . , x12 = 70.
(c) Straightforward calculation (computer or calculator) gives
λ
7.5
8.0
8.5
`(λ; data)
254.577
257.468
257.971
Fitted quadratic:
b − 8) = a(λ − 8)2 + b(λ − 8) + c
`(λ
where
c
b
a
=
=
=
257.468,
257.971 - 254.577
4(257.971 - c - b/2)
=
=
3.394,
−4.776.
b = 8.355.
`b0 (x) = 0 at x = −b/2a = 0.355, i.e. MLE is λ
Estimated ‘2-unit support interval’ (assuming fitted quadratic) is
p
8.355 ± 2/4.776 = (7.708, 9.002).
b = 8.355. Then the observed & expected counts in each category are as follows:
(d) Suppose λ = λ
n
4
5
6
7
8
9
10
11
12
OBS
53
116
221
374
402
346
277
102
70
EXP
105.9
177.0
246.4
294.1
307.2
285.2
238.3
181.0
126.0
for example,
105.9 = 1961 × Pr(N = n|4 ≤ N ≤ 12).
P
2
Then (OBS − EXP) /EXP = 180 on 7 d.f. (9 categories, have fitted one parameter).
Reject model decisively!
(e) The given argument involves a lot of hand-waving and approximation, but is basically sound,
and correctly suggests that the data are less dispersed than would be expected under the Poisson
assumption. It involves much less work than (a)–(d)!
1. Problem 2.1, page 12
f (x1 , x2 )
=
Pr (X1 = x1 )&(X2 = x2 )
=
Pr(X1 = x1 ) Pr(X2 = x2 )
=
λx1 1 e−λ1 λx2 2 e−λ2
x1 !
x2 !
=
λx1 1 λx2 2 e−(λ1 +λ2 )
x1 ! x2 !
V
by independence
(7.2)
2. Problem 2.2, page 12
f (x1 , x2 )
=
Pr(X1 = x1 & X2 = x2 )
=
Pr(X1 = x1 & N = x1 + x2 )
Pr(X1 = x1 |N = x1 + x2 ) × Pr(N = x1 + x2 )
x1 + x2 x1
λx1 +x2 e−λ
=
θ (1 − θ)x2 ×
x1
(x1 + x2 )!
x2 −λ
x1
(λθ) λ(1 − θ) e
=
x1 ! x2 !
=
(7.3)
3. Problem 2.3, page 12
Writing λ1 = λθ and λ2 = λ(1 − θ) shows that the PMFs and hence the fitted values are the same.
4. Problem 2.4, page 12
Pr(X1 = x1 )
X
=
f (x1 , x2 )
x2 =0,1,...
x2
∞
X
(λθ)x1 λ(1 − θ) e−λ
=
x1 ! x2 !
x =0
=
2
x2
∞
(λθ)x1 e−λ X λ(1 − θ)
x1 !
x2 !
x =0
2
x1 −λθ
=
(λθ) e
x1 !
i.e. under the model of Exercise 2.2, X1 ∼ Poi (λθ).
5. Problem 2.5, page 12
Pr(Q = 0) = (1 − α)
∞
X
1
αy−1 .
y
+
1
y=1
But
∞
X
αy
=
y=1
αy+1
=
y+1
y=1
Z α
1
− 1 dx
1−x
0
∴
=
∞
X
=
Therefore
Pr(Q = 0) =
α
1−α
Z
0
α
x
dx
1−x
α
− log(1 − x) − x 0
=
− α + log(1 − α) .
∞
α−1
(1 − α) X αy+1
=
α + log(1 − α) .
2
2
α
y+1
α
y=1
6. Problem 2.6, page 14
The joint PMF
fX,Θ (x, θ)
=
=
Pr(X = x&Θ = θ) = Pr(X = x|Θ = θ) Pr(Θ = θ)
1 2 x
θ (1 − θ)2−x
5 x
is easy to calculate & tabulate:
VI
x
1
0
θ
0
1
4
1
2
3
4
1
2
1
1 · 15 = 16
· 34 · 15 = 9
· 12 · 25 = 8
0
2
0
2 · 34 · 14 · 15 = 6
2 · 12 · 12 · 25 = 16
0
1
4
1
2
0
· 14 · 15 = 1
· 12 · 25 = 8
1 · 15 = 16
Table of fX,Θ (x, θ) all entries ×1/80
The Marginal CDF FX (x) is then by definition FX (x) = limθ→∞ FX,Θ (x, θ) = FX,Θ (x, 1), and is
given by the last row in the following table of FX,Θ (x, θ) (obtained by summing over fX,Θ (x0 , θ0 ) for
all x0 ≤ x and θ0 ≤ θ in the above table of fX,Θ ):
θ
0
1
4
1
2
1
0
x
1
2
16
25
33
33
16
31
55
55
16
32
64
80
Table of FX,Θ (x, θ) all entries ×1/80
i.e.
x
FX (x)
0
33/80
1
55/80
2
1
More simply, fX (x) can be obtained just by summing the columns of the table of fX,Θ (x, θ):
x
fX (x)
0
33/80
1
22/80
2
25/80
7. Problem 2.7, page 15
x
fX|Θ (x|θ = 1/4)
θ
fΘ|X (θ|x = 0)
0
25/88
0
16/107
1
31/88
1/4
25/107
2
32/88
1/2
33/107
1
33/107
8. Problem 2.8, page 15
(2.6⇒2.7)
FX,Y (x, y)
=
Pr(X ≤ x & Y ≤ y) = Pr(X ≤ x) Pr(Y ≤ y)
= FX (x)FY (y).
VII
(by 2.6)
(2.7⇒2.8)
fX,Y (x, y)
=
=
∂2 FX,Y (x, y) (by the ‘fundamental theorem of calculus’)
∂x ∂y
∂2 FX (x)FY (y) (by 2.7)
∂x ∂y
= fX (x)fY (y).
(2.8⇒2.6)
Z Z
Pr(X ∈ A & Y ∈ B)
=
fX,Y (x, y) dy dx
A
B
Z Z
=
fX (x)fY (y) dy dx (by 2.8)
A
B
Z
Z
=
fY (y) dy = Pr(X ∈ A) Pr(Y ∈ B).
fX (x) dx
A
B
9. Problem 2.5.1, page 16
(a)
Z
∞
Z
∞
Z
1
1
Z
f (x, y) dx dy =
−∞
−∞
Z
f (x, y) dx dy =
0
0
0
1
1
3y 2 dy = y 3 0 = 1.
(b)
Z
1
Z
1
6xy 2 dx dy = 9/10.
Pr(X + Y ≥ 1) =
0
1−y
(c)
R∞
fX (x) =
−∞
f (x, y)dy =
0
R1
0
6xy 2 dy = 2x if 0 ≤ x ≤ 1,
otherwise.
(d)
Z
3/4
Pr(1/2 < X < 3/4) =
2x dx = 5/16.
1/2
10. Problem 2.5.2, page 16
(a)




0
if x < 0 or y < 0,
1
if x ≥ 2 and y ≥ 2,
FX,Y (x, y) =
F
(x)
=
F
(x,
2)
=
x(x
+
2)/8
if
0 ≤ x < 2 and y ≥ 2,

X
X,Y


FY (y) = FX,Y (2, y) = y(y + 2)/8 if x ≥ 2 and 0 ≤ y < 2.
(b)
FX (x) = lim FX,Y (x, y) = FX,Y (x, 2) =
y→∞


0
if x < 0,
x(x + 2)/8 if 0 ≤ x < 2,

1
if x ≥ 2.
(c)
∂ 2 FX,Y (x, y)
=
fX,Y (x, y) =
∂x∂y
(x + y)/8
0
VIII
if 0 < x < 2 and 0 < y < 2,
otherwise.
11. Problem 2.5.3, page 16
(a)
Z
∞
Z
∞
1
Z
1
Z
cx2 y dy dx =
f (x, y) dy dx =
−∞
−∞
x2
−1
4
c.
21
Therefore c = 21/4.
y=1
y = x2
(b) Need to integrate over the region
bounded by y = x and y = x2 :
-1
√
1
Z
Pr(X ≥ Y ) =
Z
y
0
y
1
=
0
7 3
x y
4
x=√y
Z
dy =
0
x=y
=
1
x
21 2
x y dx
4
dy
Z
0
7 2 7/2 1 5
y
− y
4 7
5
1
=
0
1
7 5/2
y
− y 4 dy
4
1
7
3
.
−
=
2 20
20
(c)
Z
fX (x)
∞
=
f (x, y)dy =
x2
√
−∞
Z
fY (y)
1
Z
∞
=
Z
f (x, y)dx =
−∞
21 2
21 2
x (1 − x4 )
x y dy =
4
8
y
√
− y
7
21 2
x y dy = y 5/2 .
4
2
12. Problem 2.5.6, page 17
First note that
Z
x
Z
FX|Y (x|y) = Pr(X ≤ x|Y = y) =
x
fX|Y (t|y) dt =
−∞
−∞
fX,Y (t, y)
dt,
fY (y)
but that in general
FX|Y (x|y) 6=
FX,Y (t, y)
.
FY (y)
For example, suppose that fX,Y (x, y) = 2 on the triangle 0 ≤ y ≤
x ≤ 1 (shaded on diagram).
Then the set {(x, 0.5) | x < 0.5} lies entirely outside the shaded
triangle, whereas the set {(t, y) | 0 < t < x, 0 < y < 0.5} intersects
it for all x ∈ (0, 1).
Therefore FX|Y (x|y = 0.5) = 0 for 0 < x < 0.5, whereas
FX,Y (x, 0.5) > 0 for 0 < x < 0.5.
1
0
(a)

 0 if x < 0
x if 0 ≤ x < 1
FX (x) = lim FX,Y (x, y) = FX,Y (x, 1) =
y→∞

1 if x ≥ 1.

 0 if y < 0
y if 0 ≤ y < 1
FY (y) = lim FX,Y (x, y) = FX,Y (1, y) =
x→∞

1 if y ≥ 1.
IX
@
@
@ @
@ @
@ @ @
@ @ @
@ @ @ @
1
For (x, y) ∈ (0, 1)2 ,
fX,Y (x, y) = ∂ 2 FX,Y (x, y)/∂x∂y = ∂ 2 (xy)/∂x∂y = 1
and
fY (y) = ∂FY (y)/∂y = ∂y/∂y = 1.
Therefore
Z
x
FX|Y (x|0.5) = Pr(X ≤ x|Y = 0.5) =
Z
x
fX|Y (t|0.5) dt =
0
0
fX,Y (t, 0.5)
dt =
fY (0.5)
Z
0
x
1
dt = x,
1
with FX|Y (x|0.5) = 0 for x ≤ 0 and FX|Y (x|0.5) = 1 for x > 1.
(b)


0
if x ≤ 0
min(x, 1) = x if 0 < x ≤ 1
y→∞

1
if x > 1.

0
if y ≤ 0

min(y, 1) = y if 0 < y ≤ 1
FY (y) = lim FX,Y (x, y) = FX,Y (1, y) =
x→∞

1
if y > 1.
FX (x) = lim FX,Y (x, y) = FX,Y (x, 1) =
FX,Y (x, y) is discontinuous for x = y, but fX,Y (x, y) = ∂ 2 FX,Y (x, y)/∂x∂y = 0 elsewhere.
Therefore the distribution lies entirely on the line x = y. Therefore (Y = 0.5) ⇒ (X = 0.5) and
0 if x < 0.5
FX|Y (x|0.5) = Pr(X ≤ x|Y = 0.5) =
1 if x ≥ 0.5.
(c)


0
if x ≤ 0
x + 1 − 1 = x if 0 < x ≤ 1
y→∞

1
if x > 1.

0
if y ≤ 0

1 + y − 1 = y if 0 < y ≤ 1
FY (y) = lim FX,Y (x, y) = FX,Y (1, y) =
x→∞

1
if y > 1.
FX (x) = lim FX,Y (x, y) = FX,Y (x, 1) =
Similarly to the previous part, the distribution lies entirely on the line x + y = 1. Therefore
again
0 if x < 0.5
FX|Y (x|0.5) = Pr(X ≤ x|Y = 0.5) =
1 if x ≥ 0.5.
Note:
i.
ii.
iii.
The three distributions are in fact
Uniform on (0, 1) × (0, 1),
Uniform on {(x, y) | 0 < x = y < 1},
Uniform on {(x, y) | 0 < x = (1 − y) < 1},
13. Problem 2.5.7, page 17
(a) F (θ, x) = Pr(Θ ≤ θ, X ≤ x).
e.g., for 0 ≤ θ ≤ 1 and 1 ≤ X < 2 we have
F (θ, x)
Pr(Θ ≤ θ, X = 0) + Pr(Θ ≤ θ, X = 1)
Z θ
Z θ
2
=
(1 − φ) dφ +
2φ(1 − φ) dφ
=
0
0
1
= θ − θ3
3
and F (θ, x) is as shown in the following table
X
(−∞, 0)
[2, ∞)
[1, 2)
[0, 1)
(−∞, 0)
X
0
0
0
0
range of θ
[0, 1]
(1, ∞)
θ
θ − θ3 /3
θ − θ2 + θ3 /3
0
1
2/3
1/3
0
(b) X = 2: f (θ, x) = θ2
X = 1: f (θ, x) = 2θ(1 − θ)
X = 0: f (θ, x) = (1 − θ)2
(c)

0
(x < 0),



1/3 (0 ≤ x < 1),
F2 (x) =
2/3 (1 ≤ x < 2),



1
(x ≥ 2).

 0 (θ < 0),
θ (0 ≤ θ ≤ 1),
F1 (θ) =

1 (θ > 1),
14. Problem 2.5.8, page 17
⇐
DeGroot p.138 (bottom)
y
15. Problem 2.5.5, page 16
2
x=y
A
The diagram shows that Pr(X + Y < 1)
gives a simpler region (B rather than A):
1
@
@
B@
0
0
@ x+y =1
@
1
2
x
(a)
Pr(X + Y ≥ 1)
=
=
1 − Pr(X + Y < 1)
Z 1/2 Z 1−x
Z
1−
e−y dy dx = 1 −
0
=
x
−1/2
2e
e−x − e−(1−x) dx
0
−1
−e
1/2
.
(b)
Z
∞
fX (x) =
Z
∞
f (x, y)dy =
−∞
e−y dy = e−x .
x
(c)
f (x, y)
fY |X (y|x) =
=
fX (x)
XI
−y
e
0/e−x = 0
/e−x = e−(y−x)
if y ≤ x,
if y > x.
16. Problem 2.5.9, page 17
(a)
1
0
fX (x) =
(0 < x < 1),
(otherwise),
(given).
(b)
fY |X (y|x) =
1/x (0 < y < x),
0
(otherwise).
Therefore
Z
∞
fY (y) =
Z
∞
f (x, y) dx =
−∞
Z
fY |X (y|x)fX (x) dx =
−∞
0
y
1
dx = log y
x
for 0 < y < 1; fY (y) = 0 elsewhere.
17. Problem 2.5.4, page 16
⇐
DeGroot p 131 ex.5, pp 132–3 ex.2,6,7. c = 2, 3, 2, 24
18. Problem 2.9, page 19
E g(X)h(Y ) =
Z
∞
Z
∞
g(x)h(y)f (x, y) dx dy
−∞
Z ∞
−∞
Z ∞
−∞
Z ∞
−∞
=
g(x)h(y)fX (x)fY (y) dx dy
=
Z
−∞
=
∞
g(x)fX (x) dx
h(y)fY (y) dy
−∞
E g(X) E h(Y ) .
19. Problem 2.10, page 19
There are many equivalent ways to find the solution, e.g.
E[X 2 Y Z]
= (EX 2 )(EY )(EZ)
=
by independence,
Var(X) + (EX)2 (EY )(EZ) = λ(1 + λ)µν
(since, for a Poisson distribution, variance=mean).
A sneaky trick to show that (mean=variance) for a Poisson is:
∞
∞
2
X
x(x − 1)λx e−λ X xλx e−λ
E X = E X(X − 1) + X =
+
= . . . = λ2 + λ,
x!
x!
x=0
x=0
since x(x − 1)/x! simplifies.
20. Problem 2.11, page 19
Let
h(t) = E (tX − Y )2
= E[X 2 ]t2 − 2E[XY ]t + E[Y 2 ].
Since (tX − Y )2 > 0, we must have that h(t) ≥ 0 for all t.
2
If h(t) > 0 ∀ t then h(t) can have no real roots, so 4 E[XY ] − 4E[X 2 ]E[Y 2 ] < 0,
XII
However, if h(t) = 0 for some value of t (c, say), then E (cX − Y )2 = 0, i.e. Pr(Y = cX) = 1.
2
2
Therefore E(XY ) = E(X 2 ) = E X 2 E Y 2 .
2
Summarising, E(XY ) ≤ E X 2 E Y 2 with equality iff Pr(Y = cX) = 1 for some c.
Using this Cauchy-Schwarz inequality on the RVs X − EX and Y − EY shows that ρX,Y ≤ 1,
and ρX,Y = 1 if & only if one variable is a linear function of the other with probability one.
21. Problem 2.12, page 20
[Outline Solution] The idea is to make the following approximations:
√
g(X) = g(µ + Zn σ/ n)
√
√
= g(µ) + (Zn σ/ n)g 0 (µ) + 12 (Zn σ/ n)2 g 00 (µ) + . . .
√
√
∴ E g(X) = g(µ) + (σ/ n)g 0 (µ)EZn + 12 (σ/ n)2 g 00 (µ)E Zn2 + . . .
= g(µ) + 0 + (σ 2 /2n)g 00 (µ) + . . .
2
and Var g(X) l E g(X) − g(µ)
= E
0
√
Zn σg (µ)/ n +
1
2 00
2
2 (Zn σ) g (µ)/n)
= . . . = σ 2 g 0 (µ)2 /n + O(n−3/2 ).
More formally,
g(X)
= g(µ + n−1/2 σZn )
= g(µ) + n−1/2 σZn g 0 (µ) + 21 n−1 σ 2 δn2 g 00 (µ),
0 < |δn | < |Zn |,
and the results follow, e.g. by Chebyshev’s inequality.
22. Problem 2.13, page 20
Var g(X)
l
2
Var(µ) g 0 (µ)2 /n
from equation 2.15
p
if g 0 (µ) = 1/ Var(µ).
= 1/n
If X ∼ Poi (µ), then we have the above situation with n = 1, X = X (= X1 ) and Var(µ) = µ.
Therefore the ‘variance stabilising’ transformation for a Poisson is given by
√
g 0 (µ) = 1/ µ
√
√
√
i.e. (integrating): g(µ) = 2 µ. Therefore g(X) = 2 X, or more simply g(X) = X, has approximately constant variance.
√
√
Note: the transformation g(Xi ) = Xi + Xi + 1 in fact has slightly better properties as a
transformation of the independent RVs Xi ∼ Poi (µi )
23. Problem 2.9.1, page 21
(a) Marginal PMFs:
x1
f1 (x1 )
1
0.4
2
0.6
x2
f2 (x2 )
1
0.25
XIII
2
0.5
3
0.25
x3
f3 (x3 )
1
0.33
2
0.32
3
0.35
f12 (x1 , x2 )
x1
1
2
1
x2
2
3
0.1
0.15
0.2
0.30
0.1
0.15
(b) X1 ⊥
⊥ X2 , since f12 (x1 , x2 ) = f1 (x1 ) × f2 (x2 ).
(c) Conditional PMFs:
x1
g1 (x1 |X2 =1, X3 =3)
1
5/8
2
3/8
x2
g2 (x2 |X1 =1, X3 =3)
1
1/4
2
2/4
g12 (x1 , x2 |X3 =3)
x3
g3 (x3 |X1 =1, X2 =3)
1
2/10
2
3/10
3
5/10
x1
1
2
3
1/4
1
x2
2
3
5/35
3/35
10/35
7/35
5/35
5/35
⇐
24. Problem 2.9.2, page 21
25. Problem 2.9.3, page 21
(a) if fX (x) = (1/µ) exp(−x/µ) for x > 0, then Var(x) = µ2 . Then
g 0 (µ) =
1
⇒ g = log(µ),
µ
i.e. log(X) has approximately constant variance.
(b) Similarly if X ∼ Bin(n, p) then E(X) = µ = np and Var(X) = np(1 − p).
Want
Z t
dp
p
.
g(t) =
p(1 − p)
0
√
√ Substituting p = (sin θ)2 gives g = sin−1 p , i.e. sin−1 X/n has approximately constant
variance.
⇐
26. Problem 2.9.4, page 21
27. Problem 2.14, page 22
(i) [as usual, start with the more complicated side of the equality, and simplify]
E E[X1 |X2 ] =
Z
∞
−∞
∞
Z
Z
Z
∞
x1 f (x1 |x2 )dx1 f (x2 )dx2
−∞
∞
=
x1
−∞
Z ∞
−∞
Z ∞
−∞
−∞
=
f (x1 , x2 )
dx1 f (x2 )dx2
f (x2 )
by definition of f (x1 |x2 )
x1 f (x1 , x2 )dx1 dx2 = EX1
by definition.
(ii) The discrete case is similar (replace integration by summation).
28. Problem 2.15, page 22
One could find the density fX,Y (x, y) over the triangle on which is non-zero, and work out EY by
integration. However, it’s much simpler to say
E[Y |x]
∴ E[Y ]
= (x + 1)/2
by symmetry
= E E[Y |X] = E[(X + 1)/2]
= E[X]/2 + 1/2 = 3/4.
XIV
29. Problem 2.16, page 22
Θ ∼ U (0, 1)
X |Θ ∼ Bin(2, Θ)
∴ E[X |Θ] = 2Θ
∴ E[X] = E E[X |Θ] = 2EΘ = 1.
30. Problem 2.17, page 23
(i)
X
E E[h(X1 )|X2 ] =
E[h(x1 )|X2 ]f2 (x2 )
x2
=
XX
x2
h(x1 )g1 (x1 |x2 )f2 (x2 ) =
x1
X
h(x1 )f (x1 , x2 ) = E[h(X1 )].
x1 ,x2
(ii) The continuous case is proved similarly (replace sums by integrals etc.)
31. Problem 2.18, page 23
2
= E X12 |x2 − E[X1 |x2 ]
= E E[X12 |X2 ] − E (E[X1 |X2 ])2 ,
2
= E (E[X1 |X2 ])2 − E E[X1 |X2 ] .
2
= E E[X12 |X2 ] − E E[X1 |X2 ]
= E X12 − (E[X1 ])2
Note that Var(X1 |x2 )
Therefore E Var(X1 |X2 )
Also Var E[X1 |X2 ]
∴ E Var(X1 |X2 ) + Var E[X1 |X2 ]
=
Var(X1 ).
32. Problem 2.19, page 23
2
E (2Θ)2 − E[2Θ]
Z 1
2
Z 1
=
4θ2 dθ −
2θdθ
Var(E[X |Θ]) =
0
=
4θ
3
0
3 1
2
1
1
− θ2 0 = .
3
0
Also, (X |θ) ∼ Bin(2, θ)
∴ Var(X |θ)
2θ(1 − θ)
[CHECK]
Z 1
1
∴ E Var(X |Θ) =
2θ(1 − θ)dθ = θ2 − 2θ3 /3 0 = 1/3.
=
0
∴ Var(X) = E Var(X |Θ) + Var(E[X |Θ]) = 2/3.
The variance of X is reduced from 2/3 to 1/3 on average by observing Θ.
XV
Var(X|θ)
1/2
1/3
0
0
1
θ
33. Problem 2.11.1, page 24
(a) Pr(A1 ) = Pr(A2 ) = Pr(A3 ) = 1/2.
Pr(A1 &A2 ) = Pr(‘HH’) = 1/4 = Pr(A1 ) Pr(A2 ),
Pr(A1 &A3 ) = Pr(‘HH’) = 1/4 = Pr(A1 ) Pr(A3 ),
Pr(A2 &A3 ) = Pr(‘HH’) = 1/4 = Pr(A2 ) Pr(A3 ),
but (A1 &A2 &A3 ) occurs iff both tosses are ‘H’,
∴ Pr(A1 &A2 &A3 ) = Pr(‘HH’) = 1/4 6= 1/8 = Pr(A1 ) Pr(A2 ) Pr(A3 ).
Therefore A1 , A2 & A3 are not mutually independent (& in fact knowing any two of them gives
you the third).
1 if Ai occurs,
(b) Let Xi =
i = 1, 2, 3.
0 otherwise,
Then E[X3 |X1 =x1 ] = 1/2 whether x1 = 0 or 1, and E[X3 |X2 =x2 ] = 1/2 whether x2 = 0 or 1,
1 if x1 = x2 ,
but E[X3 |X1 = x1 &X2 = x2 ] =
0 otherwise,
34. Problem 2.11.2, page 24
One possibility: X1 , X2 IID
∼ U [0, 1],
X1 + X2
if X1 + X2 < 1,
X1 + X2 − 1 if X1 + X2 ≥ 1
X3 =
Physical model: rotating a spinner randomly through 2πX1 , followed by a second rotation through
2πX2 . Spinner now points at angle 2πX3 to its original direction.
35. Problem 2.11.3, page 24
(a)
i.
Z
∞
Z
∞
E[E[Y |X]] =
yfY |X (y|x)dy fX (x)dx
−∞
Z ∞
Z
−∞
∞
=
y fX,Y (x, y)dy dx
−∞
by definition of fY |X
−∞
= E[Y ].
The discrete case is similar (replace integration by summation).
ii.
E[Var[Y |X]] =
E[E[Y 2 |X]] − (E[Y |X])2
= E[Y 2 ] − E[(E[Y |X])2 ]
Var[E[Y |X]] =
∴ E[Var[Y |X]] + Var[E[Y |X]] =
=
(b)
E[(E[Y |X])2 ] − (E[E[Y |X]])2
E[Y 2 ] − E[(E[Y |X])2 ] + E[(E[Y |X])2 ] − (E[Y ])2
Var[Y ]
i.
E[X1 |P1 ] = 1×P1 + 0×(1 − P1 ) = P1
XVI
ii.
Var[X1 |P1 ]
= E[X12 |P1 ] − (E[X1 |P1 ])2
= [1×P1 + 0×(1 − P1 )] − P12 = P1 (1 − P1 )
iii.
Var[E[X1 |P1 ]] = Var[P1 ] =
αβ
(α + β)2 (α + β + 1)
(given)
iv.
E[Var[X1 |P1 ]] = E[P1 − P12 ] = E[P1 ] − Var[P1 ] − (E[P1 ])2
Therefore
E[Y ] =
n
X
=
α
αβ
α2
−
−
2
α+β
(α + β) (α + β + 1) (α + β)2
=
αβ
(α + β)(α + β + 1)
E[Xi ] =
i=1
n
X
E[E[Xi |Pi ]] =
i=1
n
X
E[Pi ] =
i=1
nα
.
α+β
Also, the Xi s are mutually independent since the Pi s are. Therefore
Var[Y ]
=
n
X
Var[Xi ] =
i=1
= n
n
X
(Var[E[Xi |Pi ]] + E[Var[Xi |Pi ]])
i=1
αβ
αβ
+
(α + β)2 (α + β + 1) (α + β)(α + β + 1)
=
nαβ
(α + β)2
(c)
E[Y ] = nµ,
Var[Y ] = n
β
α
= nµ(1 − µ)
α+β α+β
NB. —the same as for Bin(n, µ).
36. Problem 2.11.4, page 24
(a) EN = VarN = 10 (property of Poisson),
E[X|N ] = 14 N (property of binomial),
Var[X|N ] = 14 × 34 N (property of binomial).
(b) EX = E E[X|N ] = 14 EN = 2.5,
3 VarX = Var E[X|N ] + E Var[X|N ] = Var 14 N + E 16
N =
(c) Poisson, mean 2.5 (compare ‘pig litter’ example in lectures).
1
16 VarN
+
3
16 EN
37. Problem 2.11.5, page 25
(a) bookwork
(b) bookwork
(c)
i.
f (x, y) = fX (x)fY |X (y|x)
(definition of fX ),
therefore
Z
∞
Z
yf (x, y)dy
∞
=
−∞
y fX (x) fY |X (y|x) dy
−∞
Z
∞
yfY |X (y|x) dy = fX (x)E[Y |x]
= fX (x)
−∞
=
(β0 + β1 x)fX (x).
XVII
= 2.5.
ii.
= E[Y ] = E[E[Y |X]]
µY
= E[β0 + β1 X] = β0 + β1 µX
iii.
ρ=
E[XY ] − µX µY
,
σX σY
therefore
ρσX σY + µX µY
= E[XY ] = E[E[XY |X]]
= E[X(β0 + β1 X)]
(since X is constant in inner bracket)
2
2
= β0 E[X] + β1 E[X ] = β0 µX + β1 (σX
+ µ2X ).
(d) ii & iii above are 2 simultaneous linear equations for β0 & β1 , with solution
β0
= µY − ρ
β1
= ρ
β̂0
= µ̂Y − ρ̂
β̂1
= ρ̂
σY
µX
σX
σY
.
σX
Therefore the MLEs of β0 & β1 are
σ̂Y
µ̂X
σ̂X
σ̂Y
,
σ̂X
where
µ̂X
µ̂Y
σ̂X
σ̂Y
ρ̂
1X
xi ,
n
1X
=
yi ,
n
r
1X
=
(xi − µ̂X )2 ,
n
r
1X
=
(yi − µ̂Y )2 ,
n
=
=
1
n Σxi yi
− µ̂X µ̂Y
σ̂X σY
38. Problem 2.11.6, page 25
(i–iv) bookwork.
(v)
E[XY ]
E[Y ]
Var[Y ]
= E E[XY |X] = E E[β0 X + β1 X 2 |X]
= E β0 X + β1 X 2 = β0 E[X] + β1 E[X 2 ].
= E E[Y |X] = E[µX] = λµ.
= Var E[Y |X] + E Var[Y |X]
=
Var[µX] + E[µX] = µ2 λ + µλ = λµ(1 + µ).
XVIII
Similarly
= λµν,
Var[Z] = Var E[Z |Y ] + E Var[Z |Y ] = λµν(1 + ν + µν),
E[XZ] = E E[XZ |X] = E XE[Z |X]
E[Z]
= E[µνX 2 ] = µν(λ + λ2 ).
Hence Cov(X, Z) = λµν and
r
λµν
corr(X, Z) = p
λ × λµν(1 + ν + µν)
=
µν
1
=q
1 + ν + µν
1 + µ1 +
1
µν
⇐
39. Problem 2.11.7, page 26
40. Problem 2.11.8, page 26
(a)
Cov X, Y − E[Y |X] = E XY − XE[Y |X] − (EX) E Y − E[Y |X]
= E XY − E[XY |X] − (EX)(EY − EY )
(Note that E E[XY |X] = E XE[Y |X] since given X, X is constant!)
= E[XY ] − E[XY ] − 0
(since E E[h(X)|X] = E[h(X)])
=
0.
(b)
Var Y − E[Y |X] =
=
Var E Y − E[Y |X] X +E Var Y − E[Y |X] X
Var(0) + E[Var(Y )|X]
(since E[Y |X] is a constant for a fixed X)
= E Var[Y |X]
(in equivalent notation).
(c)
E Cov[X, Y |Z] = E E[XY |Z] − E[X|Z]E[Y |Z]
= E[XY ] − E E[X|Z]E[Y |Z]
= (EX)(EY ) − E E[X|Z]E[Y |Z] (since X, Y uncorrelated)
= E E[X|Z] ) E E[Y |Z] ) − E E[X|Z]E[Y |Z]
= −Cov E[X|Z], E[Y |Z] .
(d)
Cov Z, E(Y |Z)
= E Z E[Y |Z] − E[Z] E E[Y |Z]
= E E[ZY |Z] − E[Z] E E[Y |Z] (since, given Z, Z is constant)
= E[ZY ] − E[Z]E[Y ] = Cov(Z, Y ).
1. Problem 3.2, page 30
XIX
If v = (x2 − µ2 )/σ2 then dv = dx/σ2 . Therefore
fX1 (x1 )
=
where Q0
=
=
=
!
Z ∞
1
1
0
Q dv,
p
×
×
exp −
2π
2 1 − ρ2
1 − ρ2 σ 1
−∞
2
x1 − µ1
x1 − µ1
− 2ρ
v + v2
σ1
σ1
2
x1 − µ1 2
x1 − µ1
2
v−ρ
+ 1−ρ
(completing the square)
σ1
σ1
2 !
x1 − µ1
2
2
1−ρ
w +
.
σ1
1
p
p
Also if w = v − ρ(x1 − µ1 )/σ1 / 1 − ρ2 then dw = dv/ 1 − ρ2 . Therefore
"
(
2 )#
Z ∞
1
1
x1 − µ1
2
fX1 (x1 ) =
exp −
w +
dw
2πσ1 −∞
2
σ1
"
2 #
1
1 x1 − µ1
= √
exp −
,
2
σ1
2πσ1
i.e. the PDF of N µ1 , σ12 .
Similarly by symmetry X2 ∼ N µ2 , σ22 . Therefore
fX1 |X2 (x1 |x2 )
fX1 ,X2 (x1 , x2 )
fX2 (x2 )
=
√
=
2πσ2
(by definition)
2 !
σ1
1
exp −
x1 − µ1 − ρ (x2 − µ2 )
,
2σ1 (1 − ρ2 )
σ2
1 − ρ2
1
p
i.e. the PDF of N µ1 + ρ σ1 (x2 − µ2 )/σ2 , σ12 (1 − ρ2 ) .
2. Problem 3.3, page 30
Straightforward multiplication of V and the alleged V −1 .
3. Problem 3.4.1, page 31
Since x = A−1 y, and det(AB) = det(A) det(B), it follows from exercise 3.1 (page 28) that
fY (y) = det A−1 fX (x) =
1
fX A−1 y .
| det A|
Now let Y = X1 − X2 & Z = X2 , and consider the transformation from (X1 , X2 )T to (Y, Z)T :
Y
1 −1
X1
=
,
Z
0 1
X2
X1
= Y − Z,
X2
= Z.
Therefore
fY,Z (y, z)
∴ fY (y)
1
fX (y − z, z) = f1 (y − z)f2 (z)
| det A|
Z ∞
Z ∞
=
fY,Z (y, z)dz =
f1 (y − z)f2 (z)dz.
=
−∞
−∞
XX
Similarly putting Z = X1 , or alternatively arguing by symmetry,
Z ∞
fY (y) =
f1 (z)f2 (y − z)dz.
−∞
If Xi IID
∼ Exp(1), then
Z
∞
f1 (y − z)f2 (z)dz.
fY (y) =
−∞
But f1 (y − z)f2 (z) = 0 if y < z or z < 0. Therefore the range of integration reduces to z ∈ (0, y), i.e.
Z y
Z y
fY (y) =
e−(y−z) e−z dz =
e−y dz
0
=
0
ye−y
0
(y ≥ 0),
(y < 0).
4. Problem 3.4.2, page 31
(a) E[X1 ] = E[µ1 ] + σ1 E[Z1 ] = µ1 ,
p
E[X2 ] = E[µ2 ] + σ2 E[Z1 ] + 1 − ρ2 EZ2 = µ2 ,
2
Var[X1 ] = σ12 Var[Z
1 ] = σ1 ,
2 2
Var[X2 ] = σ2 ρ Var[Z1 ] + (1 − ρ2 )Var[Z2 ] = σ22 ,
E[X1 X2 ] = E[µ1 µ2 + σ1 σ2 ρZ12 + aZ1 + bZ1 Z2 ] for constants a & b,
= µ1 µ2 + σ1 σ2 ρ,
∴ corr[X1 , X2 ] = (E[X1 X2 ] − µ1 µ2 )/σ1 σ2 = ρ.
p
(b) X2 = µ2 + σ2 ρ(X1 − µ1 )/σ1 + 1 − ρ2 Z2 ,
therefore E[X2 |X1 ]p
= µ2 + (X1 − µ1 )ρσ
p2 /σ1 ,
2
and Var[X2 |X1 ] = 1 − ρ Var[Z2 ] = 1 − ρ2 σ22 .
Note in particular that if ρ = 0 then E[X2 |X1 ] = µ2 = E[X2 ] and Var[X2 |X1 ] = σ22 = Var[X2 ],
i.e. the conditional (Normal) distribution of [X2 |X1 ] has the same mean & variance as the
marginal (Normal) distribution of X2 .
This shows that if two RVs have a bivariate Normal distribution, then they are independent if
& only if they are uncorrelated.
(c) Bivariate Normal by construction; the PDF is given in lecture notes (formulae 3.7 & 3.8).
(d) Therefore [X2 |X1 ] is Normal (exercise 3.2), with mean & variance as given above. Note that
Var[X2 |X1 ] is constant and E[X2 |X1 ] depends on X1 if & only if ρ 6= 0.
Therefore [X2 |X1 ] has a distribution independent of X1 if & only if ρ = 0.
(e) The transformation from (X1 , X2 ) to (Y1 , Y2 ) has constant Jacobian, therefore Y1 & Y2 have a
bivariate Normal distribution.
If σ1 = σ2 then
Cov(Y1 , Y2 ) = E (X1 + X2 )(X1 − X2 ) − E[X1 +X2 ]E[X1 −X2 ]
= E[X12 − X22 ] − (µ1 + µ2 )(µ1 − µ2 )
= Var[X1 + µ21 ] − Var[X2 + µ22 ] − µ21 − µ22
=
0.
Thus Y1 & Y2 are uncorrelated & have a bivariate Normal distribution, so are independent.
5. Problem 3.4.3, page 31
(a)
Z
∞
Z
∞
Z
−∞
−∞
∞
Z
∞
1
p
exp(−Q) du dv
2
−∞ −∞ 2π 1 − ρ
1
where Q = −
u2 − 2ρuv + v 2
2(1 − ρ2 )
1
= −
(u − ρv)2 + (1 − ρ2 )v 2
2
2(1 − ρ )
fX,Y (x, y) dx dy
=
XXI
(completing the square).
p
p
Now substitute w = (u − ρv)/ 1 − ρ2 , so that dw = du/ 1 − ρ2 , giving
Z ∞Z ∞
Z ∞Z ∞
1
fX,Y (x, y) dx dy =
exp(−w2 /2 − v 2 /2) dw dv
2π
−∞ −∞
−∞ −∞
Z ∞
Z ∞
1
1
√ exp(−v 2 /2) dv
√ exp(−w2 /2) dw
=
2π
2π
−∞
−∞
= 1 × 1 = 1.
(b) Again substitute u = (x − µX )/σX and v = (y − µY )/σY :
Z ∞Z ∞
MX,Y (s, t) = E exp(sX + tY ) =
exp(sx + ty)fX,Y (x, y) dx dy
−∞
Z
∞
Z
∞
1
=
−∞
where Q =
=
=
−∞
2π
p
1 − ρ2
−∞
exp(−Q) du dv
1
u2 − 2ρuv + v 2 −s(µX + σX u) − t(µY + σY v)
2
2(1 − ρ )
2
1
u − u 2(1 − ρ2 )sσX − 2ρv + v 2 − 2(1 − ρ2 )(sµX + tµY + tσY v)
2
2(1 − ρ )
2
1
u − ρv − (1 − ρ2 )sσX − ρ2 v 2 + 2ρ(1 − ρ2 )vsσX
2
2(1 − ρ )
2
+(1 − ρ2 )2 s2 σX
+ v 2 − 2(1 − ρ2 )(sµX + tµY + tσY v)
=
(completing the square for u)
2
u − ρv − (1 − ρ2 )sσX / 2(1 − ρ2 )
=
2
+ v 2 − v(2ρsσX + 2tσY ) − 2sµX + 2tµY + (1 − ρ2 )s2 σX
2
u − ρv − (1 − ρ2 )sσX / 2(1 − ρ2 )
/2
2
+ (v − ρsσX − tσY )2 − (2sµX + 2tµY + s2 σX
+ 2ρstσX σY + t2 σY2 ) /2
(completing the square for v).
Finally substituting
w=
u − ρv − (1 − ρ2 )sσX
p
1 − ρ2
and z = v − ρsσX − tσY
gives
MX,Y (s, t)
2 2
exp µX s + µY t + 12 (σX
s + 2ρσX σY st + σY2 t2 )
Z ∞Z ∞
1
×
exp(−w2 /2 − z 2 /2) dw dz
2π
−∞ −∞
2 2
= exp µX s + µY t + 12 (σX
s + 2ρσX σY st + σY2 t2 ) .
=
(c) Straightforward (partial) differentiation.
(d) Generalising the univariate & bivariate cases,
MX (s) = exp µT s + 12 sT Vs .
6. Problem 3.4.4, page 31
By the construction of the bivariate Normal in formulae 3.6, Y can also be written as a linear
combination of Z1 and Z2 , and is therefore Normally distributed by problem 1.3.3, page 6.
XXII
7. Problem 3.4, page 34
The Jacobian of the transformation form z to y = a + Bz
 ∂y1 ∂y2
...
∂z1
∂z1
 ∂y1 ∂y2
 ∂z2 ∂z2 . . .
∂y
=
..
 ..
..
∂z
.
 .
.
∂y1
∂y2
...
∂zn
∂zn
is
∂yn
∂z1
∂yn
∂z2
..
.
∂yn
∂zn



 = B,


i.e. it is constant. Also if y = a + Bz then z = B−1 (y − a).
∴ fY (y) ∝ fZ (z)
∝ exp − 12 zT z
T = exp − 12 B−1 (y − a) B−1 (y − a)
T −1 = exp − 12 (y − a)T B−1
B
(y − a) .
−1
T −1 = (BT )−1 (B−1 ) = B−1
B
.
1
T
T −1
Therefore fY (y) ∝ exp 2 (y − a) (BB ) (y − a) , i.e. fY (y) is the PDF of N (a, BBT ).
But BBT
8. Problem 3.7.1, page 38
(a)
FY (y)
=
=
=
Pr(Y ≤ y) = Pr(X 2 ≤ y)
√
√
√
√
Pr(− y ≤ X ≤ y) = Pr(X ≤ y) − Pr(X < − y)
√
√
Φ( y) − Φ(− y)
(for y ≥ 0).
(b)
fY (y) =
1
1
d
√
√
FY (y) = φ( y). √ + φ(− y). √
dy
2 y
2 y
(for y ≥ 0).
(c)
1
∴ fY (y) = √
2 y
1
1
√ e−y/2 + √ e−y/2
2π
2π
(d) The MGF MY (t) is given by
Z ∞
1
√ y −1/2 e−y/2 ety dy
MY (t) =
2π
0
Z ∞
1
1
√ x−1/2 e−x/2 dx
= √
1 − 2t 0
2π
1
= √
for t < 21 .
1 − 2t
1
= √ y −1/2 e−y/2 .
2π
putting x = (1 − 2t)y
9. Problem 3.7.2, page 38
MX (t)
=
Y
MZi2 (t) =
√
i
∴ MX+Y (t)
∴ X +Y
=
√
1
1 − 2t
∼ χ2m+n
1
1 − 2t
n
(from exercise 3.7.1)
m+n
,
by the inversion theorem for MGFs.
⇐
XXIII
Note: the fact that (X + Y ) ∼ χ2m+n also follows directly from our definition of the chi-squared
distribution.
10. Problem 3.7.3, page 38
(a) Since Z1 ⊥
⊥ Z2 ,
Pr (−1 < Z1 < 1)&(−1 < Z2 < 1)
=
Pr(−1 < Z1 < 1) Pr(−1 < Z2 < 1)
2
= Φ(1) − Φ(−1) = (0.6826)2 = 0.466.
(b) Z12 + Z22 ∼ χ22 , i.e. exponential with mean 2.
Therefore Pr(Z12 + Z22 < 1) = 1 − exp − 21 l 0.393.
11. Problem 3.7.4, page 38
Pn
Pn
2
T
T T
T
2
(a)
i=1 Yi = Y Y = Z A AZ = Z Z =
i=1 Zi .
(b) fY (y) = |J|fZ (z) where J = det(∂z/∂y) = det A−1
(easy to check componentwise using Z = A−1 Y).
J = det A−1 = det(AT ), therefore 1 = det(AT A) = det(AT ) det(A) = J 2 , so |J| = 1.
Therefore fY (y) = fZ (z), i.e. Y ∼ N n (0, I).
(c) Straightforward to verify that
n
X
0 (i 6= j),
aik ajk =
1 (i = j).
k=1
(d)
n
n−1
X
Yn
=
Yi2
=
√
1 X
√
Zi = n Z
n i=1
n
X
Yi2 − Yn2
i=1
i=1
=
n
X
Zi2
i=1
2
− nZ =
n
X
(Zi − Z)2 .
i=1
Pn−1
Pn
√
(e) Yn ⊥
⊥ Y1 , . . . , Yn−1 , ∴ Yn / n ⊥
⊥ i=1 Yi2 , i.e. Z ⊥
⊥ i=1 (Zi − Z)2 .
√
Z = Yn / n where Yn ∼ N (0, 1), therefore Z ∼ N (0, 1/n)
Pn
Pn−1 2
Pn
IID
2
2
2
i=1 (Zi − Z) =
i=1 Yi where Yi ∼ N (0, 1). Therefore
i=1 (Zi − Z) ∼ χn−1 .
(f) Write Zi = (Xi − µ)/σ (i = 1, . . . , n) & result follows.
12. Problem 3.7.5, page 38
(a) Note first that Y is a vector (also the numerator and denominator of each Yi are NOT independent), so the answer is NOT just a t distribution.
The p
easiest way to find the distribution of Y is to argue as follows:
P 2
i.
i Zi is the length of the vector Z, so Y is a vector of length 1, i.e. Y lies on the unit
sphere in (m + n) dimensions.
ii. fZ (z) is spherically symmetric (since from definition 3.1 it is a function of zT z alone).
iii. The construction of Y preserves the spherical symmetry. Therefore Y has a spherically
symmetric distribution on the unit sphere, i.e. is uniformly distributed on the unit sphere.
(b)
Pm 2 qPm+n 2
Pm 2
Pm 2
Pm
n
Zi
n 1 Yi
n 1 Zi2
A/m
1 Zi /
1
1 Zi /m
qP
X = Pm+n 2 = P
=
= Pm+n 2 = Pm+n
, say,
2
B/n
m+n
m+n
m m+1 Yi
m m+1 Zi
m+1 Zi /n
m m+1 Zi2 /
Zi2
1
where A ⊥
⊥ B, A ∼ χ2m and B ∼ χ2n . Therefore X ∼ Fm,n .
XXIV
qP
m+n 2
Yi is uniformly distributed on the unit
(c) By the same argument as part 3.7.5a, W = Y/
1
sphere.
But
Pm 2 qPm+n 2
Pm
Pm 2
n
Yi
n 1 Wi2
n 1 Yi
1 Yi /
1
qP
= Pm+n 2 ,
X = Pm+n 2 = P
m+n 2
m+n
m m+1 Wi
m m+1 Yi
m m+1 Yi2 /
Yi
1
and therefore by part 3.7.5b has an Fm,n distribution.
13. Problem 3.7.6, page 39
(a) If λi is large then Yi ∼ N (λi , λi ) approximately; therefore Zi ∼ N (0, 1) approximately.
Pk
(b) The Zi are mutually independent, so X = i=1 Zi2 ∼ χ2k approximately.
14. Problem 3.7.7, page 39
(a) Note that E[Oi ] = npi = Var[Oi ] (property of Poisson).
∴ E[Oi − npi ] = 0,
& Var[Oi − npi ] = Var[Oi ]
since npi is constant
= npi .
Pk
Pk
(b) N = i=1 Oi ∼ Poi
i=1 npi (since the sum of independent Poissons has itself a Poisson
distribution), i.e. N ∼ Poi (n).
(c)
E[Ei ] = E[N pi ] = npi ,
Var[Ei ] = Var[N pi ] = p2i Var[N ] = np2i .
(d)
E[O1 E1 ]
∴ Cov(O1 , E1 )
Pk
= E[O1 N p1 ] = p1 E O12 + O1 i=2 Oi
Pk
= p1 Var[O1 ] + (E[O1 ])2 + p1 E[O1 ]E
i=2 Oi
{z
}
|
|
{z
}
definition of Var
by independence
= p1 (np1 + n2 p2i ) + p1 (np1 ) n(1 − p1 ) = np21 + n2 p21 .
= E[O1 E1 ] − E[O1 ]E[E1 ]
= np21 + n2 p21 − (np1 )(np1 ) = np21 .
Similarly Cov(Oi , Ei ) = np2i for each i.
(e)
E[Oi − Ei ] = npi − npi = 0,
Var[Oi − Ei ] = Var(Oi ) − 2 Cov(Oi , Ei ) + Var(Ei )
= npi − 2np2i + np2i = npi (1 − pi ).
15. Problem 3.7.8, page 39
If X ∼ χ2n , then its PDF is given (Formula 3.18) by:
fX (x)
where c(n)
= c(n)x(n/2)−1 e−x/2
1
.
=
n/2
2
Γ(n/2)
XXV
(for x > 0),
Therefore
Z
EX
= c(n)
∞
x × x(n/2)−1 e−x/2 dx
0
= c(n)/c(n + 2)
since the PDF for χ2n+2 is c(n + 2)x((n+2)/2)−1 e−x/2
= 2 Γ(n/2 + 1) Γ(n/2)
=
2 n/2 = n
(property of gamma function).
Similarly, to find the variance,
E[X 2 ] = c(n)
Z
∞
x2 × x(n/2)−1 e−x/2 dx
0
= c(n)/c(n + 4)
= 4Γ(n/2 + 2) Γ(n/2)
= 4 (n + 2)/2 (n/2) = n2 + 2n,
therefore Var(X) = E[X 2 ] − (EX)2 = 2n.
0
00
Or, neater, show that the MGF MX (t) is (1 − 2t)−n/2 (for t < 21 ), hence find MX
(0) & MX
(0), etc.
0
The mode xM occurs when fX
(xM ) = 0, i.e.
(n/2 − 1) xn/2−2 e−x/2 − (1/2) xn/2−1 e−x/2
n/2−1 −x/2
i.e. (n − 2 − x) x
e
= 0,
=
0.
But e−x/2 > 0, therefore either (n − 2 − x) = 0 or xn/2−1 , i.e. x = (n − 2) or (for n ≥ 2) 0.
The form of fX (x) shows that for n > 2, the mode is xM = n − 2 (and fX (0) = 0), and for n = 1,
fX (x) → ∞ as x → 0, so the mode is xM = 0.
To find an approximate variance-stabilizing transformation, note that if X has a χ2 distribution
with mean µ, then it has variance 2µ, so we need to solve
g 0 (µ) =
1
1
=√
2µ
Var(µ)
p
Z
µ
p
1
1
√ dλ = √ 2µ1/2 = 2µ.
2
2λ
0
√
√
Therefore the transformed random variable Y = 2X, or simply Y = X, has approximately
constant variance.
Therefore g(µ)
=
Note: As an example of when this might be useful, suppose we assume that Xi ∼ χ2µi for i = 1, 2, . . .,
where the µi are known. Then a plot of the corresponding observed xi against µi should lie scattered
√
about the line x = µ, but the variability would increase with µ. However, a plot of yi = xi against
√
νi = µi should roughly follow the line y = ν, with roughly equal variability throughout. This makes
the linearity assumption (here equivalent to EXi = µi ) much easier to judge visually.
16. Problem 3.7.9, page 39
(a) X1 ∼ N (0, 3).
(b) X2 ∼ t1 (i.e. Cauchy).
Note that E (Z1 +Z2 )(Z1 −Z2 ) = E[Z12 −Z22 ] = 0, i.e. numerator & denominator have bivariate
Normal distribution & are uncorrelated, therefore independent.
(c) X3 ∼ F1,1 .
√
√
(d) X4 ∼ χ22 , since (Z1 +Z2 )/ 2 & (Z1 −Z2 )/ 2 are IID N (0, 1).
(e) t4 .
(f) t3 .
XXVI
(g) F1,3 .
Note that
Y1
= Z1 + Z2 + Z3 + Z4 ,
Y2
= Z1 + Z2 − Z3 − Z4 ,
Y3
= Z1 − Z2 + Z3 − Z4 ,
Y4
= Z1 − Z2 − Z3 + Z4 ,
are mutually uncorrelated & have a joint MVN distribution, so are independent and Normally
distributed.
(h) X8 /2 ∼ χ22 (exponential with mean 2). Therefore X8 has an exponential distribution with mean
4.
17. Problem 3.7.10, page 39
⇐
18. Problem 3.7.11, page 40
⇐
19. Problem 3.7.12, page 40
⇐
20. Problem 3.7.13, page 40
(a) The MVN(0, I) distribution is the continuous p-dimensional distribution with PDF
f (z) = f (z1 , z2 , . . . , zp ) =
(b)
i. X =
1
exp − 21 zT z
p/2
(2π)
Pn
Zi2
i=1 p
Pn
2
ii. T = Zn+1 /
i=1 Zi /n
Pm
P
m+n
iii. Y = n i=1 Zi2 m i=m+1 Zi2
(c)
i.
Z ∞
1
x[(n+2)/2−1] e−x/2 dx
2n/2 Γ(n/2) −∞
−∞
2(n+2)/2 Γ (n + 2)/2
=n
since Γ n2 + 1 = n2 Γ n2 ).
n/2
2 Γ(n/2)
Z
E[X]
=
=
∞
xfX (x)dx =
Similarly
ii.
E X2 =
=
2(n+4)/2 Γ (n + 4)/2
= 22
2n/2 Γ(n/2)
2n n2 + 1 n2 ) = n(n + 2)
n
2
+ 1)Γ
n
2
+ 1 /Γ
n
2)
iii.
E X −1 =
2(n−2)/2 Γ (n − 2)/2
= 1/(n − 2)
2n/2 Γ(n/2)
(provided n > 2).
(d)
2
i. σX
= E[X 2 ] − (E[X])2 = n2 + 2n − n2 = 2n for all n.
ii.
n
µY =
E[Wm /Wn ]
where Wi ∼ χ2i and Wm ⊥
⊥ Wn
m
n
=
E[Wm ]E[1/Wn ]
by independence
m
n
1
n
=
×m×
=
provided n > 2
m
n−2
n−2
XXVII
iii.
µT
=
σT2
0
by symmetry (provided n > 1),
2
= E[T ] − (E[T ])2 = nE[W1 /Wn ]
where Wi ∼ χ2i and W1 ⊥
⊥ Wn
= nE[W1 ]E[1/Wn ]
by independence
n
=
(provided n > 2)
n−2
1. Problem 4.4.1, page 48
⇐
2. Problem 4.4.2, page 48
⇐
3. Problem 4.4.3, page 48
(a)
T
=
n
X
wi Xi
i=1
∴ ET
X
X
= E[
wi Xi ] =
wi EXi
X
=
wi µX = µX .
(b)
Var(
X
wi Xi )
X
2
2
wi2 σX
(writing Var(Xi ) = σX
)
X
2 2
=
wi − n1 + n1 σX
=
2X
wi − n1 + n1
n
X
P
P1
1 2
=
wi − n + n1
since
wi =
n =1
=
which is minimised when wi =
1
n
X
wi −
1 2
n
+
for all i. [Could also use Lagrange multipliers].
(c)
E
X
(Xi − X)2
X
(Xi − µX + µX − X)2
X
X
= E
(Xi − µX )2 + 2E
(Xi − µX ) (µX − X) + nE (µX − X)2
2
= nσX
− nE (µX − X)2
= E
2
2
2
= nσX
− σX
= (n − 1)σX
.
4. Problem 4.4.4, page 48
(a)
Let p
=
Pr(No errors in either remaining lecture)
=
Pr(No errors in 2nd lecture) × Pr(No errors in 3rd lecture)
= e−λ .e−λ
since Pr(X = x) = λx e−λ /x! for X ∼ Poi (λ)
= e−2λ .
(b) Want E[b
p] = e−2λ whatever the value of λ.
XXVIII
Can express pb generally by

p0




p1


 .
..
pb =


px




 ..
.
Then E[b
p] =
∞
X
if X = 0
if X = 1
if X = x
px Pr(X = x)
k=0
=
∞
X
px
k=0
∞
X
∴
λx e−λ
= exp(−2λ) ∀ λ.
x!
px λx /x! = e−λ =
x=0
∞
X
(−λ)x /x!
x=0
∴ px
=
1 k even
−1 k odd
(equating coefficients of the two series).
b = X (see MSA), ∴ MLE of p is e−2X (since transform of MLE is also MLE).
(c) λ
(d) MLE is biased but gives sensible values (between 0 & 1, and decreases as X increases), whereas
the MVUE gives ridiculous estimates and also ignores the magnitude of X.
5. Problem 4.4.5, page 48
E[φ(S)]
= E E[T |S]
= E[T ] = g(θ)
∴ φ(S) is also unbiased.
Var[φ(S)|θ] = Var E[T |S]|θ]
but Var[T |θ] = Var E[T |S] + E Var[T |S]
≥ Var[φ(S)|θ]
since E Var[T |S] ≥ 0.
Thus for any unbiased estimator T , E[T |S] is also unbiased and has variance ≤ Var[T ].
Therefore if a MVUE exists, then it is a function of a sufficient statistic for θ.
Note: if S 0 is not sufficient for θ, then the distribution of φ(S 0 ) = E[T |S 0 ] may depend on θ,
i.e. φ0 (S) is not just a function of the data and is therefore not an estimator.
6. Problem 4.4.6, page 49
(a) bookwork
(b) bookwork
(c)
E[T ]
= 1 × Pr(X1 + X2 = 0)
−2λ
= e
=
+
0 × Pr(X1 + X2 > 0)
Pr(X3 + X4 = 0).
Therefore T is unbiased.
XXIX
(d) Let T be an unbiased estimator of Pr(X2 + X3 + X4 = 0), i.e. of e−3λ .
T is a function of the data X1 alone, & can therefore be expressed as
T = tk
if X1 = k,
k = 0, 1, 2, . . .
We want T to be unbiased,
i.e. e−3λ
∀λ > 0
= E[T ]
=
∞
X
tk Pr(T = tk )
k=0
=
∞
X
tk Pr(X = k)
k=0
=
∞
X
tk
λk e−λ
,
k!
tk
λk
k!
k=0
therefore
e−2λ
=
∞
X
k=0
∀ λ > 0.
P∞
But e−2λ = k=0 (−2)k λk /k!.
Equating coefficients gives the result.
Note: students had previously seen the similar problem of unbiased estimation of exp(−2λ)
rather than exp(−3λ).
(e) MLEs are e−3x1 and e−(x1 +x2 ) respectively.
(f) The MLEs are sensible, being between 0 and 1, and decreasing as the number of goals so far
increases.
The unbiased estimators are unrealistic, and in case (d) silly—not restricted to [0, 1], and taking
account of the evenness of x1 rather than its magnitude.
7. Problem 4.4.7, page 49
(a) bookwork
(b) bookwork
i. E[W ] = E[E[T |S]] = E[T ] = τ (θ).
ii. Var[W ] = Var[T ] − E[Var[T |S]] ≤ Var[T ] ∀ θ.
(d) i. f (x|θ) = ns θs (1 − θ)n−s , therefore S is sufficient.
ii. E[T ] = θ(1 − θ).
iii.
(c)
E[T |S] = Pr(T = 1|S)
= Pr(X1 = 1|S) Pr(X2 = 0|S, X1 = 1)
s n−s
=
.
n n−1
Therefore s(n − s)/(n − 1) is an MVUE of nθ(1 − θ).
8. Problem 4.4.8, page 50
XXX
Est.
Unbiased?
Consistent?
θb1
θb2
θb3
θb4
θb5
θb6
θb7
θb8
θb9
b
θ10
Y
N
Y
N
Y
Y
N
N
Y
Y
Y
Y
N
Y
Y
Y
N
N
Y
Y
Comments
maximum likelihood & most efficient
silly but consistent!
E θb3 = 12 EX12 − (EX1 )(EX2 ) + 12 EX22 = . . . = θ
biased estimator of variance
unbiased estimator of variance
since θb1 & θb5 are both unbiased & consistent
biased since Poisson skewed; always integer or half-integer
always integer
less efficient than θb1 (see 4.4.3)
less efficient than θb1 (see 4.4.3)
Note: θb7 will be consistent for median(FX ), unless Pr(X ≤ x) = 0.5 exactly for some x. Similarly
θb8 will be consistent for mode(FX ), unless FX has two equal modes at (say) x and x + 1.
9. Problem 4.4.9, page 50
Just something to think about! Shows that likelihood on its own is insufficient & perhaps misleading.
10. Problem 4.9.1, page 55
Just something to think about! (but represents JEHS position reasonably well).
11. Problem 4.9.2, page 55
(a) A possible plot is shown on page 2 of the notes; diamonds represent initial values, arrows point
to final values.
Let the RVs S1 , D1 , S2 , D2 represent systolic & diastolic pressure, before & after; corresponding
observed values s1 , d1 , s2 , d2 .
Natural to let response variable be after treatment.
Examples of possible models:
i. (S1 , D1 , S2 , D2 )T ∼ N (µ, V)
ii. (Si , Di )T ∼ N (µi , Vi )
iii. (Si , Di )T ∼ N (µi , V) (scatter plots before & after roughly lie in ellipses of same size, same
shape, same orientation but different location)
iv. S2 − S1 ∼ N (µ, V ) (model change in systolic pressure but take no account of diastolic
pressure)
v. S2 − S1 ∼ N (α + βs1 , V ) (allow for fact that change probably depends on S1 )
vi. S2 ∼ N (β0 + β1 s1 + β2 d1 , V )
. . . etc. (could similarly model D2 as the response).
Other possibilities include letting the random variation have a distribution other than Normal,
or allowing the variance to depend on s1 and d1 .
(b) SBP drops by about 20 mm Hg, DBP by about 10 mm Hg. Variance-covariance matrix of SBP
& DBP similar before & after.
(c) All SBPs dropped, all but 2 of the DBPs dropped. Effect on SBP looks larger & more consistent,
suggests use S2 − S1 . Could also define (e.g.) Ti = Si + Di and use T2 − T1 .
(d) As BPs must be positive (for live patients!), there is a case for taking logs, but a transformation
like log would have little effect here (since coefficient of variation is small). Also the effect of
treatment seems fairly consistent whetever the values of S1 and D1 , so no reason to transform.
(e) There are only 15 patients, not enough data to fit a model with many parameters.
(f) ρ is an interesting parameter of the MVN distribution, whereas ρS doesn’t have such a natural
meaning. Given that data seem consistent with MVN, ρ is more useful in this case. [However,
other data might have extreme outliers]
(g) (to follow)
XXXI
(h) Don’t have any info on how the patients would have done without being given Captopril. Would
like data on a ‘control group’, e.g. 15 comparable hypertensive individuals given some other
treatment (e.g. standard available treatment, or placebo). Because of natural variation, peoples’
BP fluctuates. People whose BP just happens to be higher than usual at a given time are more
likely to be labelled hypertensive, and their BP is likely to be lower later. Therefore would
expect an automatic apparent drop in BP (after-before) (‘regression to the mean’).
(i) Bootstrapping—will give an answer relevant to the actual data obtained, without assuming (e.g.)
underlying Normality. Here Normality seems reasonable, so little advantage in bootstrapping.
Also bootstrapping approximates a continuous distribution by a discrete one, which may be
misleading with a small sample (as here).
However, bootstrapping is a very widely applicable procedure, if one wanted to apply the same
approach to almost any problem (to reduce workload), makes sense to adopt the bootstrap.
1. Problem 5.1, page 59
L(θ; x)
∴ reject H0 if
=
L(θ0 ; x)
L(θ1 ; x)
P
1
exp − 12 (xi − θ)2
n/2
(2π)
= exp − 12
=
P
x2i +
1
2
exp(−Σxi + n/2)
Equivalently, reject H0 if λ(x) = exp n(x − 12 ) is too large.
P
(xi − 1)2
is too small.
But λ(x) is monotonic in x, therefore reject H0 if the test statistic T (X) = X is greater than some c
(chosen to give the appropriate size).
√
For a size α test, want
√ Pr(T > c) = α. But under H0 , T ∼ N (0, 1/n), therefore Z = nT ∼ N (0, 1),
and (T > c) ≡ (Z > nc).
√
∴ Pr(T > c) = 1 − Φ( nc) = α
√
∴ c = Φ−1 (1 − α)/ n.
2. Problem 5.3, page 61
X1 , X 2 , . . . , Xn
IID
∼
N (0, σ 2 ),
L(σ 2 ; x)
=
1
1 X 2
exp
−
x
i .
2σ 2
(2πσ 2 )n/2
Therefore the likelihood ratio is
L(1; x)
1
1 X 2
n
=
σ
exp
−
x
i .
L(σ 2 ; x)
2σ 2
2
P 2
(a) (σ 2 > 1): t(x) is monotonic decreasing in
xi .
P 2
Therefore reject H0 : σ 2 = 1 in favour of H1 : σ 2 = σ12 > 1 for sufficiently large values of
xi .
P 2
P 2
2
2
Under H0 ,
xi ∼ χn . Therefore reject H0 iff
xi > 100(1 − α)% point of χn .
Same critical region ∀ σ12 , therefore test is UMP.
P 2
(b) (σ 2 < 1): t(x) is now monotonic increasing in
xi .
Therefore
(similarly
to
previous
case)
reject
H
: σ 2 = 1 in favour of H1 : σ 2 = σ12 > 1 iff
0
P 2
2
xi < 100α% point of χn .
Again same critical region ∀ σ12 , therefore again UMP.
t(x) =
(c) (σ 2 6= 1): Critical region Cα of the UMP test differs between the two cases σ 2 < 1, σ 2 > 1.
Therefore no UMP test for the combined case σ 2 6= 1 exists.
XXXII
3. Problem 5.4, page 62
T =
X
√ ∼ tn−1
S/ n
see exercise 3.7.4
The likelihood is given by
L(µ, σ 2 ; x)
n
Y
2 1
exp − 21 xiσ−µ
2π σ
i=1
X
1
=
exp −
(xi − µ)2 /(2σ 2 ) .
n/2
n
(2π) σ
=
Under H0 : µ = 0, the MLE of σ 2 is σ
b02 =
H0 is
L(0, σ
b02 ; x)
√
P
=
=
x2i /n. Therefore the maximum possible likelihood under
1
(2π)n/2 σ
b0n
1
(2π)n/2 σ
b0n
exp −
x2i /(2b
σ02 )
exp(−n/2).
Under H1 : µ 6= 0, the MLE of (µ, σ 2 ) is (b
µ, σ
b2 ) = x,
likelihood under H1 simplifies to
L(b
µ, σ
b2 ; x) =
X
1
(2π)n/2 σ
bn
P
(xi − x)2 /n , and the maximum possible
exp(−n/2).
The likelihood ratio is
λ
P 2
n
xi
σ
b0n
P
=
σ
bn
(xi − x)2
n
nx2
=
1+ P
(xi − x)2
n
t2
=
1+
.
n−1
=
(since
P
P
(xi − x)2 = x2i − nx2 )
Therefore λ is monotonic in t2 , i.e. the test based on rejecting H0 for large T 2 (or equivalently for
large |T |) is the same as the likelihood ratio test.
Similarly the test based on rejecting H0 for large T is a likelihood ratio test against the one-seded
alternative H1 : µ > 0.
For the diastolic blood pressure data, n = 15. Therefore under H0 : δD = 0, T ∼ t14 , and a size 0.05
test would reject H0 if |T | > 2.145 (the 97.5% point of the t14 distribution).
Calculations:
x
n = 15,
=
−5
−12
−1
−4
x = −9.267,
−3
−8
2
−21
−11
4
s2 = 74.21,
−16
−4
t=
−23
−18
−19
x
√ = −4.16,
s/ n
Therefore reject H0 : δD = 0. In other words ‘there is statistically significant evidence at the 5% level
that the mean change in DBP is nonzero’.
4. Problem 5.5, page 62
XXXIII
(a) Under H0 : µX = µY = µ0 (say), MLEs are:
µ
b0
1
m+n
=
σ
b02
1
m+n
=
m
X
Xi +
!
n
X
i=1
Yi
,
i=1
m
X
(Xi − µ
b0 )2 +
i=1
n
X
!
(Yi − µ
b0 )2
.
i=1
Under H1 , MLEs are:
µ
bX
= X,
µ
bY
= Y,
σ
b
2
=
m
X
1
m+n
2
(Xi − X) +
n
X
i=1
!
2
(Yi − Y )
.
i=1
By manipulations similar to, but more tedious than, those in exercise 5.4, the likelihood ratio is
!(m+n)/2
mn
2
m+n (x − y)
P
λ= 1+ P
.
(xi − x)2 + (yi − y)2
(b) Similarly to exercise 5.4,
(m+n)/2
λ = 1 + t2 /(m + n − 2)
,
which is monotonic in
t=
x−y
q
1
+
sp m
,
1
n
the observed value of T .
(c)
X
(Xi − X)2 /σ 2
X
(Yi − Y )2 /σ 2
∴ (m + n − 2)Sp2 /σ 2
∼ χ2m−1 ,
∼ χ2n−1 ,
∼ χ2m+n−2
(since X ⊥
⊥ Y).
Under H0 ,
X
∼ N (µX , σ 2 /m),
Y
∼ N (µY , σ 2 /n),
1
1
2
∼ N 0, σ
+
.
m n
∴ X −Y
Therefore T can be written
T =
A
√
B/ m + n − 2
where
A =
X −Y
p
σ 1/m + 1/n
B
(m + n − 2)Sp2 /σ 2
=
∼
N (0, 1),
∼
χ2m+n−2 ,
and A ⊥
⊥ B. Therefore T ∼ tm+n−2 .
(d) m = 12, x = 120.0, n = 7, y = 101.0, s2p = 446.12, t = 1.89.
The 97.5% point of t17 is 2.110. Therefore accept H0 using a 2-sided test of size 0.05 (equivalently,
‘at 5% significance level’ or ‘P < 0.05’).
XXXIV
5. Problem 5.6, page 63
(a) χ2m−1 , χ2n−1 respectively.
(b) Fm−1,n−1 , also sometimes written F (m − 1, n − 1).
(c)
563 − 842 /16
= 2.37.
72 − 182 /16
F =
The 97.5% point of the F15,15 distribution is 2.86, the 95% point is 2.40. Therefore accept H0
with a test of size α = 0.05; also accept (just!) if α = 0.1.
6. Problem 5.7, page 64
√
P
= n log(1/ 2π) − 12 (xi − θ)2
P
P 2
∴ r(x) = − 12 (xi − x)2 − (− 12
xi )
P
P
= nx2 /2
(since (xi − x)2 = x2i − nx2 )
`(θ; x)
∴ 2r(x)
= nx2 .
But X ∼ N (0, 1/n) under H0 . Therefore r(x) ∼ χ21 under H0 .
7. Problem 5.8, page 64
P
P
b2
= − 12 (xi − θbi )2 + 12 (xi − θ)
P
(since θbi = xi and θb = x).
= 12 (xi − x)2
r(x)
Therefore, from theorem 5.3, Wald’s theorem holds exactly.
8. Problem 5.5.1, page 64
2
L(µ, σ ; x)
∴ `(µ, σ 2 ; x)
n
Y
1
1
√
=
exp −
2
2πσ
i=1
= −
xi − µ
σ
2 !
,
n
n
1 X
log(2π) − n log σ − 2
(xi − µ)2 .
2
2σ i=1
Under H0 , µ
b = x, σ
b2 = 1.
Under H1 , µ
b = x, σ
b2 = vb.
Therefore the log likelihood ratio is:
`(x, vb; x) − `(x, 1; x) =
−
n
1
1
log vb − nb
v − − nb
v .
2
2b
v
2
i.e. reject H0 for large n(b
v − 1 − log vb).
9. Problem 5.5.2, page 64
Pn
Note that X ∼ Bin(n, p) arises when X = i=1 Xi , where
0 with probability p,
Xi =
1 with probability 1 − p,
and the Xi are mutually independent. This gives EX = np, VarX = np(1 − p).
XXXV
Also, by the CLT,
X −p
p
→ N (0, 1)
p(1 − p)/n
as n → ∞.
Therefore, under H0 : p = p0 ,
X/n − p0
Y =p
p0 (1 − p0 )/n
∼ N (0, 1) approx.
If x = 560, n = 1000 and p0 = 12 , then y = 3.79, roughly the 99.99% point of N (0, 1). Thus there is
VERY strong evidence against H0 .
10. Problem 5.5.3, page 64
Assume that X ∼ Bin(n, p), where X is the number of male babies in a total of n births. The
likelihood is
n x
370!
L(p; n, x) =
p (1 − p)n−x =
p197 (1 − p)173 .
x
197! 173!
Under H0 , the MLE of p obviously occurs at p = pb0 =
than 21 ).
1
2
(since the observed proportion is greater
Under H1 , the MLE of p is pb1 = 197/370.
Therefore, after maximizing the likelihoods under H0 and under H1 , we’re effectively testing H00 :
p = 1/2 against H1 : p = pb1 = 197/370.
Under H00 , X ∼ Bin(370, 12 ), and large values of X are evidence against H00 and in favour of H10 .
Therefore X ∼ N (370/2, 370/4) approx., and
197 − 185.5
x − 370/2 − 1/2
p
= 1.20,
=
9.62
370/4
which is less than the 95% point of N (0, 1).
Therefore accept H0 using a test of size 0.05.
The binomial assumption seems reasonable but cannot be exactly correct, for example there is the
possibility of identical twins. However the Normal approximation to the binomial will be very accurate
given the large sample size.
11. Problem 5.5.4, page 64
(a) bookwork
(b) bookwork
(c)
i. Regularity assumption reasonable if: no traffic jams, no variation in timetabling (e.g. in
rush hour/at weekends), no breakdowns, no staffing problems, only one bus company. . . .
Waiting times IID if person arrives at random times, doesn’t learn from experience.
However, in practice:
• buses don’t run regularly (for all above reasons),
• buses don’t arrive & depart instantaneously,
• people tend to arrive at the bus stop at particular times (e.g. when bus is due, or after
a lecture).
ii.
L(θ; x) =
n
Y
f (xi |θ)
i=1
=
θ−n
0
= θ−n
XXXVI
if 0 ≤ xi ≤ θ ∀ xi ,
otherwise.
on {θ : θ ≥ maxi=1...n xi }
iii.
20−n
0
θ1−n
0
0
(20/θ1 )−n
under H0 , L(θ; x) =
under H1 , L(θ; x)
=
Likelihood ratio is
if max(xi ) ≤ 20,
otherwise.
if max(xi ) ≤ θ1 (where θ1 > 20),
otherwise.
if max(xi ) > 20,
if max(xi ) ≤ 20.
Thus the LRT is based on test statistic T = max(Xi ), and large values of T are evidence
against H0 .
To find a test of size α,
Pr(T ≤ c|H0 ) =
=
Pr(Xi ≤ c ∀ i|H0 )
n
Y
Pr(Xi ≤ c|H0 )
(by independence)
i=1
=
(c/20)n .
Therefore the likelihood ratio test rejects H0 iff T = max(Xi ) > c, where c satisfies (c/20)n =
1 − α, i.e. c = 20(1 − α)1/n .
Note: the critical region is the same for all θ1 > 20, so the test is UMP.
Also note that the above test is the easiest to specify and seems most logical, but one could also
use the test
‘reject H0 iff T ∈ C’
where C is the union of (20, ∞) and any subset of [0, 20] of measure 20 1 − (1 − α)1/n .
12. Problem 5.5.5, page 65
(a)
X
X
`(λ; x) = n log λ − 2
log xi − λ
x−1
i
P −1
1
1
1
∴ `(λ; x) − ` 2 ; x = n log λ − n log 2 − λ − 2
xi .
For λ >
1
2
this is monotonic decreasing in
P
x−1
i , and is therefore of the specified form.
(b) Given fX (x) = λx−2 e−λ/x , let Y = 1/X. Then
fY (y) =
λ −λy dy −λy
e
,
dx = λe
y2
i.e. Y has an exponential distribution with mean 1/λ.
P −1
Therefore
P −1 Xi 2has a gamma distribution (using MGFs, see MSA), and in particular, under
H0 , 2 Xi ∼ χ2n .
P −1
(c)
xi = 118. The 95% point of χ220 is 31.41 (and the 99.9% point is 45.31). Therefore reject
H0 (P < 0.001).
P −1
The large value of
xi is almost entirely due to one unusually low value (x6 = 0.01)—it’s
important to check that this value was recorded correctly and that there is no reason that X6
might have come from a different distribution.
13. Problem 5.5.6, page 65
(a) bookwork
(b) bookwork
XXXVII
(c)
n
1X
1
exp −
fX (x|µ) =
2
n/2
2 i=1
(2πσ )
xi − µ
σ
2 !
.
Therefore
!
n
1 X
(xi − µ0 )2 − (xi − µ1 )2
2σ 2 i=1
P 2
which is a function of x after cancelling the
xi . Therefore the most powerful test is based on
X and the result follows.
The critical region is the same for all µ1 < µ0 , therefore the test is UMP.
√
By symmetry, reject H0 if X > µ0 + σΦ−1 (1 − α)/ n.
There are different critical regions for the two cases µ1 < µ0 and µ1 > µ0 . Therefore the UMP
tests differ between the two cases, and there is no overall UMP test.
√
A sensible procedure is to reject H0 if X ∈
/ µ0 ± σΦ−1 (α/2)/ n.
fX (x|µ1 )
= exp
fX (x|µ0 )
(d)
(e)
(f)
(g)
⇐
14. Problem 5.5.7, page 65
15. Problem 5.5.8, page 65
⇐
MGB p.420
16. Problem 5.5.9, page 66
⇐
17. Problem 5.5.10, page 66
⇐
18. Problem 5.5.11, page 66
⇐
19. Problem 5.9, page 68
L(θ; y) ∝
k
Y
θiyi
where
Pk
i=1 θi
=1
i=1

k−1
Y
i.e. L(θ; y) ∝ 

y
θj j  1
−
j=1
∴ `(θ; y)
k−1
X
yk
θj 
j=1
= < const. > +
k−1
X
k−1
X yj log θj + yk log 1 −
θj
j=1
∴
∂`
∂θi
=
yi
θbi
=
∴
But
Pk
j=1
yi
−yk
+
= 0 at MLE.
Pk−1
θi
1 − j=1 θj
yk
= c (say) for 1 ≤ i ≤ k − 1
θbk
b = 1 and Pk yi = n, therefore c = n, i.e. θbi = yi /n for 1 ≤ i ≤ n.
i=1
i=1 θi
20. Problem 5.10, page 68
Let θi+1 = Pr(X = i. Then under H0 ,
θi+1 =
3 i
φ (1 − φ)3−i ,
i
i.e. under H0 , Y ∼ Mn(n, θ 0 ) where

(1 − φ)3
 3φ(1 − φ)2 

θ0 = 
 3φ2 (1 − φ)  .
φ3

XXXVIII
21. Problem 5.7.1, page 69
The observed & expected counts are shown in the following table (R=round, A=angular, G=green,
Y=yellow):
Obs.
Exp.
556 ×
9
16
=
X2 =
RY
RG
AY
AG
315
312.75
108
104.25
101
104.25
32
34.75
X (Obs − Exp)2
Exp
= 0.47
on 3 d.f. (since no parameters have been estimated).
This is less than the 10% point of χ23 , therefore accept the null hypothesis at size α = 0.05 (and in
fact at α = 0.9).
The fit is remarkably good, as was the case with many of Mendel’s genetic experiments on peas.
This strongly suggests that he (or, possibly, a minion) fiddled the data!
22. Problem 5.7.2, page 69
Let pO,+ = Pr(type O & Rh+) etc.
Under H0 , let pO = Pr(type O) etc., and p+ = Pr(Rh+), etc.
Then under H0 , pO,+ = pO p+ etc. by independence.
The MLEs of the unknown probabilities are the corresponding observed proportions, i.e.
pbO
pbA
pbB
pbAB
=
=
=
=
(82 + 13)/300
(89 + 27)/300
(54 + 7)/300
(19 + 9)/300
=
=
=
=
95/300,
116/300,
61/300,
28/300,
pb+
pb−
=
=
(82 + 89 + 54 + 19)/300
(13 + 27 + 7 + 9)/300
=
=
244/300,
56/300,
∴ pbO,+ =
244
95
×
= 0.258,
300 300
and other probabilities under H0 can be estimated similarly:
Rh+
Rh-
O
A
B
AB
0.2576
0.0591
0.3145
0.0722
0.1654
0.0380
0.0759
0.0174
giving the following table of expected values (300 × probabilities):
Rh+
Rh-
O
A
B
AB
77.3
17.7
94.4
21.7
49.6
11.4
22.8
5.2
Therefore the chi-squared statistic is
X (O − E)2
E
= 8.60
on 8 − 1 − (3 + 1) d.f.
(need to estimate pO , pA , pB and p+ ).
The 95% point of χ23 is 7.815, therefore reject H0 using a test of size 0.05.
XXXIX
Note: for a general contingency table, with r rows and c columns, the χ2 test for independence is
associated with (r − 1)(c − 1) d.f.
23. Problem 5.7.3, page 69
(a)
i. A hypothesis test is a procedure for deciding whether to accept a particular hypothesis as a
reasonable simplifying assumption, or to reject it as unreasonable in the light of the data.
ii. A simple hypothesis is of the form Hk : θ = θk ,
i.e. the probability distribution of the data is specified completely.
A composite hypothesis is of the form Hk : θ ∈ Ωk ,
i.e. the parameter θ lies in a specified subset Ωk of the parameter space ΩΘ .
b ∈ Ω0 in favour of the alternative H1 : θ
b ∈ Ω1 = Ω \ Ω0
iii. A likelihood ratio test rejects H0 : θ
iff
b x)
L(θ;
λ(x) =
≥ λ,
b0 ; x)
L(θ
b is the MLE of θ over the whole parameter space Ω, θ
b0 is the MLE of θ over Ω0 ,
where θ
and the value λ is fixed so that
sup Pr(λ(X) ≥ λ|θ) = α
θ∈Ω0
where α, the size of the test, is some chosen value.
(b) L(θ; x) = py11 py22 py33 py44 .
P
(c) θ comprises 4 parameters with one constraint ( θi = 1), so ΩΘ is 3-dimensional.
Under H0 there are 3 constraints, so Ω0 is 1-dimensional. Therefore ν = 3 − 1 = 2.
(d) Parameterizing ΩΘ by p1 , p2 & p3 ,
L(p1 , p2 , p3 ; x) = py11 py22 py33 (1 − p1 − p2 − p3 )y4 .
Corresponding log-likelihood is
= y1 log p1 + y2 log p2 + y3 log p3 + y4 log(1 − p1 − p2 − p3 )
`(p1 , p2 , p3 ; x)
∂`
∂pi
yi
y4
−
=0
pi
1 − p1 − p 2 − p3
=
at pi = pbi (i = 1, 2, 3)
y1
y2
y3
y4
y4
=
=
=
=
pb1
pb2
pb3
1 − pb1 − pb2 − pb3
pb4
P
Therefore pbi ∝ yi (i = 1 . . . 4), therefore pbi = yi /n (i = 1 . . . 4) since
pbi = 1.
∴
(e) Under H0 there are in effect 2 categories (Xi = 1 or 2) and (Xi = 3 or 4). Therefore, as above,
under H0 the MLEs of the probabilities of falling into each of these 2 categories are the observed
proportions (y1 + y2 )/n and (y3 + y4 )/n respectively.
Therefore the test statistic is 2× the difference between the maximised log likelihoods:
b x) − `(θ
b0 ; x)
`(θ;
= y1 log(y1 /n) + y2 log(y2 /n) + y3 log(y3 /n) + y4 log(y4 /n)
= −[y1 log(m1 /n) + y2 log(m2 /n) + y3 log(m3 /n) + y4 log(m4 /n)]
=
4
X
yj log(yj /mj )
j=1
(f) (y1 , y2 , y3 , y4 ) = (46, 49, 22, 32), therefore m1 = m2 = 47.5, m3 = m4 = 27.
P4
∴ 2 j=1 yj log(yj /mj ) = 1.957. But the 75% point of the χ22 distribution is 2.77, so the null
hypothesis H0 is accepted (P > 0.25) using the above test.
The fundamental assumption is that the Xi are IID.
Independence is highly doubtful since (for example) a single individual may fracture limbs on
more than one occasion, or fracture more than one limb in a bad accident.
XL
[ASIDE: 1 month before setting this question I broke my right ankle; 6 months later I broke it
again!]
Also the Xi may not be identically distributed, for example if arm or leg fractures are relatively
more common at different times of the year (skiing holidays etc.?)
1. Problem 6.1, page 73
Partially differentiating 6.6:
∂Q/∂β0
∂Q/∂β1
= −2
X
[yi − (β0 + β1 xi )],
= −2
X
[yi − (β0 + β1 xi )]xi .
Therefore, at the minimum,
βb0 n
P
b
β0 xi
Then n(B) −
P
P
+ βb1 xi
P
+ βb1 x2i
P
=
y
P i
=
xi yi
(A),
(B).
xi (A) gives formula (6.7); which can be substituted back into (A) to give (6.8).
2. Problem 6.4.1, page 74
X
X
X
X
X
(xi − x)(yi − y) =
xi yi −
xi y − x
yi + nxy =
xi yi − nxy,
X
X
X
X
(xi − x)yi =
xi yi − x
yi =
xi yi − nxy,
X
X
X
X
xi (yi − y) =
xi yi −
xi y =
xi yi − nxy.
3. Problem 6.4.2, page 74
n
X
(yi − ybi )2
=
X
(yi − βb0 − βb1 xi )2
(substituting for ybi )
i=1
X
(usual sneaky trick)
(yi − y + y − βb0 − βb1 xi )2
X
2
=
yi − y − βb1 (xi − x)
(since βb0 = y − βb1 x)
X
X
X
=
(yi − y)2 − 2βb1
(yi − y)(xi − x) + βb12
(xi − x)2
X
X
=
(yi − y)2 − βb1
(xi − x)(yi − y)
from formula for βb1 and (6.4.1) .
=
4. Problem 6.4.3, page 74
(a) EX2 = 30.3 + 1.31x1 ,
(b) EX4 = 49.3 + 1.05x3 ,
(c) EX1 = 52.4 + 0.34x2 .
• Lines (a) & (b) are similar in the range of interest, though the SBP for a given DBP appears
lower after treatment. The lines (a) & (b) cross over at DBP l 73.
• (a) & (b) have slopes greater than 1: drop in SBP is clearly greater than drop in DBP.
• (c) & (a) have very different slopes (1/0.339 vs. 1.31),
i.e. have very different lines to predict SBP from DBP, or DBP from SBP.
Note lines (a) & (c) intersect at (mean DBP, mean SBP) [WHY?]
XLI
• Comparison of (b) & (c) is pretty meaningless.
DBP
◦
◦
◦ ◦
◦ ◦
6
Combining before & after measurements
could in general be misleading.
××
×
×
××
Extreme case:
SBP
Note: Might consider omitting the intercept term, since if someone’s DBP is 0, then they’re dead,
so their SBP must also be 0. However, it might be that the relationship between DBP & SBP
is nonlinear, but well approximated by a straight line in the relatively small range of interest.
There’s no guarantee that this approximating straight line passes near the origin.
5. Problem 6.4.4, page 74
(a) 3-d case:
XLII
yr
C B
B
C
B
C
C
B plane P through 0, 1 & x
C
r
line β0 1
3
b
B i.e. set of pts of form
y
B
β0 + β1 x
1r
:
B
B
B
B
r
line β1 x
x
B
0 XX
B
XXX
B
XX
XXX
B
z
B
2
1
B
i.e. plot the points
 
 




0
1
x1
y1
0 =  0 ,
1 =  1 ,
x =  x2  ,
& y =  y2 
0
1
x3
y3
3 6
in 3-d space.
b is then the plane P through 0, 1 and x;
The set of possible fitted values y
P can be parametrised by (β0 , β1 ).
b is obtained by dropping a perpendicular from y to P,
The actual y
i.e. literally finding the closest model to the data.
(b) 2-d case:
2 6
y1
y
=
r
y2
B
B
B
1
B
line β1 x
B B
r
yb1
B
y
b=
yb2
x1
r
x
=
0
x2
0=
0
1
Pythagoras ⇒
y12 + y22 = yb12 + yb22 + (y1 − yb1 )2 + (y2 − yb2 )2 .
Note: Shall see similar examples later:
‘Total sum of squares’ = ‘Fitted sum of squares’ + ‘Residual sum of squares’.
6. Problem 6.2, page 76
(a) From the formula (3.9) for the PDF of the MVN,
f (y|β, σ 2 , X)
∴ `(β, σ 2 ; y, X)
=
1
p
= −
(2π)n |σ 2 I|
exp − 12 (y − Xβ)T (σ 2 I)−1 (y − Xβ)
n
n
1
log(2π) − log(σ 2 ) − 2 (y − Xβ)T (y − Xβ).
2
2
2σ
(b) For any given value of σ 2 , the log-likelihood is maximised by minimising the sum of squares
Pn
b T (y − Xβ).
b In particular, this is true if σ 2 = σ
bi )2 = (y − Xβ)
b2 , the MLE of σ 2 .
i=1 (yi − y
Therefore the maximum likelihood estimator of β is also the least-squares estimate,
b = (XT X)−1 XT Y.
β
XLIII
b first note that if Y is a RV & A is a constant matrix, then
To obtain the distribution of β,
E[AY]
Var[AY]
= AEY,
T
= E (AY)(AY)T − E(AY) E(AY)
= AE[Y)YT ]AT − A(EY)(EYT )AT = A(Var Y)AT .
b also has a MVN distribution, with
Since (XT X)−1 XT is a constant matrix, and Y is MVN, β
mean and variance given by:
b
Eβ
=
(XT X)−1 XT EY = (XT X)−1 XT Xβ = β,
b
Var(β)
=
(XT X)−1 XT (Var Y)X(XT X)−1
= σ 2 (XT X)−1 XT IX(XT X)−1
= (XT X)−1 σ 2 .
b ∼ N β, (XT X)−1 σ 2 .
∴ β
(c) From formula 6.12,
∂`
n
s2
=− 2 + 4
2
∂(σ )
2σ
2σ
=0
when s2 /σ 2 = n
Therefore σ
b2 = s2 /n.
S 2 /σ 2 = nb
σ 2 /σ 2
∴ ES 2
∴ Eb
σ2
∼ χ2n−p ,
= (n − p)σ 2
n−p 2
=
σ .
n
Similarly VarS 2
=
2
2(n − p) σ 2 ,
∴ Varb
σ2
=
2(σ 2 /n)2 (n − p)
n−p
2 2 σ4 .
n
=
(d) E[S 2 ] = (n − p)σ 2 , therefore E[S 2 /(n − p)] = σ 2 .
7. Problem 6.3, page 77
Straightforward calculation.
b
Note: there are quicker and more numerically stable methods to calculate β.
8. Problem 6.4, page 78
One possiblility: divide SBP into ranges (say n1 intervals before treatment and n2 after treatment),
count the number of cases in each of the resulting (n1 × n2 ) categories, and use a χ2 test for independence. This is a simple, useful precedure when there is a lot of data.
However, the total number of points is here only 15, so one must choose n1 = n2 = 2 giving just four
categories (or ‘contingencies’); for example
After
< 150 ≥ 150
Before
< 170
4
2
≥ 170
2
7
XLIV
The value of the χ2 statistic is
(4 − 2.4)2
(2 − 3.6)2
(2 − 3.6)2
(7 − 5.4)2
+
+
+
= 2.96
2.4
3.6
3.6
5.4
(with 1 d.f.)
However, because of the small expected values even with only four categories, χ21 is here a poor
approximation to the null distribution.
There are alternative tests in the case of 2 × 2 contingency tables (Yates’ correction, Fisher’s exact
test).
9. Problem 6.5, page 78
(a) A quadratic may give a better fit than a straight line, but may still be a very poor fit and
therefore useless for prediction.
Always remember that rejecting H0 just means that H1 is more plausible than H0 , it DOESN’T
necessarily mean that H1 represents a good model.
(b) A quadratic is symmetrical, but the effect of a unit increase in x for low x (perhaps making the
dosage effective) is fundamentally different from that at high x (e.g. toxicity).
In general, when fitting quadratics or higher-order polynomials, the predictions for large x
depend to a large extent on the pattern of the data for low x, and vice versa. In almost all
practical circumstances, this is nonsensical.
10. Problem 6.6.2, page 78
(a) ri vs. ybi
i. OK
ii. nonlinearity
iii. heteroscedasticity e.g. Y ∼ N (b
y , σ 2 yb) or Y ∼ N (b
y , σ 2 yb2 ) c.f. variance-stabilising transformations
(b) histogram (or probability plot) of ri
i. OK
ii. nonnormality
(c) ri vs. explanatory variable in model (x1 say). Nonlinearity suggests including x21 in model, or
transforming x1 and/or y
(d) ri vs. explanatory variable NOT in model (x2 say). Correlation (e.g.) suggests including x2
11. Problem 6.7.1, page 79
Model is equivalent to previous representation of t-test, with new β0 = old β0 and new β1 = old β1
– old β0 .
12. Problem 6.7.2, page 79
m = 10,
n = 15,
x = 70.19,
y = 68.58,
s2p = 11.05,
t=
x−y
q
1
sp m
+
.
1
n
Therefore accept H0 : µX = µY (P > 0.2, since |t| < 1.319).
A plot suggests the data are consistent with the assumptions of Normality & equal variances. However:
(a) Obviously Normality can’t be exactly right since heights are positive.
(b) One tends to get more outliers at low heights (say < 50 inches) rather than very tall (say > 90
inches). Thus if the population mean really is 70 inches, the distribution of heights is somewhat
negatively skewed. But this hardly matters because of the central limit theorem.
(c) Independence can only be assumed, but would break down if some men were genetically related.
XLV
Can’t deduce anything about trends in average height over time—height is fairly constant between
ages 25 and 54, but what happens if tall people tend to die younger (or to live longer?)
13. Problem 6.7.3, page 79
Linearity: Points lie fairly close to a straight line, but (to JEHS’s eyes!) it looks more as though
there’s a line below which points don’t fall, rather than that points are scattered about a line.
Possibility that lower boundary is curved rather than straight.
Independence: Can’t comment without knowing more about how the cities were chosen
(e.g. may have been chosen to be easy for researchers to get to & hence not be evenly distributed
geographically).
Normality: Rather suspect (see comments under linearity)—distribution seems positively skewed.
Note possible outliers (34.3, 47) & (38.4, 42).
Note also that these are the points with the highest longitude—suggests strongly that both x1
and x2 should be included in model.
Equal variances: Highly suspect—apart from the two outliers, y is more variable at high x1 .
However, plotting different symbols to represent different values of x2 suggests that outliers (large or
low values of y for given x1 ) occur at large x2 . This is easy on a computer!
This indicates
(a) Assessment of ‘reasonableness of assumptions’ can change when extra explanatory vriables are
included, i.e. it’s dangerous to use the marginal distribution [Y |x1 ] instead of the conditional
distribution [Y |x1 , x2 ].
(b) Relationship between y and (x1 , x2 ) is nonlinear in x2 [if linear, would expect size of circles to
change monotonically and smoothly at each x1 ]—therefore need to consider a model like
EY = β0 + βi x1 + β2 x2 + β3 x32 .
XLVI
It’s also important to take the geography into account—e.g. to include the explanatory variable
x3 =height above sea level.
Another useful plot is x1 against x2 , with size of symbol denoting the magnitude of y (produces a
rotated & reflected map of the USA).
⇐
14. Problem 6.7.4, page 79
15. Problem 6.7.5, page 81
Formulae for βbi : bookwork.
If the Yi |xi are IID Normal then the least squares estimates are also maximum likelihood estimates,
and statistical procedures involving distributional assumptions (predictive distributions, hypothesis
tests etc) can be made.
(i) Relationship between x and y is clearly nonlinear, but points lie close to a smooth, monotonically
increasing, concave curve
(ii) Need more information about the conduct of the experiment to assess independence; if ambient
conditions were kept constant throughout then independence seems plausible.
(iii) Variance clearly is not constant, but increases as x & y increase.
(iv) There are no clear outliers, consistent skewness etc., so Normality assumption is reasonable.
(A) Transforming from y to y 0 = loge (y) would on its own make the relationship still less linear. It
would not affect independence, but would make the homoscedasticity assumption plausible since
the sample s.d. of the y values at a given x is roughly proportional to their mean. Normality
would be barely affected, since the s.d.s are much less than the mean.
(B) Transforming x would only affect the linearity assumption (and in fact transforming x, y to
1/x, log y makes the simple linear regression model reasonable).
16. Problem 6.7.6, page 82
(a) The least squares line is
y = 305 + 0.0782 x.
Giving the calculated coefficients to any greater precision would convey a misleading impression;
the scatter in the data is such that (for example) the constant term could easily be 290 or 320.
XLVII
(b) Residuals are given in the following table (denoting weeks beginning 26–Sept–99 to 19–Dec–99
by 1,2,. . . ,13):
Week
1
2
3
4
5
6
x
0
35
901
641
1549
823
y
resid
Week
182
253
315
443
525
344
−123
−54
−60
88
99
−25
8
9
10
11
12
13
x
1136
2114
2097
3732
5
0
y
resid
383
584
536
461
352
296
−10
114
67
−135
47
−9
The large positive residuals are all in the middle half of the time period; the large negative
residuals are at the beginning and end.
The largest negative residual (week 11, i.e. week beginning 5–Dec–99) is associated with an
unusually high x value, and could be regarded as an outlier (and I shall regard it as an outlier
in what follows). The three next largest negative residuals occur in weeks 1, 2 & 3, i.e. the least
squares line consistently overestimates y at the start of the time period.
Thus there is clear evidence that the relationship between x & y is changing over time.
(c) A useful plot of the original data is the following scattergram of x & y, in which each point is
labelled with the week it represents. The plot shows that up to week 10 the points lie reasonably
around a straight line (which by eye has an equation something like y = 200 + x/5), but the
points for weeks 11, 12 & 13 lie well off the line. Each of the last three points has a fairly typical
y value, but the x value for week 11 is exceptionally large, and the x values for weeks 12 & 13
are zero or near zero. This suggests that one can fit a decent model for weeks 1–10, but that it
breaks down thereafter.
Reasonableness of assumptions:
Independence: Three of the four largest negative residuals occur together (weeks 1–3). Thus
successive ‘random errors’ appear to be related, and the independence assumption is implausible.
XLVIII
Also an individual may access the website many times, and many of the ‘remote’ hits may
represent Warwick students working from home. Both these facts affect independence.
Linearity: The plot of y against x shows that week 11 lies clearly off any line passing through
the other points. Even without the plot, the tabulated data and the information that there
was an exam in week 11 suggest that there are particular circumstances at the end of the
time period which may cause a general systematic formula for predicting Y from x to break
down. Linearity is therefore an unreasonable assumption.
Normality: The unusual value for week 11 suggests that the random variation has heavier
tails than has a Normal distribution. Also the response variable is a non-negative integer,
so the Normality assumption is somewhat dubious anyway.
Note: assessing Normality is hard (and also not particularly useful!) if the more fundamental
assumptions of conditional independence and/or the form of the systematic part of the model
(here linearity) seem unreasonable.
Equal variances: The presence of an outlier (week 11) suggests that in some circumstances
the prediction errors are unusually large, i.e. in some circumstances the variance of the
random errors is large.
Also one might expect greater prediction errors associated with large counts Y than with
small counts.
Possible ways to improve the prediction of Y :
• Transforming y: A reasonable simple assumption might be that the number Y of remote
hits in a given week follows a Poisson distribution (for discussion: do you
√ think this is a
reasonable assumption?) To equalise variances one might therefore use Y (see ‘variance
stabilising transformations’ in the lecture notes) as the response variable to be predicted.
• Transforming x: The x value for week 11 is exceptionally large compared to the other x
values, and, not to get too technical, totally screws up the model fitting process. It therefore
makes some sense to transform x so as to squash up large values (e.g. take logs or square
roots). Some of the x values are 0, so taking logs looks like a bad idea. On the other hand,
taking square roots exhibits some consistency with the proposed transformation of Y , so is
worth trying.
• Omitting outliers etc.: Probably the most important commentary on the simple least
squares fit is to acknowledge that a reasonable model for weeks 1–10 will probably not work
XLIX
for weeks 11–13 (as explained above). This in turn means that including weeks 11–13 when
calculating the least squares line distorts the picture: the model for predicting y from x in
a ‘typical’ week during term-time should be recalculated using just the data from weeks 1
to 10, perhaps after taking square roots.
This gives the formula
√
√
y = 13.8 + 0.205 x.
√
√
For example, given the information x = 2000, one would predict y = 13.8+0.205× 2000 =
23.0, i.e. y = 23.02 = 529. This compares with y = 305+0.0782×1000 = 461 for the original
least squares line.
• To make the predictions more useful in practice, one should also give some idea of the
uncertainty in the predicted value (for example, a ‘confidence interval for prediction’). This
b in the Normal Linear Model, the specific
could be derived using the general formula for β
formulae in the case of simple linear regression are given in many textbooks, and the actual
numerical values in a given situation are produced by most statistics packages when fitting
the model to the data-set.
Note that I only asked you to suggest ways to improve the prediction of Y . I was NOT
expecting anyone to calculate a confidence interval or anything; the important thing is
to realise that giving an idea of one’s uncertainty is a valuable part of making usable
predictions, and that the appropriate formulae in standard situations can looked up. Anyone
who at least suggested qualifying predictions with a measure of the uncertainty involved,
got brownie points.
• One needs more data, particularly from outside term time and around exam time, in order
to make useful predictions in other circumstances.
• Other explanatory variables might help prediction, for example (if available) the number of
individual people who accessed the web-page, not just the total number of hits.
L
Some other comments:
• The number of ‘hits’ should not be taken to represent the number of separate people accessing the web-page. for example, many people (particularly students eager to get last-minute
information and help as the exam approaches) will access the web-page a large number of
times in a given week.
• It’s extremely important to use common sense and additional information as well as mathematical ability and computer packages. Here the knowledge that there was an exam on
8–Dec–99 and that term ended shortly afterwards (which could be guessed from the data
even if you didn’t know beforehand) suggests that weeks 11–13 are special cases, and that
any model may break down there.
• Note how the underlying assumptions of the Normal linear model are often interlinked—if
the systematic part of the model isn’t reasonable throughout the whole range of the data,
then all four assumptions may break down together.
Contrariwise, sometimes (as here) by excluding a small ‘atypical’ subset of the data, one
may simultaneously make all the assumptions more reasonable and hence find a simple
model that works well in all but exceptional circumstances.
• There are other sensible ways to analyse the data, for example one might exclude week 11
but include weeks 12 & 13. Similarly one might exclude week 1 (which lies outside term
time).
• It seems highly likely that the outlier in week 11 is due to the exam, but be careful to say
(e.g.) ‘presumably caused by’ or ‘probably due to’, rather than just ‘caused by’. One needs
a lot more information, not to say hubris, to deduce causation.
• The unavailability of data for week 7 is unfortunate—if the point were to lie close to a line
through points 4, 5, 9 & 10 (see the plot of ‘JEHS Webpage Hits’) then weeks 6 & 8 would
seem to be anomalies. However if the point for week 7 was found to lie close to those for
weeks 6 & 8, then one might look for some explanation.
On the other hand, if the data for week 7 are missing because of some problem with the
University ‘hit counter’, then this problem might have been partly present in the adjacent
weeks, explaining why points 6 & 8 lie away from the line through weeks 4, 5, 9 & 10.
• Disclaimer: there is always the danger of reading too much into the data, particularly when
we have so little!
17. Problem 6.7.7, page 82
(a) See plots.
(b) Straightforward calculation; see plots for fitted lines.
(c) Apologies for not having defined Normal probability plots—I thought they had been used in
MSA. I therefore marked this part of the question quite leniently.
(d) Some comparisons between fits 1 & 2:
i. The linearity assumption seems reasonable for both models.
ii. The scatter about the first fitted line increases as x increases, whereas the ‘homoscedasticity’
assumption seems reasonable for fit 2.
iii. Bank 9 (13.2, 144.2) looks like an outlier for fit 1, but is not particularly out of line for fit 2.
iv. The Normal probability plot of the residuals from fit 1 curve upwards (i.e. the distribution
of the random variation appears skewed), whereas the Normal probability plot for fit 2 is
perhaps straighter.
v. Note that the values x in fit 1 are highly skewed. This makes the three outlying points with
high x particularly influential, i.e. the least squares fit ‘takes particular notice of them’ and
here passes pretty well through their centroid
49.0 + 42.3 + 36.3 218.8 + 265.6 + 170.9
,
= (42.5, 218.4).
3
3
Fit 1 is thus most sensitive to the y’s corresponding to the largest x values, and may mislead
about the relationship for low/moderate x.
LI
For these reasons, fit 2 seems preferable to fit 1. Note that this doesn’t necessarily imply that
fit 2 is an adequate description of the observed relationship between assets and income. For
example, important explanatory variables have been omitted (see below). Also note that net
income could be negative (if a bank really is in trouble), which suggests that square root or log
transformations may be dubious even if all the observed values happen to be positive.
(e) Bank 19 is marked with a cross on the plots, and doesn’t seem particularly out of line for fit 2.
However, log(net income) is noticably lower for bank 19 than for any of the other banks. This
might be worrying if, for example, the actual income and expenditure were both large. It would
therefore be useful to have further information such as
x1 actual income ($) in 1973,
x2 actual expenditure ($) in 1973 (so x = x1 − x2 ),
x3 number of investors in 1973,
and corresponding values pre-1973.
In addition, it would be useful to have information on further banks with net income similar to
that of bank 19. However, from the data presented, and without the benefit of hindsight, there
is little reason to suspect that bank 19 was in trouble.
⇐
18. Problem 6.7.8, page 83
19. Problem 6.7.9, page 83
(a) bookwork
(b)
(Y − Xβ)T (Y − Xβ)
=
b + X(β
b − β)
(Y − Xβ)
LII
b T (Y − Xβ)
b
(Y − Xβ)
(A)
T
b
b
+ 2(Y − Xβ) X(β − β) (B)
b − β)T XT X(β
b − β) (C)
+ (β
T
b − β)
Y − X(XT X)−1 XT Y X(β
b − β) = 0,
YT X − YT X(XT X)−1 XT X (β
=
where (B/2)
=
=
and (C)
≥ 0
(since its an inner product, i.e. a sum of squares)
b T (Y − Xβ),
b
∴ (Y − Xβ)T (Y − Xβ) ≥ (Y − Xβ)
b
with equality only for β = β.
(c)
b
E[β]
b
Var[β]
= E[(XT X)−1 XT Y] = (XT X)−1 XT EY = (XT X)−1 XT Xβ = β
= (XT X)−1 XT Var[Y]X(XT X)−1
= σ 2 (XT X)−1 XT In X(XT X)−1 = σ 2 (XT X)−1 .
(d)
= X(XT X)−1 XT X(XT X)−1 X = A,
AA
(In − A)(In − A) = In − 2A + AA = In − A.
b is a linear transformation of Y where Y is MVN. Therefore β
b is also MVN.
(e) β
b
∴ β
∼ N p β, σ 2 (XT X)−1 ,
b
∴ Y
∼ N p (Xβ, σ 2 A).
b
∴ E[Y − Y]
= 0
b T Y]
b = E[YT (In − A)T AY] = 0.
& E[(Y − Y)
b is uncorrelated with each component of Y.
b
Thus each component of (Y − Y)
b and Y
b are independent (property of MVN).
Therefore (Y − Y)
20. Problem 6.7.10, page 83
For SLR,



X=

1
1
..
.
x1
x2
..
.




,



y=

1 xn
XT X
T
−1
∴ (X X)
XT y
b
∴ β
P
P x2i
,
xi
P 2
1
Pxi
=
− xi
det(XT X)
P
P yi
=
,
xi yi
=
=
=
y1
y2
..
.



,

yn
Pn
xi
−
P
xi
n
=
n
P
1
P
x2i − ( xi )2
P 2
Pxi
− xi
P 2P
P P
1
x
yi − xiP xi yi
i
P
P
P
(X X) X y =
− xi yi + n xi yi
n x2i − ( xi )2
P 2
P
1
xi y − x xi yi
P
P 2
.
xi yi − nx y
xi − nx2
T
−1
T
P
LIII
−
P
n
xi
,
∴ βb1
=
βb0
=
P
xi yi − nx y
P 2
,
xi − nx2
P 2
P
P
P 2
xi y − xi yi x
xi − nx2
xi yi x − nx2 y
P 2
P
P
=
y
−
xi − nx2
x2i − nx2
x2i − nx2
= y − βb1 x.
21. Problem 6.6, page 84
Straightforward calculation.
22. Problem 6.9.1, page 86
X
ni (Y i+ − Y ++ )2
2
=
X
ni Y i+ − 2
X
=
X
ni Y i+ − 2Y ++ (nY ++ ) + nY ++
ni Y i+ Y ++ +
2
X
2
ni Y ++
2
Hence result; the 2nd equality is proved similarly.
Note: These formulae give simpler formulae for the sums of squares in the ANOVA table (page 85).
23. Problem 6.9.2, page 86
(a) Given n response RVs Yi (i = 1, 2, . . . , n), with corresponding values of explanatory variables
xTi , the NLM makes the following assumptions:
i. Independence i.e. the Yi are mutually independent given the xTi .
This is usually best checked by considering how the data arose, but for example if successive
observations are not independent, then a plot of the residuals ri = yi − yhati against i may
show a trend.
ii. Linearity
i.e. the expected value of the response is linearly related to the unknown parameters β:
EYi = xTi β.
b should show points lying randomly
A plot of ri against the predicted response ybi = xTi β
about the straight line r = 0.
iii. Normality i.e. Yi |xi is Normally distributed.
A histogram of ri should have a shape like that of a Normal density (more sophisticated
plots like a ‘Normal probability plot’ or a ‘Q-Q plot’ may also be useful).
iv. Homoscedasticity (Equal Variances) i.e. Yi |xi ∼ N (xTi β, σ 2 ).
A plot of the residuals ri against any other variable (e.g. i, ybi or the jth explanatory variable
xij ) should show similar scatter for all values of this other variable.
(b)
i. Independence: plausible since the observations are on separate units, but we’d like more
information. Were some of the mice genetically related? Were they kept separate?
ii. Identically distributed random variation: unlikely since the responses must be nonnegative,
& the experiment is only being carried out because some difference in response level between
√
, or similar transformation.
groups is expected. This suggests trying a log,
iii. Equal variances: similar discussion to ‘identical distributions’ assumption. In particular,
group 9D seems less spread out as well as obviously having the lowest mean. This again
suggests trying a log or similar transformation.
iv. Normality: could only be a crude approximation since data are > 0, and are also rounded
to the nearest whole day. However, the sample sizes are fairly large, so the Normality
assumption is relatively unimportant.
b agree
v. ‘Linearity’: EYi = xTi β automatically holds for one-way ANOVA, since the fitted β
with the observed group means.
LIV
(c) Calculations:
16042 /224 =
1252 /31 + 4422 /60 + 10372 /133
11486
= 11846
ANOVA table:
Source of
Variation
Degrees of
Freedom
Sum of
Squares
Mean
Square
Overall mean
Group
Residuals
1
2
221
11486
360
1278
180.00
5.78
Total
224
13124
F = 180/5.78 = 31 on (2, 221) d.f., P < 0.001. Therefore reject H0 decisively.
There is a clear overall difference in response to the different strains (assuming that the experiment has been properly conducted & independence reasonable!)
The group means are 4.03, 7.37, 7.80 showing that mice survive on average half as long when
inoculated with 9D compared with either of the other 2 strains.
24. Problem 6.9.3, page 86
(a) Some statistics + box-plots for each group (calculated by the statistics package S-Plus) are given
below. LQ=lower quartile (25% point), UQ=upper quartile (75% point).
Group
no.
min
LQ
med.
UQ
max
mean
var
s.d.
N+P
AD+P
AD+I
20
18
19
14
13
18
104
76
45
124
140
82
260
251
132
655
499
465
186
182
113
25229
20981
11191
159
145
106
Note that many other equally useful plots are possible, and there are other sensible estimates
for the quartiles.
LV
The general impression is that the N+P and AD+P groups are similar, but the values for the
AD+I group tend to be lower and also less spread out.
(b) If the group means were proportional to their s.d.s then a square root transformation would
be approximately variance-stabilizing. If the group means were proportional to their variances
then a log transformation would be approximately variance-stabilizing. Both transformations
would also reduce skewness, and either transformation could be tried here.
The one-way analysis of variance tables are as follows (your answer only needs one of them!)
ANOVA on Untransformed Data
Source d.f. sum of squares
Overall mean
Group effect
Residual
1
2
54
1465607
64357
1037470
Total
57
2567434
Data
Source
d.f.
sum of squares
Overall mean
Group effect
Residual
1
2
54
7685.8
94.7
1359.5
Total
57
9140.0
Source
d.f.
sum of squares
Overall mean
Group effect
Residual
1
2
54
1263.23
2.75
44.61
Total
57
1310.59
ANOVA on
F =
64357/2
= 1.675
1037470/54
√
F =
94.7/2
= 1.665
1359.5/54
F =
2.75/2
= 1.882
44.61/54
ANOVA on log(Data)
The 80%, 90% and 95% points of the F(2,54) distribution are 1.658, 2.404, 3.168 respectively.
Thus in each case, the null hypothesis of ‘no group effect’ is accepted (P > 0.1). The data
provide no strong evidence that the underlying level of nitrogen-bound bovine serum albumin
differs between the three populations.
25. Problem 6.9.4, page 87
The sufficient statistics are:
ni
y i+
with total sum of squares
HB SS
16
8.71
P P
i
j
HB S/-t
10
10.63
HB SC
15
12.30
2
yij
= 4652
ANOVA table (using the shortcut formulae in problem 6.9.1):
LVI
n = 41
y ++ = 10.49
Source
d.f.
Overall mean
1
Treatment
Residual
Total
SS
41(10.49)2
p−1=2
n − p = 38
= 4514.0
2
16(8.71 − 10.49)
+ 10(10.63 − 10.49)2
+ 15(12.30 − 10.49)2
= 99.9
2
4652 − 16(8.71)
− 10(10.63)2 − 15(12.30)2
41
= 38.0
4652
F =
99.9/2
= 50.0
38.0/38
The 95%, 99% and 99.9% points of F 2,38 are 3.245, 5.211, 8.331 respectively, i.e. reject H0 decisively.
Comments
(a) Given the sums of squares so near to 50× and 1× the d.f., the data appear to have been fiddled
for teaching purposes!
(b) Note what the various sums of squares ‘mean’, e.g.
41(10.49)2 = 7.22 + 7.72 + . . . + 13.92 − (7.2−10.49)2 + (7.7−10.49)2 + . . . (13.9−10.49)2
& similarly the other formulae are each equivalent to differences between the lack-of-fits of two
models.
⇐
26. Problem 6.9.5, page 88
27. Problem 6.9.6, page 89
GENERAL COMMENT
The most important feature is the time trend:
• the largest 3 positive residuals are on days 1,3,4;
the largest negative residual is on day 21.
• also the largest x1 are at the ends of the series;
the largest x2 are at the start of the series.
(a) Independence
is violated—see above.
Linearity
is plausible, but the outliers will have affected the estimation of the coefficients for x2 and
possibly for x1 .
Homoscedasticity:
the variance of Y appears to increase with x3 , and possibly also with x1 and x2 (not clear
if the outliers are omitted).
Normality:
there are some suspicious-looking outliers, particularly day 21, but we need to reestimate
the coefficients in the model after omitting (say) days 1–4 & 21, and replot.
A useful additional plot would be residual vs. day.
It would also helpful to have the estimated dispersion matrix of (X1 , X2 , X3 ).
(b) One really needs a model taking time into account (outside the scope of MSB). A reasonable
suggestion is to fit the same model to days 5–20 only.
Some comments on the suggestions (many others are possible!):
i. The largest (positive and negative) residuals seem associated with the largest y; transforming
to (e.g.) log y looks plausible.
LVII
ii. Transforming the explanatory variables would have little effect since they each have small
coefficient of variation.
iii. Deleting outliers is the most sensible next step: the outliers are at the extremes of the series
of data and therefore have an a priori reason for being suspect.
iv. Fitting a polyniomial is too sophisticated and misleading given the presence and nature of
the outliers.
v. Similarly fitting interactions is misleading here.
vi. The model is presumably to be used for prediction; this is not so easy to do nonparametrically.
vii. For bootstrapping, one needs to know the mechanism giving rise to the data. The interdependence between all the variables makes it hard to use bootstrapping sensibly (especially
with a short series of observations).
viii. No sensible nonlinear model suggests itself—better to keep things simple.
⇐
28. Problem 6.9.7, page 90
29. Problem 6.7, page 92
Suppose data are Yij = yij , i = 1 . . . I, j = 1 . . . J. Then
Y = (y11 , y12 . . . y1J , y21 . . . y2J , . . . , yI1 . . . yIJ )T
Note that after fitting y i+ and y +j , we have fitted 1 + (I − 1) + (J − 1) parameters. Therefore if X
has I + J or more columns, then XT X is singular (see linear algebra).
Therefore want to parameterise so that X has (I + J − 1) columns. One possibility:

















X=
















1
1
1
..
.
1
1
1
1
..
.
1
1
..
.
0 0 ...
0 0 ...
0 0 ...
.. ..
..
.
. .
0 0 ...
1 0 ...
1 0 ...
1 0 ...
.. ..
..
. .
.
1 0
0 1
.. ..
. .
1 0 0
1 0 0
1 0 0
.. .. ..
. . .
1 0 0
0 0
0 1
0 0
.. ..
. .
0 0
0 0
0 1
0 0
.. ..
. .
0 ...
0 ...
1 ...
.. . .
.
.
0 ...
0 ...
0 ...
1 ...
.. . .
.
.
... 0 0 0 ...
... 0 0 0 ...
..
.. .. ..
..
.
. . .
.
... 1 0 0 ...
... 1 1 0 ...
... 1 0 1 ...
..
.. .. .. . .
.
.
. . .
... 1 0 0 ...
0
0
0
..
.
1
0
0
0
..
.
1
1
..
.
0
0
0
..
.

















.
















1
Here the first column corresponds to parameter µ, the mean response at I = J = 1.
The next (I − 1) correspond to parameters α2 , α3 . . . αI , representing differences between I = i and
I = 1 (i = 2 . . . I).
The final (J − 1) correspond to parameters β2 , β3 . . . βJ , representing differences between J = j and
J = 1 (j = 2 . . . J).
30. Problem 6.11.1, page 94
LVIII
Source
d.f.
s.s.
Overall mean
Days
Vaccines
Resid
1
9
2
18
210.887
3.007
0.922
0.637
Total
30
215.452
Test of H0 : ‘no day effect’:
F =
3.007/9
= 13.0
0.637/18
(P < 0.001)
F =
0.922/2
= 9.4
0.637/18
(P < 0.001)
Test of H1 : ‘no vaccine effect’:
⇐
31. Problem 6.11.2, page 94
32. Problem 6.11.3, page 95
For example,
(2) − (3)
=
X
=
X
(Yij − Y ++ )2 −
X
ij
(Yij − Y i+ )2
ij
Yij2 − 2
ij
X
2
Yij Y ++ + IJY ++
ij
X
Yij2
+2
= −IJY ++ + J
X
−
ij
X X
i
2
j
Yij Y i+
+
X
2
JY i+
i
2
Y i+
i
= J
X
Y
2
i+
− IY
2
++
i
or equivalently
= J
X
(Y i+ − Y ++ )2 .
i
⇐
33. Problem 6.11.4, page 95
⇐
1. Problem 7.1, page 99
2. Problem 7.3.1, page 99
(a)
i. Given n response RVs Yi (i = 1, 2, . . . , n), with corresponding values of explanatory variables
xTi , the NLM makes the following assumptions:
A. Independence
i.e. the Yi are mutually independent given the xTi .
B. Linearity
i.e. the expected value of the response variable is linearly related to the unknown parameters β:
EYi = xTi β.
C. Normality
i.e. Yi |xi is Normally distributed.
LIX
D. Homoscedasticity (Equal Variances)
i.e. Yi |xi ∼ N (xTi β, σ 2 ).
ii. A simple linear regression model is a linear model with one response variable Y and one
explanatory variable X, i.e. a model of the form
E[Y |x1 ] = β0 + β1 x1 .
iii. A nonlinear model has the form
E[Y |x] = g(x, β)
where Y is the response, x is a vector of explanatory variables, β = (β1 . . . βp )T is a
parameter vector, and the function g is nonlinear in the βi s.
(b) If Yi ∼ N (β0 + β1 xi , σ 2 ) and the Yi are independent given the xi s, then the log likelihood is:
X1
√
`(β0 , β1 , σ 2 ; x, y) = −n log( 2π) − n log σ −
(yi − β0 − β1 xi )2 /σ 2
2
P
Pn
( denotes i=1 ).
Whatever the values of σ, ` is maximised over β0 and β1 by minimising
X
Q=
(yi − β0 − β1 xi )2 /σ 2 .
At the joint MLE,
∂Q
∂β0
= −
X
(yi − β0 − β1 xi ) = 0,
∴ βb0 n + βb1
∂Q
∂β1
= −
X
∴ βb0
X
xi =
X
yi ,
xi (yi − β0 − β1 xi ) = 0,
X
xi + βb1
X
x2i =
X
xi yi .
Solving the simultaneous equations in βb0 and βb1 :
P
(n× second equation) – ( xi × first equation) gives
X
X
X
X X
βb1 (n
x2i − (
xi )2 ) = n
xi yi −
xi
yi
Therefore
βb1
βb0
P
P P
xi yi − xi yi
P
P
=
,
n x2i − ( xi )2
X
X
= (
yi − βb1
xi )/n.
n
(c) The plot shows clearly that Y tends to decrease as x increases. On its own the plot doesn’t
make the NLM assumptions unreasonable. However there are several comments that could be
made given what the data represent:
• Clearly Y cannot be negative, the Normality assumption is therefore suspect particularly
where EY is small, i.e. x is large.
• Over the range of the data a straight line fit looks reasonable, but Y cannot be negative
so one might consider a model like E[Y |x] = max(β0 + βi x, 0), or perhaps transform Y to
log Y . However, log Y will have a large variance for large x, (heteroscedasticity).
• The fact that the three largest x values are all 5 suggests the data may have been truncated
(to 100,000 on the original scale).
• Later data points tend to have higher x and correspondingly lower Y , suggesting a time
trend and perhaps non-independence.
LX
⇐
3. Problem 7.3.2, page 100
4. Problem 7.3.3, page 100
⇐
(a)
⇐
(b)
(c) Wait till the SIS asks you, then say ‘four’. It’s a trick question.
⇐
5. Problem 7.3.4, page 101
LXI
Bibliography
[1] V. Barnett. Comparative Statistical Inference. John Wiley and Sons, New York, second edition, 1982.
[2] C. Burt. The genetic determination of differences in intelligence: A study of monozygotic twins reared
together and apart. Brit. J. Psych., 57:137–153, 1966.
[3] G. Casella and R. L. Berger. Statistical Inference. Wadsworth & Brooks/Cole, Pacific Grove, CA,
1990.
[4] M. H. DeGroot. Probability and Statistics. Addison-Wesley, Reading, Mass., second edition, 1989.
[5] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap, volume 57 of Monographs on Statistics
and Applied Probability. Chapman and Hall, New York, 1993.
[6] D. Freedman, R. Pisani, R. Purves, and A. Adhikari. Statistics. W. W. Norton, New York, second
edition, 1991.
[7] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov Chain Monte Carlo in Practice.
Chapman and Hall, London, 1996.
[8] D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski, editors. A Handbook of Small
Data Sets. Chapman and Hall, London, 1994.
[9] R. V. Hogg and A. T. Craig. Introduction to Mathematical Statistics. MacMillan, New York, 1970.
[10] B. W. Lindgren. Statistical Theory. Chapman and Hall, London, fourth edition, 1994.
[11] A. M. Mood, F. A. Graybill, and D. C. Boes. Introduction to the Theory of Statistics. McGraw-Hill,
New York, third edition, 1974.
[12] D. S. Moore and G. S. McCabe. Introduction to the Practice of Statistics. W. H. Freeman & Company
Limited, Oxford, UK, third edition, 1998.
[13] S. C. Narula and J. F. Wellington. Prediction, linear regression and minimum sum of relative errors.
Technometrics, 19:185–190, 1977.
[14] O.P.C.S. 1993 Mortality Statistics, volume 20 of DH2. Her Majesty’s Stationery Office, London, 1995.
[15] J. F. Osborn. Statistical Exercises in Medical Research. Blackwell Scientific Publications, Oxford, UK,
1979.
[16] J. A. Rice. Mathematical Statistics and Data Analysis. Wadsworth, Pacific Grove, CA, second edition,
1995.
[17] P. Sprent. Data Driven Statistical Methods. Chapman and Hall, London, 1998.
[18] S. Weisberg. Applied Linear Regression. John Wiley and Sons, New York, 1980.
LXII

Download Report

(ST217: Mathematical Statistics B)

Paperzz.com

Your Paperzz