Data x = x 1,x2,...,xn, realisations of random variables X1,X2,...,Xn

A Very Brief Summary of Statistical Inference, and Examples
Hilary Term 2007
Prof. Gesine Reinert
Data x = x1, x2, . . . , xn, realisations of random variables X1, X2, . . . , Xn
with distribution (model) f (x1, x2, . . . , xn; θ)
Frequentist: θ ∈ Θ unknown constant
1
1. Likelihood and Sufficiency
(Fisherian) Likelihood approach:
L(θ) = L(θ, x) = f (x1, x2, . . . , xn; θ)
Often: X1, X2, . . . , Xn independent, identically distributed (i.i.d.);
then L(θ, x) =
Qn
i=1 f (xi , θ)
Summarize information about θ: find a minimal sufficient statistic t(x); from the Factorization Theorem: T = t(X) is sufficient for
θ if and only if there exists functions g(t, θ) and h(x) such that for
all x and θ
f (x, θ) = g(t(x), θ)h(x).
Moveover T = t(X) is minimal sufficient when it holds that
f (x, θ)
is constant in θ ⇐⇒ t(x) = t(y)
f (y, θ)
2
Example
Let X1, . . . , Xn be a random sample from the truncated exponential
distribution, where
fXi (xi) = eθ−xi , xi > θ
or, using the indicator function notation,
fXi (xi) = eθ−xi I(θ,∞)(xi).
Show that Y1 = min(Xi) is sufficient for θ.
Let T = T (X1, . . . , Xn) = Y1. We need the pdf fT (t) of the
smallest order statistic. Calculate the cdf for Xi,
Z
F (x) =
x
eθ−z dz = eθ [e−θ − e−x] = 1 − eθ−x.
θ
3
Now
n
Y
P (T > t) =
(1 − F (t)) = (1 − F (t))n.
i=1
Differentiating gives that fT (t) equals
n[1 − F (t)]n−1f (t) = ne(θ−t)(n−1) × eθ−t = nen(θ−t), t > θ.
So the conditional density of X1, . . . Xn given T = t is
eθ−x1 eθ−x2
θ−xn
...e
nen(θ−t)
−
P
xi
e
, xi ≥ t, i = 1, . . . n,
=
ne−nt
which does not depend on θ for each fixed t = min(xi). Note that
since xi ≥ t, i = 1, . . . , n, neither the expression nor the range space
depends on θ, so the first order statistic, X(1), is a sufficient statistic
for θ.
4
Alternatively, use Factorization Theorem:
fX(x) =
n
Y
eθ−xi I(θ,∞)(xi)
i=1
= I(θ,∞)(x(1))
n
Y
eθ−xi
i=1
−
nθ
= e I(θ,∞)(x(1))e
with
g(t, θ) = enθ I(θ,∞)(t)
and
−
h(x) = e
5
Pn
i=1 xi
.
Pn
i=1 xi
2. Point Estimation
Estimate θ by a function t(x1, . . . , xn) of the data; often by maximumlikelihood:
θ̂ = arg max L(θ),
θ
or by method of moments
Neither are unbiased in general, but the m.l.e. is asymptotically
unbiased and asymptotically efficient; under some regularity assumptions,
θ̂ ≈ N (θ, In−1(θ)),
where
"
In(θ) = E
is the Fisher information (matrix)
6
∂`(θ, X)
∂θ
2 #
Under more regularity,
∂ 2`(θ, X)
In(θ) = −E
∂θ2
If x is a random sample, then In(θ) = nI1(θ) ; often abbreviate
I1(θ) = I(θ)
The m.l.e. is a function of a sufficient statistic; recall that, for
scalar θ, there is a nice theory using the Cramer-Rao lower bound
and the Rao-Blackwell theorem on how to obtain minimum variance
unbiased estimators based on a sufficient statistic and an unbiased
estimator
Nice invariance property: The m.l.e. of a function φ(θ) is φ(θ̂)
7
Example
Suppose X1, X2, . . . , Xn random sample from Gamma distribution
with density
xc−1 − βx
e ,
f (x; c, β) =
Γ(c)β c
x > 0,
where c > 0, β > 0; θ = (c, β);
`(θ) = −n log Γ(c) − nc log β
+(c − 1) log
Put D1(c) =
∂
∂c
Y
1X
xi
xi −
β
log Γ(c), then
∂`
nc
1 X
= − + 2
xi
∂β
β
β
Y
∂`
= −nD1(c) − n log β + log( xi)
∂c
8
Setting equal to zero yields
x
β̂ = ,
ĉ
where ĉ solves
Y
D1(ĉ) − log(ĉ) = log([ xi]1/n/x)
P
Q
(Could calculate that sufficient statistics is indeed ( xi, [ xi), and
is minimal sufficient)
Calculate Fisher information: Put D2(c) =
∂2
∂c2
∂ 2`
nc
2 X
= 2− 3
xi
∂β 2
β
β
∂ 2`
n
= −
∂β∂c
β
∂ 2`
= −nD2(c)
∂c2
9
log Γ(c),
Use that EXi = cβ to obtain




I(θ) = n 



c
β2
1
β
1
β
D2(c)






Recall also that there is an iterative method to compute m.l.e.s,
related to the Newton-Raphson method.
10
3. Hypothesis Testing
For simple null hypothesis and simple alternative, the NeymanPearson Lemma says that the most powerful tests are likelihood-ratio
tests. These can sometimes be generalized to one-sided alternatives
in such a way as to yield uniformly most powerful tests.
11
Example
Suppose as above that X1, . . . , Xn is a random sample from the
truncated exponential distribution, where
fXi (xi; θ) = eθ−xi , xi > θ.
Find a UMP test of size α for testing H0 : θ = θ0 against H1 : θ > θ0:
Let θ1 > θ0, then the LR is
f (x, θ1)
= en(θ1−θ0)I(θ1,∞)(x(1)),
f (x, θ0)
which increases with x(1), so we reject H0 if the smallest observation,
T = X(1), is large. Under H0, we have calculated that the pdf of fT
is
nen(θ0−y1), y1 > θ0.
12
So for t > θ
Z
P (T > t) =
∞
nen(θ0−s)ds = en(θ0−t).
t
Choose t such that
t = θ0 −
ln α
.
n
The LR test rejects H0 if X(1) > θ0 −
all θ > θ0, so is UMP.
13
ln α
n .
The test is the same for
For general null hypothesis and general alternative:
H0 : θ ∈ Θ0
H1 : θ ∈ Θ1 = Θ \ Θ0
The (generalized) LR test uses the likelihood ratio statistic
max L(θ; X)
T =
θ∈Θ
max L(θ; X)
θ∈Θ0
rejects H0 for large values of T ; use chisquare asymptotics 2 log T ≈
χ2p, where p = dimΘ − dimΘ0 (nested models only); or use score
tests, which are based on the score function ∂`/∂θ, with asymptotics
in terms of normal distribution N (0, I(θ))
An important example is Pearson’s Chisquare test, which we derived as a score test, and we saw that it is asymptotically equivalent
to the generalized likelihood ratio test
14
Example
Let X1, . . . , Xn be a random sample from a geometric distribution
with parameter p;
P (Xi = k) = p(1 − p)k ,
k = 0, 1, 2, . . . .
Then
n
L(p) = p (1 − p)
P
(xi )
and
`(p) = n ln p + nx ln(1 − p),
so that
∂`/∂p(p) = n
1
x
−
p 1−p
and
x
1
∂ 2`/∂p2(p) = n − 2 −
p
(1 − p)2
15
.
We know that EX1 = (1 − p)/p. Calculate the information
I(p) =
n
.
p2(1 − p)
Suppose n = 20, x = 3, H0 : p0 = 0.15 and H1 : p0 6= 0.15. Then
∂`/∂p(0.15) = −62.7 and I(0.15) = 1045.8. Test statistic
√
Z = −62.7/ 1045.8 = 1.9388.
Compare to 1.96 for a test at level α = 0.05: do not reject H0.
16
Example continued.
For a generalized LRT the test statistic is based on 2 log[L(θ̂)/L(θ0)] ≈
χ21, and the test statistic is thus
2(`(θ̂) − `(θ0) = 2n(ln(θ̂) − ln(θ0) + x(ln(1 − θ̂) − ln(1 − θ0)).
We calculate that the m.l.e. is
θ̂ =
1
.
1+x
Suppose again n = 20, x = 3, H0 : θ0 = 0.15 and H1 : θ0 6= 0.15.
The m.l.e. is θ̂ = 0.25, and `(0.25) = −44.98; `(0.15) = −47.69.
Calculate
χ2 = 2(47.69 − 44.98) = 5.4
and compare to chisquare distribution with 1 degree of freedom: 3.84
at 5 percent level, so reject H0.
17
4. Confidence Regions
If we can find a pivot, a function t(X, θ) of a sufficient statistics
whose distribution does not depend on θ, then we can find confidence
regions in a straightforward manner.
Otherwise we may have to resort to approximate confidence regions, for example using the approximate normality of the m.l.e.
Recall that confidence intervals are equivalent to hypothesis tests
with simple null hypothesis and one- or two-sided alternatives
18
Not to forget about:
Profile likelihood
Often θ = (ψ, λ) where ψ contains the parameters of interest;
then base inference on profile likelihood for ψ,
LP (ψ) = L(ψ, λ̂ψ ).
Again we can use (generalized) LRT or score test; if ψ scalar, H0 :
ψ = ψ0, H1+ : ψ > ψ0, H1− : ψ < ψ0, use test statistic
T =
∂`(ψ0, λ̂0; X)
,
∂ψ
where λ̂0 is the MLE for λ when H0 true
Large positive values of T indicate H1+
Large negative values indicate H1−
−1
T ≈ `0ψ − Iλ,λ
Iψ,λ`0λ ≈ N (0, 1/I ψ,ψ ),
19
−1 −1
2
where I ψ,ψ = (Iψ,ψ − Iψ,λ
Iλ,λ
) is the top left element of I −1
Estimate by substituting the null hypothesis values; calculate the
practical standardized form of T as
Z=p
T
Var(T )
≈ `0ψ (ψ, λ̂ψ )[I ψ,ψ (ψ, λ̂ψ )]1/2
≈ N (0, 1)
20
Bias and variance approximations: the delta method
If we cannot calculate mean and variance directly: Suppose T =
g(S) where ES = β and Var S = V . Taylor
T = g(S) ≈ g(β) + (S − β)g 0(β).
Taking the mean and variance of the r.h.s.:
ET ≈ g(β),
Var T ≈ [g 0(β)]2V.
Also works for vectors S, β, with T still a scalar
g 0(β) i = ∂g/∂βi and g 00(β) matrix of second derivatives,
Var T ≈ [g 0(β)]T V g 0(β)
and
1
ET ≈ g(β) + trace[g 00(β)V ].
2
21
Exponential family
For distributions in the exponential family, many calculations have
been standardized, see Notes.
22
A Very Brief Summary of Bayesian Inference, and Examples
Data x = x1, x2, . . . , xn, realisations of random variables X1, X2, . . . , Xn
with distribution (model) f (x1, x2, . . . , xn|θ)
Bayesian: θ ∈ Θ random; prior π(θ)
23
1. Priors, Posteriors, Likelihood, and Sufficiency
Posterior distribution of θ given x is
π(θ|x) = R
f (x|θ)π(θ)
f (x|θ)π(θ)dθ
Shorter: posterior ∝ prior × likelihood
The (prior) predictive distribution of the data x on the basis π
is
Z
p(x) =
f (x|θ)π(θ)dθ
Suppose data x1 is available, want to predict additional data:
Z
p(x2|x1) =
f (x2|θ)π(θ|x1)dθ
Note that x2 and x1 are assumed conditionally independent given θ
They are not, in general, unconditionally independent.
24
Example Suppose y1, y2, . . . , yn are independent normally distributed random variables, each with variance 1 and with means
βx1, . . . , βxn, where β is an unknown real-valued parameter and
x1, x2, . . . , xn are known constants. Suppose prior π(β) = N (µ, α2).
Then the likelihood is
L(β) = (2π)
− n2
n
Y
2
(y −βx )
− i 2 i
e
i=1
and
π(β) = √
1
−
2πα2
e
(β−µ)2
2α2
β2
µ
∝ exp − 2 + β 2
2α
α
25
and the posterior of β given y1, y2, . . . , yn can be calculated as
1X
1
π(β|y) ∝ exp −
(yi − βxi)2 − 2 (β − µ)2
2
2α
X
β2
βµ
1 2X 2
∝ exp β
xi yi − β
xi − 2 + 2 .
2
2α
α
Abbreviate sxx =
P
x2i , sxy =
P
xiyi, then
1 2
β2
βµ
π(β|y) ∝ exp βsxy − β sxx − 2 + 2
2
2α
α
2
β
1
µ
= exp −
sxx + 2 + β sxy + 2
2
α
α
which we recognize as
N
sxy +
sxx +
µ α2
1 , sxx
α2
26
+
1
α2
−1!
.
Summarize information about θ: find a minimal sufficient statistic t(x); T = t(X) is sufficient for θ if and only if for all x and θ
π(θ|x) = π(θ|t(x))
As in frequentist approach; posterior (inference) based on sufficient
statistic
27
Choice of prior
There are many ways of choosing a prior distribution; using a
coherent belief system, e.g.. Often it is convenient to restrict the
class of priors to a particular family of distributions. When the
posterior is in the same family of models as the prior, i.e. when one
has a conjugate prior, then updating the distribution under new data
is particularly convenient.
For the regular k-parameter exponential family,
f (x|θ) = f (x)g(θ)exp{
k
X
ciφi(θ)hi(x)},
x ∈ X,
i=1
conjugate priors can be derived in a straightforward manner, using
the sufficient statistics.
28
Non-informative priors favour no particular values of the parameter over others. If Θ is finite, choose uniform prior; if Θ is infinite,
there are several ways (some may be improper)
For a location density f (x|θ) = f (x−θ), then the non-informative
location-invariant prior is π(θ) = π(0) constant; usually choose
π(θ) = 1 for all θ (improper prior)
For a scale density f (x|σ) =
1
σf
x
σ
for σ > 0, the scale-invariant
non-informative prior is π(σ) ∝ σ1 ; usually choose π(σ) =
1
σ
(im-
proper)
1
Jeffreys prior π(θ) ∝ I(θ) 2 if the information I(θ) exists; they is
invariant under reparametrization; may or may not be improper
29
Example continued. Suppose y1, y2, . . . , yn are independent,
yi ∼ N (βxi, 1), where β is an unknown parameter and x1, x2, . . . , xn
are known. Then
∂2
log L(β, y)
∂β 2
∂ X
(yi − βxi)xi
∂β
I(β) = −E
= −E
= sxx
and so the Jeffreys prior is ∝
√
sxx; constant and improper. With
this Jeffreys prior, the calculation of the posterior distribution is
equivalent to putting α2 = ∞ in the previous calculation, yielding
π(β|y) is N
sxy
, (sxx)−1 .
sxx
30
If Θ is discrete, constraints
Eπ gk (θ) = µk ,
k = 1, . . . , m
choose the distribution with the maximum entropy under these constraints:
P
exp ( m
k=1 λk gk (θi ))
P
π̃(θi) = P
m
i exp (
k=1 λk gk (θi ))
where the λi are determined by the constraints. There is a similar
formula for the continuous case, maximizing the entropy relative to
a particular reference distribution π0 under constraints. For π0 one
would choose the “natural” invariant noninformative prior.
For inference, check influence of the choice of prior, for example
by −contamination class of priors.
31
2. Point Estimation
Under suitable regularity conditions, random sampling, when n is
large, then the posterior is approximately N (θ̂, (nI1(θ̂))−1), where θ̂
is the m.l.e.; provided that the prior is non-zero in a region surrounding θ̂. In particular, if θ0 is the true parameter, then the posterior
will become more and more concentrated around θ0.
Often reporting the posterior distribution is preferred to point
estimation.
32
Bayes estimators are constructed to minimize the posterior expected loss. Recall from Decision Theory: Θ is the set of all possible
states of nature (values of parameter), D is the set of all possible decisions (actions); a loss function is any function
L : Θ × D → [0, ∞)
For point estimation: D = Θ, L(θ, d) loss in reporting d when θ is
true.
33
Bayesian: For a prior π and data x ∈ X , the posterior expected
loss of a decision is
Z
ρ(π, d|x) =
L(θ, d)π(θ|x)dθ
Θ
For a prior π the integrated risk of a decision rule δ is
Z Z
r(π, δ) =
L(θ, δ(x))f (x|θ)dxπ(θ)dθ
Θ
X
An estimator minimizing r(π, δ) can be obtained by selecting, for
every x ∈ X , the value δ(x) that minimizes ρ(π, δ|x). A Bayes
estimator associated with prior π, loss L, is any estimator δ π which
minimizes r(π, δ). Then r(π) = r(π, δ π ) is the Bayes risk.
34
Valid for proper priors, and for improper priors if r(π) < ∞. If
r(π) = ∞, define a generalized Bayes estimator as the minimizer,
for every x, of ρ(π, d|x)
For strictly convex loss functions, Bayes estimators are unique.
For squared error loss L(θ, d) = (θ − d)2, the Bayes estimator δ π
associated with prior π is the posterior mean. For absolute error loss
L(θ, d) = |θ − d|, the posterior median is a Bayes estimator.
35
Example. Suppose that x1, . . . , xn is a random sample from a
Poisson distribution with unknown mean θ. Two models for the prior
distribution of θ are contemplated;
π1(θ) = e−θ ,
θ > 0, and π2(θ) = e−θ θ,
θ > 0.
Then calculate the Bayes estimator of θ under model 1, with quadratic
loss, using
−nθ
π1(θ|x) ∝ e
θ
P
xi −θ
e
which we recognize as Gamma(
P
−(n+1)θ
=e
θ
P
xi
,
xi +1, n+1). The Bayes estimator
is the expected value of the posterior distribution,
P
xi + 1
.
n+1
36
For model 2,
−nθ
π2(θ|x) ∝ e
θ
P
xi −θ
−(n+1)θ
e θ=e
which we recognize as Gamma(
P
θ
P
xi +1
,
xi +2, n+1). The Bayes estimator
is the expected value of the posterior distribution,
P
xi + 2
.
n+1
Note that the first model has greater weight for smaller values of θ,
so the posterior distribution is shifted to the left.
Recall that we have seen more in Decision Theory: least favourable
distributions.
37
3. Hypothesis Testing
H0 : θ ∈ Θ0,
D = {accept H0, reject H0} = {1, 0},
where 1 stands for acceptance; loss function






0 if θ ∈ Θ0, φ = 1










 a0 if θ ∈ Θ0, φ = 0
L(θ, φ) =





0 if θ 6∈ Θ0, φ = 0










 a1 if θ 6∈ Θ0, φ = 1
Under this loss function, the Bayes decision rule associated with a
prior distribution π is






 1 if P π (θ ∈ Θ0|x) >
φπ (x) =





 0 otherwise
38
a1
a0 +a1
The Bayes factor for testing H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1 is
P π (θ ∈ Θ0|x)/P π (θ ∈ Θ1|x)
B (x) =
P π (θ ∈ Θ0)/P π (θ ∈ Θ1)
π
=
p(x|θ ∈ Θ0)
p(x|θ ∈ Θ1)
is the ratio of how likely the data is under H0 and how likely the data
is under H1; measures the extent to which the data x will change the
odds of Θ0 relative to Θ1
Note that we have to make sure that our prior distribution puts
mass on H0 (and on H1). If H0 is simple, this is usually achieved by
choosing a prior that has some point mass on H0 and otherwise lives
on H1.
39
Example revisited. Suppose that x1, . . . , xn is a random sample from a Poisson distribution with unknown mean θ; possible models for the prior distribution are
π1(θ) = e−θ ,
θ > 0, and π2(θ) = e−θ θ,
θ > 0.
The prior probabilities of model 1 and model 2 are assessed at probability 1/2 each. Then the Bayes factor is
R ∞ −nθ P x −θ
θ i e dθ
0 e
R
B(x) = ∞ −nθ P x −θ
θ i e θdθ
0 e
P
P
Γ( xi + 1)/((n + 1) xi+1)
P
P
=
Γ( xi + 2)/((n + 1) xi+2)
n+1
.
= P
xi + 1
40
Note that in this setting
B(x) =
=
P (model 1|x)
P (model 2|x)
P (model 1|x)
,
1 − P (model 1|x)
so that
P (model 1|x) = (1 + B(x))−1.
Hence
−1
xi + 1
P (model 1|x) = 1 +
n+1
−1
x + n1
= 1+
,
1
1+ n
which is decreasing for x increasing.
41
P
For robustness: Least favourable Bayesian answers; suppose H0 :
θ = θ0, H1 : θ 6= θ0, prior probability on H0 is ρ0 = 1/2. For G
family of priors on H1; put
B(x, G) = inf R
g∈G
f (x|θ0)
.
f
(x|θ)g(θ)dθ
Θ
A Bayesian prior g ∈ G on H0 will then have posterior probability
at least P (x, G) = 1 +
1
B(x,G)
−1
on H0 (for ρ0 = 1/2). If θ̂ is the
m.l.e. of θ, and GA the set of all prior distributions, then
B(x, GA) =
f (x|θ0)
f (x|θ̂(x))
and
P (x, GA) =
1+
42
f (x|θ̂(x))
f (x|θ0)
!−1
.
4. Credible intervals
A (1 − α) (posterior) credible interval (region) is an interval
(region) of θ−values within which 1 − α of the posterior probability
lies
Often would like to find HPD (highest posterior density) region:
a (1 − α) credible region that has minimal volume
when the posterior density is unimodal, this is often straightforward
43
Example continued. Suppose y1, y2, . . . , yn are independent
normally distributed random variables, each with variance 1 and with
means βx1, . . . , βxn, where β is an unknown real-valued parameter and x1, x2, . . . , xn are known constants; Jeffreys prior, posterior
N
sxy
−1
,
(s
)
xx
sxx
, then a 95%-credible interval for β is
r
sxy
1
± 1.96
.
sxx
sxx
Interpretation in Bayesian: conditional on the observed x; randomness relates to the distribution of θ
In contrast, frequentist confidence interval applies before x is observed; randomness relates to the distribution of x
44
5. Not to forget about: Nuisance parameters
θ = (ψ, λ), where λ nuisance parameter
π(θ|x) = π((ψ, λ)|x)
marginal posterior of ψ:
Z
π(ψ|x) =
π(ψ, λ|x)dλ
Just integrate out the nuisance parameter
45