Vladimir Spokoiny, Thorsten Dickhaus
Basics of Modern Mathematical
Statistics ∗
– Textbook –
April 26, 2014
Springer
Berlin Heidelberg NewYork
Hong Kong London
Milan Paris Tokyo
∗
Many people made useful and extremely helpful suggestions which allowed to
improve the composition and presentation and to clean up the mistakes and
typos. The authors especially thank Wolfgang Häddle and Vladimir Panov who
assisted us on the long way of writing the book.
To my children Seva, Irina, Daniel, Alexander, Michael, and
Maria
To my family
Preface
Preface of the first author
This book was written on the basis of a graduate course on mathematical
statistics given at the mathematical faculty of the Humboldt University Berlin.
The classical theory of parametric estimation, since the seminal works by
Fisher, Wald and Le Cam, among many others, has now reached maturity
and an elegant form. It can be considered as more or less complete, at least
for the so-called “regular case”. The question of the optimality and efficiency
of the classical methods has been rigorously studied and typical results state
the asymptotic normality and efficiency of the maximum likelihood and/or
Bayes estimates; see an excellent monograph by Ibragimov and Khas’minskij
(1981) for a comprehensive study.
In the time around 1984 when I started my own PhD at the Lomonosoff
University, a popular joke in our statistical community in Moscow was that
all the problems in the parametric statistical theory have been solved and
described in a complete way in Ibragimov and Khas’minskij (1981), there
is nothing to do any more for mathematical statisticians. If at all, only few
nonparametric problems remain open. After finishing my PhD I also moved
to nonparametric statistics for a while with the focus on local adaptive estimation. In the year 2005 I started to write a monograph on nonparametric
estimation using local parametric methods which was supposed to systemize
my previous experience in this area. The very first draft of this book was
available already in the autumn 2005, and it only included few sections about
basics of parametric estimation. However, attempts to prepare a more systematic and more general presentation of the nonparametric theory led me back to
the very basic parametric concepts. In 2007 I extended significantly the part
about parametric methods. In the spring 2009 I taught a graduate course on
parametric statistics at the mathematical faculty of the Humboldt University
Berlin. My intention was to present a “modern” version of the classical theory
which in particular addresses the following questions:
4
Preface
“what do you need to know from parametric statistics to work on
modern parametric and nonparametric methods?”
“how to identify the borderline between the classical parametric and
the modern nonparametric statistics”.
The basic assumptions of the classical parametric theory are that the parametric specification is exact and the sample size is large relative to the dimension
of the parameter space. Unfortunately, this viewpoint limits applicability of
the classical theory: it is usually unrealistic to assume that the parametric
specification is fulfilled exactly. So, the modern version of the parametric theory has to include a possible model misspecification. The issue of large samples is even more critical. Many modern applications face a situation when
the number of parameters p is not only comparable with the sample size n ,
it can be even much larger than n . It is probably the main challenge of the
modern parametric theory to include in a rigorous way the case of “large p
small n ”. One can say that the parametric theory that is able to systematically treat the issues of model misspecification and of small of fixed samples
already includes the nonparametric statistics. The present study aims at reconsidering the basics of the parametric theory in this sense. The “modern
parametric” view can be stressed as follows:
- any model is parametric;
- any parametric model is wrong;
- even a wrong model can be useful.
The model mentioned in the first item can be understood as a set of assumptions describing the unknown distribution of the underlying data. This
description is usually given in terms of some parameters. The parameter space
can be large or infinite dimensional, however, the model is uniquely specified
by the parameter value. In this sense “any model is parametric”.
The second statement “any parametric model is wrong” means that any
imaginary model is only an idealization (approximation) of reality. It is unrealistic to assume that the data exactly follow the parametric model, even
if this model is flexible and involves a lot of parameters. Model misspecification naturally leads to the notion of the modeling bias measuring the distance
between the underlying model and the selected parametric family. It also separates parametric and nonparametric viewpoint. The parametric approach
focuses on “estimation within the model” ignoring the modeling bias. The
nonparametric approach attempts to account for the modeling bias and to
optimize the joint impact of two kinds of errors: estimation error within the
model and the modeling bias. This volume is limited to parametric estimation
and testing for some special models like exponential families or linear models.
However, it prepares some important tools for doing the general parametric
theory presented in the second volume.
The last statement “even a wrong model can be useful” introduces the
notion of a “useful” parametric specification. In some sense it indicates a
Preface
5
change of a paradigm in the parametric statistics. Trying to find the true
model is hopeless anyway. Instead, one aims at taking a potentially wrong
parametric model which however, possesses some useful properties. Among
others, one can figure out the following “useful” features:
- a nice geometric structure of the likelihood leading to a numerically
efficient estimation procedure;
- parameter identifiability.
Lack of identifiability in the considered model is just an indication that the
considered parametric model is poorly selected. A proper parametrization
should involve a reasonable regularization ensuring both features: numerical
efficiency/stability and a proper parameter identification. The present volume
presents some examples of “useful models” like linear or exponential families.
The second volume will extend such models to a quite general regular case involving some smoothness and moment conditions on the log-likelihood process
of the considered parametric family.
This book does not pretend to systematically cover the scope of the classical parametric theory. Some very important and even fundamental issues
are not considered at all in this book. One characteristic example is given
by the notion of sufficiency, which can be hardly combined with model misspecification. At the same time, much more attention is paid to the questions
of nonasymptotic inference under model misspecification including concentration and confidence sets in dependence of the sample size and dimensionality
of the parameter space. In the first volume we especially focus on linear models. This can be explained by their role for the general theory in which a linear
model naturally arises from local approximation of a general regular model.
This volume can be used as text-book for a graduate course in mathematical statistics. It assumes that the reader is familiar with the basic notions
of the probability theory including the Lebesgue measure, Radon-Nycodim
derivative, etc. Knowledge of basic statistics is not required. I tried to be as
self-contained as possible; the most of the presented results are proved in a
rigorous way. Sometimes the details are left to the reader as exercises, in those
cases some hints are given.
Preface of the second author
It was in early 2012 when Prof. Spokoiny approached me with the idea of a
joint lecture on Mathematical Statistics at Humboldt-University Berlin, where
I was a junior professor at that time. Up to then, my own education in statistical inference had been based on the german textbooks by Witting (1985)
and Witting and Müller-Funk (1995), and for teaching in english I had always used the books by Lehmann and Casella (1998), Lehmann and Romano
(2005), and Lehmann (1999). However, I was aware of Prof. Spokoiny’s own
textbook project and so the question was which text to use as the basis for
6
Preface
the lecture. Finally, the first part of the lecture (estimation theory) was given
by Prof. Spokoiny based on the by then already substantiated Chapters 1 5 of the present work, while I gave the second part on test theory based on
my own teaching material which was mainly based on Lehmann and Romano
(2005).
This joint teaching activity turned out to be the starting point of a collaboration between Prof. Spokoiny and myself, and I was invited to join him
as a co-author of the present work for the Chapters 6 - 8 on test theory,
matching my own research interests. By the summer term of 2013, the book
manuscript had substantially been extended, and I used it as the sole basis for
the Mathematical Statistics lecture. During the course of this 2013 lecture,
I received many constructive comments and suggestions from students and
teaching assistants, which led to a further improvement of the text.
Contents
1
Basic notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Example of a Bernoulli experiment . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Least squares estimation in a linear model . . . . . . . . . . . . . . . . . .
1.3 General parametric model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Statistical decision problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
13
16
19
20
21
2
Parameter estimation for an i.i.d. model . . . . . . . . . . . . . . . . . . .
2.1 Empirical distribution. Glivenko-Cantelli Theorem . . . . . . . . . . .
2.2 Substitution principle. Method of moments . . . . . . . . . . . . . . . . .
2.2.1 Method of moments. Univariate parameter . . . . . . . . . . .
2.2.2 Method of moments. Multivariate parameter . . . . . . . . . .
2.2.3 Method of moments. Examples . . . . . . . . . . . . . . . . . . . . . .
2.3 Unbiased estimates, bias, and quadratic risk . . . . . . . . . . . . . . . .
2.3.1 Univariate parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Multivariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Root- n normality. Univariate parameter . . . . . . . . . . . . .
2.4.2 Root- n normality. Multivariate parameter . . . . . . . . . . . .
2.5 Some geometric properties of a parametric family . . . . . . . . . . . .
2.5.1 Kullback-Leibler divergence . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Hellinger distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.3 Regularity and the Fisher Information. Univariate
parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Cramér–Rao Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.1 Univariate parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.2 Exponential families and R-efficiency . . . . . . . . . . . . . . . .
2.7 Cramér–Rao inequality. Multivariate parameter . . . . . . . . . . . . .
2.7.1 Regularity and Fisher Information. Multivariate
parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
23
28
28
29
29
34
34
35
36
36
39
41
42
44
45
49
49
51
52
53
8
Contents
2.8
2.9
2.10
2.11
2.12
3
2.7.2 Local properties of the Kullback-Leibler divergence
and Hellinger distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.3 Multivariate Cramér–Rao Inequality . . . . . . . . . . . . . . . . .
2.7.4 Exponential families and R-efficiency . . . . . . . . . . . . . . . .
Maximum likelihood and other estimation methods . . . . . . . . . .
2.8.1 Minimum distance estimation . . . . . . . . . . . . . . . . . . . . . . .
2.8.2 M -estimation and Maximum likelihood estimation . . . .
Maximum Likelihood for some parametric families . . . . . . . . . . .
2.9.1 Gaussian shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9.2 Variance estimation for the normal law . . . . . . . . . . . . . . .
2.9.3 Univariate normal distribution . . . . . . . . . . . . . . . . . . . . . .
2.9.4 Uniform distribution on [0, θ] . . . . . . . . . . . . . . . . . . . . . . .
2.9.5 Bernoulli or binomial model . . . . . . . . . . . . . . . . . . . . . . . .
2.9.6 Multinomial model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9.7 Exponential model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9.8 Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9.9 Shift of a Laplace (double exponential) law . . . . . . . . . . .
Quasi Maximum Likelihood approach . . . . . . . . . . . . . . . . . . . . . .
2.10.1 LSE as quasi likelihood estimation . . . . . . . . . . . . . . . . . . .
2.10.2 LAD and robust estimation as quasi likelihood
estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Univariate exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.1 Natural parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.2 Canonical parametrization . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.3 Deviation probabilities for the maximum likelihood . . . .
Historical remarks and further reading . . . . . . . . . . . . . . . . . . . . .
54
55
57
58
58
59
62
63
65
66
66
66
67
67
68
68
69
69
71
72
72
75
78
83
Regression Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.1 Regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.1.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.1.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.1.3 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.1.4 Regression function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.2 Method of substitution and M-estimation . . . . . . . . . . . . . . . . . . . 88
3.2.1 Mean regression. Least squares estimate . . . . . . . . . . . . . . 89
3.2.2 Median regression. Least absolute deviation estimate . . . 90
3.2.3 Maximum likelihood regression estimation . . . . . . . . . . . . 90
3.2.4 Quasi Maximum Likelihood approach . . . . . . . . . . . . . . . . 92
3.3 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.3.1 Projection estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3.2 Polynomial approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.3.3 Orthogonal polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.3.4 Chebyshev polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.3.5 Legendre polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.3.6 Lagrange polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Contents
9
3.3.7 Hermite polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.3.8 Trigonometric series expansion . . . . . . . . . . . . . . . . . . . . . . 110
3.4 Piecewise methods and splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.4.1 Piecewise constant estimation . . . . . . . . . . . . . . . . . . . . . . . 111
3.4.2 Piecewise linear univariate estimation . . . . . . . . . . . . . . . . 113
3.4.3 Piecewise polynomial estimation . . . . . . . . . . . . . . . . . . . . . 115
3.4.4 Spline estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.5 Generalized regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.5.1 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.5.2 Logit regression for binary data . . . . . . . . . . . . . . . . . . . . . 122
3.5.3 Parametric Poisson regression . . . . . . . . . . . . . . . . . . . . . . . 123
3.5.4 Piecewise constant methods in generalized regression . . . 124
3.5.5 Smoothing splines for generalized regression . . . . . . . . . . 126
3.6 Historical remarks and further reading . . . . . . . . . . . . . . . . . . . . . 127
4
Estimation in linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.1 Modeling assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.2 Quasi maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . 130
4.2.1 Estimation under the homogeneous noise assumption . . 132
4.2.2 Linear basis transformation . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.2.3 Orthogonal and orthonormal design . . . . . . . . . . . . . . . . . . 134
4.2.4 Spectral representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
e . . . . . . . . . . . . . . . . . . . . . 136
4.3 Properties of the response estimate f
4.3.1 Decomposition into a deterministic and a stochastic
component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.3.2 Properties of the operator Π . . . . . . . . . . . . . . . . . . . . . . . 137
4.3.3 Quadratic loss and risk of the response estimation . . . . . 139
4.3.4 Misspecified “colored noise” . . . . . . . . . . . . . . . . . . . . . . . . 140
e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.4 Properties of the MLE θ
4.4.1 Properties of the stochastic component . . . . . . . . . . . . . . . 141
4.4.2 Properties of the deterministic component . . . . . . . . . . . . 142
4.4.3 Risk of estimation. R-efficiency . . . . . . . . . . . . . . . . . . . . . . 143
4.4.4 The case of a misspecified noise . . . . . . . . . . . . . . . . . . . . . 146
4.5 Linear models and quadratic log-likelihood . . . . . . . . . . . . . . . . . . 147
4.6 Inference based on the maximum likelihood . . . . . . . . . . . . . . . . . 149
4.6.1 A misspecified LPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.6.2 A misspecified noise structure . . . . . . . . . . . . . . . . . . . . . . . 153
4.7 Ridge regression, projection, and shrinkage . . . . . . . . . . . . . . . . . 154
4.7.1 Regularization and ridge regression . . . . . . . . . . . . . . . . . . 155
4.7.2 Penalized likelihood. Bias and variance . . . . . . . . . . . . . . . 156
4.7.3 Inference for the penalized MLE . . . . . . . . . . . . . . . . . . . . . 159
4.7.4 Projection and shrinkage estimates . . . . . . . . . . . . . . . . . . 160
4.7.5 Smoothness constraints and roughness penalty approach163
4.8 Shrinkage in a linear inverse problem . . . . . . . . . . . . . . . . . . . . . . . 163
10
Contents
4.8.1 Spectral cut-off and spectral penalization. Diagonal
estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.8.2 Galerkin method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
4.9 Semiparametric estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.9.1 (θ, η) - and υ -setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.9.2 Orthogonality and product structure . . . . . . . . . . . . . . . . . 168
4.9.3 Partial estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.9.4 Profile estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
4.9.5 Semiparametric efficiency bound . . . . . . . . . . . . . . . . . . . . 175
4.9.6 Inference for the profile likelihood approach . . . . . . . . . . . 176
4.9.7 Plug-in method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
4.9.8 Two step procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
4.9.9 Alternating method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
4.10 Historical remarks and further reading . . . . . . . . . . . . . . . . . . . . . 181
5
Bayes estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
5.1 Bayes formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
5.2 Conjugated priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
5.3 Linear Gaussian model and Gaussian priors . . . . . . . . . . . . . . . . . 187
5.3.1 Univariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
5.3.2 Linear Gaussian model and Gaussian prior . . . . . . . . . . . 189
5.3.3 Homogeneous errors, orthogonal design . . . . . . . . . . . . . . . 192
5.4 Non-informative priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.5 Bayes estimate and posterior mean . . . . . . . . . . . . . . . . . . . . . . . . 194
5.6 Posterior mean and ridge regression . . . . . . . . . . . . . . . . . . . . . . . . 196
5.7 Bayes and minimax risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
5.8 Van Trees inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
5.9 Historical remarks and further reading . . . . . . . . . . . . . . . . . . . . . 203
6
Testing a statistical hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.1 Testing problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.1.1 Simple hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.1.2 Composite hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.1.3 Statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.1.4 Errors of the first kind, test level . . . . . . . . . . . . . . . . . . . . 207
6.1.5 Randomized tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.1.6 Alternative hypotheses, error of the second kind,
power of a test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.2 Neyman-Pearson test for two simple hypotheses . . . . . . . . . . . . . 210
6.2.1 Neyman-Pearson test for an i.i.d. sample . . . . . . . . . . . . . 213
6.3 Likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.4 Likelihood ratio tests for parameters of a normal distribution . 216
6.4.1 Distributions related to an i.i.d. sample from a normal
distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
6.4.2 Gaussian shift model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Contents
11
6.4.3 Testing the mean when the variance is unknown . . . . . . . 220
6.4.4 Testing the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
6.5 LR-tests. Further examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
6.5.1 Bernoulli or binomial model . . . . . . . . . . . . . . . . . . . . . . . . 223
6.5.2 Uniform distribution on [0, θ] . . . . . . . . . . . . . . . . . . . . . . . 224
6.5.3 Exponential model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.5.4 Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.6 Testing problem for a univariate exponential family . . . . . . . . . . 226
6.6.1 Two-sided alternative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
6.6.2 One-sided alternative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
6.6.3 Interval hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
6.7 Historical remarks and further reading . . . . . . . . . . . . . . . . . . . . . 231
7
Testing in linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.1 Likelihood ratio test for a simple null . . . . . . . . . . . . . . . . . . . . . . 233
7.1.1 General errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.1.2 I.i.d. errors, known variance . . . . . . . . . . . . . . . . . . . . . . . . 234
7.1.3 I.i.d. errors with unknown variance . . . . . . . . . . . . . . . . . . 239
7.2 Likelihood ratio test for a subvector . . . . . . . . . . . . . . . . . . . . . . . 241
7.3 Likelihood ratio test for a linear hypothesis . . . . . . . . . . . . . . . . . 244
7.4 Wald test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
7.5 Analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
7.5.1 Two-sample test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
7.5.2 Comparing K treatment means . . . . . . . . . . . . . . . . . . . . . 248
7.5.3 Randomized blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
7.6 Historical remarks and further reading . . . . . . . . . . . . . . . . . . . . . 252
8
Some other testing methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
8.1 Method of moments for an i.i.d. sample . . . . . . . . . . . . . . . . . . . . 253
8.1.1 Series expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
8.1.2 Testing a parametric hypothesis . . . . . . . . . . . . . . . . . . . . . 256
8.2 Minimum distance method for an i.i.d. sample . . . . . . . . . . . . . . 257
8.2.1 Kolmogorov-Smirnov test . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
8.2.2 ω 2 test (Cramér-Smirnov-von Mises) . . . . . . . . . . . . . . . . 260
8.3 Partially Bayes tests and Bayes testing . . . . . . . . . . . . . . . . . . . . . 261
8.3.1 Partial Bayes approach and Bayes tests . . . . . . . . . . . . . . 261
8.3.2 Bayes approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8.4 Score, rank and permutation tests . . . . . . . . . . . . . . . . . . . . . . . . . 262
8.4.1 Score tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8.4.2 Rank tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
8.4.3 Permutation tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
8.5 Historical remarks and further reading . . . . . . . . . . . . . . . . . . . . . 275
12
9
Contents
Deviation probability for quadratic forms . . . . . . . . . . . . . . . . . . 277
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
9.2 Gaussian case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
9.3 A bound for the `2 -norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
9.4 A bound for a quadratic form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
9.5 Rescaling and regularity condition . . . . . . . . . . . . . . . . . . . . . . . . . 283
9.6 A chi-squared bound with norm-constraints . . . . . . . . . . . . . . . . . 284
9.7 A bound for the `2 -norm under Bernstein conditions . . . . . . . . 285
9.8 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
1
Basic notions
The starting point of any statistical analysis is data, also called observations
or a sample. A statistical model is used to explain the nature of the data. A
standard approach assumes that the data is random and utilizes some probabilistic framework. On the contrary to probability theory, the distribution of
the data is not known precisely and the goal of the analysis is to infer on this
unknown distribution.
The parametric approach assumes that the distribution of the data is
known up to the value of a parameter θ from some subset Θ of a finitedimensional space IRp . In this case the statistical analysis is naturally reduced to the estimation of the parameter θ : as soon as θ is known, we know
the whole distribution of the data. Before introducing the general notion of a
statistical model, we discuss some popular examples.
1.1 Example of a Bernoulli experiment
Let Y = (Y1 , . . . , Yn )> be a sequence of binary digits zero or one. We distinguish between deterministic and random sequences. Deterministic sequences
appear e.g. from the binary representation of a real number, or from digitally
coded images, etc. Random binary sequences appear e.g. from coin throw,
games, etc. In many situations incomplete information can be treated as random data: the classification of healthy and sick patients, individual vote results, the bankruptcy of a firm or credit default, etc.
Basic assumptions behind a Bernoulli experiment are:
•
•
the observed data Yi are independent and identically distributed.
each Yi assumes the value one with probability θ ∈ [0, 1] .
The parameter θ completely identifies the distribution of the data Y . Indeed,
for every i ≤ n and y ∈ {0, 1} ,
IP (Yi = y) = θy (1 − θ)1−y ,
14
1 Basic notions
and the independence of the Yi ’s implies for every sequence y = (y1 , . . . , yn )
that
n
Y
IP Y = y =
θyi (1 − θ)1−yi .
(1.1)
i=1
To indicate this fact, we write IPθ in place of IP .
The equation (1.1) can be rewritten as
IPθ Y = y = θsn (1 − θ)n−sn ,
where
sn =
n
X
yi .
i=1
The value sn is often interpreted as the number of successes in the sequence
y.
Probabilistic theory focuses on the probabilistic properties of the data Y
under the given measure IPθ . The aim of the statistical analysis is to infer on
the measure IPθ for an unknown θ based on the available data Y . Typical
examples of statistical problems are:
1. Estimate the parameter θ i.e. build a function θe of the data Y into
[0, 1] which approximates the unknown value θ as well as possible;
2. Build a confidence set for θ i.e. a random (data-based) set (usually an
interval) containing θ with a prescribed probability;
3. Testing a simple hypothesis that θ coincides with a prescribed value θ0 ,
e.g. θ0 = 1/2 ;
4. Testing a composite hypothesis that θ belongs to a prescribed subset Θ0
of the interval [0, 1] .
Usually any statistical method is based on a preliminary probabilistic analysis of the model under the given θ .
Theorem 1.1.1. Let Y be i.i.d. Bernoulli with the parameter θ . Then the
mean and the variance of the sum Sn = Y1 + . . . + Yn satisfy
IEθ Sn = nθ,
def
Varθ Sn = IEθ Sn − IEθ Sn
2
= nθ(1 − θ).
Exercise 1.1.1. Prove this theorem.
This result suggests that the empirical mean θe = Sn /n is a reasonable estimate of θ . Indeed, the result of the theorem implies
1.1 Example of a Bernoulli experiment
IEθ θe = θ,
IEθ θe − θ
2
15
= θ(1 − θ)/n.
The first equation means that θe is an unbiased estimate of θ , that is,
IEθ θe = θ for all θ . The second equation yields a kind of concentration (consistency) property of θe : with n growing, the estimate θe concentrates in a
small neighborhood of the point θ . By the Chebyshev inequality
IPθ e
θ − θ > δ ≤ θ(1 − θ)/(nδ 2 ).
This result is refined by the famous de Moivre-Laplace theorem.
Theorem 1.1.2. Let Y be i.i.d. Bernoulli with the parameter θ . Then for
every k ≤ n
n k
IPθ Sn = k =
θ (1 − θ)n−k
k
2 k − nθ
1
exp −
,
≈ p
2nθ(1 − θ)
2πnθ(1 − θ)
where an ≈ bn means an /bn → 1 as n → ∞ . Moreover, for any fixed z > 0 ,
Z ∞
S
p
2
2
n
e−t /2 dt .
− θ > z θ(1 − θ)/n ≈ √
IPθ n
2π z
This concentration result yields that the estimate θe deviates from a root- n
p
def
neighborhood A(z, θ) = {u : |θ − u| ≤ z θ(1 − θ)/n} with probability of
2
order e−z /2 .
e
This result bounding the difference |θ−θ|
can also be used to build random
e
confidence intervals around the point θ . Indeed,
p by the result of the theorem,
the random interval E ∗ (z) = {u : |θe − u| ≤ z θ(1 − θ)/n} fails to cover the
true point θ with approximately the same probability:
Z ∞
2
2
IPθ E ∗ (z) 63 θ ≈ √
e−t /2 dt .
(1.2)
2π z
Unfortunately, the construction of this interval E ∗ (z) is not entirely databased. Its width involves the true unknown value θ . A data based confidence
2
def
set can be obtained by replacing the population variance σ 2 = IEθ Y1 −θ =
θ(1 − θ) with its empirical counterpart
n
def
σ
e2 =
2
1X
Yi − θe
n i=1
The resulting confidence set E(z) reads as
16
1 Basic notions
√
def
e2 }.
E(z) = {u : |θe − u| ≤ z n−1 σ
It possesses the same asymptotic properties as E ∗ (z) including (1.2).
The hypothesis that the value θ is equal to a prescribed value θ0 , e.g.
θ0 = 1/2 , can be checked by examining the difference |θe − 1/2| . If this value
is too large compared to σn−1/2 or with σ
en−1/2 , then the hypothesis is wrong
with high probability. Similarly one can consider a composite hypothesis that
θ belongs to some interval [θ1 , θ2 ] ⊂ [0, 1] . If θe deviates from this interval
at least by the value ze
σ n−1/2 with a large z , then the data significantly
contradict this hypothesis.
1.2 Least squares estimation in a linear model
A linear model assumes a linear systematic dependence between the output
(also called response or explained variable) Y from the input (also called regressor or explanatory variable) Ψ which in general can be multidimensional.
The linear model is usually written in the form
IE Y = Ψ> θ ∗
with an unknown vector of coefficients θ ∗ = (θ1∗ , . . . , θp∗ )> . Equivalently one
writes
Y = Ψ> θ ∗ + ε
(1.3)
where ε stands for the individual error with zero mean: IEε = 0 . Such a
linear model is often used to describe the influence of the response on the
regressor Ψ from the collection of data in the form of a sample (Yi , Ψi ) for
i = 1, . . . , n .
Let θ be a vector of coefficients considered as a candidate for θ ∗ . Then
each observation Yi is approximated by Ψ>
i θ . One often measures the quality
2
of approximation by the sum of quadratic errors |Yi −Ψ>
i θ| . Under the model
assumption (1.3), the expected value of this sum is
IE
X
2
|Yi − Ψ>
i θ| = IE
X
X
2 X > ∗
∗
Ψ>
=
Ψi (θ − θ)2 +
IEε2i .
i (θ − θ) + εi
The cross term cancels in view of IEεi = 0 . Note that minimizing this expression w.r.t. θ is equivalent to minimizing the first sum because the second
sum does not depend on θ . Therefore,
argmin IE
θ
X
2
|Yi − Ψ>
i θ| = argmin
θ
X
2
∗
Ψ>
= θ∗ .
i (θ − θ)
1.2 Least squares estimation in a linear model
17
In words, the true parameter vector θ ∗ minimizes the expected quadratic
error of fitting the data with a linear combinations of the Ψi ’s. The least
squares estimate of the parameter vector θ ∗ is defined by minimizing
in θ
2
its empirical counterpart, that is, the sum of the squared errors Yi − Ψ>
i θ
over all i :
e def
θ
= argmin
θ
n
X
2
Yi − Ψ>
i θ .
i=1
This equation can be solved explicitly under some condition on the Ψi ’s. Define the p×n design matrix Ψ = (Ψ1 , . . . , Ψn ) . The aforementioned condition
means that this matrix is of rank p .
∗
Theorem 1.2.1. Let Yi = Ψ>
i θ + εi for i = 1, . . . , n , where εi are independent and satisfy IEεi = 0 , IEε2i = σ 2 . Suppose that the matrix Ψ is of
rank p . Then
e = ΨΨ>
θ
−1
ΨY ,
e is unbiased in the sense that
where Y = (Y1 , . . . , Yn )> . Moreover, θ
e = θ∗
IEθ∗ θ
e = σ 2 ΨΨ> −1 .
and its variance satisfies Var θ
e = h> θ
e is an
For each vector h ∈ IRp , the random value e
a = h, θ
> ∗
∗
unbiased estimate of a = h θ :
IEθ∗ (e
a) = a∗
(1.4)
with the variance
−1
Var e
a = σ 2 h> ΨΨ>
h.
Proof. Define
def
Q(θ) =
n
X
2
Yi − Ψ>
= kY − Ψ> θk2 ,
i θ
i=1
def P 2
where kyk2 =
i yi . The normal equation dQ(θ)/dθ = 0 can be written
e . Now the model equation
as ΨΨ> θ = ΨY yielding the representation of θ
yields IEθ Y = Ψ> θ ∗ and thus
e = ΨΨ>
IEθ∗ θ
as required.
−1
ΨIEθ∗ Y = ΨΨ>
−1
ΨΨ> θ ∗ = θ ∗
18
1 Basic notions
e = σ 2 ΨΨ> −1 .
Exercise 1.2.1. Check that Var θ
e = h> θ ∗ = a∗ , that is, e
Similarly one obtains IEθ∗ (e
a) = IEθ∗ h> θ
a is an
unbiased estimate of a∗ . Also
e = h> Var θ
e h = σ 2 h> ΨΨ> −1 h.
Var e
a = Var h> θ
which completes the proof.
The next result states that the proposed estimate e
a is in some sense
the best possible one. Namely, we consider the class of all linear unbiased estimates e
a satisfying the identity (1.4). It appears that the variance
−1
σ 2 h> ΨΨ>
h of e
a is the smallest possible in this class.
∗
Theorem 1.2.2 (Gauss-Markov). Let Yi = Ψ>
i θ + εi for i = 1, . . . , n
2
with uncorrelated εi satisfying IEεi = 0 and IEεi = σ 2 . Let rank(Ψ) = p .
def Suppose that the value a∗ = h, θ ∗ = h> θ ∗ is to be estimated for a given
e = h> θ
e is an unbiased estimate of a∗ .
vector h ∈ IRp . Then e
a = h, θ
Moreover, e
a has the minimal possible variance over the class of all linear
unbiased estimates of a∗ .
This result was historically one of the first optimality results in statistics.
It presents a lower efficiency bound of any statistical procedure. Under the
imposed restrictions it is impossible to do better than the LSE does. This and
more general results will be proved later in Chapter 4.
Define also the vector of residuals
def
e.
b
ε = Y − Ψ> θ
e is a good estimate of the vector θ ∗ , then due to the model equation,
If θ
b
ε is a good estimate of the vector ε of individual errors. Many statistical
procedures utilize this observation by checking the quality of estimation via
the analysis of the estimated vector b
ε . In the case when this vector still shows
a nonzero systematic component, there is evidence that the assumed linear
model is incorrect. This vector can also be used to estimate the noise variance
σ2 .
∗
Theorem 1.2.3. Consider the linear model Yi = Ψ>
i θ +εi with independent
2
2
homogeneous errors εi . Then the variance σ = IEεi can be estimated by
σ
b2 =
e 2
kY − Ψ> θk
kb
εk2
=
n−p
n−p
and σ
b2 is an unbiased estimate of σ 2 , that is, IEθ∗ σ
b2 = σ 2 for all θ ∗ and
σ.
1.3 General parametric model
19
Theorems 1.2.2 and 1.2.3 can be used to describe the concentration properties of the estimate e
a and to build confidence sets based on e
a and σ
b,
especially if the errors εi are normally distributed.
∗
2
Theorem 1.2.4. Let Yi = Ψ>
i θ + εi for i = 1, . . . , n with εi ∼ N(0, σ ) .
>e
> ∗
∗
Let rank(Ψ) = p . Then it holds for the estimate e
a = h θ of a = h θ
e
a − a∗ ∼ N 0, s2
with s2 = σ 2 h> ΨΨ>
−1
h.
Corollary 1.2.1 (Concentration). If for some α > 0 , zα is the 1 − α/2 quantile of the standard normal law (i.e. Φ(zα ) = 1 − α/2 ) then
IPθ∗ |e
a − a∗ | > zα s = α
Exercise 1.2.2. Check Corollary 1.2.1.
The next result describes the confidence set for a∗ . The unknown variance
s2 is replaced by its estimate
def
sb2 = σ
b2 h> ΨΨ>
−1
h
def
Corollary 1.2.2 (Confidence set). If E(zα ) = {a : |e
a − a| ≤ sb zα } then
IPθ∗ E(zα ) 63 a∗ ≈ α.
1.3 General parametric model
Let Y denote the observed data with values in the observation space Y . In
most cases, Y ∈ IRn , that is, Y = (Y1 , . . . , Yn )> . Here n denotes the sample
size (number of observations). The basic assumption about these data is that
the vector Y is a random variable on a probability space (Y, B(Y), IP ) , where
B(Y) is the Borel σ -algebra on Y . The probabilistic approach assumes that
the probability measure IP is known and studies the distributional (population) properties of the vector Y . On the contrary, the statistical approach
assumes that the data Y are given and tries to recover the distribution IP
on the basis of the available data Y . One can say that the statistical problem
is inverse to the probabilistic one.
The statistical analysis is usually based on the notion of statistical experiment. This notion assumes that a family P of probability measures on
(Y, B(Y)) is fixed and the unknown underlying measure IP belongs to this
family. Often this family is parameterized by the value θ from some parameter set Θ : P = (IPθ , θ ∈ Θ) . The corresponding statistical experiment can
be written as
20
1 Basic notions
Y, B(Y), (IPθ , θ ∈ Θ) .
The value θ ∗ denotes the “true” parameter value, that is, IP = IPθ∗ .
The statistical experiment is dominated if there exists a dominating σ finite measure µ0 such that all the IPθ are absolutely continuous w.r.t. µ0 .
In what follows we assume without further mention that the considered statistical models are dominated. Usually the choice of a dominating measure is
unimportant and any one can be used.
The parametric approach assumes that Θ is a subset of a finite-dimensional
Euclidean space IRp . In this case, the unknown data distribution is specified
by the value of a finite-dimensional parameter θ from Θ ⊆ IRp . Since in this
case the parameter θ completely identifies the distribution of the observations
Y , the statistical estimation problem is reduced to recovering (estimating)
this parameter from the data. The nice feature of the parametric theory is
that the estimation problem can be solved in a rather general way.
1.4 Statistical decision problem. Loss and Risk
The statistical decision problem is usually formulated in terms of game theory,
the statistician playing as it were against nature. Let D denote the decision
space that is assumed to be a topological space. Next, let ℘(·, ·) be a loss
function given on the product D × Θ . The value ℘(d, θ) denotes the loss
associated with the decision d ∈ D when the true parameter value is θ ∈
Θ . The statistical decision problem is composed of a statistical experiment
(Y, B(Y), P) , a decision space D and a loss function ℘(·, ·) .
A statistical decision ρ = ρ(Y ) is a measurable function of the observed
data Y with values in the decision space D . Clearly, ρ(Y ) can be considered
as a random D -valued element on the space (Y, B(Y)) . The corresponding
loss under the true model (Y, B(Y), IPθ∗ ) reads as ℘(ρ(Y ), θ ∗ ) . Finally, the
risk is defined as the expected value of the loss:
def
R(ρ, θ ∗ ) = IEθ∗ ℘(ρ(Y ), θ ∗ ).
Below we present a list of typical the statistical decision problems.
Example 1.4.1 (Point estimation problem). Let the target of analysis be the
true parameter θ ∗ itself, that is, D coincide with Θ , Let ℘(·, ·) be a kind
of distance on Θ , that is, ℘(θ, θ ∗ ) denotes the loss of estimation, when the
selected value is θ while the true parameter is θ ∗ . Typical examples of the
loss function are quadratic loss ℘(θ, θ ∗ ) = kθ − θ ∗ k2 , l1 -loss ℘(θ, θ ∗ ) =
kθ − θ ∗ k1 or sup-loss ℘(θ, θ ∗ ) = kθ − θ ∗ k∞ = maxj=1,...,p |θj − θj∗ | .
e is an estimate of θ ∗ , that is, θ
e is a Θ -valued function of the data
If θ
Y , then the corresponding risk is
1.5 Efficiency
21
def
e θ ∗ ).
R(ρ, θ ∗ ) = IEθ∗ ℘(θ,
e − θ ∗ k2 .
Particularly, the quadratic risk reads as IEθ∗ kθ
Example 1.4.2 (Testing problem). Let Θ0 and Θ1 be two complementary
subsets of Θ , that is, Θ0 ∩ Θ1 = ∅ , Θ0 ∪ Θ1 = Θ . Our target is to check
whether the true parameter θ ∗ belongs to the subset Θ0 . The decision space
consists of two points {0, 1} for which d = 0 means the acceptance of the
hypothesis H0 : θ ∗ ∈ Θ0 while d = 1 rejects H0 in favor of the alternative
H1 : θ ∗ ∈ Θ1 . Define the loss
℘(d, θ) = 1(d = 1, θ ∈ Θ0 ) + 1(d = 0, θ ∈ Θ1 ).
A test φ is a binary valued function of the data, φ = φ(Y ) ∈ {0, 1} . The
corresponding risk R(φ, θ ∗ ) = IEθ∗ φ(Y ) can be interpreted as the probability
of selecting the wrong subset.
Example 1.4.3 (Confidence estimation). Let the target of analysis again be
the parameter θ ∗ . However, we aim to identify a subset A of Θ , as small
as possible, that covers with a presribed probability the true value θ ∗ . Our
decision space D is now the set of all measurable subsets in Θ . For any
A ∈ D , the loss function is defined as ℘(A, θ ∗ ) = 1(A 63 θ ∗ ) . A confidence set
is a random set E selected from the data Y , E = E(Y ) . The corresponding
risk R(E, θ ∗ ) = IEθ∗ ℘(E, θ ∗ ) is just the probability that E does not cover
θ∗ .
Example 1.4.4 (Estimation of a functional). Let the target of estimation be a
given function f (θ ∗ ) of the parameter θ ∗ with values in another space F . A
typical example is given by a single component of the vector θ ∗ . An estimate
ρ of f (θ ∗ ) is a function of the data Y into F : ρ = ρ(Y ) ∈ F . The loss
function ℘ is defined on the product F × F , yielding the loss ℘(ρ(Y ), f (θ ∗ ))
and the risk R(ρ(Y ), f (θ ∗ )) = IEθ∗ ℘(ρ(Y ), f (θ ∗ )) .
Exercise 1.4.1. Define the statistical decision problem for testing a simple
hypothesis θ ∗ = θ 0 for a given point θ 0 .
1.5 Efficiency
After the statistical decision problem is stated, one can ask for its optimal
solution. Equivalently one can say that the aim of statistical analysis is to
build a decision with the minimal possible risk. However, a comparison of any
two decisions on the basis of risk can be a nontrivial problem. Indeed, the risk
R(ρ, θ ∗ ) of a decision ρ depends on the true parameter value θ ∗ . It may
happen that one decision performs better for some points θ ∗ ∈ Θ but worse
at other points θ ∗ . An extreme example of such an estimate is the trivial
22
1 Basic notions
e = θ 0 which sets the estimate equal to the value θ 0
deterministic decision θ
whatever the data is. This is, of course, a very strange and poor estimate, but
it clearly outperforms all other methods if the true parameter θ ∗ is indeed
θ0 .
Two approaches are typically used to compare different statistical decisions: the minimax approach considers the maximum R(ρ) of the risks
R(ρ, θ) over the parameter set Θ while the Bayes approach is based on the
weighted sum (integral) Rπ (ρ) of such risks with respect to some measure π
on the parameter set Θ which is called the prior distribution:
R(ρ) = sup R(ρ, θ),
θ∈Θ
Rπ (ρ) =
Z
R(ρ, θ)π(dθ).
The decision ρ∗ is called minimax if
R(ρ∗ ) = inf R(ρ) = inf sup R(ρ, θ),
ρ
ρ θ∈Θ
where the infimum is taken over the set of all possible decisions ρ . The value
R∗ = R(ρ∗ ) is called the minimax risk.
Similarly, the decision ρπ is called Bayes for the prior π if
Rπ (ρπ ) = inf Rπ (ρ).
ρ
The corresponding value Rπ (ρπ ) is called the Bayes risk.
Exercise 1.5.1. Show that the minimax risk is greater than or equal to the
Bayes risk whatever the prior measure π is.
Hint: show that for any decision ρ , it holds R(ρ) ≥ Rπ (ρ) .
Usually the problem of finding a minimax or Bayes estimate is quite hard
and a closed form solution is available only in very few special cases. A standard way out of this problem is to switch to an asymptotic set-up in which
the sample size grows to infinity.
2
Parameter estimation for an i.i.d. model
This chapter is very important for understanding the whole book. It starts
with very classical stuff: Glivenko-Cantelli results for the empirical measure
that motivate the famous substitution principle. Then the method of moments
is studied in more detail including the risk analysis and asymptotic properties.
Some other classical estimation procedures are briefly discussed including the
methods of minimum distance, M-estimates and its special cases: least squares,
least absolute deviations and maximum likelihood estimates. The concept of
efficiency is discussed in context of the Cramér-Rao risk bound which is given
in univariate and multivariate case. The last sections of Chapter 2 start a
kind of smooth transition from classical to “modern” parametric statistics
and they reveal the approach of the book. The presentation is focused on the
(quasi) likelihood-based concentration and confidence sets. The basic concentration result is first introduced for the simplest Gaussian shift model and
then extended to the case of a univariate exponential family in Section 2.11.
Below in this chapter we consider the estimation problem for a sample of independent identically distributed (i.i.d.) observations. Throughout
the chapter the data Y are assumed to be given in the form of a sample
(Y1 , . . . , Yn )> . We assume that the observations Y1 , . . . , Yn are independent
identically distributed; each Yi is from an unknown distribution P also called
a marginal measure. The joint data distribution IP is the n -fold product of
P : IP = P ⊗n . Thus, the measure IP is uniquely identified by P and the
statistical problem can be reduced to recovering P .
The further step in model specification is based on a parametric assumption (PA): the measure P belongs to a given parametric family.
2.1 Empirical distribution. Glivenko-Cantelli Theorem
Let Y = (Y1 , . . . , Yn )> be an i.i.d. sample. For simplicity we assume that the
Yi ’s are univariate with values in IR . Let P denote the distribution of each
Yi :
24
2 Parameter estimation for an i.i.d. model
B ∈ B(IR).
P (B) = IP (Yi ∈ B),
One often says that Y is an i.i.d. sample from P . Let also F be the corresponding distribution function (cdf):
F (y) = IP (Y1 ≤ y) = P ((−∞, y]).
The assumption that the Yi ’s are i.i.d. implies that the joint distribution IP
of the data Y is given by the n -fold product of the marginal measure P :
IP = P ⊗n .
Let also Pn (resp. Fn ) be the empirical measure (resp. empirical distribution
function (edf))
1X
1(Yi ≤ y).
n
P
Pn
Here and everywhere in this chapter the symbol
stands for
i=1 . One
can consider Fn as the distribution function of the empirical measure Pn
defined as the atomic measure at the Yi ’s:
Pn (B) =
1X
1(Yi ∈ B),
n
Fn (y) =
n
def
Pn (A) =
1X
1(Yi ∈ A).
n i=1
So, Pn (A) is the empirical frequency of the event A , that is, the fraction
of observations Yi belonging to A . By the law of large numbers one can
expect that this empirical frequency is close to the true probability P (A) if
the number of observations is sufficiently large.
An equivalent definition of the empirical measure and empirical distribution function can be given in terms of the empirical mean IEn g for a measurable function g :
Z
∞
Z
∞
n
1X
IEn g =
g(y)IPn (dy) =
g(y)dFn (y) =
g(Yi ).
n i=1
−∞
−∞
def
The first results claims that indeed, for every Borel set B on the real line,
the empirical mass Pn (B) (which is random) is close in probability to the
population counterpart P (B) .
Theorem 2.1.1. For any Borel set B , it holds
1.
2.
3.
4.
IEP
P (B) .
n (B) = 2
2
Var Pn (B) = n−1 σB
with σB
= P (B) 1 − P (B) .
P (B) → P (B) in probability as n → ∞ .
√n
w
2
n{Pn (B) − P (B)} −→ N(0, σB
).
2.1 Empirical distribution. Glivenko-Cantelli Theorem
25
Proof. Denote ξi = 1(Yi ∈ B) . This is a Bernoulli r.v. with parameter
P
P (B) = IEξi . The first statement holds by definition of Pn (B) = n−1 i ξi .
Next, for each i ≤ n ,
def
Var ξi = IEξi2 − IEξi
2
= P (B) 1 − P (B)
in view of ξi2 = ξi . Independence of the ξi ’s yields
n
n
X
X
−1
−2
2
Var Pn (B) = Var n
ξi = n
Var ξi = n−1 σB
.
i=1
i=1
The third statement follows by the law of large numbers for the i.i.d. r.v. ξi :
n
1X
IP
ξi −→ IEξ1 .
n i=1
Finally, the last statement follows by the Central Limit Theorem for the ξi :
n
w
1 X
2
√
ξi − IEξi −→ N 0, σB
.
n i=1
The next important result shows that the edf Fn is a good approximation
of the cdf F in the uniform norm.
Theorem 2.1.2 (Glivenko-Cantelli). It holds
supFn (y) − F (y) → 0,
n→∞
y
Proof. Consider first the case when the function F is continuous in y . Fix any
integer N and define with ε = 1/N the points t1 < t2 < . . . < tN = +∞
such that F (tj ) − F (tj−1 ) = ε for j = 2, . . . , N . For every j , by (3) of
Theorem 2.1.1, it holds Fn (tj ) → F (tj ) . This implies that for some n(N ) , it
holds for all n ≥ n(N ) with a probability at least 1 − ε
Fn (tj ) − F (tj ) ≤ ε,
j = 1, . . . , N.
Now for every t ∈ [tj−1 , tj ] , it holds by definition
F (tj−1 ) ≤ F (t) ≤ F (tj ),
Fn (tj−1 ) ≤ Fn (t) ≤ Fn (tj ).
This together with (2.1) implies
IP Fn (t) − F (t) > 2ε ≤ ε.
(2.1)
26
2 Parameter estimation for an i.i.d. model
If the function F (·) is not continuous, then for every positive ε , there exists
a finite set Sε of points of discontinuity sm with F (sm ) − F (sm − 0) ≥ ε .
One can proceed as in the continuous case by adding the points from Sε to
the discrete set {tj } .
Exercise 2.1.1. Check the details of the proof of Theorem 2.1.2.
The results of Theorems 2.1.1 and 2.1.2 can be extended to certain functionals of the distribution P . Let g(y) be a function on the real line. Consider
its expectation
Z
def
∞
s0 = IEg(Y1 ) =
g(y)dF (y).
−∞
Its empirical counterpart is defined by
def
n
∞
Z
Sn =
g(y)dFn (y) =
−∞
1X
g(Yi ).
n i=1
It appears that Sn indeed well estimates s0 , at least for large n .
Theorem 2.1.3. Let g(y) be a function on the real line such that
Z
∞
g 2 (y)dF (y) < ∞
−∞
Then
√
IP
Sn −→ s0 ,
w
n(Sn − s0 ) −→ N(0, σg2 ),
n → ∞,
where
def
σg2 =
Z
∞
g 2 (y)dF (y) − s20 =
−∞
Z
∞
2
g(y) − s0 dF (y).
−∞
Moreover, if h(z) is a twice continuously differentiable function on the real
line, and h0 (s0 ) 6= 0 then
IP
h(Sn ) −→ h(s0 ),
w
√ n h(Sn ) − h(s0 ) −→ N(0, σh2 ),
n → ∞,
def
where σh2 = |h0 (s0 )|2 σg2 .
Proof. The first statement is again the CLT for the i.i.d. random variables
ξi = g(Yi ) having mean value s0 and variance σg2 .
It also implies the second statement in view of the Taylor expansion
h(Sn ) − h(s0 ) ≈ h0 (s0 ) (Sn − s0 ) .
2.1 Empirical distribution. Glivenko-Cantelli Theorem
27
Exercise 2.1.2. Complete the proof.
Hint: use the first result to show that Sn belongs with high probability
to a small neighborhood U of the point s0 .
Then apply the Taylor expansion
√ of second order to h(Sn ) − h(s0 ) =
h s0 + n−1/2 ξn − h(s0 ) with ξn = n(Sn − s0 ) :
1/2
n [h(Sn ) − h(s0 )] − h0 (s0 ) ξn ≤ n−1/2 H ∗ ξn2 /2,
IP
where H ∗ = maxU |h00 (y)| . Show that n−1/2 ξn2 −→ 0 because ξn is stochastically bounded by the first statement of the theorem.
The results of Theorems 2.1.2 and 2.1.3 can be extended to the case
> of a
vectorial function g(·) : IR1 → IRm , that is, g(y) = g1 (y), . . . , gm (y)
for
y ∈ IR1 . Then s0 = (s0,1 , . . . , s0,m )> and its empirical counterpart S n =
(Sn,1 , . . . , Sn,m )> are vectors in IRm as well:
Z ∞
Z ∞
def
def
s0,j =
gj (y)dF (y),
Sn,j =
gj (y)dFn (y),
j = 1, . . . , m.
−∞
−∞
Theorem 2.1.4. Let g(y) be an IRm -valued function on the real line with a
bounded covariance matrix Σ = (Σjk )j,k=1,...,m :
Z ∞
def
Σjk =
gj (y) − s0,j gk (y) − s0,k dF (y) < ∞,
j, k ≤ m
−∞
Then
IP
S n −→ s0 ,
√
w
n S n − s0 −→ N(0, Σ),
n → ∞.
Moreover, if H(z) is a twice continuously differentiable function on IRm and
Σ H 0 (s0 ) 6= 0 where H 0 (z) stands for the gradient of H at z then
IP
H(S n ) −→ H(s0 ),
w
√ 2
n H(S n ) − H(s0 ) −→ N(0, σH
),
n → ∞,
def
2
where σH
= H 0 (s0 )> Σ H 0 (s0 ) .
Exercise 2.1.3. Prove Theorem 2.1.4.
Hint: consider for every h ∈ IRm the scalar products h> g(y) , h> s0 ,
>
h Sn . For the first statement, it suffices to show that
IP
h> S n −→ h> s0 ,
√
w
nh> S n − s0 −→ N(0, h> Σh),
n → ∞.
For the second statement, consider the expansion
1/2
IP
0
−1/2 ∗
n [H(S n ) − H(s0 )] − ξ >
H kξ n k2 /2 −→ 0,
n H (s0 ) ≤ n
with ξ n = n1/2 (S n − s0 ) and H ∗ = maxy∈U kH 00 (y)k for a neighborhood U
of s0 . Here kAk means the maximal eigenvalue of a symmetric matrix A .
28
2 Parameter estimation for an i.i.d. model
2.2 Substitution principle. Method of moments
By the Glivenko-Cantelli theorem the empirical measure Pn (resp. edf Fn )
is a good approximation of the true measure P (reps. pdf F ), at least, if
n is sufficiently large. This leads to the important substitution method of
statistical estimation: represent the target of estimation as a function of the
distribution P , then replace P by Pn .
Suppose that there exists some functional g of a measure Pθ from the
family P = (Pθ , θ ∈ Θ) such that the following identity holds:
θ ≡ g(Pθ ),
θ ∈ Θ.
This particularly implies θ ∗ = g(Pθ∗ ) = g(P ) . The substitution estimate is
defined by substituting Pn for P :
e = g(Pn ).
θ
e can lie outside the parameter set Θ . Then
Sometimes the obtained value θ
e
one can redefine the estimate θ as the value providing the best fit of g(Pn ) :
e = argmin kg(Pθ ) − g(Pn )k.
θ
θ
Here k · k denotes some norm on the parameter set Θ , e.g. the Euclidean
norm.
2.2.1 Method of moments. Univariate parameter
The method of moments is a special but at the same time the most frequently
used case of the substitution method. For illustration, we start with the univariate case. Let Θ ⊆ IR , that is, θ is a univariate parameter. Let g(y) be a
function on IR such that the first moment
Z
def
m(θ) = Eθ g(Y1 ) = g(y)dPθ (y)
is continuous and monotonic. Then the parameter θ can be uniquely identified
by the value m(θ) , that is, there exists an inverse function m−1 satisfying
θ=m
−1
Z
g(y)dPθ (y) .
The substitution method leads to the estimate
Z
X
−1
−1 1
e
g(Yi ) .
θ=m
g(y)dPn (y) = m
n
2.2 Substitution principle. Method of moments
29
Usually g(x) = x or g(x) = x2 , which explains the name of the method. This
method was proposed by Pearson and is historically the first regular method
of constructing a statistical estimate.
2.2.2 Method of moments. Multivariate parameter
The method of moments can be easily extended to the multivariate case. Let
>
be a function
with values in
Θ ⊆ IRp , and let g(y) = g1 (y), . . . , gp (y)
IRp . Define the moments m(θ) = m1 (θ), . . . , mp (θ) by
Z
mj (θ) = Eθ gj (Y1 ) =
gj (y)dPθ (y).
The main requirement on the choice of the vector function g is that the
function m is invertible, that is, the system of equations
mj (θ) = tj
has a unique solution for any t ∈ IRp . The empirical counterpart M n of the
true moments m(θ ∗ ) is given by
def
Mn =
Z
g(y)dPn (y) =
>
1X
1X
g1 (Yi ), . . . ,
gp (Yi ) .
n
n
e can be defined as
Then the estimate θ
X
1X
def
−1
−1 1
e
θ = m (M n ) = m
g1 (Yi ), . . . ,
gp (Yi ) .
n
n
2.2.3 Method of moments. Examples
This section lists some widely used parametric families and discusses the problem of constructing the parameter estimates by different methods. In all the
examples we assume that an i.i.d. sample from a distribution P is observed,
and this measure P belongs a given parametric family (Pθ , θ ∈ Θ) , that is,
P = Pθ∗ for θ ∗ ∈ Θ .
Gaussian shift
Let Pθ be the normal distribution on the real line with mean θ and the
known variance σ 2 . The corresponding density w.r.t. the Lebesgue measure
reads as
p(y, θ) = √
n (y − θ)2 o
exp −
.
2σ 2
2πσ 2
1
30
2 Parameter estimation for an i.i.d. model
It holds IEθ Y1 = θ and Varθ (Y1 ) = σ 2 leading to the moment estimate
Z
θe =
ydPn (y) =
1X
Yi
n
with mean IEθ θe = θ and variance
e = σ 2 /n.
Varθ (θ)
Univariate normal distribution
Let Yi ∼ N(α, σ 2 ) as in the previous example but both mean α and the
variance σ 2 are unknown. This leads to the problem of estimating the vector
θ = (θ1 , θ2 ) = (α, σ 2 ) from the i.i.d. sample Y .
The method of moments suggests to estimate the parameters from the first
two empirical moments of the Yi ’s using the equations m1 (θ) = IEθ Y1 = α ,
m2 (θ) = IEθ Y12 = α2 + σ 2 . Inverting these equalities leads to
α = m1 (θ) ,
σ 2 = m2 (θ) − m21 (θ).
e:
Substituting the empirical measure Pn yields the expressions for θ
1X 2
σ
e =
Yi −
n
1X
Yi ,
α
e=
n
2
1X
Yi
n
2
=
2
1X
Yi − α
e . (2.2)
n
As previously for the case of a known variance, it holds under IP = IPθ :
IE α
e = α,
Varθ (e
α) = σ 2 /n.
However, for the estimate σ
e2 of σ 2 , the result is slightly different and it is
described in the next theorem.
Theorem 2.2.1. It holds
IEθ σ
e2 =
n−1 2
σ ,
n
Varθ (e
σ2 ) =
2(n − 1) 4
σ .
n2
Proof. We use vector notation. Consider the unit vector e = n−1/2 (1, . . . , 1)> ∈
IRn and denote by Π1 the projector on e :
Π1 h = (e> h)e.
Then by definition α
e = n−1/2 e> Π1 Y and σ
e2 = n−1 kY − Π1 Y k2 . Moreover,
1/2
the model equation Y = n αe + ε implies in view of Π1 e = e that
Π1 Y = (n1/2 αe + Π1 ε).
2.2 Substitution principle. Method of moments
31
Now
ne
σ 2 = kY − Π1 Y k2 = kε − Π1 εk2 = k(In − Π1 )εk2
where In is the identity operator in IRn and In − Π1 is the projector on
the hyperplane in IRn orthogonal to the vector e . Obviously (In − Π1 )ε is
a Gaussian vector with zero mean and the covariance matrix V defined by
V = IE (In − Π1 )εε> (In − Π1 ) = (In − Π1 )IE(εε> )(In − Π1 )
= σ 2 (In − Π1 )2 = σ 2 (In − Π1 ).
It remains to note that for any Gaussian vector ξ ∼ N(0, V ) it holds
IEkξk2 = tr V,
Var kξk2 = 2 tr(V 2 ).
Exercise 2.2.1. Check the details of the proof.
Hint: reduce to the case of diagonal V .
Exercise 2.2.2. Compute the covariance IE(e
α − α)(e
σ 2 − σ 2 ) . Show that α
e
2
and σ
e are independent.
Hint: represent α
e − α = n−1/2 e> Π1 ε and σ
e2 = n−1 k(In − Π1 )εk2 . Use
that Π1 ε and (In − Π1 )ε are independent if Π1 is a projector and ε is a
Gaussian vector.
Uniform distribution on [0, θ]
Let Yi be uniformly distributed on the interval [0, θ] of the real line where the
right end point θ is unknown. The density p(y, θ) of Pθ w.r.t. the Lebesgue
measure is θ−1 1(y ≤ θ) . It is easy to compute that for an integer k
IEθ (Y1k ) = θ−1
Z
θ
y k dy = θk /(k + 1),
0
1/k
or θ = (k + 1)IEθ (Y1k )
. This leads to the family of estimates
θek =
k+1X k
Yi
n
1/(k+1)
.
Letting k to infinity leads to the estimate
θe∞ = max{Y1 , . . . , Yn }.
This estimate is quite natural in the context of the univariate distribution.
Later it will appear once again as the maximum likelihood estimate. However,
it is not a moment estimate.
32
2 Parameter estimation for an i.i.d. model
Bernoulli or binomial model
Let Pθ be a Bernoulli law for θ ∈ [0, 1] . Then every Yi is binary with
IEθ Yi = θ.
This leads to the moment estimate
Z
1X
θe = ydPn (y) =
Yi .
n
Exercise 2.2.3. Compute the moment estimate for g(y) = y k , k ≥ 1 .
Multinomial model
The multinomial distribution Bθm describes the number of successes in m
experiments when each success has the probability θ ∈ [0, 1] . This distribution can be viewed as the sum of m binomials with the same parameter θ .
Observed is the sample Y where each Yi is the number of successes in the
i th experiment. One has
m k
Pθ (Y1 = k) =
θ (1 − θ)m−k ,
k = 0, . . . , m.
k
Exercise 2.2.4. Check that method of moments with g(x) = x leads to the
estimate
1 X
θe =
Yi .
mn
e .
Compute Varθ (θ)
Hint: Reduce the multinomial model to the sum of m Bernoulli.
Exponential model
Let Pθ be an exponential distribution on the positive semiaxis with the parameter θ . This means
IPθ (Y1 > y) = e−y/θ .
Exercise 2.2.5. Check that method of moments with g(x) = x leads to the
estimate
1X
θe =
Yi .
n
e .
Compute Varθ (θ)
2.2 Substitution principle. Method of moments
33
Poisson model
Let Pθ be the Poisson distribution with the parameter θ . The Poisson random
variable Y1 is integer-valued with
Pθ (Y1 = k) =
θk −θ
e .
k!
Exercise 2.2.6. Check that method of moments with g(x) = x leads to the
estimate
1X
θe =
Yi .
n
e .
Compute Varθ (θ)
Shift of a Laplace (double exponential) law
Let P0 be a symmetric distribution defined by the equations
P0 (|Y1 | > y) = e−y/σ ,
y ≥ 0,
for some given σ > 0 . Equivalently one can say that the absolute value of Y1
is exponential with parameter σ under P0 . Now define Pθ by shifting P0
by the value θ . This means that
Pθ (|Y1 − θ| > y) = e−y/σ ,
y ≥ 0.
It is obvious that IE0 Y1 = 0 and IEθ Y1 = θ .
Exercise 2.2.7. Check that method of moments leads to the estimate
1X
Yi .
θe =
n
e .
Compute Varθ (θ)
Shift of a symmetric density
Let the observations Yi be defined by the equation
Yi = θ∗ + εi
where θ∗ is an unknown parameter and the errors εi are i.i.d. with a density
symmetric around zero and finite second moment σ 2 = IEε21 . This particularly
yields that IEεi = 0 and IEYi = θ∗ . The method of moments immediately
yields the empirical mean estimate
34
2 Parameter estimation for an i.i.d. model
1X
θe =
Yi
n
e = σ 2 /n .
with Varθ (θ)
2.3 Unbiased estimates, bias, and quadratic risk
Consider a parametric i.i.d. experiment corresponding to a sample Y =
(Y1 , . . . , Yn )> from a distribution Pθ∗ ∈ (Pθ , θ ∈ Θ ⊆ IRp ) . By θ ∗ we
e be an estimate of θ ∗ , that is, a
denote the true parameter from Θ . Let θ
e = θ(Y
e ).
function of the available data Y with values in Θ : θ
∗
e of the parameter θ is called unbiased if
An estimate θ
e = θ∗ .
IEθ∗ θ
This property seems to be rather natural and desirable. However, it is often
just matter of parametrization. Indeed, if g : Θ → Θ is a linear transformation
def
of the parameter set Θ , that is, g(θ) = Aθ + b , then the estimate ϑe =
e + b of the new parameter ϑ = Aθ + b is again unbiased. However, if m(·)
Aθ
e = m(θ ∗ ) is not
is a nonlinear transformation, then the identity IEθ∗ m(θ)
preserved.
Example 2.3.1. Consider the Gaussian shift experiments for Yi i.i.d. N(θ∗ , σ 2 )
with known variance σ 2 but the shift parameter θ∗ is unknown. Then
θe = n−1 (Y1 +. . .+Yn ) is an unbiased estimate of θ∗ . However, for m(θ) = θ2 ,
it holds
e 2 = |θ∗ |2 + σ 2 /n,
IEθ∗ |θ|
e 2 of |θ∗ |2 is slightly biased.
that is, the estimate |θ|
The property of “no bias” is especially important in connection with the
e . To illustrate this point, we first consider the
quadratic risk of the estimate θ
case of a univariate parameter.
2.3.1 Univariate parameter
e the variance of the estimate θe :
Let θ ∈ Θ ⊆ IR1 . Denote by Var(θ)
e = IEθ∗ θe − IEθ∗ θe 2 .
Varθ∗ (θ)
The quadratic risk of θe is defined by
2.3 Unbiased estimates, bias, and quadratic risk
35
2
e θ∗ ) def
R(θ,
= IEθ∗ e
θ − θ∗ .
e θ∗ ) = Varθ∗ (θ)
e if θe is unbiased. It turns out that
It is obvious that R(θ,
the quadratic risk of θe is larger than the variance when this property is not
fulfilled. Define the bias of θe as
def
e θ∗ ) = IEθ∗ θe − θ∗ .
b(θ,
Theorem 2.3.1. It holds for any estimate θe of the univariate parameter θ∗ :
e θ∗ ) = Varθ∗ (θ)
e + b2 (θ,
e θ∗ ).
R(θ,
e θ∗ ) contributes the value b2 (θ,
e θ∗ ) in the
Due to this result, the bias b(θ,
quadratic risk. This particularly explains why one is interested in considering
unbiased or at least nearly unbiased estimates.
2.3.2 Multivariate case
e
Now we extend the result to the multivariate case with θ ∈ Θ ⊆ IRp . Then θ
e
is a vector in IRp . The corresponding variance-covariance matrix Varθ∗ (θ)
is defined as
e def
e − IEθ∗ θ
e θ
e − IEθ∗ θ
e > .
Varθ∗ (θ)
= IEθ∗ θ
e is unbiased if IEθ∗ θ
e = θ ∗ , and
As previously, θ
e − θ∗ .
IEθ∗ θ
e in the
The quadratic risk of the estimate θ
defined via the Euclidean norm of the difference
e is b(θ,
e θ ∗ ) def
the bias of θ
=
multivariate case is usually
e − θ∗ :
θ
2
e θ ∗ ) def
R(θ,
= IEθ∗ e
θ − θ∗ .
Theorem 2.3.2. It holds
e θ ∗ ) = tr Varθ∗ (θ)
e + b(θ,
e θ ∗ )2
R(θ,
Proof. The result follows similarly to the univariate case using the identity
kvk2 = tr(vv > ) for any vector v ∈ IRp .
Exercise 2.3.1. Complete the proof of Theorem 2.3.2.
36
2 Parameter estimation for an i.i.d. model
2.4 Asymptotic properties
e heavily depend on
The properties of the previously introduced estimate θ
en to highlight this dethe sample size n . We therefore, use the notation θ
e is unbiased is the
pendence. A natural extension of the condition that θ
e θ ∗ ) becomes negligible as the sample size n
requirement that the bias b(θ,
increases. This leads to the notion of consistency.
en is consistent if
Definition 2.4.1. A sequence of estimates θ
IP
en −→
θ
θ∗
n → ∞.
en is mean consistent if
θ
en − θ ∗ k → 0,
IEθ∗ kθ
n → ∞.
Clearly mean consistency implies consistency and also asymptotic unbiasedness:
IP
en , θ ∗ ) = IE θ
en − θ ∗ −→
b(θ
0,
n → ∞.
e − θ ∗ is small for n
The property of consistency means that the difference θ
large. The next natural question to address is how fast this √
difference tends
en − θ ∗ is
to zero with n . The Glivenko-Cantelli result suggests that n θ
asymptotically normal.
en is root-n normal if
Definition 2.4.2. A sequence of estimates θ
√
w
en − θ ∗ −→
n θ
N(0, V 2 )
for some fixed matrix V 2 .
We aim to show that the moment estimates are consistent and asymptotically root-n normal under very general conditions. We start again with the
univariate case.
2.4.1 Root- n normality. Univariate parameter
Our first result describes the simplest situation
when the parameter of interest
R
θ∗ can be represented as an integral g(y)dPθ∗ (y) for some function g(·) .
Theorem 2.4.1. Suppose that Θ ⊆ IR and a function g(·) : IR → IR satisfies
for every θ ∈ Θ
2.4 Asymptotic properties
37
Z
g(y)dPθ (y) = θ,
Z
2
g(y) − θ dPθ (y) = σ 2 (θ) < ∞.
P
Then the moment estimates θen = n−1 g(Yi ) satisfy the following conditions:
1. each θen is unbiased, that is, IEθ∗ θen = θ∗ .
2
2. the normalized quadratic risk nIEθ∗ θen − θ∗ fulfills
nIEθ∗ θen − θ∗
2
= σ 2 (θ∗ ).
3. θen is asymptotically root-n normal:
√
w
n θen − θ∗ −→ N(0, σ 2 (θ∗ )).
This result has already been proved, see Theorem 2.1.3. Next we extend this
∗
result to the more
R general situation when θ is defined implicitly via the mo∗
ment s0 (θ ) = g(y)dPθ∗ (y)
R . This means that there exists another function
m(θ∗ ) such that m(θ∗ ) = g(y)dPθ∗ (y) .
Theorem 2.4.2. Suppose that Θ ⊆ IR and a functions g(y) : IR → IR and
m(θ) : Θ → IR satisfy
Z
g(y)dPθ (y) ≡ m(θ),
Z
g(y) − m(θ)
2
dPθ (y) ≡ σg2 (θ) < ∞.
We also assume that m(·) is monotonic and twice continuously differentiable
P
with m0 m(θ∗ ) 6= 0 . Then the moment estimates θen = m−1 n−1 g(Yi )
satisfy the following conditions:
1.
2.
IP
θen is consistent, that is, θen −→ θ∗ .
e
θn is asymptotically root-n normal:
√
w
n θen − θ∗ −→ N(0, σ 2 (θ∗ )),
(2.3)
where σ 2 (θ∗ ) = |m0 m(θ∗ ) |−2 σg2 (θ∗ ) .
This result also follows directly from Theorem 2.1.3 with h(s) = m−1 (s) .
The property of asymptotic normality allows us to study the asymptotic
concentration of θen and to build asymptotic confidence sets.
38
2 Parameter estimation for an i.i.d. model
Corollary 2.4.1. Let θen be asymptotically root-n normal: see (2.3). Then
for any z > 0
lim IPθ∗
n→∞
√ ne
θn − θ∗ > zσ(θ∗ ) = 2Φ(−z)
where Φ(z) is the cdf of the standard normal law.
In particular, this result implies that the estimate θen belongs to a small
root-n neighborhood
def
A(z) = [θ∗ − n−1/2 σ(θ∗ )z, θ∗ + n−1/2 σ(θ∗ )z]
with the probability about 2Φ(−z) which is small provided that z is sufficiently large.
Next we briefly discuss the problem of interval (or confidence) estimation of
the parameter θ∗ . This problem differs from the problem of point estimation:
the target is to build an interval (a set) Eα on the basis of the observations
Y such that IP (Eα 3 θ∗ ) ≈ 1−α for a given α ∈ (0, 1) . This problem can be
attacked similarly to the problem of concentration by considering the interval
of width 2σ(θ∗ )z centered at the estimate θe . However, the major difficulty
is raised by the fact that this construction involves the true parameter value
θ∗ via the variance σ 2 (θ∗ ) . In some situations this variance does not depend
on θ∗ : σ 2 (θ∗ ) ≡ σ 2 with a known value σ 2 . In this case the construction is
immediate.
Corollary 2.4.2. Let θen be asymptotically root-n normal: see (2.3). Let also
σ 2 (θ∗ ) ≡ σ 2 . Then for any α ∈ (0, 1) , the set
def
E ◦ (zα ) = [θen − n−1/2 σzα , θen + n−1/2 σzα ],
where zα is defined by 2Φ(−zα ) = α , satisfies
lim IPθ∗ E(zα ) 3 θ∗ ) = 1 − α.
n→∞
(2.4)
Exercise 2.4.1. Check Corollaries 2.4.1 and 2.4.2.
Next we consider the case when the variance σ 2 (θ∗ ) is unknown. Instead
we assume that a consistent variance estimate σ
e2 is available. Then we plug
this estimate in the construction of the confidence set in place of the unknown
true variance σ 2 (θ∗ ) leading to the following confidence set:
def
E(zα ) = [θen − n−1/2 σ
ezα , θen + n−1/2 σ
ezα ].
(2.5)
2.4 Asymptotic properties
39
Theorem 2.4.3. Let θen be asymptotically root-n normal: see (2.3). Let
σ(θ∗ ) > 0 and σ
e2 be a consistent estimate of σ 2 (θ∗ ) in the sense that
I
P
σ
e2 −→ σ 2 (θ∗ ) . Then for any α ∈ (0, 1) , the set E(zα ) is asymptotically
α -confident in the sense of (2.4).
One natural estimate of the variance σ(θ∗ ) can be obtained by plugging
e . If σ(θ) is a continin the estimate θe in place of θ∗ leading to σ
e = σ(θ)
∗
uous function of θ in a neighborhood of θ , then consistency of θe implies
consistency of σ
e.
Corollary 2.4.3. Let θen be asymptotically root-n normal and let the varidef
ance σ 2 (θ) be a continuous function of θ at θ∗ . Then σ
e = σ(θen ) is a
∗
consistent estimate of σ(θ ) and the set E(zα ) from (2.5) is asymptotically
α -confident.
2.4.2 Root- n normality. Multivariate parameter
Let now Θ ⊆ IRp and θ ∗ be the true parameter vector. The method of
moments requires at least p different moment functions for identifying p
parameters. Let g(y) : IR → IRp be a vector of moment functions, g(y) =
>
g1 (y), . . . , gp (y) . Suppose
R first that the true parameter can be obtained
just by integration: θ ∗ = g(y)dPθ∗ (y) . This yields the moment estimate
en = n−1 P g(Yi ) .
θ
Theorem 2.4.4. Suppose that a vector-function g(y) : IR → IRp satisfies the
following conditions:
Z
g(y)dPθ (y) = θ,
Z
g(y) − θ
g(y) − θ
>
dPθ (y) = Σ(θ).
en = n−1 P g(Yi ) :
Then it holds for the moment estimate θ
1.
2.
e is unbiased, that is, IEθ∗ θ
e = θ∗ .
θ
en is asymptotically root-n normal:
θ
√
w
en − θ ∗ −→
n θ
N(0, Σ(θ ∗ )).
2
3. the normalized quadratic risk nIEθ∗ e
θn − θ∗ fulfills
2
nIEθ∗ e
θn − θ∗ = tr Σ(θ∗ ).
(2.6)
40
2 Parameter estimation for an i.i.d. model
Similarly to the univariate case, this result yields corollaries about concentration and confidence sets with intervals replaced by ellipsoids. Indeed, due
to the second statement, the vector
def
ξn =
√
n{Σ(θ ∗ )}−1/2 (θ − θ ∗ )
w
is asymptotically standard normal: ξn −→ ξ ∼ N(0, Ip ) . This also implies
that the squared norm of ξn is asymptotically χ2p -distributed where ξp2 is
the law of kξk2 = ξ12 + . . . + ξp2 . Define the value zα via the quantiles of χ2p
by the relation
IP kξk > zα = α.
(2.7)
en is root-n normal, see (2.6). Define for a
Corollary 2.4.4. Suppose that θ
given z the ellipsoid
def
A(z) = {θ : (θ − θ ∗ )> {Σ(θ ∗ )}−1 (θ − θ ∗ ) ≤ z 2 /n}.
en in the sense
Then A(zα ) is asymptotically (1 − α) -concentration set for θ
that
e 6∈ A(zα ) = α.
lim IP θ
n→∞
w
The weak convergence ξn −→ ξ suggests to build confidence sets also
in form of ellipsoids with the axis defined by the covariance matrix Σ(θ ∗ ) .
Define for α > 0
√ def e − θ) ≤ zα .
E ◦ (zα ) = θ : n{Σ(θ ∗ )}−1/2 (θ
The result of Theorem 2.4.4 implies that this set covers the true value θ ∗
with probability approaching 1 − α .
Unfortunately, in typical situations the matrix Σ(θ ∗ ) is unknown because
it depends on the unknown parameter θ ∗ . It is natural to replace it with the
e replacing the true value θ ∗ with its consistent estimate θ
e . If
matrix Σ(θ)
e
Σ(θ) is a continuous function of θ , then Σ(θ) provides a consistent estimate
of Σ(θ ∗ ) . This leads to the data-driven confidence set:
def
E(zα ) =
√ e −1/2 (θ
e − θ) ≤ z .
θ : n{Σ(θ)}
en is root-n normal, see (2.6), with a nonCorollary 2.4.5. Suppose that θ
∗
degenerate matrix Σ(θ ) . Let the matrix function Σ(θ) be continuous at
θ ∗ . Let zα be defined by (2.7). Then E ◦ (zα ) and E(zα ) are asymptotically
(1 − α) -confidence sets for θ ∗ :
lim IP E ◦ (zα ) 3 θ ∗ = lim IP E(zα ) 3 θ ∗ = 1 − α.
n→∞
n→∞
2.5 Some geometric properties of a parametric family
41
Exercise 2.4.2. Check Corollaries 2.4.4 and 2.4.5 about the set E ◦ (zα ) .
Exercise 2.4.3. Check Corollary 2.4.5 about the set E(zα ) .
e is consistent and Σ(θ) is continuous and invertible at θ ∗ . This
Hint: θ
implies
IP
e − Σ(θ ∗ ) −→ 0,
Σ(θ)
IP
e −1 − {Σ(θ ∗ )}−1 −→ 0,
{Σ(θ)}
and hence, the sets E ◦ (zα ) and E(zα ) are nearly the same.
Finally we discuss the general situation when the target parameter is a
function of the moments. This means the relations
Z
m(θ) = g(y)dPθ (y),
θ = m−1 m(θ) .
Of course, these relations assume that the vector function m(·) is invertible.
The substitution principle leads to the estimate
e def
θ
= m−1 (M n ),
where M n is the vector of empirical moments:
Z
1X
def
g(Yi ).
Mn =
g(y)dPn (y) =
n
The central limit theorem implies (see
that M n is a consistent
√ Theorem 2.1.4)
estimate of m(θ ∗ ) and the vector n M n −m(θ ∗ ) is asymptotically normal
with some covariance matrix Σg (θ ∗ ) . Moreover, if m−1 is differentiable at
√ e
the point m(θ ∗ ) , then n(θ
− θ ∗ ) is asymptotically normal as well:
√
w
e − θ ∗ ) −→ N(0, Σ(θ ∗ ))
n(θ
where Σ(θ ∗ ) = H > Σg (θ ∗ )H and H is the p × p -Jacobi matrix of m−1 at
def d
m(θ ∗ ) : H = dθ
m−1 m(θ ∗ ) .
2.5 Some geometric properties of a parametric family
The parametric situation means that the true marginal distribution P belongs
to some given parametric family (Pθ , θ ∈ Θ ⊆ IRp ) . By θ ∗ we denote the
true value, that is, P = Pθ∗ ∈ (Pθ ) . The natural target of estimation in this
situation is the parameter θ ∗ itself. Below we assume that the family (Pθ ) is
dominated, that is, there exists a dominating measure µ0 . The corresponding
density is denoted by
42
2 Parameter estimation for an i.i.d. model
p(y, θ) =
dPθ
(y).
dµ0
We also use the notation
def
`(y, θ) = log p(y, θ)
for the log-density.
The following two important characteristics of the parametric family (Pθ )
will be frequently used in the sequel: the Kullback-Leibler divergence and
Fisher information.
2.5.1 Kullback-Leibler divergence
For any two parameters θ, θ 0 , the value
K(Pθ , Pθ0 ) =
Z
log
p(y, θ)
p(y, θ)dµ0 (y) =
p(y, θ 0 )
Z
`(y, θ) − `(y, θ 0 ) p(y, θ)dµ0 (y)
is called the Kullback–Leibler divergence (KL-divergence) between Pθ and
Pθ0 . We also write K(θ, θ 0 ) instead of K(Pθ , Pθ0 ) if there is no risk of confusion. Equivalently one can represent the KL-divergence as
K(θ, θ 0 ) = Eθ log
p(Y, θ)
0
0 = Eθ `(Y, θ) − `(Y, θ ) ,
p(Y, θ )
where Y ∼ Pθ . An important feature of the Kullback-Leibler divergence is
that it is always non-negative and it is equal to zero iff the measures Pθ and
Pθ0 coincide.
Lemma 2.5.1. For any θ, θ 0 , it holds
K(θ, θ 0 ) ≥ 0.
Moreover, K(θ, θ 0 ) = 0 implies that the densities p(y, θ) and p(y, θ 0 ) coincide µ0 -a.s.
Proof. Define Z(y) = p(y, θ 0 )/p(y, θ) . Then
Z
Z
Z(y)p(y, θ)dµ0 (y) =
p(y, θ 0 )dµ0 (y) = 1
2
d
−2
< 0,
because p(y, θ 0 ) is the density of Pθ0 w.r.t. µ0 . Next, dt
2 log(t) = −t
thus, the log-function is strictly concave. The Jensen inequality implies
2.5 Some geometric properties of a parametric family
K(θ, θ 0 ) = −
Z
Z
log(Z(y))p(y, θ)dµ0 (y) ≥ − log
43
Z(y)p(y, θ)dµ0 (y)
= − log(1) = 0.
Moreover, the strict concavity of the log-function implies that the equality
in this relation is only possible if Z(y) ≡ 1 Pθ -a.s. This implies the last
statement of the lemma.
The two mentioned features of the Kullback-Leibler divergence suggest to
consider it as a kind of distance on the parameter space. In some sense, it
measures how far Pθ0 is from Pθ . Unfortunately, it is not a metric because
it is not symmetric:
K(θ, θ 0 ) 6= K(θ 0 , θ)
with very few exceptions for some special situations.
Exercise 2.5.1. Compute KL-divergence for the Gaussian shift, Bernoulli,
Poisson, volatility and exponential families. Check in which cases it is symmetric.
Exercise 2.5.2. Consider the shift experiment given by the equation Y =
θ+ε where ε is an error with the given density function p(·) on IR . Compute
the KL-divergence and check for symmetry.
One more important feature of the KL-divergence is its additivity.
(1)
(2)
Lemma 2.5.2. Let (Pθ , θ ∈ Θ) and (Pθ , θ ∈ Θ) be two parametric fam(1)
(2)
ilies with the same parameter set Θ , and let (Pθ = Pθ × Pθ , θ ∈ Θ) be
0
the product family. Then for any θ, θ ∈ Θ
(1)
(1)
(2)
(2)
K(Pθ , Pθ0 ) = K(Pθ , Pθ0 ) + K(Pθ , Pθ0 )
Exercise 2.5.3. Prove Lemma 2.5.2. Extend the result to the case of the
m -fold product of measures.
Hint: use that the log-density `(y1 , y2 , θ) of the product measure Pθ
fulfills `(y1 , y2 , θ) = `(1) (y1 , θ) + `(2) (y2 , θ) .
The additivity of the KL-divergence helps to easily compute the KL
quantity for two measures IPθ and IPθ0 describing the i.i.d. sample Y =
(Y1 , . . . , Yn )> . The log-density of the measure IPθ w.r.t. µ0 = µ⊗n
at the
0
point y = (y1 , . . . , yn )> is given by
X
L(y, θ) =
`(yi , θ).
An extension of the result of Lemma 2.5.2 yields
def
K(IPθ , IPθ0 ) = IEθ L(Y , θ) − L(Y , θ 0 ) = nK(θ, θ 0 ).
44
2 Parameter estimation for an i.i.d. model
2.5.2 Hellinger distance
Another useful characteristic of a parametric family (Pθ ) is the so-called
Hellinger distance. For a fixed µ ∈ [0, 1] and any θ, θ 0 ∈ Θ , define
µ
dPθ0
(Y )
dPθ
Z p(y, θ 0 ) µ
=
dPθ (y)
p(y, θ)
Z
=
pµ (y, θ 0 )p1−µ (y, θ)dµ0 (y).
h(µ, Pθ , Pθ0 ) = Eθ
Note that this function can be represented as an exponential moment of the
log-likelihood ratio `(Y, θ, θ 0 ) = `(Y, θ) − `(Y, θ 0 ) :
µ
dPθ0
(Y ) .
h(µ, Pθ , Pθ0 ) = Eθ exp µ`(Y, θ 0 , θ) = Eθ
dPθ
It is obvious that h(µ, Pθ , Pθ0 ) ≥ 0 . Moreover, h(µ, Pθ , Pθ0 ) ≤ 1 . Indeed, the
function xµ for µ ∈ [0, 1] is concave and by the Jensen inequality:
Eθ
µ µ
dPθ0
dPθ0
(Y )
≤ IEθ
(Y )
= 1.
dPθ
dPθ
Similarly to the Kullback-Leibler, we often write h(µ, θ, θ 0 ) in place of
h(µ, Pθ , Pθ0 ) .
Typically the Hellinger distance is considered for µ = 1/2 . Then
Z
h(1/2, θ, θ 0 ) = p1/2 (y, θ 0 )p1/2 (y, θ)dµ0 (y).
In contrast to the Kullback-Leibler divergence, this quantity is symmetric and
can be used to define a metric on the parameter set Θ .
Introduce
m(µ, θ, θ 0 ) = − log h(µ, θ, θ 0 ) = − log Eθ exp µ`(Y, θ 0 , θ) .
The property h(µ, θ, θ 0 ) ≤ 1 implies m(µ, θ, θ 0 ) ≥ 0 .
The rate function, similarly to the KL-divergence, is additive.
(1)
(2)
Lemma 2.5.3. Let (Pθ , θ ∈ Θ) and (Pθ , θ ∈ Θ) be two parametric fam(1)
(2)
ilies with the same parameter set Θ , and let (Pθ = Pθ × Pθ , θ ∈ Θ) be
0
the product family. Then for any θ, θ ∈ Θ and any µ ∈ [0, 1]
(1)
(1)
(2)
(2)
m(µ, Pθ , Pθ0 ) = m(Pθ , Pθ0 ) + m(Pθ , Pθ0 ).
2.5 Some geometric properties of a parametric family
45
Exercise 2.5.4. Prove Lemma 2.5.3. Extend the result to the case of a m fold product of measures.
Hint: use that the log-density `(y1 , y2 , θ) of the product measure Pθ
fulfills `(y1 , y2 , θ) = `(1) (y1 , θ) + `(2) (y2 , θ) .
Application of this lemma to the i.i.d. product family yields
def
M(µ, θ 0 , θ) = − log IEθ exp µL(Y , θ, θ 0 ) = n m(µ, θ 0 , θ).
2.5.3 Regularity and the Fisher Information. Univariate parameter
An important assumption on the considered parametric family (Pθ ) is that
the corresponding density function p(y, θ) is absolutely continuous w.r.t. the
parameter θ for almost all y . Then the log-density `(y, θ) is differentiable
as well with
def
∇`(y, θ) =
1 ∂p(y, θ)
∂`(y, θ)
=
∂θ
p(y, θ) ∂θ
with the convention 10 log(0) = 0 . In the case of a univariate parameter θ ∈
IR , we also write `0 (y, θ) instead of ∇`(y, θ) .
Moreover, we usually assume some regularity conditions on the density
p(y, θ) . The next definition presents one possible set of such conditions for
the case of a univariate parameter θ .
Definition 2.5.1. The family (Pθ , θ ∈ Θ ⊂ IR) is regular if the following
conditions are fulfilled:
def
1. The sets A(θ) = {y : p(y, θ) = 0} are the same for all θ ∈ Θ .
2. Differentiability under the integration sign: for any function s(y) satisfying
Z
s2 (y)p(y, θ)dµ0 (y) ≤ C,
θ∈Θ
it holds
Z
Z
Z
∂
∂
∂p(y, θ)
s(y)dPθ (y) =
s(y)p(y, θ)dµ0 (y) = s(y)
dµ0 (y).
∂θ
∂θ
∂θ
3. Finite Fisher information: the log-density function `(y, θ) is differentiable
in θ and its derivative is square integrable w.r.t. Pθ :
Z
0
` (y, θ)2 dPθ (y) =
Z
|p0 (y, θ)|2
dµ0 (y).
p(y, θ)
(2.8)
46
2 Parameter estimation for an i.i.d. model
The quantity in the condition (2.8) plays an important role in asymptotic
statistics. It is usually referred to as the Fisher information.
Definition 2.5.2. Let (Pθ , θ ∈ Θ ⊂ IR) be a regular parametric family with
the univariate parameter. Then the quantity
def
F(θ) =
Z
0
` (y, θ)2 p(y, θ)dµ0 (y) =
Z
|p0 (y, θ)|2
dµ0 (y)
p(y, θ)
is called the Fisher information of (Pθ ) at θ ∈ Θ .
The definition of F(θ) can be written as
2
F(θ) = IEθ `0 (Y, θ)
with Y ∼ Pθ .
A simple sufficient condition for regularity of a family (Pθ ) is given by
the next lemma.
Lemma 2.5.4. Let the log-density `(y, θ) = log p(y, θ) of a dominated family
(Pθ ) be differentiable in θ and let the Fisher information F(θ) be a continuous function on Θ . Then (Pθ ) is regular.
The proof is technical and can be found e.g. in Borovkov (1998). Some
useful properties of the regular families are listed in the next lemma.
Lemma 2.5.5. Let (Pθ ) be a regular family. Then for any θ ∈ Θ and Y ∼
Pθ
R
1.
Eθ `0 (Y, θ) = `0 (y, θ) p(y,R θ) dµ0 (y) = 0 and F(θ) = Varθ `0 (Y, θ) .
2.
F(θ) = −Eθ `00 (Y, θ) = − `00 (y, θ)p(y, θ)dµ0 (y).
R
R
Proof. Differentiating the identity p(y, θ)dµ0 (y) = exp{`(y, θ)}dµ0 (y) ≡
1 implies under the regularity conditions the first statement of the lemma.
Differentiating once more yields the second statement with another representation of the Fisher information.
Like the KL-divergence, the Fisher information possesses the important
additivity property.
(1)
(2)
Lemma 2.5.6. Let (Pθ , θ ∈ Θ) and (Pθ , θ ∈ Θ) be two parametric fami(1)
(2)
lies with the same parameter set Θ , and let (Pθ = Pθ × Pθ , θ ∈ Θ) be the
product family. Then for any θ ∈ Θ , the Fisher information F(θ) satisfies
F(θ) = F(1) (θ) + F(2) (θ)
(1)
where F(1) (θ) (resp. F(2) (θ) ) is the Fisher information for (Pθ ) (resp. for
(2)
(Pθ ) ).
2.5 Some geometric properties of a parametric family
47
Exercise 2.5.5. Prove Lemma 2.5.6.
Hint: use that the log-density of the product experiment can be represented
as `(y1 , y2 , θ) = `1 (y1 , θ) + `2 (y2 , θ) . The independence of Y1 and Y2 implies
F(θ) = Varθ `0 (Y1 , Y2 , θ) = Varθ `01 (Y1 , θ) + `02 (Y2 , θ)
= Varθ `01 (Y1 , θ) + Varθ `02 (Y2 , θ) .
Exercise 2.5.6. Compute the Fisher information for the Gaussian shift,
Bernoulli, Poisson, volatility and exponential families. Check in which cases
it is constant.
Exercise 2.5.7. Consider the shift experiment given by the equation Y =
θ+ε where ε is an error with the given density function p(·) on IR . Compute
the Fisher information and check whether it is constant.
Exercise 2.5.8. Check that the i.i.d. experiment from the uniform distribution on the interval [0, θ] with unknown θ is not regular.
Now we consider the properties of the i.i.d. experiment from a given regular
family (Pθ ) . The distribution of the whole i.i.d. sample Y is described by the
product measure IPθ = Pθ⊗n which is dominated by the measure µ0 = µ⊗n
0 .
The corresponding log-density L(y, θ) is given by
def
L(y, θ) = log
X
dIPθ
(y) =
`(yi , θ).
dµ0
The function exp L(y, θ) is the density of IPθ w.r.t. µ0 and hence, for any
r.v. ξ
IEθ ξ = IE0 ξ exp L(Y , θ) .
In particular, for ξ ≡ 1 , this formula leads to the indentity
Z
IE0 exp L(Y , θ) = exp L(y, θ) µ0 (dy) ≡ 1.
(2.9)
The next lemma claims that the product family (IPθ ) for an i.i.d. sample
from a regular family is also regular.
Lemma 2.5.7. Let (Pθ ) be a regular family and IPθ = Pθ⊗n . Then
Q
def
1. The set An = {y = (y1 , . . . , yn )> :
p(yi , θ) = 0} is the same for all
θ ∈ Θ.
2. For any r.v. S = S(Y ) with IEθ S 2 ≤ C , θ ∈ Θ , it holds
∂
∂
IEθ S =
IE0 S exp L(Y , θ) = IE0 SL0 (Y , θ) exp L(Y , θ) ,
∂θ
∂θ
def ∂
∂θ L(Y
where L0 (Y , θ) =
, θ) .
48
2 Parameter estimation for an i.i.d. model
3. The derivative L0 (Y , θ) is square integrable and
2
IEθ L0 (Y , θ) = nF(θ).
Local properties of the Kullback-Leibler divergence and Hellinger distance
Here we show that the quantities introduced so far are closely related to each
other. We start with the Kullback-Leibler divergence.
Lemma 2.5.8. Let (Pθ ) be a regular family. Then the KL-divergence K(θ, θ0 )
satisfies:
K(θ, θ0 ) 0
= 0,
θ =θ
d
= 0,
K(θ, θ0 ) 0
0
dθ
θ =θ
d2
0 = F(θ).
K(θ,
θ
)
0
θ =θ
dθ0 2
In a small neighborhood of θ , the KL-divergence can be approximated by
K(θ, θ0 ) ≈ F(θ)|θ0 − θ|2 /2.
Similar properties can be established for the rate function m(µ, θ, θ0 ) .
Lemma 2.5.9. Let (Pθ ) be a regular family. Then the rate function m(µ, θ, θ0 )
satisfies:
m(µ, θ, θ0 )
= 0,
θ 0 =θ
d
m(µ, θ, θ0 )
= 0,
0
dθ
θ 0 =θ
d2
0 m(µ,
θ,
θ
)
= µ(1 − µ)F(θ).
0
θ =θ
dθ0 2
In a small neighborhood of θ , the rate function m(µ, θ, θ0 ) can be approximated by
m(µ, θ, θ0 ) ≈ µ(1 − µ)F(θ)|θ0 − θ|2 /2.
Moreover, for any θ, θ0 ∈ Θ
m(µ, θ, θ0 )
= 0,
µ=0
d
m(µ, θ, θ0 )
= Eθ `(Y, θ, θ0 ) = K(θ, θ0 ),
dµ
µ=0
d2
0 m(µ,
θ,
θ
)
= − Varθ `(Y, θ, θ0 ) .
2
dµ
µ=0
2.6 Cramér–Rao Inequality
49
This implies an approximation for µ small
m(µ, θ, θ0 ) ≈ µK(θ, θ0 ) −
µ2
Varθ `(Y, θ, θ0 ) .
2
Exercise 2.5.9. Check the statements of Lemmas 2.5.8 and 2.5.9.
2.6 Cramér–Rao Inequality
e be an estimate of the parameter θ ∗ . We are interested in establishing
Let θ
a lower bound for the risk of this estimate. This bound indicates that under
some conditions the quadratic risk of this estimate can never be below a
specific value.
2.6.1 Univariate parameter
We again start with the univariate case and consider the case of an unbiased
estimate θe . Suppose that the family (Pθ , θ ∈ Θ) is dominated by a σ -finite
measure µ0 on the real line and denote by p(y, θ) the density of Pθ w.r.t.
µ0 :
def
p(y, θ) =
dPθ
(y).
dµ0
e ) be an unbiased
Theorem 2.6.1 (Cramér–Rao Inequality). Let θe = θ(Y
estimate of θ for an i.i.d. sample from a regular family (Pθ ) . Then
e ≥
IEθ |θe − θ|2 = Varθ (θ)
1
nF(θ)
with the equality iff θe − θ = {nF(θ)}−1 L0 (Y , θ) almost surely. Moreover, if θe
def
is not unbiased and τ (θ) = IEθ θe , then with τ 0 (θ) = d τ (θ) , it holds
dθ
e ≥
Varθ (θ)
|τ 0 (θ)|2
nF(θ)
and
e + |τ (θ) − θ|2 ≥
IEθ |θe − θ|2 = Varθ (θ)
|τ 0 (θ)|2
+ |τ (θ) − θ|2 .
nF(θ)
Proof. Consider first the case of an unbiased estimate θe with IEθ θe ≡ θ .
Differentiating the identity (2.9) IEθ exp L(Y , θ) ≡ 1 w.r.t. θ yields
50
2 Parameter estimation for an i.i.d. model
Z
0≡
0
L (y, θ) exp L(y, θ) µ0 (dy) = IEθ L0 (Y , θ).
(2.10)
Similarly, the identity IEθ θe = θ implies
Z
0
0
e (Y , θ) exp L(Y , θ) µ (dy) = IEθ θL
e (Y , θ) .
1≡
θL
0
Together with (2.10), this gives
IEθ (θe − θ)L0 (Y , θ) ≡ 1.
(2.11)
Define h = {nF(θ)}−1 L0 (Y , θ) . Then IE hL0 (Y , θ) = 1 and (2.11) yields
IEθ (θe − θ − h)h ≡ 0.
Now
IEθ (θe − θ)2 = IEθ (θe − θ − h + h)2 = IEθ (θe − θ − h)2 + IEθ h2
−1 −1
= IEθ (θe − θ − h)2 + nF(θ)
≥ nF(θ)
with the equality iff θe − θ = {nF(θ)}−1 L0 (Y , θ) almost surely. This implies
the first assertion.
Now we consider the general case. The proof is similar. The property (2.10)
continues to hold. Next, the identity IEθ θe = θ is replaced with IEθ θe = τ (θ)
yielding
0
e (Y , θ) ≡ τ 0 (θ)
IEθ θL
and
IEθ {θe − τ (θ)}L0 (Y , θ) ≡ τ 0 (θ).
Again by the Cauchy-Schwartz inequality
0 2
τ (θ) = IEθ2 {θe − τ (θ)}L0 (Y , θ)
≤ IEθ {θe − τ (θ)}2 IEθ |L0 (Y , θ)|2
e nF(θ)
= Varθ (θ)
and the second assertion follows. The last statement is the usual decomposition of the quadratic risk into the squared bias and the variance of the
estimate.
2.6 Cramér–Rao Inequality
51
2.6.2 Exponential families and R-efficiency
An interesting question is how good (precise) the Cramér–Rao lower bound is.
In particular, when it is an equality. Indeed, if we restrict ourselves to unbiased
estimates, no estimate can have quadratic risk smaller than [nF(θ)]−1 . If an
estimate has exactly the risk [nF(θ)]−1 then this estimate is automatically
efficient in the sense that it is the best in the class in terms of the quadratic
risk.
Definition 2.6.1. An unbiased estimate θe is R-efficient if
e = [nF(θ)]−1 .
Varθ (θ)
Theorem 2.6.2. An unbiased estimate θe is R-efficient if and only if
X
θe = n−1
U (Yi ),
where the function U (·) on IR satisfies
`(y, θ) of Pθ can be represented as
R
U (y)dPθ (y) ≡ θ and the log-density
`(y, θ) = C(θ)U (y) − B(θ) + `(y),
(2.12)
for some functions C(·) and B(·) on Θ and a function `(·) on IR .
Proof. Suppose first that the representation (2.12) for the log-density is correct. Then `0 (y, θ) = C 0 (θ)U (y) − B 0 (θ) and the identity Eθ `0 (y, θ) = 0
implies the relation between the functions B(·) and C(·) :
θC 0 (θ) = B 0 (θ).
(2.13)
Next, differentiating the equality
Z
Z
0 ≡ {U (y) − θ}dPθ (y) = {U (y) − θ}eL(y,θ) dµ0 (y)
w.r.t. θ implies in view of (2.13)
2
1 ≡ IEθ {U (Y ) − θ} × C 0 (θ)U (Y ) − B 0 (θ) = C 0 (θ)IEθ U (Y ) − θ .
This yields Varθ U (Y ) = 1/C 0 (θ) . This leads to the following representation
for the Fisher information:
F(θ) = Varθ `0 (Y, θ)
= Varθ {C 0 (θ)U (Y ) − B 0 (θ)}
2
= C 0 (θ) Varθ U (Y ) = C 0 (θ).
52
2 Parameter estimation for an i.i.d. model
P
The estimate θe = n−1 U (Yi ) satisfies
IEθ θe = θ,
that is, it is unbiased. Moreover,
n1 X
o
1 X
1
1
Varθ θe = Varθ
U (Yi ) = 2
Var U (Yi ) =
=
n
n
nC 0 (θ)
nF(θ)
and θe is R-efficient.
Now we show a reverse statement. Due to the proof of the Cramér–Rao
inequality, the only possibility of getting the equality in this inequality is if
L0 (Y , θ) = nF(θ) (θe − θ).
This implies for some fixed θ0 and any θ◦
L(Y , θ◦ ) − L(Y , θ0 ) =
Z
θ◦
L0 (Y , θ)dθ
θ0
Z
θ
=
e
nF(θ)(θe − θ)dθ = n θC(θ)
− B(θ)
θ0
Rθ
F(θ)dθ and B(θ) = θ0 θF(θ)dθ . Applying this equality to a
e 1 ) , and
sample with n = 1 yields U (Y1 ) = θ(Y
with C(θ) =
Rθ
θ0
`(Y1 , θ) = `(Y1 , θ0 ) + C(θ)U (Y1 ) − B(θ).
The desired representation follows.
Exercise 2.6.1. Apply the Cramér–Rao
P inequality and check R-efficiency to
the empirical mean estimate θe = n−1 Yi for the Gaussian shift, Bernoulli,
Poisson, exponential and volatility families.
2.7 Cramér–Rao inequality. Multivariate parameter
This section extends the notions and results of the previous sections from the
case of a univariate parameter to the case of a multivariate parameter with
θ ∈ Θ ⊂ IRp .
2.7 Cramér–Rao inequality. Multivariate parameter
53
2.7.1 Regularity and Fisher Information. Multivariate parameter
The definition of regularity naturally extends to the case of a multivariate
parameter θ = (θ1 , . . . , θp )> . It suffices to check the same conditions as in
the univariate case for every partial derivative ∂p(y, θ)/∂θj of the density
p(y, θ) for j = 1, . . . , p .
Definition 2.7.1. The family (Pθ , θ ∈ Θ ⊂ IRp ) is regular if the following
conditions are fulfilled:
def
1. The sets A(θ) = {y : p(y, θ) = 0} are the same for all θ ∈ Θ .
2. Differentiability under the integration sign: for any function s(y) satisfying
Z
s2 (y)p(y, θ)dµ0 (y) ≤ C,
θ∈Θ
it holds
Z
Z
Z
∂
∂p(y, θ)
∂
s(y)dPθ (y) =
s(y)p(y, θ)dµ0 (y) = s(y)
dµ0 (y).
∂θ
∂θ
∂θ
3. Finite Fisher information: the log-density function `(y, θ) is differentiable in θ and its derivative ∇`(y, θ) = ∂`(y, θ)/∂θ is square integrable
w.r.t. Pθ :
Z
∇`(y, θ)2 dPθ (y) =
Z
|∇p(y, θ)|2
dµ0 (y) < ∞.
p(y, θ)
In the case of a multivariate parameter, the notion of the Fisher information leads to the Fisher information matrix.
Definition 2.7.2. Let (Pθ , θ ∈ Θ ⊂ IRp ) be a parametric family. The matrix
def
Z
∇`(y, θ)∇> `(y, θ)p(y, θ)dµ0 (y)
Z
∇p(y, θ)∇> p(y, θ)
F(θ) =
=
1
dµ0 (y)
p(y, θ)
is called the Fisher information matrix of (Pθ ) at θ ∈ Θ .
This definition can be rewritten as
F(θ) = IEθ ∇`(Y1 , θ){∇`(Y1 , θ)}> .
The additivity property of the Fisher information extends to the multivariate
case as well.
54
2 Parameter estimation for an i.i.d. model
Lemma 2.7.1. Let (Pθ , θ ∈ Θ) be a regular family. Then the n -fold product
family (IPθ ) with IPθ = Pθ⊗n is also regular. The Fisher information matrix
F(θ) satisfies
IEθ ∇L(Y , θ){∇L(Y , θ)}> = nF(θ).
(2.14)
Exercise 2.7.1. Compute the Fisher information matrix for the i.i.d. experiment Yi = θ + σεi with unknown θ and σ and εi i.i.d. standard normal.
2.7.2 Local properties of the Kullback-Leibler divergence and
Hellinger distance
The local relations between the Kullback-Leibler divergence, rate function and
Fisher information naturally extend to the case of a multivariate parameter.
We start with the Kullback-Leibler divergence.
Lemma 2.7.2. Let (Pθ ) be a regular family. Then the KL-divergence K(θ, θ 0 )
satisfies:
K(θ, θ 0 ) 0
= 0,
θ =θ
d
0 = 0,
0 K(θ, θ ) 0
θ =θ
dθ
d2
0 K(θ,
θ
)
= F(θ).
0
2
θ =θ
dθ 0
In a small neighborhood of θ , the KL-divergence can be approximated by
K(θ, θ 0 ) ≈ (θ 0 − θ)> F(θ) (θ 0 − θ)/2.
Similar properties can be established for the rate function m(µ, θ, θ 0 ) .
Lemma 2.7.3. Let (Pθ ) be a regular family. Then the rate function m(µ, θ, θ 0 )
satisfies:
m(µ, θ, θ 0 ) 0
= 0,
θ =θ
d
0 m(µ,
θ,
θ
)
= 0,
0
θ =θ
dθ 0
d2
0 m(µ,
θ,
θ
)
= µ(1 − µ)F(θ).
0
2
θ =θ
dθ 0
In a small neighborhood of θ , the rate function can be approximated by
m(µ, θ, θ 0 ) ≈ µ(1 − µ)(θ 0 − θ)> F(θ) (θ 0 − θ)/2.
2.7 Cramér–Rao inequality. Multivariate parameter
55
Moreover, for any θ, θ 0 ∈ Θ
m(µ, θ, θ 0 )
= 0,
µ=0
d
= Eθ `(Y, θ, θ 0 ) = K(θ, θ 0 ),
m(µ, θ, θ 0 )
dµ
µ=0
d2
0 = − Varθ `(Y, θ, θ 0 ) .
m(µ,
θ,
θ
)
2
dµ
µ=0
This implies an approximation for µ small:
m(µ, θ, θ 0 ) ≈ µK(θ, θ 0 ) −
µ2
Varθ `(Y, θ, θ 0 ) .
2
Exercise 2.7.2. Check the statements of Lemmas 2.7.2 and 2.7.3.
2.7.3 Multivariate Cramér–Rao Inequality
e = θ(Y
e ) be an estimate of the unknown parameter vector. This estimate
Let θ
is called unbiased if
e ≡ θ.
IEθ θ
e = θ(Y
e )
Theorem 2.7.1 (Multivariate Cramér–Rao Inequality). Let θ
be an unbiased estimate of θ for an i.i.d. sample from a regular family (Pθ ) .
Then
e ≥ nF(θ) −1 ,
Varθ (θ)
e − θk2 = tr Varθ (θ)
e ≥ tr nF(θ) −1 .
IEθ kθ
e is not unbiased and τ (θ) = IEθ θ
e , then with ∇τ (θ) def
Moreover, if θ
=
it holds
d
dθ τ (θ) ,
e ≥ ∇τ (θ) nF(θ) −1 ∇τ (θ) > ,
Varθ (θ)
and
e − θk2 = tr Varθ (θ)
e + kτ (θ) − θk2
IEθ kθ
−1 > ≥ tr ∇τ (θ) nF(θ)
∇τ (θ)
+ kτ (θ) − θk2 .
56
2 Parameter estimation for an i.i.d. model
e with IEθ θ
e ≡ θ.
Proof. Consider first the case of an unbiased estimate θ
Differentiating the identity (2.9) IEθ exp L(Y , θ) ≡ 1 w.r.t. θ yields
Z
0 ≡ ∇L(y, θ) exp L(y, θ) µ0 (dy) = IEθ ∇L(Y , θ) ≡ 0. (2.15)
e = θ implies
Similarly, the identity IEθ θ
Z
>
e
e {∇L(Y , θ)}> .
I ≡ θ(y)
∇L(y, θ) exp L(y, θ) µ0 (dy) = IEθ θ
Together with (2.15), this gives
e − θ) {∇L(Y , θ)}> ≡ I.
IEθ (θ
(2.16)
Consider the random vector
def
h =
−1
nF(θ)
∇L(Y , θ).
By (2.15) IEθ h = 0 and by (2.14)
> Varθ (h) = IEθ hh> = n−2 IEθ I −1 (θ)∇L(Y , θ) I −1 (θ)∇L(Y , θ)
−1
= n−2 I −1 (θ)IEθ ∇L(Y , θ){∇L(Y , θ)}> I −1 (θ) = nF(θ)
.
and the identities (2.15) and (2.16) imply that
e − θ − h)h> = 0.
IEθ (θ
(2.17)
e − θ = 0 and IEθ (θ
e − θ)(θ
e − θ)> =
The “no bias” property yields IEθ θ
e . Finally by the orthogonality (2.17) and
Varθ (θ)
e = Varθ (h) + Var θ
e−θ−h
Varθ (θ)
−1
e−θ−h
= nF(θ)
+ Varθ θ
e is not smaller than nF(θ) −1 . Moreover, the equality
and the variance of θ
e − θ − h is equal to zero almost surely.
is only possible if θ
Now we consider the general case. The proof is similar. The property (2.15)
e = θ is replaced with IEθ θ
e = τ (θ)
continues to hold. Next, the identity IEθ θ
yielding
e {∇L(Y , θ)}> ≡ ∇τ (θ)
IEθ θ
and
2.7 Cramér–Rao inequality. Multivariate parameter
IEθ
57
e − τ (θ) ∇L(Y , θ) > ≡ ∇τ (θ).
θ
Define
−1
def
h = ∇τ (θ) nF(θ)
∇L(Y , θ).
Then similarly to the above
−1 >
IEθ hh> = ∇τ (θ) nF(θ)
∇τ (θ) ,
e − θ − h)h> = 0,
IEθ (θ
and the second assertion follows. The statements about the quadratic risk
follow from its usual decomposition into squared bias and the variance of the
estimate.
2.7.4 Exponential families and R-efficiency
The notion of R-efficiency naturally extends to the case of a multivariate
parameter.
e is R-efficient if
Definition 2.7.3. An unbiased estimate θ
e = nF(θ) −1 .
Varθ (θ)
e is R-efficient if and only if
Theorem 2.7.2. An unbiased estimate θ
e = n−1
θ
X
U (Yi ),
where the vector function U (·) on IR satisfies
log-density `(y, θ) of Pθ can be represented as
R
U (y)dPθ (y) ≡ θ and the
`(y, θ) = C(θ)> U (y) − B(θ) + `(y),
(2.18)
for some functions C(·) and B(·) on Θ and a function `(·) on IR .
Proof. Suppose first that the representation (2.18) for the log-density is correct. Denote by C 0 (θ) the p × p Jacobi matrix of the vector function C :
def
d
C(θ) . Then ∇`(y, θ) = C 0 (θ)U (y) − ∇B(θ) and the identity
C 0 (θ) = dθ
Eθ ∇`(y, θ) = 0 implies the relation between the functions B(·) and C(·) :
C 0 (θ) θ = ∇B(θ).
Next, differentiating the equality
(2.19)
58
2 Parameter estimation for an i.i.d. model
Z
0≡
U (y) − θ dPθ (y) =
Z
[U (y) − θ]eL(y,θ) dµ0 (y)
w.r.t. θ implies in view of (2.19)
>
I ≡ IEθ {U (Y ) − θ} C 0 (θ)U (Y ) − ∇B(θ)
= C 0 (θ)IEθ {U (Y ) − θ} {U (Y ) − θ}> .
This yields Varθ U (Y ) = [C 0 (θ)]−1 . This leads to the following representation for the Fisher information matrix:
F(θ) = Varθ ∇`(Y, θ) = Varθ [C 0 (θ)U (Y ) − ∇B(θ)]
2
= C 0 (θ) Varθ U (Y ) = C(θ).
e = n−1 P U (Yi ) satisfies
The estimate θ
e = θ,
IEθ θ
that is, it is unbiased. Moreover,
X
e = Varθ 1
U (Yi )
Varθ θ
n
−1
1 X
1 0 −1 = 2
Var U (Yi ) =
C (θ)
= nF(θ)
n
n
e is R-efficient.
and θ
As in the univariate case, one can show that equality in the Cramér–Rao
e − θ are linearly dependent. This
bound is only possible if ∇L(Y , θ) and θ
leads again to the exponential family structure of the likelihood function.
Exercise 2.7.3. Complete the proof of the Theorem 2.7.2.
2.8 Maximum likelihood and other estimation methods
This section presents some other popular methods of estimating the unknown
parameter including minimum distance and M-estimation, maximum likelihood procedure, etc.
2.8.1 Minimum distance estimation
Let ρ(P, P 0 ) denote some functional (distance) defined for measures P, P 0 on
the real line. We assume that ρ satisfies the following conditions: ρ(Pθ1 , Pθ2 ) ≥
0 and ρ(Pθ1 , Pθ2 ) = 0 iff θ 1 = θ 2 . This implies for every θ ∗ ∈ Θ that
2.8 Maximum likelihood and other estimation methods
59
argmin ρ(Pθ , Pθ∗ ) = θ ∗ .
θ∈Θ
The Glivenko-Cantelli theorem states that Pn converges weakly to the true
e of θ ∗ by
distribution Pθ∗ . Therefore, it is natural to define an estimate θ
replacing in this formula the true measure Pθ∗ by its empirical counterpart
Pn , that is, by minimizing the distance ρ between the measures Pθ and Pn
over the set (Pθ ) . This leads to the minimum distance estimate
e = argmin ρ(Pθ , Pn ).
θ
θ∈Θ
2.8.2 M -estimation and Maximum likelihood estimation
Another general method of building an estimate of θ ∗ , the so-called M estimation is defined via a contrast function ψ(y, θ) given for every y ∈ IR
and θ ∈ Θ . The principal condition on ψ is that the integral IEθ ψ(Y1 , θ 0 ) is
minimized for θ = θ 0 :
Z
θ = argmin ψ(y, θ 0 )dPθ (y),
θ ∈ Θ.
(2.20)
θ0
In particular,
θ ∗ = argmin
Z
ψ(y, θ)dPθ∗ (y),
θ∈Θ
and the M -estimate is again obtained by substitution, that is, by replacing
the true measure Pθ∗ with its empirical counterpart Pn :
Z
1X
e
ψ(Yi , θ).
θ = argmin ψ(y, θ)dPn (y) = argmin
θ∈Θ
θ∈Θ n
Exercise 2.8.1. Let Y beRan i.i.d. sample from P ∈ (Pθ , θ ∈ Θ ⊂ IR) .
(i) Let also g(y) satisfy g(y)dPθ (y) ≡ θ , leading to the moment estimate
θe = n−1
X
g(Yi ).
Show that this estimate can be obtained as the M-estimate for a properly
selected function
ψ(·) .
R
(ii) Let
g(y)dPθ (y) ≡ m(θ) for the given functions g(·) and m(·)
whereas m(·) is monotonous.
Show that the moment estimate θe = m−1 (Mn )
P
with Mn = n−1 g(Yi ) can be obtained as the M-estimate for a properly
selected function ψ(·) .
We mention three prominent examples of the contrast function ψ and
the resulting estimates: least squares, least absolute deviation and maximum
likelihood.
60
2 Parameter estimation for an i.i.d. model
Least squares estimation
The least squares estimate (LSE) corresponds to the quadratic contrast
ψ(y, θ) = kg(y) − θk2 ,
where g(y) is a p -dimensional function of the observation y satisfying
Z
g(y)dPθ (y) ≡ θ,
θ ∈ Θ.
Then the true parameter θ ∗ fulfills the relation
Z
∗
θ = argmin kg(y) − θk2 dPθ∗ (y)
θ∈Θ
because
Z
kg(y) − θk2 dPθ∗ (y) = kθ ∗ − θk2 +
Z
kg(y) − θ ∗ k2 dPθ∗ (y).
∗
e
The substitution method leads to the estimate
by minimizaR θ of θ defined
tion of the empirical version of the integral kg(y) − θk2 dPθ∗ (y) :
e def
θ
= argmin
Z
kg(y) − θk2 dPn (y) = argmin
θ∈Θ
X
kg(Yi ) − θk2 .
θ∈Θ
This is again a quadratic optimization problem having a closed form solution
called least squares or ordinary least squares estimate.
Lemma 2.8.1. It holds
e = argmin
θ
θ∈Θ
X
kg(Yi ) − θk2 =
1X
g(Yi ).
n
e coincides with the moment estimate based
One can see that the LSE θ
R
on the function g(·) . Indeed, the equality g(y)dPθ∗ (y) = θ ∗ leads directly
e = n−1 P g(Yi ) .
to the LSE θ
Least absolute deviation (median) estimation
The next example of an M-estimate is given by the absolute deviation contrast
fit. For simplicity of presentation, we consider here only the case of a univariate
def
parameter. The contrast function ψ(y, θ) is given by ψ(y, θ) = |y − θ| . The
solution of the related optimization problem (2.20) is given by the median
med(Pθ ) of the distribution Pθ .
2.8 Maximum likelihood and other estimation methods
61
Definition 2.8.1. The value t is called the median of a distribution function
F if
F (t) ≥ 1/2,
F (t−) < 1/2.
If F (·) is a continuous function then the median t = med(F ) satisfies F (t) =
1/2 .
Theorem 2.8.1. For any cdf F , the median med(F ) satisfies
Z
Z
inf
|y − θ| dF (y) = |y − med(F )| dF (y).
θ∈IR
Proof. Consider for simplicity the case of a continuous distribution function
F . One has |y − θ| = (θ − y)1(y < θ) + (y − θ)1(y ≥ θ) . Differentiating
w.r.t.
R
θ yields the following equation for any extreme point of |y − θ| dF (y) :
Z
θ
−
Z
∞
dF (y) +
−∞
dF (y) = 0.
θ
The median is the only solution of this equation.
Let the family (Pθ ) be such that θ = med(Pθ ) for all θ ∈ IR . Then the
M-estimation approach leads to the least absolute deviation (LAD) estimate
Z
X
def
e
θ = argmin |y − θ| dFn (y) = argmin
|Yi − θ|.
θ∈IR
θ∈IR
Due to Theorem 2.8.1, the solution of this problem is given by the median of
the edf Fn .
Maximum likelihood estimation
Let now ψ(y, θ) = −`(y, θ) = − log p(y, θ) where p(y, θ) is the density of the
measure Pθ at y w.r.t. to some dominating measure µ0 . This choice leads
to the maximum likelihood estimate (MLE):
X
e = argmax 1
log p(Yi , θ).
θ
n
θ∈Θ
The condition (2.20) is fulfilled because
Z
Z
argmin ψ(y, θ 0 )dPθ (y) = argmin
ψ(y, θ 0 ) − ψ(y, θ) dPθ (y)
θ0
θ0
Z
= argmin
θ0
log
p(y, θ)
dPθ (y)
p(y, θ 0 )
= argmin K(θ, θ 0 ) = θ.
θ0
62
2 Parameter estimation for an i.i.d. model
Here we used that the Kullback-Leibler divergence K(θ, θ 0 ) attains its minimum equal to zero at the point θ 0 = θ which in turn follows from the
concavity of the log-function by the Jensen inequality.
Note that the definition of the MLE does not depend on the choice of the
dominating measure µ0 .
e does not change if another dominatExercise 2.8.2. Show that the MLE θ
ing measure is used.
Computing an M -estimate orP
MLE leads to solving an optimization problem for the empirical quantity
ψ(Yi , θ) w.r.t. the parameter θ . If the
function ψ is differentiable w.r.t. θ then the solution can be found from the
estimating equation
∂ X
ψ(Yi , θ) = 0.
∂θ
Exercise 2.8.3. Show that any M -estimate and particularly the MLE can
be represented as minimum distance estimate with a properly defined distance
ρ.
R
Hint: define ρ(Pθ , Pθ∗ ) as
ψ(y, θ) − ψ(y, θ ∗ ) dPθ∗ (y) .
e is defined by maximizing the expression L(θ) =
Recall that the MLE θ
def
`(Yi , θ) w.r.t. θ . Below we use the notation L(θ, θ 0 ) = L(θ) − L(θ 0 ) ,
often called the log-likelihood ratio.
e
In our study we will
P focus on the value of the maximum L(θ) =
maxθ L(θ) . Let L(θ) = `(Yi , θ) be the likelihood function. The value
P
def
e = max L(θ)
L(θ)
θ
e
is called the maximum log-likelihood or fitted log-likelihood. The excess L(θ)−
∗
L(θ ) is the difference between the maximum of the likelihood function L(θ)
over θ and its particular value at the true parameter θ ∗ :
def
e θ ∗ ) = max L(θ) − L(θ ∗ ).
L(θ,
θ
e and
The next section collects some examples of computing the MLE θ
the corresponding maximum log-likelihood.
2.9 Maximum Likelihood for some parametric families
The examples of this section focus on the structure of the log-likelihood and
e and the maximum log-likelihood L(θ)
e .
the corresponding MLE θ
2.9 Maximum Likelihood for some parametric families
63
2.9.1 Gaussian shift
Let Pθ be the normal distribution on the real line with mean θ and the
known variance σ 2 . The corresponding density w.r.t. the Lebesgue measure
reads as
p(y, θ) = √
n (y − θ)2 o
exp −
.
2σ 2
2πσ 2
1
The log-likelihood L(θ) is
L(θ) =
X
log p(Yi , θ) = −
n
1 X
(Yi − θ)2 .
log(2πσ 2 ) − 2
2
2σ
The corresponding normal equation L0 (θ) = 0 yields
−
1 X
(Yi − θ) = 0
σ2
(2.21)
P
leading to the empirical mean solution θe = n−1 Yi .
The computation of the fitted likelihood is a bit more involved.
Theorem 2.9.1. Let Yi = θ∗ + εi with εi ∼ N(0, σ 2 ) . For any θ
e θ) = nσ −2 (θe − θ)2 /2.
L(θ,
(2.22)
Moreover,
e θ∗ ) = nσ −2 (θe − θ∗ )2 /2 = ξ 2 /2
L(θ,
e θ∗ ) has the fixed χ2 diswhere ξ is a standard normal r.v. so that 2L(θ,
1
tribution with one degree of freedom. If zα is the quantile of χ21 /2 with
P (ξ 2 /2 > zα ) = α , then
e u) ≤ zα }
E(zα ) = {u : L(θ,
(2.23)
is an α -confidence set: IPθ∗ (E(zα ) 63 θ∗ ) = α .
For every r > 0 ,
e θ∗ )r = cr ,
IEθ∗ 2L(θ,
where cr = E|ξ|2r with ξ ∼ N(0, 1) .
def
e θ) = L(θ)
e − L(θ) as a function of the paramProof (Proof 1). Consider L(θ,
eter θ . Obviously
64
2 Parameter estimation for an i.i.d. model
e θ) = −
L(θ,
1 X
e 2 − (Yi − θ)2 ,
(Yi − θ)
2σ 2
e θ) is a quadratic function of θ . Next, it holds L(θ,
e θ) e = 0
so that L(θ,
θ=θ
d
e θ) e = − d L(θ) e = 0 due to the normal equation (2.21).
L(θ,
and dθ
dθ
θ=θ
θ=θ
Finally,
2
d2
e θ) e = − d L(θ) e = n/σ 2 .
L(
θ,
2
2
θ=
θ
θ=θ
dθ
dθ
e θ) at θ = θe :
This implies by the Taylor expansion of a quadratic function L(θ,
e θ) =
L(θ,
n e
(θ − θ)2 .
2σ 2
Proof 2. First observe that for any two points θ0 , θ , the log-likelihood ratio
L(θ0 , θ) = log(dIPθ0 /dIPθ ) = L(θ0 ) − L(θ) can be represented in the form
L(θ0 , θ) = L(θ0 ) − L(θ) = σ −2 (S − nθ)(θ0 − θ) − nσ −2 (θ0 − θ)2 /2.
Substituting the MLE θe = S/n in place of θ0 implies
e θ) = nσ −2 (θe − θ)2 /2.
L(θ,
e θ∗ ) .
Now we consider the second statement about the distribution of L(θ,
∗
∗
The substitution θ = θ in (2.22) and the model equation Yi = θ + εi imply
θe − θ∗ = n−1/2 σξ , where
def
ξ =
1 X
√
εi
σ n
is standard normal. Therefore,
e θ∗ ) = ξ 2 /2.
L(θ,
This easily implies the result of the theorem.
e θ∗ ) is χ2 distributed with one
We see that under IPθ∗ the variable 2L(θ,
1
degree of freedom, and this distribution does not depend on the sample size
n and the scale parameter σ . This fact is known in a more general form as
chi-squared theorem.
Exercise 2.9.1. Check that the confidence sets
def
E ◦ (zα ) = [θe − n−1/2 σzα , θe + n−1/2 σzα ],
where zα is defined by 2Φ(−zα ) = α , and E(zα ) from (2.23) coincide.
2.9 Maximum Likelihood for some parametric families
65
Exercise 2.9.2. Compute the constant cr from Theorem 2.9.1 for r =
0.5, 1, 1.5, 2 .
Already now we point out an interesting feature of the fitted log-likelihood
e θ∗ ) . It can be viewed as the normalized squared loss of the estimate θe
L(θ,
e θ∗ ) = nσ −2 |θe− θ∗ |2 . The last statement of Theorem 2.9.1 yields
because L(θ,
that
IEθ∗ |θe − θ∗ |2r = cr σ 2r n−r .
2.9.2 Variance estimation for the normal law
Let Yi be i.i.d. normal with mean zero and unknown variance θ∗ :
Yi ∼ N(0, θ∗ ),
θ∗ ∈ IR+ .
The likelihood function reads as
L(θ) =
X
log p(Yi , θ) = −
1 X 2
n
log(2πθ) −
Yi .
2
2θ
The normal equation L0 (θ) = 0 yields
L0 (θ) = −
n
1 X 2
+ 2
Yi = 0
2θ 2θ
leading to
1
θe = Sn
n
with Sn =
P
Yi2 . Moreover, for any θ
e θ) = −
L(θ,
n
e − Sn 1/θe − 1/θ = nK(θ,
e θ)
log(θ/θ)
2
2
where
K(θ, θ0 ) = −
1
log(θ/θ0 ) + 1 − θ/θ0
2
is the Kullback-Leibler divergence for two Gaussian measures N(0, θ) and
N(0, θ0 ) .
66
2 Parameter estimation for an i.i.d. model
2.9.3 Univariate normal distribution
Let Yi be as in previous example N{α, σ 2 } but neither the mean α nor the
variance σ 2 are known. This leads to estimating the vector θ = (θ1 , θ2 ) =
(α, σ 2 ) from the i.i.d. sample Y .
The maximum likelihood approach leads to maximizing the log-likelihood
w.r.t. the vector θ = (α, σ 2 )> :
L(θ) =
X
log p(Yi , θ) = −
n
1 X
log(2πθ2 ) −
(Yi − θ1 )2 .
2
2θ2
Exercise 2.9.3. Check that the ML approach leads to the same estimates
(2.2) as the method of moments.
2.9.4 Uniform distribution on [0, θ]
Let Yi be uniformly distributed on the interval [0, θ∗ ] of the real line where
the right end point θ∗ is unknown. The density p(y, θ) of Pθ w.r.t. the
Lebesgue measure is θ−1 1(y ≤ θ) . The likelihood reads as
Z(θ) = θ−n 1(max Yi ≤ θ).
i
This density is positive iff θ ≥ maxi Yi and it is maximized exactly for θe =
maxi Yi . One can see that the MLE θe is the limiting case of the moment
estimate θek as k grows to infinity.
2.9.5 Bernoulli or binomial model
Let Pθ be a Bernoulli law for θ ∈ [0, 1] . The density of Yi under Pθ can be
written as
p(y, θ) = θy (1 − θ)1−y .
The corresponding log-likelihood reads as
L(θ) =
X
Yi log θ + (1 − Yi ) log(1 − θ) = Sn log
θ
+ n log(1 − θ)
1−θ
P
with Sn =
Yi . Maximizing this expression w.r.t. θ results again in the
empirical mean
θe = Sn /n.
This implies
2.9 Maximum Likelihood for some parametric families
67
e
e
e θ) = nθe log θ + n(1 − θ)
e log 1 − θ = nK(θ,
e θ)
L(θ,
θ
1−θ
where K(θ, θ0 ) = θ log(θ/θ0 )+(1−θ) log{(1−θ)/(1−θ0 ) is the Kullback-Leibler
divergence for the Bernoulli law.
2.9.6 Multinomial model
The multinomial distribution Bθm describes the number of successes in m
experiments when one success has the probability θ ∈ [0, 1] . This distribution
can be viewed as the sum of m binomials with the same parameter θ .
One has
m k
Pθ (Y1 = k) =
θ (1 − θ)m−k ,
k = 0, . . . , m.
k
Exercise 2.9.4. Check that the ML approach leads to the estimate
1 X
Yi .
θe =
mn
e θ) .
Compute L(θ,
2.9.7 Exponential model
Let Y1 , . . . , Yn be i.i.d. exponential random variables with parameter θ∗ >
∗
0 . This means that Yi are nonnegative and satisfy IP (Yi > t) = e−t/θ .
The density of the exponential law w.r.t. the Lebesgue measure is p(y, θ∗ ) =
∗
e−y/θ /θ∗ . The corresponding log-likelihood can be written as
L(θ) = −n log θ −
n
X
Yi /θ = −S/θ − n log θ,
i=1
where S = Y1 + . . . + Yn .
The ML estimating equation yields S/θ2 = n/θ or
θe = S/n.
e θ) this gives
For the fitted log-likelihood L(θ,
e θ) = −n(1 − θ/θ)
e − n log(θ/θ)
e
e θ).
L(θ,
= nK(θ,
Here once again K(θ, θ0 ) = θ/θ0 − 1 − log(θ/θ0 ) is the Kullback-Leibler divergence for the exponential law.
68
2 Parameter estimation for an i.i.d. model
2.9.8 Poisson model
Let Y1 , . . . , Yn be i.i.d. Poisson random variables satisfying IP (Yi = m) =
∗
|θ∗ |m e−θ /m! for m = 0, 1, 2, . . . . The corresponding log-likelihood can be
written as
L(θ) =
n
X
n
X
log θYi e−θ /Yi ! = log θ
Yi − θ − log(Yi !) = S log θ − nθ + R,
i=1
i=1
Pn
where S = Y1 + . . . + Yn and R = i=1 log(Yi !) . Here we leave out that
0! = 1 .
The ML estimating equation immediately yields S/θ = n or
θe = S/n.
e θ) this gives
For the fitted log-likelihood L(θ,
e θ) = nθe log(θ/θ)
e − n(θe − θ) = nK(θ,
e θ).
L(θ,
Here again K(θ, θ0 ) = θ log(θ/θ0 ) − (θ − θ0 ) is the Kullback-Leibler divergence
for the Poisson law.
2.9.9 Shift of a Laplace (double exponential) law
Let P0 be the symmetric distribution defined by the equations
P0 (|Y1 | > y) = e−y/σ ,
y ≥ 0,
for some given σ > 0 . Equivalently one can say that the absolute value of Y1
is exponential with parameter σ under P0 . Now define Pθ by shifting P0
by the value θ . This means that
Pθ (|Y1 − θ| > y) = e−y/σ ,
y ≥ 0.
The density of Y1 − θ under Pθ is p(y) = (2σ)−1 e−|y|/σ . The maximum
likelihood approach leads to maximizing the sum
X
L(θ) = −n log(2σ) −
|Yi − θ|/σ,
or equivalently to minimizing the sum
P
|Yi − θ| :
θe = argmin
X
|Yi − θ|.
(2.24)
θ
This is just the least absolute deviation estimate given by the median of the
edf:
θe = med(Fn ).
2.10 Quasi Maximum Likelihood approach
69
Exercise 2.9.5. Show that the median solves the problem (2.24).
Hint: suppose that n is odd. Consider the ordered observations Y(1) ≤
Y(2) ≤ . . . ≤ Y(n) . Show that the median of Pn is given by Y((n+1)/2) . Show
that this point solves (2.24).
2.10 Quasi Maximum Likelihood approach
Let Y = (Y1 , . . . , Yn )> be a sample from a marginal distribution P . Let also
(Pθ , θ ∈ Θ) be a given parametric family with the log-likelihood `(y, θ) . The
parametric approach is based on the assumption that the underlying distribution P belongs to this family. The quasi maximum likelihood method applies
the maximum likelihood approach for family (Pθ ) even if the underlying distribution P does not belong to this family.PThis leads again to the estimate
e that maximizes the expression L(θ) =
θ
`(Yi , θ) and is called the quasi
MLE. It might happen that the true distribution belongs to some other parametric family for which one also can construct the MLE. However, there could
be serious reasons for applying the quasi maximum likelihood approach even
e
in this misspecified case. One of them is that the properties of the estimate θ
are essentially determined by the geometrical structure of the log-likelihood.
The use of a parametric family with a nice geometric structure (which are
quadratic or convex functions of the parameter) can seriously simplify the
algorithmic burdens and improve the behavior of the method.
2.10.1 LSE as quasi likelihood estimation
Consider the model
Yi = θ∗ + εi
(2.25)
where θ∗ is the parameter of interest from IR and εi are random errors
satisfying IEεi = 0 . The assumption that εi are i.i.d. normal N(0, σ 2 ) leads
to the quasi log-likelihood
L(θ) = −
1 X
n
log(2πσ 2 ) − 2
(Yi − θ)2 .
2
2σ
Maximizing the expression L(θ) leads to minimizing the sum of squared residuals (Yi − θ)2 :
θe = argmin
θ
X
1X
(Yi − θ)2 =
Yi .
n
This estimate is called a least squares estimate (LSE) or ordinary least squares
estimate (oLSE).
70
2 Parameter estimation for an i.i.d. model
Example 2.10.1. Consider the model (2.25) with heterogeneous errors, that
is, εi are independent normal with zero mean and variances σi2 . The corresponding log-likelihood reads
L◦ (θ) = −
(Yi − θ)2 o
1 Xn
log(2πσi2 ) +
.
2
σi2
The MLE θe◦ is
def
θe◦ = argmax L◦ (θ) = N −1
X
θ
Yi /σi2 ,
N=
X
σi−2 .
We now compare the estimates θe and θe◦ .
Lemma 2.10.1. The following assertions hold for the estimate θe :
1. θe is unbiased: IEθ∗ θe = θ∗ .
e given by
2. The quadratic risk of θe is equal to the variance Var(θ)
e θ∗ ) def
e = n−2
R(θ,
= IEθ∗ |θe − θ∗ |2 = Var(θ)
X
σi2 .
3. θe is not R-efficient unless all σi2 are equal.
Now we consider the MLE θe◦ .
Lemma 2.10.2. The following assertions hold for the estimate θe◦ :
1. θe◦ is unbiased: IEθ∗ θe◦ = θ∗ .
2. The quadratic risk of θe◦ is equal to the variance Var(θe◦ ) given by
def
R(θe◦ , θ∗ ) = IEθ∗ |θe◦ − θ∗ |2 = Var(θe◦ ) = N −2
X
σi−2 = N −1 .
3. θe◦ is R-efficient.
Exercise 2.10.1. Check the statements of Lemma 2.10.1 and 2.10.2.
Hint: compute the Fisher information for the model (2.25) using the property of additivity:
F(θ) =
X
F(i) (θ) =
X
σi−2 = N,
where F(i) (θ) is the Fisher information in the marginal model Yi = θ+εi with
just one observation Yi . Apply the Cramér–Rao inequality for one observation
of the vector Y .
2.10 Quasi Maximum Likelihood approach
71
2.10.2 LAD and robust estimation as quasi likelihood estimation
Consider again the model (2.25). The classical least squares approach faces
serious problems if the available data Y are contaminated with outliers. The
reasons for contamination could be missing data or typing errors, etc. Unfortunately, even a single outlier can significantly disturb the sum L(θ) and
thus, the estimate θe . A typical approach proposed and developed by Huber
is to apply another “influence function” ψ(Yi − θ) in the sum L(θ) in place
of the squared residual |Yi − θ|2 leading to the M-estimate
X
θe = argmin
ψ(Yi − θ).
(2.26)
θ
A popular ψ -function for robust estimation is the absolute value |Yi − θ| .
The resulting estimate
X
|Yi − θ|
θe = argmin
θ
is called least absolute deviation and the solution is the median of the empirical
distribution Pn . Another proposal is called the Huber function: it is quadratic
in a vicinity of zero and linear outside:
(
x2
if |x| ≤ t,
ψ(x) =
a|x| + b otherwise.
Exercise 2.10.2. Show that for each t > 0 , the coefficients a = a(t) and b =
b(t) can be selected to provide that ψ(x) and its derivative are continuous.
A remarkable fact about this approach is that every such estimate can be
viewed as a quasi MLE for the model (2.25). Indeed, for a given function ψ ,
define the measure Pθ with
Pthe log-density `(y, θ) = −ψ(y − θ) . Then the
log-likelihood is L(θ) = − ψ(Yi − θ) and the corresponding (quasi) MLE
coincides with (2.26).
Exercise 2.10.3. Suggest a σ -finite measure µ such that exp −ψ(y − θ)
is the density of Yi for the model (2.25) w.r.t. the measure µ .
Hint: suppose for simplicity that
Z
def
Cψ =
exp −ψ(x) dx < ∞.
Show that Cψ−1 exp −ψ(y − θ) is a density w.r.t. the Lebesgue measure for
any θ .
P
Exercise 2.10.4. Show that the LAD θe = argminθ |Yi − θ| is the quasi
MLE for the model (2.25) when the errors εi are assumed Laplacian (double
exponential) with density p(x) = (1/2)e−|x| .
2.11 Univariate exponential families
Most parametric families considered in the previous sections are particular
cases of exponential families (EF) distributions. This includes the Gaussian
shift, Bernoulli, Poisson, exponential, volatility models. The notion of an EF
already appeared in the context of the Cramér–Rao inequality. Now we study
such families in further detail.
We say that P is an exponential family if all measures Pθ ∈ P are dominated by a σ -finite measure µ0 on Y and the density functions p(y, θ) =
dPθ /dµ0 (y) are of the form
def
p(y, θ) =
dPθ
(y) = p(y)eyC(θ)−B(θ) .
dµ0
Here C(θ) and B(θ) are some given nondecreasing functions on Θ and p(y)
is a nonnegative function on Y .
Usually one assumes some regularity conditions on the family P . One
possibility was already given when we discussed the Cramér–Rao inequality;
see Definition 2.5.1. Below we assume that that condition is always fulfilled.
It basically means that we can differentiate w.r.t. θ under the integral sign.
For an EF, the log-likelihood admits an especially simple representation,
nearly linear in y :
def
`(y, θ) = log p(y, θ) = yC(θ) − B(θ) + log p(y)
so that the log-likelihood ratio for θ, θ0 ∈ Θ reads as
def
`(y, θ, θ0 ) = `(y, θ) − `(y, θ0 ) = y C(θ) − C(θ0 ) − B(θ) − B(θ0 ) .
2.11.1 Natural parametrization
Let P = Pθ be an EF. By Y we denote one observation from the distribution Pθ ∈ P . In addition to the regularity conditions, one often assumes the
natural parametrization for the family P which means the relation Eθ Y = θ .
Note that this relation is fulfilled for all the examples of EF’s that we considered so far in the previous section. It is obvious that the natural parametrization is only possible if the following identifiability condition is fulfilled: for
any two different measures from the considered parametric family, the corresponding mean values are different. Otherwise the natural parametrization
is always possible: just define θ as the expectation of Y . Below we use the
abbreviation EFn for an exponential family with natural parametrization.
Some properties of an EFn
The natural parametrization implies an important property for the functions
B(θ) and C(θ) .
2.11 Univariate exponential families
Lemma 2.11.1. Let Pθ
73
be a naturally parameterized EF. Then
B 0 (θ) = θC 0 (θ).
R
Proof. Differentiating both sides of the equation p(y, θ)µ0 (dy) = 1 w.r.t. θ
yields
Z
0
0=
yC (θ) − B 0 (θ) p(y, θ)µ0 (dy)
Z
0
yC (θ) − B 0 (θ) Pθ (dy)
=
= θC 0 (θ) − B 0 (θ)
and the result follows.
The next lemma computes the important characteristics of a natural EF
such as the Kullback-Leibler divergence K(θ, θ0 ) = Eθ log p(Y, θ)/p(Y, θ0 ) ,
def
0
2
0
the Fisher information
F(θ)
= Eθ |` (Y, θ)| , and the rate function m(µ, θ, θ ) =
0
− log Eθ exp µ`(Y, θ, θ ) .
Lemma 2.11.2. Let (Pθ ) be an EFn. Then with θ, θ0 ∈ Θ fixed, it holds for
• the Kullback-Leibler divergence K(θ, θ0 ) = Eθ log p(Y, θ)/p(Y, θ0 ) :
Z
p(y, θ)
Pθ (dy)
p(y, θ0 )
Z
= C(θ) − C(θ0 )
yPθ (dy) − B(θ) − B(θ0 )
K(θ, θ0 ) =
log
= θ C(θ) − C(θ0 ) − B(θ) − B(θ0 ) ;
•
(2.27)
def
the Fisher information F(θ) = Eθ |`0 (Y, θ)|2 :
F(θ) = C 0 (θ);
•
the rate function m(µ, θ, θ0 ) = − log Eθ exp µ`(Y, θ, θ0 ) :
m(µ, θ, θ0 ) = K θ, θ + µ(θ0 − θ) ;
•
the variance Varθ (Y ) :
Varθ (Y ) = 1/F(θ) = 1/C 0 (θ).
(2.28)
74
2 Parameter estimation for an i.i.d. model
Proof. Differentiating the equality
Z
Z
0 ≡ (y − θ)Pθ (dy) = (y − θ)eL(y,θ) µ0 (dy)
w.r.t. θ implies in view of Lemma 2.11.1
1 ≡ IEθ (Y − θ) C 0 (θ)Y − B 0 (θ) = C 0 (θ)IEθ (Y − θ)2 .
This yields Varθ (Y ) = 1/C 0 (θ) . This leads to the following representation of
the Fisher information:
2
F(θ) = Varθ `0 (Y, θ) = Varθ [C 0 (θ)Y − B 0 (θ)] = C 0 (θ) Varθ (Y ) = C 0 (θ).
Exercise 2.11.1. Check the equations for the Kullback-Leibler divergence
and Fisher information from Lemma 2.11.2.
MLE and maximum likelihood for an EFn
Now we discuss the maximum likelihood estimation for a sample from an EFn.
The log-likelihood can be represented in the form
L(θ) =
n
X
log p(Yi , θ) = C(θ)
i=1
n
X
Yi − B(θ)
i=1
n
X
1+
i=1
n
X
log p(Yi ) (2.29)
i=1
= SC(θ) − nB(θ) + R,
where
S=
n
X
Yi ,
R=
i=1
n
X
log p(Yi ).
i=1
The remainder term R is unimportant because it does not depend on θ
and thus it does not enter in the likelihood ratio. The maximum likelihood
estimate θe is defined by maximizing L(θ) w.r.t. θ , that is,
θe = argmax L(θ) = argmax SC(θ) − nB(θ) .
θ∈Θ
θ∈Θ
In the case of an EF with the natural parametrization, this optimization
problem admits a closed form solution given by the next theorem.
Theorem 2.11.1. Let (Pθ ) be an EFn. Then the MLE θe fulfills
θe = S/n = n−1
n
X
i=1
Yi .
2.11 Univariate exponential families
75
It holds
e = [nF(θ)]−1 = [nC 0 (θ)]−1
Varθ (θ)
IEθ θe = θ,
e θ) def
e −
so that θe is R-efficient. Moreover, the fitted log-likelihood L(θ,
= L(θ)
L(θ) satisfies for any θ ∈ Θ :
e θ) = nK(θ,
e θ).
L(θ,
(2.30)
Proof. Maximization of L(θ) w.r.t. θ leads to the estimating equation
nB 0 (θ) − SC 0 (θ) = 0 . This and the identity B 0 (θ) = θC 0 (θ) yield the MLE
θe = S/n.
e is computed using (2.28) from Lemma 2.11.2. The forThe variance Varθ (θ)
mula (2.27) for the Kullback-Leibler divergence and (2.29) yield the represene θ) for any θ ∈ Θ .
tation (2.30) for the fitted log-likelihood L(θ,
One can see that the estimate θe is the mean of the Yi ’s. As for the
Gaussian shift model, this estimate can be motivated by the fact that the
expectation of every observation Yi under Pθ is just θ and by the law of
large numbers the empirical mean converges to its expectation as the sample
size n grows.
2.11.2 Canonical parametrization
Another useful representation of an EF is given by the so-called canonical
parametrization. We say that υ is the canonical parameter for this EF if the
density of each measure Pυ w.r.t. the dominating measure µ0 is of the form:
def
p(y, υ) =
dPυ
(y) = p(y) exp yυ − d(υ) .
dµ0
Here d(υ) is a given convex function on Θ and p(y) is a nonnegative function on Y . The abbreviation EFc will indicate an EF with the canonical
parametrization.
Some properties of an EFc
The next relation is an obvious corollary of the definition:
Lemma 2.11.3. An EFn (Pθ ) always permits a unique canonical representation. The canonical parameter υ is related to the natural parameter θ by
υ = C(θ) , d(υ) = B(θ) and θ = d0 (υ) .
76
2 Parameter estimation for an i.i.d. model
Proof. The first two relations follow from the definition. They imply B 0 (θ) =
d0 (υ) · dυ/dθ = d0 (υ) · C 0 (θ) and the last statement follows from B 0 (θ) =
θC 0 (θ) .
The log-likelihood ratio `(y, υ, υ1 ) for an EFc reads as
`(Y, υ, υ1 ) = Y (υ − υ1 ) − d(υ) + d(υ1 ).
The next lemma collects some useful facts about an EFc.
Lemma 2.11.4. Let P = Pυ , υ ∈ U be an EFc and let the function d(·)
be two times continuously differentiable. Then it holds for any υ, υ1 ∈ U :
(i). The mean Eυ Y and the variance Varυ (Y ) fulfill
Eυ Y = d0 (υ),
Varυ (Y ) = Eυ (Y − Eυ Y )2 = d00 (υ).
def
(ii). The Fisher information F(υ) = Eυ |`0 (Y, υ)|2 satisfies
F(υ) = d00 (υ).
(iii). The Kullback-Leibler divergence Kc (υ, υ1 ) = Eυ `(Y, υ, υ1 ) satisfies
Z
p(y, υ)
Pυ (dy)
p(y, υ1 )
= d0 (υ) υ − υ1 − d(υ) − d(υ1 )
K (υ, υ1 ) =
c
log
= d00 (ῠ) (υ1 − υ)2 /2,
where ῠ is a point between υ and υ1 . Moreover, for υ ≤ υ1 ∈ U
Kc (υ, υ1 ) =
υ1
Z
(υ1 − u)d00 (u)du.
υ
def
(iv). The rate function m(µ, υ1 , υ) = − log IEυ exp µ`(Y, υ1 , υ) fulfills
m(µ, υ1 , υ) = µKc υ, υ1 − Kc υ, υ + µ(υ1 − υ)
Proof. Differentiating the equation
Z
R
p(y, υ)µ0 (dy) = 1 w.r.t. υ yields
y − d0 (υ) p(y, υ)µ0 (dy) = 0,
that is, Eυ Y = d0 (υ) . The expression for the variance can be proved by one
more differentiating of this equation. Similarly one can check (ii) . The item
(iii) can be checked by simple algebra and (iv) follows from (i) .
2.11 Univariate exponential families
77
Further, for any υ, υ1 ∈ U , it holds
`(Y, υ1 , υ) − Eυ `(Y, υ1 , υ) = (υ1 − υ) Y − d0 (υ)
and with u = µ(υ1 − υ)
log Eυ exp u Y − d0 (υ)
= −ud0 (υ) + d(υ + u) − d(υ) + log Eυ exp uY − d(υ + u) + d(υ)
= d(υ + u) − d(υ) − ud0 (υ) = Kc (υ, υ + u),
because
dPυ+u
=1
Eυ exp uY − d(υ + u) + d(υ) = Eυ
dPυ
and (iv) follows by (iii) .
Table 2.1 presents the canonical parameter and the Fisher information for
the examples of exponential families from Section 2.9.
Table 2.1. υ(θ) , d(υ) , F(υ) = d00 (υ) and θ = θ(υ) for the examples from Section 2.9.
Model
Gaussian regression
Bernoulli model
Poisson model
Exponential model
Volatility model
υ
2
θ/σ
log θ/(1 − θ)
log θ
1/θ
−1/(2θ)
d(υ)
I(υ)
2
2
2
υ σ /2
log(1 + eυ )
eυ
− log υ
− 12 log(−2υ)
σ
eυ /(1 + eυ )2
eυ
1/υ 2
1/(2υ 2 )
θ(υ)
σ2 υ
eυ /(1 + eυ )
eυ
1/υ
−1/(2υ)
Exercise 2.11.2. Check (iii) and (iv) in Lemma 2.11.4.
Exercise 2.11.3. Check the entries of Table 2.1.
Exercise 2.11.4. Check that Kc (υ, υ 0 ) = K θ(υ), θ(υ 0 )
Exercise 2.11.5. Plot Kc (υ ∗ , υ) as a function of υ for the families from
Table 2.1.
78
2 Parameter estimation for an i.i.d. model
Maximum likelihood estimation for an EFc
The structure of the log-likelihood in the case of the canonical parametrization
is particularly simple:
L(υ) =
n
X
log p(Yi , υ) = υ
n
X
i=1
Yi − d(υ)
i=1
n
X
1+
i=1
n
X
log p(Yi )
i=1
= Sυ − nd(υ) + R
where
S=
n
X
Yi ,
i=1
R=
n
X
log p(Yi ).
i=1
Again, as in the case of an EFn, we can ignore the remainder term R . The
estimating equation dL(υ)/dυ = 0 for the maximum likelihood estimate υ
e
reads as
d0 (υ) = S/n.
This and the relation θ = d0 (υ) lead to the following result.
Theorem 2.11.2. The maximum likelihood estimates θe and υ
e for the natural and canonical parametrization are related by the equations
θe = d0 (e
υ)
e
υ
e = C(θ).
The next result describes the structure of the fitted log-likelihood and
basically repeats the result of Theorem 2.11.1.
Theorem 2.11.3. Let (Pυ ) be an EF with canonical parametrization. Then
def
for any υ ∈ U the fitted log-likelihood L(e
υ , υ) = maxυ0 L(υ 0 , υ) satisfies
L(e
υ , υ) = nKc (e
υ , υ).
Exercise 2.11.6. Check the statement of Theorem 2.11.3.
2.11.3 Deviation probabilities for the maximum likelihood
Let Y1 , . . . , Yn be i.i.d. observations from an EF P . This section presents a
probability bound for the fitted likelihood. To be more specific we assume that
P is canonically parameterized, P = (Pυ ) . However, the bound applies to the
natural and any other parametrization because the value of maximum of the
likelihood process L(θ) does not depend on the choice of parametrization.
2.11 Univariate exponential families
79
The log-likelihood ratio L(υ 0 , υ) is given by the expression (2.29) and its
maximum over υ 0 leads to the fitted log-likelihood L(e
υ , υ) = nKc (e
υ , υ) .
Our first result concerns a deviation bound for L(e
υ , υ) . It utilizes the
representation for the fitted log-likelihood given by Theorem 2.11.1. As usual,
we assume that the family P is regular. In addition, we require the following
condition.
(P c) P = (Pυ , υ ∈ U ⊆ IR) is a regular EF. The parameter set U is convex.
The function d(υ) is two times continuously differentiable and the Fisher
information F(υ) = d00 (υ) satisfies F(υ) > 0 for all υ .
The condition (P c) implies that for any compact set U0 there is a constant a = a(U0 ) > 0 such that
|F(υ1 )/F(υ2 )|1/2 ≤ a,
υ1 , υ2 ∈ U0 .
Theorem 2.11.4. Let Yi be i.i.d. from a distribution Pυ∗ which belongs to
an EFc satisfying (P c) . For any z > 0
IPυ∗ L(e
υ , υ ∗ ) > z = IPυ∗ nKc (e
υ , υ ∗ ) > z ≤ 2e−z .
Proof. The proof is based on two properties of the log-likelihood. The first one
is that the expectation of the likelihood ratio is just one: IEυ∗ exp L(υ, υ ∗ ) =
1 . This and the exponential Markov inequality imply for z ≥ 0
IPυ∗ L(υ, υ ∗ ) ≥ z ≤ e−z .
(2.31)
The second property is specific to the considered univariate EF and is based
on geometric properties of the log-likelihood function: linearity in the observations Yi and convexity in the parameter υ . We formulate this important
fact in a separate statement.
Lemma 2.11.5. Let the EFc P fulfill (P c) . For given z and any υ0 ∈ U ,
there exist two values υ + > υ0 and υ − < υ0 satisfying Kc (υ ± , υ0 ) = z/n
such that
{L(e
υ , υ0 ) > z} ⊆ {L(υ + , υ0 ) > z} ∪ {L(υ − , υ0 ) > z}.
Proof. It holds
{L(e
υ , υ0 ) > z} = sup S υ − υ0 − n d(υ) − d(υ0 ) > z
υ
⊆
z + n d(υ) − d(υ0 )
z + n d(υ) − d(υ0 )
.
∪ −S > inf
S > inf
υ<υ0
υ>υ0
υ − υ0
υ0 − υ
Define for every u > 0
80
2 Parameter estimation for an i.i.d. model
f (u) =
z + n d(υ0 + u) − d(υ0 )
.
u
This function attains its minimum at a point u satisfying the equation
z/n + d(υ0 + u) − d(υ0 ) − d0 (υ0 + u)u = 0
or, equivalently,
K(υ0 + u, υ0 ) = z/n.
The condition (P c) provides that there is only one solution u ≥ 0 of this
equation.
Exercise 2.11.7. Check that the equation K(υ0 + u, υ0 ) = z/n has only one
positive solution for any z > 0 .
Hint: use that K(υ0 + u, υ0 ) is a convex function of u with minimum at
u = 0.
Now, it holds with υ + = υ0 + u
z + n d(υ + ) − d(υ0 )
z + n d(υ) − d(υ0 )
= S>
S > inf
υ>υ0
υ − υ0
υ + − υ0
⊆ {L(υ + , υ0 ) > z}.
Similarly
z + n d(υ − ) − d(υ0 )
z + n d(υ) − d(υ0 )
= −S >
−S > inf
υ<υ0
υ0 − υ
υ0 − υ −
⊆ {L(υ − , υ0 ) > z}.
for some υ − < υ0 .
The assertion of the theorem is now easy to obtain. Indeed,
IPυ∗ L(e
υ , υ ∗ ) ≥ z ≤ IPυ∗ L(υ + , υ ∗ ) ≥ z + IPυ∗ L(υ − , υ ∗ ) ≥ z ≤ 2e−z
yielding the result.
Exercise 2.11.8. Let (Pυ ) be a Gaussian shift experiment, that is, Pυ =
N(υ, 1) .
•
•
Check that L(e
υ , υ) = n|e
υ − υ|2 /2 ;
Given z ≥ 0 , find the points υ + and υ − such that
{L(e
υ , υ ∗ ) > z} ⊆ {L(υ + , υ ∗ ) > z} ∪ {L(υ − , υ ∗ ) > z}.
2.11 Univariate exponential families
•
81
+
Plot the mentioned sets {υ : L(e
υ , υ) > z} , {υ : L(υ
P , υ) > z} , and
−
{υ : L(υ , υ) > z} as functions of υ for a fixed S =
Yi .
Remark 2.11.1. Note that the mentioned result only utilizes the geometric
structure of the univariate EFc. The most important feature of the loglikelihood ratio L(υ, υ ∗ ) = S(υ − υ ∗ ) − d(υ) + d(υ ∗ ) is its linearity w.r.t.
the stochastic term S . This allows us to replace the maximum over the whole
set U by the maximum over the set consisting of two points υ ± . Note that the
proof does not rely on the distribution of the observations Yi . In particular,
Lemma 2.11.5 continues to hold even within the quasi likelihood approach
when L(υ) is not the true log-likelihood. However, the bound
(2.31) relies
on the nature of L(υ, υ ∗ ) . Namely, it utilizes that IE exp L(υ ± , υ ∗ ) = 1 ,
which is true under IP = IPυ∗ nut generally false in the quasi likelihood setup.
Nevertheless, the exponential bound can be extended to the quasi likelihood
approach under the condition of bounded
moments for L(υ, υ ∗ ) :
exponential
∗
for some µ > 0 , it should hold IE exp µL(υ, υ ) = C(µ) < ∞ .
Theorem 2.11.4 yields a simple construction of a confidence interval for
the parameter υ ∗ and the concentration property of the MLE υ
e.
Theorem 2.11.5. Let Yi be i.i.d. from Pυ∗ ∈ P with P satisfying (P c) .
1. If zα satisfies e−zα ≤ α/2 , then
E(zα ) = υ : nKc υ
e, υ ≤ zα
is a α -confidence set for the parameter υ ∗ .
2. Define for any z > 0 the set A(z, υ ∗ ) = {υ : Kc (υ, υ ∗ ) ≤ z/n} . Then
IPυ∗ υ
e∈
/ A(z, υ ∗ ) ≤ 2e−z .
The second assertion of the theorem claims that the estimate υ
e belongs
with a high probability to the vicinity A(z, υ ∗ ) of the central point υ ∗ defined
by the Kullback-Leibler divergence. Due to Lemma 2.11.4, (iii) Kc (υ, υ ∗ ) ≈
F(υ ∗ ) (υ − υ ∗ )2 /2 , where F(υ ∗ ) is the Fisher information at υ ∗ . This vicinity
is an interval around υ ∗ of length of order n−1/2 . In other words, this result
implies the root-n consistency of υ
e.
The deviation bound for the fitted log-likelihood from Theorem 2.11.4 can
be viewed as a bound for the normalized loss of the estimate υ
e . Indeed, define
the loss function ℘(υ 0 , υ) = K1/2 (υ 0 , υ) . Then
Theorem
2.11.4
yields that the
p
loss is with high probability bounded by z/n provided that z is sufficiently
large. Similarly one can establish the bound for the risk.
Theorem 2.11.6. Let Yi be i.i.d. from the distribution Pυ∗ which belongs
to a canonically parameterized EF satisfying (P c) . The following properties
hold:
82
2 Parameter estimation for an i.i.d. model
(i). For any r > 0 there is a constant rr such that
IEυ∗ Lr (e
υ , υ ∗ ) = nr IEυ∗ Kr (e
υ , υ ∗ ) ≤ rr .
(ii). For every λ < 1
IEυ∗ exp λL(e
υ , υ ∗ ) = IEυ∗ exp λnK(e
υ , υ ∗ ) ≤ (1 + λ)/(1 − λ).
Proof. By Theorem 2.11.4
r
Z
∗
zr dIPυ∗ L(e
υ, υ∗ ) > z
IE L (e
υ, υ ) = −
υ∗
z≥0
Z
=r
zr−1 IPυ∗ L(e
υ , υ ∗ ) > z dz
z≥0
Z
≤r
2zr−1 e−z dz
z≥0
and the first assertion is fulfilled with rr = 2r
(ii) is proved similarly.
R
z≥0
zr−1 e−z dz . The assertion
Deviation bound for other parameterizations
The results for the maximum likelihood and their corollaries have been stated
for an EFc. An immediate question that arises in this respect is whether the
use of the canonical parametrization is essential. The answer is “no”: a similar
result can be stated for any EF whatever the parametrization is used. This
fact is based on the simple observation that the maximum likelihood is the
value of the maximum of the likelihood process; this value does not depend
on the parametrization.
Lemma 2.11.6. Let (Pθ ) be an EF. Then for any θ
e θ) = nK(P e, Pθ ).
L(θ,
θ
(2.32)
Exercise 2.11.9. Check the result of Lemma 2.11.6.
Hint: use that both sides of (2.32) depend only on measures Pθe, Pθ and not
on the parametrization.
e θ) instead of K(P e, Pθ ) . The property
Below we write as before K(θ,
θ
(2.32) and the exponential bound of Theorem 2.11.4 imply the bound for a
general EF:
Theorem 2.11.7. Let (Pθ ) be a univariate EF. Then for any z > 0
e θ∗ ) > z = IPθ∗ nK(θ,
e θ∗ ) > z ≤ 2e−z .
IPθ∗ L(θ,
2.12 Historical remarks and further reading
83
This result allows us to build confidence sets for the parameter θ∗ and
concentration sets for the MLE θe in terms of the Kullback-Leibler divergence:
A(z, θ∗ ) = {θ : K(θ, θ∗ ) ≤ z/n},
e θ) ≤ z/n}.
E(z) = {θ : K(θ,
Corollary 2.11.1. Let (Pθ ) be an EF. If e−zα = α/2 then
IPθ∗ θe 6∈ A(zα , θ∗ ) ≤ α,
and
IPθ∗ E(zα ) 63 θ ≤ α.
Moreover, for any r > 0
e θ∗ ) = nr IEθ∗ Kr (θ,
e θ∗ ) ≤ rr .
IEθ∗ Lr (θ,
Asymptotic against likelihood-based approach
The asymptotic approach recommends to apply symmetric confidence and
concentration sets with width of order [nF(θ∗ )]−1/2 :
An (z, θ∗ ) = {θ : F(θ∗ ) (θ − θ∗ )2 ≤ 2z/n},
e 2 ≤ 2z/n},
En (z) = {θ : F(θ∗ ) (θ − θ)
e (θ − θ)
e 2 ≤ 2z/n}.
E0n (z) = {θ : I(θ)
Then asymptotically, i.e. for large n , these sets do approximately the same
job as the non-asymptotic sets A(z, θ∗ ) and E(z) . However, the difference
for finite samples can be quite significant. In particular, for some cases, e.g.
the Bernoulli of Poisson families, the sets An (z, θ∗ ) and E0n (z) may extend
beyond the parameter set Θ .
2.12 Historical remarks and further reading
The main part of the chapter is inspired by the nice textbook Borokov (1999).
The concept of exponential families is credited to Edwin Pitman, Georges Darmois, and Bernard Koopman in 1935–1936.
The notion of Kullback–Leibler divergence was originally introduced by
Solomon Kullback and Richard Leibler in 1951 as the directed divergence between two distributions. Many its useful properties are studied in monograph
Kullback (1997).
84
2 Parameter estimation for an i.i.d. model
The Fisher information was discussed by several early statisticians, notably Francis Edgeworth. Maximum-likelihood estimation was recommended,
analyzed, and vastly popularized by Robert Fisher between 1912 and 1922,
although it had been used earlier by Carl Gauss, Pierre-Simon Laplace, Thorvald Thiele, and Francis Edgeworth.
The Cramér–Rao inequality was independently obtained by Maurice Fréchet,
Calyampudi Rao, and Harald Cramér around 1943–1945.
For further reading we recommend textbooks by Lehmann and Casella
(1998), Borokov (1999), and Strasser (1985). The deviation bound of Theorem 2.11.4 follows Polzehl and Spokoiny (2006).
3
Regression Estimation
This chapter discusses the estimation problem for the regression model. First
a linear regression model is considered, then a generalized linear modeling is
discussed. We also mention median and quantile regression.
3.1 Regression model
The (mean) regression model can be written in the form IE(Y |X) = f (X) ,
or equivalently,
Y = f (X) + ε,
(3.1)
where Y is the dependent (explained) variable and X is the explanatory
variable (regressor) which can be multidimensional. The target of analysis is
the systematic dependence of the explained variable Y from the explanatory
variable X . The regression function f describes the dependence of the mean
of Y as a function of X . The value ε can be treated as an individual deviation
(error). It is usually assumed to be random with zero mean. Below we discuss
in more detail the components of the regression model (3.1).
3.1.1 Observations
In almost all practical situations, regression analysis is performed on the basis
of available data (observations) given in the form of a sample of pairs (Xi , Yi )
for i = 1, . . . , n , where n is the sample size. Here Y1 , . . . , Yn are observed
values of the regression variable Y and X1 , . . . , Xn are the corresponding
values of the explanatory variable X . For each observation Yi , the regression
model reads as:
Yi = f (Xi ) + εi
where εi is the individual i th error.
86
3 Regression Estimation
3.1.2 Design
The set X1 , . . . , Xn of the regressor’s values is called a design. The set X of
all possible values of the regressor X is called the design space. If this set X
is compact, then one speaks of a compactly supported design.
The nature of the design can be different for different statistical models.
However, it is important to mention that the design is always observable.
Two kinds of design assumptions are usually used in statistical modeling. A
deterministic design assumes that the points X1 , . . . , Xn are nonrandom and
given in advance. Here are typical examples:
Example 3.1.1 (Time series). Let Yt0 , Yt0 +1 , . . . , YT be a time series. The time
points t0 , t0 + 1, . . . , T build a regular deterministic design. The regression
function f explains the trend of the time series Yt as a function of time.
Example 3.1.2 (Imaging). Let Yij be the observed grey value at the pixel
(i, j) of an image. The coordinate Xij of this pixel is the corresponding
design value. The regression function f (Xij ) gives the true image value at
Xij which is to be recovered from the noisy observations Yij .
If the design is supported on a cube in IRd and the design points Xi form
a grid in this cube, then the design is called equidistant. An important feature
of such a design is that the number NA of design points in any “massive”
subset A of the unit cube is nearly the volume of this subset VA multiplied
by the sample size n : NA ≈ nVA . Design regularity means that the value
NA is nearly proportional to nVA , that is, NA ≈ cnVA for some positive
constant c which may depend on the set A .
In some applications, it is natural to assume that the design values Xi are
randomly drawn from some design distribution. Typical examples are given
by sociological studies. In this case one speaks of a random design. The design
values X1 , . . . , Xn are assumed to be independent and identically distributed
from a law PX on the design space X which is a subset of the Euclidean
space IRd . The design variables X are also assumed to be independent of
the observations Y .
One special case of random design is the uniform design when the design
distribution is uniform on the unit cube in IRd . The uniform design possesses
a similar, important property to an equidistant design: the number of design
points in a “massive” subset of the unit cube is on average close to the volume
of this set multiplied by n . The random design is called regular on X if the
design distribution is absolutely continuous with respect to the Lebesgue measure and the design density ρ(x) = dPX (x)/dλ is positive and continuous on
X . This again ensures with a probability close to one the regularity property
NA ≈ cnVA with c = ρ(x) for some x ∈ A .
It is worth mentioning that the case of a random design can be reduced to
the case of a deterministic design by considering the conditional distribution
of the data given the design variables X1 , . . . , Xn .
3.1 Regression model
87
3.1.3 Errors
The decomposition of the observed response variable Y into the systematic
component f (x) and the error ε in the model equation (3.1) is not formally
defined and cannot be done without some assumptions on the errors εi . The
standard approach is to assume that the mean value of every εi is zero.
Equivalently this means that the expected value of the observation Yi is just
the regression function f (Xi ) . This case is called mean regression or simply
regression. It is usually assumed that the errors εi have finite second moments.
Homogeneous errors case means that all the errors εi have the same variance
σ 2 = Var ε2i . The variance of heterogeneous errors εi may vary with i . In
many applications not only the systematic component f (Xi ) = IEYi but also
the error variance Var Yi = Var εi depend on the regressor (location) Xi .
Such models are often written in the form
Yi = f (Xi ) + σ(Xi )εi .
The observation (noise) variance σ 2 (x) can be the target of analysis similarly
to the mean regression function.
The assumption of zero mean noise, IEεi = 0 , is very natural and has
a clear interpretation. However, in some applications, it can cause trouble,
especially if data are contaminated by outliers. In this case, the assumption of
a zero mean can be replaced by a more robust assumption of a zero median.
This leads to the median regression model which assumes IP (εi ≤ 0) = 1/2 ,
or, equivalently
IP Yi − f (Xi ) ≤ 0 = 1/2.
A further important assumption concerns the joint distribution of the errors
εi . In the majority of applications the errors are assumed to be independent.
However, in some situations, the dependence of the errors is quite natural.
One example can be given by time series analysis. The errors εi are defined
as the difference between the observed values Yi and the trend function fi at
the i th time moment. These errors are often serially correlated and indicate
short or long range dependence. Another example comes from imaging. The
neighbor observations in an image are often correlated due to the imaging
technique used for recoding the images. The correlation particularly results
from the automatic movement correction.
For theoretical study one often assumes that the errors εi are not only
independent but also identically distributed. This, of course, yields a homogeneous noise. The theoretical study can be simplified even further if the error
distribution is normal. This case is called Gaussian regression and is denoted
as εi ∼ N(0, σ 2 ) . This assumption is very useful and greatly simplifies the
theoretical study. The main advantage of Gaussian noise is that the observations and their linear combinations are also normally distributed. This is an
88
3 Regression Estimation
exclusive property of the normal law which helps to simplify the exposition
and avoid technicalities.
Under the given distribution of the errors, the joint distribution of the
observations Yi is determined by the regression function f (·) .
3.1.4 Regression function
By the equation (3.1), the regression variable Y can be decomposed into a
systematic component and a (random) error ε . The systematic component is
a deterministic function f of the explanatory variable X called the regression
function. Classical regression theory considers the case of linear dependence,
that is, one fits a linear relation between Y and X :
f (x) = a + bx
leading to the model equation
Yi = θ1 + θ2 Xi + εi .
Here θ1 and θ2 are the parameters of the linear model. If the regressor x is
multidimensional, then θ2 is a vector from IRd and θ2 x becomes the scalar
product of two vectors. In many practical examples the assumption of linear
dependence is too restrictive. It can be extended by several ways. One can
try a more sophisticated functional dependence of Y on X , for instance
polynomial. More generally, one can assume that the regression function f is
known up to the finite-dimensional parameter θ = (θ1 , . . . , θp )> ∈ IRp . This
situation is called parametric regression and denoted by f (·) = f (·, θ) . If the
function f (·, θ) depends on θ linearly, that is, f (x, θ) = θ1 ψ1 (x) + . . . +
θp ψp (x) for some given functions ψ1 , . . . , ψp , then the model is called linear
regression. An important special case is given by polynomial regression when
f (x) is a polynomial function of degree p−1 : f (x) = θ1 +θ2 x+. . .+θp xp−1 .
In many applications a parametric form of the regression function cannot
be justified. Then one speaks of nonparametric regression.
3.2 Method of substitution and M-estimation
Observe that the parametric regression equation can be rewritten as
εi = Yi − f (Xi , θ).
e is an estimate of the parameter θ , then the residuals εei = Yi − f (Xi , θ)
e
If θ
are estimates of the individual errors εi . So, the idea of the method is to
e in a way that the empirical distribution Pn
select the parameter estimate θ
of the residuals εei mimics as well as possible certain prescribed features of
3.2 Method of substitution and M-estimation
89
the error distribution. We consider one approach called minimum contrast
or M-estimation. Let ψ(y) be an influence or contrast function. The main
condition on the choice of this function is that
IEψ(εi + z) ≥ IEψ(εi )
∗
for all i = 1, . . . , n and P
all z . Then the true
value θ clearly minimizes the
expectation of the sum
i ψ Yi − f (Xi , θ) :
θ ∗ = argmin IE
X
θ
ψ Yi − f (Xi , θ) .
i
This leads to the M-estimate
e = argmin
θ
θ∈Θ
X
ψ Yi − f (Xi , θ) .
i
This estimation method can be treated as replacing the true expectation of
the errors by the empirical distribution of the residuals.
We specify this approach for regression estimation by the classical examples of least squares, least absolute deviation and maximum likelihood estimation corresponding to ψ(x) = x2 , ψ(x) = |x| and ψ(x) = − log ρ(x) ,
where ρ(x) is the error density. All these examples belong within framework
of M-estimation and the quasi maximum likelihood approach.
3.2.1 Mean regression. Least squares estimate
The observations Yi are assumed to follow the model
Yi = f (Xi , θ ∗ ) + εi ,
IEεi = 0
(3.2)
with an unknown target θ ∗ . Suppose in addition that σi2 = IEε2i < ∞ . Then
for every θ ∈ Θ and every i ≤ n due to (3.2)
2
2
IEθ∗ Yi − f (Xi , θ) = IEθ∗ εi + f (Xi , θ ∗ ) − f (Xi , θ)
2
= σi2 + f (Xi , θ ∗ ) − f (Xi , θ) .
This yields for the whole sample
IEθ∗
X
2 2 X 2 Yi − f (Xi , θ) =
σi + f (Xi , θ ∗ ) − f (Xi , θ) .
This expression is clearly minimized at θ = θ ∗ . This leads to the idea of
estimating the parameter θ ∗ by maximizing its empirical counterpart. The
resulting estimate is called the (ordinary) least squares estimate (LSE):
90
3 Regression Estimation
eLSE = argmin
θ
X
2
Yi − f (Xi , θ) .
θ∈Θ
This estimate is very natural and requires minimal information about the
errors εi . Namely, one only needs IEεi = 0 and IEε2i < ∞ .
3.2.2 Median regression. Least absolute deviation estimate
Consider the same regression model as in (3.2), but the errors εi are not
zero-mean. Instead we assume that their median is zero:
Yi = f (Xi , θ ∗ ) + εi ,
med(εi ) = 0.
As previously, the target of estimation is the parameter θ ∗ . Observe that
εi = Yi − f (Xi , θ ∗ ) and hence, the latter r.v. has median zero. We now use
the following simple fact: if med(ε) = 0 , then for any z 6= 0
IE|ε + z| ≥ IE|ε|.
(3.3)
Exercise 3.2.1. Prove (3.3).
The property (3.3) implies for every θ
IEθ∗
X
X
Yi − f (Xi , θ) ≥ IEθ∗
Yi − f (Xi , θ ∗ ),
∗
that is,
Pθ minimizes over θ the expectation under the true measure∗ of the
sum
Yi − f (Xi , θ) . This leads to the empirical counterpart of θ given
by
e = argmin
θ
X
Yi − f (Xi , θ).
θ∈Θ
This procedure is usually referred to as least absolute deviations regression
estimate.
3.2.3 Maximum likelihood regression estimation
Let the density function ρ(·) of the errors εi be known. The regression equation (3.2) implies εi = Yi − f (Xi , θ ∗ ) . Therefore, every Yi has the density
ρ(y − f (Xi , θ ∗ )) . Independence of the Yi ’s implies the product structure of
the density of the joint distribution:
Y
yielding the log-likelihood
ρ(yi − f (Xi , θ)),
3.2 Method of substitution and M-estimation
L(θ) =
X
91
`(Yi − f (Xi , θ))
with `(t) = log ρ(t) . The maximum likelihood estimate (MLE) is the point of
maximum of L(θ) :
e = argmax L(θ) = argmax
θ
θ∈Θ
X
`(Yi − f (Xi , θ)).
θ∈Θ
A closed form solution for this equation exists only in some special cases like
linear Gaussian regression. Otherwise this equation has to be solved numerically.
Consider an important special case corresponding to the i.i.d. Gaussian
errors when ρ(y) is the density of the normal law with mean zero and variance
σ 2 . Then
L(θ) = −
2
1 X
n
log(2πσ 2 ) − 2
Yi − f (Xi , θ) .
2
2σ
The corresponding MLE maximizes L(θ) or, equivalently, minimizes the sum
P
Yi − f (Xi , θ)2 :
e = argmax L(θ) = argmin
θ
θ∈Θ
X
Yi − f (Xi , θ)2 .
(3.4)
θ∈Θ
This estimate has already been introduced as the ordinary least squares estimate (oLSE).
An extension of the previous example is given by inhomogeneous Gaussian
regression, when the errors εi are independent Gaussian zero-mean but the
variances depend on i : IEε2i = σi2 . Then the log-likelihood L(θ) is given by
the sum
o
Xn Yi − f (Xi , θ)2
1
L(θ) =
−
− log(2πσi2 ) .
2
2σi
2
Maximizing this expression
2 w.r.t. θ is equivalent to minimizing the weighted
P −2 sum
σi Yi − f (Xi , θ) :
e = argmin
θ
θ∈Θ
X
2
σi−2 Yi − f (Xi , θ) .
Such an estimate is also-called the weighted least squares (wLSE).
Another example corresponds to the case when the errors εi are i.i.d.
double exponential, so that IP (±ε1 > t) = e−t/σ for some given σ > 0 . Then
ρ(y) = (2σ)−1 e−|y|/σ and
L(θ) = −n log(2σ) − σ −1
X
Yi − f (Xi , θ).
92
3 Regression Estimation
P
e
The MLE
Yi −
θ maximizes L(θ) or, equivalently, minimizes the sum
f (Xi , θ) :
e = argmax L(θ) = argmin
θ
θ∈Θ
X
Yi − f (Xi , θ).
θ∈Θ
So the maximum likelihood regression with Laplacian errors leads back to the
least absolute deviations (LAD) estimate.
3.2.4 Quasi Maximum Likelihood approach
This section very briefly discusses an extension of the maximum likelihood approach. A more detailed discussion will be given in context of linear modeling
in Chapter 4. To be specific, consider a regression model
Yi = f (Xi ) + εi .
The maximum likelihood approach requires to specify the two main ingredients of this model: a parametric class {f (x, θ), θ ∈ Θ} of regression functions
and the distribution of the errors εi . Sometimes such information is lacking.
One or even both modeling assumptions can be misspecified. In such situations one speaks of a quasi maximum likelihood approach, where the estimate
e is defined via maximizing over θ the random function L(θ) even through
θ
it is not necessarily the real log-likelihood. Some examples of this approach
have already been given.
Below we distinguish between misspecification of the first and second kind.
The first kind corresponds to the parametric assumption about the regression
function: assumed is the equality f (Xi ) = f (Xi , θ ∗ ) for some θ ∗ ∈ Θ . In
reality one can only expect a reasonable quality of approximating f (·) by
f (·, θ ∗ ) . A typical example is given by linear (polynomial) regression. The
linear structure of the regression function is useful and tractable but it can
only be a rough approximation of the real relation between Y and X . The
quasi maximum likelihood approach suggests to ignore this misspecification
and proceed as if the parametric assumption is fulfilled. This approach raises
a number of questions: what is the target of estimation and what is really
estimated by such quasi ML procedure? In Chapter 4 we show in the context
of linear modeling that the target of estimation can be naturally defined as
the parameter θ ∗ providing the best approximation of the true regression
function f (·) by its parametric counterpart f (·, θ) .
The second kind of misspecification concerns the assumption about the
errors εi . In the most of applications, the distribution of errors is unknown.
Moreover, the errors can be dependent or non-identically distributed. Assumption of a specific i.i.d. structure leads to a model misspecification and thus, to
the quasi maximum likelihood approach. We illustrate this situation by few
examples.
3.3 Linear regression
93
Consider the regression model Yi = f (Xi , θ ∗ ) + εi and suppose for a moment that the errors εi are i.i.d. normal. Then the principal term of the corresponding
log-likelihood
is given by the negative sum of the squared residuals:
P
Yi − f (Xi , θ)2 , and its maximization leads to the least squares method.
So, one can say that the LSE method is the quasi MLE when the errors are
assumed to be i.i.d. normal. That is, the LSE can be obtained as the MLE
for the imaginary Gaussian regression model when the errors εi are not necessarily i.i.d. Gaussian.
If the data are contaminated or the errors have heavy tails, it could be
unwise to apply the LSE method. The LAD method is known to be more
robust against outliers and data contamination. At the same time, it has
already been shown in Section 3.2.3 that the LAD estimates is the MLE when
the errors are Laplacian (double exponential). In other words, LAD is the
quasi MLE for the model with Laplacian errors.
Inference for the quasi ML approach is discussed in detail in Chapter 4 in
the context of linear modeling.
3.3 Linear regression
One standard way of modeling the regression relationship is based on a linear
expansion of the regression function. This approach is based on the assumption
that the unknown regression function f (·) can be represented as a linear
combination of given basis functions ψ1 (·), . . . , ψp (·) :
f (x) = θ1 ψ1 (x) + . . . + θp ψp (x).
A couple of popular examples are liste in this section. More examples are
given below in Section 3.3.1 in context of projection estimation.
Example 3.3.1 (Multivariate linear regression). Let x = (x1 , . . . , xd )> be d dimensional. The linear regression function f (x) can be written as
f (x) = a + b1 x1 + . . . + bd xd .
Here we have p = d + 1 and the basis functions are ψ1 (x) ≡ 1 and
ψm = xm−1 for m = 2, . . . , p . The coefficient a is often called the intercept and b1 , . . . , bd are the slope coefficients. The vector of coefficients
θ = (a, b1 , . . . , bd )> uniquely describes the linear relation.
Example 3.3.2 (Polynomial regression). Let x be univariate and f (·) be a
polynomial function of degree p − 1 , that is,
f (x) = θ1 + θ2 x + . . . + θp xp−1 .
Then the basic functions are ψ1 (x) ≡ 1 , ψ2 (x) ≡ x , ψp (x) ≡ xp−1 , while
θ = (θ1 , . . . , θp )> is the corresponding vector of coefficients.
94
3 Regression Estimation
Exercise 3.3.1. Let the regressor x be d -dimensional, x = (x1 , . . . , xd )> .
Describe the basis system and the corresponding vector of coefficients for the
case when f is a quadratic function of x .
Linear regression is often described using vector-matrix notation. Let Ψi
be the vector in IRp whose entries are the values ψm (Xi ) of the basis func∗
tions at the design point Xi , m = 1, . . . , p . Then f (Xi ) = Ψ>
i θ , and the
linear regression model can be written as
∗
Yi = Ψ>
i θ + εi ,
i = 1, . . . , n.
Denote by Y = (Y1 , . . . , Yn )> the vector of observations (responses), and
ε = (ε1 , . . . , εn )> the vector of errors. Let finally Ψ be the p × n matrix
i=1,...,n
with columns Ψ1 , . . . , Ψn , that is, Ψ = ψm (Xi ) m=1,...,p . Note that each
row of Ψ is composed by the values of the corresponding basis function ψm
at the design points Xi . Now the regression equation reads as
Y = Ψ> θ ∗ + ε.
The estimation problem for this linear model will be discussed in detail in
Chapter 4.
3.3.1 Projection estimation
Consider a (mean) regression model
Yi = f (Xi ) + εi ,
i = 1, . . . , n.
(3.5)
The target of the analysis is the unknown nonparametric function f which
has to be recovered from the noisy data Y . This approach is usually considered within the nonparametric statistical theory because it avoids fixing any
parametric specification of the model function f , and thus, of the distribution
of the data Y . This section discusses how this nonparametric problem can
be put back into the parametric theory.
The standard way of estimating the regression function f is based on
some smoothness assumption about this function. It enables us to expand the
given function w.r.t. some given functional basis and to evaluate the accuracy
of approximation by finite sums. More precisely, let ψ1 (x), . . . , ψm (x), . . . be
a given system of functions. Specific examples are trigonometric (Fourier,
cosine), orthogonal polynomial (Chebyshev, Legendre, Jacobi), and wavelet
systems among many others. The completeness of this system means that a
given function f can be uniquely expanded in the form
f (x) =
∞
X
m=1
θm ψm (x).
(3.6)
3.3 Linear regression
95
A very desirable feature of the basis system is orthogonality:
Z
ψm (x)ψm0 (x)µX (dx) = 0,
m 6= m0 .
HereP
µX can be some design measure on X or the empirical design measure
n−1 δXi . However, the expansion (3.6) is untractable because it involves
infinitely many coefficients θm . A standard procedure is to truncate this expansion after the first p terms leading to the finite approximation
f (x) ≈
p
X
θm ψm (x).
(3.7)
m=1
Accuracy of such an approximation becomes better and better as the number
p of terms grows. A smoothness assumption helps to estimate the rate of
convergence to zero of the remainder term f − θ1 ψ1 − . . . − θp ψp :
f − θ1 ψ1 − . . . − θp ψp ≤ rp ,
(3.8)
where rp describes the accuracy of approximation of the function f by the
considered finite sums uniformly over the class of functions with the prescribed
smoothness. The norm used in the definition (3.8) as well as the basis {ψm }
depend on the particular smoothness class. Popular examples are given by
Hölder classes for the L∞ norm, Sobolev smoothness for L2 -norm or more
generally Ls -norm for some s ≥ 1 .
A choice of a proper truncation value p is one of the central problems in
nonparametric function estimation. With p growing, the quality of approximation (3.7) improves in the sense that rp → 0 as p → ∞ . However, the
growth of the parameter dimension yields the growth of model complexity,
one has to estimate more and more coefficients. Section 4.7 below briefly discusses how the problem can be formalized and how one can define the optimal
choice. However, a rigorous solution is postponed until the next volume. Here
we suppose that the value p is fixed by some reasons and apply the quasi
maximum likelihood parametric approach. Namely, the approximation (3.7)
def
is assumed to be the exact equality: f (x) ≡ f (x, θ ∗ ) = θ1∗ ψ1 (x) + . . . + θp∗ ψp .
def
Model misspecification f (·) 6≡ f (·, θ) = θ1 ψ1 (x) + . . . + θp ψp for any vector
θ ∈ Θ means the modeling error, or, the modeling bias. The parametric approach ignores this modeling error and focuses on the error within the model
e.
which describes the accuracy of the qMLE θ
The qMLE procedure requires to specify the error distribution which appears in the log-likelihood. In the most general form, let IP0 be the joint
distribution of the error vector ε , and let ρ(n) (ε) be its density function on
IRn . The identities εi = Yi − f (Xi , θ) yield the log-likelihood
L(θ) = log ρ(n) Y − f (X, θ) .
(3.9)
96
3 Regression Estimation
If the errors εi are i.i.d. with the density ρ(y) , then
L(θ) =
X
log ρ(Yi − f (Xi , θ)).
(3.10)
The most popular least squares method (3.4) implicitly assumes Gaussian homogeneous noise: εi are i.i.d. N(0, σ 2 ) . The LAD approach is based on the
assumption of Laplace error distribution. Categorical data are modeled by a
proper exponential family distribution; see Section 3.5. Below we assume that
the one or another assumption about errors is fixed and the log-likelihood
is described by (3.9) or (3.10). This assumption can be misspecified and the
qMLE analysis has to be done under the true error distribution. Some examples of this sort for linear models are given in Section 4.6.
In the rest of this section we only discuss how the regression function f
in (3.5) can be approximated by different series expansions. With the selected
expansion and the assumption on the errors, the approximating parametric
model is fixed due to (3.9). In the most of examples we only consider a univariate design with d = 1 .
3.3.2 Polynomial approximation
It is well known that any smooth function f (·) can be approximated by a
polynomial. Moreover, the larger smoothness of f (·) is the better the accuracy
of approximation. The Taylor expansion yields an approximation in the form
f (x) ≈ θ0 + θ1 x + θ2 x2 + . . . + θm xm .
(3.11)
Such an approximation is very natural, however, it is rarely used in statistical applications. The main reason is that the different power functions
ψm (x) = xm are highly correlated between each other. This makes difficult
to identify the corresponding coefficients. Instead one can use different polynomial systems which fulfill certain orthogonality conditions.
We say that f (x) is a polynomial of degree m , if it can be represented
in the form (3.11) with θm 6= 0 . Any sequence 1, ψ1 (x), . . . , ψm (x) of such
polynomials yields a basis in the vector space of polynomials of degree m .
Exercise 3.3.2. Let for each j ≤ m a polynomial of degree j be fixed. Then
any polynomial Pm (x) of degree m can be represented in a unique way in
the form
Pm (x) = c0 + c1 ψ1 (x) + . . . + cm ψm (x)
(m)
(m)
Hint: define cm = Pm /ψm
and apply induction to Pm (x) − cm ψm (x) .
3.3 Linear regression
97
3.3.3 Orthogonal polynomials
Let µ be any measure on the real line satisfying the condition
Z
xm µ(dx) < ∞,
(3.12)
for any integer m . This enables us to define the scalar product for two polynomial functions f, g by
Z
def
f, g =
f (x)g(x)µ(dx).
With such a Hilbert structure we aim to define an orthonormal polynomial
system of polynomials ψm of degree m for m = 0, 1, 2, . . . such that
ψj , ψm = δj,m = 1I(j = m),
j, m = 0, 1, 2, . . . .
Theorem 3.3.1. Given a measure µ satisfying the condition (3.12) there exists unique orthonormal polynomial system ψ1 , ψ2 , . . . . Any polynomial Pm
of degree m can be represented as
Pm (x) = a0 + a1 ψ1 (x) + . . . + am ψm (x)
with
aj = Pm , ψj .
(3.13)
Proof. We construct the function ψm successively. The function ψ0 is a constant defined by
Z
2
ψ0 µ(dx) = 1.
Suppose now that the orthonormal polynomials ψ1 , . . . , ψm−1 have been already constructed. Define the coefficients
Z
def
aj =
xm ψj (x)µ(dx),
j = 0, 1, . . . , m − 1,
and consider the function
def
gm (x) = xm − a0 ψ0 − a1 ψ1 (x) − . . . − am−1 ψm−1 (x).
This is obviously a polynomial of degree m . Moreover, by orthonormality of
the ψj ’s for j < m
98
3 Regression Estimation
Z
Z
gm (x)ψj (x)µ(dx) =
Z
m
x ψj (x)µ(dx) − aj
ψj2 (x)µ(dx) = 0.
So, one can define ψm by normalization of gm :
−1/2
def ψm (x) = gm , gm
gm (x).
One can also easily see that such defined ψm is only
of degree
polynomial
m which is orthogonal to ψj for j < m and fulfills ψm , ψm = 1 , because
the number of constraints is equal to the number of coefficients θ0 , . . . , θm of
ψm (x) .
Let now Pm be a polynomial of degree m . Define the coefficient am by
(3.13). Similarly to above one can show that
Pm (x) − a0 + a1 ψ1 (x) + . . . + am ψm (x) ≡ 0
which implies the second claim.
Exercise 3.3.3. Let ψm be an orthonormal polynomial system. Show that
for any polynomial Pj (x) of degree j < m , it holds
Pj , ψm = 0.
Finite approximation and the associated kernel
Let f be a function satisfying
Z
f 2 (x)µ(dx) < ∞.
(3.14)
Then the scalar product aj = f, ψj is well defined for all j ≥ 0 leading for
each m ≥ 1 to the following approximation:
def
fm (x) =
m
X
aj ψj (x) =
j=0
m Z
X
f (u)ψj (u)µ(du)ψj (x)
j=0
Z
=
f (u)Φm (x, u)µ(du)
with
Φm (x, u) =
m
X
j=0
ψj (x)ψj (u).
(3.15)
3.3 Linear regression
99
Completeness
The accuracy of approximation of f by fm with m growing is one of the
central questions in the approximation theory. The answer
on the
depends
regularity of the function f and on choice of the system ψm . Let F be a
linear space of functions f on the real line satisfying (3.14).
We say that the
basis system {ψm (x)} is complete in F if the identities f, ψm = 0 for all
m ≥ 0 imply f ≡ 0 . As ψm (x) is a polynomial of degree m , this definition
is equivalent to the condition
f, xm = 0, m = 0, 1, 2, . . . ⇐⇒ f ≡ 0.
Squared bias and accuracy of approximation
Let f ∈ F be a function in L2 satisfying (3.14), and let {ψm } be a complete
basis. Consider the error f (x) − fm (x) of the finite approximation fm (x)
from (3.15). The Parseval identity yields
Z
f 2 (x)µ(dx) =
∞
X
a2m .
m=0
Pm 2
This yields that the finite sums of
a converge to the infinite sum
P∞
P∞j=0 j 2
2
m=0 am and the remainder bm =
j=m+1 aj tends to zero with m :
bm
def
= f − fm =
Z
∞
X
f (x) − fm (x)2 µ(dx) =
a2j → 0
j=m+1
as m → ∞ . The value bm is often called the squared bias. Below in this
section we briefly overview some popular polynomial systems used in the
approximation theory.
3.3.4 Chebyshev polynomials
Chebyshev polynomials are frequently used in the approximation theory because of their very useful features. These polynomials can be defined by many
ways: explicit formulas, recurrent relations, differential equations, among others.
A trigonometric definition
Chebyshev polynomials is usually defined in the trigonometric form:
Tm (x) = cos m arccos(x) .
(3.16)
Exercise 3.3.4. Check that Tm (x) from (3.16) is a polynomial of degree m .
Hint: use the formula cos (m + 1)u = 2 cos(u) cos(mu) − cos (m − 1)u and
induction arguments.
100
3 Regression Estimation
Recurrent formula
The trigonometric identity cos (m + 1)u = 2 cos(u) cos(mu) − cos (m − 1)u
yields the recurrent relation between Chebyshev polynomials:
Tm+1 (x) = 2xTm (x) − Tm−1 (x),
m ≥ 1.
(3.17)
Exercise 3.3.5. Describe the first 5 polynomials Tm .
Hint: use that T0 (x) ≡ 1 and T1 (x) ≡ x and use the recurrent formula (3.17).
The leading coefficient
The recurrent relation (3.17) and the formulas T0 (x) ≡ 1 and T1 (x) ≡ x
imply that the leading coefficient of Tm (x) is equal to 2m−1 by induction
arguments. Equivalently
(m)
Tm
(x) ≡ 2m−1 m!
Orthogonality and normalization
Consider the measure µ(dx) on the open interval (−1, 1) with the density
(1−x2 )−1/2 with respect to the Lebesgue measure. By the change of variables
x = cos(u) we obtain for all j 6= m
Z
1
dx
=
Tm (x)Tj (x) √
1 − x2
−1
π
Z
cos(mu) cos(ju)du = 0.
0
Moreover, for m ≥ 1
Z
1
2
Tm
(x) √
−1
dx
=
1 − x2
π
Z
1
cos (mu)du =
2
2
0
Z
0
π
π
1 + cos(2mu) du = .
2
Finally,
Z
1
−1
dx
√
=
1 − x2
Z
π
du = π.
0
So, the orthonormal system can be defined by normalizing the Chebyshev
polynomials Tm (x) :
ψ0 (x) ≡ π −1/2 ,
ψm (x) =
p
2/π Tm (x),
m ≥ 1.
3.3 Linear regression
101
The moment generating function
The bivariate function
f (t, x) =
∞
X
Tm (x)tm
(3.18)
m=0
is called the moment generating function. It holds for the Chebyshev polynomials
f (t, x) =
1 − tx
.
1 − 2tx + t2
This fact can be proven by using the recurrent formula (3.17) and the following
relation:
f (x, t) = 1 + tx + t
∞
X
Tm+1 (x)tm
m=1
= 1 + tx + t
∞
X
2xTm (x) − Tm−1 (x) tm
m=1
= 1 + tx + 2tx f (x, t) − 1 − t2 f (x, t).
(3.19)
Exercise 3.3.6. Check (3.19) and (3.18).
Roots of Tm
The identity cos π(k − 1/2) ≡ 0 for all integer k yields the roots of the
polynomial Tm (x) :
π(k − 1/2) ,
xk,m = cos
m
k = 1, . . . , m.
(3.20)
This means that Tm (xk,m ) = 0 for k = 1, . . . , m and hence, Tm (x) has
exactly m roots on the interval [−1, 1] .
Discrete orthogonality
Let x1,N , . . . , xN,N be the roots of TN due to (3.20): xk,N = cos π(2k−1)
.
2N
Define the discrete inner product
Tm , Tj
N
=
N
X
Tm (xk,N )Tj (xk,N ).
k=1
Then it holds similarly to the continuous case
102
3 Regression Estimation
Tm , Tj
N
0
= N/2
N
m 6= j,
m = j 6= 0
m = j = 0.
(3.21)
Exercise 3.3.7. Prove (3.21).
Hint: use that for all m > 0
N
X
cos
πm(k − 1/2) N
k=1
=0
yielding for all m0 6= m
N
X
cos
πm(k − 1/2) N
k=1
cos
πm0 (k − 1/2) N
=0
Extremes of Tm (x)
Obviously Tm (x) ≤ 1 because the cos-function is bounded by one in absolute value. Moreover, cos(kπ) = (−1)k yields the extreme points ek with
Tm (ek ) = (−1)k for
ek = cos
πk
m
,
k = 0, 1, . . . , m.
(3.22)
In particular, the edge points x = 1 and x = −1 are extremes of Tm (x) .
Exercise 3.3.8. Check that Tm (ek ) = (−1)k for ek from (3.22). Show that
Tm (1) = 1 and Tm (−1) = (−1)m . Show that |Tm (x)| < 1 for x 6= ek on
[−1, 1] .
Hint: Tm is a polynomial of degree m , hence, it can have at most m − 1
extreme points inside the interval (−1, 1) , which are e1 , . . . , em−1 .
Sup-norm
The important feature of the Chebyshev polynomials which makes them very
useful for the approximation theory us that each of them minimizes the supnorm over all polynomial of the certain degree with the fixed leading coefficient.
Theorem 3.3.2. The scaled Chebyshev polynomial fm (x) = 21−m Tm minidef
mizes the sup-norm kf k∞ = supx∈[−1,1] |f (x)| over the class of all polynomials of degree m with the leading coefficient 1.
3.3 Linear regression
103
Proof. As |Tm (x)| ≤ 1 , the sup-norm of fm fulfills kfm k∞ = 21−m . Let
w(x) be any other polynomial with the leading coefficient one and |w(x)| <
21−m . Consider the difference fm (x) − w(x) at the extreme points ek from
(3.22). Then fm (ek ) − w(ek ) > 0 for all even k = 0, 2, 4, . . . and fm (ek ) −
w(ek ) < 0 for all odd k = 1, 3, 5, . . . . This means that this difference has at
least m roots on [−1, 1] which is impossible because it is a polynomial of
degree m − 1 .
Expansion by Chebyshev polynomials and discrete cosine transform
Let f (x) be a measurable function satisfying
Z
1
f 2 (x) √
−1
dx
< ∞.
1 − x2
Then this function can be uniquely expanded by Chebyshev polynomials:
∞
X
f (x) =
am Tm (x).
m=0
The coefficients am in this expansion can be obtained by projection
am = f, Tm =
Z
1
f (x)Tm (x) √
−1
dx
< ∞.
1 − x2
However, this method is numerically intensive. Instead, one can use the dis
crete orthogonality (3.21). Let some N be fixed and xk,N = cos π(k−1/2)
.
N
Then for m ≥ 1
am =
N
π(k − 1/2) 1 X
f (xk,N ) cos
.
N
N
k=1
This sum can be computed very efficiently via the discrete cosine transform.
3.3.5 Legendre polynomials
The Legendre polynomials Pm (x) are often used in physics and in harmonic
analysis. It is an orthogonal polynomial system on the interval [−1, 1] w.r.t.
the Lebesgue measure, that is,
Z
1
Pm (x)Pm0 (x)dx = 0,
m 6= m0 .
(3.23)
−1
They also can be defined as solutions of the Legendre differential equation
104
3 Regression Estimation
d d
(1 − x2 ) Pm (x) + m(m + 1)Pm (x) = 0.
dx
dx
(3.24)
An explicit representation is given by the Rodrigues’ formula
Pm (x) =
1 dm (1 − x2 )m .
2m m! dxm
(3.25)
Exercise 3.3.9. Check that Pm (x) from (3.25) fulfills (3.24).
Hint: differentiate m + 1 times the identity
d 2
(x − 1)m = 2mx(x2 − 1)m
dx
(x2 − 1)
yielding
d2
d
Pm (x) + (x2 − 1) 2 Pm (x)
dx
dx
d
= 2mPm (x) + 2mx Pm (x).
dx
2Pm (x) + 2x
Orthogonality
The orthogonality property (3.23) can be checked by using the Rodrigues’
formula.
Exercise 3.3.10. Check that for m < m0
1
Z
Pm (x)Pm0 (x)dx = 0.
(3.26)
−1
Hint: integrate (3.26) by part m + 1 times with Pm from (3.25) and use that
the m + 1 th derivative of Pm vanishes.
Recursive definition
It is easy to check that P0 (x) ≡ 1 and P1 (x) ≡ x . Bonnets recursion formula
relates 3 subsequent Legendre polynomials: for m ≥ 1
(m + 1)Pm+1 (x) = (2m + 1)xPm (x) − mPm−1 (x).
(3.27)
From Bonnet’s recursion formula one obtains by induction the explicit representation
Pm (x) =
m
X
(−1)k
k=0
2 m
1 + x n−k 1 − x k
.
k
2
2
3.3 Linear regression
105
More recursions
Further, the definition (3.25) yields the following 3-term recursion:
x2 − 1 d
Pm (x) = xPm (x) − Pm−1 (x)
m dx
(3.28)
Useful for the integration of Legendre polynomials is another recursion
(2m + 1)Pm (x) =
d Pm+1 (x) − Pm−1 (x) .
dx
Exercise 3.3.11. Check (3.28) by using the definition (3.25).
Hint: use that
x
i
dm−1 h
dm h (1 − x2 )m i
2 m−1
)
=
2x
x(1
−
x
dxm
m
dxm−1
= 2x2
dm−1
dm−2
2 m−1
(1
−
x
)
−
2x
(1 − x2 )m−1
dxm−1
dxm−2
Exercise 3.3.12. Check (3.27).
Hint: use that
dm d h (1 − x2 )m+1 i
dm =
−2
x(1 − x2 )m
dxm dx
m+1
dxm
= −2x
dm dm−1 (1 − x2 )m − 2 m−1 (1 − x2 )m
m
dx
dx
Generating function
The Legendre generating function is defined by
def
f (t, x) =
∞
X
Pm (x)tm .
m=0
It holds
f (t, x) = (1 − 2tx + x2 )−1/2 .
Exercise 3.3.13. Check (3.29).
(3.29)
106
3 Regression Estimation
3.3.6 Lagrange polynomials
In numerical analysis, Lagrange polynomials are used for polynomial interpolation. The Lagrange polynomial are widely applied in cryptography, such as
in Shamir’s Secret Sharing scheme.
Given a set of p+1 data points (X0 , Y0 ) , . . . , (Xp , Yp ) , where no two Xj
are the same, the interpolation polynomial in the Lagrange form is a linear
combination
L(x) =
p
X
`m (x)Ym
(3.30)
m=0
of Lagrange basis polynomials
def
`m (x) =
x − Xj
Xm − Xj
j=0,...,p,
Y
j6=m
=
x − X0
x − Xm−1 x − Xm+1
x − Xp
...
...
.
Xm − X0
Xm − Xm−1 Xm − Xm+1
Xm − Xp
This definition yields that `m (Xm ) = 1 and `m (Xj ) = 0 for j 6= m . Hence,
P (Xm ) = Ym for the polynomial Lm (x) from (3.30). One can easily see that
L(x) is the only polynomial of degree p that fulfills P (Xm ) = Ym .
The main disadvantage of the Lagrange forms is that any change of the
design X1 , . . . , Xn require to change each basis function `m (x) . Another
problem is that the Lagrange basis polynomials `m (x) are not necessarily
orthogonal. This explains why these polynomials are rarely used in statistical
applications.
Barycentric interpolation
Introduce a polynomial `(x) of degree p + 1 by
`(x) = (x − X0 ) . . . (x − Xp ).
Then the Lagrange basis polynomials can be rewritten as
`m (x) = `(x)
wm
x − Xm
with the barycentric weights wm defined by
−1
wm
=
Y
(Xm − Xj )
j=0,...,p,
j6=m
3.3 Linear regression
107
which is commonly referred to as the first form of the barycentric interpolation formula. The advantage of this representation is that the interpolation
polynomial may now be evaluated as
L(x) = `(x)
p
X
wm
Ym
x
−
Xm
m=0
which, if the weights wm have been pre-computed, requires only O(p) operations (evaluating `(x) and the weights wm /(x − Xm ) ) as opposed to O(p2 )
for evaluating the Lagrange basis polynomials `m (x) individually.
The barycentric interpolation formula can also easily be updated to incorporate a new node Xp+1 by dividing each of the wm by (Xm − Xp+1 ) and
constructing the new wp+1 as above.
We can further simplify the first form by first considering the barycentric
interpolation of the constant function g(x) ≡ 1 :
g(x) = `(x)
p
X
wm
.
x
−
Xm
m=0
Dividing L(x) by g(x) does not modify the interpolation, yet yields
Pp
m=0 wm Ym /(x − Xm )
L(x) = P
p
m=0 wm /(x − Xm )
which is referred to as the second form or true form of the barycentric interpolation formula. This second form has the advantage that `(x) need not be
evaluated for each evaluation of L(x) .
3.3.7 Hermite polynomials
The Hermite polynomials build an orthogonal system on the whole real line.
The explicit representation is given by
def
Hm (x) = (−1)m ex
2
dm −x2
e
.
dxm
Sometimes one uses a “probabilistic” definition
def
H̆m (x) = (−1)m ex
2
/2
dm −x2 /2
e
.
dxm
Exercise 3.3.14. Show that each Hm (x) and H̆m (x) is a polynomial of
degree m .
m
2
d
−x
can be represented in
Hint: Use induction arguments to show that dx
me
2
−x
with a polynomial Pm (x) of degree m .
the form Pm (x)e
108
3 Regression Estimation
Exercise 3.3.15. Check that the leading coefficient of Hm (x) is equal to 2m
while the leading coefficient of H̆m (x) is equal to one.
Orthogonality
The Hermite polynomial are orthogonal on the whole real line with the weight
2
function w(x) = e−x : for j 6= m
Z ∞
Hm (x)Hj (x)w(x)dx = 0.
(3.31)
−∞
Note first that each Hm (x) is a polynomial so the scalar product (3.31) is
well defined. Suppose that m > j . It is obvious that it suffices to check that
Z ∞
Hm (x)xj w(x)dx = 0,
j < m.
(3.32)
−∞
def
m
2
d
−x
0
Define fm (x) = dx
. Obviously fm−1
(x) ≡ fm (x) . Integration by part
me
yields for any j ≥ 1
Z ∞
Z ∞
j
m
Hm (x)x w(x)dx = (−1)
xj fm (x)dx
−∞
−∞
= (−1)m
Z
∞
0
xj fm−1
(x)dx
−∞
= (−1)
m−1
Z
∞
j
xj−1 fm−1 (x)dx .
−∞
By the same arguments, for m ≥ 1
Z ∞
Z ∞
0
0
0
fm (x)dx =
fm−1
(x)dx = fm−1
(∞) − fm−1
(−∞) = 0. (3.33)
−∞
−∞
This implies (3.32) and hence the orthogonality property (3.31).
Now we compute the scalar product of Hm (x) . Formula (3.33) with j = m
implies
Z ∞
Z ∞
√
2
Hm (x)xm w(x)dx = m!
e−x dx = π m!.
−∞
−∞
As the leading coefficient of Hm (x) is equal to 2m , this implies
Z ∞
Z ∞
√
2
Hm
(x)w(x)dx = 2m
Hm (x)xm w(x)dx = π 2m m!.
−∞
−∞
Exercise 3.3.16. Prove the orthogonality of the probabilistic Hermite polynomials H̆m (x) . Compute their norm.
3.3 Linear regression
109
Recurrent formula
The definition yields
H0 (x) ≡ 1,
H1 (x) = 2x.
Further, for m ≥ 1 , we use the formula
d
d
Hm (x) = (−1)m
dx
dx
ex
2
dm −x2
e
dxm
= 2xHm (x) − Hm+1 (x) (3.34)
yielding the recurrent relation
0
Hm+1 (x) = 2xHm (x) − Hm
(x).
Moreover, integration by part, the formula (3.34), and the orthogonality property yield for j < m − 1
Z
∞
0
Hm
(x)Hj (x)w(x)dx
Z
∞
=−
−∞
Hm (x)Hj0 (x)w(x)dx = 0
−∞
0
This means that Hm
(x) is a polynomial of degree m − 1 and it is orthogonal
0
(x) coincides with Hm−1 (x) up to a
to all Hj (x) for j < m − 1 . Thus, Hm
0
(x) is equal to m2m while
multiplicative factor. The leading coefficient of Hm
m−1
the leading coefficient of Hm−1 (x) is equal to 2
yielding
0
Hm
(x) = 2mHm−1 (x).
This results in another recurrent equation
Hm+1 (x) = 2xHm (x) − 2mHm−1 (x).
Exercise 3.3.17. Derive the recurrent formulas for the probabilistic Hermite
polynomials H̆m (x) .
Generating function
The exponential generating function f (x, t) for the Hermite polynomials is
defined as
def
f (x, t) =
∞
X
m=0
Hm (x)
tm
.
m!
It holds
f (x, t) = exp 2xt − t2 .
(3.35)
110
3 Regression Estimation
It can be proved by checking the formula
∂m
f (x, t) = Hm (x − t)f (x, t)
∂tm
(3.36)
Exercise 3.3.18. Check the formula (3.36) and derive (3.35).
Completeness
This property means that the system of the normalized Hermite polynomials
builds an orthonormal basis in L2 Hilbert space of functions f (x) on the real
line satisfying
∞
Z
f 2 (x)w(x)dx < ∞.
−∞
3.3.8 Trigonometric series expansion
The trigonometric functions are frequently used in the approximation theory,
in particular due to their relation to the spectral theory. One usually applies
either the Fourier basis, or the cosine basis.
The Fourier basis is composed by the constant function F0 ≡ 1 and
the functions F2m−1 (x) = sin(2mπx) and F2m (x) = cos(2mπx) for m =
1, 2, . . . . These functions are considered on the interval [0, 1] and are all periodic: f (0) = f (1) . Therefore, it can be only used for approximation of
periodic functions.
The cosine basis is composed by the functions S0 ≡ 1 , and Sm (x) =
cos(mπx) for m ≥ 1 . These functions are periodic for even m and antiperiodic for odd m , this allows to approximate functions which are not necessarily
periodic.
Orthogonality
Trigonometric identities imply orthogonality
1
Z
Fm , Fj =
Fm (x)Fj (x)dx = 0,
j 6= m.
(3.37)
0
Also
Z
1
2
Fm
(x)dx = 1/2
0
Exercise 3.3.19. Check (3.37) and (3.38).
(3.38)
3.4 Piecewise methods and splines
111
Exercise 3.3.20. Check that
1
Z
1
1I(j = m).
2
Sj (x)Sm (x)dx =
0
Many nice features of the Chebyshev polynomials can be translated to
the cosine basis by a simple change of variable: with u = cos(πx) , it holds
Sm (x) = Tm (u) . So, any expansion of the function
f (u) by the Chebyshev
polynomials yields an expansion of f cos(πx) by the cosine system.
3.4 Piecewise methods and splines
This section discusses piecewise polynomial methods of approximation of the
univariate regression functions.
3.4.1 Piecewise constant estimation
Any continuous function can be locally approximated by a constant. This
naturally leads to the basis consisting of piecewise constant functions. Let
A1 , . . . , AK be a non-overlapping partition of the design space X :
[
X=
Ak ∩ Ak0 = ∅, k 6= k 0 .
Ak ,
(3.39)
k=1,...,K
We approximate the function f by a finite sum
f (x) ≈ f (x, θ) =
K
X
θk 1I(x ∈ Ak ).
(3.40)
k=1
Here θ = (θ1 , . . . , θp )> with p = K . A nice feature of this approximation
is that the basis indicator functions ψ1 , . . . , ψK are orthogonal because they
have non-overlapping supports. For the case of independent errors, this makes
e very simple. In fact, every coefficient θek can
the computation of the qMLE θ
be estimated independently of the others. Indeed, the general formula (3.10)
yields
e = argmax L(θ) = argmax
θ
θ
= argmax
θ
M
X
X
θ=(θk ) k=1 X ∈A
i
k
n
X
`(Yi − f (Xi , θ))
i=1
`(Yi − θk ).
(3.41)
112
3 Regression Estimation
Exercise 3.4.1. Show that θek can be obtained by the constant approximation of the data Yi for Xi ∈ Ak :
X
θek = argmax
`(Yi − θk ),
k = 1, . . . , K.
(3.42)
θk
Xi ∈Ak
A similar formula can be obtained for the target θ ∗ = (θk∗ ) = argmaxθ IEL(θ) :
X
θk∗ = argmax
θk
IE`(Yi − θk ),
m = 1, . . . , K.
Xi ∈Ak
e can be computed explicitly in some special cases. In parThe estimator θ
ticular, if ρ corresponds a density of a normal distribution, then the resulting
estimator θek is nothing but the mean of observations Yi over the piece Ak .
For the Laplacian errors, the solution is the median of the observations over
Ak . First we consider the case of Gaussian likelihood.
Theorem 3.4.1. Let `(y) = −y 2 /(2σ 2 ) + R be a log-density of a normal law.
Then for every k = 1, . . . , K
1 X
Yi ,
θek =
Nk
Xi ∈Ak
θk∗ =
1 X
IEYi ,
Nk
Xi ∈Ak
where Nk stands for the number of design points Xi within the piece Ak :
def
Nk =
X
1 = # i : Xi ∈ Ak .
Xi ∈Ak
Exercise 3.4.2. Check the statements of Theorem 3.4.1.
The properties of each estimator θek repeats ones of the MLE for the
sample retracted to Ak ; see Section 2.9.1.
e be defined by (3.41) for a normal density ρ(y) . Then
Theorem 3.4.2. Let θ
∗
∗
with θ = (θ1 , . . . , θK )> = argmaxθ IEL(θ) , it holds
IE θek = θk∗ ,
Var(θek ) =
1 X
Var(Yi ).
Nk2
Xi ∈Ak
Moreover,
e θ∗ ) =
L(θ,
K
X
Nk e
(θk − θk∗ )2 .
2σ 2
k=1
3.4 Piecewise methods and splines
113
The statements follow by direct calculus on each interval separately.
If the errors εi = Yi − IEYi are normal and homogeneous, then the distrie θ ∗ ) is available.
bution of the maximum likelihood L(θ,
Theorem 3.4.3. Consider a Gaussian regression Yi ∼ N(f (Xi ), σ 2 ) for i =
1, . . . , n . Then θek ∼ N(θk∗ , σ 2 /Nk ) and
e θ∗ ) =
L(θ,
K
X
2
Nk e
θk − θk∗ ∼ χ2K ,
2
2σ
m=1
where χ2K stands for the chi-squared distribution with K degrees of freedom.
This result is again a combination of the results from Section 2.9.1 for
different pieces Ak . It is worth mentioning once again that the regression
function f (·) is not assumed to be piecewise constant, it can be whatever
function. Each θek estimates the mean θk∗ of f (·) over the design points Xi
within Ak .
e θ ∗ ) are often
The results on the behavior of the maximum likelihood L(θ,
used for studying the properties of the chi-squared test; see Section 7.1 for
more details.
A choice of the partition is an important issue in the piecewise constant
approximation. The presented results indicate that the accuracy of estimation of θk∗ by θek is inversely proportional to the number of points Nk within
each piece Ak . In the univariate case one usually applies the equidistant
partition: the design interval is split into p equal intervals Ak leading to approximately equal values Nk . Sometimes, especially if the design is irregular,
a non-uniform partition can be preferable. In general it can be recommended
to split the whole design space into intervals with approximately the same
number Nk of design points Xi .
A constant approximation is often not accurate enough to expand a regular
regression function. One often uses a linear or polynomial approximation. The
next sections explain this approach for the case of a univariate regression.
3.4.2 Piecewise linear univariate estimation
The piecewise constant approximation can be naturally extended to piecewise
linear and piecewise polynomial construction. The starting point is again a
non-overlapping partition of X into intervals Ak for k = 1, . . . , K . First we
explain the idea for the linear approximation of the function f on each interval
Ak . Any linear function on Ak can be represented in the form ak + ck x
with some coefficients ak , ck . This yields in total p = 2K coefficients: θ =
(a1 , c1 , . . . , aK , cK )> . The corresponding function f (·, θ) can be represented
as
114
3 Regression Estimation
f (x) ≈ f (x, θ) =
K
X
(ak + ck x) 1I(x ∈ Ak ).
k=1
The non-overlapping structure of the sets Ak yields orthogonality of basis
functions for different pieces. As a corollary, one can optimize the linear approximation on every interval Ak independently of the others.
Exercise 3.4.3. Show that e
ak , e
ck can be obtained by the linear approximation of the data Yi for Xi ∈ Ak :
(e
ak , e
ck ) = argmax
X
`(Yi − ak − ck Xi ) 1I(Xi ∈ Ak ),
k = 1, . . . , K.
(ak ,ck )
On every piece Ak , the constant and the linear function x are not orthogonal except some very special situation. However, one can easily achieve
orthogonality by a shift of the linear term.
Exercise 3.4.4. For each k ≤ K , there exists a point xk such that
X
(Xi − xk ) 1I(Xi ∈ Ak ) = 0.
(3.43)
Introduce for each k ≤ K two basis functions φj−1 (x) = 1I(x ∈ Ak ) and
φj (x) = (x − xk ) 1I(x ∈ Ak ) with j = 2k .
Exercise 3.4.5. Assume (3.43) for each k ≤ K . Check that any piecewise
linear function can be uniquely represented in the form
f (x) =
p
X
θj φj (x)
j=1
with p = 2K and the functions φj are orthogonal in the sense that for j 6= j 0
n
X
φj (Xi )φj 0 (Xi ) = 0.
i=1
In addition, for each k ≤ K
2 def
kφj k
=
n
X
(
φ2j (Xi ) =
i=1
def
Nk2 =
X
Xi ∈Ak
1,
Nk , j = 2k − 1,
Vk2 j = 2k.
def
Vk2 =
X
Xi ∈Ak
(Xi − xk )2 .
3.4 Piecewise methods and splines
115
In the case of Gaussian regression, orthogonality of the basis helps to gain
e = (θej ) :
a simple closed form for the estimators θ
n
1 X
θej =
Yi ψj (Xi ) =
kψj k2 i=1
(
1
Nk
1
Vk2
P
Yi ,
j = 2k − 1,
P Xi ∈Ak
Xi ∈Ak Yi (Xi − xk ), j = 2k.
see Section 4.2 in the next chapter for a comprehensive study.
3.4.3 Piecewise polynomial estimation
Local linear expansion of the function f (x) on each piece Ak can be extended
to a piecewise polynomial case. The basic idea is to apply a polynomial approximation of a certain degree q on each piece Ak independently. One can
use for each piece a basis of the form (x − xk )m 1I(x ∈ Ak ) for m = 0, 1, . . . , q
with xk from (3.43) yielding the approximation
f (x) 1I(x ∈ Ak ) = f (x, ak ) 1I(x ∈ Ak )
=
K
X
a0,k + a1,k (x − xk ) + . . . + aq,k (x − xk )q 1I(x ∈ Ak )
k=1
for ak = (a0,k , a1,k , . . . , aq,k )> . This involves q + 1 parameter for each piece
and p = K(q+1) parameters in total. A nice feature of the piecewise approach
is that the coefficients ak of the piecewise polynomial approximation can be
estimated independently for each piece. Namely,
e k = argmax
a
a
X 2
Yi − f (Xi , a)
Xi ∈Ak
The properties of this estimator will be discussed in details in Chapter 4.
3.4.4 Spline estimation
The main drawback of the piecewise polynomial approximation is that the
resulting function f is discontinuous at the edge points between different
pieces. A natural way of improving the boundary effect is to force some conditions on the boundary behavior. One important special case is given by the
spline system. Let X be an interval on the real line, perhaps infinite. Let also
t0 < t1 < . . . < tK be some ordered points in X such that t0 is the left
edge and tK the right edge of X . Such points are called knots. We say that
a function f is a spline of degree q at knots (tk ) if is polynomial on each
span (tk−1 , tk ) for k = 1, . . . , K and satisfies the boundary conditions
f (m) (tk −) = f (m) (tk +),
m = 0, . . . , q − 1,
k = 1, . . . , K − 1.
116
3 Regression Estimation
Here f (m) (t−) stands for the left derivative of f at t . In words, the function
f and its first q − 1 derivatives are continuous on X and only the q th
derivatives may have discontinuities at the knots tk . It is obvious that the q
derivative f (q) (t) of the spline of degree q is a piecewise constant functions
on the spans Ak = [tk−1 , tk ) .
The spline is called uniform if the knots are equidistant, or, in other words,
if all the spans Ak have equal length. Otherwise it is non-uniform.
Lemma 3.4.1. The set of all splines of degree q at knots (tk ) is a linear
space, that is, any linear combination of such splines is again a spline. Any
function having a continuous m th derivative for m < K and piecewise constant q th derivative is a q -spline.
Splines of degree zero are just piecewise constant functions studied in
Section 3.4.1. Linear splines are particularly transparent: this is the set of
all piecewise linear continuous functions on X . Each of them can be easily
constructed from left to right or from right to left: start with a linear function
a1 + c1 x on the piece A1 = [t0 , t1 ] . Then f (t1 ) = a1 + c1 t1 . On the piece
A2 the slope of f can be changed for c2 leading to the function f (x) =
f (t1 ) + c2 (x − t1 ) for x ∈ [t1 , t2 ] . Similarly, at t2 the slop of f s can change
for c3 yielding f (x) = f (t2 ) + c3 (x − t2 ) on A3 , and so on. Splines of higher
order can be constructed similarly step by step: one fixes the polynomial form
on the very first piece A1 and then continues the spline function to every next
piece Ak using the boundary conditions and the value of the q th derivative
of f on Ak . This construction explains the next result.
Lemma 3.4.2. Each spline f of degree q and knots (tk ) is uniquely described by the vector of coefficients a1 on the first span and the values f (q) (x)
for each span A1 , . . . , AK .
This result explains that the parameter dimension of the linear spline
space is q + K . One possible basis in this space is given by polynomials
def
xm−1 of degree m = 0, 1, . . . , q and the functions φk (x) = (x − tk )q+ for
k = 1, . . . , K − 1 .
Exercise 3.4.6. Check that φj (x) for j = 1, . . . , q + K form a basis in the
linear spline space, and any q -spline f can be represented as
f (x) =
q
X
m=0
αm xm +
K−1
X
θk φk (x).
(3.44)
k=1
Hint: check that the functions φj (x) are linearly independent and that each
(q)
q th derivative φj (x) is piecewise constant.
3.4 Piecewise methods and splines
117
B–splines
Unfortunately, the basis functions {φk (x)} with φk (x) = (x − tk )q+ are only
useful for theoretical study. The main problem is that the functions φj (x) are
strongly correlated, and the recovering the coefficients θj in the expansion
(3.44) is a hard numerical task. by this reason, one often uses another basis
called B–splines. The idea is to build splines of the given degree with the
minimal support. Each B–spline basis function bk,q (x) is only non-zero on
the q neighbor spans Ak , Ak+1 , . . . Ak+q−1 for k = 1, . . . , K − q .
Exercise 3.4.7. Let f (x) be a q -spline with the support on q 0 < q neighbor
spans Ak , Ak+1 , . . . Ak+q0 −1 . Then f (x) ≡ 0 .
Pk+q0 −1
Hint: consider any spline of the form f (x) = j=k
cj φj (x) . Show that the
(m)
boundary conditions f (tk+q0 ) = 0 for m = 0, 1, . . . , q yield cj ≡ 0 .
The basis B–spline functions can be constructed successfully. For q =
0 , the B–splines bk,0 (x) coincide with the functions φk (x) = 1I(x ∈ Ak ) ,
k = 1, . . . , K . Each linear B–spline bk,1 (x) has a triangle shape on the two
connected intervals Ak and Ak+1 . It can be defined by the formula
def
bk,1 (x) =
tk+1 − x
x − tk−1
bk,0 (x) +
bk+1,0 (x),
tk − tk−1
tk+1 − tk
k = 1, . . . , K − 1.
One can continue this way leading to the Cox–de Boor recursion formula
def
bk,m (x) =
x − tk−1
tk+m − x
bk,m−1 (x) +
bk+1,m−1 (x)
tk+m−1 − tk−1
tk+m − tk
for k = 1, . . . , K − m .
Exercise 3.4.8. Check by induction for each function bk,m (x) the following
conditions:
1. bk,m (x) a polynomial of degree m on each span Ak , . . . , Ak+m−1 and
zero outside;
Pm−1
2. bk,m (x) can be uniquely represented as a sum bk,m (x) = l=0 cl,k φk+l (x) ;
3. bk,m (x) is a m -spline.
The formulas simplify for the uniform splines with equal span length ∆ =
|Ak | :
def
bk,m (x) =
for k = 1, . . . , K − m .
tk+m − x
x − tk−1
bk,m−1 (x) +
bk+1,m−1 (x)
m∆
m∆
118
3 Regression Estimation
Exercise 3.4.9. Check that
bk,m (x) =
m
X
ωl,m φk+l (x)
l=0
with
def
ωl,m =
(−1)l
∆m l!(m − l)!
Smoothing splines
Such a spline system naturally arises as a solution of a penalized maximum
likelihood problem. Suppose we are given the regression data (Yi , Xi ) with the
univariate design X1 ≤ X2 ≤ . . . ≤ Xn . Consider the mean regression model
Yi = f (Xi ) + εi with zero mean errors εi . The assumption of independent
homogeneous Gaussian errors leads to the Gaussian log-likelihood
L(f ) = −
n
X
Yi − f (Xi )2 /(2σ 2 )
(3.45)
i=1
Maximization of this expression w.r.t. all possible functions f or, equivalently,
>
all vectors f (X1 ), . . . , f (Xn )
results in the trivial solution: f (Xi ) = Yi .
This mean that the full dimensional maximum likelihood perfectly reproduces
the original noisy data. Some additional assumptions are needed to force any
desirable feature of the reconstructed function. One popular example is given
by smoothness of the function f . Degree of smoothness (or, inversely, degree
of roughness) can be measured by the value
def
Rq (f ) =
Z
X
(q) 2
f (x) dx.
(3.46)
One can try to optimize the fit (3.45) subject to the constraint on the amount
of roughness from (3.46). Equivalently, one can optimize the penalized loglikelihood
Z
n
2
2
1 X
Lλ (f ) = L(f ) − λRq (f ) = − 2
Yi − f (Xi ) − λ f (q) (x) dx,
2σ i=1
X
def
where λ > 0 is a Lagrange multiplier. The corresponding maximizer is the
penalized maximum likelihood estimator:
feλ = argmax Lλ (f ),
f
(3.47)
3.5 Generalized regression
119
where the maximum is taken over the class of all measurable functions. It is
remarkable that the solution of this optimization problem is a spline of degree
q with the knots X1 , . . . , Xn .
Theorem 3.4.4. For any λ > 0 and any integer q , the problem (3.47) has
a unique solution which is a q -spline with knots at design points (Xi ) .
For the proof we refer to Green and Silverman (1994). Due to this result,
one can simplify the problem and look for a spline f which minimizes the
objective Lλ (f ) . A solution to (3.47) is called a smoothing spline. If f is
a q -spline, the integral Rq (f ) can be easily computed. Indeed, f (q) (x) is
piecewise constant, that is, f (q) (x) = ck for x ∈ Ak , and
Rq (f ) =
K
X
c2k |tk − tk−1 |.
k=1
For the uniform design, the formula simplifies
even more, and by change of
P
the multiplier λ , one can use Rq (f ) = k c2k . The use of any parametric
representation of a spline function f allows to represent the optimization
problem (3.47) as a penalized least squares problem. Estimation and inference
in such problems are studied below in Section 4.7.
3.5 Generalized regression
Let the response Yi be observed at the design point Xi ∈ IRd , i = 1, . . . , n .
A (mean) regression model assumes that the observed values Yi are independent and can be decomposed into the systematic component f (Xi ) and the
individual centered stochastic error εi . In some cases such a decomposition is
questionable. This especially concerns the case when the data Yi are categorical, e.g. binary or discrete. Another striking example is given by nonnegative
observations Yi . In such cases one usually assumes that the distribution of
Yi belongs to some given parametric family (Pυ , υ ∈ U) and only the parameter of this distribution depends on the design point Xi . We denote this
parameter value as f (Xi ) ∈ U and write the model in the form
Yi ∼ Pf (Xi ) .
As previously, f (·) is called a regression function and its values at the design
points Xi completely specify the joint data distribution:
Y
Y ∼
Pf (Xi ) .
i
Below we assume that (Pυ ) is a univariate exponential family with the logdensity `(y, υ) .
120
3 Regression Estimation
The parametric modeling
tion f can be specified by a
f (x) = f (x, θ) . As usual, by
log-likelihood function for this
approach assumes that the regression funcfinite-dimensional parameter θ ∈ Θ ⊂ IRp :
θ ∗ we denote the true parameter value. The
model reads
L(θ) =
X
` Yi , f (Xi , θ) .
i
e maximizes L(θ) :
The corresponding MLE θ
e = argmax
θ
θ
X
` Yi , f (Xi , θ) .
i
The estimating equation ∇L(θ) = 0 reads as
X
`0 Yi , f (Xi , θ) ∇f (Xi , θ) = 0
i
def
where `0 (y, υ) = ∂`(y, υ)/∂υ .
The approach essentially depends on the parametrization of the considered
EF. Usually one applies either the natural or canonical parametrization. In
the case of the natural parametrization, `(y, υ) = C(υ)y − B(υ) , where the
functions C(·), B(·) satisfy B 0 (υ) = υC 0 (υ) . This implies `0 (y, υ) = yC 0 (υ)−
B 0 (υ) = (y − υ)C 0 (υ) and the estimating equation reads as
X
Yi − f (Xi , θ) C 0 f (Xi , θ) ∇f (Xi , θ) = 0
i
Unfortunately, a closed form solution for this equation exists only in very
special cases. Even the questions of existence and uniqueness of the solution
cannot be studied in whole generality. Some numerical algorithms are usually
applied to solve the estimating equation.
Exercise 3.5.1. Specify the estimating equation for generalized EFn regression and find the solution for the case of the constant regression function
f (Xi , θ) ≡ θ .
Hint: If f (Xi , θ) ≡ θ , then the Yi are i.i.d. from Pθ .
The equation can be slightly simplified by using the canonical parametrization. If (Pυ ) is an EFc with the log-density `(y, υ) = yυ − d(υ) , then the
log-likelihood L(θ) can be represented in the form
L(θ) =
X
Yi f (Xi , θ) − d f (Xi , θ) .
i
The corresponding estimating equation is
3.5 Generalized regression
X
Yi − d0 f (Xi , θ)
121
∇f (Xi , θ) = 0.
i
Exercise 3.5.2. Specify the estimating equation for generalized EFc regression and find the solution for the case of constant regression with f (Xi , υ) ≡
υ . Relate the natural and canonical representation.
A generalized regression with a canonical link is often applied in combination
with linear modeling of the regression function considered in the next section.
3.5.1 Generalized linear models
Consider the generalized regression model
Yi ∼ Pf (Xi ) ∈ P.
In addition we assume a linear (in parameters) structure of the regression
function f (X) . Such modeling is particularly useful to combine with the
canonical parametrization of the considered EF with the log-density `(y, υ) =
yυ−d(υ) . The reason is that the stochastic part in the log-likelihood of an EFc
linearly depends on the parameter. So, below we assume that P = (Pυ , υ ∈ U)
is an EFc.
p
Linear regression f (Xi ) = Ψ>
i θ with given feature vectors Ψi ∈ IR leads
to the model with the log-likelihood
X
>
L(θ) =
Yi Ψ>
.
i θ − d Ψi θ
i
Such a setup is called generalized linear model (GLM). Note that the loglikelihood can be represented as
L(θ) = S > θ − A(θ),
where
S=
X
Yi Ψi ,
A(θ) =
i
X
d Ψ>
i θ .
i
e maximizes L(θ) . Again, a closed form solution
The corresponding MLE θ
only exists in special cases. However, an important advantage of the GLM
approach is that the solution always exists and is unique. The reason is that
the log-likelihood function L(θ) is concave in θ .
e solves the following estimating equation:
Lemma 3.5.1. The MLE θ
X
Yi − d0 (Ψ>
(3.48)
∇L(θ) = S − ∇A(θ) =
i θ) Ψi = 0.
i
The solution exists and is unique.
122
3 Regression Estimation
Proof. Define the matrix
B(θ) =
X
>
d00 (Ψ>
i θ)Ψi Ψi .
(3.49)
i
Since d00 (υ) is strictly positive for all u , the matrix B(θ) is positively defined
as well. It holds
X
>
∇2 L(θ) = −∇2 A(θ) = −
d00 (Ψ>
i θ)Ψi Ψi = −B(θ).
i
Thus, the function L(θ) is strictly concave w.r.t. θ and the estimating equae.
tion ∇L(θ) = S − ∇A(θ) = 0 has the unique solution θ
The solution of (3.48) can be easily obtained numerically by the NewtonRaphson algorithm: select the initial estimate θ (0) . Then for every k ≥ 1
apply
−1 θ (k+1) = θ (k) + B(θ (k) )
S − ∇A(θ (k) )
(3.50)
until convergence.
Below we consider two special cases of GLMs for binary and Poissonian
data.
3.5.2 Logit regression for binary data
Suppose that the observed data Yi are independent and binary, that is, each
Yi is either zero or one, i = 1, . . . , n . Such models are often used in e.g.
sociological and medical study, two-class classification, binary imaging, among
many other fields. We treat each Yi as a Bernoulli r.v. with the corresponding
parameter fi = f (Xi ) . This is a special case of generalized regression also
called binary response models. The parametric modeling assumption means
that the regression function f (·) can be represented in the form f (Xi ) =
f (Xi , θ) for a given class of functions {f (·, θ), θ ∈ Θ ∈ IRp } . Then the loglikelihood L(θ) reads as
L(θ) =
X
`(Yi , f (Xi , θ)),
(3.51)
i
where `(y, υ) is the log-density of the Bernoulli law. For linear modeling, it
is more useful to work with the canonical parametrization. Then `(y, υ) =
yυ − log(1 + eυ ) , and the log-likelihood reads
L(θ) =
X
Yi f (Xi , θ) − log 1 + ef (Xi ,θ) .
i
3.5 Generalized regression
123
In particular, if the regression function f (·, θ) is linear, that is, f (Xi , θ) =
Ψ>
i θ , then
L(θ) =
X
Ψ>
i θ) .
Yi Ψ>
i θ − log(1 + e
(3.52)
i
The corresponding estimate reads as
e = argmax L(θ) = argmax
θ
θ
θ
X
Ψ>
i θ)
Yi Ψ>
i θ − log(1 + e
i
This modeling is usually referred to as logit regression.
Exercise 3.5.3. Specify the estimating equation for the case of logit regression.
Exercise 3.5.4. Specify the step of the Newton-Raphson procedure for the
case of logit regression.
3.5.3 Parametric Poisson regression
Suppose that the observations Yi are nonnegative integer numbers. The Poisson distribution is a natural candidate for modeling such data. It is supposed
that the underlying Poisson parameter depends on the regressor Xi . Typical
examples arise in different types of imaging including medical positron emission and magnet resonance tomography, satellite and low-luminosity imaging,
queueing theory, high frequency trading, etc. The regression equation reads
Yi ∼ Poisson(f (Xi )).
The Poisson regression function f (X
i ) is usually the target of estimation.
The parametric specification f (·) ∈ f (·, θ), θ ∈ Θ reduces this problem to
estimating the parameter θ . Under the assumption of independent observations Yi , the corresponding maximum likelihood L(θ) is given by
L(θ) =
X
Yi log f (Xi , θ) − f (Xi , θ) + R,
where the remainder R does not depend on θ and can be omitted. Obviously, the constant function family f (·, θ) ≡ θ leads back to the case of i.i.d.
modeling studied in Section 2.11. A further extension is given by linear Poisson regression: f (Xi , θ) = Ψ>
i θ for some given factors Ψi . The regression
equation reads
L(θ) =
X
>
Yi log(Ψ>
i θ) − Ψi θ) .
i
(3.53)
124
3 Regression Estimation
Exercise 3.5.5. Specify the estimating equation and the Newton-Raphson
procedure for the linear Poisson regression (3.53).
An obvious problem of linear Poisson modeling is that it requires all the
values Ψ>
i θ to be positive. The use of canonical parametrization helps to
avoid this problem. The linear structure is assumed for the canonical parameter leading to the representation f (Xi ) = exp(Ψ>
i θ) . Then the general
log-likelihood process L(θ) from (3.51) translates into
L(θ) =
X
>
Yi Ψ>
i θ − exp(Ψi θ) ;
(3.54)
i
cf. with (3.52).
Exercise 3.5.6. Specify the estimating equation and the Newton-Raphson
procedure for the canonical link linear Poisson regression (3.54).
If the factors Ψi are properly scaled then the scalar products Ψ>
i θ for
all i and all θ ∈ Θ0 belong to some bounded interval. For the matrix B(θ)
from (3.49), it holds
B(θ) =
X
>
exp(Ψ>
i θ)Ψi Ψi .
i
Initializing the ML optimization problem with θ = 0 leads to the oLSE
e(0) =
θ
X
Ψi Ψ>
i
−1 X
i
Ψi Yi .
i
The further steps of the algorithm (3.50) can be done as weighted LSE with
e(k) for the estimate θ
e(k) obtained at the previous step.
the weights exp Ψ> θ
i
3.5.4 Piecewise constant methods in generalized regression
Consider a generalized regression model
Yi ∼ Pf (Xi ) ∈ (Pυ )
for a given exponential family (Pυ ) . Further, let A1 , . . . , AK be a nonoverlapping partition of the design space X ; see (3.39). A piecewise constant
approximation (3.40) of the regression function f (·) leads to the additive
log-likelihood structure: for θ = (θ1 , . . . , θK )>
e = argmax L(θ) = argmax
θ
θ
θ1 ,...,θK
K
X
X
k=1 Xi ∈Ak
`(Yi , θk );
3.5 Generalized regression
125
cf. (3.41). Similarly to the mean regression case, the global optimization w.r.t.
the vector θ can be decomposed into K separated simple optimization problems:
X
θek = argmax
`(Yi , θ);
θ
Xi ∈Ak
cf. (3.42). The same decomposition can be obtained for the target θ ∗ =
(θ1 , . . . , θK )> :
θ ∗ = argmax IEL(θ) = argmax
θ
θ1 ,...,θK
K
X
X
IE`(Yi , θk ).
k=1 Xi ∈Ak
The properties of each estimator θek repeats ones of the qMLE for a univariate
EFn; see Section 2.11.
Theorem 3.5.1. Let `(y, θ) = C(θ)y − B(θ) be a density of an EFn, so
that the functions B(θ) and C(θ) satisfy B 0 (θ) = θC 0 (θ) . Then for every
k = 1, . . . , K
1 X
Yi ,
θek =
Nk
Xi ∈Ak
θk∗ =
1 X
IEYi ,
Nk
Xi ∈Ak
where Nk stands for the number of design points Xi within the piece Ak :
def
Nk =
X
1 = # i : Xi ∈ Ak .
Xi ∈Ak
Moreover, it holds
IE θek = θk∗
and
e θ∗ ) =
L(θ,
K
X
Nk K(θek , θk∗ )
(3.55)
k=1
def
where K(θ, θ0 ) = Eθ `(Yi , θ) − `(Yi , θ0 ) .
These statements follow from Theorem 3.5.1 and Theorem 2.11.1 of Section 2.11. For the presented results, the true regression function f (·) can be
of arbitrary structure, the true distribution of each Yi can differ from Pf (Xi ) .
126
3 Regression Estimation
Exercise 3.5.7. Check the statements of Theorem 3.5.1.
If PA is correct, that is, if f is indeed piecewise constant and the distribution of Yi is indeed Pf (Xi ) , the deviation bound for the excess L(θek , θk∗ )
from Theorem 2.11.4 can be applied to each piece Ak yielding the following
result.
Theorem 3.5.2. Let (Pθ ) be a EFn and let Yi ∼ Pθk for Xi ∈ Ak and
k = 1, . . . , K . Then for any z > 0
e θ ∗ ) > Kz ≤ 2Ke−z .
IP L(θ,
Proof. By (3.55) and Theorem 2.11.4
X
K
∗
∗
e
e
IP L(θ, θ ) > Kz = IP
Nk K(θk , θk ) > Kz
k=1
≤
K
X
IP Nk K(θek , θk∗ ) > z ≤ 2Ke−z
k=1
and the result follows.
A piecewise linear generalized regression can be treated in a similar way.
The main benefit of piecewise modeling remains preserved: a global optimization over the vector θ can be decomposed into a set of small optimization
problems for each piece Ak . However, a closed form solution is available only
in some special cases like Gaussian regression.
3.5.5 Smoothing splines for generalized regression
Consider again the generalized regression model
Yi ∼ Pf (Xi ) ∈ P
for an exponential family P with canonical parametrization. Now we do not
assume any specific parametric structure for the function f . Instead, the
function f is supposed to be smooth and its smoothness is measured by the
roughness Rq (f ) from (3.46). Similarly to the regression case of Section 3.4.4,
the function f can be estimated directly by optimizing the penalized loglikelihood Lλ (f ) :
feλ = argmax Lλ (f ) = argmax L(f ) − Rq (f )
f
= argmax
f
f
X
i
Yi f (Xi ) − d f (Xi ) −
Z
X
(q) 2
f (x) dx.
(3.56)
3.6 Historical remarks and further reading
127
The maximum is taken over the class of all regular q -times differentiable functions. In the regression case, the function d(·) is quadratic and the solution
is a spline functions with knots Xi . This conclusion can be extended to the
case of any convex function d(·) , thus, the problem (3.56) yields a smoothing
spline solution. Numerically this problem is usually solved by iterations. One
starts with a quadratic function d(υ) = υ 2 /2 to obtain an initial approximation fe(0) (·) of f (·) by a standard smoothing spline regression. Further, at
each new step k + 1 , the use of the estimate fe(k) (·) from the previous step k
for k ≥ 0 helps to approximate the problem (3.56) by a weighted regression.
The corresponding iterations can be written in the form (3.50).
3.6 Historical remarks and further reading
A nice introduction in the use of smoothing splines in statistics can be found
in Green and Silverman (1994), and Wahba (1990). Fore further properties of
the spline approximation and algorithmic use of splines see de Boor (2001).
Orthogonal polynomials have long stories and have been applied in many
different fields of mathematics. We refer to Szegö (1939) and Chihara (2011)
for the classical results and history around different polynomial systems.
Some further methods in regression estimation and their features are described e.g. in Lehmann and Casella (1998), Fan and Gijbels (1996), Wasserman (2006).
4
Estimation in linear models
This chapter studies the estimation problem for a linear model. The first four
sections are fairly classical and the presented results are based on the direct
analysis of the linear estimation procedures. Section 4.5 and 4.6 reproduce in
a very short form the same results but now based on the likelihood analysis.
The presentation is based on the celebrated chi-squared phenomenon which
appears to be the fundamental fact yielding the exact likelihood based concentration and confidence properties. The further sections are complementary
and can be recommended for a more profound reading. The issues like regularization, shrinkage, smoothness and roughness are usually studied within
the nonparametric theory, here we try to fit them to the classical linear parametric setup. A special focus is on semiparametric estimation in Section 4.9.
In particular, efficient estimation and chi-squared result are extended to the
semiparametric framework.
The main tool of the study is the quasi maximum likelihood method.
We especially focus on the validity of the presented results under possible
model misspecification. Another important issue is the way of measuring the
estimation loss and risk. We distinguish below between response estimation
or prediction and the parameter estimation. The most advanced results like
chi-squared result in Section 4.6 are established under the assumption of a
Gaussian noise. However, a misspecification of noise structure is allowed and
addressed.
4.1 Modeling assumptions
A linear model assumes that the observations Yi follow the equation:
∗
Yi = Ψ>
i θ + εi
(4.1)
for i = 1, . . . , n , where θ ∗ = (θ1∗ , . . . , θp∗ )> ∈ IRp is an unknown parameter
vector, Ψi are given vectors in IRp and the εi ’s are individual errors with
130
4 Estimation in linear models
zero mean. A typical example is given by linear regression (see Section 3.3)
when the vectors Ψi are the values of a set of functions (e.g polynomial,
trigonometric) series at the design points Xi .
A linear Gaussian model assumes in addition that the vector of errors
ε = (ε1 , . . . εn )> is normally distributed with zero mean and a covariance
matrix Σ :
ε ∼ N(0, Σ).
In this chapter we suppose that Σ is given in advance. We will distinguish
between three cases:
1. the errors εi are i.i.d. N(0, σ 2 ) , or equivalently, the matrix Σ is equal to
σ 2 In with In being the unit matrix in IRn .
2. the errors are independent but not homogeneous, that is, IEε2i = σi2 .
Then the matrix Σ is diagonal: Σ = diag(σ12 , . . . , σn2 ) .
3. the errors εi are dependent with a covariance matrix Σ .
In practical applications one mostly starts with the white Gaussian noise
assumption and more general cases 2 and 3 are only considered if there are
clear indications of the noise inhomogeneity or correlation. The second situation is typical e.g. for the eigenvector decomposition in an inverse problem.
The last case is the most general and includes the first two.
4.2 Quasi maximum likelihood estimation
Denote by Y = (Y1 , . . . , Yn )> (resp. ε = (ε1 , . . . , εn )> ) the vector of observations (resp. of errors) in IRn and by Ψ the p × n matrix with columns Ψi .
Let also Ψ> denote its transpose. Then the model equation can be rewritten
as:
Y = Ψ> θ ∗ + ε,
ε ∼ N(0, Σ).
An equivalent formulation is that Σ−1/2 (Y − Ψ> θ) is a standard normal vector in IRn . The log-density of the distribution of the vector Y =
(Y1 , . . . , Yn )> w.r.t. the Lebesgue measure in IRn is therefore of the form
log det Σ
1
n
− kΣ−1/2 (Y − Ψ> θ)k2
L(θ) = − log(2π) −
2
2
2
log det Σ
n
1
= − log(2π) −
− (Y − Ψ> θ)> Σ−1 (Y − Ψ> θ).
2
2
2
In case 1 this expression can be rewritten as
4.2 Quasi maximum likelihood estimation
131
n
n
1 X
2
2
L(θ) = − log(2πσ ) − 2
(Yi − Ψ>
i θ) .
2
2σ i=1
In case 2 the expression is similar:
L(θ) = −
n n
X
1
i=1
2
log(2πσi2 ) +
2o
(Yi − Ψ>
i θ)
.
2σi2
e of θ ∗ is defined by maximizing
The maximum likelihood estimate (MLE) θ
the log-likelihood L(θ) :
e = argmax L(θ) = argmin(Y − Ψ> θ)> Σ−1 (Y − Ψ> θ).
θ
θ∈IRp
(4.2)
θ∈IRp
We omit the other terms in the expression of L(θ) because they do not
depend on θ . This estimate is the least squares estimate (LSE) because it
minimizes the sum of squared distances between the observations Yi and the
linear responses Ψ>
i θ . Note that (4.2) is a quadratic optimization problem
which has a closed form solution. Differentiating the right hand-side of (4.2)
w.r.t. θ yields the normal equation
e = ΨΣ−1 Y .
ΨΣ−1 Ψ> θ
If the p × p -matrix ΨΣ−1 Ψ> is non-degenerate then the normal equation has
the unique solution
e = ΨΣ−1 Ψ>
θ
−1
ΨΣ−1 Y = SY ,
(4.3)
where
S = ΨΣ−1 Ψ>
−1
ΨΣ−1
e , m = 1, . . . , p .
is a p×n matrix. We denote by θem the entries of the vector θ
−1 >
If the matrix ΨΣ Ψ is degenerate, then the normal equation has infinitely many solutions. However, one can still apply the formula (4.3) where
(ΨΣ−1 Ψ> )−1 is a pseudo-inverse of the matrix ΨΣ−1 Ψ> .
e . Note that due to
The ML-approach leads to the parameter estimate θ
e = Ψ> θ
e is an estimate of the mean f ∗ def
the model (4.1), the product f
= IEY
of the vector of observations Y :
e = Ψ> θ
e = Ψ> ΨΣ−1 Ψ>
f
−1
ΨΣ−1 Y = ΠY ,
where
Π = Ψ> ΨΣ−1 Ψ>
−1
ΨΣ−1
132
4 Estimation in linear models
e is called a prediction
is an n×n matrix (linear operator) in IRn . The vector f
or response regression estimate.
e and f
e . In this study we
Below we study the properties of the estimates θ
try to address both types of possible model misspecification: due to a wrong
assumption about the error distribution and due to a possibly wrong linear
parametric structure. Namely we consider the model
ε ∼ N(0, Σ0 ).
Yi = fi + εi ,
(4.4)
The response values fi are usually treated as the value of the regression
function f (·) at the design points Xi . The parametric model (4.1) can be
viewed as an approximation of (4.4) while Σ is an approximation of the true
covariance matrix Σ0 . If f ∗ is indeed equal to Ψ> θ ∗ and Σ = Σ0 , then
e and f
e are MLEs, otherwise quasi MLEs. In our study we mostly restrict
θ
ourselves to the case 1 assumption about the noise ε : ε ∼ N(0, σ 2 In ) . The
general case can be reduced to this one by a simple data transformation,
namely, by multiplying the equation (4.4) Y = f ∗ +ε with the matrix Σ−1/2 ,
see Section 4.6 for more detail.
4.2.1 Estimation under the homogeneous noise assumption
If a homogeneous noise is assumed, that is Σ = σ 2 In and ε ∼ N(0, σ 2 In ) ,
e f
e slightly simplify. In particular, the
then the formulae for the MLEs θ,
2
variance σ cancels and the resulting estimate is the ordinary least squares
(oLSE):
e = ΨΨ>
θ
with S = ΨΨ>
−1
−1
Ψ . Also
e = Ψ> ΨΨ>
f
with Π = Ψ> ΨΨ>
ΨY = SY
−1
−1
ΨY = ΠY
Ψ.
e f
e directly from the log-likelihood
Exercise 4.2.1. Derive the formulae for θ,
L(θ) for homogeneous noise.
If the assumption ε ∼ N(0, σ 2 In ) about the errors is not precisely fulfilled,
then the oLSE can be viewed as a quasi MLE.
4.2.2 Linear basis transformation
>
Denote by ψ >
1 , . . . , ψ p the rows of the matrix Ψ . Then the ψ i ’s are vectors
n
in IR and we call them the basis vectors. In the linear regression case the
4.2 Quasi maximum likelihood estimation
133
ψ i ’s are obtained as the values of the basis functions at the design points.
Our linear parametric assumption simply means that the underlying vector
f ∗ can be represented as a linear combination of the vectors ψ 1 , . . . , ψ p :
f ∗ = θ1∗ ψ 1 + . . . + θp∗ ψ p .
In other words, f ∗ belongs to the linear subspace in IRn spanned by the
vectors ψ 1 , . . . , ψ p . It is clear that this assumption still holds if we select
another basis in this subspace.
Let U be any linear orthogonal transformation in IRp with U U > = Ip .
Then the linear relation f ∗ = Ψ> θ ∗ can be rewritten as
f ∗ = Ψ> U U > θ ∗ = Ψ̆> u∗
with Ψ̆ = U > Ψ and u∗ = U > θ ∗ . Here the columns of Ψ̆ mean the new
basis vectors ψ̆ m in the same subspace while u∗ is the vector of coefficients
describing the decomposition of the vector f ∗ w.r.t. this new basis:
f ∗ = u∗1 ψ̆ 1 + . . . + u∗p ψ̆ p .
e and f
e change
The natural question is how the expression for the MLEs θ
with the change of the basis. The answer is straightforward. For notational
simplicity, we only consider the case with Σ = σ 2 In . The model can be
rewritten as
Y = Ψ̆> u∗ + ε
yielding the solutions
e = Ψ̆Ψ̆>
u
−1
e = Ψ̆> Ψ̆Ψ̆>
f
Ψ̆Y = S̆Y ,
−1
Ψ̆Y = Π̆Y ,
where Ψ̆ = U > Ψ implies
S̆ = Ψ̆Ψ̆>
Π̆ = Ψ̆>
−1
Ψ̆ = U > S,
−1
Ψ̆Ψ̆>
Ψ̆ = Π.
This yields
e
e = U >θ
u
e is not changed for any linear transformation
and moreover, the estimate f
of the basis. The first statement can be expected in view of θ ∗ = U u∗ , while
the second one will be explained in the next section: Π is the linear projector
on the subspace spanned by the basis vectors and this projector is invariant
w.r.t. basis transformations.
134
4 Estimation in linear models
Exercise 4.2.2. Consider univariate polynomial regression of degree p − 1 .
This means that f is a polynomial function of degree p − 1 observed at the
points Xi with errors εi that are assumed to be i.i.d. normal. The function
f can be represented as
f (x) = θ1∗ + θ2∗ x + . . . + θp∗ xp−1
using the basis functions ψm (x) = xm−1 for m = 0, . . . , p − 1 . At the same
time, for any point x0 , this function can also be written as
f (x) = u∗1 + u∗2 (x − x0 ) + . . . + u∗p (x − x0 )p−1
using the basis functions ψ̆m = (x − x0 )m−1 .
• Write the matrices Ψ and ΨΨ> and similarly Ψ̆ and Ψ̆Ψ̆> .
• Describe the linear transformation A such that u = Aθ for p = 1 .
• Describe the transformation A such that u = Aθ for p > 1 .
Hint: use the formula
u∗m =
1
f (m−1) (x0 ),
(m − 1)!
m = 1, . . . , p
∗
, . . . , θp∗ .
to identify the coefficient u∗m via θm
4.2.3 Orthogonal and orthonormal design
Orthogonality of the design matrix Ψ means that the basis vectors ψ1 , . . . , ψp
are orthonormal in the sense
(
n
X
0
if m 6= m0 ,
>
ψ m ψ m0 =
ψm,i ψm0 ,i =
λm if m = m0 ,
i=1
for some positive values λ1 , . . . , λp . Equivalently one can write
ΨΨ> = Λ = diag(λ1 , . . . , λp ).
This feature of the design is very useful and it essentially simplifies the come . Indeed, ΨΨ> = Λ implies
putation and analysis of the properties of θ
e = Λ−1 ΨY ,
θ
e = Ψ> θ
e = Ψ> Λ−1 ΨY
f
−1
with Λ−1 = diag(λ−1
1 , . . . , λp ) . In particular, the first relation means
θem = λ−1
m
n
X
i=1
Yi ψm,i ,
4.2 Quasi maximum likelihood estimation
135
that is, θem is the scalar product of the data and the basis vector ψ m for
m = 1, . . . , p . The estimate of the response f reads as
e = θe1 ψ + . . . + θep ψ .
f
1
p
Theorem 4.2.1. Consider the model Y = Ψ> θ + ε with homogeneous errors
ε : IEεε> = σ 2 In . If the design Ψ is orthogonal, that is, if ΨΨ> = Λ for
a diagonal matrix Λ , then the estimated coefficients θem are uncorrelated:
e = σ 2 Λ−1 . Moreover, if ε ∼ N(0, σ 2 In ) , then θ
e ∼ N(θ ∗ , σ 2 Λ−1 ) .
Var(θ)
An important message of this result is that the orthogonal design allows
for splitting the original multivariate problem into a collection of independent
∗
univariate problems: each coefficient θm
is estimated by θem independently
on the remaining coefficients.
The calculus can be further simplified in the case of an orthogonal design
with ΨΨ> = Ip . Then one speaks about an orthonormal design. This also
implies that every basis function (vector) ψ m is standardized: kψ m k2 =
Pn
2
e
i=1 ψm,i = 1 . In the case of an orthonormal design, the estimate θ is
e = ΨY . Correspondingly, the target of estimation θ ∗
particularly simple: θ
∗
∗
∗
satisfies θ = Ψf . In other words, the target is the collection (θm
) of the
∗
Fourier coefficients of the underlying function (vector) f w.r.t. the basis Ψ
e is the collection of empirical Fourier coefficients θem :
while the estimate θ
∗
θm
=
n
X
fi ψm,i ,
θem =
i=1
n
X
Yi ψm,i
i=1
An important feature of the orthonormal design is that it preserves the noise
homogeneity:
e = σ 2 Ip .
Var θ
4.2.4 Spectral representation
Consider a linear model
Y = Ψ> θ + ε
(4.5)
with homogeneous errors ε : Var(ε) = σ 2 In . The rows of the matrix Ψ can
be viewed as basis vectors in IRn and the product Ψ> θ is a linear combinations of these vectors with the coefficients (θ1 , . . . , θp ) . Effectively linear least
squares estimation does a kind of projection of the data onto the subspace
generated by the basis functions. This projection is of course invariant w.r.t.
a basis transformation within this linear subspace. This fact can be used to
reduce the model to the case of an orthogonal design considered in the previous section. Namely, one can always find a linear orthogonal transformation
136
4 Estimation in linear models
U : IRp → IRp ensuring the orthogonality of the transformed basis. This
means that the rows of the matrix Ψ̆ = U Ψ are orthogonal and the matrix
Ψ̆Ψ̆> is diagonal:
Ψ̆Ψ̆> = U ΨΨ> U > = Λ = diag(λ1 , . . . , λp ).
The original model reads after this transformation in the form
Y = Ψ̆> u + ε,
Ψ̆Ψ̆> = Λ,
where u = U θ ∈ IRp . Within this model, the transformed parameter u can
>
be estimated using the empirical Fourier coefficients Zm = ψ̆ m Y , where ψ̆ m
is the m th row of Ψ̆ , m = 1, . . . , p . The original parameter vector θ can be
recovered via the equation θ = U > u . This set of equations can be written in
the form
Z = Λu + Λ1/2 ξ
(4.6)
where Z = Ψ̆Y = U ΨY is a vector in IRp and ξ = Λ−1/2 Ψ̆ε = Λ−1/2 U Ψε ∈
IRp . The equation (4.6) is called the spectral representation of the linear model
(4.5). The reason is that the basic transformation U can be built by a singular
value decomposition of Ψ . This representation is widely used in context of
linear inverse problems; see Section 4.8.
Theorem 4.2.2. Consider the model (4.5) with homogeneous errors ε , that
is, IEεε> = σ 2 In . Then there exists an orthogonal transform U : IRp →
IRp leading to the spectral representation (4.6) with homogeneous uncorrelated
errors ξ : IEξξ > = σ 2 Ip . If ε ∼ N(0, σ 2 In ) , then the vector ξ is normal as
well: ξ = N(0, σ 2 Ip ) .
Exercise 4.2.3. Prove the result of Theorem 4.2.2.
Hint: select any U ensuring U > ΨΨ> U = Λ . Then
IEξξ > = Λ−1/2 U ΨIEεε> Ψ> U > Λ−1/2 = σ 2 Λ−1/2 U > ΨΨ> U Λ−1/2 = σ 2 Ip .
A special case of the spectral representation corresponds to the orthonormal design with ΨΨ> = Ip . In this situation, the spectral model reads as
Z = u + ξ , that is, we simply observe the target u corrupted with a homogeneous noise ξ . Such an equation is often called the sequence space model and
it is intensively used in the literature for the theoretical study; cf. Section 4.7
below.
e
4.3 Properties of the response estimate f
e = Ψ> θ
e = ΠY of
This section discusses some properties of the estimate f
∗
the response vector f . It is worth noting that the first and essential part of
e
4.3 Properties of the response estimate f
137
the analysis does not rely on the underlying model distribution, only on our
parametric assumptions that f = Ψ> θ ∗ and Cov(ε) = Σ = σ 2 In . The real
model only appears when studying the risk of estimation. We will comment
on the cases of misspecified f and Σ .
e = ΠY of the
When Σ = σ 2 In , the operator Π in the representation f
e
estimate f reads as
Π = Ψ> ΨΨ>
−1
Ψ.
(4.7)
First we make use of the linear structure of the model (4.1) and of the
e to derive a number of its simple but important properties.
estimate f
4.3.1 Decomposition into a deterministic and a stochastic
component
The model equation Y = f ∗ + ε yields
e = ΠY = Π(f ∗ + ε) = Πf ∗ + Πε.
f
(4.8)
The first element of this sum, Πf ∗ is purely deterministic, but it depends
on the unknown response vector f ∗ . Moreover, it will be shown in the next
lemma that Πf ∗ = f ∗ if the parametric assumption holds and the vector
f ∗ indeed can be represented as Ψ> θ ∗ . The second element is stochastic
as a linear transformation of the stochastic vector ε but is independent of
e heavily rely on the
the model response f ∗ . The properties of the estimate f
properties of the linear operator Π from (4.7) which we collect in the next
section.
4.3.2 Properties of the operator Π
Let ψ 1 , . . . , ψ p be the columns of the matrix Ψ> . These are the vectors in
IRn also called the basis vectors.
Lemma 4.3.1. Let the matrix ΨΨ> be non-degenerate. Then the operator
Π fulfills the following conditions:
(i) Π is symmetric (self-adjoint), that is, Π> = Π .
(ii) Π is a projector in IRn , i.e. Π> Π = Π2 = Π and Π(1n −Π) = 0 , where
1n means the unity operator in IRn .
(iii) For an arbitrary vector v from IRn , it holds kvk2 = kΠvk2 +kv −Πvk2 .
(iv) The trace of Π is equal to the dimension of its image, tr Π
= p.
(v) Π projects the linear space IRn on the linear subspace Lp = ψ 1 , . . . , ψ p ,
which is spanned by the basis vectors ψ 1 , . . . ψ p , that is,
kf ∗ − Πf ∗ k = inf kf ∗ − gk.
g∈Lp
138
4 Estimation in linear models
(vi) The matrix Π can be represented in the form
Π = U > Λp U
where U is an orthonormal matrix and Λp is a diagonal matrix with the
first p diagonal elements equal to 1 and the others equal to zero:
Λp = diag{1, . . . , 1, 0, . . . , 0}.
| {z } | {z }
p
n−p
Proof. It holds
Ψ> ΨΨ>
−1 >
−1
Ψ
= Ψ> ΨΨ>
Ψ
and
Π2 = Ψ> ΨΨ>
−1
ΨΨ> ΨΨ>
−1
Ψ = Ψ> ΨΨ>
−1
Ψ = Π,
which proves the first two statements of the lemma. The third one follows
directly from the first two. Next,
tr Π = tr Ψ> ΨΨ>
−1
Ψ = tr ΨΨ> ΨΨ>
−1
= tr Ip = p.
The second property means that Π is a projector in IRn and the fourth one
means that the dimension of its image space is equal to p . The basis vectors
ψ 1 , . . . , ψ p are the rows of the matrix Ψ . It is clear that
ΠΨ> = Ψ> ΨΨ>
−1
ΨΨ> = Ψ> .
Therefore, the vectors ψ m are invariants of the operator Π and in particular,
all these vectors belong to the image space of this operator. If now g is a
vector in Lp , then it can be represented as g = c1 ψ 1 + . . . + cp ψ p and
therefore, Πg = g and ΠLp = Lp . Finally, the non-singularity of the matrix
ΨΨ> means that the vectors ψ 1 , . . . , ψ p forming the rows of Ψ are linearly
independent. Therefore, the space Lp spanned by the vectors ψ 1 , . . . , ψ p is
of dimension p , and hence it coincides with the image space of the operation
Π.
The last property is the usual diagonal decomposition of a projector.
Exercise 4.3.1. Consider the case of an orthogonal design with ΨΨ> = Ip .
Specify the projector Π of Lemma 4.3.1 for this situation, particularly its
decomposition from (vi).
e
4.3 Properties of the response estimate f
139
4.3.3 Quadratic loss and risk of the response estimation
In this section we study the quadratic risk of estimating the response f ∗ .
The reason for studying the quadratic risk of estimating the response f ∗ will
be made clear when we discuss the properties of the fitted likelihood in the
next section.
e , f ∗ ) of the estimate f
e can be naturally defined as the
The loss ℘(f
∗
e
squared norm of the difference f − f :
e , f ∗ ) = kf
e − f ∗ k2 =
℘(f
n
X
|fi − fei |2 .
i=1
e is the mean of this loss
Correspondingly, the quadratic risk of the estimate f
e ) = IE℘(f
e , f ∗ ) = IE (f
e − f ∗ )> (f
e − f ∗) .
R(f
(4.9)
The next result describes the loss and risk decomposition for two cases: when
the parametric assumption f ∗ = Ψ> θ ∗ is correct and in the general case.
Theorem 4.3.1. Suppose that the errors εi from (4.1) are independent with
e , f ∗ ) = kΠY −
IE εi = 0 and IE ε2i = σ 2 , i.e. Σ = σ 2 In . Then the loss ℘(f
∗ 2
e ) of the LSE f
e fulfill
f k and the risk R(f
e , f ∗ ) = kf ∗ − Πf ∗ k2 + kΠεk2 ,
℘(f
e ) = kf ∗ − Πf ∗ k2 + pσ 2 .
R(f
Moreover, if f ∗ = Ψ> θ ∗ , then
e , f ∗ ) = kΠεk2 ,
℘(f
e ) = pσ 2 .
R(f
e . It
Proof. We apply (4.9) and the decomposition (4.8) of the estimate f
follows
e , f ∗ ) = kf
e − f ∗ k2 = kf ∗ − Πf ∗ − Πεk2
℘(f
= kf ∗ − Πf ∗ k2 + 2(f ∗ − Πf ∗ )> Πε + kΠεk2 .
e by Lemma 4.3.1, (ii). Next
This implies the decomposition for the loss of f
2
we compute the mean of kΠεk applying again Lemma 4.3.1. Indeed
IEkΠεk2 = IE(Πε)> Πε = IE tr Πε(Πε)> = IE tr Πεε> Π>
= tr ΠIE(εε> )Π = σ 2 tr(Π2 ) = pσ 2 .
Now consider the case when f ∗ = Ψ> θ ∗ . By Lemma 4.3.1 f ∗ = Πf ∗ and
and the last two statements of the theorem clearly follow.
140
4 Estimation in linear models
4.3.4 Misspecified “colored noise”
Here we briefly comment on the case when ε is not a white noise. So, our assumption about the errors εi is that they are uncorrelated and homogeneous,
that is, Σ = σ 2 In while the true covariance matrix is given by Σ0 . Many
e = ΠY which are simply based on the linearity of
properties of the estimate f
e itself continue to apply. In particular,
the model (4.1) and of the estimate f
∗
∗ 2
e
e
the loss ℘ f , f = kf − f k can again be decomposed as
e − f ∗ k2 = kf ∗ − Πf ∗ k2 + kΠεk2 .
kf
Theorem 4.3.2. Suppose that IEε = 0 and Var(ε) = Σ0 . Then the loss
e , f ) and the risk R(f
e ) of the LSE f
e fulfill
℘(f
e , f ∗ ) = kf ∗ − Πf ∗ k2 + kΠεk2 ,
℘(f
e ) = kf ∗ − Πf ∗ k2 + tr ΠΣ0 Π .
R(f
Moreover, if f ∗ = Ψ> θ ∗ , then
e , f ∗ ) = kΠεk2 ,
℘(f
e ) = tr ΠΣ0 Π .
R(f
Proof. The decomposition of the loss from Theorem 4.3.1 only relies on the
geometric properties of the projector Π and does not use the covariance
structure of the noise. Hence, it only remains to check the expectation of
kΠεk2 . Observe that
IEkΠεk2 = IE tr Πε(Πε)> = tr ΠIE(εε> )Π = tr ΠΣ0 Π
as required.
e
4.4 Properties of the MLE θ
e built for the
In this section we focus on the properties of the quasi MLE θ
> ∗
idealized linear Gaussian model Y = Ψ θ + ε with ε ∼ N(0, σ 2 In ) . As
in the previous section, we do not assume the parametric structure of the
underlying model and consider a more general model Y = f ∗ + ε with an
unknown vector f ∗ and errors ε with zero mean and covariance matrix Σ0 .
e = SY with S = ΨΨ> −1 Ψ . An important feature of
Due to (4.3), it holds θ
this estimate is its linear dependence on the data. The linear model equation
e = SY allow us for
Y = f ∗ + ε and linear structure of the estimate θ
e
decomposing the vector θ into a deterministic and stochastic terms:
e
4.4 Properties of the MLE θ
e = SY = S f ∗ + ε = Sf ∗ + Sε.
θ
141
(4.10)
The first term Sf ∗ is deterministic but depends on the unknown vector f ∗
while the second term Sε is stochastic but it does not involve the model
response f ∗ . Below we study the properties of each component separately.
4.4.1 Properties of the stochastic component
The next result describes the distributional properties of the stochastic com−1
e.
ponent δ = Sε for S = ΨΨ>
Ψ and thus, of the estimate θ
Theorem 4.4.1. Assume Y = f ∗ + ε with IEε = 0 and Var(ε) = Σ0 . The
stochastic component δ = Sε in (4.10) fulfills
IEδ = 0,
def
W 2 = Var(δ) = SΣ0 S > ,
IEkδk2 = tr W 2 = tr SΣ0 S > .
Moreover, if Σ = Σ0 = σ 2 In , then
W 2 = σ 2 ΨΨ>
−1
,
−1 IEkδk2 = tr(W 2 ) = σ 2 tr ΨΨ>
. (4.11)
e it holds
Similarly for the estimate θ
e = Sf ∗ ,
IE θ
e = W 2.
Var(θ)
e are Gaussian as well:
If the errors ε are Gaussian, then the both δ and θ
δ ∼ N(0, W 2 )
e ∼ N(Sf ∗ , W 2 ).
θ
Proof. For the variance W 2 of δ holds
Var(δ) = IEδδ > = IESεε> S > = SΣ0 S > .
Next we use that IEkδk2 = IEδ > δ = IE tr(δδ > ) = tr W 2 . If Σ = Σ0 = σ 2 In ,
then (4.11) follows by simple algebra.
If ε is a Gaussian vector, then δ as its linear transformation is Gaussian
e follow directly from the decomposition (4.10).
as well. The properties of θ
With Σ0 6= σ 2 In , the variance W 2 can be represented as
W 2 = ΨΨ>
−1
ΨΣ0 Ψ> ΨΨ>
−1
.
e built for the misExercise 4.4.1. Let δ be the stochastic component of θ
> ∗
specified linear model Y = Ψ θ + ε with Var(ε) = Σ . Let also the true
e = W 2 with
noise variance is Σ0 . Then Var(θ)
W 2 = ΨΣ−1 Ψ>
−1
ΨΣ−1 Σ0 Σ−1 Ψ> ΨΣ−1 Ψ>
−1
.
(4.12)
142
4 Estimation in linear models
The main finding in the presented study is that the stochastic part δ = Sε
e is completely independent of the structure of the vector f ∗ .
of the estimate θ
In other words, the behavior of the stochastic component δ does not change
even if the linear parametric assumption is misspecified.
4.4.2 Properties of the deterministic component
Now we study the deterministic term starting with the parametric situation
f ∗ = Ψ> θ ∗ . Here we only specify the results for the case 1 with Σ = σ 2 In .
e = SY with S = ΨΨ> −1 Ψ is
Theorem 4.4.2. Let f ∗ = Ψ> θ ∗ . Then θ
e = Sf ∗ = θ ∗ .
unbiased, that is, IE θ
−1
Proof. For the proof, just observe that Sf ∗ = ΨΨ>
ΨΨ> θ ∗ = θ ∗ .
Now we briefly discuss what happens when the linear parametric assumption is not fulfilled, that is, f ∗ cannot be represented as Ψ> θ ∗ . In this case
e really estimates. The answer is given in the context
it is not yet clear what θ
of the general theory of minimum contrast estimation. Namely, define θ ∗ as
the point which maximizes the expectation of the (quasi) log-likelihood L(θ) :
θ ∗ = argmax IEL(θ).
(4.13)
θ
Theorem 4.4.3. The solution θ ∗ of the optimization problem (4.13) is given
by
θ ∗ = Sf ∗ = ΨΨ>
−1
Ψf ∗ .
Moreover,
Ψ> θ ∗ = Πf ∗ = Ψ> ΨΨ>
−1
Ψf ∗ .
In particular, if f ∗ = Ψ> θ ∗ , then θ ∗ follows (4.13).
Proof. The use of the model equation Y = f ∗ + ε and of the properties of
the stochastic component δ yield by simple algebra
> ∗
argmax IEL(θ) = argmin IE f ∗ − Ψ> θ + ε
f − Ψ> θ + ε
θ
θ
= argmin (f ∗ − Ψ> θ)> (f ∗ − Ψ> θ) + IE ε> ε
θ
= argmin (f ∗ − Ψ> θ)> (f ∗ − Ψ> θ) .
θ
Differentiating w.r.t. θ leads to the equation
e
4.4 Properties of the MLE θ
143
Ψ(f ∗ − Ψ> θ) = 0
and the solution θ ∗ = ΨΨ>
e by Theorem 4.4.1.
θ
−1
Ψf ∗ which is exactly the expected value of
e
Exercise 4.4.2. State the result of Theorems 4.4.2 and 4.4.3 for the MLE θ
built in the model Y = Ψ> θ ∗ + ε with Var(ε) = Σ .
−1
Hint: check that the statements continue to apply with S = ΨΣ−1 Ψ>
ΨΣ−1 .
The last results and the decomposition (4.10) explain the behavior of the
e in a very general situation. The considered model is Y = f ∗ + ε .
estimate θ
We assume a linear parametric structure and independent homogeneous noise.
The estimation procedure means in fact a kind of projection of the data Y on
a p -dimensional linear subspace in IRn spanned by the given basis vectors
ψ 1 , . . . , ψ p . This projection, as a linear operator, can be decomposed into
a projection of the deterministic vector f ∗ and a projection
of the random
noise ε . If the linear parametric assumption f ∗ ∈ ψ 1 , . . . , ψ p is correct,
that is, f ∗ = θ1∗ ψ 1 + . . . + θp∗ ψ p , then this projection keeps f ∗ unchanged
and only the random noise is reduced via this projection. If f ∗ cannot be
exactly expanded using the basis ψ 1 , . . . , ψ p , then the procedure recovers
the projection of f ∗ onto this subspace. The latter projection can be written
as Ψ> θ ∗ and the vector θ ∗ can be viewed as the target of estimation.
4.4.3 Risk of estimation. R-efficiency
This section briefly discusses how the obtained properties of the estimate
e can be used to evaluate the risk of estimation. A particularly important
θ
e . The main result of the section claims
question is the optimality of the MLE θ
e
that θ is R-efficient if the model is correctly specified and is not if there is a
misspecification.
We start with the case of a correct parametric specification Y = Ψ> θ ∗ +ε ,
that is, the linear parametric assumption f ∗ = Ψ> θ ∗ is exactly fulfilled and
the noise ε is homogeneous: ε ∼ N(0, σ 2 In ) . Later we extend the result to
the case when the LPA f ∗ = Ψ> θ ∗ is not fulfilled and to the case when the
noise is not homogeneous but still correctly specified. Finally we discuss the
case when the noise structure is misspecified.
e is also
Under LPA Y = Ψ> θ ∗ + ε with ε ∼ N(0, σ 2 In ) , the estimate θ
∗
2
2
>
2
> −1
. Define
normal with mean θ and the variance W = σ SS = σ ΨΨ
a p × p symmetric matrix D by the equation
D2 =
Clearly W 2 = D−2 .
n
1 X
1
>
Ψi Ψ>
i = 2 ΨΨ .
2
σ i=1
σ
144
4 Estimation in linear models
e is R -efficient. Actually this fact can be derived from
Now we show that θ
the Cramér-Rao Theorem because the Gaussian model is a special case of an
exponential family. However, we check this statement directly by computing
the Cramér-Rao efficiency bound. Recall that the Fisher information matrix
F(θ) for the log-likelihood L(θ) is defined as the variance of ∇L(θ) under
IPθ .
Theorem 4.4.4 (Gauss-Markov). Let Y = Ψ> θ ∗ +ε with ε ∼ N(0, σ 2 In ) .
e is R-efficient estimate of θ ∗ : IE θ
e = θ∗ ,
Then θ
e − θ∗ θ
e − θ ∗ > = Var θ
e = D−2 ,
IE θ
b satisfying IEθ θ
b ≡ θ , it holds
and for any unbiased linear estimate θ
b ≥ Var θ
e = D−2 .
Var θ
e ∼ N(θ ∗ , W 2 ) with W 2 =
Proof. Theorems 4.4.1 and 4.4.2 imply that θ
σ 2 (ΨΨ> )−1 = D−2 . Next we show that for any θ
Var ∇L(θ) = D2 ,
that is, the Fisher information does not depend on the model function f ∗ .
The log-likelihood L(θ) for the model Y ∼ N(Ψ> θ ∗ , σ 2 In ) reads as
L(θ) = −
n
1
(Y − Ψ> θ)> (Y − Ψ> θ) − log(2πσ 2 ).
2σ 2
2
This yields for its gradient ∇L(θ) :
∇L(θ) = σ −2 Ψ(Y − Ψ> θ)
and in view of Var(Y ) = Σ = σ 2 In , it holds
Var ∇L(θ) = σ −4 Ψ Var(Y )Ψ> = σ −2 ΨΨ>
as required.
e follows from the Cramér-Rao efficiency bound beThe R-efficiency θ
−1
e
cause Var(θ)
= Var ∇L(θ) . However, we present an independent
proof of this fact. Actually we prove a sharper result that the variance of
b coincides with the variance of θ
e only if θ
b coa linear unbiased estimate θ
e
incides almost surely with θ , otherwise it is larger. The idea of the proof
b−θ
e and show that the condition
is quite simple. Consider the difference θ
∗
b
e
e
b − θ)
e > = 0 . This, in turns,
IE θ = IE θ = θ implies orthogonality IE θ(θ
b = Var(θ)
e + Var(θ
b − θ)
e ≥ Var(θ)
e . So, it remains to check the
implies Var(θ)
e
b
e
b
orthogonality of θ and θ − θ . Let θ = AY for a p × n matrix A and
e
4.4 Properties of the MLE θ
145
b ≡ θ and all θ . These two equalities and IEY = Ψ> θ ∗ imply that
IEθ θ
AΨ> θ ∗ ≡ θ ∗ , i.e. AΨ> is the identity p × p matrix. The same is true for
e = SY yielding SΨ> = Ip . Next, in view of IE θ
b = IE θ
e = θ∗
θ
b − θ)
e θ
e> = IE(A − S)εε> S > = σ 2 (A − S)Ψ> (ΨΨ> )−1 = 0,
IE (θ
and the assertion follows.
Exercise 4.4.3. Check the details of the proof of the theorem. Show that the
b ≥ Var θ
e only uses that θ
b is unbiased and that IEY =
statement Var(θ)
> ∗
2
Ψ θ and Var(Y ) = σ In .
Exercise 4.4.4. Compute ∇2 L(θ) . Check that it is non-random, does not
depend on θ , and fulfills for every θ the identity
∇2 L(θ) ≡ − Var ∇L(θ) = −D2 .
A colored noise
The majority of the presented results continue to apply in the case of heterogeneous and even dependent noise with Var(ε) = Σ0 . The key facts behind
this extension are the decomposition (4.10) and the properties of the stochastic component δ from Section 4.4.1: δ ∼ N(0, W 2 ) . In the case of a colored
noise, the definition of W and D is changed for
def
>
D2 = W −2 = ΨΣ−1
0 Ψ .
Exercise 4.4.5. State and prove the analog of Theorem 4.4.4 for the colored
noise ε ∼ N(0, Σ0 ) .
A misspecified LPA
An interesting feature of our results so far is that they equally apply for the
correct linear specification f ∗ = Ψ> θ ∗ and for the case when the identity
f ∗ = Ψ> θ is not precisely fulfilled whatever θ is taken. In this situation the
target of analysis is the vector θ ∗ describing the best linear approximation
of f ∗ by Ψ> θ . We already know from the results of Section 4.4.1 and 4.4.2
e is also normal with mean θ ∗ = Sf ∗ = ΨΨ> −1 Ψf ∗
that the estimate θ
−1
and the variance W 2 = σ 2 SS > = σ 2 ΨΨ>
.
Theorem 4.4.5. Assume Y = f ∗ + ε with ε ∼ N(0, σ 2 In ) . Let θ ∗ = Sf ∗ .
e is R-efficient estimate of θ ∗ : IE θ
e = θ∗ ,
Then θ
e − θ∗ θ
e − θ ∗ > = Var θ
e = D−2 ,
IE θ
146
4 Estimation in linear models
b satisfying IEθ θ
b ≡ θ , it holds
and for any unbiased linear estimate θ
b ≥ Var θ
e = D−2 .
Var θ
e ∼ N(θ ∗ , W 2 ) with W 2 = D−2 . The
Proof. The proofs only utilize that θ
only small remark concerns the equality Var ∇L(θ) = D2 from Theorem 4.4.4.
Exercise 4.4.6. Check the identity Var ∇L(θ) = D2 from Theorem 4.4.4
for ε ∼ N(0, Σ0 ) .
4.4.4 The case of a misspecified noise
Here we again consider the linear parametric assumption Y = Ψ> θ ∗ + ε .
However, contrary to the previous section, we admit that the noise ε is not
homogeneous normal: ε ∼ N(0, Σ0 ) while our estimation procedure is the
quasi MLE based on the assumption of noise homogeneity ε ∼ N(0, σ 2 In ) .
e is unbiased with mean θ ∗ and variance
We already know that the estimate θ
2
>
> −1
W = SΣ0 S , where S = ΨΨ
Ψ . This gives
W 2 = ΨΨ>
−1
ΨΣ0 Ψ> ΨΨ>
−1
.
e based on the misspecified disThe question is whether the estimate θ
tributional assumption is efficient. The Cramér-Rao result delivers the lower
e ≥ Var ∇L(θ) −1 . We albound for the quadratic risk in form of Var(θ)
ready know that the use of the correctly specified covariance matrix of the
e . The next result show that the use
errors leads to an R-efficient estimate θ
of a misspecified matrix Σ results in an estimate which is unbiased but not
R-efficient, that is, the best estimation risk is achieved if we apply the correct
model assumptions.
Theorem 4.4.6. Let Y = Ψ> θ ∗ + ε with ε ∼ N(0, Σ0 ) . Then
>
Var ∇L(θ) = ΨΣ−1
0 Ψ .
e = ΨΨ> −1 ΨY is unbiased, that is, IE θ
e = θ ∗ , but it is not
The estimate θ
R-efficient unless Σ0 = Σ .
e0 be the MLE for the correct model specification with the noise
Proof. Let θ
e is unbiased, the difference θ
e−θ
e0 is orthogonal to θ
e0
ε ∼ N(0, Σ0 ) . As θ
e
and it holds for the variance of θ
e = Var(θ
e0 ) + Var(θ
e−θ
e0 );
Var(θ)
cf. with the proof of Gauss-Markov-Theorem 4.4.4.
e and of θ
e0 .
Exercise 4.4.7. Compare directly the variances of θ
4.5 Linear models and quadratic log-likelihood
147
4.5 Linear models and quadratic log-likelihood
Linear Gaussian modeling leads to a specific log-likelihood structure; see Section 4.2. Namely, the log-likelihood function L(θ) is quadratic in θ , the
coefficients of the quadratic terms are deterministic and the cross term is linear both in θ and in the observations Yi . Here we show that this geometric
structure of the log-likelihood characterizes linear models. We say that L(θ)
is quadratic if it is a quadratic function of θ and there is a deterministic
symmetric matrix D2 such that for any θ ◦ , θ
L(θ) − L(θ ◦ ) = (θ − θ ◦ )> ∇L(θ ◦ ) − (θ − θ ◦ )> D2 (θ − θ ◦ )/2. (4.14)
def dL(θ)
dθ
Here ∇L(θ) =
. As usual we define
e def
θ
= argmax L(θ),
θ
θ
∗
= argmax IEL(θ).
θ
e which are entirely
The next result describes some properties of the estimate θ
based on the geometric (quadratic) structure of the function L(θ) . All the
results are stated by using the matrix D2 and the vector ζ = ∇L(θ ∗ ) .
Theorem 4.5.1. Let L(θ) be quadratic for a matrix D2 > 0 . Then for any
θ◦
e − θ ◦ = D−2 ∇L(θ ◦ ).
θ
(4.15)
In particular, with θ ◦ = 0 , it holds
e = D−2 ∇L(0).
θ
Taking θ ◦ = θ ∗ yields
e − θ ∗ = D−2 ζ
θ
(4.16)
def
with ζ = ∇L(θ ∗ ) . Moreover, IEζ = 0 , and it holds with V 2 = Var(ζ) =
IEζζ >
e = θ∗
IE θ
e = D−2 V 2 D−2 .
Var θ
Further, for any θ ,
e − L(θ) = (θ
e − θ)> D2 (θ
e − θ)/2 = kD(θ
e − θ)k2 /2.
L(θ)
(4.17)
148
4 Estimation in linear models
def
e θ ∗ ) = L(θ)
e − L(θ ∗ )
Finally, it holds for the excess L(θ,
e θ ∗ ) = (θ
e − θ ∗ )> D2 (θ
e − θ ∗ ) = ζ > D−2 ζ = kξk2
2L(θ,
(4.18)
with ξ = D−1 ζ .
Proof. The extremal point equation ∇L(θ) = 0 for the quadratic function
L(θ) from (4.14) yields (4.15). The equation (4.14) with θ ◦ = θ ∗ implies for
any θ
∇L(θ) = ∇L(θ ◦ ) − D2 (θ − θ ◦ ) = ζ − D2 (θ − θ ∗ ).
(4.19)
Therefore, it holds for the expectation IEL(θ)
∇IEL(θ) = IEζ − D2 (θ − θ ∗ ),
and the equation ∇IEL(θ ∗ ) = 0 implies IEζ = 0 .
e:
To show (4.17), apply again the property (4.14) with θ ◦ = θ
e = (θ − θ)
e > ∇L(θ)
e − (θ − θ)
e > D2 (θ − θ)/2
e
L(θ) − L(θ)
e − θ)> D2 (θ
e − θ)/2.
= −(θ
e = 0 because θ
e is an extreme point of L(θ) . The
Here we used that ∇L(θ)
last result (4.18) is a special case with θ = θ ∗ in view of (4.16).
This theorem delivers an important message: the main properties of the
e can be explained via the geometric (quadratic) structure of the
MLE θ
log-likelihood. An interesting question to clarify is whether a quadratic loglikelihood structure is specific for linear Gaussian model. The answer is positive: there is one-to-one correspondence between linear Gaussian models and
quadratic log-likelihood functions. Indeed, the identity (4.19) with θ ◦ = θ ∗
can be rewritten as
∇L(θ) + D2 θ ≡ ζ + D2 θ ∗ .
If we fix any θ and define Y = ∇L(θ) + D2 θ , this yields
Y = D2 θ ∗ + ζ.
def
Similarly, Y = D−1 ∇L(θ) + D2 θ yields the equation
Y = Dθ ∗ + ξ,
where ξ = D−1 ζ . We can summarize as follows.
(4.20)
4.6 Inference based on the maximum likelihood
149
Theorem 4.5.2. Let L(θ) be quadratic with a non-degenerated matrix D2 .
def
Then Y = D−1 ∇L(θ) + D2 θ does not depend on θ and L(θ) − L(θ ∗ )
is the quasi log-likelihood ratio for the linear Gaussian model (4.20) with ξ
standard normal. It is the true log-likelihood if and only if ζ ∼ N(0, D2 ) .
Proof. The model (4.20) with ξ ∼ N(0, Ip ) leads to the log-likelihood ratio
(θ − θ ∗ )> D(Y − Dθ ∗ ) − kD(θ − θ ∗ )k2 /2 = (θ − θ ∗ )> ζ − kD(θ − θ ∗ )k2 /2
in view of the definition of Y . The definition (4.14) implies
L(θ) − L(θ ∗ ) = (θ − θ ∗ )> ∇L(θ ∗ ) − kD(θ − θ ∗ )k2 /2.
As these two expressions coincide, it follows that L(θ) is the true loglikelihood if and only if ξ = D−1 ζ is standard normal.
4.6 Inference based on the maximum likelihood
All the results presented above for linear models were based on the explicit
e . Here we present the approach based on
representation of the (quasi) MLE θ
the analysis of the maximum likelihood. This approach does not require to
fix any analytic expression for the point of maximum of the (quasi) likelihood
process L(θ) . Instead we work directly with the maximum of this process.
We establish exponential inequalities for the excess or the maximum likelihood
e θ ∗ ) . We also show how these results can be used to study the accuracy
L(θ,
e , in particular, for building confidence sets.
of the MLE θ
One more benefit of the ML-based approach is that it equally applies to a
homogeneous and to a heterogeneous noise provided that the noise structure
is not misspecified. The celebrated chi-squared result about the maximum
e θ ∗ ) claims that the distribution of 2L(θ,
e θ ∗ ) is chi-squared
likelihood L(θ,
with p degrees of freedom χ2p and it does not depend on the noise covariance;
see Section 4.6.
Now we specify the setup. The starting point of the ML-approach is the
linear Gaussian model assumption Y = Ψ> θ ∗ + ε with ε ∼ N(0, Σ) . The
corresponding log-likelihood ratio L(θ) can be written as
1
L(θ) = − (Y − Ψ> θ)> Σ−1 (Y − Ψ> θ) + R,
2
(4.21)
where the remainder term R does not depend on θ . Now one can see that
L(θ) is a quadratic function of θ . Moreover, ∇2 L(θ) = ΨΣ−1 Ψ> , so that
L(θ) is quadratic with D2 = ΨΣ−1 Ψ> . This enables us to apply the general
results of Section 4.5 which are only based on the geometric (quadratic) structure of the log-likelihood L(θ) : the true data distribution can be arbitrary.
150
4 Estimation in linear models
Theorem 4.6.1. Consider L(θ) from (4.21). For any θ , it holds with D2 =
ΨΣ−1 Ψ>
e θ) = (θ
e − θ)> D2 (θ
e − θ)/2.
L(θ,
(4.22)
In particular, if Σ = σ 2 In then the fitted log-likelihood is proportional to the
e − f k2 for f
e = Ψ> θ
e and f = Ψ> θ :
quadratic loss kf
θ
θ
e θ) =
L(θ,
2
1 e − θ)2 = 1 e
Ψ> (θ
f − f θ .
2
2
2σ
2σ
def
If θ ∗ = argmaxθ IEL(θ) = D−2 ΨΣ−1 f ∗ for f ∗ = IEY , then
e θ ∗ ) = ζ > D−2 ζ = kξk2
2L(θ,
(4.23)
def
with ζ = ∇L(θ ∗ ) and ξ = D−1 ζ .
Proof. The results (4.22) and (4.23) follow from Theorem 4.5.1; see (4.17) and
(4.18).
If the model assumptions are not misspecified one can establish the remarkable χ2 result.
Theorem 4.6.2. Let L(θ) from (4.21) be the log-likelihood for the model
Y = Ψ> θ ∗ + ε with ε ∼ N(0, Σ) . Then ξ = D−1 ζ ∼ N(0, Ip ) and
e θ ∗ ) ∼ χ2 is chi-squared with p degrees of freedom.
2L(θ,
p
Proof. By direct calculus
ζ = ∇L(θ ∗ ) = ΨΣ−1 (Y − Ψ> θ ∗ ) = ΨΣ−1 ε.
So, ζ is a linear transformation of a Gaussian vector Y and thus it is Gaussian as well. By Theorem 4.5.1, IEζ = 0 . Moreover, Var(ε) = Σ implies
Var(ζ) = IEΨ> Σ−1 εε> Σ−1 Ψ> = ΨΣ−1 Ψ> = D2
yielding that ξ = D−1 ζ is standard normal.
e θ ∗ ) ∼ χ2 is sometimes called the “chi-squared pheThe last result 2L(θ,
p
nomenon”: the distribution of the maximum likelihood only depends on the
number of parameters to be estimated and is independent of the design Ψ ,
of the noise covariance matrix Σ , etc. This particularly explains the use of
word “phenomenon” in the name of the result.
4.6 Inference based on the maximum likelihood
151
Exercise 4.6.1. Check that the linear transformation Y̌ = Σ−1/2 Y of the
data does not change the value of the log-likelihood ratio L(θ, θ ∗ ) and hence,
e θ∗ ) .
of the maximum likelihood L(θ,
Hint: use the representation
1
(Y − Ψ> θ)> Σ−1 (Y − Ψ> θ) + R
2
1
= (Y̌ − Ψ̌> θ)> (Y̌ − Ψ̌> θ) + R
2
L(θ) =
and check that the transformed data Y̌ is described by the model Y̌ =
Ψ̌> θ ∗ + ε̌ with Ψ̌ = ΨΣ−1/2 and ε̌ = Σ−1/2 ε ∼ N(0, In ) yielding the same
log-likelihood ratio as in the original model.
Exercise 4.6.2. Assume homogeneous noise in (4.21) with Σ = σ 2 In . Then
it holds
e θ ∗ ) = σ −2 kΠεk2
2L(θ,
−1
where Π = Ψ> ΨΨ>
Ψ is the projector in IRn on the subspace spanned
by the vectors ψ 1 , . . . , ψ p .
Hint: use that ζ = σ −2 Ψε , D2 = σ −2 ΨΨ> , and
σ −2 kΠεk2 = σ −2 ε> Π> Πε = σ −2 ε> Πε = ζ > D−2 ζ.
e θ ∗ ) ∼ χ2 , where
We write the result of Theorem 4.6.1 in the form 2L(θ,
p
stands for the chi-squared distribution with p degrees of freedom. This
result can be used to build likelihood-based confidence ellipsoids for the parameter θ ∗ . Given z > 0 , define
χ2p
o
n
e θ) ≤ z = θ : sup L(θ 0 ) − L(θ) ≤ z .
E(z) = θ : L(θ,
(4.24)
θ0
Theorem 4.6.3. Assume Y = Ψ> θ ∗ + ε with ε ∼ N(0, Σ) and consider
e . Define zα by P χ2 > 2zα = α . Then E(zα ) from (4.24) is an
the MLE θ
p
α -confidence set for θ ∗ .
Exercise 4.6.3. Let D2 = ΨΣ−1 Ψ> . Check that the likelihood-based CS
e − θ)k ≤ zα } , z 2 = 2zα ,
E(zα ) and estimate-based CS E(zα ) = {θ : kD(θ
α
coincide in the case of the linear modeling:
e − θ)2 ≤ 2zα .
E(zα ) = θ : D(θ
152
4 Estimation in linear models
Another corollary of the chi-squared result is a concentration bound for the
maximum likelihood. A similar result was stated for the univariate exponential
e θ∗ ) is stochastically bounded with exponential
family model: the value L(θ,
moments, and the bound does not depend on the particular family, parameter value, sample size, etc. Now we can extend this result to the case of a
linear Gaussian model. Indeed, Theorem 4.6.1 states that the distribution of
e θ ∗ ) is chi-squared and only depends on the number of parameters to be
2L(θ,
estimated. The latter distribution concentrates on the ball of radius of order
p1/2 and the deviation probability is exponentially small.
Theorem 4.6.4. Assume Y = Ψ> θ ∗ + ε with ε ∼ N(0, Σ) . Then for every
x > 0 , it holds with κ ≥ 6.6
e θ ∗ ) > p + √κxp ∨ (κx)
IP 2L(θ,
e − θ ∗ )2 > p + √κxp ∨ (κx) ≤ exp(−x).
= IP D(θ
(4.25)
def
e −θ ∗ ) . By Theorem 4.4.4 ξ is standard normal vector
Proof. Define ξ = D(θ
e θ ∗ ) = kξk2 . Now the statement (4.25)
in IRp and by Theorem 4.6.1 2L(θ,
follows from the general deviation bound for the Gaussian quadratic forms;
see Theorem 9.2.1.
The main message of this result can be explained as follows: the deviation
e does not belong to the elliptic set E(z) =
probability that the estimate θ
e − θ)k ≤ z} starts to vanish when z 2 exceeds the dimensionality
{θ : kD(θ
p of the parameter space. Similarly, the coverage probability that the true
parameter θ ∗ is not covered by the confidence set E(z) starts to vanish when
2z exceeds p .
Corollary 4.6.1. Assume Y = Ψ> θ ∗ + ε with ε ∼ N(0, Σ) . Then for every
√
x > 0 , it holds with 2z = p + κxp ∨ (κx) for κ ≥ 6.6
IP E(z) 63 θ ∗ ≤ exp(−x).
Exercise 4.6.4. Compute z ensuring the covering of 95% in the dimension
p = 1, 2, 10, 20 .
4.6.1 A misspecified LPA
Now we discuss the behavior of the fitted log-likelihood for the misspecified
linear parametric assumption IEY = Ψ> θ ∗ . Let the response function f ∗
not be linearly expandable as f ∗ = Ψ> θ ∗ . Following to Theorem 4.4.3, de−1
fine θ ∗ = Sf ∗ with S = ΨΣ−1 Ψ>
ΨΣ−1 . This point provides the best
approximation of the nonlinear response f ∗ by a linear parametric fit Ψ> θ .
4.6 Inference based on the maximum likelihood
153
Theorem 4.6.5. Assume Y = f ∗ + ε with ε ∼ N(0, Σ) . Let θ ∗ = Sf ∗ .
e is an R-efficient estimate of θ ∗ and
Then θ
e θ ∗ ) = ζ > D−2 ζ = kξk2 ∼ χ2 ,
2L(θ,
p
where D2 = ΨΣ−1 Ψ> , ζ = ∇L(θ ∗ ) = ΨΣ−1 ε , ξ = D−1 ζ is standard
normal vector in IRp and χ2p is a chi-squared random variable with p degrees
of freedom. In particular, E(zα ) is an α -CS for the vector θ ∗ and the bound
of Corollary 4.6.1 applies.
Exercise 4.6.5. Prove the result of Theorem 4.6.5.
4.6.2 A misspecified noise structure
This section addresses the question about the features of the maximum likelihood in the case when the likelihood is built under a wrong assumption about
the noise structure. As one can expect, the chi-squared result is not valid
anymore in this situation and the distribution of the maximum likelihood depends on the true noise covariance. However, the nice geometric structure of
the maximum likelihood manifested by Theorems 4.6.1 and 4.6.3 does not rely
on the true data distribution and it is only based on our structural assumptions on the considered model. This helps to get rigorous results about the
behaviors of the maximum likelihood and particularly about its concentration
properties.
e be built for the model Y = Ψ> θ ∗ + ε with ε ∼
Theorem 4.6.6. Let θ
N(0, Σ) , while the true noise covariance is Σ0 : IEε = 0 and Var(ε) = Σ0 .
Then
e = θ∗ ,
IE θ
e = D−2 W 2 D−2 ,
Var θ
where
D2 = ΨΣ−1 Ψ> ,
W 2 = ΨΣ−1 Σ0 Σ−1 Ψ> .
Further,
e θ ∗ ) = kD(θ
e − θ ∗ )k2 = kξk2 ,
2L(θ,
(4.26)
where ξ is a random vector in IRp with IEξ = 0 and
def
Var(ξ) = B = D−1 W 2 D−1 .
e ∼ N θ ∗ , D−2 W 2 D−2
Moreover, if ε ∼ N(0, Σ0 ) , then θ
and ξ ∼ N(0, B) .
154
4 Estimation in linear models
e have been computed in Theorem 4.5.1 while the
Proof. The moments of θ
∗
e
e − θ ∗ )k2 = kξk2 is given in Theorem 4.6.1. Next,
equality 2L(θ, θ ) = kD(θ
∗
−1
ζ = ∇L(θ ) = ΨΣ ε and
def
W 2 = Var(ζ) = ΨΣ−1 Var(ε)Σ−1 Ψ> = ΨΣ−1 Σ0 Σ−1 Ψ> .
This implies that
Var(ξ) = IEξξ > = D−1 Var(ζ)D−1 = D−1 W 2 D−1 .
It remains to note that if ε is a Gaussian vector, then ζ = ΨΣ−1 ε , ξ =
e − θ ∗ = D−2 ζ are Gaussian as well.
D−1 ζ , and θ
Exercise 4.6.6. Check that Σ0 = Σ leads back to the χ2 -result.
One can see that the chi-squared result is not valid any more if the noise
structure is misspecified. An interesting question is whether the CS E(z) can
be applied in the case of a misspecified noise under some proper adjustment
of the value z . Surprisingly, the answer is not entirely negative. The reason is
that the vector ξ from (4.26) is zero mean and its norm has a similar behavior
as in the case of the correct noise specification: the probability IP kξk >
z starts to degenerate when z 2 exceeds IEkξk2 . A general bound from
Theorem 9.2.2 in Section 9.1 implies the following bound for the coverage
probability.
Corollary 4.6.2. Under the conditions of Theorem 4.6.6, for every x > 0 ,
it holds with p = tr(B) , v 2 = 2 tr(B 2 ) , and a∗ = kBk∞
e θ ∗ ) > p + (2vx1/2 ) ∨ (6a∗ x) ≤ exp(−x).
IP 2L(θ,
Exercise 4.6.7. Show that an overestimation of the noise in the sense Σ ≥
Σ0 preserves the coverage probability for the
CS E(zα ) , that is, if 2zα is the
1 − α quantile of χ2p , then IP E(zα ) 63 θ ∗ ≤ α .
4.7 Ridge regression, projection, and shrinkage
This section discusses the important situation when the number of predictors
ψ j and hence the number of parameters p in the linear model Y = Ψ> θ ∗ +ε
is not small relative to the sample size. Then the least square or the maximum
likelihood approach meets serious problems. The first one relates to the nue involves the inversion of the p × p
merical issues. The definition of the LSE θ
>
matrix ΨΨ and such an inversion becomes a delicate task for p large. The
other problem concerns the inference for the estimated parameter θ ∗ . The
risk bound and the width of the confidence set are proportional to the parameter dimension p and thus, with large p , the inference statements become
4.7 Ridge regression, projection, and shrinkage
155
almost uninformative. In particular, if p is of order the sample size n , even
consistency is not achievable. One faces a really critical situation. We already
know that the MLE is the efficient estimate in the class of all unbiased estimates. At the same time it is highly inefficient in overparametrized models.
The only way out of this situation is to sacrifice the unbiasedness property in
favor of reducing the model complexity: some procedures can be more efficient
than MLE even if they are biased. This section discusses one way of resolving
these problems by regularization or shrinkage. To be more specific, for the
rest of the section we consider the following setup. The observed vector Y
follows the model
Y = f∗ + ε
(4.27)
with a homogeneous error vector ε : IEε = 0 , Var(ε) = σ 2 In . Noise misspecification is not considered in this section.
Furthermore, we assume a basis or a collection of basis vectors ψ 1 , . . . , ψ p
is given with p large. This allows for approximating the response vector f =
IEY in the form f = Ψ> θ ∗ , or, equivalently,
f = θ1∗ ψ 1 + . . . + θp∗ ψ p .
In many cases we will assume that the basis is already orthogonalized: ΨΨ> =
Ip . The model (4.27) can be rewritten as
Y = Ψ> θ ∗ + ε,
Var(ε) = σ 2 In .
The MLE or oLSE of the parameter vector θ ∗ for this model reads as
e = ΨΨ>
θ
−1
ΨY ,
e = Ψ> θ
e = Ψ> ΨΨ>
f
−1
ΨY .
e is a
If the matrix ΨΨ> is degenerate or badly posed, computing the MLE θ
hard task. Below we discuss how this problem can be treated.
4.7.1 Regularization and ridge regression
Let R be a positive symmetric p × p matrix. Then the sum ΨΨ> + R is
positive symmetric as well and can be inverted whatever the matrix Ψ is. This
−1
−1
suggests to replace ΨΨ>
by ΨΨ> + R
leading to the regularized
e
least squares estimate θ R of the parameter vector θ and the corresponding
e :
response estimate f
R
eR def
θ
= ΨΨ> + R
−1
ΨY ,
>
e def
f
ΨΨ> + R
R = Ψ
−1
ΨY .
(4.28)
Such a method is also called ridge regression. An example of choosing R is
the multiple of the unit matrix: R = αIIp where α > 0 and Ip stands for the
156
4 Estimation in linear models
unit matrix. This method is also called Tikhonov regularization and it results
eα and the response estimate f
e :
in the parameter estimate θ
α
eα def
θ
= ΨΨ> + αIIp
−1
ΨY ,
>
e def
f
ΨΨ> + αIIp
α = Ψ
−1
ΨY . (4.29)
A proper choice of the matrix R for the ridge regression method (4.28) or
the parameter α for the Tikhonov regularization (4.29) is an important issue.
Below we discuss several approaches which lead to the estimate (4.28) with
eR and
a specific choice of the matrix R . The properties of the estimates θ
e will be studied in context of penalized likelihood estimation in the next
f
R
section.
4.7.2 Penalized likelihood. Bias and variance
The estimate (4.28) can be obtained in a natural way within the (quasi) ML
approach using the penalized least squares. The classical unpenalized method
is based on minimizing the sum of residuals squared:
e = argmax L(θ) = arginf kY − Ψ> θk2
θ
θ
θ
with L(θ) = σ −2 kY − Ψ> θk2 /2 . (Here we omit the terms which do not
depend on θ .) Now we introduce an additional penalty on the objective function which penalizes for the complexity of the candidate vector θ which is
expressed by the value kGθk2 /2 for a given symmetric matrix G . This choice
of complexity measure implicitly assumes that the vector θ ≡ 0 has the smallest complexity equal to zero and this complexity increases with the norm of
Gθ . Define the penalized log-likelihood
def
LG (θ) = L(θ) − kGθk2 /2
= −(2σ 2 )−1 kY − Ψ> θk2 − kGθk2 /2 − (n/2) log(2πσ 2 ). (4.30)
The penalized MLE reads as
eG = argmax LG (θ) = argmin (2σ 2 )−1 kY − Ψ> θk2 + kGθk2 /2 .
θ
θ
θ
eG with R =
A straightforward calculus leads to the expression (4.28) for θ
σ 2 G2 :
−1
eG def
θ
= ΨΨ> + σ 2 G2
ΨY .
(4.31)
eG is again a linear estimate: θ
eG = SG Y with SG = ΨΨ> +
We see that θ
−1
eG in fact estimates the
σ 2 G2
Ψ . The results of Section 4.4 explains that θ
value θ G defined by
4.7 Ridge regression, projection, and shrinkage
157
θ G = argmax IELG (θ)
θ
= arginf IE kY − Ψ> θk2 + σ 2 kGθk2
θ
= ΨΨ> + σ 2 G2
−1
Ψf ∗ = SG f ∗ .
(4.32)
In particular, if f ∗ = Ψ> θ ∗ , then
θ G = ΨΨ> + σ 2 G2
−1
ΨΨ> θ ∗
(4.33)
eG is biased.
and θ G 6= θ ∗ unless G = 0 . In other words, the penalized MLE θ
eα = θ α for θ α = ΨΨ> + αIIp −1 ΨΨ> θ ∗ ,
Exercise 4.7.1. Check that IE θ
the bias kθ α − θ ∗ k grows with the regularization parameter α .
eG leads to the response estimate f
e = Ψ> θ
eG .
The penalized MLE θ
G
Exercise 4.7.2. Check that the penalized ML approach leads to the response
estimate
e = Ψ> θ
eG = Ψ> ΨΨ> + σ 2 G2
f
G
−1
ΨY = ΠG Y
−1
with ΠG = Ψ> ΨΨ> + σ 2 G2
Ψ . Show that ΠG is a sub-projector in the
sense that kΠG uk ≤ kuk for any u ∈ IRn .
Exercise 4.7.3. Let Ψ be orthonormal: ΨΨ> = Ip . Then the penalized
eG can be represented as
MLE θ
eG = (IIp + σ 2 G2 )−1 Z,
θ
where Z = ΨY is the vector of empirical Fourier coefficients. Specify the
result for the case of a diagonal matrix G = diag(g1 , . . . , gp ) and describe the
e .
corresponding response estimate f
G
The previous results indicate that introducing the penalization leads to some
bias of estimation. One can ask about a benefit of using a penalized procedure.
The next result shows that penalization decreases the variance of estimation
and thus, makes the procedure more stable.
eG be a penalized MLE from (4.31). Then IE θ
eG = θ G ,
Theorem 4.7.1. Let θ
see (4.33), and under noise homogeneity Var(ε) = σ 2 In , it holds
eG ) = σ −2 ΨΨ> + G2
Var(θ
−2 2 −2
= DG
D DG
−1
σ −2 ΨΨ> σ −2 ΨΨ> + G2
−1
158
4 Estimation in linear models
2
eG ) ≤ D−2 . If ε ∼ N(0, σ 2 In ) ,
with DG
= σ −2 ΨΨ> +G2 . In particular, Var(θ
G
−2
eG is also normal: θ
eG ∼ N(θ G , D D2 D−2 ) .
then θ
G
G
Moreover, the bias kθ G − θ ∗ k monotonously increases in G2 while the
variance monotonously decreases with the penalization G .
eG are computed from θ
eG = SG Y . MonoProof. The first two moments of θ
e
tonicity of the bias and variance of θ G is proved below in Exercise 4.7.6.
eG ) .
Exercise 4.7.4. Let Ψ be orthonormal: ΨΨ> = Ip . Describe Var(θ
Show that the variance decreases with the penalization G in the sense that
eG ) ≤ Var(θ
eG ) .
G1 ≥ G implies Var(θ
1
Exercise 4.7.5. Let ΨΨ> = Ip and let G = diag(g1 , . . . , gp ) be a diagonal
matrix. Compute the squared bias kθ G −θ ∗ k2 and show that it monotonously
increases in each gj for j = 1, . . . , p .
eG the corresponding
Exercise 4.7.6. Let G be a symmetric matrix and θ
e
penalized MLE. Show that the variance Var(θ G ) decreases while the bias
kθ G − θ ∗ k increases in G2 .
Hint: with D2 = σ −2 ΨΨ> , show that for any vector w ∈ IRp and u =
D−1 w , it holds
eG )w = u> (IIp + D−1 G2 D−1 )−2 u
w> Var(θ
and this value decreases with G2 because Ip + D−1 G2 D−1 increases. Show
in a similar way that
>
kθ G − θ ∗ k2 = k(D2 + G2 )−1 G2 θ ∗ k2 = θ ∗ Γ−1 θ ∗
with Γ = Ip +G−2 D2 Ip +D2 G−2 . Show that the matrix Γ monotonously
increases and thus Γ−1 monotonously decreases as a function of the symmetric matrix B = G−2 .
eG yields
Putting together the results about the bias and the variance of θ
the statement about the quadratic risk.
Theorem 4.7.2. Assume the model Y = Ψ> θ ∗ + ε with Var(ε) = σ 2 In .
eG fulfills
Then the estimate θ
eG − θ ∗ k2 = kθ G − θ ∗ k2 + tr D−2 D2 D−2 .
IEkθ
G
G
This result is called the bias-variance decomposition. The choice of a proper
regularization is usually based on this decomposition: one selects a regularization from a given class to provide the minimal possible risk. This approach
is referred to as bias-variance trade-off.
4.7 Ridge regression, projection, and shrinkage
159
4.7.3 Inference for the penalized MLE
eG . In particular,
Here we discuss some properties of the penalized MLE θ
we focus on the construction of confidence and concentration sets based on
eG is
the penalized log-likelihood. We know that the regularized estimate θ
the empirical counterpart of the value θ G which solves the regularized deterministic problem (4.32). We also know that the key results are expressed via
the value of the supremum supθ LG (θ) − LG (θ G ) . The next result extends
Theorem 4.6.1 to the penalized likelihood.
Theorem 4.7.3. Let LG (θ) be the penalized log-likelihood from (4.30). Then
eG , θ G ) = θ
eG − θ G
2LG (θ
>
2 e
DG
θG − θG
= σ −2 ε> ΠG ε
with ΠG = Ψ> ΨΨ> + σ 2 G2
−1
(4.34)
(4.35)
Ψ.
In general the matrix ΠG is not a projector and hence, σ −2 ε> ΠG ε is not
χ -distributed, the chi-squared result does not apply.
2
Exercise 4.7.7. Prove (4.34).
eG . Use that ∇LG (θ G ) = 0
Hint: apply the Taylor expansion to LG (θ) at θ
2
−2
>
2
and −∇ LG (θ) ≡ σ ΨΨ + G .
Exercise 4.7.8. Prove (4.35).
eG − θ G = SG ε with SG = ΨΨ> + σ 2 G2 −1 Ψ .
Hint: show that θ
The straightforward corollaries of Theorem 4.7.3 are the concentration and
confidence probabilities. Define the confidence set EG (z) for θ G as
def
EG (z) =
eG , θ) ≤ z .
θ : LG (θ
The definition implies the following result for the coverage probability:
eG , θ G ) > z .
IP EG (z) 63 θ G ≤ IP LG (θ
eG , θ G ) reduces the problem to a deviNow the representation (4.35) for LG (θ
ation bound for a quadratic form. We apply the general result of Section 9.1.
Theorem 4.7.4. Let LG (θ) be the penalized log-likelihood from (4.30) and
2
let ε ∼ N(0, σ 2 In ) . Then it holds with pG = tr(ΠG ) and vG
= 2 tr(Π2G ) that
eG , θ G ) > pG + (2vG x1/2 ) ∨ (6x) ≤ exp(−x).
IP 2LG (θ
160
4 Estimation in linear models
2
Similarly one can state the concentration result. With DG
= σ −2 ΨΨ> + G2
eG , θ G ) = DG θ
eG − θ G 2
2LG (θ
and the result of Theorem 4.7.4 can be restated as the concentration bound:
eG − θ G )k2 > pG + (2vG x1/2 ) ∨ (6x) ≤ exp(−x).
IP kDG (θ
eG concentrates on the set A(z, θ G ) = θ : kθ − θ G k2 ≤ 2z
In other words, θ
for 2z > pG .
4.7.4 Projection and shrinkage estimates
Consider a linear model Y = Ψ> θ ∗ +ε in which the matrix Ψ is orthonormal
in the sense ΨΨ> = Ip . Then the multiplication with Ψ maps this model in
the sequence space model Z = θ ∗ + ξ , where Z = ΨY = (z1 , . . . , zp )> is the
vector of empirical Fourier coefficients zj = ψ >
j Y . The noise ξ = Ψε borrows
the feature of the original noise ε : if ε is zero mean and homogeneous, the
same applies to ξ . The number of coefficients p can be large or even infinite.
To get a sensible estimate, one has to apply some regularization method. The
simplest one is called projection: one just considers the first m empirical
coefficients z1 , . . . , zm and drop the others. The corresponding parameter
em reads as
estimate θ
(
zj if j ≤ m,
θem,j =
0 otherwise.
em leading to the repreThe response vector f ∗ = IEY is estimated by Ψ> θ
sentation
e = z1 ψ + . . . + zm ψ
f
m
1
m
e
with zj = ψ >
j Y . In other words, f m is just a projection of the observed
vector Y onto the subspace Lm spanned by the first m basis vectors
ψ 1 , . . . , ψ m : Lm = ψ 1 , . . . , ψ m . This explains the name of the method.
em or f
e using the methods of preClearly one can study the properties of θ
m
vious sections. However, one more question for this approach is still open: a
proper choice of m . The standard way of accessing this issue is based on the
analysis of the quadratic risk.
e ) = IEkf
e − f ∗ k2 .
Consider first the prediction risk defined as R(f
m
m
Below we focus on the case of a homogeneous noise with Var(ε) = σ 2 Ip . An
e effectively estimates
extension to the colored noise is possible. Recall that f
m
∗
the vector f m = Πm f , where Πm is the projector on Lm ; see Section 4.3.3.
e ) can be decomposed as
Moreover, the quadratic risk R(f
m
4.7 Ridge regression, projection, and shrinkage
e ) = kf ∗ − Πm f ∗ k2 + σ 2 m = σ 2 m +
R(f
m
p
X
161
θj∗ 2 .
j=m+1
Obviously the squared bias kf ∗ − Πm f ∗ k2 decreases with m while the variance σ 2 m linearly grows with m . Risk minimization leads to the so called
e ) over
bias-variance trade-off : one selects m which minimizes the risk R(f
m
all possible m :
def
e ) = argmin kf ∗ − Πm f ∗ k2 + σ 2 m .
m∗ = argmin R(f
m
m
m
Unfortunately this choice requires some information about the bias kf ∗ −
Πm f ∗ k which depends on the unknown vector f . As this information is not
available in typical situation, the value m∗ is also called an oracle choice.
A data-driven choice of m is one of the central issue in the nonparametric
statistics.
em −
The situation is not changed if we consider the estimation risk IEkθ
θ ∗ k2 . Indeed, the basis orthogonality ΨΨ> = Ip implies for f ∗ = Ψ> θ ∗
e − f ∗ k2 = kΨ> θ
em − Ψ> θ ∗ k2 = kθ
em − θ ∗ k2
kf
m
and minimization of the estimation risk coincides with minimization of the
prediction risk.
A disadvantage of the projection method is that it either keeps each empirical coefficient zm or completely discards it. An extension of the projection
method is called shrinkage: one multiplies every empirical coefficient zj with
eα with
a factor αj ∈ (0, 1) . This leads to the shrinkage estimate θ
θeα,j = αj zj .
Here α stands for the vector of coefficients αj for j = 1, . . . , p . A projection
method is a special case of this shrinkage with αj equal to one or zero. Another
popular choice of the coefficients αj is given by
αj = (1 − j/m)β 1(j ≤ m)
(4.36)
for some β > 0 and m ≤ p . This choice ensures that the coefficients αj
smoothly approach zero as j approach the value m , and αj vanish for j >
m . In this case, the vector α is completely specified by two parameters m and
β . The projection method corresponds to β = 0 . The design orthogonality
eα − θ ∗ k2 coincides with
ΨΨ> = Ip yields again that the estimation risk IEkθ
∗
2
e −f k .
the prediction risk IEkf
α
162
4 Estimation in linear models
e ) of the shrinkage estiExercise 4.7.9. Let Var(ε) = σ 2 Ip . The risk R(f
α
e fulfills
mate f
α
def
e ) = IEkf
e − f ∗ k2 =
R(f
α
α
p
X
θj∗ 2 (1 − αj )2 +
j=1
p
X
αj2 σ 2 .
j=1
Specify
P 2 2the cases of α = α(m, β) from (4.36). Evaluate the variance term
j αj σ .
R
Hint: approximate the sum over j by the integral (1 − x/m)2β
+ dx .
The oracle choice is again defined by risk minimization:
def
e ),
α∗ = argmin R(f
α
α
where minimization is taken over the class of all considered coefficient vectors
α.
One way of obtaining a shrinkage estimate in the sequence space model
Z = θ ∗ + ξ is by using a roughness penalization. Let G be a symmetric maeG from (4.28). The next result claims
trix. Consider the regularized estimate θ
eG is a shrinkage estimate. Moreover,
that if G is a diagonal matrix, then θ
a general penalized MLE can be represented as shrinkage by an orthogonal
basis transformation.
Theorem 4.7.5. Let G be a diagonal matrix, G = diag(g1 , . . . , gp ) . The peeG in the sequence space model Z = θ ∗ +ξ with ξ ∼ N(0, σ 2 Ip )
nalized MLE θ
eα for αj = (1 + σ 2 g 2 )−1 ≤ 1 . Morecoincides with the shrinkage estimate θ
j
eG for a general matrix G can be reduced to a shrinkover, a penalized MLE θ
age estimate by a basis transformation in the sequence space model.
Proof. The first statement for a diagonal matrix G follows from the represeneG = (IIp +σ 2 G2 )−1 Z . Next, let U be an orthogonal transform leading
tation θ
to the diagonal representation G2 = U > D2 U with D2 = diag(g1 , . . . , gp ) .
Then
eG = (IIp + σ 2 D2 )−1 U Z
Uθ
eG is a shrinkage estimate in the transformed model U Z = U θ ∗ +
that is, U θ
Uξ .
In other words, roughness penalization results in some kind of shrinkage.
Interestingly, the inverse statement holds as well.
eα is a shrinkage estimate for a vector α = (αj ) .
Exercise 4.7.10. Let θ
eα = θ
eG .
Then there is a diagonal penalty matrix G such that θ
Hint: define the j th diagonal entry gj by the equation αj = (1 + σ 2 gj2 )−1 .
4.8 Shrinkage in a linear inverse problem
163
4.7.5 Smoothness constraints and roughness penalty approach
Another way of reducing the complexity of the estimation procedure is based
on smoothness constraints. The notion of smoothness originates from regression estimation. A non-linear regression function f is expanded using
a Fourier or some other functional basis and θ ∗ is the corresponding vector
of coefficients. Smoothness properties of the regression function imply certain
rate of decay of the corresponding Fourier coefficients: the larger frequency is,
the fewer amount of information about the regression function is contained in
the related coefficient. This leads to the natural idea to replace the original
optimization problem over the whole parameter space with the constrained optimization over a subset of “smooth” parameter vectors. Here we consider one
popular example of Sobolev smoothness constraints which effectively means
that the s th derivative of the function f ∗ has a bounded L2 -norm. A general
Sobolev ball can be defined using a diagonal matrix G :
def
BG (R) = kGθk ≤ R.
Now we consider a constrained ML problem:
eG,R = argmax L(θ) =
θ
argmin
θ∈BG (R)
θ∈Θ: kGθk≤R
kY − Ψ> θk2 .
(4.37)
The Lagrange multiplier method leads to an unconstrained problem
eG,λ = argmin kY − Ψ> θk2 + λkGθk2 .
θ
θ
eG,λ belongs to BG (R)
A proper choice of λ ensures that the solution θ
and solves also the problem (4.37). So, the approach based on a Sobolev
smoothness assumption, leads back to regularization and shrinkage.
4.8 Shrinkage in a linear inverse problem
This section extends the previous approaches to the situation with indirect
observations. More precisely, we focus on the model
Y = Af ∗ + ε,
(4.38)
where A is a given linear operator (matrix) and f ∗ is the target of analysis.
With the obvious change of notation this problem can be put back in the
general linear setup Y = Ψ> θ + ε . The special focus is due to the facts that
the target can be high dimensional or even functional and that the product
A> A is usually badly posed and its inversion is a hard task. Below we consider separately the cases when the spectral representation for this problem
is available and the general case.
164
4 Estimation in linear models
4.8.1 Spectral cut-off and spectral penalization. Diagonal estimates
Suppose that the eigenvectors of the matrix A> A are available. This allows for reducing the model to the spectral representation by an orthogonal
change of the coordinate system: Z = Λu + Λ1/2 ξ with a diagonal matrix Λ = diag{λ1 , . . . , λp } and a homogeneous noise Var(ξ) = σ 2 Ip ; see
Section 4.2.4. Below we assume without loss of generality that the eigenvalues λj are ordered and decrease with j . This spectral representation means
that one observes empirical Fourier coefficients zm described by the equation
1/2
zj = λj uj + λj ξj for j = 1, . . . , p . The LSE or qMLE estimate of the
spectral parameter u is given by
−1
>
e = Λ−1 Z = (λ−1
u
1 z1 , . . . , λp zp ) .
Exercise 4.8.1. Consider the spectral representation Z = Λu + Λ1/2 ξ . The
e reads as u
e = Λ−1 Z .
LSE u
If the dimension p of the model is high or, specifically, if the spectral
values λj rapidly go to zero, it might be useful to only track few coefficients
u1 , . . . , um and to set all the remaining ones to zero. The corresponding estie m = (e
mate u
um,1 , . . . , u
em,p )> reads as
(
def
u
em,j =
λ−1
j zj
0
if j ≤ m,
otherwise.
It is usually referred to as a spectral cut-off estimate.
Exercise 4.8.2. Consider the linear model Y = Af ∗ + ε . Let U be an orthogonal transform in IRp providing U A> AU > = Λ with a diagonal matrix
Λ leading to the spectral representation for Z = U AY . Write the corree for the original vector f ∗ . Show that
sponding spectral cut-off estimate f
m
computing this estimate only requires to know the first m eigenvalues and
eigenvectors of the matrix A> A .
Similarly to the direct case, a spectral cut-off can be extended to spectral
shrinkage: one multiplies every empirical coefficient zj with a factor αj ∈
e α with u
(0, 1) . This leads to the spectral shrinkage estimate u
eα,j = αj λ−1
j zj .
Here α stands for the vector of coefficients αj for j = 1, . . . , p . A spectral
cut-off method is a special case of this shrinkage with αj equal to one or zero.
e α with a given vector α
Exercise 4.8.3. Specify the spectral shrinkage u
for the situation of Exercise 4.8.2.
The spectral cut-off method can be described as follows. Let ψ 1 , ψ 2 , . . . be
the intrinsic orthonormal basis of the problem composed of the standardized
eigenvectors of A> A and leading to the spectral representation Z = Λu +
4.8 Shrinkage in a linear inverse problem
165
Λ1/2 ξ with the target vector u . In terms of the original target
f ∗ , one
P
is looking for a solution or an estimate in the form f =
j uj ψ j . The
design orthogonality allows to estimate every coefficient uj independently
of the others using the empirical Fourier coefficient ψ >
ej =
j Y . Namely, u
−1 >
−1
e
λ
Pj ψ j Y = λj zj . The LSE procedure tries to recover f as the full sum f =
ej ψ j . The projection method suggests to cut this sum at the index m :
ju
P
e
f m = j≤m u
ej ψ j , while the shrinkage procedure is based on downweighting
e = P αj u
the empirical coefficients u
ej : f
ej ψ j .
α
j
Next we study the risk of the shrinkage method. Orthonormality of the
e − f ∗ k2 . Under
basis ψ j allows to represent the loss as ke
uα − u∗ k2 = kf
α
the noise homogeneity one obtains the following result.
Theorem 4.8.1. Let Z = Λu∗ + Λ1/2 ξ with Var(ξ) = σ 2 Ip . It holds for the
eα
shrinkage estimate u
def
R(e
uα ) = IEke
uα − u k =
∗ 2
p
X
|αj −
1|2 u∗j 2
j=1
+
p
X
αj2 σ 2 λ−1
j .
j=1
Proof. The empirical Fourier coefficients zj are uncorrelated and IEzj =
λj u∗j , Var zj = σ 2 λj . This implies
∗ 2
IEke
uα − u k =
p
X
IE|αj λ−1
j zj
−
u∗j |2
p
X
=
|αj − 1|2 u∗j 2 + αj2 σ 2 λ−1
j
j=1
j=1
as required.
Risk minimization leads to the oracle choice of the vector α or
α∗ = argmin R(e
uα )
α
where the minimum is taken over the set of all admissible vectors α .
Similar analysis can be done for the spectral cut-off method.
e m fulfills
Exercise 4.8.4. The risk of the spectral cut-off estimate u
R(e
um ) =
m
X
j=1
2
λ−1
j σ +
p
X
j=m+1
Specify the choice of the oracle cut-off index m∗ .
u∗j 2 .
166
4 Estimation in linear models
4.8.2 Galerkin method
A general problem with the spectral shrinkage approach is that it requires to
precisely know the intrinsic basis ψ 1 , ψ 2 , . . . or equivalently the eigenvalue
decomposition of A leading to the spectral representation. After this basis
is fixed, one can apply the projection or shrinkage method using the corresponding Fourier coefficients. In some situations this basis is hardly available
or difficult to compute. A possible way out of this problem is to take some
other orthogonal basis φ1 , φ2 , . . . which is tractable and convenient but does
not lead to the spectral representation of the model. The Galerkin method
is based on projecting the original high dimensional problem to a lower dimensional problem in term of the new basic {φj } . Namely, without loss of
generality suppose that the target function f ∗ can be decomposed as
f∗ =
X
uj φ j .
j
This can be achieved e.g. if f ∗ belongs to some Hilbert space and {φj } is an
orthonormal basis in this space. Now we cut this sum and replace this exact
decomposition by a finite approximation
f∗ ≈ fm =
X
uj φj = Φ>
m um ,
j≤m
where um = (u1 , . . . , um )> and the matrix Φm is built of the vectors
φ1 , . . . , φm : Φm = (φ1 , . . . , φm ) . Now we plug this decomposition in the original equation Y = Af ∗ +ε . This leads to the linear model Y = AΦ>
m um +ε =
>
u
+
ε
with
Ψ
=
Φ
A
.
The
corresponding
(quasi)
MLE
reads
Ψ>
m
m
m m
e m = Ψm Ψ>
u
m
−1
Ψm Y .
Note that for computing this estimate one only needs to evaluate the action
of the operator A on the basis functions φ1 , . . . , φm and on the data Y .
e m of the vector u∗ , one obtains the response estimate
With this estimate u
e
f m of the form
e = Φ> u
f
e1 φ1 + . . . + u
em φm .
m
m em = u
The properties of this estimate can be studied in the same way as for a general
qMLE in a linear model: the true data distribution follows (4.38) while we
use the approximating model Y = Af ∗m + ε with ε ∼ N(0, σ 2 I) for building
the quasi likelihood.
A further extension of the qMLE approach concerns the case when the
operator A is not precisely known. Instead, an approximation or an estie is available. The pragmatic way of tackling this problem is to use
mate A
4.9 Semiparametric estimation
167
e ∗m + ε for building the quasi likelihood. The use of the
the model Y = Af
Galerkin method is quite natural in this situation because the spectral repree will not necessarily result in a similar representation for the
sentation for A
true operator A .
4.9 Semiparametric estimation
This section discusses the situation when the target of estimation does not
coincide with the parameter vector. This problem is usually referred to as
semiparametric estimation. One typical example is the problem of estimating
a part of the parameter vector. More generally one can try to estimate a
given function/functional of the unknown parameter. We focus here on linear
modeling, that is, the considered model and the considered mapping of the
parameter space to the target space are linear. For the ease of presentation
we assume everywhere the homogeneous noise with Var(ε) = σ 2 In .
4.9.1 (θ, η) - and υ -setup
This section presents two equivalent descriptions of the semiparametric problem. The first one assumes that the total parameter vector can be decomposed
into the target parameter θ and the nuisance parameter η . The second one
operates with the total parameter υ and the target θ is a linear mapping of
υ.
We start with the (θ, η) -setup. Let the response Y be modeled in dependence of two sets of factors: {ψ j , j = 1, . . . , p} and {φm , m = 1, . . . , p1 } . We
are mostly interested in understanding the impact of the first set {ψ j } but
we cannot ignore the influence of the {φm } ’s. Otherwise the model would be
incomplete. This situation can be described by the linear model
Y = Ψ> θ ∗ + Φ> η ∗ + ε,
(4.39)
where Ψ is the p × n matrix with the columns ψ j , while Φ is the p1 × n matrix with the columns φm . We primarily aim at recovering the vector θ ∗ ,
while the coefficients η ∗ are of secondary importance. The corresponding
(quasi) log-likelihood reads as
L(θ, η) = −(2σ 2 )−1 kY − Ψ> θ − Φ> ηk2 + R,
where R denotes the remainder term which does not depend on the parameters θ, η . The more general υ -setup considers a general linear model
Y = Υ> υ ∗ + ε,
(4.40)
where Υ is p∗ ×n matrix of p∗ factors, and the target of estimation is a linear
∗
mapping θ ∗ = P υ ∗ for a given operator P from IRp to IRp . Obviously the
168
4 Estimation in linear models
(θ, η) -setup is a special case of the υ -setup. However, a general υ -setup can
be reduced back to the (θ, η) -setup by a change of variable.
Exercise 4.9.1. Consider the sequence space model Y = υ ∗ + ξ in IRp and
let the target of estimation be the sum of the coefficients υ1∗ +. . .+υp∗ . Describe
the υ -setup for the problem. Reduce to (θ, η) -setup by an orthogonal change
of the basis.
In the υ -setup, the (quasi) log-likelihood reads as
L(υ) = −(2σ 2 )−1 kY − Υ> υk2 + R,
where R is the remainder which does not depend on υ . It implies quadraticity
of the log-likelihood L(υ) : given by
D2 = −∇2 L(υ) = Var ∇L(υ) = σ −2 ΥΥ> .
(4.41)
Exercise 4.9.2. Check the statements in (4.41).
Exercise 4.9.3. Show that for the model (4.39) holds with Υ =
ΨΨ> ΨΦ>
D = σ ΥΥ = σ
ΦΨ> ΦΦ>
Ψε
∗
−2
−2
∇L(υ ) = σ Υε = σ
.
Φε
2
−2
>
−2
Ψ
Φ
,
(4.42)
4.9.2 Orthogonality and product structure
Consider the model (4.39) under the orthogonality condition ΨΦ> = 0 . This
condition effectively means that the factors of interest {ψ j } are orthogonal
to the nuisance factors {φm } . An important feature of this orthogonal case is
that the model has the product structure leading to the additive form of the
log-likelihood. Consider the partial θ -model Y = Ψ> θ + ε with the (quasi)
log-likelihood
L(θ) = −(2σ 2 )−1 kY − Ψ> θk2 + R
Similarly L1 (η) = −(2σ 2 )−1 kY − Φ> ηk2 + R1 denotes the log-likelihood in
the partial η -model Y = Φ> θ + ε .
Theorem 4.9.1. Assume the condition ΨΦ> = 0 . Then
L(θ, η) = L(θ) + L1 (η) + R(Y ),
(4.43)
where R(Y ) is independent of θ and η . This implies the block diagonal
structure of the matrix D2 = σ −2 ΥΥ> :
4.9 Semiparametric estimation
D2 = σ −2
ΨΨ> 0
0 ΦΦ>
=
D2 0
0 H2
169
,
(4.44)
with D2 = σ −2 ΨΨ> , H 2 = σ −2 ΦΦ> .
Proof. The formula (4.44) follows from (4.42) and the orthogonality condition.
The statement (4.43) follows if we show that the difference
L(θ, η) − L(θ) − L1 (η)
does not depend on θ and η . This is a quadratic expression in θ, η , so it
suffices to check its first and the second derivative w.r.t. the parameters. For
the first derivative, it holds by the orthogonality condition
def
∇θ L(θ, η) = ∂L(θ, η)/∂θ = σ −2 Ψ Y − Ψ> θ − Φ> η = σ −2 Ψ Y − Ψ> θ
that coincides with ∇L(θ) . Similarly ∇θ L(θ, η) = ∇L1 (η) yielding
∇ L(θ, η) − L(θ) − L1 (η) = 0.
The identities (4.41) and (4.42) imply that ∇2 L(θ, η) − L(θ) − L1 (η) = 0 .
This implies the desired assertion.
Exercise 4.9.4. Check the statement (4.43) by direct computations. Describe
the term R(Y ) .
Now we demonstrate how the general case can be reduced to the orthogonal
one by a linear transformation of the nuisance parameter. Let C be a p × p1
matrix. Define η̆ = η + C > θ . Then the model equation Y = Ψ> θ + Φ> η + ε
can be rewritten as
Y = Ψ> θ + Φ> (η̆ − C > θ) + ε = (Ψ − CΦ)> θ + Φ> η̆ + ε.
Now we select C to ensure the orthogonality. This leads to the equation
(Ψ − CΦ)Φ> = 0
or C = ΨΦ> ΦΦ>
−1
. So, the original model can be rewritten as
Y = Ψ̆> θ + Φ> η̆ + ε,
Ψ̆ = Ψ − CΦ = Ψ(I − Πη ),
(4.45)
−1
Φ being the projector on the linear subspace
where Πη = Φ> ΦΦ>
spanned by the nuisance factors {φm } . This construction has a natural interpretation: correction the θ -factors ψ 1 , . . . , ψ p by removing their interaction
with the nuisance factors φ1 , . . . , φp1 reduces the general case to the orthogonal one. We summarize:
170
4 Estimation in linear models
Theorem 4.9.2. The linear model (4.39) can be represented in the orthogonal
form
Y = Ψ̆> θ + Φ> η̆ + ε,
where Ψ̆ from (4.45) satisfies Ψ̆Φ> = 0 and η̆ = η + C > θ for C =
−1
ΨΦ> ΦΦ>
. Moreover, it holds for υ = (θ, η)
L(υ) = L̆(θ) + L1 (η̆) + R(Y )
(4.46)
with
L̆(θ) = −(2σ 2 )−1 kY − Ψ̆> θk2 + R,
L1 (η) = −(2σ 2 )−1 kY − Φ> ηk2 + R1 .
Exercise 4.9.5. Show that for C = ΨΦ> ΦΦ>
−1
∇L̆(θ) = ∇θ L(υ) − C∇η L(υ).
Exercise 4.9.6. Show that the remainder term R(Y ) in the equation (4.46)
is the same as in the orthogonal case (4.43).
Exercise 4.9.7. Show that Ψ̆Ψ̆> < ΨΨ> if ΨΦ> 6= 0 .
Hint: for any vector γ ∈ IRp , it holds with h = Ψ> γ
γ > Ψ̆Ψ̆> γ = k(Ψ − ΨΠη )> γk2 = kh − Πη hk2 ≤ khk2 .
Moreover, the equality here for any γ is only possible if
Πη h = Φ> ΦΦ>
−1
ΦΨ> γ ≡ 0,
that is, if ΦΨ> = 0 .
4.9.3 Partial estimation
This section explains the important notion of partial estimation which is quite
natural and transparent in the (θ, η) -setup. Let some value η ◦ of the nuisance parameter be fixed. A particular case of this sort is just ignoring the
factors {φm } corresponding to the nuisance component, that is, one uses
η ◦ ≡ 0 . This approach is reasonable in certain situation, e.g. in context of
projection method or spectral cut-off.
e ◦ ) by partial optimization of the joint log-likelihood
Define the estimate θ(η
◦
L(θ, η ) w.r.t. the first parameter θ :
4.9 Semiparametric estimation
171
e ◦ ) = argmax L(θ, η ◦ ).
θ(η
θ
e ◦ ) is the MLE in the residual model Y − Φ> η ◦ = Ψ> θ ∗ + ε :
Obviously θ(η
e ◦ ) = ΨΨ>
θ(η
−1
Ψ(Y − Φ> η ◦ ).
e ◦ ) simiThis allows for describing the properties of the partial estimate θ(η
larly to the usual parametric situation.
e ◦)
Theorem 4.9.3. Consider the model (4.39). Then the partial estimate θ(η
fulfills
e ◦ ) = θ ∗ + ΨΨ> −1 ΨΦ> (η ∗ − η ◦ ),
e ◦ ) = σ 2 ΨΨ> −1 .
IE θ(η
Var θ(η
e
In words, θ(η)
has the same variance as the MLE in the partial model Y =
Ψ> θ ∗ +ε but it is biased if ΨΦ> (η ∗ −η ◦ ) 6= 0 . The ideal situation corresponds
e ∗ ) is the MLE in the correctly specified
to the case when η ◦ = η ∗ . Then θ(η
def
θ -model: with Y (η ∗ ) = Y − Φ> η ∗ ,
Y (η ∗ ) = Ψ> θ ∗ + ε.
An interesting and natural question is a legitimation of the partial estimation
method: under which conditions it is justified and does not produce any estimation bias. The answer is given by Theorem 4.9.1: the orthogonality condition ΨΦ> = 0 would ensure the desired feature because of the decomposition
(4.43).
Theorem 4.9.4. Assume orthogonality ΨΦ> = 0 . Then the partial estimate
e ◦ ) does not depend on the nuisance parameter η ◦ used:
θ(η
e = θ(η
e ◦ ) = θ(η
e ∗ ) = ΨΨ>
θ
−1
ΨY .
In particular, one can ignore the nuisance parameter and estimate θ ∗ from
the partial incomplete model Y = Ψ> θ ∗ + ε .
Exercise 4.9.8. Check that the partial derivative
on η under the orthogonality condition.
∂
∂θ L(θ, η)
does not depend
The partial estimation can be considered in context of estimating the nuisance parameter η by inverting the role of θ and η . Namely, given a fixed
value θ ◦ , one can optimize the joint log-likelihood L(θ, η) w.r.t. the second
argument η leading to the estimate
def
e (θ ◦ ) = argmax L(θ ◦ , η)
η
η
In the orthogonal situation the initial point θ ◦ is not important and one can
use the partial incomplete model Y = Φ> η ∗ + ε .
172
4 Estimation in linear models
4.9.4 Profile estimation
This section discusses one general profile likelihood method of estimating the
target parameter θ in the semiparametric situation. Later we show its optimality and R-efficiency. The method suggests to first estimate the entire
parameter vector υ by using the (quasi) ML method. Then the operator P
e . One can
e to produce the estimate θ
is applied to the obtained estimate υ
describe this method as
e = argmax L(υ),
υ
υ
e = Pυ
e.
θ
(4.47)
The first step here is the usual LS estimation of υ ∗ in the linear model (4.40):
e = arginf kY − Υ> υk2 = ΥΥ>
υ
−1
ΥY .
υ
e is obtained by applying P to υ
e:
The estimate θ
e = Pυ
e = P ΥΥ>
θ
−1
ΥY = SY
(4.48)
−1
with S = P ΥΥ>
Υ . The properties of this estimate can be studied using
the decomposition Y = f ∗ + ε with f ∗ = IEY ; cf. Section 4.4. In particular,
it holds
e = Sf ∗ ,
IE θ
e = S Var(ε)S > .
Var(θ)
(4.49)
If the noise ε is homogeneous with Var(ε) = σ 2 In , then
e = σ 2 SS > = σ 2 P ΥΥ>
Var(θ)
−1
P >.
(4.50)
The next theorem summarizes our findings.
Theorem 4.9.5. Consider the model (4.40) with homogeneous error Var(ε) =
e follows (4.48). Its means and variance are given by
σ 2 In . The profile MLE θ
(4.49) and (4.50).
The profile MLE is usually written in the (θ, η) -setup. Let υ = (θ, η) .
e η
e ) on the θ Then the target θ is obtained by projecting the MLE (θ,
coordinates. This procedure can be formalized as
e = argmax max L(θ, η).
θ
θ
η
Another way of describing the profile MLE is based on the partial optimization considered in the previous section. Define for each θ the value L̆(θ) by
optimizing the log-likelihood L(υ) under the condition P υ = θ :
4.9 Semiparametric estimation
def
L̆(θ) =
sup
L(υ) = sup L(θ, η).
173
(4.51)
η
υ: P υ=θ
e is defined by maximizing the partial fit L̆(θ) :
Then θ
e def
θ
= argmax L̆(θ).
(4.52)
θ
e.
Exercise 4.9.9. Check that (4.47) and (4.52) lead to the same estimate θ
We use for the function L̆(θ) obtained by partial optimization (4.51) the
same notation as for the function obtained by the orthogonal decomposition
(4.43) in Section 4.9.2. Later we show that these two functions indeed coincide.
e.
This helps in understanding the structure of the profile estimate θ
>
Consider first the orthogonal case ΨΦ = 0 . This assumption gradually
simplifies the study. In particular, the result of Theorem 4.9.4 for partial estimation can be obviously extended to the profile method in view of product
structure (4.43): when estimating the parameter θ , one can ignore the nuisance parameter η and proceed as if the partial model Y = Ψ> θ ∗ + ε were
correct. Theorem 4.9.1 implies:
Theorem 4.9.6. Assume that ΨΦ> = 0 in the model (4.39). Then the profile
e from (4.52) coincides with the MLE from the partial model Y =
MLE θ
> ∗
Ψ θ + ε:
e = argmax L(θ) = argmin kY − Ψ> θk2 = ΨΨ>
θ
θ
−1
ΨY .
θ
e = θ ∗ and
It holds IE θ
e − θ ∗ = D−2 ζ = D−1 ξ
θ
with D2 = σ −2 ΨΨ> , ζ = σ −2 Ψε , and ξ = D−1 ζ . Finally, L̆(θ) from
(4.51) fulfills
e − L̆(θ ∗ ) = kD θ
e − θ ∗ k2 = ζ > D−2 ζ = kξk2 .
2 L̆(θ)
Moreover, if Var(ε) = σ 2 In , then Var(ξ) = Ip . If ε ∼ N(0, σ 2 In ) , then ξ
is standard normal in IRp .
The general case can be reduced to the orthogonal one by the construction
from Theorem 4.9.2. Let
Ψ̆ = Ψ − ΨΠη = Ψ − ΨΦ> ΦΦ>
−1
Φ
be the corrected Ψ -factors after removing their interactions with the Φ factors.
174
4 Estimation in linear models
Theorem 4.9.7. Consider the model (4.39), and let the matrix D̆2 = σ −2 Ψ̆Ψ̆>
e reads as
is non-degenerated. Then the profile MLE θ
e = argmin kY − Ψ̆> θk2 = Ψ̆Ψ̆>
θ
−1
Ψ̆Y .
(4.53)
θ
e = θ ∗ and
It holds IE θ
e − θ ∗ = Ψ̆Ψ̆>
θ
−1
Ψ̆ε = D̆−2 ζ̆ = D̆−1 ξ̆
(4.54)
with D̆2 = σ −2 Ψ̆Ψ̆> , ζ̆ = σ −2 Ψ̆ε , and ξ̆ = D̆−1 ζ̆ . Furthermore, L̆(θ) from
(4.51) fulfills
e − L̆(θ ∗ ) = kD̆ θ
e − θ ∗ k2 = ζ̆ > D̆−2 ζ̆ = kξ̆k2 .
2 L̆(θ)
(4.55)
Moreover, if Var(ε) = σ 2 In , then Var(ξ̆) = Ip . If ε ∼ N(0, σ 2 In ) , then
ξ̆ ∼ N(0, Ip ) .
Finally we present the same result in terms of the original log-likelihood
L(υ) .
Theorem 4.9.8. Write D2 = −∇2 IEL(υ) for the model (4.39) in the block
form
2
D A
D2 =
(4.56)
A> H 2
Let D2 and H 2 be invertible. Then D̆2 and ζ̆ in (4.54) can be represented
as
D̆2 = D2 − AH −2 A> ,
ζ̆ = ∇θ L(υ ∗ ) − AH −2 ∇η L(υ ∗ ).
Proof. In view of Theorem 4.9.7, it suffices to check the formulas for D̆2 and
ζ̆ . One has for Ψ̆ = Ψ(IIn − Πη ) and A = σ −2 ΨΦ>
D̆2 = σ −2 Ψ̆Ψ̆> = σ −2 Ψ In − Πη Ψ>
−1
= σ −2 ΨΨ> − σ −2 ΨΦ> ΦΦ>
ΦΨ> = D2 − AH −2 A> .
Similarly, by AH −2 = ΨΦ> ΦΦ>
ζ̆ = Ψ̆ε = Ψε − ΨΦ> ΦΦ>
as required.
−1
−1
, ∇θ L(υ ∗ ) = Ψε , and ∇η L(υ ∗ ) = Φε
Φε = ∇θ L(υ ∗ ) − AH −2 ∇η L(υ ∗ ).
4.9 Semiparametric estimation
175
It is worth stressing again that the result of Theorems 4.9.6 through 4.9.8
is purely geometrical. We only used the condition IEε = 0 in the model (4.39)
and the quadratic structure of the log-likelihood function L(υ) . The distribution of the vector ε does not enter in the results and proofs. However, the
representation (4.54) allows for straightforward analysis of the probabilistic
e.
properties of the estimate θ
Theorem 4.9.9. Consider the model (4.39) and let Var(Y ) = Var(ε) = Σ0 .
Then
e = σ −4 D̆−2 Ψ̆Σ0 Ψ̆> D̆−2 ,
Var(θ)
Var(ξ̆) = σ −4 D̆−1 Ψ̆Σ0 Ψ̆> D̆−1 .
In particular, if Var(Y ) = σ 2 In , this implies that
e = D̆−2 ,
Var(θ)
Var(ξ̆) = Ip .
Exercise 4.9.10. Check the result of Theorem 4.9.9. Specify this result to
the orthogonal case ΨΦ> = 0 .
4.9.5 Semiparametric efficiency bound
The main goal of this section is to show that the profile method in the semiparametric estimation leads to R-efficient procedures. Remind that the target
of estimation is θ ∗ = P υ ∗ for a given linear mapping P . The profile MLE
e is one natural candidate. The next result claims its optimality.
θ
Theorem 4.9.10 (Gauss-Markov). Let Y follow Y = Υ> υ ∗ + ε for hoe of θ ∗ = P υ ∗ from (4.48) is unbimogeneous errors ε . Then the estimate θ
ased and
e = σ 2 P ΥΥ>
Var(θ)
−1
P>
yielding
e − θ ∗ k2 = σ 2 tr P ΥΥ> −1 P > .
IEkθ
Moreover, this risk is minimal in the class of all unbiased linear estimates of
θ∗ .
e have been already proved.
Proof. The statements about the properties of θ
The lower bound can be proved by the same arguments as in the case of the
b be
MLE estimation in Section 4.4.3. We only outline the main steps. Let θ
∗
any unbiased linear estimate of θ . The idea is to show that the difference
>
b−θ
e is orthogonal to θ
e in the sense IE θ
b−θ
e θ
e = 0 . This implies that
θ
176
4 Estimation in linear models
b is the sum of Var(θ)
e and Var θ
b−θ
e and therefore larger
the variance of θ
e .
than Var(θ)
b
b = BIEY = BΥ> υ ∗ . The
Let θ = BY for some matrix B . Then IE θ
∗
b
no-bias property yields the identity IE θ = θ = P υ ∗ and thus
BΥ> − P = 0.
(4.57)
e = IE θ
b = θ ∗ and thus
Next, IE θ
eθ
e> = θ ∗ θ ∗ > + Var(θ),
e
IE θ
bθ
e> = θ ∗ θ ∗ > + IE(θ
b − IE θ)(
b θ
e − IE θ)
e >.
IE θ
b − IE θ
b = Bε and θ
e − IE θ
e = Sε yielding Var(θ)
e = σ 2 SS > and
Obviously θ
>
2
>
IEBε(Sε) = σ BS . So
b − θ)
e θ
e> = σ 2 (B − S)S > .
IE(θ
The identity (4.57) implies
−1 >
−1 >
(B − S)S > = B − P ΥΥ>
Υ Υ ΥΥ>
P
−1 >
= (BΥ> − P ) ΥΥ>
P =0
and the result follows.
Now we specify the efficiency bound for the (θ, η) -setup (4.39). In this
case P is just the projector onto the θ -coordinates.
4.9.6 Inference for the profile likelihood approach
This section discusses the construction of confidence and concentration sets
for the profile ML estimation. The key fact behind this construction is the
chi-squared result which extends without any change from the parametric to
semiparametric framework.
e from (4.52) suggests to define a CS for θ ∗ as the level
The definition θ
set of L̆(θ) = supυ:P υ=θ L(υ) :
def
E(z) =
e − L̆(θ) ≤ z .
θ : L̆(θ)
This definition can be rewritten as
n
o
def
E(z) = θ : sup L(υ) − sup L(υ) ≤ z .
υ
υ:P υ=θ
4.9 Semiparametric estimation
177
It is obvious that the unconstrained optimization of the log-likelihood L(υ)
w.r.t. υ is not smaller than the optimization under the constrain that P υ =
θ . The point θ belongs to E(z) if the difference between these two values
does not exceed z . As usual, the main question is the choice of a value z
which ensures the prescribed coverage probability of θ ∗ . This naturally leads
to studying the deviation probability
IP sup L(υ) −
υ
sup
υ:P υ=θ ∗
L(υ) > z .
Such a study is especially simple in the orthogonal case. The answer can be
expected: the expression and the value are exactly the same as in the case
without any nuisance parameter η , it simply has no impact. In particular,
the chi-squared result still holds.
In this section we follow the line and the notation of Section 4.9.4. In
particular, we use the block notation (4.56) for the matrix D2 = −∇2 L(υ) .
Theorem 4.9.11. Consider the model (4.39). Let the matrix D2 be nondegenerated. If ε ∼ N(0, σ 2 In ) , then
e − L̆(θ ∗ ) ∼ χ2 ,
2 L̆(θ)
p
(4.58)
e − L̆(θ ∗ ) is chi-squared with p degrees of freedom.
that is, this 2 L̆(θ)
e − L̆(θ ∗ ) = kξ̆k2
Proof. The result is based on representation (4.55) 2 L̆(θ)
from Theorem 4.9.7. It remains to note that normality of ε implies normality
of ξ̆ and the moment conditions IE ξ̆ = 0 , Var(ξ̆) = Ip imply (4.58).
This result means that the chi-squared result continues to hold in the
general semiparametric framework as well. One possible explanation is as
follows: it applies in the orthogonal case, and the general situation can be
reduced to the orthogonal case by a change of coordinates which preserves
the value of the maximum likelihood.
The statement (4.58) of Theorem 4.9.11 has an interesting geometric interpretation which is often used in analysis of variance. Consider the expansion
e − L̆(θ ∗ ) = L̆(θ)
e − L(θ ∗ , η ∗ ) − L̆(θ ∗ ) − L(θ ∗ , η ∗ ) .
L̆(θ)
def
e − L(υ ∗ ) coincides with L(e
The quantity L1 = L̆(θ)
υ , υ ∗ ) ; see (4.47). Thus,
∗
2L1 chi-squared with p degrees of freedom by the chi-squared result. More−1
over, 2σ 2 L(e
υ , υ ∗ ) = kΠυ εk2 , where Πυ = Υ> ΥΥ>
Υ is the projector on
the linear subspace spanned by the joint collection of factors {ψ j } and {φm } .
def
Similarly, the quantity L2 = L̆(θ ∗ ) − L(θ ∗ , η ∗ ) = supη L(θ ∗ , η) − L(θ ∗ , η ∗ )
is the maximum likelihood in the partial η -model. Therefore, 2L2 is also
chi-squared distributed with p1 degrees of freedom, and 2σ 2 L2 = kΠη εk2 ,
178
4 Estimation in linear models
−1
where Πη = Φ> ΦΦ>
Φ is the projector on the linear subspace spanned
by the η -factors {φm } . Now we use the decomposition Πυ = Πη + Πυ − Πη ,
in which Πυ − Πη is also a projector on the subspace of dimension p . This
explains the result (4.58) that the difference of these two quantities is chisquared with p = p∗ − p1 degrees of freedom. The above consideration leads
to the following result.
Theorem 4.9.12. It holds for the model (4.39) with Π̆θ = Πυ − Πη
e − 2L̆(θ ∗ ) = σ −2 kΠυ εk2 − kΠη εk2
2L̆(θ)
= σ −2 kΠ̆θ εk2 = σ −2 ε> Π̆θ ε.
(4.59)
Exercise 4.9.11. Check the formula (4.59). Show that it implies (4.58).
4.9.7 Plug-in method
Although the profile MLE can be represented in a closed form, its computing
can be a hard task if the dimensionality p1 of the nuisance parameter is high.
Here we discuss an approach which simplifies the computations but leads to
a suboptimal solution.
We start with the approach called plug-in. It is based on the assumption
b of the nuisance parameter η is available. Then one
that a pilot estimate η
b of the target θ ∗ from the residuals Y − Φ> η
b.
obtains the estimate θ
b is used as observations
This means that the residual vector Yb = Y −Φ> η
b is defined as the best fit to such observations in the θ and the estimate θ
model:
b = argmin kYb − Ψ> θk2 = ΨΨ> −1 ΨYb .
θ
(4.60)
θ
A very particular case of the plug-in method is partial estimation from Secb ≡ η◦ .
tion 4.9.3 with η
The plug-in method can be naturally described in context of partial estib = θ(b
e η) .
mation. We use the following representation of the plug-in method: θ
b = θ(b
e η ) for the plug-in method. DeExercise 4.9.12. Check the identity θ
b ≡ 0.
scribe the plug-in estimate for η
b heavily depends upon the quality of the pilot η
b. A
The behavior of the θ
detailed study is complicated and a closed form solution is only available for
b = AY . Then (4.60) implies
the special case of a linear pilot estimate. Let η
b = ΨΨ>
θ
−1
Ψ(Y − Φ> AY ) = SY
−1
with S = ΨΨ>
Ψ(IIn − Φ> A) . This is a linear estimate whose properties
can be studied in a usual way.
4.9 Semiparametric estimation
179
4.9.8 Two step procedure
The ideas of partial and plug-in estimation can be combined yielding the
so called two step procedures. One starts with the initial guess θ ◦ for the
target θ ∗ . A very special choice is θ ◦ ≡ 0 . This leads to the partial η -model
Y (θ ◦ ) = Φ> η + ε for the residuals Y (θ ◦ ) = Y − Ψ> θ ◦ . Next compute the
−1
e (θ ◦ ) = ΦΦ>
partial MLE η
Φ Y (θ ◦ ) in this model and use it as a pilot
for the plug-in method: compute the residuals
e (θ ◦ ) = Y − Πη Y (θ ◦ )
Yb (θ ◦ ) = Y − Φ> η
−1
with Πη = Φ> ΦΦ>
Φ , and then estimate the target parameter θ by
>
fitting Ψ θ to the residuals Yb (θ ◦ ) . This method results in the estimate
b ◦ ) = ΨΨ>
θ(θ
−1
ΨYb (θ ◦ )
(4.61)
A simple comparison with the formula (4.53) reveals that the pragmatic two
step approach is sub-optimal: the resulting estimate does not fit the profile
e unless we have an orthogonal situation with ΨΠη = 0 . In particular,
MLE θ
b ◦ ) from (4.61) is biased.
the estimate θ(θ
Exercise 4.9.13. Consider the orthogonal case with ΨΦ> = 0 . Show that
b ◦ ) coincides with the partial MLE θ
e = ΨΨ> −1 ΨY .
the two step estimate θ(θ
b ◦ ) . Show that there exists some
Exercise 4.9.14. Compute the mean of θ(θ
∗
◦
∗
b ) 6= θ unless the orthogonality condition ΨΦ> = 0
θ such that IE θ(θ
is fulfilled.
b ◦) .
Exercise 4.9.15. Compute the variance of θ(θ
◦
2
Hint: use that Var Y (θ ) = Var(Y ) = σ In . Derive that Var Yb (θ ◦ ) =
σ 2 (IIn − Πη ) .
Exercise 4.9.16. Let Ψ be orthogonal, i.e. ΨΨ> = Ip . Show that
b ◦ ) = σ 2 (IIp − ΨΠη Ψ> ).
Var θ(θ
4.9.9 Alternating method
he idea of partial and two step estimation can be applied in an iterative way.
One starts with some initial value for θ ◦ and sequentially performs the two
steps of partial estimation. Set
b0 = η
e (θ ◦ ) = argmin kY − Ψ> θ ◦ − Φ> ηk2 = ΦΦ>
η
η
−1
Φ(Y − Ψ> θ ◦ ).
180
4 Estimation in linear models
b1 = θ(b
e η ) and continue in this way.
With this estimate fixed, compute θ
1
bk and η
b k computed, one recomputes
Generically, with θ
bk+1 = θ(b
e η ) = ΨΨ>
θ
k
−1
bk+1 ) = ΦΦ
b k+1 = η
e (θ
η
b k ),
Ψ(Y − Φ> η
> −1
bk+1 ).
Φ(Y − Ψ> θ
(4.62)
(4.63)
The procedure is especially transparent if the partial design matrices Ψ and
Φ are orthonormal: ΨΨ> = Ip , ΦΦ> = Ip1 . Then
bk+1 = Ψ(Y − Φ> η
b k ),
θ
bk+1 ).
b k+1 = Φ(Y − Ψ> θ
η
b of the parameter θ ∗ one computes the residIn words, having an estimate θ
>b
b
b of the nuisance η ∗ by
uals Y = Y − Ψ θ and then build the estimate η
b
b is used in a similar way
the empirical coefficients ΦY . Then this estimate η
to recompute the estimate of θ ∗ , and so on.
It is worth noting that every doubled step of alternation improves the
bk , η
bk+1 is defined by maximizing L(θ, η
b k ) . Indeed, θ
bk ) ,
current value L(θ
bk+1 , η
bk , η
bk+1 , η
bk+1 , η
b k ) ≥ L(θ
b k ) . Similarly, L(θ
b k+1 ) ≥ L(θ
bk )
that is, L(θ
yielding
bk+1 , η
bk , η
b k+1 ) ≥ L(θ
b k ).
L(θ
(4.64)
A very interesting question is whether the procedure (4.62), (4.63) converges and whether it converges to the maximum likelihood solution. The
answer is positive and in the simplest orthogonal case the result is straightforward.
Exercise 4.9.17. Consider the orthogonal situation with ΨΦ> = 0 . Then
the above procedure stabilizes in one step with the solution from Theorem 4.9.4.
In the non-orthogonal case the situation is much more complicated. The
idea is to show that the alternating procedure can be represented a sequence
of actions of a shrinking linear operator to the data. The key observation
bk and Φ> η
bk :
behind the result is the following recurrent formula for Ψ> θ
bk+1 = Πθ (Y − Φ> η
bk , (4.65)
b k ) = Πθ − Πθ Πη Y + Πθ Πη Ψ> θ
Ψ> θ
bk+1 ) = Πη − Πη Πθ Y + Πη Πθ Φ> η
b k+1 = Πη (Y − Ψ> θ
b k (4.66)
Φ> η
with Πθ = Ψ> ΨΨ>
−1
Ψ and Πη = Φ> ΦΦ>
Exercise 4.9.18. Show (4.65) and (4.66).
−1
Φ.
4.10 Historical remarks and further reading
181
This representation explains necessary and sufficient conditions for convergence of the alternating procedure. Namely, the spectral norm kΠη Πθ k∞
(the largest singular value) of the product operator Πη Πθ should be strictly
less than one, and similarly for Πθ Πη .
Exercise 4.9.19. Show that kΠθ Πη k∞ = kΠη Πθ k∞ .
Theorem 4.9.13. Suppose that kΠη Πθ k∞ = λ < 1 . Then the alternating
b and η
b are unique
procedure converges geometrically, the limiting values θ
and fulfill
b = (IIn − Πθ Πη )−1 (Πθ − Πθ Πη )Y ,
Ψ> θ
b = (IIn − Πη Πθ )−1 (Πη − Πη Πθ )Y ,
Φ> η
(4.67)
b coincides with the profile MLE θ
e from (4.52).
and θ
Proof. The convergence will be discussed below. Now we comment on the
b=θ
e . A direct comparison of the formulas for these two estimates
identity θ
can be a hard task. Instead we use the monotonicity property (4.64). By
e η
e ) maximize globally L(θ, η) . If we start the procedure with
definition, (θ,
◦
e , we would improve the value L(θ,
e η
e ) at every step. By uniqueness,
θ =θ
b
e
bk = η
e for every k .
the procedure stabilizes with θ k = θ and η
Exercise 4.9.20. 1. Show by induction arguments that
b k+1 = Ak+1 Y + Πη Πθ
Φ> η
k
b1,
Φ> η
where the linear operator Ak fulfills A1 = 0 and
Ak+1 = Πη − Πη Πθ + Πη Πθ Ak =
k−1
X
(Πη Πθ )i (Πη − Πη Πθ ).
i=0
2. Show that Ak converges to A = (IIn − Πη Πθ )−1 (Πη − Πη Πθ ) and evaluate
b )k .
kA − Ak k∞ and kΦ> (b
ηk − η
Hint: use that kΠη − Πη Πθ k∞ ≤ 1 and k(Πη Πθ )i k∞ ≤ kΠη Πθ ki∞ ≤ λi .
b in place of η
b k and η
b k+1 in (4.66).
3. Prove (4.67) by inserting η
4.10 Historical remarks and further reading
The least squares method goes back to Carl Gauss (around 1795 but published
first 1809) and Adrien Marie Legendre in 1805.
The notion of linear regression was introduced by Fransis Galton around
1868 for biological problem and then extended by Karl Pearson and Robert
Fisher between 1912 and 1922.
182
4 Estimation in linear models
Chi-squared distribution was first described by the German statistician
Friedrich Robert Helmert in papers of 1875/1876. The distribution was independently rediscovered by Karl Pearson in the context of goodness of fit.
The Gauss–Markov theorem is attributed to Gauß (1995) (originally published in Latin in 1821/3) and Markoff (1912).
Classical references for the sandwich formula (see, e. g., (4.12)) for the
variance of the maximum likelihood estimator in a misspecified model are
Huber (1967) and White (1982).
It seems that the term ridge regression has first been used by Hoerl (1962);
see also the original paper by Tikhonov (1963) for what is nowadays referred
to as Tikhonov regularization. An early reference on penalized maximum likelihood is Good and Gaskins (1971) who discussed the usage of a roughness
penalty.
The original reference for the Galerkin method is Galerkin (1915). Its application in the context of regularization of inverse problems has been described,
e. g., by Donoho (1995). The theory of shrinkage estimation started with the
fundamental article by Stein (1956).
A systematic treatment of profile maximum likelihood is provided by Murphy and van der Vaart (2000), but the origins of this concept can be traced
back to Fisher (1956).
The alternating procedure has been introduced by Dempster et al. (1977)
in the form of the expectation-maximization algorithm.
5
Bayes estimation
This chapter discusses the Bayes approach to parameter estimation. This approach differs essentially from classical parametric modeling also called the
frequentist approach. Classical frequentist modeling assumes that the observed data Y follow a distribution law IP from a given parametric family
(IPθ , θ ∈ Θ ⊂ IRp ) , that is,
IP = IPθ∗ ∈ (IPθ ).
Suppose
that the family (IPθ ) is dominated by a measure µ0 and denote by
p(y θ) the corresponding density:
dIPθ
p(y θ) =
(y).
dµ0
The likelihood is defined as the density at the observed point and the maximum likelihood approach tries to recover the true parameter θ ∗ by maximizing this likelihood over θ ∈ Θ .
In the Bayes approach, the paradigm is changed and the true data distribution is not assumed to be specified by a single parameter value θ ∗ . Instead,
the unknown parameter is considered to be a random variable ϑ with a distribution π on the parameter space Θ called a prior. The measure IPθ can
be considered to be the data distribution conditioned that the randomly selected parameter is exactly θ . The target of analysis is not a single value
θ ∗ , this value is no longer defined. Instead one is interested in the posterior
distribution of the random parameter ϑ given the observed data:
what is the distribution of ϑ given the prior π and the data Y ?
In other words, one aims at inferring on the distribution of ϑ on the basis
of the observed data Y and our prior knowledge π . Below we distinguish
between the random variable ϑ and its particular values θ . However, one
often uses the same symbol θ for denoting the both objects.
184
5 Bayes estimation
5.1 Bayes formula
The Bayes modeling assumptions can be put together in the form
Y θ ∼ p(· θ),
ϑ ∼ π(·).
The first line has to be understood as the conditional distribution
of Y
given
the particular value θ of the random parameter ϑ : Y θ means Y ϑ = θ .
This section formalizes and states the Bayes approach in a formal mathematical way. The answer is given by the Bayes formula for the conditional
distribution of ϑ given Y . First consider the joint distribution IP of Y
and ϑ . If B is a Borel set in the space Y of observations and A is a measurable subset of Θ then
Z Z
IP (B × A) =
IPθ (dy) π(dθ)
A
B
The marginal or unconditional distribution of Y is given by averaging the
joint probability w.r.t. the distribution of ϑ :
Z Z
Z
IP (B) =
IPθ (dy)π(dθ) =
IPθ (B)π(dθ).
Θ
B
Θ
The posterior (conditional) distribution of ϑ given the event Y ∈ B is
defined as the ratio of the joint and marginal probabilities:
IP (B × A)
IP (ϑ ∈ A Y ∈ B) =
.
IP (B)
Equivalently one can write this formula in terms of the related densities. In
what follows we denote by the same letter π the prior measure π and its
density w.r.t. some other measure λ , e.g. the Lebesgue or uniform measure
on Θ . Then the joint measure IP has the density
p(y, θ) = p(y θ)π(θ),
while the marginal density p(y) is the integral of the joint density w.r.t. the
prior π :
Z
Z
p(y) =
p(y, θ)dθ =
p(y θ)π(θ)dθ.
Θ
Θ
Finally the posterior (conditional) density p(θ y) of ϑ given y is defined
as the ratio of the joint density p(y, θ) and the marginal density p(y) :
5.1 Bayes formula
185
θ)π(θ)
p(y
p(y,
θ)
p(θ y) =
=R
.
p(y)
p(y θ)π(θ)dθ
Θ
Our definitions are summarized in the next lines:
Y θ ∼ p(y θ),
ϑ ∼ π(θ),
Z
Y ∼ p(y) =
p(y θ)π(θ)dθ,
Θ
p(Y
p(Y , θ)
=R
ϑ Y ∼ p(θ Y ) =
p(Y )
p(Y
Θ
θ)π(θ)
θ)π(θ)dθ .
(5.1)
Note that given the prior π and the observations Y , the posterior density
p(θ Y ) is uniquely defined and can be viewed as the solution or target of
analysis within the Bayes approach. The expression (5.1) for the posterior
density is called the Bayes formula.
The value p(y) of the marginal density of Y at y does not depend on
the parameter θ . Given the data Y , it is just a numeric normalizing factor.
Often one skips this factor writing
ϑ Y ∝ p(Y θ)π(θ).
Below we consider a couple of examples.
Example 5.1.1. Let Y = (Y1 , . . . , Yn )> be a sequence of zeros and ones considered to be a realization of a Bernoulli experiment for n = 10 . Let also the
underlying parameter θ be random and let it take the values 1/2 or 1 each
with probability 1/2 , that is,
π(1/2) = π(1) = 1/2.
Then the probability of observing y = “10 ones” is
IP (y) =
1
1
IP (y ϑ = 1/2) + IP (y ϑ = 1).
2
2
The first probability is quite small, it is 2−10 , while the second one is just
one. Therefore, IP (y) = (2−10 + 1)/2 . If we observed y = (1, . . . , 1)> , then
the posterior probability of ϑ = 1 is
ϑ = 1)IP (ϑ = 1)
I
P
(y
1
IP (ϑ = 1 y) =
= −10
IP (y)
2
+1
that is, it is quite close to one.
186
5 Bayes estimation
Exercise 5.1.1. Consider the Bernoulli experiment Y = (Y1 , . . . , Yn )> with
n = 10 and let
π(1/2) = π(0.9) = 1/2.
Compute the posterior distribution of ϑ if we observe y = (y1 , . . . , yn )> with
• y = (1, . . . , 1)>
• the number of successes S = y1 + . . . + yn is 5.
Show that the posterior density p(θ y) only depends on the numbers of
successes S .
5.2 Conjugated priors
Let (IPθ ) be a dominated parametric family with the density function
p(y θ) . For a prior
π with the density π(θ) , the posterior density is
proportional to p(y θ)π(θ) . Now consider the case when the prior π belongs to some other parametric family indexed by a parameter α , that is,
π(θ) = π(θ, α) . An very desirable feature of the Bayes approach is that the
posterior density also belongs to this family. Then computing the posterior is
equivalent to fixing the related parameter α = α(Y ) . Such priors are usually
called conjugated.
To illustrate this notion, we present some examples.
Example 5.2.1 (Gaussian Shift). Let Yi ∼ N(θ, σ 2 ) with σ known, i =
1, . . . , n . Consider ϑ ∼ N(τ, g 2 ) , α = (τ, g 2 ) . Then for y = (y1 , . . . , yn )> ∈
IRn
X
p(y θ)π(θ, α) = C exp −
(yi − θ)2 /(2σ 2 ) − (θ − τ )2 /(2g 2 ) ,
where the normalizing factor C does not depend on θ and y . The expression
in the exponent is a quadratic form of θ and the Taylor expansion w.r.t. θ
at θ = τ implies
n X (y − τ )2
o
θ−τ X
n(θ − τ )2 −2
i
−2
ϑ Y ∝ exp −
+
(y
−
τ
)
−
(σ
+
g
)
.
i
2σ 2
σ2
2
This representation indicates that the conditional distribution of θ given Y is
normal. The parameters of the posterior will be computed in the next section.
Example 5.2.2 (Bernoulli). Let Yi be a Bernoulli
Qn r.v. with IP (Yi = 1) = θ .
Then for y = (y1 , . . . , yn )> , it holds p(y θ) = i=1 θyi (1 − θ)1−yi . Consider
the family of priors from the Beta-distribution: π(θ, α) = π(α)θa (1 − θ)b for
α = (a, b) , where π(θ) is a normalizing constant. It follows
Y
ϑ y ∝ θa (1 − θ)b
θyi (1 − θ)1−yi = θs+a (1 − θ)n−s+b
5.3 Linear Gaussian model and Gaussian priors
187
for s = y1 + . . . + yn . Obviously given s this is again a distribution from the
Beta-family.
Example 5.2.3 (Exponential).
Let Yi be an exponential r.v. with IP (Yi ≥
Q
n
y) = e−yθ . Then p(y θ) = i=1 θe−yi θ . For the family of priors from the
Gamma-distribution with π(θ, α) = C(α)θa e−θb for α = (a, b) . One has for
the vector y = (y1 , . . . , yn )>
n
Y
θe−yi θ = θn+a e−(s+b) .
ϑ y ∝ θa e−θb
i=1
which yields that the posterior is Gamma with the parameters n + a and
s + b.
All the previous examples can be systematically treated as special case an
exponential family. Let Y = (Y1 , . . . , Yn ) be an i.i.d. sample from a univariate
EF (Pθ ) with the density
p1 (y θ) = p1 (y) exp yC(θ) − B(θ)
for some fixed functions C(θ) and B(θ) . For y = (y1 , . . . , yn )> , the joint
density at the point y is given by
p(y θ) ∝ exp sC(θ) − nB(θ) .
This suggests
to take a prior from the family π(θ, α) = π(α) exp aC(θ) −
bB(θ) with α = (a, b)> . This yields for the posterior density
ϑ y ∝ p(y θ)π(θ, α) ∝ exp (s + a)C(θ) − (n + b)B(θ)
which is from the same family with the new parameters α(Y ) = (S +a, n+b) .
Exercise 5.2.1. Build a conjugate prior the Poisson family.
Exercise 5.2.2. Build a conjugate prior the variance of the normal family
with the mean zero and unknown variance.
5.3 Linear Gaussian model and Gaussian priors
An interesting and important class of prior distributions is given by Gaussian
priors. The very nice and desirable feature of this class is that the posterior
distribution for the Gaussian model and Gaussian prior is also Gaussian.
188
5 Bayes estimation
5.3.1 Univariate case
We start with the case of a univariate parameter and one observation Y ∼
N(θ, σ 2 ) , where the variance σ 2 is known and only the mean θ is unknown.
The Bayes approach suggests to treat θ as a random variable. Suppose that
the prior π is also normal with mean τ and variance r2 .
Theorem 5.3.1. Let Y ∼ N(θ, σ 2 ) , and let the prior π be the normal distribution N(τ, r2 ) :
Y θ ∼ N(θ, σ 2 ),
ϑ ∼ N(τ, r2 ).
Then the joint, marginal, and posterior distributions are normal as well. Moreover, it holds
Y ∼ N(τ, σ 2 + r2 ),
2
τ σ + Y r2 σ2 r2
,
.
ϑY ∼ N
σ2 + r2 σ2 + r2
Proof. It holds Y = ϑ + ε with ϑ ∼ N(τ, r2 ) and ε ∼ N(0, σ 2 ) independent
of ϑ . Therefore, Y is normal with mean IEY = IEϑ + IEε = τ and the
variance is
Var(Y ) = IE(Y − τ )2 = r2 + σ 2 .
This implies the formula for the marginal density p(Y ) . Next, for ρ =
σ 2 /(r2 + σ 2 ) ,
IE (ϑ − τ )(Y − τ ) = IE(ϑ − τ )2 = r2 = (1 − ρ) Var(Y ).
Thus, the random variables Y − τ and ζ with
ζ = ϑ − τ − (1 − ρ)(Y − τ ) = ρ(ϑ − τ ) − (1 − ρ)ε
are Gaussian and uncorrelated and therefore, independent. The conditional
distribution of ζ given Y coincides with the unconditional distribution and
hence, it is normal with mean zero and variance
Var(ζ) = ρ2 Var(ϑ) + (1 − ρ)2 Var(ε) = ρ2 r2 + (1 − ρ)2 σ 2 =
σ2 r2
.
+ r2
σ2
This yields the result because ϑ = ζ + ρτ + (1 − ρ)Y .
Exercise 5.3.1. Check the result of Theorem 5.3.1 by direct calculation using
Bayes formula (5.1).
5.3 Linear Gaussian model and Gaussian priors
189
Now consider the i.i.d. model from N(θ, σ 2 ) where the variance σ 2 is
known.
Theorem 5.3.2. Let Y = (Y1 , . . . , Yn )> be i.i.d. and for each Yi
Yi θ ∼ N(θ, σ 2 ),
(5.2)
ϑ ∼ N(τ, r2 ).
(5.3)
Then for Y = (Y1 + . . . + Yn )/n
ϑY ∼ N
τ σ 2 /n + Y r2 r2 σ 2 /n
,
r2 + σ 2 /n r2 + σ 2 /n
.
So, the posterior mean of ϑ is a weighted average of the prior mean τ
and the sample estimate Y ; the sample estimate is pulled back (or shrunk )
toward the prior mean. Moreover, the weight ρ on the prior mean is close
to one if σ 2 is large relative to r2 (i.e. our prior knowledge is more precise
than the data information), producing substantial shrinkage. If σ 2 is small
(i.e., our prior knowledge is imprecise relative to the data information), ρ is
close to zero and the direct estimate Y is moved very little towards the prior
mean.
Exercise 5.3.2. Prove Theorem 5.3.2 using the technique of the proof of
Theorem 5.3.1.
Hint: consider Yi = ϑ + εi , Y = S/n , and define ζ = ϑ − τ − (1 − ρ)(Y − τ ) .
Check that ζ and Y are uncorrelated and hence, independent.
The result of Theorem 5.3.2 can formally be derived from Theorem 5.3.1
by replacing n i.i.d. observations Y1 , . . . , Yn with one single observation Y
with conditional mean θ and variance σ 2 /n .
5.3.2 Linear Gaussian model and Gaussian prior
Now we consider the general case when both Y and ϑ are vectors. Namely
we consider the linear model Y = Ψ> ϑ + ε with Gaussian errors ε in which
the random parameter vector ϑ is multivariate normal as well:
ϑ ∼ N(τ , Γ),
Y θ ∼ N(Ψ> θ, Σ).
(5.4)
Here Ψ is a given p × n design matrix, and Σ is a given error covariance
matrix. Below we assume that both Σ and Γ are non-degenerate. The model
(5.4) can be represented in the form
ϑ = τ + ξ,
ξ ∼ N(0, Γ),
Y = Ψ> τ + Ψ> ξ + ε,
ε ∼ N(0, Σ),
(5.5)
ε ⊥ ξ,
(5.6)
190
5 Bayes estimation
where ξ ⊥ ε means independence of the error vectors ξ and ε . This representation makes clear that the vectors ϑ, Y are jointly normal. Now we state
the result about the conditional distribution of ϑ given Y .
Theorem 5.3.3. Assume (5.4). Then the joint distribution of ϑ, Y is normal
with
ϑ
τ
ϑ
Γ
Ψ> Γ
IE =
=
Var
=
.
ΓΨ Ψ> ΓΨ + Σ
Y
Ψ> τ
Y
Moreover, the posterior ϑ Y is also normal. With B = Γ−1 + ΨΣ−1 Ψ> ,
−1
IE ϑ Y = τ + ΓΨ Ψ> ΓΨ + Σ
(Y − Ψ> τ )
= B −1 Γ−1 τ + B −1 ΨΣ−1 Y ,
Var ϑ Y = B −1 .
(5.7)
(5.8)
Proof. The following technical lemma explains a very important property of
the normal law: conditional normal is again normal.
Lemma 5.3.1. Let ξ and η be jointly normal. Denote U = Var(ξ) , W =
Var(η) , C = Cov(ξ, η) = IE(ξ − IEξ)(η − IEη)> . Then the conditional distribution of ξ given η is also normal with
IE ξ η = IEξ + CW −1 (η − IEη),
Var ξ η = U − CW −1 C > .
Proof. First consider the case when ξ and η are zero-mean. Then the vector
def
ζ = ξ − CW −1 η
is also normal zero mean and fulfills
IE(ζη > ) = IE (ξ − CW −1 η)η > = IE(ξη > ) − CW −1 IE(ηη > ) = 0,
Var(ζ) = IE (ξ − CW −1 η)(ξ − CW −1 η)>
= IE (ξ − CW −1 η)ξ > = U − CW −1 C > .
The vectors ζ and η are jointly normal and uncorrelated, thus, independent.
This means that the conditional distribution of ζ given η coincides with the
unconditional one. It remains to note that the ξ = ζ + CW −1 η . Hence,
conditionally on η , the vector ξ is just a shift of the normal vector ζ by a
fixed vector CW −1 η . Therefore, the conditional distribution of ξ given η is
normal with mean CW −1 η and the variance Var(ζ) = U − CW −1 C > .
5.3 Linear Gaussian model and Gaussian priors
191
Exercise 5.3.3. Extend the proof of Lemma 5.3.1 to the case when the vectors ξ and η are not zero mean.
It remains to deduce the desired result about posterior distribution from
this lemma. The formulas for the first two moments of ϑ and Y follow
directly from (5.5) and (5.6). Now we apply Lemma 5.3.1 with U = Γ , C =
ΓΨ , W = Ψ> ΓΨ + Σ . It follows that the vector ϑ conditioned on Y is
normal with
IE ϑ Y = τ + ΓΨW −1 (Y − Ψ> τ )
Var ϑ Y = Γ − ΓΨW −1 Ψ> Γ.
(5.9)
Now we apply the following identity: for any p × n -matrix A
−1
−1
A I n + A> A
A.
= Ip + AA>
−1 >
−1
I p − A I n + A> A
A = Ip + AA>
,
(5.10)
The latter implies with A = Γ1/2 ΨΣ−1/2 that
−1 > 1/2
Γ − ΓΨW −1 Ψ> Γ = Γ1/2 Ip − A In + A> A
A Γ
−1
−1 1/2
= Γ1/2 Ip + AA>
Γ
= Γ−1 + ΨΣ−1 Ψ>
= B −1
Similarly (5.10) yields with the same A
−1 −1/2
ΓΨW −1 = Γ1/2 A In + A> A
Σ
−1
= Γ1/2 Ip + AA>
AΣ−1/2 = B −1 ΨΣ−1 .
This implies (5.7) by (5.9).
Exercise 5.3.4. Check the details of the proof of Theorem 5.3.3.
Exercise 5.3.5. Derive the result of Theorem 5.3.3 by direct computation of
the density of ϑ given Y .
Hint: use that ϑ and Y are jointly normal vectors. Consider their joint
density p(θ, Y ) for Y fixed and obtain the conditional density by analyzing
its the linear and quadratic terms w.r.t. θ .
Exercise 5.3.6. Show that Var(ϑ Y ) < Var(ϑ) = Γ .
def
Hint: use that Var ϑ Y = B −1 and B = Γ−1 + ΨΣ−1 Ψ> > Γ−1 .
The last exercise delivers an important message: the variance of the posterior is smaller than the variance of the prior. This is intuitively clear because
the posterior utilizes the both sources of information: those contained in the
192
5 Bayes estimation
prior and those we get from the data Y . However, even in the simple Gaussian case, the proof is quite complicated. Another interpretation of this fact
will be given later: the Bayes approach effectively performs a kind of regularization and thus, leads to a reduce of the variance; cf. Section 4.7. Another
conclusion from the formulas (5.7), (5.8) is that the moments of the posterior
e = ΨΣ−1 Ψ> −1 ΨΣ−1 Y
distribution approach the moments of the MLE θ
as Γ grows.
5.3.3 Homogeneous errors, orthogonal design
Consider a linear model Yi = Ψ>
i ϑ + εi for i = 1, . . . , n , where Ψi are given
vectors in IRp and εi are i.i.d. normal N(0, σ 2 ) . This model is a special case
of the model (5.4) with Ψ = (Ψ1 , . . . , Ψn ) and uncorrelated homogeneous
errors ε yielding Σ = σ 2 In . Then Σ−1 = σ −2 In , B = Γ−1 + σ −2 ΨΨ>
IE ϑ Y = B −1 Γ−1 τ + σ −2 B −1 ΨY ,
Var ϑ Y = B −1 ,
(5.11)
P
where ΨΨ> = i Ψi Ψ>
i . If the prior variance is also homogeneous, that is,
Γ = r2 Ip , then the formulas can be further simplified. In particular,
−1
Var ϑ Y = r−2 Ip + σ −2 ΨΨ>
.
The most transparent case corresponds to the orthogonal design with ΨΨ> =
η 2 Ip for some η 2 > 0 . Then
IE ϑ Y =
Var ϑ Y =
η2
1
σ 2 /r2
τ+ 2
ΨY ,
+ σ 2 /r2
η + σ 2 /r2
(5.12)
η2
σ2
Ip .
+ σ 2 /r2
(5.13)
Exercise 5.3.7. Derive (5.12) and (5.13) from Theorem 5.3.3 with Σ = σ 2 In ,
Γ = r2 Ip , and ΨΨ> = Ip .
Exercise 5.3.8. Show that the posterior mean is the convex combination of
e = η −2 ΨY and the prior mean τ :
the MLE θ
e
IE ϑ Y = ρτ + (1 − ρ)θ,
with ρ = (σ 2 /r2 )/(η 2 + σ 2 /r2 ) . Moreover, ρ → 0 as η → ∞ , that is, the
e.
posterior mean approaches the MLE θ
5.4 Non-informative priors
193
5.4 Non-informative priors
The Bayes approach requires to fix a prior distribution on the values of the
parameter ϑ . What happens if no such information is available? Is the Bayes
approach still applicable? An immediate answer is “no”, however it is a bit
hasty. Actually one can still apply the Bayes approach with the priors which
do not give any preference to one point against the others. Such priors are
called non-informative. Consider first the case when the set Θ is finite: Θ =
{θ 1 , . . . , θ M } . Then the non-informative prior is just the uniform measure
on Θ giving to every point θ m the equal probability 1/M . Then the joint
probability of Y and ϑ is the average of the measures IPθm and the same
holds for the marginal distribution of the data:
p(y) =
M
1 X
p(y θ m ).
M m=1
The posterior distribution is already “informative” and it differs from the
uniform prior:
p(y θ k )
p(y θ k )π(θ k )
= PM
p(θ k y) =
,
k = 1, . . . , M.
p(y)
m=1 p(y θ m )
Exercise 5.4.1. Check that the posterior measure is non-informative iff all
the measures IPθm coincide.
A similar situation arises if the set Θ is a non-discrete bounded subset
in IRp . A typical example is given by the case of a univariate parameter
restricted to a finite interval [a, b] . Define π(θ) = 1/π(Θ) , where
Z
def
π(Θ) =
dθ.
Θ
Then
Z
1
p(y) =
p(y θ)dθ.
π(Θ) Θ
p(y θ)π(θ)
p(y θ)
p(θ y) =
=R
.
p(y)
p(y θ)dθ
Θ
(5.14)
In some cases the non-informative uniform prior can be used even for unbounded parameter sets. Indeed, what we really need is that the integrals in
the denominator of the last formula are finite:
Z
p(y θ)dθ < ∞
∀y.
Θ
Then we can apply (5.14) even if Θ is unbounded.
194
5 Bayes estimation
Exercise 5.4.2. Consider the Gaussian Shift
(5.2) and (5.3).
R ∞ model
(i) Check that for n = 1 , the value −∞ p(y θ)dθ is finite for every y
and the posterior distribution of ϑ coincides with the distribution of Y .
(ii) Compute the posterior for n > 1 .
Exercise 5.4.3. Consider the Gaussian regression model Y = Ψ> ϑ + ε ,
ε ∼ N(0, Σ) , and the non-informative prior π which is the Lebesgue measure
e=
on the space IRp . Show that the posterior for ϑ is normal with mean θ
(ΨΣ−1 Ψ> )−1 ΨΣ−1 Y and variance (ΨΣ−1 Ψ> )−1 . Compare with the result
of Theorem 5.3.3.
Note that the result of this exercise can be formally derived from Theorem 5.3.3 by replacing Γ−1 with 0.
Another way of tackling the case of an unbounded parameter set is to
consider a sequence of priors that approaches the uniform distribution on
the whole parameter set. In the case of linear Gaussian models and normal
priors, a natural way is to let the prior variance tend to infinity. Consider
first the univariate case; see Section 5.3.1. A non-informative prior can be
approximated by the normal distribution with mean zero and variance r2
tending to infinity. Then
ϑY ∼ N
σ2 r2
Y r2
,
σ2 + r2 σ2 + r2
w
−→ N(Y, σ 2 )
r → ∞.
It is interesting to note that the case of an i.i.d. sample in fact reduces the
situation to the case of a non-informative prior. Indeed, the result of Theorem 5.3.3 implies with rn2 = nr2
ϑY ∼ N
σ 2 rn2
Y rn2
,
σ 2 + rn2 σ 2 + rn2
.
One says that the prior information “washes out” from the posterior distribution as the sample size n tends to infinity.
5.5 Bayes estimate and posterior mean
Given a loss function ℘(θ, θ 0 ) on Θ × Θ , the Bayes risk of an estimate
b = θ(Y
b ) is defined as
θ
b def
b ϑ) =
Rπ (θ)
= IE℘(θ,
Z Z
Θ
Y
b
℘(θ(y), θ) p(y θ) µ0 (dy) π(θ)dθ.
Note that ϑ in this formula is treated as a random variable that follows the
prior distribution π . One can represent this formula symbolically in the form
5.5 Bayes estimate and posterior mean
195
b = IE IE ℘(θ,
b ϑ) θ = IE R(θ,
b ϑ).
Rπ (θ)
b ϑ) over all
Here the external integration averages the pointwise risk R(θ,
possible values of ϑ due to the prior distribution.
The Bayes formula p(y θ)π(θ) = p(θ y)p(y) and change of order of
integration can be used to represent the Bayes risk via the posterior density:
b =
Rπ (θ)
Z Z
Y
b
℘(θ(y), ϑ) p(θ y) dθ p(y) µ0 (dy)
Θ
b ϑ) Y .
= IE IE ℘(θ,
eπ is called Bayes or π -Bayes if it minimizes the corresponding
The estimate θ
risk:
eπ = argmin Rπ (θ),
b
θ
b
θ
where the infimum is taken over the class of all feasible estimates. The most
widespread choice of the loss function is the quadratic one:
def
℘(θ, θ 0 ) = kθ − θ 0 k2 .
The great advantage of this choice is that the Bayes solution can be given
explicitly; it is the posterior mean
Z
def
e
θ π = IE(ϑ Y ) =
θ p(θ Y ) dθ.
Θ
Note that due to Bayes’ formula, this value can be rewritten
Z
eπ = 1
θ p(Y θ) π(θ)dθ
θ
p(Y ) Θ
Z
p(Y ) =
p(Y θ) π(θ)dθ.
Θ
b
Theorem 5.5.1. It holds for any estimate θ
b ≥ Rπ (θ
eπ ).
Rπ (θ)
Proof. The main feature of the posterior mean is that it provides a kind of
projection of the data. This property can be formalized as follows:
Z
e
eπ − θ p(θ Y ) dθ = 0
IE θ π − ϑ Y =
θ
Θ
196
5 Bayes estimation
b = θ(Y
b )
yielding for any estimate θ
b − ϑk2 Y
IE kθ
b−θ
eπ )IE θ
eπ − ϑ Y
eπ − θk
b 2 Y + 2(θ
eπ − ϑk2 Y + IE kθ
= IE kθ
eπ − θk
b 2Y
eπ − ϑk2 Y + IE kθ
= IE kθ
eπ − ϑk2 Y .
≥ IE kθ
b and θ
eπ are functions of Y and can be
Here we have used that both θ
considered as constants when taking the conditional expectation w.r.t. Y .
Now
b = IEkθ
b − ϑk2 = IE IE kθ
b − ϑk2 Y
Rπ (θ)
eπ − ϑk2 Y = Rπ (θ
eπ )
≥ IE IE kθ
and the result follows.
Exercise 5.5.1. Consider the univariate case with the loss function |θ − θ0 | .
Check that the posterior median minimizes the Bayes risk.
5.6 Posterior mean and ridge regression
Here we again consider the case of a linear Gaussian model
Y = Ψ> ϑ + ε,
ε ∼ N(0, σ 2 In ).
(To simplify the presentation, we focus here on the case of homogeneous errors
e for this
with Σ = σ 2 In .) Recall that the maximum likelihood estimate θ
model reads as
e = ΨΨ>
θ
−1
ΨY .
eG for the roughness penalty kGθk2 is given by
A penalized MLE θ
eG = ΨΨ> + σ 2 G2
θ
−1
ΨY ;
see Section 4.7. It turns out that a similar estimate appears in quite a natural
way within the Bayes approach. Consider the normal prior distribution ϑ ∼
N(0, G−2 ) . The posterior will be normal as well with the posterior mean:
eπ = σ −2 B −1 Y = ΨΨ> + σ 2 G2
θ
−1
ΨY ;
5.7 Bayes and minimax risks
197
eπ = θ
eG for the normal prior π = N(0, G−2 ) .
see (5.11). It appears that θ
One can say that the Bayes approach with a Gaussian prior leads to a regularization of the least squares method which is similar to quadratic penalization. The degree of regularization is inversely proportional to the variance of
the prior. The larger the variance, the closer the prior is to the non-informative
eπ to the MLE θ
e.
one and the posterior mean θ
5.7 Bayes and minimax risks
e be an
Consider the parametric model Y ∼ IP ∈ (IPθ , θ ∈ Θ ⊂ IRp ) . Let θ
e is a
estimate of the parameter ϑ from the available data Y . Formally θ
measurable function of Y with values in Θ :
e = θ(Y
e ) : Y → Θ.
θ
The quality of estimation is assigned by the loss function %(θ, θ 0 ) . In estimation problem one usually selects this function in the form %(θ, θ 0 ) = %1 (θ−θ 0 )
for another function %1 of one argument. Typical examples are given by
quadratic loss %1 (θ) = kθk2 or absolute loss %1 (θ) = kθk . Given such a
e at θ is defined as
loss function, the pointwise risk of θ
e θ) def
e θ).
R(θ,
= IEθ %(θ,
The minimax risk is defined as the maximum of pointwise risks over all θ ∈ Θ :
e def
e θ) = sup IEθ %(θ,
e θ).
R(θ)
= sup R(θ,
θ∈Θ
θ∈Θ
Similarly, the Bayes risk for a prior π is defined by weighting the pointwise
risks accordingly to the prior distribution:
Z
e def
e θ) =
e θ)π(θ)dθ.
Rπ (θ)
= R(θ,
IEθ %(θ,
θ
It is obvious that the Bayes risk is always smaller or equal than the minimax
one, whatever the prior measure is:
e ≤ R(θ)
e
Rπ (θ)
The famous Le Cam theorem states that the minimax risk can be recovered
by taking the maximum over all priors:
e = sup Rπ (θ).
e
R(θ)
π
Moreover, the maximum can be taken over all discrete priors with finite supports.
198
5 Bayes estimation
5.8 Van Trees inequality
The Cramér–Rao inequality yields a low bounds of the risk for any unbiased
estimator. However, it appears to be sub-optimal if the condition of no-bias
is dropped. Another way to get a general bound on the quadratic risk of any
estimator is to bound from below a Bayes risk for any suitable prior and then
to maximize this lower bound in a class of all such priors.
Let Y be an observed vector in IRn and (IP θ ) be the corresponding
parametric family with the density function p(y θ) w.r.t. a measure µ0
b = θ(Y
b ) be an arbitrary estimator of θ . For any prior
on IRn . Let also θ
b of θ
b . We already know
π , we aim to lower bound the π -Bayes risk Rπ (θ)
eπ . However,
that this risk is minimizes by the posterior mean estimator θ
the presented bound does not rely on a particular structure of the considered
estimator. Similarly to the Cramér–Rao inequality, it is entirely
based on some
geometric properties of the log-likelihood function p(y θ) .
To simplify the explanation, consider first the case of a univariate parameter θ ∈ Θ ⊆ IR . Below we assume that the prior π has a positive continuously
differentiable density π(θ) . In addition we suppose that the parameter set Θ
is an interval, probably infinite, and the prior density π(θ) vanishes at the
edges of Θ . By Fπ we denote the Fisher information for the prior distribution
π:
Z 0 2
π (θ)
def
dθ
Fπ =
Θ π(θ)
Remind also the definition of the full Fisher information for the family (IPθ ) :
Z 0 2
2
∂
p (y θ)
µ0 (dy).
F(θ) = IEθ log p(Y θ) =
∂θ
p(y θ)
def
These quantities will enter in the risk bound. In what follows we also use the
notation
pπ (y, θ) = p(y θ)π(θ),
def
p0π (y, θ) =
dpπ (y, θ)
.
dθ
The Bayesian analog of the score function is
L0π (Y
p0 (Y θ) π 0 (θ)
p0π (Y , θ)
, θ) =
=
+
.
pπ (Y , θ)
π(θ)
p(Y θ)
def
(5.15)
b
The use of pπ (y, θ) ≡ 0 at the edges of Θ implies for any θb = θ(y)
and any
n
y ∈ IR
5.8 Van Trees inequality
Z
199
h
i
0
b
b
θ(y)
− θ pπ (y, θ) dθ = θ(y)
− θ pπ (y, θ) = 0,
Θ
and hence
Z
b
θ(y)
− θ p0π (y, θ) dθ =
Z
Θ
pπ (y, θ)dθ.
(5.16)
Θ
This is an interesting identity. It holds for each y with µ0 -probability one
and the estimateR θb only appears in its left hand-side. This can be explained
by the fact that Θ p0π (y, θ) dθ = 0 which follows by the same calculus. Based
b ) − ϑ and
on (5.16), one can compute the expectation of the product of θ(Y
0
0
Lπ (Y , ϑ) = pπ (Y , ϑ)/pπ (Y , ϑ) :
IEπ
Z
h
b ) − ϑ L0 (Y , ϑ) =
θ(Y
π
Z
Z
Z
IRn
b
θ(y)
− θ p0π (y, θ) dθ dµ0 (y)
Θ
=
IRn
pπ (y, θ) dθ dµ0 (y) = 1.
(5.17)
Θ
Again, the remarkable feature of this equality that the estimate θb only appears in the left hand-side. Now the idea of the obtained bound is very simple.
We introduce a r.v. h(Y , ϑ) = L0π (Y , ϑ)/IEπ |L0π (Y , ϑ)|2 and use orthogonalb ) − ϑ − h(Y , ϑ) and h(Y , ϑ) and the Pythagorus Theorem to show
ity of θ(Y
that the squared risk of θb is not smaller than IEπ h2 (Y , ϑ) . More precisely,
denote
h
2 i
def
Iπ = IEπ L0π (Y , ϑ)
,
def
h(Y , ϑ) = Iπ−1 L0π (Y , ϑ).
Then IEπ h2 (Y , ϑ) = Iπ−1 and (5.17) implies
IEπ
and
i
h
b ) − ϑ − h(Y , ϑ) h(Y , ϑ)
θ(Y
i
h
b ) − ϑ L0 (Y , ϑ) − IEπ h2 (Y , ϑ) = 0
= Iπ−1 IEπ θ(Y
π
200
5 Bayes estimation
IEπ
h
b )−ϑ
θ(Y
2 i
= IEπ
h
i
b ) − ϑ − h(Y , ϑ) + h(Y , ϑ) 2
θ(Y
h
i
b ) − ϑ − h(Y , ϑ) 2 + IEπ h2 (Y , ϑ)
θ(Y
h
i
b ) − ϑ − h(Y , ϑ) h(Y , ϑ)
+ 2IEπ θ(Y
h
i
b ) − ϑ − h(Y , ϑ) 2 + IEπ h2 (Y , ϑ)
= IEπ θ(Y
= IEπ
≥ IEπ h2 (Y , ϑ) = Iπ−1 .
0
It remains to compute Iπ . RWe use
the representation (5.15) for Lπ (Y , θ) .
Further we use the identity p(y θ)dµ0 (y) ≡ 1 which implies by differentiating in θ
Z
p0 (y θ) dµ0 (y) ≡ 0.
This yields that two random variables p0 (Y θ)/p(Y θ) and π 0 (θ)/π(θ) are
orthogonal under the measure IPπ :
0
Z Z
p (Y θ) π 0 (θ)
=
p0 (y θ) π 0 (θ) dµ0 (y) dθ
IEπ
π(θ)
p(Y θ)
Θ IRn
Z Z
=
p0 (y θ) dµ0 (y) π 0 (θ)dθ = 0.
IRn
Θ
Now by the Pythagorus Theorem
2
0
0
0
2
p (Y ϑ) 2
π (ϑ)
+
I
E
IEπ Lπ (Y , ϑ) = IEπ
π
π(ϑ)
p(Y ϑ)
0
2
0
p (Y ϑ) 2 π (ϑ)
ϑ
+
I
E
= IEπ IEϑ
π
π(ϑ)
p(Y ϑ)
Z
=
F(θ) π(θ) dθ + Fπ .
Θ
Now we can summarize the derivations in the form of van Trees’ inequality.
Theorem 5.8.1 (van Trees). Let Θ be an interval on IR and let the prior
density π(θ) have piecewise continuous first derivate, π(θ) be positive in the
interior of Θ and vanish at the edges. Then for any estimator θb of θ
IEπ (θb − ϑ)2 ≥
−1
Z
F(θ) π(θ) dθ + Fπ
Θ
.
5.8 Van Trees inequality
201
Now we consider a multivariate extension. We will say that a real function
g(θ) , θ ∈ Θ , is nice if it is piecewise continuously differentiable in θ for
almost all values of θ . Everywhere g 0 (·) means the derivative w.r.t. θ , that
d
g(θ) .
is, g 0 (θ) = dθ
We consider the following assumptions:
1. p(y θ) is nice in θ for almost all y ;
2. The full Fisher information matrix
0
0
p (Y θ) p0 (Y θ) >
p (Y θ)
def
= IEθ
F(θ) = Varθ
p(Y θ)
p(Y θ) p(Y θ)
exists and continuous in θ ;
3. Θ is compact with boundary which is piecewise C1 -smooth;
4. π(θ) is nice; π(θ) is positive on the interior of Θ and zero on its boundary. The Fisher information matrix Fπ of the prior π is positive and
finite:
def
Fπ = IEπ
> π 0 (ϑ) π 0 (ϑ)
< ∞.
π(ϑ) π(ϑ)
b = θ(Y
b ),
Theorem 5.8.2. Let the assumptions 1–4 hold. For any estimate θ
it holds
h
i
b ) − ϑ θ(Y
b ) − ϑ > ≥ I −1 ,
IEπ θ(Y
(5.18)
π
where
def
Z
Iπ =
F(θ) π(θ) dθ + Fπ .
Θ
Proof. The use of pπ (y, θ) = p(y θ)π(θ) ≡ 0 at the boundary of Θ implies
b = θ(y)
b
for any θ
and any y ∈ IRn by Stokes’ theorem
Z
0
b
θ(y)
− θ pπ (y, θ) dθ = 0,
Θ
and hence
Z
Θ
p0π (y, θ)
>
b
θ(y)
− θ dθ = Ip
Z
pπ (y, θ)dθ,
Θ
def
where Ip is the identity matrix. Therefore, the random vector L0π (Y , θ) =
p0π (Y , θ)/pπ (Y , θ) fulfills
202
5 Bayes estimation
Z
h
> i
0
b
=
IEπ Lπ (Y , ϑ) θ(Y ) − ϑ
Z
IRn
>
b
p0π (y, θ) θ(y)
− θ dθ dµ0 (y)
Θ
Z
Z
= Ip
IRn
pπ (y, θ) dθ dµ0 (y) = Ip . (5.19)
Θ
Denote
h
> i
def
Iπ = IEπ L0π (Y , ϑ) L0π (Y , ϑ)
,
def
h(Y , ϑ) = Iπ−1 L0π (Y , ϑ).
> Then IEπ h(Y , ϑ) h(Y , ϑ)
= Iπ−1 and (5.19) implies
h
i
b ) − ϑ − h(Y , ϑ) >
IEπ h(Y , ϑ) θ(Y
h
h
i
i
b ) − ϑ > − IEπ h(Y , ϑ) h(Y , ϑ) > = 0
= Iπ−1 IEπ L0π (Y , ϑ) θ(Y
and hence
h
i
b ) − ϑ θ(Y
b )−ϑ >
IEπ θ(Y
h
i
b ) − ϑ − h(Y , ϑ) + h(Y , ϑ) θ(Y
b ) − ϑ − h(Y , ϑ) + h(Y , ϑ) >
= IEπ θ(Y
h
i
b ) − ϑ − h(Y , ϑ) θ(Y
b ) − ϑ − h(Y , ϑ) >
= IEπ θ(Y
h
> i
+ IEπ h(Y , ϑ) h(Y , ϑ)
h
> i
≥ IEπ h(Y , ϑ) h(Y , ϑ)
= Iπ−1 .
It remains to compute Iπ . The definition implies
def
L0π (Y , θ) =
1
1
1 0
p0 (Y θ) +
p0π (Y , θ) =
π (θ).
pπ (Y , θ)
π(θ)
p(Y θ)
Further, the identity
R
p(y θ)dµ0 (y) ≡ 1 yields by differentiating in θ
Z
p0 (y θ) dµ0 (y) ≡ 0.
Using this identity, we show that the random vectors p0 (Y θ)/p(Y θ) and
π 0 (θ)/π(θ) are orthogonal w.r.t. the measure IPπ . Indeed,
5.9 Historical remarks and further reading
IEπ
203
Z
>
Z
p0 (Y θ) π 0 (θ) >
0
0
=
π (θ)
p (y θ) dµ0 (y) dθ
p(Y θ) π(θ)
Θ
IRn
>
Z
Z
0
0
=
π (θ)
p (y θ) dµ0 (y) dθ = 0.
Θ
IRn
Now by usual Pythagorus calculus, we obtain
IEπ
h
> i
L0π (Y , ϑ) L0π (Y , ϑ)
0
> p0 (Y ϑ) p0 (Y ϑ) >
π (ϑ) π 0 (ϑ)
+
I
E
= IEπ
π
π(ϑ) π(ϑ)
p(Y ϑ) p(Y ϑ)
0
0
0
> p (Y ϑ)
p (Y ϑ) > π (ϑ) π 0 (ϑ)
ϑ
+
I
E
= IEπ IEϑ
π
π(ϑ) π(ϑ)
p(Y ϑ)
p(Y ϑ)
Z
=
F(θ) π(θ) dθ + Fπ ,
Θ
and the result follows.
This matrix inequality can be used for obtaining a number of L2 bounds.
b ) − ϑk2 .
We present only two bounds for the squared norm kθ(Y
Corollary 5.8.1. Under the same conditions 1–4, it holds
b ) − ϑk2 ≥ tr I −1 ≥ p2 / tr(Iπ ).
IEπ kθ(Y
π
Proof. The first inequality follows directly from the bound (5.18) of Theorem 5.8.2. For the second one, it suffices to note that for any positive symmetric p × p matrix B , it holds
tr(B) tr(B −1 ) ≥ p2 .
(5.20)
This fact can be proved by the Cauchy-Schwarz inequality.
Exercise 5.8.1. Prove (5.20).
Hint: use the Cauchy-Schwarz inequality for the scalar product tr(B 1/2 B −1/2 )
2
of two matrices B 1/2 and B −1/2 (considered as vectors in IRp ).
5.9 Historical remarks and further reading
The origin of the Bayesian approach to statistics was the article by Bayes
(1763). Further theoretical foundations are due to de Finetti (1937), Savage
(1954), and Jeffreys (1957).
204
5 Bayes estimation
The theory of conjugated priors was developed by Raiffa and Schlaifer
(1961). Conjugated priors for exponential families have been characterized
by Diaconis and Ylvisaker (1979). Non-informative priors were considered by
Jeffreys (1961).
Bayes optimality of the posterior mean estimator under quadratic loss is
a classical result which can be found, for instance, in Section 4.4.2 of Berger
(1985).
The van Trees inequality is orgininally due to Van Trees (1968), p. 72. Gill
and Levit (1995) applied it to the problem of establishing a Bayesian version
of the Cramér-Rao bound.
For further reading, we recommend the books by Berger (1985), Bernardo
and Smith (1994), and Robert (2001).
6
Testing a statistical hypothesis
Let Y be the observed sample. The hypothesis testing problem assumes that
there is some external information (hypothesis) about the distribution of this
sample and the target is to check this hypothesis on the basis of the available
data.
6.1 Testing problem
This section specifies the main notions of the theory of hypothesis testing.
We start with a simple hypothesis. Afterwards a composite hypothesis will be
discussed. We also introduce the notions of the testing error, level, power, etc.
6.1.1 Simple hypothesis
The classical testing problem consists in checking a specific hypothesis that
the available data indeed follow an externally precisely given distribution. We
illustrate this notion by several examples.
Example 6.1.1 (Simple game). Let Y = (Y1 , . . . , Yn )> be a Bernoulli sequence of zeros and ones. This sequence can be viewed as the sequence of
successes, or results of throwing a coin, etc. The hypothesis about this sequence is that wins (associated with one) and losses (associated with zero)
are equally frequent in the long run. This hypothesis can be formalized as
follows: IP = (IPθ∗ )⊗n with θ∗ = 1/2 , where IPθ describes the Bernoulli
experiment with parameter θ .
Example 6.1.2 (No treatment effect). Let (Yi , Ψi ) be experimental results,
i = 1, . . . , n . The linear regression model assumes a certain dependence of
the form Yi = Ψ>
i θ + εi with errors εi having zero mean. The “no effect”
hypothesis means that there is no systematic dependence of Yi on the factors
Ψi , i.e., θ = θ ∗ = 0 , and the observations Yi are just noise.
206
6 Testing a statistical hypothesis
Example 6.1.3 (Quality control). Assume that the Yi are the results of a
production process which can be represented in the form Yi = θ∗ + εi , where
θ∗ is a nominal value and εi is a measurement error. The hypothesis is that
the observed process indeed follows this model.
The general problem of testing a simple hypothesis is stated as follows:
to check on the basis of the available observations Y that their distribution
is described by a given measure IP . The hypothesis is often called a null
hypothesis or just null.
6.1.2 Composite hypothesis
More generally, one can treat the problem of testing a composite hypothesis.
Let (IPθ , θ ∈ Θ ⊂ IRp ) be a given parametric family, and let Θ0 ⊂ Θ be
a non-empty subset in Θ . The hypothesis is that the data distribution IP
belongs to the set (IPθ , θ ∈ Θ0 ) . Often, this hypothesis and the subset Θ0
are identified with each other and one says that the hypothesis is given by
Θ0 .
We give some typical examples where such a formulation is natural.
Example 6.1.4 (Testing a subvector). Assume that the vector θ ∈ Θ can be
decomposed into two parts: θ = (γ, η) . The subvector γ is the target of
analysis while the subvector η matters for the distribution of the data but
is not the target of analysis. It is often called the nuisance parameter. The
hypothesis we want to test is γ = γ ∗ for some fixed value γ ∗ . A typical
situation in factor analysis where such problems arise is to check on “no
effect” for one particular factor in the presence of many different, potentially
interrelated factors.
Example 6.1.5 (Interval testing). Let Θ be the real line and Θ0 be an interval. The hypothesis is that IP = IPθ∗ for θ∗ ∈ Θ0 . Such problems are typical
for quality control or warning (monitoring) systems when the controlled parameter should be in the prescribed range.
Example 6.1.6 (Testing a hypothesis about the error distribution). Consider
the regression model Yi = Ψ>
i θ + εi . The typical assumption about the errors
εi is that they are zero-mean normal. One may be interested in testing this
assumption, having in mind that the cases of discrete, or heavy-tailed, or
heteroscedastic errors can also occur.
6.1.3 Statistical tests
A test is a statistical decision on the basis of the available data whether the
pre-specified hypothesis is rejected or retained. So the decision space consists
of only two points, which we denote by zero and one. A decision φ is a
mapping of the data Y to this space and is called a test:
6.1 Testing problem
207
φ : Y → {0, 1}.
The event {φ = 1} means that the null hypothesis is rejected and the opposite
event means non-rejection (acceptance) of the null. Usually the testing results
are qualified in the following way: rejection of the null hypothesis means that
the data are not consistent with the null, or, equivalently, the data contain
some evidence against the null hypothesis. Acceptance simply means that the
data do not contradict the null. Therefore, the term ”non-rejection” is often
considered more appropriate.
The region of acceptance is a subset of the observation space Y on which
φ = 0 . One also says that this region is the set of values for which we fail to
reject the null hypothesis. The region of rejection or critical region is on the
other hand the subset of Y on which φ = 1 .
6.1.4 Errors of the first kind, test level
In the hypothesis testing framework one distinguishes between errors of the
first and second kind. An error of the first kind means that the null hypothesis
is falsely rejected when it is correct. We formalize this notion first for the case
of a simple hypothesis and then extend it to the general case.
Let H0 : Y ∼ IPθ∗ be a null hypothesis. The error of the first kind is the
situation in which the data indeed follow the null, but the decision of the test
is to reject this hypothesis: φ = 1 . Clearly the probability of such an error is
IPθ∗ (φ = 1) . The latter number in [0, 1] is called the size of the test φ . One
says that φ is a test of level α for some α ∈ (0, 1) if
IPθ∗ (φ = 1) ≤ α.
The value α is called level of the test or significance level. Often, size and
level of a test coincide; however, especially in discrete models, it is not always
possible to attain the significance level exactly by the chosen test φ , meaning
that the actual size of φ is smaller than α , see Example 6.1.7 below.
If the hypothesis is composite, then the level of the test is the maximum
rejection probability over the null subset Θ0 . Here, a test φ is of level α if
sup IPθ (φ = 1) ≤ α.
θ∈Θ0
Example 6.1.7 (One-sided binomial test). Consider again the situation from
Example 6.1.1 (an i.i.d. sample Y = (Y1 , . . . , Yn )> from a Bernoulli distribution is observed). We let n = 13 and Θ0 = [0, 1/5] . For instance, one may
want to test if the cancer-related mortality in a subpopulation of individuals which are exposed to some environmental risk factor is not significantly
larger than in the general population, in which it is equal to 1/5 . To this
end, death causes are assessed for 13 decedents and for every decedent we
208
6 Testing a statistical hypothesis
get the information if s/he died because of cancer or not. From Section 2.6,
∗
we know
Pnthat Sn /n efficiently estimates the success probability θ , where
Sn = i=1 Yi . Therefore, it appears natural to use Sn also as the basis for
a solution of the test problem. Under θ∗ , it holds Sn ∼ Bin(n, θ∗ ) . Since a
large value of Sn implies evidence against the null, we choose a test of the
form φ = 1I(cα ,n] (Sn ) , where the constant cα has to be chosen such that φ
is of level α . This condition can equivalently be expressed as
IPθ (Sn ≤ cα ) ≥ 1 − α.
inf
0≤θ≤1/5
For fixed k ∈ {0, . . . , n} , we have
k X
n `
IPθ (Sn ≤ k) =
θ (1 − θ)n−` = F (θ, k) (say).
`
`=0
Exercise 6.1.1. Show that for all k ∈ {0, . . . , n} , the function F (·, k) is
decreasing on Θ0 = [0, 1/5] .
Due to Exercise 6.1.1, we have to calculate cα under the least favorable
parameter configuration (LFC) θ = 1/5 and we obtain
(
cα = min k ∈ {0, . . . , n} :
k ` n−`
X
4
n
1
`=0
5
`
5
)
≥1−α ,
because we want to exhaust the significance level α as tightly as possible. For
the standard choice of α = 0.05 , we have
4 ` 13−`
X
13
1
4
`=0
`
5
5
≈ 0.901,
5 ` 13−`
X
13
1
4
`=0
`
5
5
≈ 0.9700,
hence, we choose cα = 5 . However, the size of the test φ = 1I(5,13] (Sn ) is
strictly smaller than the significance level α = 0.05 , namely
sup
IPθ (Sn > 5) = IP1/5 (Sn > 5) = 1 − IP1/5 (Sn ≤ 5) ≈ 0.03 < α.
0≤θ≤1/5
6.1.5 Randomized tests
In some situations it is difficult to decide about acceptance or rejection of the
hypothesis. A randomized test can be viewed as a weighted decision: with a
certain probability the hypothesis is rejected, otherwise retained. The decision
space for a randomized test φ is the unit interval [0, 1] , that is, φ(Y ) is a
number between zero and one. The hypothesis H0 is rejected with probability
6.1 Testing problem
209
φ(Y ) on the basis of the observed data Y . If φ(Y ) only admits the binary
values 0 and 1 for every Y , then we are back at the usual non-randomized
test that we have considered before. The probability of an error of the first
kind is naturally given by the value IEφ(Y ) . For a simple hypothesis H0 :
IP = IPθ∗ , a test φ is now of level α if
IEφ(Y ) = α.
For a randomized test φ , the significance level α is typically attainable
exactly, even for discrete models. In the case of a composite hypothesis
H0 : IP ∈ (IPθ , θ ∈ Θ0 ) , the level condition reads as
sup IEφ(Y ) ≤ α
θ∈Θ0
as before. In what follows we mostly consider non-randomized tests and only
comment on whether a randomization can be useful. Note that any randomized test can be reduced to a non-randomized test by extending the probability
space.
Exercise 6.1.2. Construct for any randomized test φ its non-randomized
version using a random data generator.
Randomized tests are a satisfactory solution of the test problem from a
mathematical point of view, but they are disliked by practitioners, because
the test result may not be reproducible, due to randomization.
Example 6.1.8 (Example 6.1.7 continued). Under the setup of Example 6.1.7,
consider the randomized test φ , given by
0, Sn < 5
φ(Y ) = 2/7, Sn = 5
1, Sn > 5
It is easy to show that under the LFC θ = 1/5 , the size of φ is (up to
rounding) exactly equal to α = 5% .
6.1.6 Alternative hypotheses, error of the second kind, power of a
test
The setup of hypothesis testing is asymmetric in the sense that it focuses on
the null hypothesis. However, for a complete analysis, one has to specify the
data distribution when the hypothesis is false. Within the parametric framework, one usually makes the assumption that the unknown data distribution
belongs to some parametric family (IPθ , θ ∈ Θ ⊆ IRp ) . This assumption has
to be fulfilled independently of whether the hypothesis is true or false. In
210
6 Testing a statistical hypothesis
other words, we assume that IP ∈ (IPθ , θ ∈ Θ) and there is a subset Θ0 ⊂ Θ
corresponding to the null hypothesis. The measure IP = IPθ for θ 6∈ Θ0
def
is called an alternative. Furthermore, we call Θ1 = Θ \ Θ0 the alternative
hypothesis.
Now we can consider the performance of a test φ when the hypothesis H0
is false. The decision to retain the hypothesis when it is false is called the error
of the second kind. The probability of such an error is equal to IP (φ = 0) ,
whenever IP is an alternative. This value certainly depends on the alternative
IP = IPθ for θ 6∈ Θ0 . The value β(θ) = 1 − IPθ (φ = 0) is often called the
test power at θ 6∈ Θ0 . The function β(θ) of θ ∈ Θ \ Θ0 given by
def
β(θ) = 1 − IPθ (φ = 0)
is called power function. Ideally one would desire to build a test which simultaneously and separately minimizes the size and maximizes the power. These
two wishes are somehow contradictory. A decrease of the size usually results
in a decrease of the power and vice versa. Usually one imposes the level α
constraint on the size of the test and tries to optimize its power under this
constraint. Under the general framework of statistical decision problems as
discussed in Section 1.4, one can thus regard R(φ, θ) = 1 − β(θ), θ ∈ Θ1 , as
a risk function. If we agree on this risk measure and restrict attention to level
α tests, then the test problem, regarded as a statistical decision problem, is
already completely specified by (Y, B(Y), (IPθ : θ ∈ Θ), Θ0 ) .
Definition 6.1.1. A test φ∗ is called uniformly most powerful (UMP) test
of level α if it is of level α and for any other test of level α , it holds
1 − IPθ (φ∗ = 0) ≥ 1 − IPθ (φ = 0),
θ 6∈ Θ0 .
Unfortunately, such UMP tests exist only in very few special models; otherwise, optimization of the power given the level is a complicated task.
In the case of a univariate parameter θ ∈ Θ ⊂ IR1 and a simple hypothesis
θ = θ∗ , one often considers one-sided alternatives
H1 : θ ≥ θ∗
or
H1 : θ ≤ θ ∗
or a two-sided alternative
H1 : θ 6= θ∗ .
6.2 Neyman-Pearson test for two simple hypotheses
This section discusses one very special case of hypothesis testing when both
the null hypothesis and the alternative are simple one-point sets. This special
6.2 Neyman-Pearson test for two simple hypotheses
211
situation by itself can be viewed as a toy problem, but it is very important
from the methodological point of view. In particular, it introduces and justifies
the so-called likelihood ratio test and demonstrates its efficiency.
For simplicity we write IP0 for the measure corresponding to the null
hypothesis and IP1 for the alternative measure. A test φ is a measurable
function of the observations with values in the two-point set {0, 1} . The
event φ = 0 is treated as acceptance of the null hypothesis H0 while φ = 1
means rejection of the null hypothesis and, consequently, decision in favor of
H1 .
For ease of presentation we assume that the measure IP1 is absolutely
continuous w.r.t. the measure IP0 and denote by Z(Y ) the corresponding
derivative at the observation point:
def
Z(Y ) =
dIP1
(Y ).
dIP0
Similarly L(Y ) means the log-density:
def
L(Y ) = log Z(Y ) = log
dIP1
(Y ).
dIP0
The solution of the test problem in the case of two simple hypotheses is known
as the Neyman-Pearson test: reject the hypothesis H0 if the log-likelihood
ratio L(Y ) exceeds a specific critical value z :
def
φ∗z = 1I L(Y ) > z = 1I Z(Y ) > ez .
The Neyman-Pearson test is known as the one minimizing the weighted sum
of the errors of the first and second kind. For a non-randomized test this sum
is equal to
℘0 IP0 (φ = 1) + ℘1 IP1 (φ = 0),
while the weighted error of a randomized test φ is
℘0 IE0 φ + ℘1 IE1 (1 − φ).
(6.1)
Theorem 6.2.1. For every two positive values ℘0 and ℘1 , the test φ∗z with
z = log(℘0 /℘1 ) minimizes (6.1) over all possible (randomized) tests φ :
def
φ∗z = 1I(L(Y ) ≥ z) = argmin ℘0 IE0 φ + ℘1 IE1 (1 − φ) .
φ
Proof. We use the formula for a change of measure:
IE1 ξ = IE0 ξZ(Y )
212
6 Testing a statistical hypothesis
for any r.v. ξ . It holds for any test φ with z = log(℘0 /℘1 )
℘0 IE0 φ + ℘1 IE1 (1 − φ) = IE0 ℘0 φ − ℘1 Z(Y )φ + ℘1
= −℘1 IE0 [(Z(Y ) − ez )φ] + ℘1
≥ −℘1 IE0 [Z(Y ) − ez ]+ + ℘1
with equality for φ = 1I(L(Y ) ≥ z) .
The Neyman-Pearson test belongs to a large class of tests of the form
φ = 1I(T ≥ z),
where T is a function of the observations Y . This random variable is usually
called a test statistic while the threshold z is called a critical value. The
hypothesis is rejected if the test statistic exceeds the critical value. For the
Neyman-Pearson test, the test statistic is the log-likelihood ratio L(Y ) and
the critical value is selected as a suitable quantile of this test statistic.
The next result shows that the Neyman-Pearson test φ∗z with a proper
critical value z can be constructed to maximize the power IE1 φ under the
level constraint IE0 φ ≤ α .
Theorem 6.2.2. Given α ∈ (0, 1) , let zα be such that
IP0 (L(Y ) ≥ zα ) = α.
(6.2)
Then it holds
def
φ∗zα = 1I L(Y ) ≥ zα = argmax IE1 φ .
φ:IE0 φ≤α
Proof. Let φ satisfy IE0 φ ≤ α . Then
IE1 φ − αzα ≤ IE0 Z(Y )φ − ezα IE0 φ
= IE0 Z(Y ) − ezα φ
≤ IE0 [Z(Y ) − ezα ]+ ,
again with equality for φ = 1I(L(Y ) ≥ zα ) .
The previous result assumes that for a given α there is a critical value zα
such that (6.2) is fulfilled. However, this is not always the case.
Exercise 6.2.1. Let L(Y ) = log dIP1 (Y )/dIP0 .
•
Show that the relation (6.2) can always be fulfilled with a proper choice
of zα if the pdf of L(Y ) under IP0 is a continuous function.
6.2 Neyman-Pearson test for two simple hypotheses
•
213
Suppose that the pdf of L(Y ) is discontinuous and zα fulfills
IP0 (L(Y ) ≥ zα ) > α,
IP0 (L(Y ) ≤ zα ) > 1 − α.
Construct a randomized test that fulfills IE0 φ = α and maximizes the test
power IE1 φ among all such tests.
The Neyman-Pearson test can be viewed as a special case of the general
likelihood ratio test. Indeed, it decides in favor of the null or the alternative
by looking at the likelihood ratio. Informally one can say: we decide in favor
of the alternative if it is significantly more likely at the point of observation
Y .
An interesting question that arises in relation with the Neyman-Pearson
result is how to interpret it when the true distribution IP does not coincide
either with IP0 or with IP1 and probably it is not even within the considered
parametric family (IPθ ) . Wald called this situation the third-kind error. It is
worth mentioning that the test φ∗z remains meaningful: it decides which of
two given measures IP0 and IP1 describes the given data better. However, it
is not any more a likelihood ratio test. In analogy with estimation theory, one
can call it a quasi likelihood ratio test.
6.2.1 Neyman-Pearson test for an i.i.d. sample
Let Y = (Y1 , . . . , Yn )> be an i.i.d. sample from a measure P . Suppose that
P belongs to some parametric family (Pθ , θ ∈ Θ ⊂ IRp ) , that is, P = Pθ∗
for θ ∗ ∈ Θ . Let also a special point θ 0 be fixed. A simple null hypothesis can
be formulated as θ ∗ = θ 0 . Similarly, a simple alternative is θ ∗ = θ 1 for some
other point θ 1 ∈ Θ . The Neyman-Pearson test situation is a bit artificial: one
reduces the whole parameter set Θ to just these two points θ 0 and θ 1 and
tests θ 0 against θ 1 .
As usual, the distribution of the data Y is described by the product
def
measure IPθ = Pθ⊗n . If µ0 is a dominating measure for (Pθ ) and `(y, θ) =
log[dPθ (y)/dµ0 ] , then the log-likelihood L(Y , θ) is
def
L(Y , θ) = log
X
dIPθ
(Y ) =
`(Yi , θ),
µ0
i
where µ0 = µ⊗n
0 . The log-likelihood ratio of IPθ 1 w.r.t. IPθ 0 can be defined
as
def
L(Y , θ 1 , θ 0 ) = L(Y , θ 1 ) − L(Y , θ 0 ).
The related Neyman-Pearson test can be written as
def
φ∗z = 1I L(Y , θ 1 , θ 0 ) > z .
214
6 Testing a statistical hypothesis
6.3 Likelihood ratio test
This section introduces a general likelihood ratio test in the framework of
parametric testing theory. Let, as usual, Y be the observed data, and IP be
their distribution. The parametric assumption is that IP ∈ (IPθ , θ ∈ Θ) , that
is, IP = IPθ∗ for θ ∗ ∈ Θ . Let now two subsets Θ0 and Θ1 of the set Θ be
given. The hypothesis H0 that we would like to test is that IP ∈ (IPθ , θ ∈
Θ0 ) , or equivalently, θ ∗ ∈ Θ0 . The alternative is that θ ∗ ∈ Θ1 .
The general likelihood approach leads to comparing the (maximum) likelihood values L(Y , θ) on the hypothesis and alternative sets. Namely, the
hypothesis is rejected if there is one alternative point θ 1 ∈ Θ1 such that
the value L(Y , θ) exceeds all corresponding values for θ ∈ Θ0 by a certain
amount which is defined by assigning losses or by fixing a significance level.
In other words, rejection takes place if observing the sample Y under alternative IPθ1 is significantly more likely than under any measure IPθ from the
null. Formally this relation can be written as:
sup L(Y , θ) + z < sup L(Y , θ),
θ∈Θ0
θ∈Θ1
where the constant z makes the term ”significantly” explicit. In particular, a
simple hypothesis means that the set Θ0 consists of one single point θ 0 and
the latter relation takes of the form
L(Y , θ 0 ) + z < sup L(Y , θ).
θ∈Θ1
In general, the likelihood ratio (LR) test corresponds to the test statistic
def
T = sup L(Y , θ) − sup L(Y , θ).
θ∈Θ1
(6.3)
θ∈Θ0
The hypothesis is rejected if this test statistic exceeds some critical value z .
Usually this critical value is selected to ensure the level condition:
IP T > zα ≤ α
for a given level α , whenever IP is a measure under the null hypothesis.
We have already seen that the LR test is optimal for testing a simple
hypothesis against a simple alternative. Later we show that this optimality
property can be extended to some more general situations. Now and in the
following Section 6.4 we consider further examples of a LR test.
Example 6.3.1 (Chi-square test for goodness-of-fit). Let the observation space
(which is a subset of IR1 ) be split into non-overlapping subsets A1 , . . . , Ad
and assume one observes indicators 1I(Yi ∈ Aj ) for 1 ≤ i ≤ n and 1 ≤
6.3 Likelihood ratio test
215
R
j ≤ d . Define θj = P0 (Aj ) = Aj P0 (dy) for 1 ≤ j ≤ d . Let counting
Pn
variables Nj , 1 ≤ j ≤ d , be given by Nj = i=1 1I(Yi ∈ Aj ) . The vector
N = (N1 , . . . , Nd ) follows the multinomial distribution with parameters n ,
d , and θ = (θ1 , . . . , θd )> , where we assume n and d as fixed, leading to
dim(Θ) = d . More specifically, it holds
d
n
o
X
Θ = θ = (θ1 , . . . , θd )> ∈ [0, 1]d :
θj = 1 .
j=1
The likelihood statistic for this model with respect to the counting measure
is given by
Z(N, p) = Qd
n!
d
Y
j=1 Nj ! `=1
θ`N` ,
ej = Nj /n, 1 ≤ j ≤ d . Now, consider the point
and the MLE is given by θ
hypothesis θ = p for a fixed given vector p ∈ Θ . We obtain the likelihood
ratio statistic
T = sup log
θ∈Θ
Z(N, θ)
,
Z(N, p)
leading to
T =n
d
X
θej
θej log ,
pj
j=1
In practice, this LR test is often carried out as Pearson’s chi-square test. To
this end, consider the function h : IR → IR , given by h(x) = x log(x/x0 ) for
a fixed real number x0 ∈ (0, 1) . Then, the Taylor expansion of h(x) around
x0 is given by
h(x) = (x − x0 ) +
1
(x − x0 )2 + o[(x − x0 )2 ] as x → x0 .
2x0
e close to p , the use of P θej = P pj = 1 implies
Consequently, for θ
j
j
2T ≈
d
X
(Nj − npj )2
npj
j=1
in probability under the null hypothesis. The statistic Q , given by
216
6 Testing a statistical hypothesis
def
Q =
d
X
(Nj − npj )2
,
npj
j=1
is called Pearson’s chi-square statistic.
6.4 Likelihood ratio tests for parameters of a normal
distribution
For all examples considered in this section, we assume that the data Y in
form of an i.i.d. sample (Y1 , . . . , Yn )> follow the model Yi = θ∗ + εi with
εi ∼ N(0, σ 2 ) for σ 2 known. Equivalently Yi ∼ N(θ∗ , σ 2 ) . The log-likelihood
L(Y , θ) (which we also denote by L(θ) ) reads as
L(θ) = −
n
1 X
n
(Yi − θ)2
log(2πσ 2 ) − 2
2
2σ i=1
(6.4)
and the log-likelihood ratio L(θ, θ0 ) = L(θ) − L(θ0 ) is given by
L(θ, θ0 ) = σ −2 (S − nθ0 )(θ − θ0 ) − n(θ − θ0 )2 /2
(6.5)
def
with S = Y1 + . . . + Yn .
6.4.1 Distributions related to an i.i.d. sample from a normal
distribution
As a preparation for the subsequent sections, we introduce here some important probability distributions which correspond to functions of an i.i.d. sample
Y = (Y1 , . . . , Yn )> from a normal distribution.
Lemma 6.4.1. If Y follows the standard normal distribution on IR , then
Y 2 has the gamma distribution Γ(1/2, 1/2) .
Exercise 6.4.1. Prove Lemma 6.4.1.
Corollary 6.4.1. Let Y = (Y1 , . . . , Yn )> denote an i.i.d. sample from the
standard normal distribution on IR . Then, it holds that
n
X
Yi2 ∼ Γ(1/2, n/2).
i=1
We call Γ(1/2, n/2) the chi-square distribution with n degrees of freedom,
χ2n for short.
6.4 Likelihood ratio tests for parameters of a normal distribution
217
Proof. From Lemma 6.4.1, we have that Y12 ∼ Γ(1/2, 1/2) . Convolution stability of the family of gamma distributions with respect to the second parameter yields the assertion.
Lemma 6.4.2. Let α, r, s > 0 non-negative constants and X, Y indendent
random variables with X ∼ Γ(α, r) and Y ∼ Γ(α, s) . Then S = X + Y and
R = X/(X + Y ) are independent with S ∼ Γ(α, r + s) and R ∼ Beta(r, s) .
Exercise 6.4.2. Prove Lemma 6.4.2.
Theorem 6.4.1. Let X1 , . . . , Xm , Y1 , . . . , Yn i.i.d., with X1 following the
standard normal distribution on IR . Then, the ratio
def
Fm,n = m−1
m
X
Xi2
.
(n−1
i=1
n
X
Yj2 )
j=1
has the following pdf with respect to the Lebesgue measure.
fm,n (x) =
xm/2−1
mm/2 nn/2
1I(0,∞) (x).
B(m/2, n/2) (n + mx)(m+n)/2
The distribution of Fm,n is called Fisher’s F -distribution with m and n
degrees of freedom (Sir R. A. Fisher, 1890-1962).
Exercise 6.4.3. Prove Theorem 6.4.1.
Corollary 6.4.2. Let X, Y1 , . . . , Yn i.i.d., with X following the standard
normal distribution on IR . Then, the statistic
X
T =q
Pn
2
−1
n
j=1 Yj
has the Lebesgue density
n+1
−1
t2 − 2 √
n B(1/2, n/2)
.
t 7→ τn (t) = 1 +
n
The distribution of T is called Student’s t -distribution with n degrees of
freedom.
Proof. According to Theorem 6.4.1, T 2 √
∼ F1,n . Thus, due to the transformation formula for densities, |T | = T 2 has Lebesgue density t 7→
2tf1,n (t2 ), t > 0 . Because of the symmetry of the standard normal density,
also the distribution of T is symmetric around 0 , i. e., T and −T have the
same distribution. Hence, T has the Lebesgue density t 7→ |t|f1,n (t2 ) = τn (t) .
218
6 Testing a statistical hypothesis
Theorem 6.4.2 (Student (1908)).
In the Gaussian product model (IRn , B(IRn ), (N (µ, σ 2 )n )θ=(µ,σ2 )∈Θ ) , where
Θ = IR × (0, ∞) , it holds for all θ ∈ Θ :
Pn
Pn
def
def
(a) The statistics Y n = n−1 j=1 Yj and σ
e2 = (n − 1)−1 i=1 (Yi − Y n )2
are independent.
σ 2 /σ 2 ∼ χ2n−1 .
(b) Y n ∼ N (µ, σ 2 /n) and (n − 1)e
√
def
(c) The statistic Tn = n(Y n − µ)/e
σ is distributed as tn−1 .
6.4.2 Gaussian shift model
Under the measure IPθ0 , the variable S − nθ0 is normal zero-mean
with the
√
variance nσ 2 . This particularly implies that (S − nθ0 )/ nσ 2 is standard
normal under IPθ0 :
1
L √ (S − nθ0 ) IPθ0 = N(0, 1).
σ n
We start with the simplest case of a simple null and simple alternative.
Simple null and simple alternative.
Let the null H0 : θ∗ = θ0 be tested against the alternative H1 : θ∗ = θ1 for
some fixed θ1 6= θ0 . The log-likelihood L(θ1 , θ0 ) is given by (6.5) leading to
the test statistic
T = σ −2 (S − nθ0 )(θ1 − θ0 ) − n(θ1 − θ0 )2 /2 .
The proper critical value z can be selected from the condition of α -level:
IPθ0 (T > zα ) = α . We use that the sum S − nθ0 is under the null normal
zero-mean with variance nσ 2 , and hence, the random variable
√
ξ = (S − nθ0 )/ nσ 2
is under θ0 standard normal: ξ ∼ N(0, 1) . The level condition can be rewritten as
1
√ σ 2 zα + n(θ1 − θ0 )2 /2 = α.
IPθ0 ξ >
|θ1 − θ0 |σ n
As ξ is standard normal under θ0 , the proper zα can be computed as a
quantile of the standard normal law: if zα is defined by IPθ0 (ξ > zα ) = α ,
then
1
√ σ 2 zα + n|θ1 − θ0 |2 /2 = zα
|θ1 − θ0 |σ n
6.4 Likelihood ratio tests for parameters of a normal distribution
219
or
√
zα = σ −2 zα |θ1 − θ0 |σ n − n|θ1 − θ0 |2 /2 .
It is worth noting that this value actually does not depend on θ0 . It only
depends on the difference |θ1 − θ0 | between the null and the alternative. This
is a very important and useful property of the normal family and is called
pivotality. Another way of selecting the critical value z is given by minimizing
the sum of the first and second-kind error probabilities. Theorem 6.2.1 leads
to the choice z = 0 , or equivalently, to the test
φ = 1I{S/n ≷ (θ0 + θ1 )/2},
θ1 ≷ θ 0 ,
= 1I{θe ≷ (θ0 + θ1 )/2},
θ 1 ≷ θ0 .
This test is also called the Fisher discrimination. It naturally appears in classification problems.
Two-sided test.
Now we consider a more general situation when the simple null θ∗ = θ0
is tested against the alternative θ∗ 6= θ0 . Then the LR test compares the
likelihood at θ0 with the maximum likelihood over Θ \ {θ0 } which clearly
coincides with the maximum over the whole parameter set. This leads to the
test statistic:
T = max L(θ, θ0 ) =
θ
n e
|θ − θ0 |2 .
2σ 2
(see Section 2.9), where θe = S/n is the MLE. The LR test rejects the null
if T ≥ z for a critical value z . The value z can be selected from the level
condition:
IPθ0 T > z = IPθ0 nσ −2 |θe − θ0 |2 > 2z = α.
Now we use that nσ −2 |θe − θ0 |2 is χ21 -distributed according to Lemma 6.4.1.
If zα is defined by IP (ξ 2 ≥ 2zα ) = α for standard normal ξ , then the test
φ = 1I(T > zα ) is of level α . Again, this value does not depend on the null
point θ0 , and the LR test is pivotal.
Exercise 6.4.4. Compute the power function of the resulting two-sided test
φ = 1I(T > zα ).
220
6 Testing a statistical hypothesis
One-sided test.
Now we consider the problem of testing the null θ∗ = θ0 against the onesided alternative H1 : θ > θ0 . To apply the LR test we have to compute the
maximum of the log-likelihood ratio L(θ, θ0 ) over the set Θ1 = {θ > θ0 } .
Exercise 6.4.5. Check that
(
sup L(θ, θ0 ) =
θ>θ0
nσ −2 |θe − θ0 |2 /2
0
if θe ≥ θ0 ,
otherwise.
Hint: if θe ≥ θ0 , then the supremum over Θ1 coincides with the global maximum, otherwise it is attained at the edge θ0 .
Now the LR test rejects the null if θe > θ0 and nσ −2 |θe − θ0 |2 > 2z for a
critical value (CV) z . That is,
φ = 1I θe − θ0 > σ
p
2z/n .
√
The CV z can be again chosen by the level condition. As ξ = n(θe√
− θ0 )/σ is
standard normal under IPθ0 , one has to select z such that IP (ξ > 2z) = α ,
leading to
√ φ = 1I θe > θ0 + σz1−α / n ,
where z1−α denotes the (1−α) -quantile of the standard normal distribution.
6.4.3 Testing the mean when the variance is unknown
This section discusses the Gaussian shift model Yi = θ∗ + σ ∗ εi with standard
normal errors εi and unknown variance σ ∗ 2 . The log-likelihood function is
still given by (6.4) but now σ ∗ 2 is a part of the parameter vector.
Two-sided test problem.
Here, we are considered with the two-sided testing problem H0 : θ∗ = θ0
against H1 : θ1 6= θ0 . Notice that the null hypothesis is composite, because
it involves the unknown variance σ ∗ 2 .
Maximizing the log-likelihood L(θ, σ 2 ) under the null leads to the value
L(θ0 , σ
e02 ) with
def
σ
e02 = argmax L(θ0 , σ 2 ) = n−1
σ2
X
(Yi − θ0 )2 .
i
As in Section 2.9.2 for the problem of variance estimation, it holds for any σ
6.4 Likelihood ratio tests for parameters of a normal distribution
221
L(θ0 , σ
e02 ) − L(θ0 , σ 2 ) = nK(e
σ02 , σ 2 ).
At the same time, maximizing L(θ, σ 2 ) over the alternative is equivalent to
eσ
the global maximization leading to the value L(θ,
e2 ) with
θe = S/n,
σ
e2 =
1X
e 2.
(Yi − θ)
n i
The LR test statistic reads as
eσ
T = L(θ,
e2 ) − L(θ0 , σ
e02 ).
This expression can be decomposed in the following way:
eσ
T = L(θ,
e2 ) − L(θ0 , σ
e2 ) + L(θ0 , σ
e2 ) − L(θ0 , σ
e02 )
=
1 e
(θ − θ0 )2 − nK(e
σ02 , σ
e2 ).
2e
σ2
In order to derive the CV z , notice that
exp(T ) =
eσ
exp(L(θ,
e2 ))
=
exp(L(θ0 , σ
e02 ))
σ
e02
σ
e2
n/2
.
Consequently, the LR test rejects for large values of
n(θe − θ0 )2
σ
e02
= 1 + Pn
.
2
e2
σ
e
i=1 (Yi − θ)
In view of Theorems 6.4.1 and 6.4.2, z is therefore a deterministic transformation of the suitable quantile from Fisher’s F -distribution with 1 and n − 1
degrees of freedom or from Student’s t -distribution with n − 1 degrees of
freedom.
Exercise 6.4.6. Derive the explicit form of zα for given significance level α
in terms of Fisher’s F -distribution and in terms of Student’s t -distribution.
Often one considers the case in which the variance is only estimated under the
alternative, that is, σ
e is used in place of σ
e0 . This is quite natural because
the null can be viewed as a particular case of the alternative. This leads to
the test statistic
n e
eσ
T ∗ = L(θ,
e2 ) − L(θ0 , σ
e2 ) =
(θ − θ0 )2 .
2e
σ2
Since T ∗ is an isotone transformation of T , both tests are equivalent.
222
6 Testing a statistical hypothesis
One-sided test problem.
In analogy to the considerations in Section 6.4.2, the LR test for the one-sided
problem H0 : θ∗ = θ0 against H1 : θ∗ > θ0 rejects if θe > θ0 and T exceeds
a suitable critical value z .
Exercise 6.4.7. Derive the explicit form of the LR test for the one-sided test
problem. Compute zα for given significance level α in terms of Student’s
t -distribution.
6.4.4 Testing the variance
In this section, we consider the LR test for the hypothesis H0 : σ 2 = σ02
against H1 : σ 2 > σ02 or H0 : σ 2 = σ02 against H1 : σ 2 6= σ02 , respectively. In
this, we assume that θ∗ is known. The case of unknown θ∗ can be treated
similarly, cf. Exercise 6.4.10 below. As discussed before, maximization of the
likelihood under the constraint of known mean yields
def
σ
e∗2 = argmax L(θ∗ , σ 2 ) = n−1
σ2
X
(Yi − θ∗ )2 .
i
The LR test for H0 against H1 rejects the null hypothesis if
T = L(θ∗ , σ
e∗2 ) − L(θ∗ , σ02 )
exceeds a critical value z . For determining the rejection regions, notice that
2T = n(Q − log(Q) − 1),
def
Q = σ
e∗2 /σ02 .
(6.6)
Exercise 6.4.8. (a) Verify representation (6.6).
(b) Show that x 7→ x−log(x)−1 is a convex function on (0, ∞) with minimum
value 0 at x = 1 .
Combining Exercise 6.4.8 and Corollary 6.4.1, we conclude that the critical
value z for the LR test for the one-sided test problem is a deterministic
transformation of a suitable quantile of the χ2n -distribution. For the two-sided
test problem, the acceptance region of the LR test is an interval for Q which
is bounded by deterministic transformations of lower and upper quantiles of
the χ2n -distribution.
Exercise 6.4.9. Derive the rejection regions of the LR tests for the one- and
the two-sided test problems explicitly.
Exercise 6.4.10. Derive one- and two-sided LR tests for σ 2 in the case of
unknown θ∗ . Hint: Use Theorem 6.4.2.(b).
6.5 LR-tests. Further examples
223
6.5 LR-tests. Further examples
We return to the models investigated in Section 2.9 and derive LR tests for
the respective parameters.
6.5.1 Bernoulli or binomial model
Assume
that (Yi , . . . , Yn ) are i.i.d. with Y1 ∼ Bernoulli(θ∗ ) . Letting Sn =
P
Yi , the log-likelihood is given by
L(θ) =
X
Yi log θ + (1 − Yi ) log(1 − θ)
= Sn log
θ
+ n log(1 − θ).
1−θ
Simple null versus simple alternative.
The LR statistic for testing the simple null hypothesis H0 : θ∗ = θ0 against
the simple alternative H1 : θ∗ = θ1 reads as
T = L(θ1 ) − L(θ0 ) = Sn log
1 − θ1
θ1 (1 − θ0 )
+ n log
.
θ0 (1 − θ1 )
1 − θ0
For a fixed significance level α , the resulting LR test rejects if
θe ≷
zα
1 − θ1
− log
n
1 − θ0
.
log
θ1 (1 − θ0 )
,
θ0 (1 − θ1 )
θ1 ≷ θ0 ,
where θe = Sn /n . For the determination of zα , it is convenient to notice that
for any pair (θ0 , θ1 ) ∈ (0, 1)2 , the function
x 7→ x log
θ1 (1 − θ0 )
1 − θ1
+ log
θ0 (1 − θ1 )
1 − θ0
is increasing (decreasing) in x ∈ [0, 1] if θ1 > θ0 ( θ1 < θ0 ). Hence, the LR
statistic T is an isotone (antitone) transformation of θe if θ1 > θ0 ( θ1 < θ0 ).
Since Sn = nθe is under H0 binomially distributed with parameters n and
θ0 , the LR test φ is given by
(
φ=
−1
1I{Sn > FBin(n,θ
(1 − α)},
0)
1I{Sn <
−1
FBin(n,θ
(α)},
0)
θ1 > θ0 ,
θ1 < θ0 .
(6.7)
224
6 Testing a statistical hypothesis
Composite alternatives.
Obviously, the LR test in (6.7) depends on the value of θ1 only via the sign
of θ1 − θ0 . Therefore, the LR test for the one-sided test problem H0 : θ∗ = θ0
against H1 : θ∗ > θ0 rejects if
−1
θe > FBin(n,θ
(1 − α)
0)
and the LR test for H0 against H1 : θ∗ < θ0 rejects if
−1
θe < FBin(n,θ
(α).
0)
The LR test for the two-sided test problem H0 against H1 : θ∗ 6= θ0 rejects
if
−1
−1
θe 6∈ [FBin(n,θ
(α/2), FBin(n,θ
(1 − α/2)].
0)
0)
6.5.2 Uniform distribution on [0, θ]
Consider again the model from Section 2.9.4, i. e., Y1 , . . . , Yn are i.i.d. with
Y1 ∼ UNI[0, θ∗ ] , where the upper endpoint θ∗ of the support is unknown. It
holds that
Z(θ) = θ−n 1I(max Yi ≤ θ)
i
and that the maximum of Z(θ) over (0, ∞) is obtained for θe = maxi Yi . Let
us consider the two-sided test problem H0 : θ∗ = θ0 against H1 : θ∗ 6= θ0 for
some given value θ0 > 0 . We get that
e
Z(θ)
=
exp(T ) =
Z(θ0 )
(
n
θ0 /θe , maxi Yi ≤ θ0 ,
∞,
maxi Yi > θ0 .
It follows that exp(T ) > z if θe > θ0 or θe < θ0 z−1/n . We compute the critical
value zα for a level α test by noticing that
n
IPθ0 ({θe > θ0 } ∪ {θe < θ0 z−1/n }) = IPθ0 (θe < θ0 z−1/n ) = FY1 (θ0 z−1/n ) = 1/z.
Thus, the LR test at level α for the two-sided problem H0 against H1 is
given by
φ = 1I{θe > θ0 } + 1I{θe < θ0 α1/n }.
6.5 LR-tests. Further examples
225
6.5.3 Exponential model
We return to the model considered in Section 2.9.7 and assume that Y1 , . . . , Yn
are i.i.d. exponential random variables with parameter θ∗ > 0 . The corresponding log-likelihood can be written as
L(θ) = −n log θ −
n
X
Yi /θ = −S/θ − n log θ,
i=1
where S = Y1 + . . . + Yn .
In order to derive the LR test for the simple hypothesis H0 : θ∗ = θ0
against the simple alternative H0 : θ∗ = θ1 , notice that the LR statistic T
is given by
θ0
θ1 − θ0
+ n log .
T = L(θ1 ) − L(θ0 ) = S
θ1 θ0
θ1
Since the function
x 7→ x
θ1 − θ0
θ1 θ0
+ n log(θ0 /θ1 )
is increasing (decreasing) in x > 0 whenever θ1 > θ0 ( θ1 < θ0 ), the LR test
rejects for large values of S in the case that θ1 > θ0 and for small values of S
if θ1 < θ0 . Due to the facts that the exponential distribution with parameter
θ0 is identical to Gamma (θ0 , 1) and the family of gamma distributions is
convolution-stable with respect to its second parameter whenever the first parameter is fixed, we obtain that S is under θ0 distributed as Gamma (θ0 , n) .
This implies that the LR test φ for H0 against H1 at significance level α
is given by
(
−1
1I{S > FGamma(θ
(1 − α)}, θ1 > θ0 ,
0 ,n)
φ=
−1
1I{S < FGamma(θ0 ,n) (α)},
θ1 < θ0 .
Moreover, composite alternatives can be tested in analogy to the considerations in Section 6.5.1.
6.5.4 Poisson model
Let Y1 , . . . , Yn be i.i.d. Poisson random variables satisfying IP (Yi = m) =
∗
|θ∗ |m e−θ /m! for m = 0, 1, 2, . . . . According to Section 2.9.8, we have that
L(θ) = S log θ − nθ + R,
L(θ1 ) − L(θ0 ) = S log
θ1
+ n(θ0 − θ1 ),
θ0
226
6 Testing a statistical hypothesis
where the remainder term R does not depend on θ . In order to derive the
LR test for the simple hypothesis H0 : θ∗ = θ0 against the simple alternative
H0 : θ∗ = θ1 , we again check easily that x 7→ x log(θ1 /θ0 ) + n(θ0 − θ1 ) is
increasing (decreasing) in x > 0 if θ1 > θ0 ( θ1 < θ0 ). Convolution stability of
the family of Poisson distributions entails that the LR test φ for H0 against
H1 at significance level α is given by
φ=
(
−1
1I{S > FPoisson(nθ
(1 − α)},
0)
1I{S <
−1
FPoisson(nθ
(α)},
0)
θ1 > θ0 ,
θ1 < θ0 .
Moreover, composite alternatives can be tested in analogy to Section 6.5.1.
6.6 Testing problem for a univariate exponential family
Let (Pθ , θ ∈ Θ ⊆ IR1 ) be a univariate exponential family. The choice of
parametrization is unimportant, any of parametrization can be taken. To be
specific, we assume the natural parametrization that simplifies the expression
for the maximum likelihood estimate.
We assume that the two functions C(θ) and B(θ) of θ are fixed, with
which the log-density of Pθ can be written in the form:
def
`(y, θ) = log p(y, θ) = yC(θ) − B(θ) − `(y)
for some other function `(y) . The function C(θ) is monotonic in θ and C(θ)
and B(θ) are related (for the case of an EFn) by the identity B 0 (θ) = θC 0 (θ) ,
see Section 2.11.
Let now Y = (Y1 , . . . , Yn ) be an i.i.d. sample from Pθ∗ for θ∗ ∈ Θ . The
task is to test a simple hypothesis θ∗ = θ0 against an alternative θ∗ ∈ Θ1
for some subset Θ1 that does not contain θ0 .
6.6.1 Two-sided alternative
We start with the case of a simple hypothesis H0 : θ∗ = θ0 against a full
two-sided alternative H1 : θ∗ 6= θ0 . The likelihood ratio approach suggests to
compare the likelihood at θ0 with the maximum of the likelihood over the alternative, which effectively means the maximum over the whole parameter set
Θ . In the case of a univariate exponential family, this maximum is computed
in Section 2.11. For
def
L(θ, θ0 ) = L(θ) − L(θ0 ) = S C(θ) − C(θ0 ) − n B(θ) − B(θ0 )
with S = Y1 + . . . + Yn , it holds
6.6 Testing problem for a univariate exponential family
227
def
e θ0 ),
T = sup L(θ, θ0 ) = nK(θ,
θ
where K(θ, θ0 ) = Eθ `(θ, θ0 ) is the Kullback–Leibler divergence between the
measures Pθ and Pθ0 . For an EFn, the MLE θe is the empirical mean of the
observations Yi , θe = S/n , and the KL divergence K(θ, θ0 ) is of the form
K(θ, θ0 ) = θ C(θ) − C(θ0 ) − B(θ) − B(θ0 ) .
Therefore, the test statistic T is a function of the empirical mean θe = S/n :
e θ0 ) = nθe C(θ)
e − C(θ0 ) − n B(θ)
e − B(θ0 ) .
T = nK(θ,
(6.8)
The LR test rejects H0 if the test statistic T exceeds a critical value z . Given
α ∈ (0, 1) , a proper CV zα can be specified by the level condition
IPθ0 (T > zα ) = α.
e θ0 ) between
In view of (6.8), the LR test rejects the null if the “distance” K(θ,
e
the estimate θ and the null θ0 is significantly larger than zero. In the case
of an exponential family, one can simplify the test just by considering the
estimate θe as test statistic. We use the following technical result for the KL
divergence K(θ, θ0 ) :
Lemma 6.6.1. Let (Pθ ) be an EFn. Then for every z there are two positive
values t− (z) and t+ (z) such that
{θ : K(θ, θ0 ) ≤ z} = {θ : θ0 − t− (z) ≤ θ ≤ θ0 + t+ (z)}.
(6.9)
In other words, the conditions K(θ, θ0 ) ≤ z and θ0 − t− (z) ≤ θ ≤ θ0 + t+ (z)
are equivalent.
Proof. The function K(θ, θ0 ) of the first argument θ fulfills
∂K(θ, θ0 )
= C(θ) − C(θ0 ),
∂θ
∂ 2 K(θ, θ0 )
= C 0 (θ) > 0.
∂θ2
Therefore, it is convex in θ with minimum at θ0 , and it can cross the level z
only once from the left of θ0 and once from the right. This yields that for any
z > 0 , there are two positive values t− (z) and t+ (z) such that (6.9) holds.
Note that one or even both of these values can be infinite.
Due to the result of this lemma, the LR test can be rewritten as
φ = 1I T > z = 1 − 1I T ≤ z
= 1 − 1I −t− (z) ≤ θe − θ0 ≤ t+ (z)
= 1I θe > θ0 + t+ (z) + 1I θe < θ0 − t− (z) ,
228
6 Testing a statistical hypothesis
that is, the test rejects the null hypothesis if the estimate θe deviates significantly from θ0 .
6.6.2 One-sided alternative
Now we consider the problem of testing the same null H0 : θ∗ = θ0 against the
one-sided alternative H1 : θ∗ > θ0 . Of course, the other one-sided alternative
H1 : θ∗ < θ0 can be considered analogously.
The LR test requires computing the maximum of the log-likelihood over
the alternative set {θ : θ > θ0 } . This can be done as in the Gaussian shift
model. If θe > θ0 then this maximum coincides with the global maximum over
all θ . Otherwise, it is attained at θ = θ0 .
Lemma 6.6.2. Let (Pθ ) be an EFn. Then
(
sup L(θ, θ0 ) =
θ>θ0
e θ0 )
nK(θ,
0
if θe ≥ θ0 ,
otherwise.
Proof. It is only necessary to consider the case θe < θ0 . The difference L(θ) −
L(θ0 ) can be represented as S[C(θ) − C(θ0 )] − n[B(θ) − B(θ0 )] . Next, usage
of the identities θe = S/n and B 0 (θ) = θC 0 (θ) yields
e θ)
e θ)
∂L(θ,
∂K(θ,
∂L(θ, θ0 )
=−
= −n
= n(θe − θ)C 0 (θ) < 0
∂θ
∂θ
∂θ
for any θ > θe . This implies that L(θ) − L(θ0 ) becomes negative as θ grows
beyond θ0 and thus, L(θ, θ0 ) has its supremum over {θ : θ > θ0 } at θ = θ0 ,
yielding the assertion.
This fact implies the following representation of the LR test in the case of
a one-sided alternative.
Theorem 6.6.1. Let (Pθ ) be an EFn. Then the α -level LR test for the null
H0 : θ∗ = θ0 against the one-sided alternative H1 : θ∗ > θ0 is
φ = 1I(θe > θ0 + tα ),
(6.10)
where tα is selected to ensure IPθ0 θe > θ0 + tα = α .
Proof. Let T be the LR test statistic. Due to Lemmas 6.6.2 and 6.6.1, the
event {T ≥ z} can be rewritten as {θe > θ0 + t(z)} for some constant t(z) . It
remains to select a proper value t(z) = tα to fulfill the level condition.
This result can be extended naturally to the case of a composite null
hypothesis H0 : θ∗ ≤ θ0 .
6.6 Testing problem for a univariate exponential family
229
Theorem 6.6.2. Let (Pθ ) be an EFn. Then the α -level LR test for the composite null H0 : θ∗ ≤ θ0 against the one-sided alternative H1 : θ∗ > θ0
is
φ∗α = 1I(θe > θ0 + tα ),
(6.11)
where tα is selected to ensure IPθ0 θe > θ0 + tα = α .
Proof. The same arguments as in the proof of Theorem 6.6.1 lead to exactly
the same LR test statistic T and thus to the test of the form (6.10). In
particular, the estimate θe should significantly deviate from the null set. It
remains to check that the level condition for the edge point θ0 ensures the
level for all θ < θ0 . This follows from the next monotonicity property.
Lemma 6.6.3. Let (Pθ ) be an EFn. Then for any t ≥ 0
IPθ (θe > θ0 + t) ≤ IPθ0 (θe > θ0 + t),
∀θ < θ0 .
Proof. Let θ < θ0 . We apply
h
i
IPθ (θe > θ0 + t) = IEθ0 exp L(θ, θ0 ) 1I(θe > θ0 + t) .
Now the monotonicity of L(θ, θ0 ) w.r.t. θ (see the proof of Lemma 6.6.2) implies
e . This yields the result.
L(θ, θ0 ) < 0 on the set {θ < θ0 < θ}
Therefore, if the level is controlled under IPθ0 , it is well checked for all other
points in the null set.
A very nice feature of the LR test is that it can be universally represented
in terms of θe independently of the form of the alternative set. In particular, for
the case of a one-sided alternative, this test just compares the estimate θe with
the value θ0 + tα . Moreover, the value tα only depends on the distribution of
θe under IPθ0 via the level condition. This and the monotonicity of the error
probability from Lemma 6.6.3 allow us to state the nice optimality property
of this test: φ∗α is uniformly most powerful in the sense of Definition 6.1.1,
that is, it maximizes the test power under the level constraint.
Theorem 6.6.3. Let (Pθ ) be an EFn, and let φ∗α be the test from (6.11) for
testing H0 : θ∗ ≤ θ0 against H1 : θ∗ > θ0 . For any (randomized) test φ
satisfying IEθ0 ≤ α and any θ > θ0 , it holds
IEθ φ ≤ IPθ (φ∗α = 1).
In fact, this theorem repeats the Neyman-Pearson result of Theorem 6.2.2,
because the test φ∗α is at the same time the LR α -level test of the simple
hypothesis θ∗ = θ0 against θ∗ = θ1 , for any value θ1 > θ0 .
230
6 Testing a statistical hypothesis
6.6.3 Interval hypothesis
In some applications, the null hypothesis is naturally formulated in the form
that the parameter θ∗ belongs to a given interval [θ0 , θ1 ] . The alternative
H1 : θ∗ ∈ Θ \ [θ0 , θ1 ] is the complement of this interval. The likelihood ratio
test is based on the test statistic T from (6.3) which compares the maximum
of the log-likelihood L(θ) under the null [θ0 , θ1 ] with the maximum over the
alternative set. The special structure of the log-likelihood in the case of an
EFn permits representing this test statistics in terms of the estimate θe : the
hypothesis is rejected if the estimate θe significantly deviates from the interval
[θ0 , θ1 ] .
Theorem 6.6.4. Let (Pθ ) be an EFn. Then the α -level LR test for the null
H0 : θ ∈ [θ0 , θ1 ] against the alternative H1 : θ 6∈ [θ0 , θ1 ] can be written as
−
e
φ = 1I(θe > θ1 + t+
α ) + 1I(θ < θ0 − tα ),
(6.12)
−
e
where the non-negative constants t+
α and tα are selected to ensure IPθ0 (θ <
+
e
θ0 − t−
α ) + IPθ1 (θ > θ1 + tα ) = α .
Exercise 6.6.1. Prove the result of Theorem 6.6.4.
Hint: Consider three cases: θe ∈ [θ0 , θ1 ] , θe > θ1 , and θe < θ0 . For every case,
e θ) in θ .
apply the monotonicity of L(θ,
One can consider the alternative of the interval hypothesis as a combination of two one-sided alternatives. The LR test φ from (6.12) involves only
+
one critical value z and the parameters t−
α and tα are related via the structure of this test: they are obtained by transforming the inequality T > zα into
−
e
θe > θ1 + t+
α and θ < θ0 − tα . However, one can just apply two one-sided tests
separately: one for the alternative H1− : θ∗ < θ0 and one for H1+ : θ∗ > θ1 .
This leads to the two tests
def
def
φ− = 1I θe < θ0 − t− ,
φ+ = 1I θe > θ1 + t+ .
(6.13)
The values t− , t+ can be chosen by the so-called Bonferroni rule: just perform
each of the two tests at level α/2 .
+
Exercise 6.6.2. For fixed α ∈ (0, 1) , let the values t−
α , tα be selected to
ensure
IPθ0 θe < θ0 − t−
IPθ1 θe > θ1 + t+
α = α/2,
α = α/2.
Then for any θ ∈ [θ0 , θ1 ] , the test φ , given by φ = φ− + φ+ (cf. (6.13))
fulfills
IPθ (φ = 1) ≤ α.
Hint: Use the monotonicity from Lemma 6.6.3.
6.7 Historical remarks and further reading
231
6.7 Historical remarks and further reading
The theory of optimal tests goes back to Jerzy Neyman (1894 - 1981) and
Egon Sharpe Pearson (1895 - 1980). Some early considerations with respect
to likelihood ratio tests can be found in Neyman and Pearson (1928). Neyman
and Pearson (1933)’s fundamental lemma is core to the derivation of most
powerful (likelihood ratio) tests. Fisher (1934) showed that uniformly best
tests (over the parameter subspace defining a composite alternative) only
exist in one-parametric exponential families. More details about the origins of
the theory of optimal tests can be found in the book by Lehmann (2011).
Important contributions to the theory of tests for parameters of a normal
distribution have moreover been made by William Sealy Gosset (1876-1937;
under the pen name ”Student”, see Student (1908)) and Ernst Abbe (1840 1905; cf. Kendall (1971) for an account of Abbe’s work).
The χ2 -test for goodness-of-fit is due to Pearson (1900). The phenomenon,
that the limiting distribution of twice the log-likelihood ratio statistic in nested
models is under regularity conditions a chi-square distribution, has been discovered by Wilks (1938).
An excellent textbook reference for the theory of testing statistical hypotheses is Lehmann and Romano (2005).
7
Testing in linear models
This chapter discusses testing problems for linear Gaussian models given by
the equation
Y = f∗ + ε
(7.1)
with the vector of observations Y , response vector f ∗ , and vector of errors
ε in IRn . The linear parametric assumption (linear PA) means that
Y = Ψ> θ ∗ + ε,
ε ∼ N(0, Σ),
(7.2)
where Ψ is the p × n design matrix. By θ we denote the p -dimensional
target parameter vector, θ ∈ Θ ⊆ IRp . Usually we assume that the parameter
set coincides with the whole space IRp , i.e. Θ = IRp . The most general
assumption about the vector of errors ε = (ε1 , . . . , εn )> is Var(ε) = Σ , which
permits for inhomogeneous and correlated errors. However, for most results
we assume i.i.d. errors εi ∼ N(0, σ 2 ) . The variance σ 2 could be unknown
as well. As in previous chapters, θ ∗ denotes the true value of the parameter
vector (assumed that the model (7.2) is correct).
7.1 Likelihood ratio test for a simple null
This section discusses the problem of testing a simple hypothesis H0 : θ ∗ = θ 0
for a given vector θ 0 . A natural “non-informative” alternative is H1 : θ ∗ 6=
θ0 .
7.1.1 General errors
We start from the case of general errors with known covariance matrix Σ .
The results obtained for the estimation problem in Chapter 4 will be heavily
e of θ ∗ is
used in our study. In particular, the MLE θ
234
7 Testing in linear models
e = ΨΣ−1 Ψ>
θ
−1
ΨΣ−1 Y
and the corresponding maximum likelihood is
e θ0 ) =
L(θ,
>
1 e
e − θ0
θ − θ0 B θ
2
with a p × p - matrix B given by
B = ΨΣ−1 Ψ> .
This immediately leads to the following representation for the likelihood ratio
(LR) test in this set-up:
def
T = sup L(θ, θ 0 ) =
θ
>
1 e
e − θ0 .
θ − θ0 B θ
2
(7.3)
Moreover, Wilks’ phenomenon claims that under IPθ0 , the test statistic T
has a fixed distribution: namely, 2T is χ2p -distributed (chi-squared with p
degrees of freedom).
Theorem 7.1.1. Consider the model (7.2) with ε ∼ N(0, Σ) for a known
matrix Σ . Then the LR test statistic T is given by (7.3). Moreover, if zα
fulfills IP (ζp > 2zα ) = α with ζp ∼ χ2p , then the LR test φ with
def
φ = 1 T > zα
(7.4)
is of exact level α :
IPθ0 (φ = 1) = α.
This result follows directly from Theorems 4.6.1 and 4.6.2. We see again
the important pivotal property of the test: the critical value zα only depends
on the dimension of the parameter space Θ . It does not depend on the design
matrix Ψ , error covariance Σ , and the null value θ 0 .
7.1.2 I.i.d. errors, known variance
We now specify this result for the case of i.i.d. errors. We also focus on the
residuals
def
e =Y −f
e,
e
ε = Y − Ψ> θ
e = Ψ> θ
e is the estimated response of the true regression function
where f
∗
> ∗
f =Ψ θ .
We start with some geometric properties of the residuals e
ε and the test
statistic T from (7.3).
7.1 Likelihood ratio test for a simple null
235
Theorem 7.1.2. Consider the model (7.1). Let T be the LR test statistic
built under the assumptions f ∗ = Ψ> θ ∗ and Var(ε) = σ 2 In with a known
value σ 2 . Then T is given by
T =
2
1 e − θ 0 )2 = 1 e
Ψ> (θ
f − f 0 .
2
2
2σ
2σ
(7.5)
Moreover, the following decompositions for the vector of observations Y and
for the errors ε = Y − f 0 hold:
e −f )+e
Y − f 0 = (f
ε,
0
e − f k2 + ke
kY − f 0 k2 = kf
εk2 ,
0
(7.6)
(7.7)
e −f is the estimation error and e
e is the vector of residuals.
where f
ε = Y −f
0
Proof. The key step of the proof is the representation of the estimated ree under the model assumption Y = f ∗ + as a projection of the
sponse f
data on the p -dimensional linear subspace L in IRn spanned by the rows of
the matrix Ψ :
e = ΠY = Π(f ∗ + )
f
−1
where Π = Ψ> ΨΨ>
Ψ is the projector onto L ; see Section 4.3. Note that
this decomposition is valid for the general linear model; the parametric form of
e−
the response f and the noise normality is not required. The identity Ψ> (θ
e − f follows directly from the definition implying the representation
θ0 ) = f
0
(7.5) for the test statistic T . The identity (7.6) follows from the definition.
e − f = Π(Y − f ) . Similarly,
Next, Πf 0 = f 0 and thus f
0
0
e = (IIn − Π)Y .
e
ε=Y −f
As Π and In − Π are orthogonal projectors, it follows
kY − f 0 k2 = k(IIn − Π)Y + Π(Y − f 0 )k2 = k(IIn − Π)Y k2 + kΠ(Y − f 0 )k2
and the decomposition (7.7) follows.
The decomposition (7.6), although straightforward, is very important for
understanding the structure of the residuals under the null and under the
alternative. Under the null H0 , the response f ∗ is assumed to be known and
coincides with f 0 , so the residuals e
ε coincide with the errors ε . The sum of
squared residuals is usually abbreviated as RSS:
def
RSS0 = kY − f 0 k2
236
7 Testing in linear models
e . The
Under the alternative, the response is unknown and is estimated by f
e
residuals are e
ε = Y − f resulting in the RSS
def
e k2 .
RSS = kY − f
The decomposition (7.7) can be rewritten as
e − f k2 .
RSS0 = RSS +kf
0
(7.8)
We see that the RSS under the null and the alternative can be essentially
e significantly deviates from the null assumption
different only if the estimate f
∗
f = f 0 . The test statistic T from (7.3) can be written as
T =
RSS0 − RSS
.
2σ 2
For the proofs in the remainder of this chapter, the following results concerning the distribution of quadratic forms of Gaussians are helpful.
Theorem 7.1.3. 1. Let X ∼ Nn (µ, Σ) , where Σ is symmetric and positive
definite. Then, (X − µ)> Σ−1 (X − µ) ∼ χ2n .
2. Let X ∼ Nn (0, In ) , R a symmetric, idempotent (n × n) -matrix with
rank (R) = r and B a (p × n) -matrix with p ≤ n . Then it holds
(a) X > RX ∼ χ2r
(b) From BR = 0 , it follows that X > RX is independent of BX .
3. Let X ∼ Nn (0, In ) and R , S symmetric and idempotent (n × n) matrices with rank (R) = r , rank (S) = s and RS = 0 . Then it holds
(a) X > RX and X > SX are independent.
>
RX/r
(b) X
∼ Fr,s .
X > SX/s
Proof. For proving the first assertion, let Σ1/2 denote the uniquely defined,
symmetric and positive definite matrix fulfilling Σ1/2 · Σ1/2 = Σ , with inverse
def
matrix Σ−1/2 . Then, Z = Σ−1/2 (X − µ) ∼ Nn (0, In ) . The assertion Z > Z ∼
2
χn follows from the defnition of the chi-squared distribution.
For the second part, notice that, due to symmetry and idempotence
of R
,
I
0
r
there exists an orthogonal matrix P with R = P Dr P > , where Dr =
.
0 0
def
Orthogonality of P implies W = P > X ∼ Nn (0, In ) . We conclude
X > RX = X > R2 X = (RX)> (RX) = (P Dr W )> (P Dr W )
= W > Dr P > P Dr W = W > Dr W =
r
X
i=1
Wi2 ∼ χ2r .
7.1 Likelihood ratio test for a simple null
def
237
def
Furthermore, we have Z1 = BX ∼ Nn (0, B > B) , Z2 = RX ∼ Nn (0, R) , and
Cov(Z1 , Z2 ) = Cov(BX, RX) = B Cov(X) R> = BR = 0 . As uncorrelation
implies independence under Gaussianity, statement 2. follows.
def
def
To prove the third part, let Z1 = SX ∼ Nn (0, S) and Z2 = RX ∼
Nn (0, R) and notice that Cov(Z1 , Z2 ) = S Cov(X) R = SR = S > R> =
(RS)> = 0 . Assertion (a) is then implied by the identities X > SX = Z1> Z1
and X > RX = Z2> Z2 . Assertion (b) immediately follows from 3.(a) and 1.
Exercise 7.1.1. Prove Theorem 6.4.2.
Hint: Use Theorem 7.1.3.
Now we show that if the model assumptions are correct, the LR test based
on T from (7.3) has the exact level α and is pivotal.
Theorem 7.1.4. Consider the model (7.1) with ε ∼ N(0, σ 2 In ) for a known
value σ 2 , implying that the εi are i.i.d. normal. The LR test φ from (7.4)
e − f and e
is of exact level α . Moreover, f
ε are under IPθ0 zero-mean
0
independent Gaussian vectors satisfying
e − f k2 ∼ χ2 ,
2T = σ −2 kf
0
p
σ −2 ke
εk2 ∼ χ2n−p .
(7.9)
Proof. The null assumption f ∗ = f 0 together with Πf 0 = f 0 implies now
the following decomposition:
e − f = Πε,
f
0
e
ε = ε − Πε = (IIn − Π)ε.
Next, Π and In − Π are orthogonal projectors implying orthogonal and thus
uncorrelated vectors Πε and (IIn − Π)ε . Under normality of ε , these vectors
are also normal, and uncorrelation implies independence. The property (7.9)
for the distribution of Πε was proved in Theorem 4.6.1. For the distribution
of e
ε = (IIn − Π)ε , we can apply Theorem 7.1.3.
Next we discuss the power of the LR test φ defined as the probability of
detecting the alternative when the response f ∗ deviates from the null f 0 .
In the next result we do not assume that the true response f ∗ follows the
linear PA f ∗ = Ψ> θ and show that the test power depends on the value
kΠ(f ∗ − f 0 )k2 .
Theorem 7.1.5. Consider the model (7.1) with Var(ε) = σ 2 In for a known
value σ 2 . Define
∆ = σ −2 kΠ(f ∗ − f 0 )k2 .
Then the power of the LR test φ only depends on ∆ , i.e. it is the same for
all f ∗ with equal ∆ -value. It holds
√ 2
IP φ = 1 = IP ξ1 + ∆ + ξ22 + . . . + ξp2 > 2zα
with ξ = (ξ1 , . . . , ξp )> ∼ N(0, Ip ) .
238
7 Testing in linear models
e = ΠY = Π(f ∗ + ε) and f = Πf for the test
Proof. It follows from f
0
0
2 −1 e
statistic T = (2σ ) kf − f 0 k2 that
T = (2σ 2 )−1 kΠ(f ∗ − f 0 ) + Πεk2 .
Now we show that the distribution of T depends on the response f ∗ only
via the value ∆ . For this we compute the Laplace transform of T .
Lemma 7.1.1. It holds for µ < 1
def
g(µ) = log IE exp µT =
µ∆
p
− log(1 − µ).
2(1 − µ) 2
Proof. For a standard Gaussian random variable ξ and any a , it holds
IE exp µ|ξ + a|2 /2
2
= eµa
/2
(2π)−1/2
Z
exp µax + µx2 /2 − x2 /2 dx
2 Z
1
µa2
µ2 a2
aµ
1−µ
√
+
x−
dx
exp −
2
2(1 − µ)
2
1−µ
2π
µa2
= exp
(1 − µ)−1/2 .
2(1 − µ)
= exp
The projector Π can be represented as Π = U > Λp U for an orthogonal transform U and the diagonal matrix Λp = diag(1, . . . , 1, 0, . . . , 0) with only p
unit eigenvalues. This permits representing T in the form
T =
p
X
(ξj + aj )2 /2
j=1
with i.i.d. standard normal r.v. ξj and numbers aj satisfying
The independence of the ξj ’s implies
g(µ) =
p X
µa2j
j=1
2
+
P
j
a2j = ∆ .
µ2 a2j
1
µ∆
p
− log(1 − µ) =
− log(1 − µ)
2(1 − µ) 2
2(1 − µ) 2
as required.
The result of Lemma 7.1.1 claims that the Laplace transform of T depends
on f ∗ only via ∆ and so this also holds for the distribution of T .
7.1 Likelihood ratio test for a simple null
239
The distribution of the squared norm kξ + hk2 for ξ ∼ N(0, Ip ) and any
fixed vector h ∈ IRp with khk2 = ∆ is called non-central chi-squared with
the non-centrality parameter ∆ . In particular, for each α, α1 one can define
the minimal value ∆ providing the prescribed error α1 of the second kind
by the equation under the given level α :
IP kξ + hk2 ≥ 2zα ≥ 1 − α1 subject to IP kξk2 ≥ 2zα ≤ α (7.10)
with khk2 ≥ ∆ . The results from
p Section 4.6 indicate that the value zα can
be bounded from above by p + 2p log α−1 for moderate values of α−1 . For
evaluating the value ∆ , the following decomposition is useful:
kξ + hk2 − khk2 − p = kξk2 − p + 2h> ξ.
The right hand-side of this equality is a sum of centered Gaussian quadratic
>
and linear forms. In particular, the cross term
2h ξ is a centered Gaussian
2
2
r.v. with the variance 4khk , while Var kξk = 2p . These arguments suggest
to take ∆ of order p to ensure the prescribed power α1 .
Theorem 7.1.6. For each α, α1 ∈ (0, 1) , there are absolute constants C and
C 1 such that (7.10) is fulfilled for khk2 ≥ ∆ with
∆1/2 =
p
C p log α−1 +
q
C 1 p log α1−1 .
The result of Theorem 7.1.6 reveals some problem with the power of the
LR-test when the dimensionality of the parameter space grows. Indeed, the
test remains insensitive for all alternatives in the zone σ −2 kΠ(f ∗ −f 0 )k2 ≤ C p
and this zone becomes larger and larger with p .
7.1.3 I.i.d. errors with unknown variance
This section briefly discusses the case when the errors εi are i.i.d. but the
variance σ 2 = Var(εi ) is unknown. A natural idea in this case is to estimate
the variance from the data. The decomposition (7.8) and independence of
e k2 and kf
e − f k2 are particularly helpful. Theorem 7.1.4
RSS = kY − f
0
2
suggests to estimate σ from RSS by
σ
e2 =
1
1
e k2 .
RSS =
kY − f
n−p
n−p
Indeed, due to the result (7.9), σ −2 RSS ∼ χ2p yielding
IEe
σ2 = σ2 ,
Var σ
e2 =
2
σ4
n−p
and therefore, σ
e2 is an unbiased, root-n consistent estimate of σ 2 .
(7.11)
240
7 Testing in linear models
IP
Exercise 7.1.2. Check (7.11). Show that σ
e2 − σ 2 −→ 0 .
Now we consider the LR test (7.5) in which the true variance is replaced
by its estimate σ
e2 :
2
2
(n − p)e
f − f 0
RSS0 − RSS
def 1 e
e
T =
f − f0 =
=
.
2
e
2e
σ2
2
RSS /(n − p)
2kY − f k
The result of Theorem 7.1.4 implies the pivotal property for this test statistic
as well.
Theorem 7.1.7. Consider the model (7.1) with ε ∼ N(0, σ 2 In ) for an unknown value σ 2 . Then the distribution of the test statistic Te under IPθ0 only
depends on p and n − p :
2p−1 Te =
n − p RSS0 − RSS
∼ Fp,n−p ,
p
RSS
where Fp,n−p denotes the Fisher distribution with parameters p, n − p :
kξ p k2 /p
=L
kξ n−p k2 /(n − p)
Fp,n−p
where ξ p and ξ n−p are two independent standard Gaussian vectors of dimension p and n − p , see Theorem 6.4.1. In particular, it does not depend
on the design matrix Ψ , the noise variance σ 2 , and the true parameter θ 0 .
This result suggests to fix the critical value z for the test statistic Te using
the quantiles of the Fisher distribution: If tα is such that Fp,n−p (tα ) = 1 − α ,
then zα = p tα /2 .
Theorem 7.1.8. Consider the model (7.1) with ε ∼ N(0, σ 2 In ) for a unknown value σ 2 . If Fp,n−p (tα ) = 1 − α and ezα = ptα /2 , then the test
φe = 1(Te ≥ ezα ) is a level- α test:
IPθ0 φe = 1 = IPθ0 Te ≥ ezα = α.
Exercise 7.1.3. Prove the result of Theorem 7.1.8.
If the sample size n is sufficiently large, then σ
e2 is very close to σ 2 and
one can apply an approximate choice of the critical value zα from the case of
σ 2 known:
φ̆ = 1(Te ≥ zα ).
This test is not of exact level α but it is of asymptotic level α . Its power
function is also close to the power function of the test φ corresponding to the
known variance σ 2 .
7.2 Likelihood ratio test for a subvector
241
Theorem 7.1.9. Consider the model (7.1) with ε ∼ N(0, σ 2 In ) for a unknown value σ 2 . Then
lim IPθ0 φ̆ = 1 = α.
(7.12)
lim supIPθ0 φe = 1 − IPθ0 φ̆ = 1 = 0.
(7.13)
n→∞
Moreover,
n→∞ f ∗
Exercise 7.1.4. Consider the model (7.1) with ε ∼ N(0, σ 2 In ) for σ 2 unknown
prove (7.12) and (7.13).
Hints:
2 2
• The consistency of σ
e2 permits to restrict to the case σ
e /σ − 1 ≤ δn
for δn → 0 .
e − f k2 and σ
• The independence of kf
e2 permits to consider the distri0
2
2
e − f k /e
bution of 2Te = kf
σ as if σ
e2 were a fixed number close to
0
δ.
• Use that for ζp ∼ χ2p ,
IP ζp ≥ zα (1 + δn ) − IP ζp ≥ zα → 0,
n → ∞.
7.2 Likelihood ratio test for a subvector
The previous section dealt with the case of a simple null hypothesis. This
section considers a more general situation when the null hypothesis concerns
a subvector of θ . This means that the whole model is given by (7.2) but the
vector θ is decomposed into two parts: θ = (γ, η) , where γ is of dimension
p0 < p . The null hypothesis assumes that η = η 0 for all γ . Usually η 0 = 0
but the particular value of η 0 is not important. To simplify the presentation
we assume η 0 = 0 leading to the subset Θ0 of Θ , given by
Θ0 = {θ = (γ, 0)}.
Under the null hypothesis, the model is still linear:
Y = Ψ>
γ γ + ε,
where Ψγ denotes a submatrix of Ψ composed by the rows of Ψ corresponding to the γ -components of θ .
Fix any point θ 0 ∈ Θ0 , e. g., θ 0 = 0 and define the corresponding
response f 0 = Ψ> θ 0 . The LR test T can be written in the form
T = max L(θ, θ 0 ) − max L(θ, θ 0 ).
θ∈Θ
θ∈Θ0
(7.14)
242
7 Testing in linear models
The results of both maximization problems are known:
max L(θ, θ 0 ) =
1 e
kf − f 0 k2 ,
2σ 2
max L(θ, θ 0 ) =
1 e
kf − f 0 k2 ,
2σ 2 0
θ∈Θ
θ∈Θ0
e and f
e are estimates of the response under the null and the alwhere f
0
ternative, respectively. As in Theorem 7.1.2 we can establish the following
geometric decomposition.
Theorem 7.2.1. Consider the model (7.1). Let T be the LR test statistic
from (7.14) built under the assumptions f ∗ = Ψ> θ ∗ and Var(ε) = σ 2 In
with a known value σ 2 . Then T is given by
T =
1 e 2 .
e
f −f
0
2σ 2
Moreover, the following decompositions for the vector of observations Y and
e from the null hold:
for the residuals e
ε0 = Y − f
0
e = (f
e−f
e )+e
Y −f
ε,
0
0
e k2 = kf
e−f
e k2 + ke
kY − f
εk2 ,
0
0
(7.15)
e−f
e is the difference between the estimated response under the null
where f
0
e is the vector of residuals from the
and under the alternative, and e
ε = Y −f
alternative.
e=
Proof. The proof is similar to the proof of Theorem 7.1.2. We use that f
−1
ΠY where Π = Ψ> ΨΨ>
Ψ is the projector on the space L spanned by
e = Π0 Y where Π0 = Ψ> Ψγ Ψ> −1 Ψγ is the
the rows of Ψ . Similarly f
0
γ
γ
projector on the subspace L0 spanned by the rows of Ψγ . This yields the
e − f = Π(Y − f ) . Similarly,
decomposition f
0
0
e−f
e = (Π − Π0 )Y ,
f
0
e = (IIn − Π)Y .
e
ε=Y −f
As Π − Π0 and In − Π are orthogonal projectors, it follows
e k2 = k(IIn − Π)Y + (Π − Π0 )Y k2 = k(IIn − Π)Y k2 + k(Π − Π0 )Y k2
kY − f
0
and the decomposition (7.15) follows.
The decomposition (7.15) can again be represented as RSS0 = RSS +2σ 2 T ,
where RSS is the sum of squared residuals, while RSS0 is the same as in the
case of a simple null.
Now we show how a pivotal test of exact level α based on T can be
constructed if the model assumptions are correct.
7.2 Likelihood ratio test for a subvector
243
Theorem 7.2.2. Consider the model (7.1) with ε ∼ N(0, σ 2 In ) for a known
e −f
e and e
value σ 2 , i. e., the εi are i.i.d. normal. Then f
ε are under IPθ0
0
zero-mean independent Gaussian vectors satisfying
e−f
e k2 ∼ χ2
2T = σ −2 kf
0
p−p0 ,
σ −2 ke
εk2 ∼ χ2n−p .
(7.16)
Let zα fulfill IP (ζp−p0 ≥ zα ) = α . Then the test φ = 1(T ≥ zα /2) is a LR
test of exact level α .
Proof. The null assumption θ ∗ ∈ Θ0 implies f ∗ ∈ L0 . This, together with
Π0 f ∗ = f ∗ implies now the following decomposition:
e−f
e = (Π − Π0 )ε,
f
0
e
ε = ε − Πε = (IIn − Π)ε.
Next, Π − Π0 and In − Π are orthogonal projectors implying orthogonal and
thus uncorrelated vectors (Π − Π0 )ε and (IIn − Π)ε . Under normality of ε ,
these vectors are also normal, and uncorrelation implies independence. The
property (7.16) is similar to (7.9) and can easily be verified by making use of
Theorem 7.1.3.
If the variance σ 2 of the noise is unknown, one can proceed exactly as in
the case of a simple null: estimate the variance from the residuals using their
independence of the test statistic T . This leads to the estimate
σ
e2 =
1
1
e k2
RSS =
kY − f
n−p
n−p
and to the test statistic
e−f
e k2
RSS0 − RSS
(n − p)kf
0
Te =
=
.
e k2
2 RSS /(n − p)
2kY − f
The property of pivotality is preserved here as well: properly scaled, the test
statistic Te has a Fisher distribution.
Theorem 7.2.3. Consider the model (7.1) with ε ∼ N(0, σ 2 In ) for an unknown value σ 2 . Then 2Te/(p−p0 ) has the Fisher distribution Fp−p0 ,n−p with
parameters p − p0 and n − p . If tα is
the 1 − α quantile of this distribution
then the test φe = 1 Te > (p − p0 )tα /2 is of exact level α .
If the sample size is sufficiently large, one can proceed as if σ
e2 were the
true variance ignoring the error of variance estimation. This would lead to
the critical value zα from Theorem 7.2.2 and the corresponding test is of
asymptotic level α .
Exercise 7.2.1. Prove Theorem 7.2.3.
244
7 Testing in linear models
The study of the power of the test T does not differ from the case of a
simple hypothesis. One needs to only redefine ∆ as
def
∆ = σ −2 k(Π − Π0 )f ∗ k2 .
7.3 Likelihood ratio test for a linear hypothesis
In this section, we generalize the test problem for the Gaussian linear model
further and assume that we want to test the linear hypothesis H0 : Cθ = d for
a given contrast matrix C ∈ IRr×p with rank(C) = r ≤ p and a right-hand
side vector d ∈ IRr . Notice that the point null hypothesis H0 : θ = θ 0 and the
hypothesis H0 : η = η 0 regarded in Section 7.2 are linear hypotheses. Here,
we restrict our attention to the case that the error variance σ 2 is unknown,
as it is typically the case in practice.
Theorem 7.3.1. Assume model (7.1) with ε ∼ N(0, σ 2 In ) for an unknown
value σ 2 and consider the linear hypothesis H0 : Cθ = d . Then it holds
def
(a) The likelihood ratio statistic Te is an isotone transformation of F =
n−p ∆ RSS
r
RSS
def
, where ∆ RSS = RSS0 − RSS .
e0 is given by
(b) The restricted MLE θ
e0 = θ
e − (ΨΨ> )−1 C > C(ΨΨ> )−1 C > −1 (C θ
e − d).
θ
(c) ∆ RSS und RSS are independent.
(d) Under H0 , ∆ RSS /σ 2 ∼ χ2r and F ∼ Fr,n−p .
Proof. For proving (a), verify that 2Te = n log(b
σ02 /b
σ 2 ) , where σ
b2 and σ
b02 are
the unrestricted and restricted (i. e., under H0 ) MLEs of the error variance
σ 2 , respectively. Plugging in of the explicit representations for σ
b2 and σ
b02
RSS
+ 1) , implying the assertion. Part (b) is an
yields that 2Te = n log( ∆RSS
application of the following well-known result from quadratic optimization
theory.
Lemma 7.3.1. Let A ∈ IRr×p and b ∈ IRr fixed and define M = {z ∈
IRp : Az = b} . Moreover, let f : IRp → IR , given by f (z) = z > Qz/2 − c> z
for a symmetric, positive semi-definite (p × p) -matrix Q and a vector
c ∈ IRp . Then, the unique minimum of f over the search space M is characterized by solving the system of linear equations
Qz − A> y = c
Az = b
for (y, z) . The component z of this solution minimizes f over M .
7.4 Wald test
245
e0 from part (b) and write ∆ RSS
For part (c), we utilize the explicit form of θ
in the form
e − d).
e − d)> C(ΨΨ> )−1 C > −1 (C θ
∆ RSS = (C θ
e . Since θ
e and
This shows that ∆ RSS is a deterministic transformation of θ
RSS are independent, the assertion follows. This representation of ∆ RSS
moreover shows that ∆ RSS /σ 2 is a quadratic form of a (under H0 ) standard
normal random vector and part (d) follows from Theorem 7.1.3.
To sum up, Theorem 7.3.1 shows that general linear hypotheses can be
tested with F -tests under model (7.1) with ε ∼ N(0, σ 2 In ) for an unknown
value σ 2 .
7.4 Wald test
The drawback of the LR test method for testing a linear hypothesis H0 : Cθ =
d without assuming Gaussian noise is that a constrained maximization of the
log-likelihood function under the constraints encoded by K and d has to be
performed. This computationally intensive step can be avoided by using Wald
statistics. The Wald statistic for testing H0 is given by
e − d)> C Vb C >
W = (C θ
−1
e − d),
(C θ
e and Vb denote the MLE in the full (unrestricted) model and the
where θ
e , respectively.
estimated covariance matrix of θ
Theorem 7.4.1. Under model (7.2), the statistic W is asymptotically equivalent to the LR test statistic 2T . In particular, under H0 , the distribution
w
of W converges to χ2r : W −→ χ2r as n → ∞ . In the case of normally
distributed noise, it holds W = rF , where F is as in Theorem 7.3.1.
Proof. For proving the asymptotic χ2r -distribution of W , we use the asympe . If the model is regular, it holds
totic normality of the MLE θ
√
w
en − θ 0 ) −→ N(0, F(θ 0 )−1 ) under θ 0 for n → ∞,
n(θ
and, consequently,
w
def
en − θ 0 ) −→
F(θ 0 )1/2 n1/2 (θ
N(0, Ir ), where r = dim(θ 0 ).
Applying the Continuous Mapping Theorem, we get
w
en − θ 0 )> nF(θ 0 )(θ
en − θ 0 ) −→
(θ
χ2r .
246
7 Testing in linear models
en ) is a consistent estimator
b θ
If the Fisher information is continuous and F(
for F(θ 0 ) , it still holds that
w
en − θ 0 )> nF(
en )(θ
en − θ 0 ) −→ χ2 .
b θ
(θ
r
en = C θ
e − d and θ 0 = 0 ∈ IRr , we obtain the assertion
Substituting θ
concerning the asymptotic distribution of W under H0 . For proving the
relationship between W and F in the case of Gaussian noise, we notice that
F =
n − p ∆SSE
r
SSE
e − d)> C(ΨΨ> )−1 C > −1 (C θ
e − d)
n − p (C θ
=
r
(n − p)e
σ2
=
e − d)> (C Vb C > )−1 (C θ
e − d)
(C θ
W
=
r
r
as required.
7.5 Analysis of variance
Important special cases of linear models arise when all entries of the design
matrix Ψ are binary, i. e., Ψij ∈ {0, 1} for all 1 ≤ i ≤ p , 1 ≤ j ≤ n . Every
row of Ψ then has the interpretation of a group indicator, where Ψij = 1
if and only if observational unit j belongs to group i . The target of such
an analysis of variance is to determine if the mean response differs between
groups. In this section, we specialize the theory of testing in linear models to
this situation.
7.5.1 Two-sample test
First, we consider the case of p = 2 , meaning that exactly two samples corresponding to the two groups A and B are given. We let nA denote the
number of observations in group A and nB = n − nA the number of observations in group B . The parameter of interest θ ∈ IR2 in this model consists
of the two population means ηA and ηB and the models reads
Yij = ηi + εij ,
j = 1, . . . , ni .
1/nA 0
e = (Y A , Y B )> ,
Noticing that (ΨΨ )
=
, we get that θ
0 1/nB
where Y A and Y B denote the two sample means. With this, we immediately
obtain that the estimator for the noise variance is in this model given by
> −1
i ∈ {A, B},
7.5 Analysis of variance
247
nB
nA
X
X
RSS
1
(YA,j − Y A )2 +
(YB,j − Y B )2 .
σ
e2 =
=
n−p
nA + nB − 2 j=1
j=1
def
The quantity s2 = σ
e2 is called the pooled sample variance. Denoting the
group-specific sample variances by
s2A =
nA
1 X
(YA,j − Y A )2 ,
nA j=1
s2B =
nB
1 X
(YB,j − Y B )2 ,
nB j=1
we have that
s2 = (nA s2A + nB s2B )/(nA + nB − 2)
is a weighted average of s2A and s2B .
Under this model, we are interested in testing the null hypothesis H0 : ηA =
ηB of equal group means against the two-sided alternative H1 : ηA 6= ηB or
the one-sided alternative H1+ : ηA > ηB . The one-sided alternative hypothesis
H1− : ηA < ηB can be treated by switching the group labels.
The null hypothesis H0 : ηA − ηB = 0 is a linear hypothesis in the sense of
Section 7.3, where the number of restrictions is r = 1 . Under H0 , the grand
average
def
Y = n
X
−1
ni
X
Yij
i∈{A,B} j=1
is the MLE for the (common) population mean. Straightforwardly, we calculate that the sum of squares of residuals under H0 is given by
RSS0 =
X
ni
X
(Yij − Y )2 .
i∈{A,B} j=1
For computing ∆ RSS = RSS0 − RSS and the statistic F from Theorem 7.3.1.(a), we obtain the following results.
Lemma 7.5.1.
RSS0 = RSS + nA (Y A − Y )2 + nB (Y B − Y )2 ,
(7.17)
2
F =
nA nB (Y A − Y B )
.
nA + nB
s2
Exercise 7.5.1. Prove Lemma 7.5.1.
(7.18)
248
7 Testing in linear models
We conclude from Theorem 7.3.1 that, if the individual errors are homogeneous between samples, the test statistic
(Y A − Y B )
t= p
s 1/nA + 1/nB
is t -distributed with nA +nB −2 degrees of freedom under the null hypothesis
H0 .
For a given significance level α , define the quantile qα of Student’s t distribution with nA + nB − 2 degrees of freedom by the equation
IP (t0 > qα ) = α,
where t0 represents the distribution of t in the case that H0 is true. Utilizing
the symmetry property
IP (|t0 | > qα/2 ) = 2IP (t > qα/2 ) = α,
the one-sided two-sample t -test for H0 versus H1+ rejects if t > qα and the
two-sided two-sample t -test test for H0 versus H1 rejects if |t| > qα/2 .
7.5.2 Comparing K treatment means
This section deals with the more general one-factorial analysis of variance in
presence of K ≥ 2 groups. We assume that K samples are given, each of size
nk for k = 1, . . . , K . The following quantities are relevant for testing the null
hypothesis of equal group (treatment) means.
Sample means in treatment group k :
1 X
Yki
Yk =
nk i
Sum of squares (SS) within treatment group k :
Sk =
X
(Yki − Y k )2
i
Sample variance within treatment group k :
s2k = Sk /(nk − 1)
Denoting the pooled sample size by N = n1 + . . . + nK , we furthermore
define the following pooled measures.
Pooled (grand) mean:
Y =
1 XX
Yki
N
i
k
7.5 Analysis of variance
249
Within-treatment sum of squares:
SR = S1 + . . . + SK
Between treatment sum of squares:
ST =
X
nk (Y k − Y )2
k
Within- and between-treatment mean square:
s2R =
SR
N −k
s2T =
ST
K −1
In anlogy to Lemma 7.5.1, the following results regarding the decomposition of spread holds.
Lemma 7.5.2. Let the overall variation in the data be defined by
def
SD =
XX
(Yki − Y )2 .
i
k
Then it holds
SD =
XX
k
(Yki − Y k )2 +
i
X
nk (Y k − Y )2 = ST + SR .
k
Exercise 7.5.2. Prove Lemma 7.5.2.
The variance components in a one-factorial analysis of variance can be
summarized in an analysis of variance table, cf. Table 7.1.
source of
variation
average
sum of
squares
2
SA = N Y
degrees of
freedom
νA = 1
between
P
treatments ST = k nk (Y k − Y )2
mean
square
s2A = SA /νA
νT = K − 1 s2T = ST /νT
within
P P
treatments SR = k i (Yki − Y k )2 νR = N − K s2R = SR /νR
total
SD =
P P
k
i (Yki
− Y )2
N −1
Table 7.1. Variance components in a one-factorial analysis of variance
250
7 Testing in linear models
For testing the global hypothesis H0 : η1 = η2 = . . . = ηK against
the (two-sided) alternative H1 : ∃(i, j) with ηi 6= ηj , we again apply Theorem 7.3.1, leading to the following result.
Corollary 7.5.1. Under the model of the analysis of variance with K groups,
assume homogeneous Gaussian noise with noise variance σ 2 > 0 . Then it
holds:
(i) SR /σ 2 ∼ χ2N −K .
(ii) Under H0 , ST /σ 2 ∼ χ2K−1 .
(iii) SR and ST are stochastically independent.
/(K−1)
(iv) Under H0 , the statistic F = SSRT/(N
−K) is distributed as FK−1,N −K :
def
F =
ST /(K − 1)
∼ FK−1,N −K .
SR /(N − K)
Therefore, H0 is rejected by a level α F -test if the value of F exceeds the
quantile FK−1,N −K;1−α of Fisher’s F -distribution with K − 1 and N − K
degrees of freedom.
Treatment effects.
For estimating the amount of shift in the mean response caused by treatment
k (the k -th treatment effect), it is convenient to re-parametrize the model as
follows. We start with the basic model, given by
Yki = ηk + εki ,
εki ∼ N(0, σ 2 ) i.i.d.,
where ηk is the true treatment mean.
Now, we introduce the averaged treatment mean η and the k -th treatment
effect τk , given by
def
η =
1 X
nk ηk ,
N
def
τk = ηk − η.
The re-parametrized model is then given by
Yki = η + τk + εki .
(7.19)
It is important to notice that this new model representation involves K + 1
unknowns. In order to achieve maximum
rank of the design matrix, one therePK
fore has to take the constraint
k=1 nk τk = 0 into account when building
Ψ , for instance by coding
τK = −n−1
K
K−1
X
k=1
nk τk .
7.5 Analysis of variance
251
For the data analysis, it is helpful to consider the decomposition of the observed data points:
Yki = Y + (Y k − Y ) + (Yki − Y k ).
Routine algebra then leads to the following results concerning inference in the
treatment effect representation of the analysis of variance model.
PK
Theorem 7.5.1. Under Model (7.19) with
k=1 nk τk = 0 , it holds:
(i) The MLEs for the unknown model parameters are given by
ηe = Y ,
τek = Y k − Y , 1 ≤ k ≤ K.
(ii) The F -statistic for testing the global hypothesis H0 : τ1 = τ2 = . . . =
τk−1 = 0 is identical to the one given in Theorem 7.5.1.(iv), i. e., F =
ST /(K−1)
SR /(N −K) .
7.5.3 Randomized blocks
This section deals with a special case of the two-factorial analysis of variance.
We assume that the observational units are grouped into n blocks, where in
each block the K treatments under investigation are applied exactly once.
The total sample size is then given by N = nK . Since there may be a ”block
effect” on the mean response, we consider the model
Yki = η + βi + τk + εki ,
1 ≤ k ≤ K,
1 ≤ i ≤ n,
(7.20)
where η is the general mean, βi the block effect and τk the treatment effect.
The data can be decomposed as
Yki = Y + (Y i − Y ) + (Y k − Y ) + (Yki − Y i − Y k + Y ),
where Y is the grand average, Y i the block average, Y k the treatment average, and Yki − Y i − Y k + Y the residual.
Applying decomposition of spread to the model given by (7.20), we arrive
at Table 7.2. Based on this, the following corollary summarizes the inference
techniques under the model defined by (7.20).
Corollary 7.5.2. Under the model defined by (7.20), it holds:
(i) The MLEs for the unknown model parameters are given by
ηe = Y ,
βei = Y i − Y ,
τek = Y k − Y .
252
7 Testing in linear models
source of variation sum of squares
degrees of freedom
average
2
(correction factor) S = nKY
1
between
blocks
between
treatments
SB = K
P
i (Y i
− Y )2
n−1
residuals
P
ST = n k (Y k − Y )2
K −1
P P
SR = k i (Yki − Y i − Y k + Y )2 (n − 1)(K − 1)
total
SD =
P P
k
i (Yki
− Y )2
N − 1 = nK − 1
Table 7.2. Analysis of variance table for a randomized block design
(ii) The F -statistic for testing the hypothesis HB of no block effects is given
by
FB =
SB /(n − 1)
.
SR /[(n − 1)(K − 1)]
Under HB , the distribution of FB is Fisher’s F -distribution with (n−1)
and (n − 1)(K − 1) degrees of freedom.
(iii) The F -statistic for testing the null hypothesis HT of no treatment effects
is given by
FT =
ST /(K − 1)
.
SR /[(n − 1)(K − 1)]
Under HT , the distribution of FT is Fisher’s F -distribution with (K−1)
and (n − 1)(K − 1) degrees of freedom.
7.6 Historical remarks and further reading
Hotelling (1931) derived a deterministic transformation of Fisher’s F -distribution
and demonstrated its usage in the context of testing for differences among several Gaussian means with a likelihood ratio test. The general idea of the Wald
test goes back to Wald (1943).
A classical textbook on the analysis of variance is that of Scheffé (1959).
The general theory of testing linear hypotheses in linear models is described,
e. g., in the textbook by Searle (1971).
8
Some other testing methods
This chapter discusses some nonparametric testing methods. First, we treat
classical testing procedures such as the Kolmogorov-Smirnov and the CramérSmirnov-von Mises test as particular cases of the substitution approach. Then,
we are considered with Bayesian approaches towards hypothesis testing. Finally, Section 8.4 deals with locally best tests. It is demonstrated that the
score function is the natural equivalent to the LR statistic if no uniformly
best tests exist, but locally best tests are aimed at, assuming that the model
is differentiable in the mean. Conditioning on the ranks of the observations
leads to the theory of rank tests. Due to the close connection of rank tests
and permutation tests (the null distribution of a rank test is a permutation
distribution), we end the chapter with some general remarks on permutation
tests.
Let Y = (Y1 , . . . , Yn )> be an i.i.d. sample from a distribution P . The
joint distribution IP of Y is the n -fold product of P , so a hypothesis about
IP can be formulated as a hypothesis about the marginal measure P . A simple hypothesis H0 means the assumption that P = P0 for a given measure
P0 . The empirical measure Pn is a natural empirical counterpart of P leading to the idea of testing the hypothesis by checking whether Pn significantly
deviates from P0 . As in the estimation problem, this substitution idea can
be realized in several different ways. We briefly discuss below the method of
moments and the minimal distance method.
8.1 Method of moments for an i.i.d. sample
Let g(·) be any d -vector function on IR1 . The assumption P = P0 leads to
the population moment
m0 = IE0 g(Y1 ).
The empirical counterpart of this quantity is given by
254
8 Some other testing methods
M n = IEn g(Y ) =
1X
g(Yi ).
n i
The method of moments (MOM) suggests to consider the difference M n −
m0 for building a reasonable test. The properties of M n were stated in
Section 2.4. In particular, under the null P = P0 , the first two moments of
the vector M n − m0 can be easily computed: IE0 (M n − m0 ) = 0 and
Var0 (M ) = IE0 (M n − m0 ) (M n − m0 )> = n−1 V,
> def
V = E0 g(Y ) − m0 g(Y ) − m0
.
For simplicity of presentation we assume that the moment function g is selected to ensure a non-degenerate matrix V . Standardization by the covariance matrix leads to the vector
ξ n = n1/2 V −1/2 (M n − m0 ),
which has under the null measure zero mean and a unit covariance matrix.
Moreover, ξ n is, under the null hypothesis, asymptotically standard normal,
i. e., its distribution is approximately standard normal if the sample size n is
sufficiently large; see Theorem 2.4.4. The MM test rejects the null hypothesis if
the vector ξ n computed from the available data Y is very unlikely standard
normal, that is, if it deviates significantly from zero. We specify the procedure
separately for the univariate and multivariate cases.
Univariate case.
2
Let g(·) be a univariate function with E0 g(Y ) = m0 and E0 g(Y ) − m0 =
σ 2 . Define the linear test statistic
Tn = √
1
nσ 2
X
g(Yi ) − m0 = n1/2 σ −1 (Mn − m0 )
i
leading to the test
φ = 1 |Tn | > zα/2 ,
(8.1)
where zα denotes the upper α -quantile of the standard normal law.
Theorem 8.1.1. Let Y be an i.i.d. sample from P . Then the test statistic
Tn is asymptotically standard normal and the test φ from (8.1) for H0 : P =
P0 is of asymptotic level α , that is,
IP0 φ = 1 → α,
n → ∞.
8.1 Method of moments for an i.i.d. sample
255
Similarly one can consider a one-sided alternative H1+ : m > m0 or H1− :
m < m0 about the moment m = Eg(Y ) of the distribution P and the
corresponding one-sided tests
φ+ = 1(Tn > zα ),
φ− = 1(Tn < −zα ).
As in Theorem 8.1.1, both tests φ+ and φ− are of asymptotic level α .
Multivariate case.
The components of the vector function g(·) ∈ IRd are usually associated with
“directions” in which the null hypothesis is tested. The multivariate situation
means that we test simultaneously in d > 1 directions. The most natural test
statistic is the squared Euclidean norm of the standardized vector ξ n :
def
Tn = kξ n k2 = nkV −1/2 (M n − m0 )k2 .
(8.2)
By Theorem 2.4.4 the vector ξ n is asymptotically standard normal so that
Tn is asymptotically chi-squared with d degrees of freedom. This yields the
natural definition of the test φ using quantiles of χ2d , i. e.,
φ = 1 Tn > zα
(8.3)
with zα denoting the upper α -quantile of the χ2d distribution.
Theorem
8.1.2. Let Y be an i.i.d. sample from P . If zα fulfills IP χ2d >
zα = α , then the test statistic Tn from (8.2) is asymptotically χ2d -distributed
and the test φ from (8.3) for H0 : P = P0 is of asymptotic level α .
8.1.1 Series expansion
A standard method of building the moment tests or, alternatively, of choosing
the directions g(·) is based on some series expansion. Let ψ1 , ψ2 , . . . , be a
given set of basis functions in the related functional space. It is especially
useful to select these basis functions to be orthonormal under the measure
P0 :
Z
Z
ψj (y)P0 (dy) = 0,
ψj (y)ψj 0 (y)P0 (dy) = δj,j 0 ,
∀j, j 0 . (8.4)
Select a fixed index d and take the first d basis functions ψ1 , . . . , ψd as
“directions” or components of g . Then
Z
def
mj,0 =
ψj (y)P0 (dy) = 0
256
8 Some other testing methods
is the j th population moment under the null hypothesis H0 and it is tested
by checking whether the empirical moments Mj,n with
def
Mj,n =
1X
ψj (Yi )
n i
do not deviate significantly from zero. The condition (8.4) effectively permits
to test each direction ψj independently of the others.
For each d one obtains a test statistic Tn,d with
def
2
2
Tn,d = n M1,n
+ . . . + Md,n
leading to the test
φd = 1 Tn,d > zα,d ,
where zα,d is the upper α -quantile of χ2d . In practical applications the choice
of d is particularly relevant and is subject of various studies.
8.1.2 Testing a parametric hypothesis
The method of moments can be extended to the situation when the null
hypothesis is parametric: H0 : P ∈ (Pθ , θ ∈ Θ0 ) . It is natural to apply the
method of moments both to estimate the parameter θ under the null and to
test the null. So, we assume two different moment vector functions g 0 and
g 1 to be given. The first one is selected to fulfill
θ ≡ Eθ g 0 (Y1 ),
θ ∈ Θ0 .
This permits estimating the parameter θ directly by the empirical moment:
X
e= 1
θ
g (Yi ).
n i 0
The second vector of moment functions is composed by directional alternatives. An identifiability condition suggests to select the directional alternative functions orthogonal to g 0 in the following sense. We choose
(1)
(k)
g 1 = (g 1 , . . . , g 1 )> : IRr → IRk such that for all θ ∈ Θ0 it holds
g 1 (m1 , . . . , mr ) = 0 ∈ IRk , where (m` : 1 ≤ ` ≤ r) denote the first r
(population) moments of the distribution Pθ .
Pn
Theorem 8.1.3. Let m
e` =
i=1 Yi denote the ` -th sample moment for
1 ≤ ` ≤ r . Then, under regularity assumptions discussed in Section 2.4 and
(j)
assuming that each g 1 is continuously differentiable and that all (m` : 1 ≤
` ≤ 2r) are continuous functions of θ , it holds that the distribution of
8.2 Minimum distance method for an i.i.d. sample
257
def
e g (m
Tn = n g >
e 1, . . . , m
e r ) V −1 (θ)
e r)
1 e 1, . . . , m
1 (m
converges under H0 weakly to χ2k , where
def
V (θ) = J(g 1 )ΣJ(g 1 )> ∈ IRk×k ,
!
(j)
∂g 1 (m1 , . . . , mr )
def
∈ IRk×r
J(g 1 ) =
∂m`
1≤j≤k
1≤`≤r
and Σ = (σij ) ∈ IRr×r with σij = mi+j − mi mj .
Theorem 8.1.3, which is an application of the Delta method in connection
with the asymptotic normality of MOM estimators, leads to the goodness-offit test
φ = 1 Tn > zα ,
where zα is the upper α -quantile of χ2k , for testing H0 .
8.2 Minimum distance method for an i.i.d. sample
The method of moments is especially useful for the case of a simple hypothesis because it compares the population moments computed under the null
with their empirical counterpart. However, if a more complicated composite
hypothesis is tested, the population moments can not be computed directly:
the null measure is not specified precisely. In this case, the minimum distance
idea appears to be useful. Let (Pθ , θ ∈ Θ ⊂ IRp ) be a parametric family and
Θ0 be a subset of Θ . The null hypothesis about an i.i.d. sample Y from
P is that P ∈ (Pθ , θ ∈ Θ0 ) . Let ρ(P, P 0 ) denote some functional (distance)
defined for measures P, P 0 on the real line. We assume that ρ satisfies the
following conditions: ρ(Pθ1 , Pθ2 ) ≥ 0 and ρ(Pθ1 , Pθ2 ) = 0 iff θ 1 = θ 2 . The
condition P ∈ (Pθ , θ ∈ Θ0 ) can be rewritten in the form
inf ρ(P, Pθ ) = 0.
θ∈Θ0
Now we can apply the substitution principle: use Pn in place of P . Define
the value T by
def
T = inf ρ(Pn , Pθ ).
θ∈Θ0
(8.5)
Large values of the test statistic T indicate a possible violation of the null
hypothesis.
258
8 Some other testing methods
In particular, if H0 is a simple hypothesis, that is, if the set Θ0 consists
of one point θ 0 , the test statistic reads as T = ρ(Pn , Pθ0 ) . The critical value
for this test is usually selected by the level condition:
IPθ0 ρ(Pn , Pθ0 ) > tα ≤ α.
Note that the test statistic (8.5) can be viewed as a combination of two different steps. First we estimate under the null the parameter θ ∈ Θ0 which provides the best possible parametric fit under the assumption P ∈ (Pθ , θ ∈ Θ0 ) :
e0 = arginf ρ(Pn , Pθ ).
θ
θ∈Θ0
Next we formally apply the minimum distance test with the simple hypothesis
e0 .
given by θ 0 = θ
Below we discuss some standard choices of the distance ρ .
8.2.1 Kolmogorov-Smirnov test
Let P0 , P1 be two distributions on the real line with distribution functions
F0 , F1 : Fj (y) = Pj (Y ≤ y) for j = 0, 1 . Define
ρ(P0 , P1 ) = ρ(F0 , F1 ) = supF0 (y) − F1 (y).
(8.6)
y
Now consider the related test starting from the case of a simple null hypothesis P = P0 with corresponding c.d.f. F0 . Then the distance ρ from (8.6)
(properly scaled) leads to the Kolmogorov-Smirnov test statistic
def
Tn = sup n1/2 F0 (y) − Fn (y).
y
A nice feature of this test is the property of asymptotic pivotality.
Theorem 8.2.1 (Kolmogorov). Let F0 be a continuous c.d.f. Then
D
Tn = sup n1/2 F0 (y) − Fn (y) → η,
y
where η is a fixed random variable (maximum of a Brownian bridge on [0, 1] ).
Proof. Idea of the proof: The c.d.f. F0 is monotonic and continuous. Therefore, its inverse function F0−1 is uniquely defined. Consider the r.v.’s
def
Ui = F0 (Yi ).
The basic fact about this transformation is that the Ui ’s are i.i.d. uniform on
the interval [0, 1] .
8.2 Minimum distance method for an i.i.d. sample
259
Lemma 8.2.1. The r.v.’s Ui are i.i.d. with values in [0, 1] and for any u ∈
[0, 1] it holds
IP Ui ≤ u = u.
By definition of F0−1 , it holds for any u ∈ [0, 1]
F0 F0−1 (u) ≡ u.
Moreover, if Gn is the c.d.f. of the Ui ’s, that is, if
def
Gn (u) =
1X
1(Ui ≤ u),
n i
then
Gn (u) ≡ Fn F0−1 (u) .
(8.7)
Exercise 8.2.1. Check Lemma 8.2.1 and (8.7).
Now by the change of variable y = F0−1 (u) we obtain
Tn = sup n1/2 F0 (F0−1 (u)) − Fn (F0−1 (u)) = sup n1/2 u − Gn (u).
u∈[0,1]
u∈[0,1]
It is obvious that the right-hand side of this expression does not depend on
the original model. Actually, it is for fixed n a precisely described random
variable, and so its distribution only depends on n . It only remains to show
that this distribution for large n is close to some fixed limit distribution with
a continuous c.d.f. allowing for a choice of a proper critical value. We indicate
the main steps of the proof.
Given a sample U1 , . . . , Un , define the random function
def
ξn (u) = n1/2 u − Gn (u) .
Clearly Tn = supu∈[0,1] ξn (u) . Next, convergence of the random functions
ξn (·) would imply the convergence of their maximum over u ∈ [0, 1] , because
the maximum is a continuous functional of a function. Finally, the weak conw
vergence of ξn (·) −→ ξ(·) can be checked if for any continuous function h(u) ,
it holds
def
ξn , h = n1/2
Z
1
def
w
h(u) u − Gn (u) du −→ ξ, h =
0
Now the result can be derived from the representation
Z
1
h(u)ξ(u)du.
0
260
8 Some other testing methods
ξn , h = n1/2
Z
n
X
−1/2
h(u)Gn (u) − m(h) du = n
Ui h(Ui ) − m(h)
1
0
i=1
R1
with m(h) = 0 h(u)ξ(u)du and from the central limit theorem for a sum of
i.i.d. random variables.
The case of a composite hypothesis.
If H0 : P ∈ (Pθ , θ ∈ Θ0 ) , then the test statistic is described by (8.5). As
we already mentioned, testing of a composite hypothesis can be viewed as
e0 and in the
a two-step procedure. In the first step, θ is estimated by θ
second step, the goodness-of-fit test based on Tn is carried out, where F0 is
replaced by the c.d.f. corresponding to Pθe0 . It turns out that pivotality of the
distribution of Tn is preserved if θ is a location and/or scale parameter, but a
general (asymptotic) theory allowing to derive tα analytically is not available.
Therefore, computer simulations are typically employed to approximate tα .
8.2.2 ω 2 test (Cramér-Smirnov-von Mises)
Here we briefly discuss another distance also based on the c.d.f. of the null
measure. Namely, define for a measure P on the real line with c.d.f. F
Z
2
ρ(Pn , P ) = ρ(Fn , F ) = n
Fn (y) − F (y) dF (x).
(8.8)
For the case of a simple hypothesis P = P0 , the Cramér-Smirnov-von Mises
(CSvM) test statistic is given by (8.8) with
F = F0 . This
is another functional
of the path of the random function n1/2 Fn (y)−F0 (y) . The Kolmogorov test
uses the maximum of this function while the CSvM test uses the integral of
this function squared. The property of pivotality is preserved for the CSvM
test statistic as well.
Theorem 8.2.2. Let F0 be a continuous c.d.f. Then
Z
2
D
Tn = n
Fn (y) − F0 (y) dF (x) → η,
where η is a fixed random variable (integral of a Brownian bridge squared on
[0, 1] ).
Proof. The idea of the proof is the same as in the case of the KolmogorovSmirnov test. First the transformation by F0−1 translates the general case
to the case of the uniform distribution on [0, 1] . Next one can again use the
functional convergence of the process ξn (u) .
8.3 Partially Bayes tests and Bayes testing
261
8.3 Partially Bayes tests and Bayes testing
In the above sections we mostly focused on the likelihood ratio testing approach. As in estimation theory, the LR approach is very general and possesses
some nice properties. This section briefly discusses some possible alternative
approaches including partially Bayes and Bayes approaches.
8.3.1 Partial Bayes approach and Bayes tests
Let Θ0 and Θ1 be two subsets of the parameter set Θ . We test the null
hypothesis H0 : θ ∗ ∈ Θ0 against the alternative H1 : θ ∗ ∈ Θ1 . The LR
approach compares the maximum of the likelihood process over Θ0 with the
similar maximum over Θ1 . Let now two measures π0 on Θ0 and π1 on Θ1
be given. Now instead of the maximum of L(Y , θ) we consider its weighted
sum (integral) over Θ0 (resp. Θ1 ) with weights π0 (θ) resp. π1 (θ) . More
precisely, we consider the value
Z
Z
L(Y , θ)π0 (θ)λ(dθ).
L(Y , θ)π1 (θ)λ(dθ) −
Tπ0 ,π1 =
Θ1
Θ0
Significantly positive values of this expression indicate that the hypothesis is
likely to be wrong.
Similarly and more commonly used, we may define measures g0 and g1
such that
(
π0 g0 (θ), θ ∈ Θ0 ,
π(θ) =
π1 g1 (θ), θ ∈ Θ1 ,
def R
where π is a prior on the entire parameter space Θ and πi = Θi π(θ)λ(dθ) =
IP (Θi ) for i = 0, 1 . Then, the Bayes factor for comparing H0 and H1 is given
by
R
B0|1
IP (Θ0 Y )/IP (Θ1 Y )
=
.
IP (Θ0 )/IP (Θ1 )
L(Y , θ)g1 (θ)λ(dθ)
Θ1
= RΘ0
L(Y , θ)g0 (θ)λ(dθ)
(8.9)
The representation of the Bayes factor on the right-hand side of (8.9) shows
that it can be interpreted as the ratio of the posterior odds ratio for H0 and
the prior odds ratio for H0 . The resulting test rejects the null hypothesis
for significantly small values of B0|1 , or, equivalently, for significantly large
values of B1|0 = 1/B0|1 . In the special case that H0 and H1 are two simple
hypotheses, i. e., Θ0 = {θ 0 } and Θ1 = {θ 1 } , the Bayes factor is simply given
by
262
8 Some other testing methods
B0|1 =
L(Y , θ 0 )
,
L(Y , θ 1 )
hence, in such a case the testing approach based on the Bayes factor is equivalent to the LR approach.
8.3.2 Bayes approach
Within the Bayes approach the true data distribution and the true parameter
value are not defined. Instead one considers the prior and posterior distribution of the parameter. The parametric Bayes model can be represented as
Y | θ ∼ p(y θ),
θ ∼ π(θ).
The posterior density p(θ Y ) can be computed via the Bayes formula:
p(Y θ)π(θ)
p(θ Y ) =
p(Y )
R
with the marginal density p(Y ) = Θ p(Y θ)π(θ)λ(dθ) . The Bayes approach
suggests instead of checking the hypothesis about the location of the parameter θ to look directly at the posterior distribution. Namely, one can construct
so-called credible sets which contain a prespecified fraction, say 1 − α , of the
mass of the whole posterior distribution. Then one can say that the probability for the parameter θ to lie outside of this credible set is at most α . So,
the testing problem in the frequentist approach is replaced by the problem of
confidence estimation for the Bayes method.
Example 8.3.1 (Example 5.2.2 continued.). Consider again the situation of
a Bernoulli product likelihood for Y = (Y1 , . . . , Yn )> with unknown success
probability θ . In example 5.2.2 we saw that this family of likelihoods is conjugated to the family of Beta-distributions as priors on [0, 1] . More specifically,
Pn
if θ ∼ Beta(a, b) , then θ | Y = y ∼ Beta(a + s, b + n − s) , where s = i=1 yi
denotes the observed number of successes. Under quadratic risk, the Bayesoptimal point estimate for θ is given by IE[θ | Y = y] = (a + s)/(a + b + n) ,
and a credible interval can be constructed around this value by utilizing quantiles of the posterior Beta(a + s, b + n − s) -distribution.
8.4 Score, rank and permutation tests
8.4.1 Score tests
Testing a composite null hypothesis H0 against a composite alternative H1
is in general a challenging problem, because only in some special cases uniformly (in θ ∈ H1 ) most powerful level α -tests exist. In all other cases, one
8.4 Score, rank and permutation tests
263
has to decide against which regions in H1 optimal power is targeted. One
class of procedures is given by locally best tests, optimizing power in regions
close to H0 . To formalize this class mathematically, one needs the concept of
differentiability in the mean.
Definition 8.4.1 (Differentiability in the mean). Let (Y, B(Y), (IPθ )θ∈Θ )
denote a statistical model and assume that (IPθ )θ∈Θ is a dominated (by µ )
family of measures, where Θ ⊆ IR . Then, (Y, B(Y), (IPθ )θ∈Θ ) is called differentiable in the mean in θ0 ∈ Θ̊ , if a function g ∈ L1 (µ) exists with
−1 dIPθ0 +t
dIPθ0 t
→0
− g
−
dµ
dµ
L1 (µ)
as t → 0.
The function g is called L1 (µ) derivative of θ 7→ IPθ in θ0 . In the sequel,
we choose w.l.o.g. θ0 ≡ 0 .
Theorem 8.4.1 (§18 in Hewitt and Stromberg (1975)). Under the
assumptions of Definition 8.4.1 let θ0 = 0 and let densities be given by
def
fθ (y) = dIPθ /(dµ)(y) . Assume that there exists an open neighbourhood U
of 0 such that for µ -almost all y the mapping U 3 θ 7→ fθ (y) is absolutely
continuous, i. e., it exists an integrable function τ 7→ f˙(y, τ ) on U with
Z
θ2
f˙(y, τ )dτ = fθ2 (y) − fθ1 (y),
θ1 < θ2
θ1
∂
fθ (y)|θ=0 = f˙(y, 0) µ -almost everywhere. Furthermore,
and assume that ∂θ
assume that for θ ∈ U the function y 7→ f˙(y, θ) is µ -integrable with
Z Z ˙
f (y, θ) dµ(y) → f˙(y, 0) dµ(y),
θ → 0.
Then, θ 7→ IPθ is differentiable in the mean in 0 with g = f˙(·, 0) .
Theorem 8.4.2. Under the assumptions of Definition 8.4.1 assume that the
densities θ 7→ fθ are differentiable in the mean in 0 with L1 (µ) derivative
g . Then,
θ−1 log(fθ (y)/f0 (y)) = θ−1 (log fθ (y) − log f0 (y))
converges for θ → 0 to L̇(y) (say) in IP0 probability. We call L̇ the derivative of the (logarithmic) likelihood ratio or score function. It holds
Z
L̇(y) = g(y)/f0 (y),
L̇dIP0 = 0, {f0 = 0} ⊆ {g = 0}
IP0 -almost surely.
264
8 Some other testing methods
Proof. θ−1 (fθ /f0 − 1) → g/f0 converges in L1 (IP0 ) and, consequently,
in
R
IP0 probability. The chain rule yields L̇(y) = g(y)/f0 (y) . Noting that (fθ −
f0 )dµ = 0 we conclude
Z
Z
L̇dIP0 = gdµ = 0.
Example 8.4.1. (a) Location parameter model:
Let Y = θ + X, θ ≥ 0 , and assume that X has a density f which is
absolutely continuous with respect to the Lebesgue measure and does not
dependent on θ . Then, the densities θ 7→ f (y − θ) of Y under θ are
differentiable in the mean in zero with score function L̇(y) = −f 0 (y)/f (y)
(differentiation with respect to y ).
(b) Scale parameter model:
Let Y = exp(θ)X and assume again that X hasRdensity f with the properties stated in part (a). Moreover, assume that |xf 0 (x)| dx < ∞ . Then,
the densities θ 7→ exp(−θ)f (y exp(−θ)) of Y under θ are differentiable
in the mean in zero with score function L̇(y) = −(1 + yf 0 (y)/f (y)) .
Lemma 8.4.1. Assume that the family θ 7→ IPθ is differentiable in the mean
with score function
Nn L̇ in θ0 = 0 and that ci , 1 ≤ i ≤ n , are real constants.
Then, also θ 7→ i=1 IPci θ is differentiable in the mean in zero, and has score
function
(y1 , . . . , yn )> 7→
n
X
ci L̇(yi ).
i=1
Exercise 8.4.1. Prove Lemma 8.4.1.
Definition 8.4.2 (Score test). Let θ 7→ IPθ be differentiable in the mean in
θ0 with score function L̇ . Then, every test φ of the form
1, if L̇(y) > c,
φ(y) = γ, if L̇(y) = c,
0, if L̇(y) < c,
is called a score test. In this, γ ∈ [0, 1] denotes a randomization constant.
Definition 8.4.3 (Locally best test). Let (IPθ )θ∈Θ with Θ ⊆ IR denote
a family which is differentiable in the mean in θ0 ∈ Θ̊ . A test φ∗ with
IEθ0 [φ∗ ] = α is called locally best test among all tests with expectation α
under θ0 for the test problem H0 = {θ0 } versus H1 = {θ > θ0 } if
d
d
∗ IEθ [φ ]
≥
IEθ [φ]
dθ
dθ
θ=θ0
θ=θ0
for all tests φ with IEθ0 [φ] = α .
8.4 Score, rank and permutation tests
265
Figure 8.1 illustrates the situation considered in Definition 8.4.3 graphically.
6
1
IEθ [φ]
IEθ [φ∗ ]
α
- θ
θ0
Fig. 8.1. Locally best test φ∗ with expectation α under θ0
Theorem 8.4.3. Under the assumptions of Definition 8.4.3, the score test
1, if L̇(y) > c(α)
φ(y) = γ, if L̇(y) = c(α), γ ∈ [0, 1]
0, if L̇(y) < c(α)
with IEθ0 [φ] = α is a locally best test for testing H0 = {θ0 } against H1 =
{θ > θ0 } .
Proof. We notice that for any test φ , it holds
d
IEθ [φ]
= IEθ0 [φL̇].
dθ
θ=θ0
R
Hence, we have to optimize φ(y)L̇(y)IPθ0 (dy) with respect to φ under the
level constraint, yielding the assertion in analogy to the argumentation in the
proof of Theorem 6.2.1.
266
8 Some other testing methods
Theorem 8.4.3 shows that in the theory of locally best tests the score
function L̇ takes the role that the likelihood ratio has in the LR theory.
Notice that, for an i.i.d. sample Y = (YP
1 , . . . , Yn ) , the joint product measure
n
IPθn has score function (y1 , . . . , yn ) 7→ i=1 L̇(yi ) according to Lemma 8.4.1
and Theorem 8.4.3 can be applied to test H0 = {θ0 } against H1 = {θ > θ0 }
based on Y .
Moreover, for k -sample problems with k ≥ 2 groups and n jointly independent observations, Lemma 8.4.1 can be utilized to test the homogeneity
hypothesis
H0 = IP Y1 = IP Y2 = . . . = IP Yn : IP Y1 continuous .
(8.10)
To this end, one considers parametric families θ 7→ IPn,θ which belong to H0
only in case of θ = 0 , i. e., IPn,0 ∈ H0 . For θ 6= 0 , IPn,θ is a product measure
with non-identical factors.
Example 8.4.2. (a) Regression model for a location parameter:
Let Yi = ci θ + Xi , 1 ≤ i ≤ n , where θ ≥ 0 . In this, assume that the Xi
are iid with Lebesgue density f which is independent of θ . Now, for a twosample problem with n1 observations in the first group and n2 = n − n1
observations in the second group, we set c1 = c2 = · · · = cn1 = 1 and
ci = 0 for all n1 + 1 ≤ i ≤ n . Under alternatives, the observations in the
first group are shifted by θ > 0 .
(b) Regression model for a scale parameter:
Let ci , i ≤ i ≤ n denote real regression coefficients and consider the model
Yi = exp(ci θ)Xi , 1 ≤ i ≤ n , θ ∈ IR , where we assume again that the Xi
are iid with Lebesgue density f which is independent of θ . Then, it holds
n
Y
dIPn,θ
(y)
=
exp(−ci θ)f (yi exp(−ci θ)).
dλn
i=1
Under θ0 = 0 , the product measure IPn,0 belongs to H0 , while under
alternatives it does not.
8.4.2 Rank tests
In this section, we will consider the case that only the ranks of the observations are trustworthy (or available). Theorem 8.4.4 will be utilized to define
resulting rank tests based on parametric families IPn,θ as considered in Example 8.4.2. It turns out that the score test based on ranks has a very simple
structure.
Theorem 8.4.4. Let θ 7→ IPθ denote a parametric family which is differentiable in the mean in θ0 = 0 with respect to some reference measure
µ , L1 (µ) -differentiable for short, with score function L̇ . Furthermore, let
8.4 Score, rank and permutation tests
267
S : Y → S measurable.
Then, θ 7→ IPθS is L1 (µS ) -differentiable with score
function s →
7 IE0 [L̇ S = s] .
S
S
Proof. First,
we show that the L1 (µ ) derivative of θ 7→ IPθ is given by
s 7→ IEµ [g S = s] , where g is the L1 (µ) derivative of θ →
7 IPθ . To this end,
notice that
dIPθ
dIPθS
(s) = IEµ [fθ S = s], where fθ =
.
S
dµ
dµ
Linearity of IEµ [· S] and transformation of measures leads to
Z −1 dIPθS
S
dIP0S
θ
dµ (y)
−
−
I
E
[g
S
=
s]
µ
S
S
dµ
dµ
Z
IEµ [θ−1 (fθ − f0 ) − g S] dµ.
=
Applying
Jensen’s inequality and Vitali’s theorem, we conclude that s 7→
IEµ [g S = s] is L1 (µS ) derivative of θ 7→ IPθS . Now, the chain rule yields
that the score function of IPθS in zero is given by
dIP0 s 7→ IEµ [g S = s]{IEµ [
S = s]}−1
dµ
and the assertion follows by substituting g = L̇ dIP0 /(dµ) and verifying that
dIP0 dIP0 IE0 [L̇ S] IEµ [
S] = IEµ [L̇
S] µ-almost surely.
dµ
dµ
For applying the latter theorem to rank statistics, we need to gather some
basic facts about ranks and order statistics.
Definition 8.4.4. Let y = (y1 , . . . , yn ) be a point in IRn . Assume that the
yi are pairwise distinct and denote their ordered values by y1:n < y2:n < . . . <
yn:n .
def
(a) For 1 ≤ i ≤ n , the integer ri ≡ ri (y) = #{j ∈ {1, . . . , n} : yj ≤ yi } is
def
called the rank of yi (in y ). The vector r(y) = (r1 (y), . . . , rn (y)) ∈ Sn
is called rank vector of y .
def
(b) The inverse permutation d(y) = (d1 (y), . . . , dn (y)) = [r(y)]−1 is called
the vector of antiranks of y , and the integer di (y) is called antirank of i
(the index that corresponds to the i -th smallest observation in y ).
Now, let Y1 , . . . , Yn with Yi : Yi → IR be stochastically independent, continuously distributed random variables and denote the joint distribution of
(Y1 , . . . , Yn ) by IP .
268
8 Some other testing methods
S
(c) Because of IP ( i6=j {Yi = Yj }) = 0 the following objects are IP -almost
surely uniquely defined: Yi:n is called i -th order statistic of Y = (Y1 , . . . , Yn ) ,
def
def
Ri (Y ) = nFbn (Yi ) = ri (Y1 , . . . , Yn ) is called rank of Yi , R(Y ) =
def
(R1 (Y ), . . . , Rn (Y )) is called vector of rank statistics of Y , Di (Y ) =
def
di (Y1 , . . . , Yn ) is called antirank of i with respect to Y and D(Y ) =
d(Y ) is called vector of antiranks of Y .
Lemma 8.4.2. Under the assumptions of Definition 8.4.4, it holds
(a) i = rdi = dri , yi = yri :n , yi:n = ydi .
(b) If Y1 , . . . , Yn are exchangeable random variables, then R(Y ) is uniformly
distributed on Sn , i. e., IP (R(Y ) = σ) = 1/n! for all permutations σ =
(r1 , . . . , rn ) ∈ Sn .
(c) If U1 , . . . , Un are i.i.d. with U1 ∼ UNI[0, 1] , and Yi = F −1 (Ui ) , 1 ≤ i ≤
n , for some distribution function F , then it holds Yi:n = F −1 (Ui:n ) . If
F is continuous, then it holds R(Y ) = R(U1 , . . . , Un ) .
(d) If (Y1 , . . . , Yn ) areP
i.i.d. with
c.d.f. F of Y1 , then we have
n
(i) IP (Yi:n ≤ y) = j=i nj F (y)j (1 − F (y))n−j .
P Yi:n
i−1
(y) = n n−1
(1 − F (y))n−i . If IP Y1 has Lebesgue den(ii) dIdI
i−1 F (y)
P Y1
Yi:n
sity f, then IP
has Lebesgue density fi:n , given by fi:n (y) =
i−1
n−i
n n−1
F
(y)
(1
−
F
(y))
f (y).
i−1
def
(iii) Letting µ = IP Y1 , (Yi:n )1≤i≤n has the joint µn -density (y1 , . . . , yn ) 7→
n! 1I{y1 <y2 <...<yn } . If µ has Lebesgue
density f , then (Yi:n )1≤i≤n has
Qn
λn -density (y1 , . . . , yn ) 7→ n! i=1 f (yi ) 1I{y1 <y2 <...<yn } .
Remark 8.4.1. Lemma 8.4.2.(c) (quantile transformation) shows the special
importance of the distribution of order statistics of i.i.d. UNI[0, 1] -distributed
random variables U1 , . . . , Un . According to Lemma 8.4.2.(d), the order statistic Ui:n has a Beta (i, n − i + 1) distribution with IE[Ui:n ] = i/(n + 1) and
Var(Ui:n ) = (i(n − i + 1))/((n + 1)2 (n + 2)) . For computing the joint distribution function of (U1:n , . . . , Un:n ) , efficient recursive algorithms exist, for
instance Bolshev’s recursion and Steck’s recursion (see Shorack and Wellner
(1986), p. 362 ff.).
Theorem 8.4.5. Let Y = (Y1 , . . . , Yn ) be real-valued i.i.d. random variables
with continuous µ = IP Y1 .
(a) The random vectors R(Y ) and (Yi:n )1≤i≤n are stochastically independent.
(b) Let T : IRn → IR denote a mapping such that the statistic T (Y ) is
integrable. For any σ = (r1 , . . . , rn ) ∈ Sn , it holds
IE[T (Y ) R(Y ) = σ] = IE[T ((Yri :n )1≤i≤n )].
8.4 Score, rank and permutation tests
269
Proof. For proving part (a), let σ = (r1 , . . . , rn ) ∈ Sn and Borel sets Ai ,
def
1 ≤ i ≤ n , arbitrary but fixed and define (d1 , . . . , dn ) = σ −1 . We note that
R(Y ) = σ if and only if Yd1 < Yd2 < . . . Ydn and that Ydi = Yi:n ∈ Ai if
and only if Yi ∈ Ari .
def
Define B = {y ∈ IRn : y1 < y2 < . . . < yn } . Then we obtain that
IP (R(Y ) = σ, ∀1 ≤ i ≤ n : Yi:n ∈ Ai )
= IP (∀1 ≤ i ≤ n : Ydi ∈ Ai , (Ydi )1≤i≤n ∈ B)
Z
=
1IB (yd1 , . . . , ydn )dµn (y1 , . . . , yn )
×n
i=1 Ari
Z
=
1IB (y1 , . . . , yn )dµn (y1 , . . . , yn ),
×n
i=1 Ari
because µn is invariant under the transformation (y1 , . . . , yn ) 7→ (yd1 , . . . , ydn )
due to exchangeability.
Summation over all σ ∈ Sn yields
Z
1IB (y1 , . . . , yn )dµn (y1 , . . . , yn ).
IP (∀1 ≤ i ≤ n : Yi:n ∈ Ai ) = n!
×n
i=1 Ari
Making use of Lemma 8.4.2.(b), we conclude
IP R(Y ) = σ, ∀1 ≤ i ≤ n : Yi:n ∈ Ai
= IP R(Y ) = σ IP ∀1 ≤ i ≤ n : Yi:n ∈ Ai ,
hence, the assertion of part (a). For proving part (b), we verify
Z
T (Y )
dIP
{R(Y )=σ} IP (R(Y ) = σ)
= IE[T ((Yri :n )1≤i≤n ) R(Y ) = σ]
IE[T (Y ) R(Y ) = σ] =
= IE[T ((Yri :n )1≤i≤n )],
where we used that Y = (Yri :n )ni=1 if R(Y ) = σ in the second line and part
(a) in the third line.
Now we are ready to apply Theorem 8.4.4 to vectors of rank statistics.
Corollary 8.4.1. Let (IPθ )θ∈Θ with Θ ⊆ IR denote an L1 (µ) -differentiable
family with score
Nn function L̇ in θ0R= 0 . Let Y = (Y1 , . . . , Yn ) be a sample
from IPn,θ = i=1 IPci θ . Then, IPn,θ
has score function
270
8 Some other testing methods
"
σ = (r1 , . . . , rn ) 7→ IEn,0
n
X
#
ci L̇(Yi ) R(Y ) = σ
i=1
=
n
X
ci IEn,0 [L̇(Yi ) R(Y ) = σ]
i=1
=
n
X
ci IEn,0 [L̇(Yri :n )] =
i=1
n
X
ci a(ri )
i=1
def
with IEn,0 denoting the expectation with respect to IPn,0 and scores a(i) =
IEn,0 [L̇(Yi:n )] .
Pn
Remark 8.4.2. (a) The test statistic T (Y ) = i=1 ci a(Ri (Y )) is called a linear rank statistic.
(b) The hypothesis H0 from (8.10) leads under conditioning on R(Y ) to a
simple null hypothesis on Sn , namely, the discrete uniform distribution on
Sn , see Lemma 8.4.2.(b). Therefore, the critical value c(α) for the rank
test φ = φ(R(Y )) , given by
1, if T (y) > c(α),
φ(y) = γ, if T (y) = c(α),
0, if T (y) < c(α),
can be computed by traversing all possible permutations σ ∈ Sn and
thereby determining the discrete distribution of T (Y ) under H0 . For
large n , we can approximate c(α) by only traversing B < n! randomly
chosen permutations σ ∈ P
Sn .
n
(c) For the scores, it holds
i=1 a(i) = 0 . If L̇ is isotone, then it holds
a(1) ≤ a(2) ≤ . . . ≤ a(n) .
D
(d) Due to the relation Yi:n = F −1 (Ui:n ) , the scores are often given in the
−1
form a(i) = IE[L̇ ◦ F (Ui:n )] and the function L̇ ◦ F −1 is called scoredef
generating function. For large n , one can approximately work with b(i) =
i
) (since IE[Ui:n ] = 1/(n + 1) , see Remark 8.4.1) or with
L̇ ◦ F −1 ( n+1
R i/n
eb(i) = n
L̇ ◦ F −1 (u)du instead of a(i) .
(i−1)/n
In the case that the score function is isotone, rank tests can also be used
to test for stochastically ordered distributions in two-sample problems.
Lemma 8.4.3 (Two-sample problems with stochastically ordered distributions). Assume that a(1) ≤ a(2) ≤ . . . ≤ a(n) , cf. Remark 8.4.2.(c),
and let φ denote a rank test at level α for H0 from (8.10), i.e., IEH0 [φ] = α .
Assume that Y1 , . . . , Yn1 are i.i.d. with c.d.f. F1 of Y1 and Yn1 +1 , . . . , Yn
i.i.d. with c.d.f. F2 of Yn1 +1 .
8.4 Score, rank and permutation tests
271
(a) If F1 ≥ F2 , then IE[φ] ≤ α .
(b) If F1 ≤ F2 , then IE[φ] ≥ α .
Proof. Lemma 4.4 in Janssen (1998).
In location parameter models as considered in Example 8.4.1.(a), the score
function is isotone if and only if the density f is unimodal. The following
example discusses some specific instances of such densities and derives the
corresponding rank tests.
Example 8.4.3. [Two-sample rank tests in location parameter models with
”stochatically larger” alternatives]
(i) Fisher-Yates test:
Let f denote the density of N(0, 1) . Then it holds L̇(y) = y and we
obtain
T =
n1
X
a(Ri )
with
a(i) = IE[Yi:n ].
i=1
In this, Yi:n denotes the i -th order statistic of i.i.d. random variables
Y1 , . . . , Yn with Y1 ∼ N(0, 1) .
(ii) Van der Waerden test:
Let f be as in part (i). The score-generating function is given by
u 7→ Φ−1 (u) . Following Remark 8.4.2.(e), b(i) = Φ−1 (i/(n + 1)) are
approximate scores, leading to the test statistic
T =
n1
X
Φ−1 (
i=1
Ri
).
n+1
(iii) Wilcoxon’s rank sum test:
Let f be the density of the logistic distribution, given by f (y) =
exp(−y)(1+exp(−y))−2 with corresponding cdf F (y) = (1+exp(−y))−1 .
The score-generating function is in this case given by u 7→ 2u − 1 , leading
to the scores
a(i) = IE[L̇ ◦ F −1 (Ui:n )] =
2i
− 1.
n+1
These scores are an affine transformation of the identity and therefore,
the test can equivalently be carried out by means of the test statistic
T =
n1
X
Ri (Y ),
i=1
which is the sum of the ranks in the first group.
272
8 Some other testing methods
(iv) Median test:
The Lebesgue density of the Laplace distribution is given by f (y) =
exp(− |y|)/2 , with induced score-generating function u 7→ sgn(ln(2u)) =
sgn(2u − 1) . Approximate scores are therefore given by
1,
i
−1
b(i) = L̇ ◦ F (
) = 0,
n+1
−1,
if i > (n + 1)/2,
if i = (n + 1)/2,
if i < (n + 1)/2.
We conclude this section with the Savage test (or log-rank test), an example
for a scale parameter test, cf. Example 8.4.1.(b).
Example 8.4.4. Under the scale parameter model considered in Example 8.4.1.(b),
assume that X is exponentially distributed with density f (x) = exp(−x) 1I(0,∞) (x) .
Then we obtain for y > 0 the score function
L̇(y) = −(1 + y
f 0 (y)
) = y − 1.
f (y)
Exercise 8.4.2. Show that for i.i.d. random variables Y1 , . . . , Yn with Y1 ∼
Exp(1) , it holds
IE[Yi:n ] =
i
X
j=1
1
.
n+1−j
Making use of the latter result, exact scores are given by
a(i) =
i
X
j=1
1
− 1.
n+1−j
Since X is almost surely positive, the model Y = exp(θ)X can be transformed into the location parameter model log(Y ) = θ + log(X) . For X ∼
Exp(1) , it holds that log(X) possesses a reflected Gumbel distribution, satisfying
IP log(X) ≤ x = 1 − exp(− exp(x)),
x > 0.
8.4.3 Permutation tests
Permutation tests can be regarded as special instances of rank tests for k sample problems.
8.4 Score, rank and permutation tests
273
Example 8.4.5 (Two-sample problem in Gaussian location parameter model).
Let (Yi )1≤i≤n denote real-valued, stochastically independent random variables, where Y1 , . . . , Yn1 are i.i.d. with Y1 ∼ F1 and Yn1 +1 , . . . , Yn are i.i.d.
with Yn1 +1 ∼ F2 . Assume that the test problem of interest is given by
H0 : {F1 = F2 }
versus
H1 : {F1 6= F2 }.
(8.11)
In the special case that F1 and F2 are Gaussian cdfs which only differ in
their means, one would compare the empirical group means to carry out a test
def
for problem (8.11). More
specifically, we let n2 =
n − n1 and define group
−1 Pn1
−1 Pn
Y
and
Y
:=
n
means by Y n1 := n1
i
n
2
2
i=1
j=n1 +1 Yj , assuming that
0 < n1 < n . The test statistic of the resulting two-sample Z -test is then
def given by Te = Y n1 − Y n2 and the test for (8.11) can easily be calibrated
by noticing that Y n1 − Y n2 is again normally distributed under H0 .
However, in the case of general F1 and F2 , exact distributional results
for Te are difficult to obtain. Assuming that F1 and F2 are continuous, we
consider more general statistics of the form
T =
n
X
ci g(Yi ) =
i=1
n
X
cDi (Y ) g(Yi:n )
(8.12)
i=1
for a given function g : IR → IR and real numbers (ci )1≤i≤n .
The representation of T on the right-hand side of (8.12) establishes the
connection to rank tests. For example, |T | equals Te if we choose g = id ,
ci = n−1
for i ≤ n1 and ci = −n−1
for i > n1 . Under H0 from (8.11),
1
2
the antiranks D(Y ) = (Di (Y )))1≤i≤n and the order statistics (Yi:n )1≤i≤n
are stochastically independent, see Theorem 8.4.5. Due to this property, the
two-sample permutation test based on T can be carried out according to the
following resampling scheme.
Example 8.4.6 (Resampling scheme for a two-sample permutation test.). The
following resampling scheme is appropriate for a one-sided ”stochastically
larger” alternative. The two-sided case is obtained by obvious modifications.
def
(A) Consider the order statistics (Yi:n )1≤i≤n and regard a(i) = g(Yi:n ) as
random scores.
e = (D
e i )1≤i≤n a random vector which is uniformly distributed
(B) Denote by D
on Sn and let c = c(α, (Yi:n )1≤i≤n ) denote the (1 − α) -quantile of the
e 7→ Pn c e a(i) .
discretely distributed random variable D
i=1 Di
(C) The permutation test φ for testing (8.11) is then given by
1,
φ = γ,
0
T > c,
T = c,
T < c,
274
8 Some other testing methods
where γ ∈ [0, 1] denotes a randomization constant.
Remark 8.4.3. If we choose g = id and (cj )1≤j≤n as in Example 8.4.5, leading
to |T | = Te , then the test φ from Example 8.4.6 is called Pitman’s permutation test, see Pitman (1937).
The permutation test principle can be adapted to test the more general null
hypothesis
H0 : Y1 , . . . , Yn
are i.i.d.
(8.13)
In the generalized form, the Yj : 1 ≤ j ≤ n are not even restricted to be
real-valued. The modified resampling scheme is given as follows.
Example 8.4.7 (Modified resampling scheme for general permutation tests.).
(A) Consider n random variates Yj , 1 ≤ j ≤ n with values in some space Y
and a real-valued test statistic T = T (Y1 , . . . , Yn ) .
(B) In the remainder, consider permutations π with values in Sn which are
independent of Y1 , . . . , Yn .
(C) Denote by Q0 the uniform distribution on Sn and let c = c(Y1 , . . . , Yn )
denote the (1 − α) -quantile of t 7→ Q0 ({π ∈ Sn : T (Yπ(1) , . . . , Yπ(n) ) ≤
t}) .
(D) The modified permutation test φe for testing (8.13) is then given by
1,
e
φ = γ,
0,
T > c,
T = c,
T < c.
Theorem 8.4.6. Under the respective assumptions, the permutation test φ
defined in Example 8.4.6 and the modified permutation test φe defined in Example 8.4.7 are under the null hypothesis H0 from (8.11) or (8.13), respectively, tests of exact level α for any fixed n ∈ N .
Proof. Conditional to the order statistics (Example 8.4.6) or to the data themselves (Example 8.4.7), the critical value c and the randomization constant
γ are chosen such that
IEL(D)
e Y = y] = α
e [ϕ Y = y] = IEQ0 [ϕ
holds true. Furthermore, the antiranks D(Y ) are under H0 from (8.11)
stochastically independent of the order statistics. Analogously, the random
permutations π are chosen stochastically independent of (Y1 , . . . , Yn ) in the
case of φe . The result of the theorem follows by averaging with respect to the
distribution of Y .
8.5 Historical remarks and further reading
275
8.5 Historical remarks and further reading
The Kolmogorov-Smirnov test goes back to Kolmogorov (1933) and Smirnov
(1948). The origins of the ω 2 test can be traced back to Cramér (1928) and
the German lecture notes by von Mises (1931). The limiting ω 2 distribution
has been derived in the work by Smirnov (1937). A comprehensive resource
for (nonparametric) goodness-of-fit tests is the book edited by D’Agostino and
Stephens (1986).
The concept of Bayes factors goes back to Jeffreys (1935) and is treated
comprehensively by Kass and Raftery (1995). Bayesian approaches to hypothesis testing are discussed in Section 4.3.3 of Berger (1985); see also the
references therein.
Our treatment of score tests mainly follows Janssen (1998). The theory
of rank tests is developed in the textbook by Hájek and Šidák (1967). The
classical reference for permutation tests is Pitman (1937). Recent textbook
and monograph references on the subject are Good (2005), Edgington and
Onghena (2007), and Pesarin and Salmaso (2010).
9
Deviation probability for quadratic forms
9.1 Introduction
This chapter presents a number of deviation probability bounds for a quadratic
form kξk2 or more generally kBξk2 of a random p vector ξ satisfying a general exponential moment condition. Such quadratic forms arise in many applications. Baraud (2010) lists some statistical tasks relying on such deviation
bounds including hypothesis testing for linear models or linear model selection. We also refer to Massart (2007) for an extensive overview and numerous
results on probability bounds and their applications in statistical model selection. Limit theorems for quadratic forms can be found e.g. in Götze and
Tikhomirov (1999) and Horváth and Shao (1999). Some concentration bounds
for U-statistics are available in Bretagnolle (1999), Giné et al. (2000), Houdré
and Reynaud-Bouret (2003). Most of results assumes that the components of
the vector ξ are independent and bounded.
Hsu et al. (2012) study the tail behavior of the quadratic form under the
condition of sub-Gaussianity of the random vector ξ and show that the deviation probability are essentially the same as in the Gaussian case. However,
the assumption that the vector ξ has finite exponential moments of arbitrary
order is quite strict and is not fulfilled in many applications. A particular
example is given by the Poisson and exponential cases. In the present work
we only suppose that some exponential moments of ξ are finite. This makes
the problem much more involved and requires new approaches and tools.
If ξ is standard normal then kξk2 is chi-squared with p degrees of freedom. We aim to extend this behavior to the case of a general vector ξ satisfying the following exponential moment condition:
log IE exp γ > ξ ≤ kγk2 /2,
γ ∈ IRp , kγk ≤ g.
(9.1)
Here g is a positive constant which appears to be very important in our
results. Namely, it determines the frontier between the Gaussian and nonGaussian type deviation bounds. Our first result shows that under (9.1) the
278
9 Deviation probability for quadratic forms
deviation bounds for the quadratic form kξk2 are essentially the same as in
the Gaussian case, if the value g2 exceeds Cp for a fixed constant C . Further
we extend the result to the case of a more general form kBξk2 . An important
advantage of the presented approach which makes it different from all the
previous studies is that there is no any additional conditions on the structure
or origin of the vector ξ . For instance, we do not assume that ξ is a sum of
independent or weakly dependent random variables, or components of ξ are
independent. The results are exact stated in a non-asymptotic fashion, all the
constants are explicit and the leading terms are sharp.
As a motivating example, we consider a linear regression model Y =
Ψ> θ ∗ + ε in which Y is a n -vector of observations, ε is the vector of errors
with zero mean, and Ψ is a p × n design matrix. The ordinary least square
e for the parameter vector θ ∗ ∈ IRp reads as
estimator θ
e = ΨΨ>
θ
−1
ΨY
and it can be viewed as the maximum likelihood estimator in a Gaussian
linear model with a diagonal covariance matrix, that is, Y ∼ N(Ψ> θ, σ 2 In ) .
Define the p × p matrix
def
D02 = ΨΨ> ,
Then
e − θ ∗ ) = D−1 ζ
D0 (θ
0
def
with ζ = Ψε . The likelihood ratio test statistic for this problem is exactly
kD0−1 ζk2 /2 . Similarly, the model selection procedure is based on comparing
such quadratic forms for different matrices D0 ; see e.g. Baraud (2010).
Now we indicate how this situation can be reduced to a bound for a vector
ξ satisfying the condition (9.1). Suppose for simplicity that the entries εi of
the error vector ε are independent and have exponential moments.
(e1 ) There exist some constants ν0 and g1 > 0 , and for every i a constant
2
si such that IE εi /si ≤ 1 and
log IE exp λεi /si ≤ ν02 λ2 /2,
|λ| ≤ g1 .
(9.2)
Here g1 is a fixed positive constant. One can show that if this condition
is fulfilled for some g1 > 0 and a constant ν0 ≥ 1 , then one can get a similar
condition with ν0 arbitrary close to one and g1 slightly decreased. A natural
candidate for si is σi where σi2 = IEε2i is the variance of εi . Under (9.2),
introduce a p × p matrix V0 defined by
def
V02 =
X
s2i Ψi Ψ>
i ,
9.2 Gaussian case
279
where Ψ1 , . . . , Ψn ∈ IRp are the columns of the matrix Ψ . Define also
ξ = V0−1 Ψε,
def
N −1/2 = max sup
i
γ∈IRp
si |Ψ>
i γ|
.
kV0 γk
Simple calculation shows that for kγk ≤ g = g1 N 1/2
log IE exp γ > ξ ≤ ν02 kγk2 /2,
γ ∈ IRp , kγk ≤ g.
We conclude that (9.1) is nearly fulfilled under (e1 ) and moreover, the value
g2 is proportional to the effective sample size N . The results below allow to
get a nearly χ2 -behavior of the test statistic kξk2 which is a finite sample
version of the famous Wilks phenomenon; see e.g. Fan et al. (2001); Fan and
Huang (2005), Boucheron and Massart (2011).
Section 9.2 reminds the classical results about deviation probability of a
Gaussian quadratic form. These results are presented only for comparison and
to make the presentation selfcontained.
Section 9.3 studies the probability of the form IP kξk > y under the
condition
log IE exp γ > ξ ≤ ν02 kγk2 /2,
γ ∈ IRp , kγk ≤ g.
The general case can be reduced to ν0 = 1 by rescaling ξ and g :
log IE exp γ > ξ/ν0 ≤ kγk2 /2,
γ ∈ IRp , kγk ≤ ν0 g
that is, ν0−1 ξ fulfills (9.1) with a slightly increased g .
The obtained result is extended to the case of a general quadratic form
in Section 9.4. Some more extensions motivated by different statistical problems are given in Section 9.6 and Section 9.7. They include the bound with
sup-norm constraint and the bound under Bernstein conditions. Among the
statistical problems demanding such bounds is estimation of the regression
model with Poissonian or bounded random noise. More examples can be found
in Baraud (2010). All the proofs are collected in Section 9.8.
9.2 Gaussian case
Our benchmark will be a deviation bound for kξk2 for a standard Gaussian
vector ξ . The ultimate goal is to show that under (9.1) the norm of the vector
ξ exhibits behavior expected for a Gaussian vector, at least in the region of
moderate deviations. For the reason of comparison, we begin by stating the
result for a Gaussian vector ξ . We use the notation a ∨ b for the maximum
of a and b , while a ∧ b = min{a, b} .
280
9 Deviation probability for quadratic forms
Theorem 9.2.1. Let ξ be a standard normal vector in IRp . Then for any
u > 0 , it holds
IP kξk2 > p + u ≤ exp −(p/2)φ(u/p)
with
def
φ(t) = t − log(1 + t).
Let φ−1 (·) stand for the inverse of φ(·) . For any x ,
IP kξk2 > p + p φ−1 (2x/p) ≤ exp(−x).
This particularly yields with κ = 6.6
IP kξk2 > p +
√
κxp ∨ (κx) ≤ exp(−x).
This is a simple version of a well known result and we present it only for
comparison with the non-Gaussian case. The message of this result is that the
squared norm of the Gaussian vector ξ concentrates around the value p and
√
its deviation over the level p + xp is exponentially small in x .
A similar bound can be obtained for a norm of the vector Bξ where B
is some given deterministic matrix. For notational simplicity we assume that
B is symmetric. Otherwise one should replace it with (B > B)1/2 .
Theorem 9.2.2. Let ξ be standard normal in IRp . Then for every x > 0
and any symmetric matrix B , it holds with p = tr(B 2 ) , v2 = 2 tr(B 4 ) , and
a∗ = kB 2 k∞
IP kBξk2 > p + (2vx1/2 ) ∨ (6a∗ x) ≤ exp(−x).
Below we establish similar bounds for a non-Gaussian vector ξ obeying
(9.1).
9.3 A bound for the `2 -norm
This
section presents a general exponential bound for the probability IP kξk >
y under (9.1). The main result tells us that if y is not too large, namely if
y ≤ yc with y2c g2 , then the deviation probability is essentially the same
as in the Gaussian case.
To describe the value yc , introduce the following notation. Given g and
p , define the values w0 = gp−1/2 and wc by the equation
wc (1 + wc )
= w0 = gp−1/2 .
(1 + wc2 )1/2
(9.3)
9.3 A bound for the `2 -norm
281
√
It is easy to see that w0 / 2 ≤ wc ≤ w0 . Further define
def
µc = wc2 /(1 + wc2 )
def p
yc =
(1 + wc2 )p,
def
xc = 0.5p wc2 − log 1 + wc2 .
(9.4)
Note that for g2 ≥ p , the quantities yc and xc can be evaluated as y2c ≥
wc2 p ≥ g2 /2 and xc & pwc2 /2 ≥ g2 /4 .
Theorem 9.3.1. Let ξ ∈ IRp fulfill (9.1). Then it holds for each x ≤ xc
IP kξk2 > p +
√
κxp ∨ (κx), kξk ≤ yc ≤ 2 exp(−x),
where κ = 6.6 . Moreover, for y ≥ yc , it holds with gc = g −
gwc /(1 + wc )
√
µc p =
IP kξk > y ≤ 8.4 exp −gc y/2 − (p/2) log(1 − gc /y)
≤ 8.4 exp −xc − gc (y − yc )/2 .
The statements of Theorem 9.4.1 can be simplified under the assumption
g2 ≥ p .
Corollary 9.3.1. Let ξ fulfill (9.1) and g2 ≥ p . Then it holds for x ≤ xc
IP kξk2 ≥ z(x, p) ≤ 2e−x + 8.4e−xc ,
(
√
p + κxp, x ≤ p/κ,
def
z(x, p) =
p + κx
p/κ < x ≤ xc ,
(9.5)
(9.6)
with κ = 6.6 . For x > xc
IP kξk2 ≥ zc (x, p) ≤ 8.4e−x ,
2
def zc (x, p) = yc + 2(x − xc )/gc .
This result implicitly assumes that p ≤ κxc which is fulfilled if w02 =
g /p ≥ 1 :
2
κxc = 0.5κ w02 − log(1 + w02 ) p ≥ 3.3 1 − log(2) p > p.
For x ≤ xc , the function z(x, p) mimics the quantile behavior of the chisquared distribution χ2p with p degrees of freedom. Moreover, increase of the
value g yields a growth of the sub-Gaussian zone. In particular, for g = ∞ ,
a general quadratic form kξk2 has under (9.1) the same tail behavior as in
the Gaussian case.
282
9 Deviation probability for quadratic forms
Finally, in the large deviation zone x > xc the deviation probability decays
1/2
as e−cx
for some fixed c . However, if the constant g in the condition (9.1) is
sufficiently large relative to p , then xc is large as well and the large deviation
zone x > xc can be ignored at a small price of 8.4e−xc and one can focus on
the deviation bound described by (9.5) and (9.6).
9.4 A bound for a quadratic form
Now we extend the result to more general bound for kBξk2 = ξ > B 2 ξ with a
given matrix B and a vector ξ obeying the condition (9.1). Similarly to the
Gaussian case we assume that B is symmetric. Define important characteristics of B
p = tr(B 2 ),
def
v2 = 2 tr(B 4 ),
def
λB = kB 2 k∞ = λmax (B 2 ).
For simplicity of formulation we suppose that λB = 1 , otherwise one has to
replace p and v2 with p/λB and v2 /λB .
Let g be shown in (9.1). Define similarly to the `2 -case wc by the equation
wc (1 + wc )
= gp−1/2 .
(1 + wc2 )1/2
Define also µc = wc2 /(1 + wc2 ) ∧ 2/3 . Note that wc2 ≥ 2 implies µc = 2/3 .
Further define
y2c = (1 + wc2 )p,
2xc = µc y2c + log det{IIp − µc B 2 }.
(9.7)
Similarly to the case with B = Ip , under the condition g2 ≥ p , one can
bound y2c ≥ g2 /2 and xc & g2 /4 .
Theorem 9.4.1. Let a random vector ξ in IRp fulfill (9.1). Then for each
x < xc
IP kBξk2 > p + (2vx1/2 ) ∨ (6x), kBξk ≤ yc ≤ 2 exp(−x).
Moreover, for y ≥ yc , with gc = g −
√
µc p = gwc /(1 + wc ) , it holds
IP kBξk > y ≤ 8.4 exp −xc − gc (y − yc )/2 .
Now we describe the value z(x, B) ensuring
a small value for the large de
viation probability IP kBξk2 > z(x, B) . For ease of formulation, we suppose
that g2 ≥ 2p yielding µ−1
c ≤ 3/2 . The other case can be easily adjusted.
9.5 Rescaling and regularity condition
283
Corollary 9.4.1. Let ξ fulfill (9.1) with g2 ≥ 2p . Then it holds for x ≤ xc
with xc from (9.7):
IP kBξk2 ≥ z(x, B) ≤ 2e−x + 8.4e−xc ,
(
p + 2vx1/2 , x ≤ v/18,
def
z(x, B) =
(9.8)
p + 6x
v/18 < x ≤ xc .
For x > xc
IP kBξk2 ≥ zc (x, B) ≤ 8.4e−x ,
2
def zc (x, B) = yc + 2(x − xc )/gc .
9.5 Rescaling and regularity condition
The result of Theorem 9.4.1 can be extended to a more general situation when
the condition (9.1) is fulfilled for a vector ζ rescaled by a matrix V0 . More
precisely, let the random p -vector ζ fulfills for some p × p matrix V0 the
condition
γ>ζ ≤ ν02 λ2 /2,
|λ| ≤ g,
(9.9)
sup log IE exp λ
kV0 γk
γ∈IRp
with some constants g > 0 , ν0 ≥ 1 . Again, a simple change of variables
reduces the case of an arbitrary ν0 ≥ 1 to ν0 = 1 . Our aim is to bound
the squared norm kD0−1 ζk2 of a vector D0−1 ζ for another p × p positive
symmetric matrix D02 . Note that condition (9.9) implies (9.1) for the rescaled
vector ξ = V0−1 ζ . This leads to bounding the quadratic form kD0−1 V0 ξk2 =
kBξk2 with B 2 = D0−1 V02 D0−1 . It obviously holds
p = tr(B 2 ) = tr(D0−2 V02 ).
Now we can apply the result of Corollary 9.4.1.
Corollary 9.5.1. Let ζ fulfill (9.9) with some V0 and g . Given D0 , define
B 2 = D0−1 V02 D0−1 , and let g2 ≥ 2p . Then it holds for x ≤ xc with xc from
(9.7):
IP kD0−1 ζk2 ≥ z(x, B) ≤ 2e−x + 8.4e−xc ,
with z(x, B) from (9.8). For x > xc
IP kD0−1 ζk2 ≥ zc (x, B) ≤ 8.4e−x ,
2
def zc (x, B) = yc + 2(x − xc )/gc .
In the regular case with D0 ≥ aV0 for some a > 0 , one obtains kBk∞ ≤
a−1 and
v2 = 2 tr(B 4 ) ≤ 2a−2 p.
284
9 Deviation probability for quadratic forms
9.6 A chi-squared bound with norm-constraints
This section extends the results to the case when the bound (9.1) requires some
other conditions than the `2 -norm of the vector γ . Namely, we suppose that
log IE exp γ > ξ ≤ kγk2 /2,
γ ∈ IRp , kγk◦ ≤ g◦ ,
(9.10)
where k · k◦ is a norm which differs from the usual Euclidean norm. Our
driving example is given by the sup-norm case with kγk◦ ≡ kγk∞ . We are
interested to check whether the previous results of Section 9.3 still apply. The
answer depends on how massive the set A(r) = {γ : kγk◦ ≤ r} is in terms of
the standard Gaussian measure on IRp . Recall that the quadratic norm kεk2
of a standard Gaussian vector ε in IRp concentrates around p at least for
p large. We need a similar concentration property for the norm k · k◦ . More
precisely, we assume for a fixed r∗ that
IP kεk◦ ≤ r∗ ≥ 1/2,
ε ∼ N(0, Ip ).
(9.11)
This implies for any value u◦ > 0 and all u ∈ IRp with kuk◦ ≤ u◦ that
IP kε − uk◦ ≤ r∗ + u◦ ≥ 1/2,
ε ∼ N(0, Ip ).
For each z > p , consider
µ(z) = (z − p)/z.
Given u◦ , denote by z◦ = z◦ (u◦ ) the root of the equation
r∗
g◦
−
= u◦ .
µ(z◦ ) µ1/2 (z◦ )
(9.12)
One can easily see that this value exists and unique if u◦ ≥ g◦ − r∗ and it
g◦
r∗
− µ1/2
≥ u◦ . Let µ◦ = µ(z◦ )
can be defined as the largest z for which µ(z)
(z)
be the corresponding µ -value. Define also x◦ by
2x◦ = µ◦ z◦ + p log(1 − µ◦ ).
If u◦ < g◦ − r∗ , then set z◦ = ∞ , x◦ = ∞ .
Theorem 9.6.1. Let a random vector ξ in IRp fulfill (9.10). Suppose (9.11)
and let, given u◦ , the value z◦ be defined by (9.12). Then it holds for any
u>0
IP kξk2 > p + u, kξk◦ ≤ u◦ ≤ 2 exp −(p/2)φ(u) .
yielding for x ≤ x◦
(9.13)
9.7 A bound for the `2 -norm under Bernstein conditions
IP kξk2 > p +
√
κxp ∨ (κx), kξk◦ ≤ u◦ ≤ 2 exp(−x),
285
(9.14)
where κ = 6.6 . Moreover, for z ≥ z◦ , it holds
IP kξk2 > z, kξk◦ ≤ u◦ ≤ 2 exp −µ◦ z/2 − (p/2) log(1 − µ◦ )
= 2 exp −x◦ − g◦ (z − z◦ )/2 .
It is easy to check that the result continues to hold for the norm of Πξ
for a given sub-projector Π in IRp satisfying Π = Π> , Π2 ≤ Π . As above,
def
def
denote p = tr(Π2 ) , v2 = 2 tr(Π4 ) . Let r∗ be fixed to ensure
IP kΠεk◦ ≤ r∗ ≥ 1/2,
ε ∼ N(0, Ip ).
The next result is stated for g◦ ≥ r∗ + u◦ , which simplifies the formulation.
Theorem 9.6.2. Let a random vector ξ in IRp fulfill (9.10) and Π follows
Π = Π> , Π2 ≤ Π . Let some u◦ be fixed. Then for any µ◦ ≤ 2/3 with
−1/2
≥ u◦ ,
g◦ µ−1
◦ − r∗ µ◦
o
nµ
◦
(kΠξk2 − p) 1I kΠ2 ξk◦ ≤ u◦ ≤ 2 exp(µ2◦ v2 /4), (9.15)
IE exp
2
where v2 = 2 tr(Π4 ) . Moreover, if g◦ ≥ r∗ + u◦ , then for any z ≥ 0
IP kΠξk2 > z, kΠ2 ξk◦ ≤ u◦
≤ IP kΠξk2 > p + (2vx1/2 ) ∨ (6x), kΠ2 ξk◦ ≤ u◦ ≤ 2 exp(−x).
9.7 A bound for the `2 -norm under Bernstein conditions
For comparison, we specify the results to the case considered recently in Baraud (2010). Let ζ be a random vector in IRn whose components ζi are
independent and satisfy the Bernstein type conditions: for all |λ| < c−1
log IEeλζi ≤
λ2 σ 2
.
1 − c|λ|
(9.16)
Denote ξ = ζ/(2σ) and consider kγk◦ = kγk∞ . Fix g◦ = σ/c . If kγk◦ ≤ g◦ ,
then 1 − cγi /(2σ) ≥ 1/2 and
γ ζ X |γ /(2σ)|2 σ 2
X
i
i i
log IE exp γ > ξ ≤
log IE exp
≤
≤ kγk2 /2.
2σ
1
−
cγ
/(2σ)
i
i
i
Let also S be some linear subspace of IRn with dimension p and ΠS denote
the projector on S . For applying the result of Theorem 9.6.1, the value r∗
286
9 Deviation probability for quadratic forms
has
√ to be fixed. We use that the infinity norm kεk∞ concentrates around
2 log p .
Lemma
9.7.1. It holds for a standard normal vector ε ∈ IRp with r∗ =
√
2 log p
IP kεk◦ ≤ r∗ ≥ 1/2.
Indeed
p
p
IP kεk◦ > r∗ ≤ IP kεk∞ > 2 log p ≤ pIP |ε1 | > 2 log p ≤ 1/2.
Now the general bound of Theorem 9.6.1 is applied to bounding the norm
of kΠS ξk . For simplicity of formulation we assume that g◦ ≥ u◦ + r∗ .
Theorem 9.7.1. Let S be some linear subspace of IRn with dimension p .
Let g◦ ≥ u◦ + r∗ . If the coordinates ζi of ζ are independent and satisfy
(9.16), then for all x ,
IP (4σ 2 )−1 kΠS ζk2 > p +
√
κxp ∨ (κx), kΠS ζk∞ ≤ 2σu◦ ≤ 2 exp(−x),
The bound of Baraud (2010) reads
√
p
IP kΠS ζk2 > 3σ ∨ 6cu x + 3p, kΠS ζk∞ ≤ 2σu◦ ≤ e−x .
As expected, in the region x ≤ xc of Gaussian approximation, the bound of
Baraud is not sharp and actually quite rough.
9.8 Proofs
Proof of Theorem 9.2.1
The proof utilizes the following well known fact, which can be obtained by
straightforward calculus : for µ < 1
log IE exp µkξk2 /2 = −0.5p log(1 − µ).
Now consider any u > 0 . By the exponential Chebyshev inequality
IP kξk2 > p + u ≤ exp −µ(p + u)/2 IE exp µkξk2 /2
= exp −µ(p + u)/2 − (p/2) log(1 − µ) .
(9.17)
It is easy to see that the value µ = u/(u+p) maximizes µ(p+u)+p log(1−µ)
w.r.t. µ yielding
9.8 Proofs
287
µ(p + u) + p log(1 − µ) = u − p log(1 + u/p).
Further we use that x−log(1 +x) ≥ a0 x2 for x ≤ 1 and x−log(1 +x) ≥ a0 x
for x > 1 with a0 = 1 − log(2) ≥ 0.3 . This implies with x = u/p for
√
u = κxp or u = κx and κ = 2/a0 < 6.6 that
IP kξk2 ≥ p +
√
κxp ∨ (κx) ≤ exp(−x)
as required.
Proof of Theorem 9.2.2
The matrix B 2 can be represented as U > diag(a1 , . . . , ap )U for an orthogonal matrix U . The vector e
ξ = U ξ is also standard normal and kBξk2 =
>
e
ξ U B2U >e
ξ . This means that one can reduce the situation to the case of a
diagonal matrix B 2 = diag(a1 , . . . , ap ) . We can also assume without loss of
generality that a1 ≥ a2 ≥ . . . ≥ ap . The expressions for the quantities p and
v2 simplifies to
p = tr(B 2 ) = a1 + . . . + ap ,
v2 = 2 tr(B 4 ) = 2(a21 + . . . + a2p ).
Moreover, rescaling the matrix B 2 by a1 reduces the situation to the case
with a1 = 1 .
Lemma 9.8.1. It holds
IEkBξk2 = tr(B 2 ),
Var kBξk2 = 2 tr(B 4 ).
Moreover, for µ < 1
p
Y
IE exp µkBξk2 /2 = det(1 − µB 2 )−1/2 =
(1 − µai )−1/2 .
(9.18)
i=1
P
Proof. If B 2 is diagonal, then kBξk2 = i ai ξi2 and the summands ai ξi2 are
independent. It remains to note that IE(ai ξi2 ) = ai , Var(ai ξi2 ) = 2a2i , and
for µai < 1 ,
IE exp µai ξi2 /2 = (1 − µai )−1/2
yielding (9.18).
288
9 Deviation probability for quadratic forms
Given u , fix µ < 1 . The exponential Markov inequality yields
µkBξk2 n µ(p + u) o
IE exp
IP kBξk2 > p + u ≤ exp −
2
2
p
n µu 1 X
o
≤ exp −
−
µai + log 1 − µai
.
2
2 i=1
We start with the case when x1/2 ≤ v/3 . Then u = 2x1/2 v fulfills u ≤ 2v2 /3 .
Define µ = u/v2 ≤ 2/3 and use that t + log(1 − t) ≥ −t2 for t ≤ 2/3 . This
implies
IP kBξk2 > p + u
p
o
n µu 1 X
+
µ2 a2i = exp −u2 /(4v2 ) = e−x .
≤ exp −
2
2 i=1
(9.19)
Next, let x1/2 > v/3 . Set µ = 2/3 . It holds similarly to the above
p
X
µai + log 1 − µai
i=1
≥−
p
X
µ2 a2i ≥ −2v2 /9 ≥ −2x.
i=1
Now, for u = 6x and µu/2 = 2x , (9.19) implies
IP kBξk2 > p + u ≤ exp − 2x − x = exp(−x)
as required.
Proof of Theorem 9.3.1
The main step of the proof is the following exponential bound.
Lemma 9.8.2. Suppose (9.1). For any µ < 1 with g2 > pµ , it holds
IE exp
µkξk2 p
1I kξk ≤ g/µ − p/µ ≤ 2(1 − µ)−p/2 .
2
(9.20)
Proof. Let ε be a standard normal vector in IRp and u ∈ IRp . The bound
IP kεk2 > p ≤ 1/2 and the triangle inequality imply for any vector u and
any r with r ≥ kuk + p1/2 that IP ku + εk ≤ r ≥ 1/2 . Let us fix some ξ
p
with kξk ≤ g/µ − p/µ and denote by IPξ the conditional probability given
ξ . The previous arguments yield:
IPξ kε + µ1/2 ξk ≤ µ−1/2 g ≥ 0.5.
9.8 Proofs
289
It holds with cp = (2π)−p/2
Z
cp
kγk2 exp γ > ξ −
1I(kγk ≤ g)dγ
2µ
Z
1
2 = cp exp µkξk2 /2
exp − µ−1/2 γ − µ1/2 ξ 1I(µ−1/2 kγk ≤ µ−1/2 g)dγ
2
= µp/2 exp µkξk2 /2 IPξ kε + µ1/2 ξk ≤ µ−1/2 g
≥ 0.5µp/2 exp µkξk2 /2 ,
because kµ1/2 ξk + p1/2 ≤ µ−1/2 g . This implies in view of p < g2 /µ that
p
exp µkξk2 /2 1I kξk2 ≤ g/µ − p/µ
Z
kγk2 1I(kγk ≤ g)dγ.
≤ 2µ−p/2 cp exp γ > ξ −
2µ
Further, by (9.1)
Z
1
exp γ > ξ −
kγk2 1I(kγk ≤ g)dγ
2µ
Z
µ−1 − 1
≤ cp exp −
kγk2 1I(kγk ≤ g)dγ
2
Z
µ−1 − 1
kγk2 dγ
≤ cp exp −
2
cp IE
≤ (µ−1 − 1)−p/2
and (9.20) follows.
Due to this result, the scaled squared norm µkξk2 /2 after a proper truncation possesses the same exponential moments as in the Gaussian case.
A
straightforward implication is the probability bound IP kξk2 > p + u for
moderate values u . Namely, given u > 0 , define µ = u/(u + p) . This value
optimizes the inequality (9.17) in the Gaussian case.
p Now we can apply a similar bound under the constraints kξk ≤pg/µ − p/µ . Therefore, the bound
√
is only
p meaningful if u + p ≤ g/µ − p/µ with µ = u/(u + p) , or, with
w = u/p ≤ wc ; see (9.3).
The largest value u for which this constraint is still valid, is given by
p + u = y2c . Hence, (9.20) yields for p + u ≤ y2c
290
9 Deviation probability for quadratic forms
IP kξk2 > p + u, kξk ≤ yc
µkξk2 n µ(p + u) o
p
≤ exp −
IE exp
1I kξk ≤ g/µ − p/µ
2
2
≤ 2 exp −0.5 µ(p + u) + p log(1 − µ)
= 2 exp −0.5 u − p log(1 + u/p) .
Similarly to the Gaussian case, this implies with κ = 6.6 that
IP kξk ≥ p +
√
κxp ∨ (κx), kξk ≤ yc ≤ 2 exp(−x).
The Gaussian case means that (9.1) holds with g = ∞ yielding yc = ∞ . In
the non-Gaussian case with a finite g , we have to accompany
the moderate
deviation bound with a large deviation bound IP kξk > y for y ≥ yc . This
is done by combining the bound (9.20) with the standard slicing arguments.
p
Lemma 9.8.3. Let µ0 ≤ g2 /p . Define y0 = g/µ0 − p/µ0 and g0 = µ0 y0 =
√
g − µ0 p . It holds for y ≥ y0
IP kξk > y ≤ 8.4(1 − g0 /y)−p/2 exp −g0 y/2
≤ 8.4 exp −x0 − g0 (y − y0 )/2 .
(9.21)
(9.22)
with x0 defined by
2x0 = µ0 y20 + p log(1 − µ0 ) = g2 /µ0 − p + p log(1 − µ0 ).
Proof. Consider the growing sequence yk with y1 = y and g0 yk+1 = g0 y+k .
Define also µk = g0 /yk . In particular, µk ≤ µ1 = g0 /y . Obviously
∞
X
IP kξk > y =
IP kξk > yk , kξk ≤ yk+1 .
k=1
Now we try to evaluate every slicing probability in this expression. We use
that
µk+1 y2k =
and also g/µk −
p
p/µk ≥ yk because g − g0 =
g/µk −
Hence by (9.20)
(g0 y + k − 1)2
≥ g0 y + k − 2,
g0 y + k
√
µ0 p >
√
µk p and
p
√
p/µk − yk = µ−1
µk p − g0 ) ≥ 0.
k (g −
9.8 Proofs
291
∞
X
IP kξk > y =
IP kξk > yk , kξk ≤ yk+1
k=1
≤
∞
X
k=1
≤
∞
X
µ
µ
2
2
k+1 kξk
k+1 yk
IE exp
1I kξk ≤ yk+1
exp −
2
2
2 1 − µk+1
−p/2
k=1
≤ 2 1 − µ1
∞
−p/2 X
k=1
µ
2
k+1 yk
exp −
2
g y + k − 2
0
exp −
2
(1 − µ1 )−p/2 exp −g0 y/2
≤ 8.4(1 − µ1 )−p/2 exp −g0 y/2
= 2e
1/2
(1 − e
−1/2 −1
)
and the first assertion follows. For y = y0 , it holds
g0 y0 + p log(1 − µ0 ) = µ0 y20 + p log(1 − µ0 ) = 2x0
and (9.21) implies IP kξk > y0 ≤ 8.4exp(−x0 ) . Now observe that the function f (y) = g0 y/2 + (p/2) log 1 − g0 /y fulfills f (y0 ) = x0 and f 0 (y) ≥ g0 /2
yielding f (y) ≥ x0 + g0 (y − y0 )/2 . This implies (9.22).
The statements of the theorem are obtained by applying the lemmas with
µ0 = µc = wc2 /(1 + wc2 ) . This also implies y0 = yc , x0 = xc , and g0 = gc =
√
g − µc p ; cf. (9.4).
Proof of Theorem 9.4.1
The main steps of the proof are similar to the proof of Theorem 9.3.1.
Lemma 9.8.4. Suppose (9.1). For any µ < 1 with g2 /µ ≥ p , it holds
p
IE exp µkBξk2 /2 1I kB 2 ξk ≤ g/µ − p/µ ≤ 2det(IIp − µB 2 )−1/2(9.23)
.
Proof. With cp (B) = 2π
Z
−p/2
det(B −1 )
1
exp γ > ξ −
kB −1 γk2 1I(kγk ≤ g)dγ
2µ
1
µkBξk2 Z
2 = cp (B) exp
exp − µ1/2 Bξ − µ−1/2 B −1 γ 1I(kγk ≤ g)dγ
2
2
µkBξk2 = µp/2 exp
IPξ kµ−1/2 Bε + B 2 ξk ≤ g/µ ,
2
cp (B)
292
9 Deviation probability for quadratic forms
where ε denotes a standard normal vector in IRp and IPξ means the conditional probability given ξ . Moreover,
for any u ∈ IRp and r ≥ p1/2 + kuk ,
2
it holds in view of IP kBεk > p ≤ 1/2
√ IP kBε − uk ≤ r ≥ IP kBεk ≤ p ≥ 1/2.
This implies
p
exp µkBξk2 /2 1I kB 2 ξk ≤ g/µ − p/µ
Z
1
−p/2
≤ 2µ
cp (B) exp γ > ξ −
kB −1 γk2 1I(kγk ≤ g)dγ.
2µ
Further, by (9.1)
Z
1
cp (B)IE exp γ > ξ −
kB −1 γk2 1I(kγk ≤ g)dγ
2µ
Z
kγk2
1
≤ cp (B) exp
−
kB −1 γk2 dγ
2
2µ
≤ det(B −1 ) det(µ−1 B −2 − Ip )−1/2 = µp/2 det(IIp − µB 2 )−1/2
and (9.23) follows.
Now we evaluate the probability IP kBξk > y for moderate values of y .
p
Lemma 9.8.5. Let µ0 < 1 ∧ (g2 /p) . With y0 = g/µ0 − p/µ0 , it holds for
any u > 0
IP kBξk2 > p + u, kB 2 ξk ≤ y0
≤ 2 exp −0.5µ0 (p + u) − 0.5 log det(IIp − µ0 B 2 ) .
(9.24)
In particular, if B 2 is diagonal, that is, B 2 = diag a1 , . . . , ap , then
IP kBξk2 > p + u, kB 2 ξk ≤ y0
p
n µ u 1X
o
0
≤ 2 exp −
−
µ0 ai + log 1 − µ0 ai
.
2
2 i=1
(9.25)
Proof. The exponential Chebyshev inequality and (9.23) imply
IP kBξk2 > p + u, kB 2 ξk ≤ y0
n µ (p + u) o
µ kBξk2 p
0
0
≤ exp −
IE exp
1I kB 2 ξk ≤ g/µ0 − p/µ0
2
2
≤ 2 exp −0.5µ0 (p + u) − 0.5 log det(IIp − µ0 B 2 ) .
9.8 Proofs
293
Moreover, the standard change-of-basis arguments allow us to reduce
the
problem to the case of a diagonal matrix B 2 = diag a1 , . . . , ap where
1 = a1 ≥ a2 ≥ . . . ≥ ap > 0 . Note that p = a1 + . . . + ap . Then the
claim (9.24) can be written in the form (9.25).
Now we evaluate a large deviation probability that kBξk > y for a large
y . Note that the condition kB 2 k∞ ≤ 1 implies kB 2 ξk ≤ kBξk . So, the
bound (9.24) continues to hold when kB 2 ξk ≤ y0 is replaced by kBξk ≤ y0 .
√
Lemma 9.8.6. Let µ0 < 1 and µ0 p < g2 . Define g0 by g0 = g − µ0 p .
def
For any y ≥ y0 = g0 /µ0 , it holds
IP kBξk > y ≤ 8.4 det{IIp − (g0 /y)B 2 }−1/2 exp −g0 y/2 .
≤ 8.4 exp −x0 − g0 (y − y0 )/2 ,
(9.26)
where x0 is defined by
2x0 = g0 y0 + log det{IIp − (g0 /y0 )B 2 }.
Proof. The slicing arguments of Lemma 9.8.3 apply here in the same manner. One has to replace kξk by kBξk and (1 − µ1 )−p/2 by det{IIp −
(g0 /y)B 2 }−1/2 . We omit the details. In particular, with y = y0 = g0 /µ ,
this yields
IP kBξk > y0 ≤ 8.4 exp(−x0 ).
Moreover, for the function f (y) = g0 y + log det{IIp − (g0 /y)B 2 } , it holds
f 0 (y) ≥ g0 and hence, f (y) ≥ f (y0 ) + g0 (y − y0 ) for y > y0 . This implies
(9.26).
One important feature of the results of Lemma 9.8.5 and Lemma 9.8.6
is that the value µ0 < 1 ∧ (g2 /p) can be selected arbitrarily. In particular,
for y ≥ yc , Lemma
9.8.6 with µ0 = µc yields the large deviation probability
IP kBξk > y . For bounding the probability IP kBξk2 > p+u, kBξk ≤ yc ,
we use the inequality log(1 − t) ≥ −t − t2 for t ≤ 2/3 . It implies for µ ≤ 2/3
that
− log IP kBξk2 > p + u, kBξk ≤ yc
≥ µ(p + u) +
p
X
log 1 − µai
i=1
≥ µ(p + u) −
p
X
i=1
(µai + µ2 a2i ) ≥ µu − µ2 v2 /2.
(9.27)
294
9 Deviation probability for quadratic forms
Now we distinguish between µc = 2/3 and µc < 2/3 starting with µc = 2/3 .
The bound (9.27) with µ = 2/3 and with u = (2vx1/2 ) ∨ (6x) yields
IP kBξk2 > p + u, kBξk ≤ yc ≤ 2 exp(−x);
see the proof of Theorem 9.2.2 for the Gaussian case.
Now consider µc < 2/3 . For x1/2 ≤ µc v/2 , use u = 2vx1/2 and µ0 =
u/v2 . It holds µ0 = u/v2 ≤ µc and u2 /(4v2 ) = x yielding the desired
bound by (9.27). For x1/2 > µc v/2 , we select again µ0 = µc . It holds with
2 2
u = 4µ−1
c x that µc u/2 − µc v /4 ≥ 2x − x = x . This completes the proof.
Proof of Theorem 9.6.1
The arguments behind the result are the same as in the one-norm case of
Theorem 9.3.1. We only outline the main steps.
Lemma 9.8.7. Suppose (9.10) and (9.11). For any µ < 1 with g◦ > µ1/2 r∗ ,
it holds
IE exp µkξk2 /2 1I kξk◦ ≤ g◦ /µ − r∗ /µ1/2 ≤ 2(1 − µ)−p/2 . (9.28)
Proof. Let ε be a standard normal vector in IRp and u ∈ IRp . Let us fix
some ξ with µ1/2 kξk◦ ≤ µ−1/2 g◦ − r∗ and denote by IPξ the conditional
probability given ξ . It holds by (9.11) with cp = (2π)−p/2
Z
1
kγk2 1I(kγk◦ ≤ g◦ )dγ
cp exp γ > ξ −
2µ
Z
1
2 2
= cp exp µkξk /2
exp − µ1/2 ξ − µ−1/2 γ 1I(kµ−1/2 γk◦ ≤ µ−1/2 g◦ )dγ
2
= µp/2 exp µkξk2 /2 IPξ kε − µ1/2 ξk◦ ≤ µ−1/2 g◦
≥ 0.5µp/2 exp µkξk2 /2 .
This implies
exp
µkξk2 1I kξk◦ ≤ g◦ /µ − r∗ /µ1/2
Z
1
−p/2
≤ 2µ
cp exp γ > ξ −
kγk2 1I(kγk◦ ≤ g◦ )dγ.
2µ
2
Further, by (9.10)
Z
1
cp IE exp γ > ξ −
kγk2 1I(kγk◦ ≤ g◦ )dγ
2µ
Z
µ−1 − 1
≤ cp exp −
kγk2 dγ ≤ (µ−1 − 1)−p/2
2
9.8 Proofs
295
and (9.28) follows.
As in the Gaussian case, (9.28) implies for z > p with µ = µ(z) = (z−p)/z
the bounds (9.13) and (9.14). Note that the value µ(z) clearly grows with z
from zero to one, while g◦ /µ(z) − r∗ /µ1/2 (z) is strictly decreasing. The value
z◦ is defined exactly as the point where g◦ /µ(z) − r∗ /µ1/2 (z) crosses u◦ , so
that g◦ /µ(z) − r∗ /µ1/2 (z) ≥ u◦ for z ≤ z◦ .
For z > z◦ , the choice µ = µ(y) conflicts with g◦ /µ(z) − r∗ /µ1/2 (z) ≥ u◦ .
So, we apply µ = µ◦ yielding by the Markov inequality
IP kξk2 > z, kξk◦ ≤ u◦ ≤ 2 exp −µ◦ z/2 − (p/2) log(1 − µ◦ ) ,
and the assertion follows.
Proof of Theorem 9.6.2
Arguments from the proof of Lemmas 9.8.4 and 9.8.7 yield in view of g◦ µ−1
◦ −
−1/2
r∗ µ◦
≥ u◦
o
IE exp µ◦ kΠξk2 /2 1I kΠ2 ξk◦ ≤ u◦
1/2 ≤ IE exp µ◦ kΠξk2 /2 1I kΠ2 ξk◦ ≤ g◦ /µ◦ − p/µ◦
≤ 2det(IIp − µ◦ Π2 )−1/2 .
The inequality log(1 − t) ≥ −t − t2 for t ≤ 2/3 and symmetricity of the
matrix Π imply
− log det(IIp − µ◦ Π2 ) ≤ µ◦ p + µ2◦ v2 /2
cf. (9.27); the assertion (9.15) follows.
References
Baraud, Y. (2010). A Bernstein-type inequality for suprema of random processes with applications to model selection in non-Gaussian regression.
Bernoulli, 16(4):1064–1085.
Bayes, T. (1763). An essay towards solving a problem in the doctrine
of chances. Philosophical Transactions of the Royal Society of London.,
53:370–418.
Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. 2nd
ed. Springer Series in Statistics. New York etc.: Springer-Verlag.
Bernardo, J. M. and Smith, A. F. (1994). Bayesian theory. Chichester: John
Wiley & Sons.
Borokov, A. (1999). Mathematical Statistics. Taylor & Francis.
Boucheron, S. and Massart, P. (2011). A high-dimensional Wilks phenomenon.
Probability Theory and Related Fields, 150:405–433. 10.1007/s00440-0100278-7.
Bretagnolle, J. (1999). A new large deviation inequality for U -statistics of
order 2. ESAIM, Probab. Stat., 3:151–162.
Chihara, T. (2011). An Introduction to Orthogonal Polynomials. Dover Books
on Mathematics. Dover Publications, Incorporated.
Cramér, H. (1928). On the composition of elementary errors. I. Mathematical
deductions. II. Statistical applications. Skand. Aktuarietidskr., 11:13–74,
141–180.
D’Agostino, R. B. and Stephens, M. A. (1986). Goodness-of-fit techniques.
Statistics: Textbooks and Monographs, Vol. 68. New York - Basel: Marcel
Dekker, Inc.
de Boor, C. (2001). A Practical Guide to Splines. Number Bd. 27 in Applied
Mathematical Sciences. Springer.
de Finetti, B. (1937). Foresight: Its logical laws, its subjective sources (in
French). Annales de l’Institut Henri Poincaré, 7:1–68.
Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from
incomplete data via the EM algorithm. Discussion. J. R. Stat. Soc., Ser.
B, 39:1–38.
298
References
Diaconis, P. and Ylvisaker, D. (1979). Conjugate priors for exponential families. Ann. Stat., 7:269–281.
Donoho, D. L. (1995). Nonlinear solution of linear inverse problems by
wavelet-vaguelette decomposition. Appl. Comput. Harmon. Anal., 2(2):101–
126.
Edgington, E. S. and Onghena, P. (2007). Randomization tests. With CDROM. 4th ed. Boca Raton, FL: Chapman & Hall/CRC, 4th ed. edition.
Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications: Monographs on Statistics and Applied Probability 66. Chapman
& Hall/CRC Monographs on Statistics & Applied Probability. Taylor &
Francis.
Fan, J. and Huang, T. (2005). Profile likelihood inferences on semiparametric
varying-coefficient partially linear models. Bernoulli, 11(6):1031–1057.
Fan, J., Zhang, C., and Zhang, J. (2001). Generalized likelihood ratio statistics
and Wilks phenomenon. Ann. Stat., 29(1):153–193.
Fisher, R. A. (1934). Two new properties of mathematical likelihood. Proc.
R. Soc. Lond., Ser. A, 144:285–307.
Fisher, R. A. (1956). Statistical methods and scientific inference. EdinburghLondon: Oliver & Boyd, Ltd.
Galerkin, B. G. (1915). On electrical circuits for the approximate solution of
the Laplace equation (In Russian). Vestnik Inzh., 19:897–908.
Gauß, C. F. (1995). Theory of the combination of observations least subject
to errors. Part one, part two, supplement. Translated by G. W. Stewart.
Philadelphia, PA: SIAM.
Gill, R. D. and Levit, B. Y. (1995). Applications of the van Trees inequality:
A Bayesian Cramér-Rao bound. Bernoulli, 1(1-2):59–79.
Giné, E., Latala, R., and Zinn, J. (2000). Exponential and moment inequalities
for U -statistics. Giné, Evarist (ed.) et al., High dimensional probability II.
2nd international conference, Univ. of Washington, DC, USA, August 1-6,
1999. Boston, MA: Birkhäuser. Prog. Probab. 47, 13-38 (2000).
Good, I. and Gaskins, R. (1971). Nonparametric roughness penalties for probability densities. Biometrika, 58:255–277.
Good, P. (2005). Permutation, parametric and bootstrap tests of hypotheses.
3rd ed. New York, NY: Springer, 3rd ed. edition.
Götze, F. and Tikhomirov, A. (1999). Asymptotic distribution of quadratic
forms. Ann. Probab., 27(2):1072–1098.
Green, P. J. and Silverman, B. (1994). Nonparametric regression and generalized linear models: a roughness penalty approach. London: Chapman &
Hall. .
Hájek, J. and Šidák, Z. (1967). Theory of rank tests. New York-London:
Academic Press; Prague: Academia, Publishing House of the Czechoslovak
Academy of Sciences.
Hewitt, E. and Stromberg, K. (1975). Real and abstract analysis. A modern
treatment of the theory of functions of a real variable. 3rd printing. Grad-
References
299
uate Texts in Mathematics. 25. New York - Heidelberg - Berlin: SpringerVerlag.
Hoerl, A. E. (1962). Application of Ridge Analysis to Regression Problems.
Chemical Engineering Progress, 58:54–59.
Horváth, L. and Shao, Q.-M. (1999). Limit theorems for quadratic forms with
applications to Whittle’s estimate. Ann. Appl. Probab., 9(1):146–187.
Hotelling, H. (1931). The generalization of student’s ratio. Ann. Math. Stat.,
2:360–378.
Houdré, C. and Reynaud-Bouret, P. (2003). Exponential inequalities, with
constants, for U-statistics of order two. Giné, Evariste (ed.) et al., Stochastic inequalities and applications. Selected papers presented at the Euroconference on “Stochastic inequalities and their applications”, Barcelona, June
18–22, 2002. Basel: Birkhäuser. Prog. Probab. 56, 55-69 (2003).
Hsu, D., Kakade, S. M., and Zhang, T. (2012). A tail inequality for quadratic
forms of subgaussian random vectors. Electron. Commun. Probab., 17:no.
52, 6.
Huber, P. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proc. 5th Berkeley Symp. Math. Stat. Probab., Univ.
Calif. 1965/66, 1, 221-233 (1967).
Ibragimov, I. and Khas’minskij, R. (1981). Statistical estimation. Asymptotic
theory. Transl. from the Russian by Samuel Kotz. New York - Heidelberg
-Berlin: Springer-Verlag .
Janssen, A. (1998). Zur Asymptotik nichtparametrischer Tests, Lecture Notes.
Skripten zur Stochastik Nr. 29. Gesellschaft zur Förderung der Mathematischen Statistik, Münster.
Jeffreys, H. (1935). Some tests of significance, treated by the theory of probability. Proc. Camb. Philos. Soc., 31:203–222.
Jeffreys, H. (1957). Scientific inference. 2. ed. Cambridge University Press.
Jeffreys, H. (1961). Theory of probability. 3rd ed. (The International Series
of Monographs on Physics.) Oxford: Clarendon Press.
Kass, R. E. and Raftery, A. E. (1995). Bayes factors. J. Am. Stat. Assoc.,
90(430):773–795.
Kendall, M. (1971). Studies in the history of probability and statistics. XXVI:
The work of Ernst Abbe. Biometrika, 58:369–373.
Kolmogorov, A. (1933). Sulla determinazione empirica di una legge di distribuzione. G. Ist. Ital. Attuari, 4:83–91.
Kullback, S. (1997). Information Theory and Statistics. Dover Books on
Mathematics. Dover Publications.
Lehmann, E. and Casella, G. (1998). Theory of point estimation. 2nd ed. New
York, NY: Springer, 2nd ed. edition.
Lehmann, E. L. (1999). Elements of large-sample theory. New York, NY:
Springer.
Lehmann, E. L. (2011). Fisher, Neyman, and the creation of classical statistics. New York, NY: Springer.
300
References
Lehmann, E. L. and Romano, J. P. (2005). Testing statistical hypotheses. 3rd
ed. New York, NY: Springer, 3rd ed. edition.
Markoff, A. (1912). Wahrscheinlichkeitsrechnung. Nach der 2. Auflage des
russischen Werkes übersetzt von H. Liebmann. Leipzig u. Berlin: B. G.
Teubner.
Massart, P. (2007). Concentration inequalities and model selection. Ecole
d’Eté de Probabilités de Saint-Flour XXXIII-2003. Lecture Notes in Mathematics. Springer.
Murphy, S. and van der Vaart, A. (2000). On profile likelihood. J. Am. Stat.
Assoc., 95(450):449–485.
Neyman, J. and Pearson, E. S. (1928). On the use and interpretation of
certain test criteria for purposes of statistical inference. I, II. Biometrika
20, A; 175-240, 263-294 (1928).
Neyman, J. and Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philos. Trans. R. Soc. Lond., Ser. A,
Contain. Pap. Math. Phys. Character, 231:289–337.
Pearson, K. (1900). On the criterion, that a given system of deviations from
the probable in the case of a correlated system of variables is such that it
can be reasonably supposed to have arisen from random sampling. Phil.
Mag. (5), 50:157–175.
Pesarin, F. and Salmaso, L. (2010). Permutation Tests for Complex Data:
Theory, Applications and Software. Wiley Series in Probability and Statistics.
Pitman, E. (1937). Significance Tests Which May be Applied to Samples From
any Populations. Journal of the Royal Statistical Society, 4(1):119–130.
Polzehl, J. and Spokoiny, V. (2006). Propagation-separation approach for
local likelihood estimation. Probab. Theory Relat. Fields, 135(3):335–362.
Raiffa, H. and Schlaifer, R. (1961). Applied statistical decision theory. Cambridge, MA: The M.I.T. Press. Republished in Wiley Classic Library Series
(2000).
Robert, C. P. (2001). The Bayesian choice. From decision-theoretic foundations to computational implementation. 2nd ed. New York, NY: Springer,
2nd ed. edition.
Savage, L. J. (1954). The foundations of statistics. Wiley Publications in
Statistics. New York: John Wiley & Sons, Inc. XV, 294 p. (1954).
Scheffé, H. (1959). The analysis of variance. A Wiley Publication in Mathematical Statistics. New York: John Wiley & Sons, Inc.; London: Chapman
& Hall, Ltd.
Searle, S. (1971). Linear models. Wiley Series in Probability and Mathematical Statistics. New York etc.: John Wiley and Sons, Inc.
Shorack, G. R. and Wellner, J. A. (1986). Empirical processes with applications
to statistics. Wiley Series in Probability and Mathematical Statistics. New
York, NY: Wiley.
Smirnov, N. (1937). On the distribution of Mises’ ω 2 -criterion (in Russian).
Mat. Sb. (Nov. Ser.), 2(5):973–993.
References
301
Smirnov, N. (1948). Table for Estimating the Goodness of Fit of Empirical
Distributions. Ann. Math. Statist., 19(2):279–281.
Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a
multivariate normal distribution. Proc. 3rd Berkeley Sympos. Math. Statist.
Probability 1, 197-206 (1956).
Strasser, H. (1985). Mathematical theory of statistics. Statistical experiments
and asymptotic decision theory. De Gruyter Studies in Mathematics, 7.
Berlin - New York: de Gruyter.
Student (1908). The probable error of a mean. Biometrika, 6:1–25.
Szegö, G. (1939). Orthogonal Polynomials. Number Bd. 23 in American Mathematical Society colloquium publications. American Mathematical Society.
Tikhonov, A. (1963). Solution of incorrectly formulated problems and the
regularization method. Sov. Math., Dokl., 5:1035–1038.
Van Trees, H. L. (1968). Detection, estimation, and modulation theory. Part
I. New York - London - Sydney: John Wiley and Sons, Inc. XIV, 697 p.
(1968).
von Mises, R. (1931). Vorlesungen aus dem Gebiete der angewandten Mathematik. Bd. 1. Wahrscheinlichkeitsrechnung und ihre Anwendung in der
Statistik und theoretischen Physik. Leipzig u. Wien: Franz Deuticke.
Wahba, G. (1990). Spline Models for Observational Data. Society for Industrial and Applied Mathematics.
Wald, A. (1943). Tests of statistical hypotheses concerning several parameters
when the number of observations is large. Trans. Am. Math. Soc., 54:426–
482.
Wasserman, L. (2006). All of Nonparametric Statistics. Springer Texts in
Statistics. Springer.
White, H. (1982). Maximum likelihood estimation of misspecified models.
Econometrica, 50:1–25.
Wilks, S. (1938). The large-sample distribution of the likelihood ratio for
testing composite hypotheses. Ann. Math. Stat., 9:60–62.
Witting, H. (1985). Mathematische Statistik I: Parametrische Verfahren bei
festem Stichprobenumfang. Stuttgart: B. G. Teubner.
Witting, H. and Müller-Funk, U. (1995). Mathematische Statistik II. Asymptotische Statistik: Parametrische Modelle und nichtparametrische Funktionale. Stuttgart: B. G. Teubner.
Index
alternating procedure, 172
alternative hypothesis, 202
analysis of variance, 237
analysis of variance table, 240
antirank, 259
asymptotic confidence set, 37–39
asymptotic pivotality, 250
B–spline, 112
Bayes approach, 22
Bayes estimate, 187
Bayes factor, 253
Bayes formula, 176, 177
Bayes risk, 186
Bayes testing, 254
Bernoulli experiment, 13
Bernoulli model, 32, 64, 215
between treatment sum of squares, 239
bias-variance decomposition, 151
bias-variance trade-off, 151, 153
binary response models, 117
binomial model, 32
binomial test, 200
block effect, 242
Bolshev’s recursion, 260
Bonferroni rule, 222
canonical parametrization, 73
Chebyshev polynomials, 96
chi-square distribution, 209
chi-square test, 207
chi-squared result, 142
chi-squared theorem, 169
colored noise, 137
complete basis, 95
composite hypothesis, 198
concentration set, 39
confidence ellipsoid, 144
confidence intervals, 15
conjugated prior, 178
consistency, 35
constrained ML problem, 155
Continuous Mapping Theorem, 236
contrast function, 87
contrast matrix, 234
cosine basis, 106
Cox–de Boor formula, 112
Cramér–Rao Inequality, 48, 53
Cramér-Smirnov-von Mises test, 252
credible set, 254
critical region, 199
critical value, 204
data, 13
decision space, 20
decomposition of spread, 240
Delta method, 249
design, 84
deviation bound, 76
differentiability in the mean, 255
double exponential model, 33
efficiency, 22
empirical distribution function, 24
empirical measure, 24
error of the first kind, 199
error of the second kind, 202
estimating equation, 60
304
Index
excess, 60
exponential family, 56, 70
exponential model, 32, 65, 216
Fisher discrimination, 211
Fisher information, 45
Fisher information matrix, 52
Fisher’s F -distribution, 209
Fisher-Yates test, 263
fitted log-likelihood, 60
Fourier basis, 105
Galerkin method, 158
Gauss-Markov Theorem, 136, 167
Gaussian prior, 179, 181
Gaussian regression, 86
Gaussian shift, 61
Gaussian shift model, 210
generalized linear model, 116
generalized regression, 114
Glivenko-Cantelli, theorem, 25
goodness-of-fit test, 249
group indicator, 237
Hellinger distance, 43
Hermite polynomials, 103
heterogeneous errors, 85
homogeneous errors, 85
hypothesis testing problem, 197
i.i.d. model, 23
integral of a Brownian bridge, 253
interval hypothesis, 221
interval testing, 199
knots, 111
Kolmogorov-Smirnov test, 250
Kullback–Leibler divergence, 41, 218
Lagrange polynomials, 101
least absolute deviation, 59, 69
least absolute deviations, 88
least favorable configuration (LFC), 201
least squares estimate, 58, 88, 123
Legendre polynomials, 99
level of a test, 200
likelihood ratio test, 205, 206
linear Gaussian model, 121
linear hypothesis, 234
linear model, 16, 121
linear rank statistic, 262
linear regression, 90
locally best test, 256
log-density, 41
log-likelihood ratio, 206
log-rank test, 264
logit regression, 117
Loss and Risk, 20
loss function, 20
M-estimate, 57, 87
marginal distribution, 176
maximum likelihood estimate, 59, 89
maximum log-likelihood, 60
maximum of a Brownian bridge, 251
median, 59, 69
median regression, 85
median test, 263
method of moments, 28, 29, 246
minimax approach, 22
minimum distance estimate, 57
minimum distance method, 249
misspecified LPA, 138, 145
misspecified noise, 138, 145
multinomial model, 32, 64
natural parametrization, 70
Neyman-Pearson test, 203
no treatment effect, 198
noise misspecification, 132
non-central chi-squared, 229
non-informative prior, 185
nonparametric regression, 91
normal equation, 123
nuisance, 159
observations, 13
odds ratio, 254
one-factorial analysis of variance, 239
one-sided alternative, 203
oracle choice, 153, 157
order statistic, 259
ordinary least squares, 124
orthogonal design, 126
orthogonal projector, 225
orthogonality, 160
orthonormal design, 127
orthonormal polynomials, 94
parametric approach, 13
Index
parametric model, 19
partial estimation, 162
partially Bayes test, 253
Pearson’s chi-square statistic, 208
penalization, 149
penalized log-likelihood, 149
penalized MLE, 149
permutation test, 264
piecewise constant estimation, 106
piecewise linear estimation, 109
piecewise polynomial estimation, 110
Pitman’s permutation test, 265
pivotality, 211
plug-in method, 170
Poisson model, 33, 65, 217
Poisson regression, 118
polynomial regression, 90
pooled sample variance, 238
posterior distribution, 175
posterior mean, 187
power function, 202
profile likelihood, 164
profile MLE, 165
projection estimate, 152
305
ridge regression, 148
risk minimization, 154, 157
root-n normal, 36
quadratic forms, 226
quadratic log-likelihood, 139
quadratic loss, 131
quadratic risk, 131
quadraticity, 139
quality control, 198
quasi likelihood ratio test, 205
quasi maximum likelihood method, 67
quasi ML estimation, 119
sample, 13
Savage test, 264
score, 261
score function, 255
score test, 255
semiparametric estimation, 159
sequence space model, 128
series expansion, 248
shrinkage estimate, 153
significance level, 200
simple game, 198
simple hypothesis, 198
size of a test, 200
smoothing splines, 113
smoothness assumption, 91
spectral cut-off, 156
spectral representation, 128
spline, 111
spline estimation, 111
squared bias, 96
statistical decision problem, 20
statistical problem, 14
statistical tests, 199
Steck’s recursion, 260
Student’s t -distribution, 210
substitution approach, 245
substitution estimate, 28
substitution principle, 27, 250
subvector, 232
sum of squared residuals, 226
R-efficiency, 49, 136
R-efficient, 55
R-efficient estimate, 50
randomized blocks, 242
randomized tests, 201
rank, 259
rank test, 258
region of acceptance, 199
regression function, 83, 86, 114
regression model, 83
regular family, 44, 51
resampling, 265
residuals, 87, 225
response estimate, 123
T, 171
target, 159
test power, 202
test statistic, 204
testing a parametric hypothesis, 248
testing a subvector, 198
Tikhonov regularization, 148
treatment effect, 242
treatment effects, 241
trigonometric series, 105
two-factorial analysis of variance, 242
two-sample test, 237
two-sided alternative, 203
two-step procedure, 170
306
Index
unbiased estimate, 15, 17, 34, 54
uniform distribution, 31, 64, 216
uniformly most powerful (UMP) test,
203
univariate exponential family, 218
univariate normal distribution, 63
van der Waerden test, 263
van Trees’ inequality, 192
van Trees’ inequality multivariate, 193
Wald statistic, 236
Wald test, 236
weighted least squares estimate, 89
Wilcoxon’s rank sum test, 263
Wilks’ phenomenon, 224
within-treatment sum of squares, 239
© Copyright 2026 Paperzz