“Three Plus One” Likelihood-Based Test Statistics: Unified

This article was downloaded by: [UZH Hauptbibliothek / Zentralbibliothek Zürich]
On: 17 February 2015, At: 05:11
Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,
37-41 Mortimer Street, London W1T 3JH, UK
The American Statistician
Publication details, including instructions for authors and subscription information:
http://www.tandfonline.com/loi/utas20
The “Three Plus One” Likelihood-Based Test Statistics:
Unified Geometrical and Graphical Interpretations
Vito M. R. Muggeo & Gianfranco Lovison
Accepted author version posted online: 27 Aug 2014.Published online: 19 Nov 2014.
Click for updates
To cite this article: Vito M. R. Muggeo & Gianfranco Lovison (2014) The “Three Plus One” Likelihood-Based Test
Statistics: Unified Geometrical and Graphical Interpretations, The American Statistician, 68:4, 302-306, DOI:
10.1080/00031305.2014.955212
To link to this article: http://dx.doi.org/10.1080/00031305.2014.955212
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained
in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the
Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and
are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and
should be independently verified with primary sources of information. Taylor and Francis shall not be liable for
any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever
or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of
the Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any
form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
The “Three Plus One” Likelihood-Based Test Statistics: Unified
Geometrical and Graphical Interpretations
Downloaded by [UZH Hauptbibliothek / Zentralbibliothek Zürich] at 05:11 17 February 2015
Vito M. R. MUGGEO and Gianfranco LOVISON
The presentations of the well-known likelihood ratio, Wald
and score test statistics in textbooks appear to lack a unified
graphical and geometrical interpretation. We present two simple
graphical representations on a common scale for these three test
statistics, and also the recently proposed gradient test statistic.
These unified graphical displays may favor better understanding
of the geometrical meaning of the likelihood-based statistics and
provide useful insights into their connections.
KEY WORDS: Statistical inference; Geometrical interpretation; Gradient statistic; Graphical display; Likelihood ratio;
Score; Wald.
statistic. Although this has not yet entered the mainstream of
teaching practice, it is useful to include it here, since it is also
based on the likelihood and it is asymptotically equivalent to the
“Holy Trinity.”
In this article, we show that all four of these test statistics
have a geometrical interpretation on the same scale: either that
of the log-likelihood function or that of the score function. Consequently, this allows a common graphical representation. We
believe that this unified approach may enable a fuller and better
understanding of the relationships among these statistics, improving the students’ ability to learn and compare them with a
“critical eye.”
2. THE LIKELIHOOD-BASED STATISTICS
1. INTRODUCTION
In courses on statistical inference based on the likelihood
paradigm, hypothesis testing is typically discussed in terms of
three well-known test statistics: the likelihood ratio (Neyman
and Pearson 1928; Wilks 1938), Wald (Wald 1943), and score
(Rao 1948) tests, the latter also referred to as the Lagrange multiplier test in the econometric literature. They are covered in
almost every book on statistical inference, for example, Azzalini (2001), Casella and Berger (2002), and Boos and Stefanski
(2013); in addition, reviews are also presented in some books
on specialized areas of data analysis (e.g., Agresti 2007). To
emphasize their key role in statistical inference, Rao (2005)
named them “the Holy Trinity.” While analytic representations
and asymptotic equivalences of these likelihood-based statistics
are fully discussed and easily handled via Taylor expansions,
their actual geometrical meaning, with its implications, remains
somewhat vague and, to some extent, unclear to most students.
In our experience, they tend to learn them separately, recognizing that these are reasonable statistics, each measuring the
“distance” between the null hypothesis and the sample evidence
on its own appropriate scale, but without fully grasping their
deep connections.
In addition to the “Holy Trinity”, a fourth test statistic based
on the likelihood was introduced by Terrell (2002): the gradient
Vito M. R. Muggeo (E-mail: [email protected]) and Gianfranco Lovison (E-mail: [email protected]), Dipartimento Scienze Statistiche
e Matematiche ‘Vianelli,’ viale delle Scienze, edificio 13, Palermo 90128, Italy.
The authors thank the referee, the Associate Editor, and the Editor, Prof. Ronald
Christensen, for their comments and suggestions, and Amanda Ross for carefully revising the final manuscript.
For ease of presentation, we focus on the case of a scalar
parameter and consider a regular statistical model with loglikelihood (θ ), where θ ∈ a subset of . We assume the
maximum likelihood estimate θ̂ = arg max (θ ) to exist and to
be unique. We refer to the usual hypotheses H0 : θ = θ0 versus
H1 : θ = θ0 . To avoid overloading the notation, when necessary
we assume θ0 is also the true parameter value. Moreover, let
us define the usual quantities based on log-likelihood derivatives: (θ ) = u(θ ) the score, I(θ ) and J (θ ) = − (θ ) the expected and observed information. Notice that even if the observed and expected information are usually evaluated at θ0 and
θ̂, respectively, we generically define them as functions of
θ in the observed sample. When needed, we add “Y ” to refer to the corresponding random variables; thus, for instance
E[u(θ0 ; Y )] = 0 or I(θ ) = E[J (θ ; Y )] for each θ ∈ .
Under the usual regularity conditions and correct model specification, the variances of u(θ0 , Y ) and θ̂ (Y ) are given by the
information and its inverse, respectively—the latter providing
typically only the asymptotic variance. In general, these variances must be estimated, and this can be done by using either the
expected or the observed information, evaluated at the appropriate parameter value: the ML estimate θ̂ or the value postulated
by the null hypothesis, θ0 .
The usual approach when building the Wald and the score
test statistics relies upon using the expected information, respectively, evaluated at the ML estimate I(θ̂), and under the
null hypothesis I(θ0 ). However, for the purpose of the unified representation proposed in this article, we use the observed information at θ0 and θ̂ ; thus, we estimate the two variances by V̂[u(θ0 ; Y )] = J (θ0 ) and V̂[θ̂(Y )] = J (θ̂)−1 . Also we
make no distinction between exact and asymptotic variance. In
this context, we think this choice is justified by the following
302 © 2014 American Statistical Association DOI: 10.1080/00031305.2014.955212
The American Statistician, November 2014, Vol. 68, No. 4
d = 2{(θ̂) − (θ0 )}
w = (θ̂ − θ0 )2 /J (θ̂)−1
s = u(θ0 )2 /J (θ0 ).
(1)
(2)
(3)
As reported in most textbooks, each test uses its own scale: d
works on the log-likelihood scale, w on the parameter scale,
and s on the first derivative scale. To better explain them, some
textbooks also report a classical graphical display, like the one
depicted in Figure 1: see, for instance, Figure 4.2 on p. 122
in Azzalini (2001), Figure 3.7 on p. 89 of Agresti (2007), and
Figure 3.1 on p. 126 of Boos and Stefanski (2013).
While this representation is absolutely correct, we propose
that the presentation and discussion of the connections among
these three test statistics can be fruitfully enhanced using a
graphical illustration for all the test statistics on the same scale.
As mentioned in the introduction, we also discuss the new
gradient statistic (Terrell 2002), defined as
g = u(θ0 )(θ̂ − θ0 ).
(4)
The gradient statistic is justified by noting that the product
u(θ0 ;Y )
(θ̂ (Y )−θ0 )
simof the two standard Normal variates V[u(θ
1/2 ×
V[θ̂(Y )]1/2
0 ;Y )]
plifies to (4) when V[θ̂(Y )] ≈ V[u(θ0 ; Y )]−1 . As for the Holy
log likelihood
Downloaded by [UZH Hauptbibliothek / Zentralbibliothek Zürich] at 05:11 17 February 2015
considerations: (i) although I(θ̂) and I(θ0 ) are more commonly
used for standardizing the Wald and Score statistics, respectively, the use of the observed information is also discussed by
some textbooks (e.g., Pawitan 2001, pp. 244–247); (ii) in this
article, which focuses on the geometrical interpretation of
asymptotically equivalent statistics, distinction between observed and expected information is a minor issue, since they are
asymptotically equivalent, and even identical in important cases
such as the natural exponential family. On the same grounds,
distinction between exact and asymptotic variance is not crucial
here and will, therefore, be neglected.
With this specification, the “Holy Trinity” (Rao 2005)
becomes
d
s
3. COMPARISON ON THE LOG-LIKELIHOOD SCALE
Under the assumed regularity conditions (θ ) is smooth up
to the second order, and it is possible to approximate it via
second-order polynomials. As is well known, two quadratic
approximations are available. The former is based on the Taylor
expansion at θ = θ̂ :
1
Pw (θ ) = (θ̂ ) + (θ − θ̂ ) (θ̂ ) + (θ − θ̂)2 (θ̂),
2
(5)
where clearly (θ̂) = 0; the latter approximation relies on a
Taylor expansion at θ = θ0 :
1
Ps (θ ) = (θ0 ) + (θ − θ0 ) (θ0 ) + (θ − θ0 )2 (θ0 ).
2
(6)
Traditionally, the Wald and score tests are discussed with reference to the usual formulas (2) and (3), which are employed
to justify their rationale and at the same time to illustrate their
different scales. The fact that they are based on the two aforementioned quadratic approximations is only mentioned to explain how the numerators (θ̂ − θ0 )2 and u(θ0 )2 are weighted by
the log-likelihood curvature. For instance, Boos and Stefanski
(2013, p. 126) wrote, “The likelihood ratio test statistic is a
multiple of the difference of the log-likelihood; the Wald test
statistic is a multiple of the squared difference of (θ̂ − θ0 ); and
the score test statistic is a multiple of the squared slope at θ0 .”
Thus, students are not presented with the actual relationships of
w and s with the log-likelihood itself. On learning that (θ ) measures just the plausibility of the different θ values in the observed
sample, students find it very intuitive to reject H0 if the model
plausibility under H0 , that is, (θ0 ), is smaller than the sample
evidence representing the maximum plausibility, that is, (θ̂). In
our view, when introducing the Wald and score tests, it would
be helpful to present them using the same approach used for
the likelihood ratio, that is, as alternative ways of comparing
the same two plausibility values, but employing the quadratic
approximations (5) and (6), instead of the log-likelihood itself.
It is trivial to check that subtracting and doubling the plausibility values at θ0 and θ̂ provided by approximation (5)
yields
2{Pw (θ̂ ) − Pw (θ0 )} = w.
w
θ0
Trinity, the large sample null distribution of the gradient statistic
is χ12 .
It should be mentioned that no test, including the gradient
statistic, is uniformly most powerful in all settings; thus it makes
sense to learn each of them, and appreciating their geometrical
interpretation may help to reach this goal.
θ^
parameter
Figure 1. Comparing the three test statistics according to the traditional plot: Likelihood ratio is reported on the y scale, Wald on the
x scale, and the score on the first derivative scale. The different scales
do not favor understanding of the underlying connections.
Similarly s in (3) can be obtained by subtracting and doubling
plausibility values at maximum and under H0 according to approximation (6). Notice the maximum plausibility of (6) is attained at θ̃0 = θ0 − (θ0 )/ (θ0 ). The reader can obtain it by
equating Ps (θ ) to zero, or by recognizing that θ̃0 is simply the
update from the Newton algorithm step to maximize (θ ) starting from θ0 . Therefore, it is easy to check that
2{Ps (θ̃0 ) − Ps (θ0 )} = s.
The American Statistician, November 2014, Vol. 68, No. 4
303
WALD
log likelihood
log likelihood
LIKELIHOOD RATIO
×2
^
θ
θ0
×2
^
θ
θ0
parameter
parameter
GRADIENT
log likelihood
log likelihood
Downloaded by [UZH Hauptbibliothek / Zentralbibliothek Zürich] at 05:11 17 February 2015
SCORE
×2
θ0
~
θ0
^
θ
θ0
parameter
parameter
Figure 2. Comparing the four test statistics on the log-likelihood scale. On each plot the log-likelihood is illustrated (black line) along with the
relevant approximation underlying the test statistic: in the Wald panel the gray line is Pw (θ), in the score panel it is Ps (θ), and in the gradient
panel it is Pg (θ ). The arrows on the left side quantify the corresponding observed test statistic; the longer the arrow, the larger the evidence
against H0 . Notice that for the likelihood ratio, Wald, and score, the arrow lengths have to be doubled to obtain the actual values comparable to
those from the gradient statistic.
Finally the gradient statistic can be also expressed as a difference in plausibility at θ0 and θ̂ on an appropriate log-likelihood
approximation, simply using a first-order Taylor expansion for
(θ ) at θ0 , Pg (θ ) = (θ0 ) + (θ − θ0 ) (θ0 ). Here we do not need
to double the difference to obtain
{Pg (θ̂ ) − Pg (θ0 )} = g.
Figure 2 presents the graphical representations of the four test
statistics on the common log-likelihood scale. Of course, for
inferences about the location parameter of Gaussian models, the
two approximations Pw and Ps coincide with the log-likelihood,
thus d = w = s; also, it is simple to show that the gradient
statistic reduces to same value. In fact, the four test statistics are
identical in this special case.
4. COMPARISON ON THE SCORE SCALE
The four test statistics have also a common representation on
the score scale. In Figure 3, the gray area of the “triangle” A in
each plot is half the value of the corresponding statistic.
The likelihood ratio statistic, again, is simple. Since (θ ) =
u(θ ) up to a constant, clearly the area of the curvilinear triangle
A is
A=
θ̂
θ0
u(θ )dθ = {(θ̂) − (θ0 )} =
1
d.
2
In each remaining triangle displayed, let b be the basis, h be the
height, and recall that u (θ ) provides the slope at each θ ∈ .
In the “Wald” panel, b = (θ̂ − θ0 ) and from basic trigonometric relationships h/b = −u (θ̂ ), thus h = (θ̂ − θ0 ){−u (θ̂)}.
304 Teacher’s Corner
Hence,
1
bh
= (θ̂ − θ0 )2 {−u (θ̂)}
2
2
1
1
= (θ̂ − θ0 )2 /{−u (θ̂)}−1 = w,
2
2
A=
where {−u (θ̂ )} = J (θ̂).
In the “score” panel, h = u(θ0 ) and using the same trigonometric relationship h/b = −u (θ0 ), we get b = u(θ0 )/{−u (θ0 )}.
Hence,
A=
1
bh
1
= u(θ0 )2 /{−u (θ0 )} = s,
2
2
2
where {−u (θ0 )} = J (θ0 ).
Curiously, when showing the three triangles representing the
likelihood ratio, Wald, and score statistics in the classroom, a
student of ours asked about the “missing triangle.” He meant just
the one now shown in the right lower corner. After looking at the
first three, he expected this fourth triangle as the last, natural and
consequential step of a logical sequence. At the time, we had
only three triangles in our graphical display and were unable
to answer our student’s question, as we did not yet know the
gradient statistic. Now the missing triangle has appeared and it
makes sense: in the “gradient” panel in the right lower corner it
is immediately seen that b = (θ̂ − θ0 ), h = u(θ0 ) and then
A=
1
bh
1
= u(θ0 )(θ̂ − θ0 ) = g.
2
2
2
For inferences about the location parameter of Gaussian models the score is linear, and therefore the four triangles coincide.
WALD
0
0
score
score
LIKELIHOOD RATIO
^
θ
θ0
θ0
parameter
^
θ
parameter
score
0
0
θ0
~
θ0
θ0
parameter
^
θ
parameter
Figure 3. Illustrating the four test statistics on the score scale. In each plot the gray area represents half the actual empirical value of test statistic.
The bigger the “triangle,” the larger the evidence against H0 . In this example, clearly g > d > w (or s), but in general the areas of triangles
depend on the shape of the score u(·) (in turn depending on the assumed model), and on the locations of θ̂ and θ0 .
5. A NONSTANDARD EXAMPLE: THE MONOTONIC
LIKELIHOOD PROBLEM
The common score-scale representation of the likelihood
based statistics enables us to better understand and appreciate the applicability and potential fallibility of some test statistics in specific nonstandard situations; see Fears, Benichou, and
Gail (1996) for an example focusing on the variance of random effects in a simple balanced one way ANOVA. Another
well-known example concerns the monotonic likelihood problem: in a simple 2 × 2 table the odds ratio estimate does not
exist when there is “quasi-separation” between the response
conditional distributions and consequently a cell count is zero
(Hauck and Donner 1977). As the parameter of interest θ
increases the log-likelihood increases monotonically and
plateaus at large values of θ . Accordingly the first derivative,
that is, the score, approaches zero without crossing the x axis.
The maximum likelihood estimate usually returned by estimating algorithms and routine software is the parameter value at
which the log-likelihood does not increase further. Its numeric
value depends on the tolerance stopping value of the algorithm
and is normally quite large.
The graphical illustration on the score scale reported in
Figure 4 aids understanding of the fallibility of the Wald and
also of the gradient test statistic, which, interestingly, is due to
their opposite behaviors.
In fact u (θ̂ ) is very flat, making the area of the Wald triangle
very tiny: this means that the corresponding empirical value
GRADIENT
score
score
WALD
0
0
Downloaded by [UZH Hauptbibliothek / Zentralbibliothek Zürich] at 05:11 17 February 2015
GRADIENT
score
SCORE
θ^
θ0
parameter
θ^
θ0
parameter
Figure 4. Illustrating fallibility of the Wald and gradient statistic in testing for association in logistic regression model with quasi-separation
between the response conditional distributions causing the log-likelihood to be monotonic and the score function not to cross the x-axis. The
gray areas represent the empirical values of the observed test statistics.
The American Statistician, November 2014, Vol. 68, No. 4
305
Downloaded by [UZH Hauptbibliothek / Zentralbibliothek Zürich] at 05:11 17 February 2015
is substantially zero. On the contrary the area of the gradient
triangle becomes huge indicating a very large empirical value.
Thus, in terms of performance of statistical tests, the Wald and
gradient statistics are questionable if the probability of observing
a zero count is not negligible owing to small sample size and/or
a strong predictor of the response; the power of the Wald test
tends to zero and the gradient test size is out of control and
potentially much larger than the nominal level. On the other
hand, it is evident that the likelihood ratio and the score tests
continue to hold, and in a likelihood framework they can provide
reliable results and trustworthy p-values.
A possible, probably better, alternative is to assume the Jeffrey’s invariant prior for the model parameters; this guarantees
the score function to cross always the x-axis, namely existence
of finite estimate even in cases of quasi-separation of data. While
this approach has clearly a Bayesian flavor, it is also acknowledged in a likelihood-based paradigm via penalized likelihood;
see Firth (1993) for details.
6. CONCLUSION
We have presented two graphical illustrations on common
scales for the likelihood ratio, Wald, score and gradient statistics. These alternative views can facilitate comprehension of
the underlying relationships between the tests. While we do not
claim that these relationships are new, it appears that graphical illustrations and comparisons based on them have been neglected
in standard graduate textbooks on statistical inference. To the
best of our knowledge, no textbook includes a similar treatment.
Indeed, we experienced quite positive reactions from our students when introducing the test statistics via graphical display.
An important benefit is that the derived formulas for w, s, and g
based on plausibility differences clarify and emphasize the role
of the likelihood itself.
[Received October 2013. Revised August 2014.]
306 Teacher’s Corner
REFERENCES
Agresti, A. (2007), An Introduction to Categorical Data Analysis (2nd ed.),
New York: Wiley. [302,303]
Azzalini, A. (2001), Inferenza Statistica: Una Presentazione Basata Sul Concetto Di Verosimiglianza, Milano: Springer. [302,303]
Boos, D. D., and Stefanski, L. A. (2013), Essential Statistical Inference: Theory
and Methods, New York: Springer. [302,303]
Casella, G., and Berger, R. L. (2002), Statistical Inference, Belmont, CA:
Duxbury. [302]
Fears, T. R., Benichou, J., and Gail, M. H. (1996), “A Reminder of
the Fallibility of the Wald Statistic,” The American Statistician, 50,
226–227. [305]
Firth, D. (1993), “Bias Reduction of Maximum Likelihod Estimates,”
Biometrika, 80, 27–38. [306]
Hauck, W. W., and Donner, A. (1977), “Wald’s Test as Applied to Hypotheses
in Logit Analysis,” Journal of the American Statistical Association, 72,
851–853. [305]
Neyman, J., and Pearson, E. S. (1928), “On the Use and Interpretation of
Certain Test Criteria for Purposes of Statistical Inference,” Biometrika, 20A,
175–240. [302]
Pawitan, Y. (2001), In All Likelihood: Statistical Modelling and Inference Using
Likelihood, New York: Oxford University Press. [303]
Rao, C. R. (1948), “Large Sample Tests of Statistical Hypotheses Concerning
Several Parameters With Applications to Problems of Estimation,” Proceedings of the Cambridge Philosophical Society, 44, 50–57. [302]
——— (2005), “Score Test: Historical Review and Recent Developments,” in
Advances in Ranking and Selection, Multiple Comparisons, and Reliability, eds. N. Balakrishnan, N. Kannan, and H. N. Nagaraja, Boston, MA:
Birkhäuser. [302,303]
Terrell, G. (2002), “The Gradient Statistic,” Computing Science and Statistics,
34, 206–215. [302,303]
Wald, A. (1943), “Tests of Statistical Hypothesis Concerning Several Parameters
when the Number of Observations is Large,” Transactions of the American
Mathematical Society, 54, 426–482. [302]
Wilks, S. S. (1938), “The Large-Sample Distribution of the Likelihood Ratio
for Testing Composite Hypothesis,” Annals of Mathematical Statistics, 9,
60–62. [302]