This article was downloaded by: [UZH Hauptbibliothek / Zentralbibliothek Zürich] On: 17 February 2015, At: 05:11 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK The American Statistician Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/utas20 The “Three Plus One” Likelihood-Based Test Statistics: Unified Geometrical and Graphical Interpretations Vito M. R. Muggeo & Gianfranco Lovison Accepted author version posted online: 27 Aug 2014.Published online: 19 Nov 2014. Click for updates To cite this article: Vito M. R. Muggeo & Gianfranco Lovison (2014) The “Three Plus One” Likelihood-Based Test Statistics: Unified Geometrical and Graphical Interpretations, The American Statistician, 68:4, 302-306, DOI: 10.1080/00031305.2014.955212 To link to this article: http://dx.doi.org/10.1080/00031305.2014.955212 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/terms-and-conditions The “Three Plus One” Likelihood-Based Test Statistics: Unified Geometrical and Graphical Interpretations Downloaded by [UZH Hauptbibliothek / Zentralbibliothek Zürich] at 05:11 17 February 2015 Vito M. R. MUGGEO and Gianfranco LOVISON The presentations of the well-known likelihood ratio, Wald and score test statistics in textbooks appear to lack a unified graphical and geometrical interpretation. We present two simple graphical representations on a common scale for these three test statistics, and also the recently proposed gradient test statistic. These unified graphical displays may favor better understanding of the geometrical meaning of the likelihood-based statistics and provide useful insights into their connections. KEY WORDS: Statistical inference; Geometrical interpretation; Gradient statistic; Graphical display; Likelihood ratio; Score; Wald. statistic. Although this has not yet entered the mainstream of teaching practice, it is useful to include it here, since it is also based on the likelihood and it is asymptotically equivalent to the “Holy Trinity.” In this article, we show that all four of these test statistics have a geometrical interpretation on the same scale: either that of the log-likelihood function or that of the score function. Consequently, this allows a common graphical representation. We believe that this unified approach may enable a fuller and better understanding of the relationships among these statistics, improving the students’ ability to learn and compare them with a “critical eye.” 2. THE LIKELIHOOD-BASED STATISTICS 1. INTRODUCTION In courses on statistical inference based on the likelihood paradigm, hypothesis testing is typically discussed in terms of three well-known test statistics: the likelihood ratio (Neyman and Pearson 1928; Wilks 1938), Wald (Wald 1943), and score (Rao 1948) tests, the latter also referred to as the Lagrange multiplier test in the econometric literature. They are covered in almost every book on statistical inference, for example, Azzalini (2001), Casella and Berger (2002), and Boos and Stefanski (2013); in addition, reviews are also presented in some books on specialized areas of data analysis (e.g., Agresti 2007). To emphasize their key role in statistical inference, Rao (2005) named them “the Holy Trinity.” While analytic representations and asymptotic equivalences of these likelihood-based statistics are fully discussed and easily handled via Taylor expansions, their actual geometrical meaning, with its implications, remains somewhat vague and, to some extent, unclear to most students. In our experience, they tend to learn them separately, recognizing that these are reasonable statistics, each measuring the “distance” between the null hypothesis and the sample evidence on its own appropriate scale, but without fully grasping their deep connections. In addition to the “Holy Trinity”, a fourth test statistic based on the likelihood was introduced by Terrell (2002): the gradient Vito M. R. Muggeo (E-mail: [email protected]) and Gianfranco Lovison (E-mail: [email protected]), Dipartimento Scienze Statistiche e Matematiche ‘Vianelli,’ viale delle Scienze, edificio 13, Palermo 90128, Italy. The authors thank the referee, the Associate Editor, and the Editor, Prof. Ronald Christensen, for their comments and suggestions, and Amanda Ross for carefully revising the final manuscript. For ease of presentation, we focus on the case of a scalar parameter and consider a regular statistical model with loglikelihood (θ ), where θ ∈ a subset of . We assume the maximum likelihood estimate θ̂ = arg max (θ ) to exist and to be unique. We refer to the usual hypotheses H0 : θ = θ0 versus H1 : θ = θ0 . To avoid overloading the notation, when necessary we assume θ0 is also the true parameter value. Moreover, let us define the usual quantities based on log-likelihood derivatives: (θ ) = u(θ ) the score, I(θ ) and J (θ ) = − (θ ) the expected and observed information. Notice that even if the observed and expected information are usually evaluated at θ0 and θ̂, respectively, we generically define them as functions of θ in the observed sample. When needed, we add “Y ” to refer to the corresponding random variables; thus, for instance E[u(θ0 ; Y )] = 0 or I(θ ) = E[J (θ ; Y )] for each θ ∈ . Under the usual regularity conditions and correct model specification, the variances of u(θ0 , Y ) and θ̂ (Y ) are given by the information and its inverse, respectively—the latter providing typically only the asymptotic variance. In general, these variances must be estimated, and this can be done by using either the expected or the observed information, evaluated at the appropriate parameter value: the ML estimate θ̂ or the value postulated by the null hypothesis, θ0 . The usual approach when building the Wald and the score test statistics relies upon using the expected information, respectively, evaluated at the ML estimate I(θ̂), and under the null hypothesis I(θ0 ). However, for the purpose of the unified representation proposed in this article, we use the observed information at θ0 and θ̂ ; thus, we estimate the two variances by V̂[u(θ0 ; Y )] = J (θ0 ) and V̂[θ̂(Y )] = J (θ̂)−1 . Also we make no distinction between exact and asymptotic variance. In this context, we think this choice is justified by the following 302 © 2014 American Statistical Association DOI: 10.1080/00031305.2014.955212 The American Statistician, November 2014, Vol. 68, No. 4 d = 2{(θ̂) − (θ0 )} w = (θ̂ − θ0 )2 /J (θ̂)−1 s = u(θ0 )2 /J (θ0 ). (1) (2) (3) As reported in most textbooks, each test uses its own scale: d works on the log-likelihood scale, w on the parameter scale, and s on the first derivative scale. To better explain them, some textbooks also report a classical graphical display, like the one depicted in Figure 1: see, for instance, Figure 4.2 on p. 122 in Azzalini (2001), Figure 3.7 on p. 89 of Agresti (2007), and Figure 3.1 on p. 126 of Boos and Stefanski (2013). While this representation is absolutely correct, we propose that the presentation and discussion of the connections among these three test statistics can be fruitfully enhanced using a graphical illustration for all the test statistics on the same scale. As mentioned in the introduction, we also discuss the new gradient statistic (Terrell 2002), defined as g = u(θ0 )(θ̂ − θ0 ). (4) The gradient statistic is justified by noting that the product u(θ0 ;Y ) (θ̂ (Y )−θ0 ) simof the two standard Normal variates V[u(θ 1/2 × V[θ̂(Y )]1/2 0 ;Y )] plifies to (4) when V[θ̂(Y )] ≈ V[u(θ0 ; Y )]−1 . As for the Holy log likelihood Downloaded by [UZH Hauptbibliothek / Zentralbibliothek Zürich] at 05:11 17 February 2015 considerations: (i) although I(θ̂) and I(θ0 ) are more commonly used for standardizing the Wald and Score statistics, respectively, the use of the observed information is also discussed by some textbooks (e.g., Pawitan 2001, pp. 244–247); (ii) in this article, which focuses on the geometrical interpretation of asymptotically equivalent statistics, distinction between observed and expected information is a minor issue, since they are asymptotically equivalent, and even identical in important cases such as the natural exponential family. On the same grounds, distinction between exact and asymptotic variance is not crucial here and will, therefore, be neglected. With this specification, the “Holy Trinity” (Rao 2005) becomes d s 3. COMPARISON ON THE LOG-LIKELIHOOD SCALE Under the assumed regularity conditions (θ ) is smooth up to the second order, and it is possible to approximate it via second-order polynomials. As is well known, two quadratic approximations are available. The former is based on the Taylor expansion at θ = θ̂ : 1 Pw (θ ) = (θ̂ ) + (θ − θ̂ ) (θ̂ ) + (θ − θ̂)2 (θ̂), 2 (5) where clearly (θ̂) = 0; the latter approximation relies on a Taylor expansion at θ = θ0 : 1 Ps (θ ) = (θ0 ) + (θ − θ0 ) (θ0 ) + (θ − θ0 )2 (θ0 ). 2 (6) Traditionally, the Wald and score tests are discussed with reference to the usual formulas (2) and (3), which are employed to justify their rationale and at the same time to illustrate their different scales. The fact that they are based on the two aforementioned quadratic approximations is only mentioned to explain how the numerators (θ̂ − θ0 )2 and u(θ0 )2 are weighted by the log-likelihood curvature. For instance, Boos and Stefanski (2013, p. 126) wrote, “The likelihood ratio test statistic is a multiple of the difference of the log-likelihood; the Wald test statistic is a multiple of the squared difference of (θ̂ − θ0 ); and the score test statistic is a multiple of the squared slope at θ0 .” Thus, students are not presented with the actual relationships of w and s with the log-likelihood itself. On learning that (θ ) measures just the plausibility of the different θ values in the observed sample, students find it very intuitive to reject H0 if the model plausibility under H0 , that is, (θ0 ), is smaller than the sample evidence representing the maximum plausibility, that is, (θ̂). In our view, when introducing the Wald and score tests, it would be helpful to present them using the same approach used for the likelihood ratio, that is, as alternative ways of comparing the same two plausibility values, but employing the quadratic approximations (5) and (6), instead of the log-likelihood itself. It is trivial to check that subtracting and doubling the plausibility values at θ0 and θ̂ provided by approximation (5) yields 2{Pw (θ̂ ) − Pw (θ0 )} = w. w θ0 Trinity, the large sample null distribution of the gradient statistic is χ12 . It should be mentioned that no test, including the gradient statistic, is uniformly most powerful in all settings; thus it makes sense to learn each of them, and appreciating their geometrical interpretation may help to reach this goal. θ^ parameter Figure 1. Comparing the three test statistics according to the traditional plot: Likelihood ratio is reported on the y scale, Wald on the x scale, and the score on the first derivative scale. The different scales do not favor understanding of the underlying connections. Similarly s in (3) can be obtained by subtracting and doubling plausibility values at maximum and under H0 according to approximation (6). Notice the maximum plausibility of (6) is attained at θ̃0 = θ0 − (θ0 )/ (θ0 ). The reader can obtain it by equating Ps (θ ) to zero, or by recognizing that θ̃0 is simply the update from the Newton algorithm step to maximize (θ ) starting from θ0 . Therefore, it is easy to check that 2{Ps (θ̃0 ) − Ps (θ0 )} = s. The American Statistician, November 2014, Vol. 68, No. 4 303 WALD log likelihood log likelihood LIKELIHOOD RATIO ×2 ^ θ θ0 ×2 ^ θ θ0 parameter parameter GRADIENT log likelihood log likelihood Downloaded by [UZH Hauptbibliothek / Zentralbibliothek Zürich] at 05:11 17 February 2015 SCORE ×2 θ0 ~ θ0 ^ θ θ0 parameter parameter Figure 2. Comparing the four test statistics on the log-likelihood scale. On each plot the log-likelihood is illustrated (black line) along with the relevant approximation underlying the test statistic: in the Wald panel the gray line is Pw (θ), in the score panel it is Ps (θ), and in the gradient panel it is Pg (θ ). The arrows on the left side quantify the corresponding observed test statistic; the longer the arrow, the larger the evidence against H0 . Notice that for the likelihood ratio, Wald, and score, the arrow lengths have to be doubled to obtain the actual values comparable to those from the gradient statistic. Finally the gradient statistic can be also expressed as a difference in plausibility at θ0 and θ̂ on an appropriate log-likelihood approximation, simply using a first-order Taylor expansion for (θ ) at θ0 , Pg (θ ) = (θ0 ) + (θ − θ0 ) (θ0 ). Here we do not need to double the difference to obtain {Pg (θ̂ ) − Pg (θ0 )} = g. Figure 2 presents the graphical representations of the four test statistics on the common log-likelihood scale. Of course, for inferences about the location parameter of Gaussian models, the two approximations Pw and Ps coincide with the log-likelihood, thus d = w = s; also, it is simple to show that the gradient statistic reduces to same value. In fact, the four test statistics are identical in this special case. 4. COMPARISON ON THE SCORE SCALE The four test statistics have also a common representation on the score scale. In Figure 3, the gray area of the “triangle” A in each plot is half the value of the corresponding statistic. The likelihood ratio statistic, again, is simple. Since (θ ) = u(θ ) up to a constant, clearly the area of the curvilinear triangle A is A= θ̂ θ0 u(θ )dθ = {(θ̂) − (θ0 )} = 1 d. 2 In each remaining triangle displayed, let b be the basis, h be the height, and recall that u (θ ) provides the slope at each θ ∈ . In the “Wald” panel, b = (θ̂ − θ0 ) and from basic trigonometric relationships h/b = −u (θ̂ ), thus h = (θ̂ − θ0 ){−u (θ̂)}. 304 Teacher’s Corner Hence, 1 bh = (θ̂ − θ0 )2 {−u (θ̂)} 2 2 1 1 = (θ̂ − θ0 )2 /{−u (θ̂)}−1 = w, 2 2 A= where {−u (θ̂ )} = J (θ̂). In the “score” panel, h = u(θ0 ) and using the same trigonometric relationship h/b = −u (θ0 ), we get b = u(θ0 )/{−u (θ0 )}. Hence, A= 1 bh 1 = u(θ0 )2 /{−u (θ0 )} = s, 2 2 2 where {−u (θ0 )} = J (θ0 ). Curiously, when showing the three triangles representing the likelihood ratio, Wald, and score statistics in the classroom, a student of ours asked about the “missing triangle.” He meant just the one now shown in the right lower corner. After looking at the first three, he expected this fourth triangle as the last, natural and consequential step of a logical sequence. At the time, we had only three triangles in our graphical display and were unable to answer our student’s question, as we did not yet know the gradient statistic. Now the missing triangle has appeared and it makes sense: in the “gradient” panel in the right lower corner it is immediately seen that b = (θ̂ − θ0 ), h = u(θ0 ) and then A= 1 bh 1 = u(θ0 )(θ̂ − θ0 ) = g. 2 2 2 For inferences about the location parameter of Gaussian models the score is linear, and therefore the four triangles coincide. WALD 0 0 score score LIKELIHOOD RATIO ^ θ θ0 θ0 parameter ^ θ parameter score 0 0 θ0 ~ θ0 θ0 parameter ^ θ parameter Figure 3. Illustrating the four test statistics on the score scale. In each plot the gray area represents half the actual empirical value of test statistic. The bigger the “triangle,” the larger the evidence against H0 . In this example, clearly g > d > w (or s), but in general the areas of triangles depend on the shape of the score u(·) (in turn depending on the assumed model), and on the locations of θ̂ and θ0 . 5. A NONSTANDARD EXAMPLE: THE MONOTONIC LIKELIHOOD PROBLEM The common score-scale representation of the likelihood based statistics enables us to better understand and appreciate the applicability and potential fallibility of some test statistics in specific nonstandard situations; see Fears, Benichou, and Gail (1996) for an example focusing on the variance of random effects in a simple balanced one way ANOVA. Another well-known example concerns the monotonic likelihood problem: in a simple 2 × 2 table the odds ratio estimate does not exist when there is “quasi-separation” between the response conditional distributions and consequently a cell count is zero (Hauck and Donner 1977). As the parameter of interest θ increases the log-likelihood increases monotonically and plateaus at large values of θ . Accordingly the first derivative, that is, the score, approaches zero without crossing the x axis. The maximum likelihood estimate usually returned by estimating algorithms and routine software is the parameter value at which the log-likelihood does not increase further. Its numeric value depends on the tolerance stopping value of the algorithm and is normally quite large. The graphical illustration on the score scale reported in Figure 4 aids understanding of the fallibility of the Wald and also of the gradient test statistic, which, interestingly, is due to their opposite behaviors. In fact u (θ̂ ) is very flat, making the area of the Wald triangle very tiny: this means that the corresponding empirical value GRADIENT score score WALD 0 0 Downloaded by [UZH Hauptbibliothek / Zentralbibliothek Zürich] at 05:11 17 February 2015 GRADIENT score SCORE θ^ θ0 parameter θ^ θ0 parameter Figure 4. Illustrating fallibility of the Wald and gradient statistic in testing for association in logistic regression model with quasi-separation between the response conditional distributions causing the log-likelihood to be monotonic and the score function not to cross the x-axis. The gray areas represent the empirical values of the observed test statistics. The American Statistician, November 2014, Vol. 68, No. 4 305 Downloaded by [UZH Hauptbibliothek / Zentralbibliothek Zürich] at 05:11 17 February 2015 is substantially zero. On the contrary the area of the gradient triangle becomes huge indicating a very large empirical value. Thus, in terms of performance of statistical tests, the Wald and gradient statistics are questionable if the probability of observing a zero count is not negligible owing to small sample size and/or a strong predictor of the response; the power of the Wald test tends to zero and the gradient test size is out of control and potentially much larger than the nominal level. On the other hand, it is evident that the likelihood ratio and the score tests continue to hold, and in a likelihood framework they can provide reliable results and trustworthy p-values. A possible, probably better, alternative is to assume the Jeffrey’s invariant prior for the model parameters; this guarantees the score function to cross always the x-axis, namely existence of finite estimate even in cases of quasi-separation of data. While this approach has clearly a Bayesian flavor, it is also acknowledged in a likelihood-based paradigm via penalized likelihood; see Firth (1993) for details. 6. CONCLUSION We have presented two graphical illustrations on common scales for the likelihood ratio, Wald, score and gradient statistics. These alternative views can facilitate comprehension of the underlying relationships between the tests. While we do not claim that these relationships are new, it appears that graphical illustrations and comparisons based on them have been neglected in standard graduate textbooks on statistical inference. To the best of our knowledge, no textbook includes a similar treatment. Indeed, we experienced quite positive reactions from our students when introducing the test statistics via graphical display. An important benefit is that the derived formulas for w, s, and g based on plausibility differences clarify and emphasize the role of the likelihood itself. [Received October 2013. Revised August 2014.] 306 Teacher’s Corner REFERENCES Agresti, A. (2007), An Introduction to Categorical Data Analysis (2nd ed.), New York: Wiley. [302,303] Azzalini, A. (2001), Inferenza Statistica: Una Presentazione Basata Sul Concetto Di Verosimiglianza, Milano: Springer. [302,303] Boos, D. D., and Stefanski, L. A. (2013), Essential Statistical Inference: Theory and Methods, New York: Springer. [302,303] Casella, G., and Berger, R. L. (2002), Statistical Inference, Belmont, CA: Duxbury. [302] Fears, T. R., Benichou, J., and Gail, M. H. (1996), “A Reminder of the Fallibility of the Wald Statistic,” The American Statistician, 50, 226–227. [305] Firth, D. (1993), “Bias Reduction of Maximum Likelihod Estimates,” Biometrika, 80, 27–38. [306] Hauck, W. W., and Donner, A. (1977), “Wald’s Test as Applied to Hypotheses in Logit Analysis,” Journal of the American Statistical Association, 72, 851–853. [305] Neyman, J., and Pearson, E. S. (1928), “On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference,” Biometrika, 20A, 175–240. [302] Pawitan, Y. (2001), In All Likelihood: Statistical Modelling and Inference Using Likelihood, New York: Oxford University Press. [303] Rao, C. R. (1948), “Large Sample Tests of Statistical Hypotheses Concerning Several Parameters With Applications to Problems of Estimation,” Proceedings of the Cambridge Philosophical Society, 44, 50–57. [302] ——— (2005), “Score Test: Historical Review and Recent Developments,” in Advances in Ranking and Selection, Multiple Comparisons, and Reliability, eds. N. Balakrishnan, N. Kannan, and H. N. Nagaraja, Boston, MA: Birkhäuser. [302,303] Terrell, G. (2002), “The Gradient Statistic,” Computing Science and Statistics, 34, 206–215. [302,303] Wald, A. (1943), “Tests of Statistical Hypothesis Concerning Several Parameters when the Number of Observations is Large,” Transactions of the American Mathematical Society, 54, 426–482. [302] Wilks, S. S. (1938), “The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypothesis,” Annals of Mathematical Statistics, 9, 60–62. [302]
© Copyright 2026 Paperzz