J. NEYMAN (Warszawa - Polonia) (£) ON METHODS OF TESTING

J. NEYMAN (Warszawa - Polonia) (£)
ON METHODS OF TESTING HYPOTHESES
The purpose of the present paper is to give an account of some work I
have carried out partly in cooperation with Dr. E. S. PEARSON of University
CoKege, London.
The problem of the application of the theory of probabiKty to testing hypotheses is, as is weK known, a very old one. The first solution of the problem
given in Bayes' Theorem has however been very seriously attacked since it
needs for its application in practice a knowledge of the a priori probabiKty law,
which can only in quite exceptional cases be deduced from the conditions of
the particular problem under consideration. Doubt has even been expressed
whether problems exist at aK in which the a priori law of probabiKty is given.
If rarely met with, such problems do exist, as for example in connection with
Mendelism, where we may be testing hypotheses regarding the genetic components
of the parents of certain offsprings, when those of the grandparents are known.
In other cases when the a priori law is not given by the nature of the
problem, the appKcation of Bayes' Theorem generaKy needs some new assumptions,
which in most practical cases, and particularly when deaKng with small samples,
will influence the result considerably, and being arbitrary, put the same into the
danger of being useless.
Certain attempts have been made to show that under some conditions, when
the number of trials*is very large, the actual a priori probability law will not
influence very much the final result of the application of the theorem (2).
I have succeded in proving the following two theorems :
1°) Let 2 be a sample of N individuals falling into k groups with relativ
frequencies
Qi,
g2,.-,
Qk.
(*) Biometrie Laboratory, Nencki Institute, Soc. Scient, ac Litt. Varsoviensis.
<2) K. PEARSON: Biometrika, vol. X I I I and XVI. E. S. PEARSON: Biometrika, vol. XVII.
Besides that several authors proved theorems under consideration, but did not publish them
or the publications are not obtainable. Such are theorems of S. BERNSTEIN (published in
lithografic edition of his lectures in Kharkoff in 1917), of E. BOREL (lectures at the Sorbonne, 1926) and of Miss A. MIKLASZEWSKA in Warsaw (not published).
36
COMUNICAZIONI
This sample has been randomly drawn from some population n divided also
into k groups in which the corresponding group proportions are
Pi,
p2y...,
pic.
The p ' s are unknown and we assume that the a priori law of probability is a
function <p(PijP2y—iPk) which satisfies the two foKowing conditions: at the
point 2, that is for pi=qi (*=1, 2,...., k) the function <p is positive and at the
same point it is continuous.
If these two conditions are satisfied, then
The a posteriori probability
that the unknown numbers piy p2y..., pk
satisfy the condition
,=,.
k . .
i=l
where %\ is any given positive
X
°
I
&
number,
„ l o
)
tends to
°°
1
[xk-2e~2^dxlfxk~2e~^dx
ó
o
lohen N-+OG, the ratios q, being constant.
2°) Suppose that the groups into which sample and the population are
classed correspond to different values xiy x2,...., Xjc of a certain character of their
individuals, and that x and m are the mean values of this character in sample
and population respectively. Further let s be the standard deviation in the sample
and write o=sjÌN. If the a priori law of probabiKty q>(Pi, P2y—> Pk) satisfies
the same conditions as in theorem (1), then
The probability a posteriori that a^m^b,
where a<b are two arbitrary
numbers, teds to
b
ol^je a
as N--+-OG, the ratios q, being
constant.
This second theorem has of course been stated before (*), but the nature
of the assumptions involved in reaching the Kmit has not perhaps been before
fuKy examined.
It has been possible to show that there are discontinuous functions q?, for
which the first theorem does not hold and also that, if that function is continuous,
for the vaKdity of the same theorem the condition <p>0 is a necessary one. It
seems probable that the condition of continuity at the point 2 cannot be weakened
very much. Since in appKcatiohs this condition means a practical constancy of
a priori probabiKty at any rate near the sample point 2, we see that the necessity
(x) See for instance B O W L E Y ^ : Elements
of Statistics,
Part I I , p . 416.
J.
NEYMAN
: On methods
of testing hypotheses
37
of arbitrary assumptions regarding the a priori law of probabiKty is not removed
even in the case, when the number of observations is very large, and that it
is probably impossible for it to be removed.
In addition to that, in both cases, the order of approximation of the actual
value of the a posteriori probabiKty to its limit depends closely, for a given N,
upon the variabiKty of the function 99 at the point 2, and so, even if the assumption
as to continuity were true, we can never be sure in practice whether the Kmiting
value is a reasonable approximation to the probabiKty itself.
Although the two theorems have a theoretical interest, they are of more
doubtful value in practice and help to indicate the uncertainty that must always
be associated with the method of inverse probabilities ( i ).
After this method had been thrown into doubt there seems to have been
accepted very generaKy a new principle for testing hypotheses, which can be
formulated as follows ( 2 ).
If the observed event has a character, which from the point of view of
the hypothesis considered is improbable, then the hypothesis itself is improbable.
Clearly not every character of the observed event is suitable for testing
hypotheses and E. BOREL and P. LEVY state that such characters should be
from some point of view « remarquable ». They explain the meaning of this
word in several examples, but have given no definition of it.
Our general considerations on these points and some appKcations have been
published in the last volume of Biometrika (XX-A) (3).
The principle which is used in testing hypotheses must, to be useful, follow
our intuition, and intuitively we are incKned some times to accept a hypothesis
explaining the event even if the probabitity of the event happening if the hypothesis were true is very smaK, provided there is no alternative hypothesis
according to which the chance of the event is greater. On the other hand we
are unwiKing to accept an hypothesis when others exist which, if they were
(L) Since the time when the above results were presented to the International Mathematical Congress at Bologna, they have been considerably extended and are already published.
See : J. NEYMAN : Contribution to the Theory of Certain Test Criteria, Bulletin de V Institut
International de Statistique, t. XXIV, 2 e Livraison, pp. 44-87. With regard to the theorem
under 1°) I must apologize that I overlooked it in the paper by R. v. MISES in the Mathematische Zeitschrift, 1919. Unfortunately nobody present during my reading the paper in
Bologna noticed that the theorem was not new and its authorship has been wrongly
attributed.
(2) E. BOREL: Le Hasard, Paris, 1920. — P. LEVY: Calcul des Probabilités, Paris, 1925.
Pp. 91 and ff.
(3) J. NEYMAN and E. S. PEARSON: On the Use and Interpretation
of Certain Test
Criteria for Purposes of Statistical Inference. Part I. Biometrika, Vol. XX-A, pp. 175-240.
38
COMUNICAZIONI
true, would give rise to the event perhaps a thousand times more often. It has
therefore seemed to us that it is impossible to test a hypothesis without taking
into account alternative ones and we have sought for some test criterion based
on this principle.
This has lead us to mahe use of the idea of Kkelihood, introduced by Dr.
R. A.
FISHER.
His considerations have been mainly concerned with estimation of the most
probable population from a knowledge of the sample; we have tried to apply
the idea to the question of testing hypotheses.
It will be observed that the errors arising in testing hypotheses are of two
different kinds, 1°) we sometimes reject a hypothesis when it is true, and 2°) we
sometimes accept a false hypothesis. Now it is easy to give rules which will
reduce the probabiKty of commiting an error of the first kind to any given
level as low as desired, but the control of errors of the second kind is much
more difficult. What it seems possible to do, is to avoid the acceptance of
hypotheses with smaK KkeKhoods, where this term is used in a sense, which
wiK now be defined.
We distinguish between simple and composite hypotheses. A hypothesis is
simple if it be sufficient to determine completely the probabiKty of the observed
event. A hypothesis is composite if this be not the case, but additional assumptions
be necessary to determine the probability of the event. As these assumptions
are arbitrary, we see that what we caK a composite hypothesis is really a set
of single ones. This may be illustrated as foKows. If the hypothesis consists for
example in the assumption, that the mean and the standard deviation of a normaKy
distributed population have given values, say a and a, — this is a simple hypothesis. If however one of these constants is not specified —, the hypothesis is
composite. The hypothesis that a population is normaKy distributed is also a
composite one.
Let Q be the set of all admissible simple hypotheses about the sampled
population n and let 2 be a random sample of n. Further let H denote a simple
hypothesis and P(H) the probabiKty of 2 which foKows from H, while P(max)
is the upper bound of numbers P(H).
We then caK KkeKhood of the hypothesis H the ratio
X(H)=P(H)/P(msLx).
If we consider a composite hypothesis H, it will be associated with a subset co of Q. Let P(H) denote the upper bond of probabilities P(H) corresponding
to simple hypotheses included in co. The likelihood of the composite hypothesis H
we then define as
„,T7V
™^FW™
k(H)—P(H)/P(max).
Where, as is generally the case, it is impossible to express in exact terms the
J.
NEYMAN:
On methods
of testing hypotheses
39
relative a priori probabiKties of the different populations making up Q and oo,
we are inclined to think that these two ratios provide us with a kind of numerical
measure which it is rational to use in forming a judgment.
We now consider the method to be employed in controlKng the two sources
of errors involved in testing hypotheses. Let H be a simple hypothesis concerning
the sampled population n and let xL, x2y..., Xk be the numbers which specify
the sample 2. If the observations are not grouped, these numbers wiK represent
the variate values in a sample of k individuals, if grouped, then the proportions
of individuals falKng into k groups. In either case we may consider the x's as
coordinates of a point 2 in a ^-dimensioned space Q. The hypothesis H connects
with every such point, or smaK element of volume surrounding such point, a
number P(H, 2) representing the chance of drawing the sample 2 in the case
when the hypothesis H were true. The sum of the numbers P(H, 2) (or their
integral) taken over the whole space Q by fixed H, is clearly equal to unity.
Now let s be an arbitrarily smaK positive number, and let W be any region
whatever in the space Q but such, that the sum of P(H, 2) corresponding to
points inside W is equal to ei < e.
If now we adopt the rule of rejecting the hypothesis H every time we have
the sample 2 lying within the region W, we can be sure that we shall make
the error of the first kind (that is to say reject a true hypothesis) in an average
proportion of cases eL out of the number of times in which we are dealing with
a true hypothesis.
This wiK be true whatever be the region W. To control the second kind of
errors we propose to reject the hypothesis H only when the KkeKhood k(H) is
very smaK. That is to say we chose for the region W that bounded by a hypersurface on which k(H) is constant. Each of these hypersurfaces wiK be associated
with a different value of ei and the practical method of testing a hypothesis
consists in finding out the value of £i corresponding to the hypersurface of
constant 1(H) passing through the sample point 2. The smaKer be X(H) and
therefore ei} the more inclined we are to reject the hypothesis. As k(H) may
be considered as a character of the sample (but of the hypothesis H and of
their set Û also) and as sL is the probability of drawing a sample with as or
less probable value of X(H) than that observed, we refind here the principle
of E. BOREL with the only modification, that the notion of a « remarquable »
character is now defined. Such a « remarquable » character will be k(H) itself,
but sometimes it is preferable to calculate some function of it.
In a paper refered to we have applied these principles to testing hypotheses
concerning various distributions of the sampled population.
When we pass to testing composite hypotheses there are certain compKcations. The region W which we will now denote W wiK be bounded by the
hypersurface on which k(H) is constant. If the simple hypothesis H included
40
COMUNICAZIONI
in the set oo corresponding to the composite hypothesis H can be specified by
ascribing definite values to certain parameters, say aL, a2,...., ac, which may vary
continuously, then the new region W wiK be limited by the envelope of the hypersurfaces of constant KkeKhood corresponding to the simple hypotheses included
in oo. Fix a certain hypothesis of that set, and let P(H, 2) be the probabiKty of
drawing a given sample 2, following from that hypothesis. If now we sum P(H, 2)
over aK sample points inside W, we shaK get SP(H, 2), the chance of rejecting a
true hypothesis in cases when that true hypothesis is H. An important case arises
when SP(H, 2) does not depend upon the particular hypothesis H, chosen from
the set oo corresponding to H. In such cases SP(H, 2) is the probabiKty of rejecting
a true hypothesis whatever be true the hypothesis out of the set oo. In other
words, if the sampled population it conforms to one or other of the simple hypotheses constituting the composite one H, the chance that it wiK be rejected by
using the proposed test is equal to SP(H, 2). If this expression however depends
upon H, we cannot calculate the probability of rejecting a true hypothesis, although
taking its upper bound, if this can be effectively found, we shaK calculate a
Kmite which it cannot exceed. In the cases we have considered, either SP(H, 2)
has been found independent of H, or we could not solve the question whether
it is dependent or not.
It is worth noticing that in the case when the composite hypothesis H consists
in the assumption that the normal population from which a sample hase been
drawn has its mean equal to a given value, the standard deviation being unspecified, the method of testing which foKows from our principles is identical with
that of « STUDENT » (*).
In the case when the sampled population is grouped and a simple hypothesis
consists in ascribing definite values to group proportions pi (i=l, 2,...., n) in
the population, the surfaces of constant KkeKhood correspond approximately to
the equation
__ 2
n
f=]VS
^ — ^ = constant
i=\
Pi
where g's are the group proportions in the sample. We reach here from another
point of view the weK known (P, %2) test of Prof. KARL PEARSON.
If on the other hand we have a composite hypothesis which assumes that
the group probabilities p's are given functions of c independent parameters aLl
a%,...., ac the surfaces of constant KkeKhood are approximately those of constant
minimum #2, and it is possible to show that under certain conditions
SP(H, 2) - const, x / ^~c~ze *x*d%
Xi
(*) « STUDENT », Biometrika, Vol. VI, p. 1 et seq.
J.
NEYMAN
: On methods
of testing hypotheses
41
where k means the number of groups in sample, N the size of the sample and XL
the value found by minimising the above expression of % with regard to the
variable parameters aL, a2,...., ac.
An equivalent result has been given by Dr. R. A. F I S H E R (l), but we have
foKowed up a somewhat different method of proof by a more detailed examination of the nature of the limiting conditions and the limiting integral (2).
This general result has frequent applications in statistical practice.
(l) Jour. Roy. Stat. S o c , vol. 87.
(*) J. NEYMAN and E. S. PEARSON: On the Use and Interpretation
of certain Test
CriteHa for Purposes of Statistical Inference. Part II. Biometrika, Vol. XX-A, pp. 263-294.