On Certain Types of Compound Frequency Distributions in Which

Biometrika Trust
On Certain Types of Compound Frequency Distributions in Which the Components Can be
Individually Described by Binomial Series
Author(s): Karl Pearson
Reviewed work(s):
Source: Biometrika, Vol. 11, No. 1/2 (Nov., 1915), pp. 139-144
Published by: Biometrika Trust
Stable URL: http://www.jstor.org/stable/2331886 .
Accessed: 27/09/2012 06:35
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
.
Biometrika Trust is collaborating with JSTOR to digitize, preserve and extend access to Biometrika.
http://www.jstor.org
139
lfTiscellanea
On certain Types of Compound Frequency Distributions in which the
Components can be individually described by Binomial Series.
By KARL PEARSON, F.R.S.
Certaindifficultieswith regardto the interpretationof negativebinomials,which are of constant
occurrencein observational frequency series, have suggested the following investigation.
Consider a number auof binomial series of which the sth is v. (ps + q)fl and let us suppose
a frequency series compounded by adding together the rth terms of all these series, such will be
the compound frequency series it is proposed to discuss. We can realise its nature a little more
concretely by supposing n balls drawn out of a bag containing Np white and Nq black balls,
N being very large as compared with n, or else each ball returned before a fresh draw, while
the values of p and q cbange discontinuously at the v, + 1, vI + v2 + 1, V1 + V2 + V3 + 1, etc.
draws. We shall take as origin the point at which the sum of the first terms of all the binomials
may be supposed to be plotted. S will denote summation to u terms.
Let N = S (v,), and NMi', NU2' be the moments of the compound svstem about this origin,
NIul, Np.2its moments about its mean. Thus
Np.j' = S (v8nq8),
{v. (np8q8 + n 2q2) }.
Np.2' =S
Hence
N/12= nS {v8q8(1 - q,)} +
= NI1' +
a,2 (S (V.q82)
- [nS {vYv, (q8 - qs8)2}
-
_
_(
-_
NS
(v8q82)],
since N = S (v,). Accordingly, if a-be the standard deviation and anthe mean of the compound
series, i.e. k2= (T2, 1' = m, then
(
ff
(2
nS{
(q,-q8)2}
(V8
vv
q8)
2)
-ANS (vq
v~).i
Now suppose we had endeavoured to fit a binomial N (P + Q)Kto the compound series, we
should have had
m=-KQ,
a.2KpQ
and accordingly have found
Q
Sv,
1
,,
q.
-qs,
N~S(v,q,2
{ S (v.....................q)
q).
,
f Iv. v.,I~S~
V.
n nS
((q
qs - qst,21
...................
f,
NS (V8 q 2) .
-@ *-....(iii)
** (ii
Thus, had we attempted to fit a binomial to the heterogeneous series, we should have found
Q negative and P greater than unity provided
nS {IvSv (q8 - q8')2} be > NS (v.q 2),
a condition which will frequently be found to be satisfied, especially if q8be small and n large.
In the limit let us take nq, -- m, and q, vanishingly smnall,i.e. suppose the sth binomial to be
replaced by the Poisson series
K
q-
e- "s I1+ m.3
1+
-
+
*-+
+
)
then we have at once
S
Q
4 rsv8
(i8
-S,)2
NS {8rvnm}
+ S{VS,v (ms - .Ms,)2
P-1 I
P= +
K
-
NS (v,M,)
...........................
(IV)
(S {V,M8tn)2
S -VS V,,
MXs-es,)2}
Thus, if two or more Poisson's series be combined term by term from the first, then the
compound will always be a negative binomial. This theorem was first pointed out to me by
"Student" and suggested by him as a possible explanation of negative binomials occurring in
Mfiscellauea
140
material which theoretically should obey the Law of Small Numbers, e.g. "Student's" own
Haemacytometer counts*. Of course the negative binomial may quite conceivably arise from
other sourcest than heterogeneity, but if this be the source of its origin in the material of Bortkewitsch, Mortaraand McKendrickt, it is certainly most unfortunatethat such material should have
been selected to illustrate Poisson's limit to the binomial.
Now we know that the values about the mean of the successive moment coefficients of the
binomial are
npq"
/12=
l3
/A4
nPq (P
-
........................................
q)p
( 1 + 3 (n - 2) pq)
=npq
(V)
J
Further the mean is at a distance nq from the first term pn. We shall call this distance m.
Let "' $2", $3' and P4 be the moment coefficientsround the start of each binomial. Then
=nq,
2'=npq+n2q2,
$1A.
3
-
-npq(p
q) + 3npq X nq +
- 2) pq) + 4npq
U4=nq(+3(n
n3 q3
............
(p - q) nq +
6 (npq)
n2 q2
+
k(vi)
n4q4.J
From these equations we deduce
nq,
2
- 32'+
21'=n(n
JAI'
3
/14'
NoWlet
-
6/13'
a, =
t=n
(n -
- 6$1' = n(n
+ 11/2'
l
) q2^
- l) (n - 2) q3
-
(vii)
1) (n
-
2) (n
3)
-
q4.
for the combined series,
Mi' for the combined series,
a3 =
3t - 3P2' - 2,1' for the combined series,
a4 = 4' - 6P3' + 11/2' - 6Mi' for the combined series.
Then we have, if X1= v1/N, X2 = v2/N:
1 = X1+ X2,
a2
=
/IA'
M2' -
n
n(n
a2
-
X ql + X2q2,
12
=X1q
+
2
Xq2 22........................
a3
n (n - 1) (n - 2)
(viii)
Xq3 + Xq3,
a4
.=XqL
2q'
1) (n - 2) (n - 3) = X1q1'+ 2q2'.
Multiply each equation by q1 and subtract from that below it and we find:
n (n
-
= X22(q2-
n
a2
n (n
n(n - 1) (na4-- 2) (n
-
1)
a3(n -2) -
-
3)
-n
(n
aq(q2=
n (na2- 1)
q1
- a3q(
1)(n al)q
=
ql),
2q2
-2q2
............
(ix)
(2-q
2) =Xq3(2-q)
= X2q2(q2
-
* Biometrika, Vol. v. p. 356, and see also L. Whitaker's examination of these
data, Vol. x. p. 48.
t Pearson, Biometrika, Vol. iv. p. 209.
t For an examination of Bortkewitsch and Mortara's instances see L. Whitaker, loc. cit. pp. 49-66.
McKendrick has recently reached Poisson's series (Proceeding8 of the London Mathematical Society,
Vol. xm. (1914), pp. 405 et Beq ) without apparently recognising that he was on familiar ground, and
has suggested its application to the frequencies obtained by counts of the bacilli ingested by leucocytes.
He has fitted his series not by moments, but from the first two terms, and has failed to recognise that
a large proportion of such leucocyte counts give also negative binomials.
JMiscellanea
141
Hence dividing each equation by the preceding one:
a2
alq1
a3
n(n - 1)
n
n(n - l)(n - 2)
a1
-
a2
n
a4
-
1)(n-
2)(n-
n(n-
1)(n-
n(n - 1)(n - 2)
3)
a2q1q
a3
n
n(n - 1)
2)
q2 =
- (n
-
n(n-
PI, we obtain
1) aIP, + n (n - 1) P2 = 0,
(n
2) a2p. + (n - 1) (n - 2) ajLP2= 0,.*## ....................
(n - 3) a3P, + (1n - 2) (n - 3) a2P2 = o,J
Writing q, q2 =P2 and q, +
3
1)
a1q,
1)
n(n-
a2
a3 q3
a4
a2q,
n(n-
-
(xi)
three equations to determine n, P, and P2.
Eliminating - P1 and P2 we find:
a2
(n-
n(n
(n - 3)a3
(n - 1) (n - 2) a,
(n - 2)(n - 3)a2
1)a,
a3 (n -2)a2
a4
which expanded gives us the cubic for n:
n3 (2a1a2a3 - a23 - aI2a, - a2a4 - a32) + n2
+
n (22a1a2a3
-
16a23
-
5a,2a4
A root of this cubic substituted
the quadratic
+
2a2a4
3a32)
+
o= .(x...
+ 7a23 + 4a,2a4
12a1a2a3
(
-
-I)
12a,a2a3
(-
+
..
-
(xii)
3a2a4 + 4a32)
12a23 + 2a12a4)
0. ...(Xii)bl8
=
of (xi) will give P1 and P2 and then
in the first two equations
(xiii)
p2 -pp + P2-O*- ...........................
the
to
value
of
The
n.
first
equations
two
corresponding
q2
q1
and
will determine the two ialues
of (viii) then complete the solution by providing X, and X2.
Until the roots of the cubic (xii)blS have been discussed we can only assume that three
solutions are possible. As a matter of fact in the examples so far dealt with some of these solutions
have usually to be discarded.
For the special case of Poisson's limit to the binomial, we make n indefinitely large, q indefinitely small, and nq = m finite. Hence equations (viii) become
1 = X1+ X2'
a, = X m, + k2m2,
|
....
a2 = X1m12 + X2m22
(xiv)
a3 = X1m13 + X2m23,
a4 = XImI4
a2 - a,Q?
leading to
+
k2m24,)
Q2= 0,
+
a3 - a2QI + a1Q2
a4 - a3Ql
= mI + M2,
ifQt
0,
=
0,
+ a2Q2 =
Q2 = MiM2.
Thus we find:
subject to the condition*
-
Q=
(a3 - a,a2)/(a2
Q2=
(a3a1 - a22)/(a2 - a12),
a4 (a2 - a,2)
+ 2a1a2a3
-
a,2),
a32 -
a23 = 0 .........
.
(xv)
Hence ml and M2 are roots of
m2 (a2 - a12) - m (a3 -
a1a2)+ 3al
- a22 _.
..................
(x'Vi)
* Of course equations of condition hold for the 5th and higher moments in the case of the two
binomial components. But they are of small service as the probable errors of these high moments are
usually very considerable.
142
MJiiscellanea
Further
1
X
mI
-
X
2
M2
M2'
-
(xvii)
...........................
ml'
which deterimiineX, and X2.
It is thus quite easy to resolve a series into the sum of two Poisson's binomial limits provided
the roots of the above quadratic are real. As illustration I take "Student's" first count of yeast
cells on the 400 squares of a haemacytometer. He found:
No. of yeast
1
Frequencv
0
1
2
3
4
5
Total
213
128
37
18
3
1
400
=6825, M2 = *8117, u3- 10876.
giving: mean
From the above values of "Student" I determined
/3' = 3-075.
M2'=
1-2775,
a3-= 6000,
a1= 6825,
a2= 5950,
Whence
and the resulting quadratic is
129,194m2 - *193,913m + -055,475 = 0.
=
'3847 and M2= 1-1163, leading to X1= -59,295 and X2= 40,705, or, the
The roots are ml
series has for its two components
ml=
VI= 237-18,
*3847,
162-82,
m2 = 1 1163.
Calculatingout the Poisson's series for these components we have:
V2=
No.ofyeast
0
1
2
3
4
5
6
7
*01
*77
.00
14
.00
*02
lst Compt.
2nd Compt.
161P44
53-32
62-11
59-52
11-95
32-22
1P53
12X36
*15
3-45
Round totals
215
122
44
14
4
1
128
37
18
3
1
Observed
213
The test for " goodness of fit" for these six groups gives x2 = 2X82and P = 73, or the fit is
very good. The negative binomial gave P = *52and a single Poisson's series only P = '04. But
the double Poisson's series places of course one constant more at our disposal than the binomial,
and we can do still better with a double binomial, as we have four constants and only six
frequencies, while the double Poisson has three constants to six frequencies. It is clear that
neither of the above componentsforming 41 % and 59 % of the total number of cells, and having
their means at *3847and 1 1163instead of *6825,gives any idea of a dominant constitution in the
solution sampled. If in this case heterogeneity accounts for the negative binomial, then the
differenceof the components is not slight, and the heterogeneity being gross would indicate some
considerable failure in technique.
If we assume that the oounts with a haemacytometer ought tlofollow the Poisson distribution,
-and this seems to be theoretically prQbable,-then the criterion of the binomial might well be
adopted to ascertain the possibility of some failure in technique. The actual binomial in
"Student's" first case should be AU + igg)273 and any binomialwith p very small and np = *6825
would effectively represent the series; we could not anticipate getting n = 273 and p
closely from the data, but we might certainlv anticipate a positive binomial, if the theorv of a
Poisson distribution be correct. If on the other hand we say in this and many similar cases
that the negative binomial arises from heterogeneity, then it appears to me that we have saved
our theory at the expense of our technique. I propose now to test this point further by considering the component binomials. If the theory of heterogeneity be correct, unless it be very
143
Miscellanea
manifold, we might anticipate two binomial components, v, (Pi + qj)7 + '2 (P2 + q2)nl, with
n positive and large, both q, and q2 being small, while nlqt and n2q2 would be approximately *3847
and 1*1163, the frequencies v, and V2 being roughly in the ratio of 3 to 2.
Returning to "Student's" data we find p4' = 8-9275, whence
6000,
a2 = -5950,
a4 = -4800.
a3=
a, = 6825,
Substituting in (xii)his we obtain for the cubic:
*021,3269n3 + 028,2321n2 + -363,3020n + *051,0825 = 0.
This has three real roots, approximately.
n' = 4-89997,
and
n" = - *14234,
n."
-
- 343390.
The first two equations of
We will consider in succession these cases: (i) n' = 4-89997.
(xi) provide
5950 - 2 66173P1 + 19'10974P2 = 0,
6000 - 1-72548P1 + 7-71894P2 = 0,
*55304, P2 = -04590, and the quadratic
leading to P=
q2 - .55304q + *04590 = 0,
whence we deduce the binomial factors
q, = *4514, Pi = *5486, and q2 -- *1017, P2 = *8983.
The first two equations of (viii) give
X + A2,
1 X1
.6825/4.89997 = *451,355X1 + *101,685Xk2,
leading to
A2 = 892,465,
XI=*107,535,
or, in a population of 400,
=}1=43-014,
'2
= 356-986.
Accordingly the compound series is given by
43-014 (.5486 + .4514)4.89997 + 356-986 (.8983 + .1017)4.89997,
with means of the components at
ml = 2-2118 and m2 = *4983.
We see that neither the sizes of the component populations nor their means have any relation
to the previously discussed component Poisson series; further the present series* diverge widely
f rom Poisson series, n is not large nor q1 or q2 very small. Calculated to the nearest whole numbers
we obtain:
No. of
yeast cells
0
1
2
2
3
3
4
4
5
Ist Compt.
2nd Compt.
2
211
9
117
15
26
12
3
4
0
1
0
Combination
213
126
41
15
4
1
Observed
213
128
18
3
1
37
5
which leads to X2
1-27 and P=
93.
Thus the fit is excellent, but it does not correspond to the heterogeneity of a double Poisson
series.
(ii)
and give
with
n" = - *14234.
Here the first two equations of (xi) provide
5950 + *77965P1 + *16260P2 = 0,
6000 + 1 27469P1 + 1-75727P2 = 0,
P2 = *24996,
P1 = - *81529,
q2 + *81529q + *24996 = 0.
* It should be noted that such fractional binomial series tend ultimately to become negative,
although with negligibly small frequencics.
144
Miscellauea
This gives imaginary values of q, and q2and thus the solution can for the present purpose be
discarded.
- 3 43390.
We deduce
(iii) n"'
*5950 + 3*02614P1 + 15*22557PZ = 0,
*6000 + 3*23317P1+ 16*44372P2= 0,
P2 = 20229
P1 = - 1-21441,
giving
q2 + 1.21441q+ *20229= 0.
and
q2 - *1993,
q1= - 1-0151,
pi = 2-0151,
P2 = 1-1993.
Hence
The X equations are
1 = X1 + X2,
- 1-0151X1- *1993X2,
- .6845/3.4339
\1 = 000043,
leading to
2 = *999957,
V2= 399-9828.
or,
vj.= *0172,
Thus the component series is
*0172(2-0151 - 1.0151)-3-4339 + 399-9828(1.1993 - *1993)3-4339
with means at
ml = 3-4858 and m2 = *6847.
The first of these components is negligible, it contains roughly only *02 individuals in 400,
and the second is sensibly identical with the negative binomial obtained by "Student," i.e.
400 (1-1893 - .1893)-3.6054
with slightly modified constants. It provides:
No.of
yeast cells
Calculated
Observed
O
214
213
122
128
2
3
4
5
45
37
14
18
4
3
1
1
leading to x2 = 3d12and P = *68,which for all practical purposesis as good as the double Poisson.
Concluaions. It having been suggested that the appearance of negativc binomials as better
"fits" than Poisson's series for material that is supposed to follow the law of small numbers is due
to heterogeneity, formulae have been provided for testing whether this heterogeneity is due to
a second component. If so this component should be small and the first component should
substantially agree with the primary Poisson's series. The smallness of the second component
would measure the goodness of the technique in haemacytometer or opsonic index counts.
Applied to "Student's" first series of counts of yeast cells we obtain (a) two Poisson's series
neither of which dominates the data or approximates to the primary Poisson's series; (b) two
positive binomials, neither of which has any approachto a Poisson series or any agreement with
the components of (a); and lastly (c) two negativebinomials, one of which dominates the series,
and agrees with the primary negative binomial. This investigation as far as it goes suggests
either that "Student's" first count is really described homogeneouslyby a negative binomial, or,
if it be heterogeneous,then the heterogeneity is manifold, and no weight can be given to the
results of fitting by the primary Poisson's series.
The general numerical discussion by the formulae of this paper of a variety of data assumed
to follow the "law of small numbers" is in hand and will shortly be published.