Johnson, Norman L., Kotz, Samuel, and Wu, XizhQuota Fulfilment and Errors in Stratification"

QUOTA FULFILMENT AND ERRORS IN STRATIFICATION
Norman L. Johnson
Department of Statistics
University of North Carolina
at Chapel Hill
Chapel Hill. NC 27599-3260. USA
Samuel Kotz
Department of Management
Science & Statistics
University of Maryland
College Park. MD 20742. USA
Xizhi iu
Department of Statistics
University of North Carolina
at Chapel Hill
SUMMARY
It is desired to draw a random sample containing specified numbers
of individuals from each stratum of a population.
First a random
sample of size N is chosen from the whole populations and the stratum
of each individual ascertained; then any shortfall is made up by
selecting individuals with known stratum affiliation.
Optimal values
of N are sought. allOWing for cost structure and also the possibility
of error in ascertaining the strata to which individuals in the first
sample belong.
Some Key Words:
Optimization; Quota; Stratified Sampling
QUOTA FULFILMENT AND ERRORS IN STRATIFICATION
Norman L. Johnson
Univ. of North Carolina
Univ. of Maryland.
Xizhi Wu
Univ. of North Carolina
College Park
at Chapel Hi 11
1.
Samuel Kotz
at Chapel Hi 11
Introduction.
Johnson (1957. 1963) considered a problem arising when it is
desired to obtain a random sample containing specified numbers
n1.n2 ....• ~ of individuals from strata v 1 .v2 •.... vk respectively by
(i)
first taking a random sample of size N from the whole
population. and subsequently determining to which each chosen
individual belongs. and
(ii) if the number - Mi' say - of individuals chosen from Vi is
less than n .. making good the shortfall by random selection of a
I
further (n.-M.) individuals from a set known to belong to VI. (for
I
I
i=1,2 ....• k).
The problem was to find an optimal value for N. the size of the
first sample.
It was supposed that the cost of obtaining this sample
is (a + cN). and that the cost of obtaining an individual from V. in
I
the second step is
Ct'
It is to be expected that c: exceeds c. at
I
least for some values of i; otherwise the optimum value of N would be
zero.
Any excess individuals from v. (if M.
I
I
> n.)
I
were supposed to
have value v: each. though this value might be zero.
I
Clearly.
2
realistic values of vi must be less than c. at least for some i,
otherwise one could gain by taking N as large as possible.
In Johnson
(1957), sampling from an effectively infinite population was discussed:
in Johnson (1963). finite population size was allowed for.
2.
The Problem.
In the present paper, we will consider only the case of an
effectively infinite population. with proportions Pl. P2'·· .,Pk of its
members in ~1'~2""'~k respectively (PI + P2 + ... + Pk
= 1).
The
essential new feature is that in step (i) it will not be assumed that
determination of the stratum to which an individual belongs is achieved
without error.
This extension is closely related to problems in faulty inspection
sampling which we have studie'd over the last few years (Johnson & Kotz
(1985».
It also reflects aspects of currently prevailing models of
stratified sampling, which attempt to be more realistic then earlier
ones.
We will denote the probability that an individual, actually
belonging to
~i'
~j
is assigned to
by P ji .
The number of individuals.
in the first sample of size N. actually belonging to
to~.
by Y.: the number ascribed
1
1
~.
1
will be denoted
will be denoted by M. (as above).
1
have
k
k
}; M.
i=1
1
= };
i=1
Yi
= N.
Assuming that determinations of strata of different individuals are
mu tally independent,
We
3
Y. is distributed binomially with parameters (N, p.); M. is
I
I
I
=
distributed binomally with parameters (N, w.) where w.
I
I
k
~
j=1
p. PI'IJ"
J
Given Y., M. is distributed as the convolution of binomial
I
I
distributions with parameters (Y., P. Ii) and (N-Y., w:), where
I
I
I
I
-1 k
-1
wi = (l-p.)
~ p. P
= (l-p.)
(w.-p. P. .).
i1j
I
j~i
J
I
I I I II
It will be assumed, here, that no error occurs in the selection of
individuals in step (ii) - that is, when (ni-M ) individuals are chosen
i
'from Vi', they really do belong to Vi'
(Allowance for the possibility
of errors in selection can be made in a straightforward manner, though
it leads to greater complexity in the formulae.)
3.
Costs.
In general, M
i
~
Y , so the final
i
~~~may
numbers for some strata and in excess for others.
be deficient in
We introduce the
following symbols to represent the cost of the sample:
Vi - a 'penalty' for each individual lacking from Vi'
v: - the value (if any) of each individual from v. in excess of
I
I
the requiroo number, n .
i
So we have, with z
+_{z
- 0
COMPONENT
Obtaining First Sample
Obtaining Second Sample
Penalty
ifz~O
if z
<0
4
Value of excess
The expected total cost is
~
= a+cN
k
+
L {viE[ni-YiIYi<ni]Pr[Yi<ni] - viE[Yi-niIYi~ni]Pr[Yi~niJ}
i=1
k
+
L ci E[ni-MiIMi<ni] Pr[Mi<n i ]
i=1
-
k
L
i=1
[{Vi E[ni-YiIMi~Yi~ni] + Vi E[Yi-MiIMi~Yi~ni]} Pr[Mi~Yi~ni]
+ viE[ni-MiIYi<Mi<ni]Pr[Yi<Mi<ni]+viE[ni-MiIMi<ni~Yi]Pr[Mi<ni~YiJJ·
(1)
On the right-hand side of (I), the first line is the expected cost
associated with the first sample of size N, allowing for shortfall
penalties and value of excess individuals; the second line is the
expected cost of sampling in step (ii); the third and fourth lines
represent the savings from the expected value of the extra individuals
chosen in step (ii) to make up shortfalls.
Direct minimization of
~
with respect to N is a formidable task,
even with the introduction of some approximations.
As in Johnson
(1957), we will approach the problem by considering the change in cost,
A ~
= ~+1
-~,
if the size of the first sample is increased from N to (N+l).
The immediate increase in sampling cost is c.
If the additional
individual is from Vi and is classified as belonging to v j
(probability, Pi P
-
j1i
), this extra cost is offset by (a) the value of
the extra individual, which is Vi if Y < n , vi if Y
i
i
i
~
n , and (b) if
i
Mj < n j , the saving (cj) arising from reduction (by 1) in the number of
5
individuals to be chosen from v. in step (ii), less the value of this
J
individual if it had been chosen (immediately after the first sample),
which would be v
A
CN
if Yj < n j , vj if Yj
j
nj .
~
Hence
k
~
= c -
Pi{v i Pr[Y i < n i ] + vi Pr[Y i
i=l
~
n i ]}
k
- i:1 Pi[Pjli{cj Pr[M j < n j ] - vj Pr[(M j < n j ) n (Y i
~
Vj
Pr[M j < n j ) n (Y j
< nj)]
nj)]}]
(2)
k
= c - .~ Pi{vi + (vi-vi) Pr[Y i < n i ]}
1=1
-
~ Pi[.~J=l
Pj1i{(cj-vj) Pr[M j < n j ]
i=l
-(vj-Vjl Pr[(M j < nj l
n (Y j < nj ])].
(2)'
n -1
with
iNN
Pr[Y i < n i ] = ~=O (y) p~(l-Pi) -y
y-
n.-1
1
Pr[M
<n ] =
i
i
~
(N) w7(1-W )N-m
i
~m
n.-1
1
~
y=O
Pr[(M.I
< n.)
n (Y.=y)]
I I
n -1
i
=
~ (N) p~(l-Pi)N-y ~~
y=O
Y
u+v<n
i
u
,v(l ,)N-y-v
Piii (l-P iii )y-u(N-y)
v wi
-wi
.
Under the conditions ci
> (vi'
c)
increases as N increases, and A Co < O.
> v:1
(for all i), A CN
The optimal N is then the
integer part of the solution of the equation A CN = O.
This solution
depends on the values of c'{;~l' {vi} and {vi} only through the ratios
6
{c-vi}
4.
Special Cases
We now consider a succession of special cases.
and
If vi = v. vi = v
ci = c' for all i=I.2 •.... k. then
A CN = c-v'-{c'-v')
_(V-V')[ ~
i=1
k
~
•
k
Pi
~
Pjl i PrEMo
Pi{pr~:: < n:;i_ ~
j=l
< nj ]
PjliJpr[(Mj<nj) n {Y.<n.)]}].
J J
(3)
The solution of A
CN
= 0 now depends only on the ratios
c-v' : c' -v ' : v-v'
-1
If we suppose. further. that Pili = P and P j1i = (I-Pi)
Pj{I-P)
for all i and all j ~ i (i=I.2 •...• k). so that P is the probability of
correct classification. and incorrectly assigned individuals are
ascribed to other strata with probabilities proportional to stratum
sizes. then
(4. 1)
and
,-1
Wi = (I-Pi)
(l-P) Pi
I
Pj{I-Pj)
-1
(4.2)
.
j~i
In the completely symmetric case when. in addition to all the
above-specific conditions. we have Pi = k
... =
~
-1
(i=I.2 •.... k) and n
= N/k = n. say. the variables M and Yi have the
i
distribution. with parameters (N.k
independent.
-1
).
~
1
2
=
binomial
They are not. of course.
Generally. the correlation between Y and Mj is
i
= n
7
k
Pi(Pjl i - h:1 ~Pjlh)
In the completely symmetric case, the correlation between Y and M is
i
i
(k-1)-1(k P-1).) In this case
wj
= Pjl i = (k-1)
-1
(1-P)
~
for j
(5.1)
i;
n-1
Pr[Y
< n] = Pr[M i < n]
i
=pen) = k-N y=O~
for all i and j;
Pr[(Yi<n)
n (Mi<n)]
(5.2)
=P* (n)
n-l
= k-N ~ (N) ~~ (y)(N-y)pu(1_P)y-u+v(k_2+P)N-y-v
y=O y u+v<n u
v
for all i;
and
A
'1i
= c-v
(5.3)
- (c'-v') pen) - (v-v'){P(n)-P* (n)}.
(5.4)
In this completely symmetric case, P appears only in P*(n) and so
one might expect that the optimal value of N would not depend much on
''\0,:-.
P, unless v-v' is large, relative to c-v' and c'-v'.
this conjecture to a remarkable extent.
Table 1 supports
Indeed, so weak is the
dependence on P that it would seem reasonable to use the optimal values
corresponding to P=1 (errorless inspection) except, perhaps for values
of P so small as to be very unlikely.
Of course the minimized value of expected cost
will depend
substantially on P, even though the optimal value of N does not.
He
<.oJ'
8
(Note:
Two FORTRAN programs were prepared for calculations of optimal
values of N.
One is for a personal computer, and the other - a faster
one - is suitable for a main frame.
request from the third author.
negligible, but for n
~
These programs are available on
Computation time for n
50 it is quite substantial.
< 20
is
•
A bivariate normal
•
approximation to the joint distribution of Y. and M. might be used to
1
1
evaluate P* (n), but it should be noted that although the regressions
are linear, variation about the regression line is not homoscedastic.)
REFERENCES
Johnson, N.L. (1957)
Optimal sampling for quota fulfilment,
Biometrika, 44. 518-523.
Johnson, N.L. (1963)
Quota fulfilment in finite populations, In
Classical and Contagious Discrete Distributions, (G.P. Patil, ed.)
Statistical Publ. Soc., Calcutta. India and Pergamon Press, pp.
419-426.
Johnson, N.L. and Kotz. S. (1985)
Some distributions arising as a
consequence of errors in inspection, Naval Res. Logist. Qtly., 32.
35-43.
•
TABLE 1:
c· v' : c •-v' : v-v' =1 : 3 : .!
2
OPTIMAL VALUES OF N
1
1 : 5 .. 2
1 : 3 : 2
1 : 5 : 2
k
n
P=O.85
0.70
0.85
0.70
0.85
0.70
0.85
0.70
2
2
2
25
50
53
104
206
53
105
207
54
106
209
55
107
210
56
109
212
56
109
213
57
110
213
57
110
214
55
105
210
55
105
210
56
107
213
57
108
214
59
111
218
60
60
111
218
112
220
61
113
221
57
107
211
57
108
211
59
109
214
59
111
215
62
115
221
63
115
221
63
116
223
64
117
224
55
108
212
520
56
109
213
521
57
111
216
525
58
112
217
528
62
117
224
62
117
224
539
63
118
226
542
63
119
227
543
54
111
211
520
54
111
212
521
56
114
215
526
57
115
217
529
61
121
225
541
62
123
227
544
63
124
228
546
55
108
218
521
55
108
218
522
57
110
222
527
112
224
531
63
118
64
120
65
121
236
550
54
115
215
55
115
216
531
57
118
220
537
120
222
541
3
3
3
4
4
4
5
5
5
5
6
6
6
6
100
17
33
67
13
25
50
10
20
40
100
8
17
33
83
7
7
7
7
71
8
8
8
8
6
13
25
63
530
9
9
9
9
6
11
22
56
61
110
214
531
10
10
10
10
5
10
20
57
111
217
529
7
14
29
50
62
110
215
533
58
58
60
530
115
222
537
N
= min{n
225
541
63
118
232
544
233
235
544
548
63
126
232
63
127
232
555
555
128
234
559
65
130
236
561
72
124
234
563
73
126
236
565
69
127
239
562
70
128
240
564
222
543
558
71
123
232
559
61
117
67
125
67
125
225
236
541
557
236
558
65
115
58
61
121
71
122
231
64
113
219
539
112
218
538
A Cn > O}
64