On Sampling and the Estimation of Rare Errors

Biometrika Trust
On Sampling and the Estimation of Rare Errors
Author(s): D. R. Cox and E. J. Snell
Source: Biometrika, Vol. 66, No. 1 (Apr., 1979), pp. 125-132
Published by: Oxford University Press on behalf of Biometrika Trust
Stable URL: http://www.jstor.org/stable/2335251
Accessed: 15-03-2017 15:16 UTC
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted
digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about
JSTOR, please contact [email protected].
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
http://about.jstor.org/terms
Biometrika Trust, Oxford University Press are collaborating with JSTOR to digitize, preserve and
extend access to Biometrika
This content downloaded from 163.1.41.41 on Wed, 15 Mar 2017 15:16:18 UTC
All use subject to http://about.jstor.org/terms
BiometriRa
(1979),
66,
1,
pp.
125-32
125
Printed in Great Britain
On sampling and the estimation of rare errors
BY D. R. COX "D E. J. SNELL
Department of Mathematics, Imperial College, London
SUIMM-AY
For each individual in a finite population a recorded value is available. A small proportion
of the values is in error, all the errors being of the same sign and the maximum fractional
error being known. It is required to obtain an upper confidence limit for the total error in
the population. Various solutions, including a Bayesian one, are discussed for sampling with
probability proportional to the recorded values, and the appropriateness of this form of
sampling is examined.
Some key words: Auditing; Bayesian inference; Finite population; Length biased sampling; Optimal
sampling; Poisson distribution; Probability proportional to size; Sampling; Superpopulation model.
1. INTRODUCTION
The work described in this paper was done in connexion with the auditing of accounts.
Nevertheless we describe the problem in a general setting, partly to emphasize the breadth
of potential applications and partly because it would be out of place to discuss in detail
considerations specific to auditing.
Consider a finite population of N items with recorded values xl, ..., XN, assumed all posit
Suppose that the items are subject to errors Yl, ***, YN and hence to fractional errors zl, ..
where yi = xi zi. In more conventional language, xi is the 'observed' value and x - y
'true' value of the ith item. Let
1 (zitO),
di d=f
I= zj0
? (Zi= 0).
Then Tx = ; xi is the total recorded value of the population, Ty = E yi is the total error and
Td = E di is the number of items in error. Note that Td = xi di is the total recorded value
of those items in error.
Suppose that it is required to estimate Ty by sampling. Now, Ty being a population total,
the standard theoretical results of sample survey theory are available giving, on the basis
of suitable sampling schemes, unbiased estimates of Ty and unbiased estimates of
sampling variance of these estimates. We consider here, however, situations in which most
of the items are error-free. Indeed we may have populations and sample sizes such that with
appreciable probability the whole sample is error-free. We may still wish to estimate T
and in particular to give an upper confidence limit for its value. It is clear that estimated
sampling variances are not much use for this, and are totally useless when no errors are
observed.
Some assumption about the nature of the errors is needed and, with auditing partly in
mind, we take initially situations in which 0 <zi 1. This is relevant primarily when the
recorded values are, if in error, too large and when true values and recorded values are non-
negative. There is an immediate extension to the case when the errors all have the same sign
and 0 z Zi,, a, with a known, and the results can, with a little more difficulty, be extended
to -b < zf < a, with a, b > 0 both known; that is, if errors of both signs can occur, bounds on
This content downloaded from 163.1.41.41 on Wed, 15 Mar 2017 15:16:18 UTC
All use subject to http://about.jstor.org/terms
126 D. R. COX AND E. J. SNELL
the fractional error must be known or assumed, or some other corresponding assumption
made.
A fairly typical situation in auditing is where a sample of perhaps 102-103 items gives
either no items in error or perhaps one or a few items in error, with the correspon
observed. We shall assume throughout that for the sampled items the z, are observed without
error, i.e. that the checking process is perfect. Of course in applications that would need
critical examination if realistic upper levels are to be found for Ty and, under certain
stances, some effort devoted to independent rechecking would be sensible.
A more general interpretation still of the investigation is that yi is any further prop
of the ith item, most of the yi being zero and the constraints on the zi being satisfied
2. A SAMPLING SCHEME
The most appropriate form of sampling will depend, among other things, on the relative
liability to error of items of various sizes, and on the relative costs of checking items of
different sizes; see ? 4. Often, however, the items with larger values will be relatively more
important and sampling with probability proportional to size, that is x, will be sensible.
To implement such sampling imagine the values cumulated, hence defining a series of
points along a value-axis at x1, X1 + X2, ..., X1 + ... + XN = Tx. A systematic sample
h is then taken, where h is determined according to the size of sample required. In auditing
this is called monetary unit sampling (Arkin, 1963; Anderson & Teitlebaum, 1973; McRae,
1974, Chapter 17; Smith, 1976).
Obviously all items with value at least h are certain to be selected, indeed possibly more
than once, i.e. they define a stratum with 100% sampling. Throughout the following
discussion, we suppose that any such items have been removed for separate study. Then
the chance of selecting an item is proportional to its x value. Throughout the mathematical argument we treat the sampling procedure as equivalent to random sampling with
probability proportional to size. The approximation involved will be improved by
randomizing the order of the items before sampling; sometimes it may be good to divide
the population into subpopulations for separate sampling, possibly with different values of h.
One of the attractions of the procedure is that it is easily implemented either by hand or
on the computer.
The description of this procedure in the auditing literature is based on the notion of
treating each unit, pound, dollar, etc., in a particular item, as an entity free from or subject
to error. Because the proportion of units subject to error is small, various results based on the
Poisson distribution are suggested. The different units forming an item are, however, not
independent, and when one is sampled they are all sampled. Arguments based on treating
the constituent 'pounds' as independent elements seem to us therefore suspect, although the
conclusions are broadly correct.
3. INFINITE POPULATION THEORY
There are various ways of approaching a theoretical treatment of this problem. The
simplest is to treat the finite population as effectively infinite and the values xi as ran
variables of density p(x) and mean ,x. Then
Tx/N
=
ux.
(1)
Suppose that items of value x have small probability p(x) of containing an error; conditional
on a nonzero error, let the density of y, the magnitude of error, be q(y Ix). Then the total
This content downloaded from 163.1.41.41 on Wed, 15 Mar 2017 15:16:18 UTC
All use subject to http://about.jstor.org/terms
On sampling and estimation of rare errors 127
error and the total value of items subject to error are given respectively by
TYIN = ffYq(Ylx)p(x)p(x)dxdy, (2)
TxdIN
=
xp(x)
p(x)dx.
(3)
Now, because sampling is with probability proportional to size, the density of x in the
sample is xp(x)/lu and therefore the chance that an arbitrary item in the sample contains a
error has the small value
j'xp(x)p(X) dXIPx =Txd1(NPx) = TxdlTx (4)
and therefore the number, m, of errors observed in the sample has a Poisson distribution of
mean nT~xdITx. Therefore if wrc(m) is the upper 1- o confidence limit for the mean of a Poisson
distribution, when m is observed, the upper confidence limit for TXd is
Txza(m)/n,
(5)
where m is the number of items in the sample with error, and a natural point estimate is
mTx/n.
In particular if the sample is error-free the upper 95% limit for TXd is 3 OOTxln. This,
interpreted as a conservative upper limit for Ty, is a central result in monetary unit sampling,
as described in the auditing literature, and in particular can be used to fix a suitable n, and
hence a suitable sampling interval h ,Tx/n.
In fact, however, the population property of interest is typically not TXd but Ty and it
follows from (2) and (3) that
Po = EW(ZIZ *0) = { { Lxp(x) p(x) q(y I x) dx dy/ xp(x) p(x) dx,
so that
Ty
=
Txd
Hz'
(6)
where Ew(. ) denotes expectation under the weighted scheme of sampling used. Formula
(6) illustrates the difficulty of the problem when no or very few errors are observed, for it
is then impossible or difficult to estimate uz. An unbiased point estimate of Ty is
mnTxz/in,
(7)
where z is the mean of the obse
From another point of view (7) is just the Horvitz-Thompson estimator for a population
total (Cochran, 1963, ? 9.14); as already noted, when mn is small and in particular zero, a
sampling variance attached to (7) is useless for the analysis of the data. Possible ways of
obtaining an upper limit for Ty include
(a) the typically very conservative assumption that zz= 1, T; = TXd leading to the
use of (5);
(b) the use on the basis of previous experience of an a przori upper bound for pz, say uz*,
leading to the conservative upper limit Txtu(mn) p4/n;
(c) the use of an empirical Bayesian argument in which information is incorporated from
previous similar experience, (b) being in a sense an extreme form;
5
This content downloaded from 163.1.41.41 on Wed, 15 Mar 2017 15:16:18 UTC
All use subject to http://about.jstor.org/terms
128 D. R. COX AND E. J. SNELL
(d) somewhat empirical modification of (b) such as t
,* if z< and by 1 if z >u,*;
(e) application of a method recommended in the auditing literature based on the ordered
nonzero proportional errors z(1) > ... . Z(m) In this the factor wzcll(m) in (5) is r
by
m
E {=w1(r) - zu(r - l)}z(r), (8)
r=O
where z(o) = 1, and tu0(-1) = 0. It was conjec
Teitlebaum, that substitution of (8) into (5) leads to a conservative 1- o upper
confidence limit for Ty, whatever the distribution of errors. So far as we k
of or counterexample to this is known. Numerical work at Imperial College by
Mr D. Dutt not reported here has confirmed (8), the limits often being extremely
conservative;
(f) the addition of a semiempirical constant to the Horvitz-Thompson estimated variance,
suggested in an unpublished paper by H. L. Jones, the constant being chosen so that
sensible results are obtained even when m = 0, when the Horvitz-Thompson estimated
variance vanishes.
(g) calculation of a conservative upper confidence limit based on a set of multinomial
probabilities, corresponding to probabilities that 0,1, ..., loop out of each pound
examined are in error (Fienberg, Neter & Leitch, 1977; Neter, Leitch & Fienberg,
1978). The method of calculation is, however, quite complicated unless the number
of items in error is very small and gives answers broadly similar to those of (e).
In ? 5 of the present paper we outline (c); a Bayesian theory is developed under special
assumptions, taking into account the finite size of the population. For practical use,
especially when m is likely to be very small, application of the simple methods (b) and (d)
is, in our opinion, often likely to be sensible.
Note that in the above discussion no special assumptions are made concerning the relation
between the occurrence of error and the value of the item.
4. CHOICE OF SAMPLING SCHEME
We now consider briefly the circumstances under which the sampling procedure of ? 2 is
likely to be reasonably efficient; the discussion is similar to that used in studies of the
efficiency of sampling with probability proportional to size. While in the present context
estimated variances are not useful for the analysis of data, variances can be used as a
reasonable guide in design.
Let the average cost of checking an item of recorded value x be c(x) and consider for
simplicity a family of sampling schemes in which the probability of selecting an item of
value x is proportional to xk; such schemes are easily implemented by a modification of the
cumulative scheme described in ? 2. If Eo(.) denotes expectation with respect to unweighted
sampling, the expected cost of a sample of size n is
nEO{Xk c(X)}/EO(Xk). (9)
We concentrate for the present discussion on the variance of the usual unbiased estimate
of Ty, namely
1(k) =-NE0(Xk) ( yz/4z)/n. ( 10)
This content downloaded from 163.1.41.41 on Wed, 15 Mar 2017 15:16:18 UTC
All use subject to http://about.jstor.org/terms
On sampling and estimation of rare errors 129
Let
+(x) =pr(YtOX ==x), ,(x) = E(YI Y$0,X = x), v2(x) = var(YI Y$O,X = x)
describe the error distribution. It is then easy to show that for a given total cost C,
vark(T ) - {X c(X)}Eo Xk -{Eo(Y)2 EO(Xk cX) (11)
A simple special case that gives a general idea of the behaviour of (11) is obtained with
c(x) = CO X11 P(x) = ,POX/l, v(x) = vO_Xvi, +(x) = oo XO (12)
and to take log X as normal, with mean g and variance 72; note that these are not consistent
with the constraints on these quantities for all parameter values. Of course assumptions
alternative to (12) may be called for in particular cases, for example +(x) = q0 + qx.
Introduction of (12) gives for the variance (11), on making the further approximation that
terms in q0 are negligible,
N2 Co 0 exp{(Ci + 01)g + r72(k + C)2}
x [vo2exp (2v, g) exp {1 2(q1 + 2v, - k)2} + /2exp (2p, g) exp {172(01+ 2p, - k)2}].
In particular the minimizing value of k can be obtained. One special case is when the cost
and probability of error are independent of x, c, = ol = 0, and the magnitude of e
proportional to x, p, = v, = 1. The optimal k minimizes k2 + (2- k)2 and so is k = 1. U
unweighted sampling, k = 0, would increase (13) by a factor eT2 = 1 +y, where yx is
coefficient of variation of x in the population. With the same assumptions except that c
the cost being thus proportional to x, the function to be minimized is (k+1)2 +(2-k)2,
leading to k- 1. The values of (13) at k = 0, 1, 1 are now in the ratio eT2/4: 1 eT2/4. Thus the
choice of sampling scheme is not critical unless the relative variation of x is extremely large.
Note finally from (13) that simple discussion of the choice of k is possible whenever p, = vl,
that is whenever the coefficient of variation of the magnitude of error is independent of x.
For then the optimizing value minimizes (k+ C1)2 +(01 + 2t,a - k)2 and the optimum is at
2k = (c, + ql + 2t1), showing a range of circumstances under which k = 1 is optimal.
5. PARAMETRIC FORMULATION: BAYESIAN APPROACH
In the discussion of ?3 no parametric assumptions have been made. An alternative
approach is to formulate the problem parametrically, either in infinite population form or
in terms of a finite population drawn from an infinite superpopulation. Further a Bayesian
analysis of the problem is natural whenever a series of populations of broadly similar properties is examined, the prior distribution being estimated from relevant data.
Of course the parametric formulation must depend on the context. Here we investigate
the simplest form. We suppose that the probability q that an item is defective is constant,
independent of x, and small, so that Poisson approximations can be used. Given that an
item is in error, let the proportional error have density g(z; ,), where , is an unknown
parameter, which can be taken when scalar to be the mean. The errors attaching to differ
items are assumed independent and identically distributed. We thus observe m, which has a
Poisson distribution of mean no, and values Zl, ..., Zm in a random sample drawn from
g(z; ,) In the infinite model formulation it is required to estimate C,a, whereas in the finite
population version it is required in effect to estimate the total error in the nonsampled part
5
This content downloaded from 163.1.41.41 on Wed, 15 Mar 2017 15:16:18 UTC
All use subject to http://about.jstor.org/terms
130 D. R. COX AND E. J. SNELL
of the population. This is the random variable T(u) with moment generating function
R
E{exp(-8sT(u))} = l {1-+ +g*(sxi; It)}, (14)
i=1
where s is the argument of the moment generating function, g*(.; ) is the moment generating
function corresponding to the density g(.; ), and R = N-n is the number of individuals
in the population that are not sampled. Note that in the auditing problem errors detected
will certainly be corrected, so that the population property of interest is precisely T(u).
The mean and variance from (14) are respectively
OpTx(u), {Oa' + o(l- 0) 2} {Tx(u)}2(l +Cx2)/R, 15
where Cx is the square of the coefficient of variation of x in the unsampled units and a2
the variance of the density g(., .).
For the remainder of the discussion we suppose that the density g(z; ,t) is exponential
with mean ,. This may be reasonable under some circumstances. In others, as in the auditing
problem, it is best regarded as a conservative approximation to a truncated exponential
distribution over (0, 1). The theory for this, and for the truncated exponential distribution
over (0, 1) plus an atom of probability at z = 1, can be developed, but are appreciably
more complicated. Under the exponential model the minimal sufficient statistic is (m, z),
where z = (z1 + ... + zm)/m; of course z is not defined if m = 0. As noted previously, the
occurrence of m = 0 with nonnegligible probability makes an 'exact' confidence interval
solution of even the infinite model problem impossible.
For a Bayesian analysis, in the absence of specific information about a particular appli-
cation, we investigate the simplest conjugate prior, giving q and , independent prior densities
of the form
(a/00)a Oa-1exp (- ao/00) tko(b- 1) p,0o(b-)b-1)expp {- (b - 1)ou (16)
r(a)
2
so that 00 and
by combining
l
)(b)
(
,gk are the prior
with the likeliho
posterior form
m|J n+ab0 J 'm +b) a2(m+a),2(m+b)) (17)
where Fd1,d2 is a random variable with the standard F distribution with (d1, d2) degrees of
freedom.
Now it is in principle desirable to use a homogeneous set of populations in defining the
prior distributions, but this may often not be possible. Then from the interpretation of
l/>a and llVb as coefficients of variation of 0 and 1/,p, it seems reasonable to consider a in
the range say 2 to 2 and b to be at least 3, the condition b > 1 being necessary to ensure that
the prior mean is finite; if the exponential distribution for z is to be used to approximate to
a truncated exponential distribution, then some combination such as = o , b = 3 will be
conservative. Finally 00 will typically be very small, so that aqoo<< n, corresponding to the
neglect of the exponential factor in the prior for 0. It would be quite wrong to suggest a
Bayesian solution for universal use regardless of context, but the above choice, say a = 1,
b = 3, , = 2, leads to
(m + 3) ( 2(m+1),2(m+3) (18)
This content downloaded from 163.1.41.41 on Wed, 15 Mar 2017 15:16:18 UTC
All use subject to http://about.jstor.org/terms
On sampling and estimation of rare errors 131
and hence to an upper confidence limit for the population total error remaining, which is
T(u), or to the order of approximation involved in the infinite population Tx.
Two other possibilities for brief consideration are that ,u is assumed known and equal to
puo; for example, it might be reasonable to assume that fractional errors are equally likel
to be between 0 and 1, when o = 2. Then we let b - oo and
-
=
o(m
+
a)F()
(19)
n
to be compared with the result from ? 3 that the upper confidence limit is P0WuI(m)/n; it is
known that especially when a 2 the agreement is close.
The other special possibility is to try for effective approximate confidence limits from (17)
for m $ 0 by passing to a prior corresponding to initial very vague knowledge, for which one
reasonable choice is a = 1, b = 1. This leads for m * 0 to
V
n
F2m+2,2m+2
(20)
The above discussion is for the infinite population version of the problem, where the
quantity of interest is essentially V = f,u. For the finite population version of the problem,
we use the moment generating function (14), or alternatively the mean and variance derived
from it, plus the posterior density (17), to show that conditionally on m and z, the mean of
T,u) is that derived from (17), whereas the variance is that from (17) multiplied by
b -1) a(21)
1I+(+C)2n(mn+
+WR(1+Cx2) 2m(+
+ b- ) '
and the simplest, if rather crude, way of allowing for this is to apply (17) with the factor
m + a and the degrees of freedom 2(m + a) both divided by (21).
It is natural to assess (20) by examining whether formal upper 1- a limits for b =
calculated from the formula have approximately the desired confidence interval property
There is a difficulty of principle in that if m = 0, that is if there are no observed errors,
(20) is inapplicable; we cannot estimate Obu if there is no information about ,u. The mos
satisfactory approach seems to be the following. Inference about 4, i.e. in effect Tzd, can be
based on the Poisson distribution as in (5) and the required confidence interval property
is assured. At least in principle we imagine limits computed for any m that arises. Inference
about 4p can be attempted only when m = 0 and hence the question for consideration is
whether an upper limit calculated from (20) has approximately the correct frequency
property over a set of repetitions in which m and z vary, but taken conditionally on m * 0.
Some information may be gleaned from extreme cases. If all proportional errors, z, are
the same, the limits (20) are too large by a factor very close to 12. If the numbers of nonzero errors are large, so that asymptotic theory is applicable, the limits are conservative,
or not, according as the standard deviation of the distribution g(z; IL) is less than or exceed
the mean. Of the situations that might arise in an auditing context, the least favourable for
the use of (20) is a mixture of a fairly high probability, p, of a very small value of z = zo,
say, with a complementary probability (1 -p) of unit z; this is broadly comparable with the
asymptotic situation with a high coefficient of variation.
We are grateful to Mr J. Connolly, Mann Judd & Co., for presenting the problem which
prompted this work.
This content downloaded from 163.1.41.41 on Wed, 15 Mar 2017 15:16:18 UTC
All use subject to http://about.jstor.org/terms
132 D. R. COX AND E. J. SNELL
REFERENCES
ANDERSON, R. & TEITLEBAUM, A. D. (1973). Dollar-unit sampling. Can. Chartered Accountant 102,
No. 4, 30-9.
A.uxn, H. (1963). Handbook of Sampling for Auditing and Accounting. New York: McGraw-Hill.
CocHRAN, W.
FIENBERG, S.,
populations.
McRAE, T. W.
G. (1963). Sampling Techniques, 2nd edition. New York: Wiley.
NETER, J. & LEITCH, R. A. (1977). Estimating the total overstatement error in accounting
J. Am. Statist. Assoc. 72, 295-302.
(1974). Statistical Sampling for Audit and Control. New York: Wiley.
NETER, J., LEITCH, R. A. & FIENBERG, S. E. (1978). Dollar unit sampling: multinomial bounds for total
overstatement and understatement errors. Accounting Rev. 53, 77-93.
SMIT, T. M. F. (1976). Statistical Sampling for Accountants. London: Accountancy Age Books.
[Received January 1978. Revised August 1978]
This content downloaded from 163.1.41.41 on Wed, 15 Mar 2017 15:16:18 UTC
All use subject to http://about.jstor.org/terms