Biometrika Trust On Sampling and the Estimation of Rare Errors Author(s): D. R. Cox and E. J. Snell Source: Biometrika, Vol. 66, No. 1 (Apr., 1979), pp. 125-132 Published by: Oxford University Press on behalf of Biometrika Trust Stable URL: http://www.jstor.org/stable/2335251 Accessed: 15-03-2017 15:16 UTC JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at http://about.jstor.org/terms Biometrika Trust, Oxford University Press are collaborating with JSTOR to digitize, preserve and extend access to Biometrika This content downloaded from 163.1.41.41 on Wed, 15 Mar 2017 15:16:18 UTC All use subject to http://about.jstor.org/terms BiometriRa (1979), 66, 1, pp. 125-32 125 Printed in Great Britain On sampling and the estimation of rare errors BY D. R. COX "D E. J. SNELL Department of Mathematics, Imperial College, London SUIMM-AY For each individual in a finite population a recorded value is available. A small proportion of the values is in error, all the errors being of the same sign and the maximum fractional error being known. It is required to obtain an upper confidence limit for the total error in the population. Various solutions, including a Bayesian one, are discussed for sampling with probability proportional to the recorded values, and the appropriateness of this form of sampling is examined. Some key words: Auditing; Bayesian inference; Finite population; Length biased sampling; Optimal sampling; Poisson distribution; Probability proportional to size; Sampling; Superpopulation model. 1. INTRODUCTION The work described in this paper was done in connexion with the auditing of accounts. Nevertheless we describe the problem in a general setting, partly to emphasize the breadth of potential applications and partly because it would be out of place to discuss in detail considerations specific to auditing. Consider a finite population of N items with recorded values xl, ..., XN, assumed all posit Suppose that the items are subject to errors Yl, ***, YN and hence to fractional errors zl, .. where yi = xi zi. In more conventional language, xi is the 'observed' value and x - y 'true' value of the ith item. Let 1 (zitO), di d=f I= zj0 ? (Zi= 0). Then Tx = ; xi is the total recorded value of the population, Ty = E yi is the total error and Td = E di is the number of items in error. Note that Td = xi di is the total recorded value of those items in error. Suppose that it is required to estimate Ty by sampling. Now, Ty being a population total, the standard theoretical results of sample survey theory are available giving, on the basis of suitable sampling schemes, unbiased estimates of Ty and unbiased estimates of sampling variance of these estimates. We consider here, however, situations in which most of the items are error-free. Indeed we may have populations and sample sizes such that with appreciable probability the whole sample is error-free. We may still wish to estimate T and in particular to give an upper confidence limit for its value. It is clear that estimated sampling variances are not much use for this, and are totally useless when no errors are observed. Some assumption about the nature of the errors is needed and, with auditing partly in mind, we take initially situations in which 0 <zi 1. This is relevant primarily when the recorded values are, if in error, too large and when true values and recorded values are non- negative. There is an immediate extension to the case when the errors all have the same sign and 0 z Zi,, a, with a known, and the results can, with a little more difficulty, be extended to -b < zf < a, with a, b > 0 both known; that is, if errors of both signs can occur, bounds on This content downloaded from 163.1.41.41 on Wed, 15 Mar 2017 15:16:18 UTC All use subject to http://about.jstor.org/terms 126 D. R. COX AND E. J. SNELL the fractional error must be known or assumed, or some other corresponding assumption made. A fairly typical situation in auditing is where a sample of perhaps 102-103 items gives either no items in error or perhaps one or a few items in error, with the correspon observed. We shall assume throughout that for the sampled items the z, are observed without error, i.e. that the checking process is perfect. Of course in applications that would need critical examination if realistic upper levels are to be found for Ty and, under certain stances, some effort devoted to independent rechecking would be sensible. A more general interpretation still of the investigation is that yi is any further prop of the ith item, most of the yi being zero and the constraints on the zi being satisfied 2. A SAMPLING SCHEME The most appropriate form of sampling will depend, among other things, on the relative liability to error of items of various sizes, and on the relative costs of checking items of different sizes; see ? 4. Often, however, the items with larger values will be relatively more important and sampling with probability proportional to size, that is x, will be sensible. To implement such sampling imagine the values cumulated, hence defining a series of points along a value-axis at x1, X1 + X2, ..., X1 + ... + XN = Tx. A systematic sample h is then taken, where h is determined according to the size of sample required. In auditing this is called monetary unit sampling (Arkin, 1963; Anderson & Teitlebaum, 1973; McRae, 1974, Chapter 17; Smith, 1976). Obviously all items with value at least h are certain to be selected, indeed possibly more than once, i.e. they define a stratum with 100% sampling. Throughout the following discussion, we suppose that any such items have been removed for separate study. Then the chance of selecting an item is proportional to its x value. Throughout the mathematical argument we treat the sampling procedure as equivalent to random sampling with probability proportional to size. The approximation involved will be improved by randomizing the order of the items before sampling; sometimes it may be good to divide the population into subpopulations for separate sampling, possibly with different values of h. One of the attractions of the procedure is that it is easily implemented either by hand or on the computer. The description of this procedure in the auditing literature is based on the notion of treating each unit, pound, dollar, etc., in a particular item, as an entity free from or subject to error. Because the proportion of units subject to error is small, various results based on the Poisson distribution are suggested. The different units forming an item are, however, not independent, and when one is sampled they are all sampled. Arguments based on treating the constituent 'pounds' as independent elements seem to us therefore suspect, although the conclusions are broadly correct. 3. INFINITE POPULATION THEORY There are various ways of approaching a theoretical treatment of this problem. The simplest is to treat the finite population as effectively infinite and the values xi as ran variables of density p(x) and mean ,x. Then Tx/N = ux. (1) Suppose that items of value x have small probability p(x) of containing an error; conditional on a nonzero error, let the density of y, the magnitude of error, be q(y Ix). Then the total This content downloaded from 163.1.41.41 on Wed, 15 Mar 2017 15:16:18 UTC All use subject to http://about.jstor.org/terms On sampling and estimation of rare errors 127 error and the total value of items subject to error are given respectively by TYIN = ffYq(Ylx)p(x)p(x)dxdy, (2) TxdIN = xp(x) p(x)dx. (3) Now, because sampling is with probability proportional to size, the density of x in the sample is xp(x)/lu and therefore the chance that an arbitrary item in the sample contains a error has the small value j'xp(x)p(X) dXIPx =Txd1(NPx) = TxdlTx (4) and therefore the number, m, of errors observed in the sample has a Poisson distribution of mean nT~xdITx. Therefore if wrc(m) is the upper 1- o confidence limit for the mean of a Poisson distribution, when m is observed, the upper confidence limit for TXd is Txza(m)/n, (5) where m is the number of items in the sample with error, and a natural point estimate is mTx/n. In particular if the sample is error-free the upper 95% limit for TXd is 3 OOTxln. This, interpreted as a conservative upper limit for Ty, is a central result in monetary unit sampling, as described in the auditing literature, and in particular can be used to fix a suitable n, and hence a suitable sampling interval h ,Tx/n. In fact, however, the population property of interest is typically not TXd but Ty and it follows from (2) and (3) that Po = EW(ZIZ *0) = { { Lxp(x) p(x) q(y I x) dx dy/ xp(x) p(x) dx, so that Ty = Txd Hz' (6) where Ew(. ) denotes expectation under the weighted scheme of sampling used. Formula (6) illustrates the difficulty of the problem when no or very few errors are observed, for it is then impossible or difficult to estimate uz. An unbiased point estimate of Ty is mnTxz/in, (7) where z is the mean of the obse From another point of view (7) is just the Horvitz-Thompson estimator for a population total (Cochran, 1963, ? 9.14); as already noted, when mn is small and in particular zero, a sampling variance attached to (7) is useless for the analysis of the data. Possible ways of obtaining an upper limit for Ty include (a) the typically very conservative assumption that zz= 1, T; = TXd leading to the use of (5); (b) the use on the basis of previous experience of an a przori upper bound for pz, say uz*, leading to the conservative upper limit Txtu(mn) p4/n; (c) the use of an empirical Bayesian argument in which information is incorporated from previous similar experience, (b) being in a sense an extreme form; 5 This content downloaded from 163.1.41.41 on Wed, 15 Mar 2017 15:16:18 UTC All use subject to http://about.jstor.org/terms 128 D. R. COX AND E. J. SNELL (d) somewhat empirical modification of (b) such as t ,* if z< and by 1 if z >u,*; (e) application of a method recommended in the auditing literature based on the ordered nonzero proportional errors z(1) > ... . Z(m) In this the factor wzcll(m) in (5) is r by m E {=w1(r) - zu(r - l)}z(r), (8) r=O where z(o) = 1, and tu0(-1) = 0. It was conjec Teitlebaum, that substitution of (8) into (5) leads to a conservative 1- o upper confidence limit for Ty, whatever the distribution of errors. So far as we k of or counterexample to this is known. Numerical work at Imperial College by Mr D. Dutt not reported here has confirmed (8), the limits often being extremely conservative; (f) the addition of a semiempirical constant to the Horvitz-Thompson estimated variance, suggested in an unpublished paper by H. L. Jones, the constant being chosen so that sensible results are obtained even when m = 0, when the Horvitz-Thompson estimated variance vanishes. (g) calculation of a conservative upper confidence limit based on a set of multinomial probabilities, corresponding to probabilities that 0,1, ..., loop out of each pound examined are in error (Fienberg, Neter & Leitch, 1977; Neter, Leitch & Fienberg, 1978). The method of calculation is, however, quite complicated unless the number of items in error is very small and gives answers broadly similar to those of (e). In ? 5 of the present paper we outline (c); a Bayesian theory is developed under special assumptions, taking into account the finite size of the population. For practical use, especially when m is likely to be very small, application of the simple methods (b) and (d) is, in our opinion, often likely to be sensible. Note that in the above discussion no special assumptions are made concerning the relation between the occurrence of error and the value of the item. 4. CHOICE OF SAMPLING SCHEME We now consider briefly the circumstances under which the sampling procedure of ? 2 is likely to be reasonably efficient; the discussion is similar to that used in studies of the efficiency of sampling with probability proportional to size. While in the present context estimated variances are not useful for the analysis of data, variances can be used as a reasonable guide in design. Let the average cost of checking an item of recorded value x be c(x) and consider for simplicity a family of sampling schemes in which the probability of selecting an item of value x is proportional to xk; such schemes are easily implemented by a modification of the cumulative scheme described in ? 2. If Eo(.) denotes expectation with respect to unweighted sampling, the expected cost of a sample of size n is nEO{Xk c(X)}/EO(Xk). (9) We concentrate for the present discussion on the variance of the usual unbiased estimate of Ty, namely 1(k) =-NE0(Xk) ( yz/4z)/n. ( 10) This content downloaded from 163.1.41.41 on Wed, 15 Mar 2017 15:16:18 UTC All use subject to http://about.jstor.org/terms On sampling and estimation of rare errors 129 Let +(x) =pr(YtOX ==x), ,(x) = E(YI Y$0,X = x), v2(x) = var(YI Y$O,X = x) describe the error distribution. It is then easy to show that for a given total cost C, vark(T ) - {X c(X)}Eo Xk -{Eo(Y)2 EO(Xk cX) (11) A simple special case that gives a general idea of the behaviour of (11) is obtained with c(x) = CO X11 P(x) = ,POX/l, v(x) = vO_Xvi, +(x) = oo XO (12) and to take log X as normal, with mean g and variance 72; note that these are not consistent with the constraints on these quantities for all parameter values. Of course assumptions alternative to (12) may be called for in particular cases, for example +(x) = q0 + qx. Introduction of (12) gives for the variance (11), on making the further approximation that terms in q0 are negligible, N2 Co 0 exp{(Ci + 01)g + r72(k + C)2} x [vo2exp (2v, g) exp {1 2(q1 + 2v, - k)2} + /2exp (2p, g) exp {172(01+ 2p, - k)2}]. In particular the minimizing value of k can be obtained. One special case is when the cost and probability of error are independent of x, c, = ol = 0, and the magnitude of e proportional to x, p, = v, = 1. The optimal k minimizes k2 + (2- k)2 and so is k = 1. U unweighted sampling, k = 0, would increase (13) by a factor eT2 = 1 +y, where yx is coefficient of variation of x in the population. With the same assumptions except that c the cost being thus proportional to x, the function to be minimized is (k+1)2 +(2-k)2, leading to k- 1. The values of (13) at k = 0, 1, 1 are now in the ratio eT2/4: 1 eT2/4. Thus the choice of sampling scheme is not critical unless the relative variation of x is extremely large. Note finally from (13) that simple discussion of the choice of k is possible whenever p, = vl, that is whenever the coefficient of variation of the magnitude of error is independent of x. For then the optimizing value minimizes (k+ C1)2 +(01 + 2t,a - k)2 and the optimum is at 2k = (c, + ql + 2t1), showing a range of circumstances under which k = 1 is optimal. 5. PARAMETRIC FORMULATION: BAYESIAN APPROACH In the discussion of ?3 no parametric assumptions have been made. An alternative approach is to formulate the problem parametrically, either in infinite population form or in terms of a finite population drawn from an infinite superpopulation. Further a Bayesian analysis of the problem is natural whenever a series of populations of broadly similar properties is examined, the prior distribution being estimated from relevant data. Of course the parametric formulation must depend on the context. Here we investigate the simplest form. We suppose that the probability q that an item is defective is constant, independent of x, and small, so that Poisson approximations can be used. Given that an item is in error, let the proportional error have density g(z; ,), where , is an unknown parameter, which can be taken when scalar to be the mean. The errors attaching to differ items are assumed independent and identically distributed. We thus observe m, which has a Poisson distribution of mean no, and values Zl, ..., Zm in a random sample drawn from g(z; ,) In the infinite model formulation it is required to estimate C,a, whereas in the finite population version it is required in effect to estimate the total error in the nonsampled part 5 This content downloaded from 163.1.41.41 on Wed, 15 Mar 2017 15:16:18 UTC All use subject to http://about.jstor.org/terms 130 D. R. COX AND E. J. SNELL of the population. This is the random variable T(u) with moment generating function R E{exp(-8sT(u))} = l {1-+ +g*(sxi; It)}, (14) i=1 where s is the argument of the moment generating function, g*(.; ) is the moment generating function corresponding to the density g(.; ), and R = N-n is the number of individuals in the population that are not sampled. Note that in the auditing problem errors detected will certainly be corrected, so that the population property of interest is precisely T(u). The mean and variance from (14) are respectively OpTx(u), {Oa' + o(l- 0) 2} {Tx(u)}2(l +Cx2)/R, 15 where Cx is the square of the coefficient of variation of x in the unsampled units and a2 the variance of the density g(., .). For the remainder of the discussion we suppose that the density g(z; ,t) is exponential with mean ,. This may be reasonable under some circumstances. In others, as in the auditing problem, it is best regarded as a conservative approximation to a truncated exponential distribution over (0, 1). The theory for this, and for the truncated exponential distribution over (0, 1) plus an atom of probability at z = 1, can be developed, but are appreciably more complicated. Under the exponential model the minimal sufficient statistic is (m, z), where z = (z1 + ... + zm)/m; of course z is not defined if m = 0. As noted previously, the occurrence of m = 0 with nonnegligible probability makes an 'exact' confidence interval solution of even the infinite model problem impossible. For a Bayesian analysis, in the absence of specific information about a particular appli- cation, we investigate the simplest conjugate prior, giving q and , independent prior densities of the form (a/00)a Oa-1exp (- ao/00) tko(b- 1) p,0o(b-)b-1)expp {- (b - 1)ou (16) r(a) 2 so that 00 and by combining l )(b) ( ,gk are the prior with the likeliho posterior form m|J n+ab0 J 'm +b) a2(m+a),2(m+b)) (17) where Fd1,d2 is a random variable with the standard F distribution with (d1, d2) degrees of freedom. Now it is in principle desirable to use a homogeneous set of populations in defining the prior distributions, but this may often not be possible. Then from the interpretation of l/>a and llVb as coefficients of variation of 0 and 1/,p, it seems reasonable to consider a in the range say 2 to 2 and b to be at least 3, the condition b > 1 being necessary to ensure that the prior mean is finite; if the exponential distribution for z is to be used to approximate to a truncated exponential distribution, then some combination such as = o , b = 3 will be conservative. Finally 00 will typically be very small, so that aqoo<< n, corresponding to the neglect of the exponential factor in the prior for 0. It would be quite wrong to suggest a Bayesian solution for universal use regardless of context, but the above choice, say a = 1, b = 3, , = 2, leads to (m + 3) ( 2(m+1),2(m+3) (18) This content downloaded from 163.1.41.41 on Wed, 15 Mar 2017 15:16:18 UTC All use subject to http://about.jstor.org/terms On sampling and estimation of rare errors 131 and hence to an upper confidence limit for the population total error remaining, which is T(u), or to the order of approximation involved in the infinite population Tx. Two other possibilities for brief consideration are that ,u is assumed known and equal to puo; for example, it might be reasonable to assume that fractional errors are equally likel to be between 0 and 1, when o = 2. Then we let b - oo and - = o(m + a)F() (19) n to be compared with the result from ? 3 that the upper confidence limit is P0WuI(m)/n; it is known that especially when a 2 the agreement is close. The other special possibility is to try for effective approximate confidence limits from (17) for m $ 0 by passing to a prior corresponding to initial very vague knowledge, for which one reasonable choice is a = 1, b = 1. This leads for m * 0 to V n F2m+2,2m+2 (20) The above discussion is for the infinite population version of the problem, where the quantity of interest is essentially V = f,u. For the finite population version of the problem, we use the moment generating function (14), or alternatively the mean and variance derived from it, plus the posterior density (17), to show that conditionally on m and z, the mean of T,u) is that derived from (17), whereas the variance is that from (17) multiplied by b -1) a(21) 1I+(+C)2n(mn+ +WR(1+Cx2) 2m(+ + b- ) ' and the simplest, if rather crude, way of allowing for this is to apply (17) with the factor m + a and the degrees of freedom 2(m + a) both divided by (21). It is natural to assess (20) by examining whether formal upper 1- a limits for b = calculated from the formula have approximately the desired confidence interval property There is a difficulty of principle in that if m = 0, that is if there are no observed errors, (20) is inapplicable; we cannot estimate Obu if there is no information about ,u. The mos satisfactory approach seems to be the following. Inference about 4, i.e. in effect Tzd, can be based on the Poisson distribution as in (5) and the required confidence interval property is assured. At least in principle we imagine limits computed for any m that arises. Inference about 4p can be attempted only when m = 0 and hence the question for consideration is whether an upper limit calculated from (20) has approximately the correct frequency property over a set of repetitions in which m and z vary, but taken conditionally on m * 0. Some information may be gleaned from extreme cases. If all proportional errors, z, are the same, the limits (20) are too large by a factor very close to 12. If the numbers of nonzero errors are large, so that asymptotic theory is applicable, the limits are conservative, or not, according as the standard deviation of the distribution g(z; IL) is less than or exceed the mean. Of the situations that might arise in an auditing context, the least favourable for the use of (20) is a mixture of a fairly high probability, p, of a very small value of z = zo, say, with a complementary probability (1 -p) of unit z; this is broadly comparable with the asymptotic situation with a high coefficient of variation. We are grateful to Mr J. Connolly, Mann Judd & Co., for presenting the problem which prompted this work. This content downloaded from 163.1.41.41 on Wed, 15 Mar 2017 15:16:18 UTC All use subject to http://about.jstor.org/terms 132 D. R. COX AND E. J. SNELL REFERENCES ANDERSON, R. & TEITLEBAUM, A. D. (1973). Dollar-unit sampling. Can. Chartered Accountant 102, No. 4, 30-9. A.uxn, H. (1963). Handbook of Sampling for Auditing and Accounting. New York: McGraw-Hill. CocHRAN, W. FIENBERG, S., populations. McRAE, T. W. G. (1963). Sampling Techniques, 2nd edition. New York: Wiley. NETER, J. & LEITCH, R. A. (1977). Estimating the total overstatement error in accounting J. Am. Statist. Assoc. 72, 295-302. (1974). Statistical Sampling for Audit and Control. New York: Wiley. NETER, J., LEITCH, R. A. & FIENBERG, S. E. (1978). Dollar unit sampling: multinomial bounds for total overstatement and understatement errors. Accounting Rev. 53, 77-93. SMIT, T. M. F. (1976). Statistical Sampling for Accountants. London: Accountancy Age Books. [Received January 1978. Revised August 1978] This content downloaded from 163.1.41.41 on Wed, 15 Mar 2017 15:16:18 UTC All use subject to http://about.jstor.org/terms
© Copyright 2026 Paperzz