Sequential Search Algorithm for Estimation of the Number of

Working Papers in Statistics
No 2016:1
Department of Statistics
School of Economics and Management
Lund University
Sequential Search Algorithm
for Estimation of the Number of Classes
in a Given Population
MICHAEL J. KLASS, UNIVERSITY OF CALIFORNIA, BERKELEY
KRZYSZTOF NOWICKI, LUND UNIVERSITY
Sequential Search Algorithm
for Estimation of the Number of Classes
in a Given Population
Michael J. Klass
University of California, Berkeley
and
Krzysztof Nowicki
Lund University, Sweden
Abstract
Let
N
be the number of classes in a population to be estimated. Fix
any preassigned error probability
0 < < exp(−2)
(roughly).
We
present a sequential search algorithm to estimate the exact value of
N , with an error probability of at most , regardless of the value of N .
Key words and phrases:
Unobserved species, estimation of popu-
lation size, sequential estimation procedure, error probability
1
Introduction
Historically many people consider the classic "species problem", that of estimating the number N of categories in a given population based on a random
sample, rst introduced by Fisher et al (1943) and widely studied in ecology
and later extended to many other applications: see, for instance, Thisted and
Efron (1987) and Mao and Lindsay (2007). Bunge and Fitzpatrick (1993)
provide a review of various statistical methods to estimate the number of
unseen species.
1
2
THE SEARCH ALGORITHM
2
The estimation of the number of unseen species is a question closely
related to the problem of estimating the expected number of new species
that will be seen if we take an additional sample of any given size, see Good
and Toulmin (1956), Efron and Thisted (1976), Boneh et al (1998).
Several authors proposed estimation methods of the number of classes utilizing sequential random sampling with various stopping rules, see Goodman
(1953), Samuel (1968, 1969), Holst (1971) and Nayak and Kundu (2007). In
the algorithm we propose the stopping rule is connected to a preassigned
error probability P (N 6= N̂ ) where N̂ is our estimate of N .
The problem addressed in this paper can be formalized as follows. Let N
be a xed but unknown positive integer. Let X1 , X2 , ... be i.i.d. random variables which take values in {1, 2, ..., N }, each of the N outcomes being equally
likely. Fix any preassigned error probability 0 < < exp(−2) (roughly). We
want to estimate the exact value of N , with an error probability of at most
, regardless of the value of N .
2
The search algorithm
To teach ourselves how to proceed, suppose we begin by asking: When might
one decide it would no longer be advantageous to continue searching/
sampling (gathering observations) for items yet unseen?
Given that j dierent objects have already been recorded, let Wj+1 denote
the random waiting time describing the number of additional observations
that happen to be needed until the (j + 1)st item surfaced, with Wj+1 = ∞
if there are only j items. Formally, dene T1 = W1 = 1 and, having dened
Wj for 1 ≤ j ≤ k , let Tj = W1 + ... + Wj and dene
(
1st i ≥ 1 : Xi+Tj ∈
/ {X1 , .., XTj }
Wj+1 =
(1)
∞ if no such i exists.
and Tj+1 = W1 + ... + Wj+1 .
Suppose N = n for some integer n > 0. For each n we use notation
Wn,j+1 and Tn,j+1 . Imagine that we have found j dierent objects already
and have conducted another s observations without nding anything new.
What information does this provide?
j
P (Wn,j+1 > s) = ( )s , for 1 ≤ j ≤ n − 1.
n
(2)
2
THE SEARCH ALGORITHM
3
Suppose our search has currently found j objects and then Wj+1 exceeds
dA(j + k)e where k is to be determined. Let
t = 1st j ≥ 1 : Wj+1 > dA(j + k)e
(3)
We stop at such time t and declare that N̂ , our best guess as to the value of
N , is t. At that point we have conducted
Q ≡ 1 + W2 + · + Wt + dA(t + k)e
(4)
searches. What is the maximum error probability incurred by this rule over
all possible values of N ≥ 1?
Let N̂ be the integer which our procedure guesses. If n = 1 ,N̂ is always 1.
For N = n ≥ 2
n−1
[
P (N̂ 6= n) = P (
{Wn,j+1 > dA(j + k)e}).
(5)
j=1
To upper-bound this expression we introduce the following lemma.
Let Ej+1 , 1 ≤ j ≤ n − 1 be a set of independent events and
, 1 ≤ j ≤ n − 1 be a set of independent events such that P (Ej+1 ) ≤
for 1 ≤ j ≤ n − 1. Then
Lemma 2.1
let
∗
Ej+1
∗
P (Ej+1 )
n−1
[
P(
Ej+1 )) ≤
j=1
n−1
X
∗
∗
P (Ej+1
) − P (En∗ )P (En−1
).
(6)
j=1
Proof: First observe that
P(
n−1
[
n−1
\
Ej+1 ) = 1 − P (
j=1
c
Ej+1
)
j=1
=1−
n−1
\
c
P (Ej+1
) (by independence)
j=1
≤1−
n−1
\
(7)
∗
P ((Ej+1
)c ) (by the assumption)
j=1
= P(
n−1
[
j=1
∗
Ej+1
).
2
THE SEARCH ALGORITHM
4
Moreover, by Boole's inequality
n−1
[
P(
∗
∗
Ej+1
) ≤ P (En∗ ∪ En−1
)+
j=1
n−2
X
∗
P (Ej+1
)
(8)
j=1
∗
∗
∗
and by independence P (En∗ ∪ En−1
) = P (En∗ ) + P (En−1
) − P (En∗ )P (En−1
)
from which (6) follows.
For 1 ≤ j ≤ n − 1, A > 0 and xed k ≥ 1 let
En,j+1 = {Wn,j+1 > dA(j + k)e}
(9)
and
j
∗
P (En,j+1
) = ( )A(j+k) .
(10)
n
∗
Clearly, P (En,j+1 ) ≤ P (En,j
). Using Lemma 2.1, (5) can be upper-bounded
by
n−1
[
P(
n−1
[
En,j+1 ) ≤ P (
j=1
∗
En,j+1
)
j=1
≤
n−1
X
∗
∗
∗
P (En,j+1
) − P (En,n
)P (En,n−1
). (11)
j=1
Next we introduce upper-bounds for terms in (11).
Lemma 2.2
For 1 ≤ j ≤ 2k and n ≥ j + 1
∗
P (En,n−j+1
) ≤ exp(−Aj).
(12)
Proof: For k ≥ 1, 1 ≤ j ≤ n − 1, A > 0,
j dA(j+k)e
(by denition of Wn,j+1 )
n
j A(j+k)
≤
n
n
∗
= exp(−A(j + k) ln( )) ≡ P (En,j+1
).
j
P (Wn,j+1 > dA(j + k)e) =
(13)
Replacing j by n − j and n by x, for x ≥ j + 1 let
f (x) = −(x − j + k) ln(
x
).
x−j
(14)
2
THE SEARCH ALGORITHM
5
∗
Notice that P (En,n−j+1
) = exp(Af (n)). Then
x
1
1
) − (x − j + k)( −
)
x−j
x x−j
x
k−j
k
= − ln(
)−
+
x−j
x
x−j
f 0 (x) = − ln(
(15)
and
1
k−j
1
k
+
f 00 (x) = − +
−
x x−j
x2
(x − j)2
j
(k − j)(x − j)2 − kx2
=
+
x(x − j)
x2 (x − j)2
jx2 − j 2 x − jx2 − 2j(k − j)x + j 2 (k − j)
=
x2 (x − j)2
j(j − 2k)x + j 2 (k − j)
=
x2 (x − j)2
≤ 0 for k ≤ j ≤ 2k or if 1 ≤ j ≤ k − 1 since (2k − j)x ≥ (k − j)j .
(16)
Therefore f (x) is concave. Notice that limx→∞ f 0 (x) → 0. Therefore f 0 (x) >
0 for all x > j + 1. Consequently
sup f (x) = lim f (x).
x≥j+1
x→∞
j
= lim x ln(1 − )
x→∞
x
= −j.
(17)
Hence
∗
∗
sup P (En,n−j+1
) = lim P (En,n−j+1
) = lim exp(Af (n)) = exp(−Aj).
n→∞
n≥j+1
n→∞
(18)
For all n ≥ 2, A ≥ 2, k = 4
Theorem 2.1
n−1
[
P(
j=1
∗
En,j+1
) < exp(−A) + exp(−2A) + exp(−3A).
(19)
2
THE SEARCH ALGORITHM
6
Proof: We treat various n separately.
For n = 1 the error probability is zero.
Case 1: 2 ≤ n ≤ 4 Inequality (12) combined with Boole's inequality yields
n−1
[
P(
∗
) ≤ exp(−A) + exp(−2A) + exp(−3A)
En,j+1
(20)
j=1
Cases n ≥ 5
Invoking (11) and (12), for n ≥ 5
n−1
[
P(
∗
∗
∗
En,j+1
) < exp(−A) + exp(−2A) + exp(−3A) − P (En,n
)P (En,n−1
)
j=1
+
n−4
X
∗
P (En,j+1
).
j=1
(21)
Case n = 5
Since
∗
P (E5,2
) = exp(−5A ln(5)),
(22)
∗
∗
P (E5,5
)P (E5,4
) = exp(−(Γ5,5 + Γ5,4 )A)
> exp(−5.4A) (by (36) and (37) below)
> exp(−5A ln(5))
∗
= P (E5,2
)
(23)
so by (21) Theorem 1.1 holds for n = 5.
Case n = 6
Applying (21) for n = 6 we need to verify the inequality
∗
∗
∗
P (E6,2
) + P (E6,3 ) ≤ P (E6,6
)P (E6,5
).
(24)
2
THE SEARCH ALGORITHM
7
Since Γn,n and Γn,n−1 decrease in n,
∗
∗
) = exp(−A(Γ6,6 + Γ6,5 ))
)P (E6,5
P (E6,6
> exp(−A(Γ5,5 + Γ5,4 )
> exp(−5.4A)
> exp(−5A ln(6)) + exp(−6A ln(3))
∗
∗
= P (E6,2
)P (E6,3
),
(25)
which conrms Theorem 1.1 for n = 6.
Case n = 7
Since Γn,n and Γn,n−1 decrease in n,
∗
∗
∗
∗
P (E7,7
)P (E7,6
) > P (E5,5
)P (E5,4
) > exp(−5.4A)
7
7
> exp(−5A ln(7)) + exp(−6A ln( )) + exp(−7A ln( ))
2
3
∗
∗
∗
= P (E7,2
) + P (E7,3
) + P (E7,4
) (for A ≥ 2),
(26)
whence Theorem 1.1 holds for n = 7.
Case n ≥ 8
We begin by considering
n−4
X
Pn−4
j=1
∗
∗
P (En,j+1
) = P (En,2
)+
∗
P (En,j+1
). Splitting this sum into three parts,
X
∗
P (En,j+1
)+
2≤j< n
e
j=1
X
∗
P (En,j+1
).
(27)
n
<j≤n−4
e
We will treat each of the sums separately. First,
n
X
2≤j< n
e
≤
∗
)=
P (En,j+1
dee
X
n
exp(−A(j + 4) ln( ))
j
j=2
exp(−6A)
n
(since ≥ 1).
1 − exp(−A)
j
(28)
2
THE SEARCH ALGORITHM
8
Second,
X
∗
P (En,j+1
)=
n
<j≤n−4
e
=
X
n
<j≤n−4
e
X
n
exp(−A(j + 4) ln( ))
j
exp(−A(n − j + 4) ln(
n
<n−j≤n−4
e
≤
X
n
))
n−j
n
))
exp(2 ln(i) − A(n + 1 − i) ln( n−i−3)
i2
1≤i<n−d n
e−3
e
(29)
∞
X
n
1
≤
maxn exp(2 ln(i) − A(n + 1 − i) ln(
))
1≤i≤n−d e e−3
n − i − 3 j=1 j 2
π2
n ≤
exp − An ln(
) (by Lemma 3.2 if k = 4).
6
n−4
Hence,
n−4
X
∗
P (En,j+1
) ≤ exp(−5A ln 8) +
j=1
= exp(−5A ln 8) +
π2
n exp(−6A)
+
exp − An ln(
)
1 − exp(−A)
6
n−4
exp(−6A)
π2
∗
+ P (En,n−3
).
1 − exp(−A)
6
(30)
Next we show
exp(−5A ln(8)) +
exp(−6A)
π2
∗
∗
∗
≤ P (En,n
)P (En,n−1
) − P (En,n−3
). (31)
1 − exp(−A)
6
For n ≥ 8, A ≥ 2
π2
π2
∗
) ≥ exp(−4.5A) −
P (En,n−3
exp(−5A)
6
6
π2
π2
= exp(−4.5A)(1 −
exp(−0.5A) ≥ exp(−4.5A)(1 −
exp(−1))
6
6
exp(−6A)
> exp(−5A ln(8)) +
1 − exp(−A)
∗
∗
)−
P (En,n
)P (En,n−1
(32)
whence (31) holds for n ≥ 8 and A ≥ 2.
3
APPENDIX
9
Notice that for all j ≥ 1
Remark 2.1
lim P (En,n−j+1 ) = lim exp(−dA(n − j + 4)e ln(
n→∞
n→∞
n
)) = exp(−Aj). (33)
n−j
Hence as the number N = n of objects to be found tends to innity the
probability that we fail to guess their exact number tends to
n−1
[
lim P (
n→∞
j=1
n−1
\
En,j+1 ) = lim (1 − P (
n→∞
c
En,j+1
))
j=1
= lim (1 −
n→∞
n−1
Y
(1 − P (En,n−j+1 ))
(34)
j=1
∞
Y
= 1 − (1 − exp(−Aj)).
j=1
An alternative expression for this limit may be obtained by writing
∞
∞
Y
X
(1 − exp(−Aj)) = exp(
ln(1 − exp(−Aj)))
j=1
j=1
∞ X
∞
X
= exp(−
= exp(−
j=1 k=1
∞
X
k=1
3
exp(−jkA)
)
k
(35)
exp(−kA)
).
k(1 − exp(−kA))
Appendix
∗
We lower- and upper-bound P (En,k
) for k = n, n − 1, and n − 3.
∗
Let Γn,k = − A1 ln(P (En,k
)). For A ≥ 2 and n ≥ 2,
Lemma 3.1
exp(−A(1 +
P (−A(2 +
7
11
5
7
∗
+ 2+ 2
)) ≤ P (En,n
) ≤ exp(−A(1 + )) (36)
2n 6n
4n (n − 1)
2n
6
20
28
6
∗
) ≤ exp(−A(2 + )) (37)
+ 2+ 2
)) ≤ P (En,n−1
n 3n
3n (n − 2)
n
and
exp(−A(4 +
8
64
64
8
∗
+ 2+ 2
)) ≤ P (En,n−3
) ≤ exp(−A(4 + )). (38)
n 3n
n (n − 4)
n
3
APPENDIX
10
∗
Proof: We derive upper- and lower- bounds of P (En,k
) by bounding Γn,k for
k = n, k = n − 1 and k = n − 3.
First,
n
1
) = −(n + 3) ln(1 − )
n−1
n
∞
X
3
+
jnj
j=1
Γn,n = (n + 3) ln(
=
∞
X
j=1
=1+
1
jnj−1
∞
X
j=1
(
(39)
3 1
1
+ ) j
j+1 j n
Clearly,
Γn,n ≥ 1 +
7
2n
(40)
and
∞
Γn,n = 1 +
X 4j + 3 1
7
11
+ 2+
2n 6n
(j + 1)j nj
j=3
∞
11
5X 1
7
+ 2+
≤1+
2n 6n
4 j=3 nj
=1+
(41)
7
11
5
+ 2+ 2
2n 6n
4n (n − 1)
which gives (36). Second,
n
2
) = −(n + 2) ln(1 − )
n−2
n
∞
j
X
2
2
2
=2+
(
+ )
j
n j+1 j
j=1
Γn,n−1 = (n + 2) ln(
(42)
3
APPENDIX
11
which gives (37) by calculations similar to those used in (40) and (41). Third,
∗
we lower-bound P (En,n−3
) = exp(−AΓn,n−3 ) by upper-bounding
n
4
) = −n ln(1 − )
n−4
n
∞
j+1
X
4
Γn,n−3 = n ln(
=4+
j=1
(j + 1)nj
∞
X 4 4
64
8
( )j
=4+ + 2 +
n 3n
j
+
1
n
j=3
(43)
∞
X 4
8
64
≤4+ + 2 +
( )j
n 3n
n
j=3
=4+
64
64
8
+ 2+ 2
n 3n
n (n − 4)
which gives (38).
Lemma 3.2
Let A ≥ 2, k ≥ 1, B = n − k + 1, n > k and
n
g(x) = 2 ln(B − x) − A(x + k) ln( )
x
Then
sup
g(x) = g(B − 1) = −An ln(
n
≤x≤B−1
e
n
).
n−k
(44)
(45)
Proof: Toward this end,
2
n
Ak
− A ln( ) + A +
B−x
x
x
2
ex
Ak
=−
+ A ln( ) +
B−x
n
x
2
ex
2k
≥−
+ 2 ln( ) +
B−x
n
x
≡ 2h(x).
g 0 (x) = −
For x =
n
e
(46)
we have ln( ex
) = 0. Hence
n
g 0 (x) ≥
2
(Bk − (k + 1)x) ≥ 0
x(B − x)
(47)
REFERENCES
for ne ≤ x ≤
Consider
Bk
.
k+1
12
We need to prove that g 0 (x) ≥ 0 for
h0 (x) = −
For
Bk
k+1
Bk
k+1
≤ x ≤ B − 1.
1
k
1
+ − 2
2
(B − x)
x x
(48)
≤x<B
2
1
2k
− 2+ 3
3
(B − x)
x
x
3
2(k + 1)3
2(k + 1)
+
<−
B3
B 3k2
h00 (x) = −
≤0
Hence h(x) is concave on
inf
Bk
≤x≤B−1
k+1
Bk
k+1
= min{h(
(49)
for k ≥ 1.
≤ x ≤ B − 1. Therefore
Bk
), h(B − 1)} ≥ 0 i h(B − 1) ≥ 0
k+1
k
e(n − k)
)+
≡ q(n)
n
n−k
1
1
k
q 0 (n) =
− −
n − k n (n − k)2
k
k
−
=
n(n − k) (n − k)2
k(n − k) − nk
=
< 0.
n(n − k)2
h(B − 1) = −1 + ln(
(50)
(51)
(52)
Thus q(n) & limn→∞ q(n) = 0 whence h(B − 1) > 0 and so g(B − 1) =
sup ne ≤x≤B−1 g(x).
References
[1] Boneh, S., Boneh, A and Caron, R.J. Estimating the prediction function
and the number of unseen species in a sampling with replacement, Journal of the American Statistical Association, 93(441), (1998), 372379.
REFERENCES
13
[2] Bunge, J and Fitzpatrick M. Estimating the number of species: a review,
Journal of the American Statistical Association, 88(421) (1993), 364
373.
[3] Efron, B and Thisted, R. Estimating the number of unseen species: How
many words did Shakespeare know?, Biometrika 63(3) (1976), 435
447.
[4] Fisher, R.A., Corbet, A.S. and Williams, C.B. The relation between the
number of species and the number of individuals in a random sample of
an animal population, Journal of Animal Ecology, 12 (1943), 4258.
[5] Good, I.J. and Toulmin, G.H. The number of new species and the
increase in population coverage, when a sample is increased, Biometrika,
43 (1956), 4563.
[6] Goodman, L.A. Sequential sampling tagging for population size
problems, the Annals Mathematical Statistics, 24 (1953), 5669.
[7] Holst, L. Some asymptotic results for incomplete multinomial or Poisson
samples, Scandinavian Journal of Statistics, 8 (1981), 243246.
[8] Mao, C.X. and Lindsay, B.G. Estimating the number of classes, the
Annals of Statistics, 35(2) (2007), 917930.
[9] Nayak, T.P. and Kundu, S. Comparison of stopping rules in sequential
estimation of the number of classes in a population, Sequential Analysis,
26 (2007), 367381.
[10] Samuel, E. Sequential maximum likelihood estimation of the size of a
population, the Annals of Mathematical Statistics, 39(3) (1968), 1057
1068.
[11] Samuel, E. Comparison of sequential rules of estimation of the size of a
poulation, Biometrics, 25 (1969), 517527.
[12] Thisted, R. and Efron, B. Did Sheakespeare write a newly-discover
poem?, Biometrika 74(3) (1987), 445455.
http://journals.lub.lu.se/stat
Working Papers in Statistics 2016
LUND UNIVERSITY
SCHOOL OF ECONOMICS AND MANAGEMENT
Department of Statistics
Box 743
220 07 Lund, Sweden

Download Report

Sequential Search Algorithm for Estimation of the Number of

Paperzz.com

Your Paperzz