Frees, Edward W. (under the direction of David Ruppert and Gordon Simons)On Construction of Sequential Age replacement Policies via Stochastic Approximation."

EDWARD W. FREES.
On Construction of Sequential Age
Replacement Policies via Stochastic Approximation.
/
(Under the direction of DAVID RUPPERT and GORDON SIMONS.)
Stochastic approximation (SA), a stochastic analog of
iterative techniques for finding the zeroes of a function
(e.g., Newton-Raphson), is applied to a problem in resource
allocation, the age replacement policy (ARP) problem.
A
stochastically failing unit is replaced at failure or at time
t, whichever comes first, under an age replacement policy.
When the form of the failure distribution is unknown the
question of estimating
~,
the optimal replacement time
(optimal in terms of achieving the smallest long-run expected
cost), is important.
SA is superior to other existing
sequential methodologies introduced by Bather (1977).
Further, this application motivates novel ways in which the
theory of SA is extended.
This research was supported in part by the National Science
Foundation (Grant MCS-8l00748).
ACKNOWLEDGEMENTS
Learning to do statistical research is like following a
very mucky trail.
and dead ends.
It can be full of pitfalls, false paths
I would now like to acknowledge some of the
people who have helped me learn how to stick to the right
trails.
Thanks go to my co-adviser, David Ruppert, for the
generous use of his time.
With the proper amount of guidance
from him, solving this dissertation problem became a very
, educational and worthwhile experience for me.
Thanks also go to Prof. Gordon Simons, co-adviser, for
suggesting the problem and helping me with some initial rocky
stages.
To Prof. W. L. Smith, chairman of the dissertation
committee, for allowing me to use the department's word
processor and his personal text-editing language (DQ3) on
which this document was produced.
The many painful drafts
made for a much better product in the end.
Thanks also go to
the other members of the dissertation committee, Professors
Janet Begun, M. R. Leadbetter, P. K. Sen.
The gratitude that I have for my parents and family are
constants in my life and need not really be mentioned.
I
remind them anyway as to how much they are appreciated.
Finally, thanks go to my nit-picking English editor,
Marie Davidian, for her support in a non-stochastic manner.
Table of Contents
Page
Chapter 0
INTRODUCTION AND NOTATION
Chapter 1
THE ARP MODEL
1.1
1.2
1.3
Chapter 2
2.1
2.2
2.3
2.4
2.5
Chapter 3
3.1
3.2
3.3
Chapter 4
4.1
4.2
4.3
4.4
4.5
Chapter 5
5.1
5.2
5.3
5.4
Chapter 6
6.1
6.2
6.3
6.4
.
The classical model
The classical model with an unknown
distribution function
The classical model viewed sequentially
1
6
10
13
STOCHASTIC APPROXIMATION AND AN APPLICATION
Introduction
Stochastic approximation-motivation
SA-short historical development
Stochastic approximation applied to the ARP
Proofs
19
21
24
33
38
DEVELOPING AN OPTIMAL SA ARP METHODOLOGY
Reducing the order of bias
Reducing the asymptotic mean square error
An adaptive SA ARP methodology
51
59
63
ANOTHER APPLICATION OF SA-ESTIMATING THE MODE
Introduction
Notation and assumptions
Univariate results and remarks
Multivariate analogs
Proofs
74
77
78
82
86
SA-SOME EXTENSIONS OF THE THEORY
Adaptive K-W procedures
SA-representation theorem
sA-sequential fixed-width confidence interval
ARP-sequential fixed-width confidence interval
93
100
105
III
MONTE-CARLO STUDIES
Introduction
Preliminary investigation
Description of the simulation
Summary of results
117
118
120
125
Appendix A - Deterministic graphs and calculations
Appendix B - Monte-Carlo output
Appendix C - Some alternative models
130
133
139
Bibliography
Index - Notation and assumptions
146
153
CHAPTER 0
INTRODUCTION AND NOTATION
Consider a functioning unit with specified life
distribution F, and the probability of survival to age x is
l-F(x)=S(x).
Let Cl and C2 be fixed, known costs with
Cl >C 2 >O.
If the unit fails prior to ~ units of time after
its installation, it is replaced at that failure time with
cost Cl •
Otherwise, the unit is replaced ~ units of time
after its installation with cost C2 .
replacement is immediate.
It is assumed that
Under the age replacement policy
(ARP), the replacement unit is available from a sequence of
such units that fail independently with the same distribution
function F.
The objective is to minimize the long-run
accumulation of costs in some sense.
A practical bound on
the cost corresponds to ~=OD which is merely a failure
replacement policy (replace only at failure).
The cost
function here is the expected long-run average cost,
(1 )
This cost function is motivated and developed in Sl.l
(Section 1 of Chapter 1).
introduced in Appendix C.
Alternative cost functions are
Page 2
This dissertation approaches the ARP problem as a
problem in statistical estimation.
In particular, we are
interested in estimating ~, the (assumed unique and finite)
optimal replacement time.
We give a short review for i.i.d.
(identically and independently distributed) data in Sl.2,
focusing on nonparametric methods.
In Sl.3 we introduce
sequentially conducted experiments, where the estimators
(~~) of the optimal replacement time are actually used in
replacing the unit.
In particular, we show that if ~~
estimates ~ well, then,
(2 )
-1 _N(t)
t
s
~i=l {CII(Xi<~i) + C2I(Xi~~~)} ~ Rl(~)
a.s.
where Xl' ... 'X n are i.i.d. observations with distribution
function F and N(t) is the number of failures by time t.
Thus, the actual cost achieved by the experimenter
asymptotically is the same as if that person knew the optimal
replacement time all along!
To provide sequential estimators of ~ with properties
sufficient for (2), we introduce stochastic approximation
(SA) as a sequential estimation technique.
Readers familiar
with this methodology may wish to skip S2.2 and S2.3, which
motivate SA and give a short historical development.
A
simple, easy to follow recipe for calculating the estimators
(~n) is given in S2.4.
In this section the assumptions are
laid out and some asymptotic properties of these estimators
are stated.
The proofs are given in S2.5.
Page 3
An estimator is said to be better than another if it has
a better rate of convergence, or, having the same rate of
convergence, a smaller asymptotic mean square error (MSE).
While this criterion is not unique, it is certainly not
unreasonable.
Chapter 3 is devoted to improving the
estimator introduced in §2.4 to achieve an optimal estimator.
The procedure in §2.4 uses a simple estimator for the density
of the units at a point.
In §3.1 we replace this estimator
with a more sophisticated one (primarily using kernel
methods, although more generality is sought).
The resulting
procedure gives estimators that have optimal rates.
Given
these rates, §3.2 shows how parameters of the procedure
should be chosen to minimize the asymptotic MSE.
§3.3 gives
a modification of §3.1 to actually achieve this optimal MSE.
In Chapter 4 we allow ourselves a bit of a digression.
Using techniques similar to those introduced in Chapters 2
and 3, we show how to estimate the mode of a distribution.
While the goals of this chapter are independent of the
others, the techniques of the proofs are not.
In particular,
the proofs of §4.5 require an intimate familarity with
stochastic apprpximation.
For this reason we have split the
details of the proofs off into a separate section.
results given here are stronger than
curre~tly
The
in the
literature and the procedure is practicable, but we felt
these results are not all one might hope for.
Hence, this
Page 4
chapter represents the current state of a work in progress.
We begin Chapter 5 by extending the notion of an
adaptive stochastic approximation process in the K-W case.
The first section is independent of the others in this
chapter.
In S5.2 we prove a representation theorem that is a
modification of Ruppert's (1982).
The modified version we
give is easier to apply to specific SA problems.
this'result to the ARP problem in S5.4.
We apply
InS5.3 we develop a
sequential fixed-width confidence interval from the
representation.
This result is a considerable improvement
over previous results concerning sequential confidence
intervals in SA since these results dealt only with the
Robbins-Monro case.
As an example, in S5.4 we show how to
develop a sequential fixed-width confidence interval for the
ARP problem.
Having discussed the asymptotic properties of several
estimators, we felt a need to provide an investigation of the
finite sample properties of these estimators.
Chapter 6
gives the details of a Monte-Carlo study performed using the
procedure proposed in S2.4.
The raw output of this study is
given in Appendices A and B.
While most of the notation is introduced as is needed,
we mention some here.
(2) refers to the second equation of
Page 5
the current chapter. (1.2) refers to the second equation of
the first chapter.
For some distribution function (d.f.) F,
we use E and P to denote expectation and probability.
~(X)
is the sigma field generated by the random variable (r.v.) X,
F n is a sigma field generated by previous events (to be
specified with each application) and E p
is the conditional
n
expectation given Pn. Rm denotes m-dimensional Euclidean
space, and Bm are the Borel sets of Rm.
write
~
we mean deterministic or almost sure (a.s.)
convergence.
distribution.
set.
Generally, when we
The symbol
~D
is used for convergence in
The symbol € is used to denote membership of a
Let {Xn } and {Yn } be sequences of random variables.
write X =o(y ) if there exists a r.v. z such that
n
n
IXnl/IYnl~z a.s. for all n.
Xn=o(Yn ).
If IXnl/IYnl~O a.s. we write
We use 0p(.) and 0p(.) for the corresponding
symbols for relationships in probability.
We
CHAPTER 1 THE ARP MODEL
§l.l
The classical model
An important area in resource allocation theory
(operations research) concerns the optimal replacement and/or
maintenance of a stochastically deteriorating system.
Lately, this has been an active area of research for applied
probabilists, cf. Beiche1t (1981), Aven (1982).
The age
replacement policy (ARP) arises as a specal case of the
theory of maintenance policies, e.g., systems designed to
maintain a network of components.
By considering the more
restrictive ARP introduced in Chapter 0, more detailed
results can be presented.
One of the earliest ARP's minimizes the long-run
expected cost per unit of time. This model assumes the time
to replace a unit is negligible and that all costs are
absorbed in C and C2 . Denote N1 (t) (N 2 (t»
to be the number
1
of failure (planned) replacements in time interval [O,t).
Thus the cost over the interval is
Page 7
~
and it is desired to find
take
~
to minimize lim E [C(t>/t].
We
to be a fixed but unknown constant and call the semi-
open interval
[O,~)
the optimal replacement interval.
This problem may be viewed as a simple application of
renewal theory.
Let Xl' .•• 'X n be an i.i.d. sample from
lifetime distribution F, and
Zi=min(Xi'~)
i=l, ... ,n.
When we
speak of a "lifetime" distribution function, we mean a d.f. F
such that F(O)=O.
We also assume left-continuity of the d.f.
(~)
and that the mean
of F is finite.
Define, vt>O, a random
.
variable N* (t) = 1nf
{n: S n >t}, where S n =Zl+ •.• +Z.
n
It is
trivial to show that for fixed t, N* (t) is a proper stopping
variable for {Sn}' and that SN*(t)-l ~ t < SN*(t)·
Since the Z. 's are non-negative, we employ a lemma due to
1
Wald (cf., Chow-Robbins-Siegmund, 1971, page 22).
We have
E ( )N * (t) U.) = E N*(t) E U,
- l
1
where {U.} is any sequence of random variables that is
1
independently adapted to
{~(Xl, ...
,Xn)}' not necessarily
independent of N* (t). Now, denoting N(t) = N (t)+N (t), and
2
l
noting that N(t)=N * (t)-l-I(SN(t)+l=t), we get (using I(.) to
denote the indicator function)
[ >N(t)I(X.<A)
-1
1
~
]
*
= E ~~ (t)[I(Xi<~)] - E I(XN*(t)<~) E I(SN(t)+l=t) I(XN(t)+l<~)
where E is an event such that SN<t)+l=t and XN<t)+l<~.
Page 8
Similarly,
B[N (t)]
2
= B[ >N(t)!(X.>~) ]
-1
1.-'1"
= B N* (t) S(c!»
- P(XN*(t)~c!»
- peE).
A well-known result from renewal theory (cf., Feller, 1971,
page 330) gives, lim B N(t)/t = liB Z.
lim B N* (t)/t=l/B Z
Thus
and
lim B C<t)/t = lim {CIB N* (t) F(c!»
+ C B N* (t)
2
S(c!»}/t
c!>
= {ClF(c!»+ C 2 S(c!»}/B Z = {ClF{c!»
+
C2S(c!»}/{~o S(u)du}.
Thus, with Rl (.) defined in (0.1), we have,
lim
B C(t)/t = R l (c!».
Alternatively, readers more familiar with renewal theory may
N(t)
recognize ~l
!(Zi<c!»
as a cumulative process. The
(1)
t~oo
proof of (1) then follows immediately from Lemma 4 of
Smith (1955).
Solving for the optimal replacement interval [O,t) for a
given Cl ' C2 ' and distribution function F is thus reduced to
a problem in analysis. Several examples appear in the
literature wherein a parametric family is specified and
values of
c!>
are calculated for given costs and values of the
parameters (see Barlow & Proschan, 1964, 1965).
Glasser (1967) gives detailed graphs of costs and parameters
for the truncated Normal, Gamma, and Weibull distributions.
Page 9
Assume now that F is absolutely continuous.
To find
min Rl(t), set
o
= ~/Jt Rl(t) It=~
=> 0 = (Cl-C2)f(~)
~
~o S(u)du - S(~)[C1F(~)+C2S(~)].
This gives
(2 )
where f=F' and r(.) is the failure or hazard rate.
To ensure
unique solutions to (2), it is common to restrict
consideration to lifetime distributions having a strictly
increasing failure rate.
Further, if the failure rate
increases to 00, then the solution is finite.
It was shown by
Denardo and Fox (1967), that ~ cannot occur where the failure
rate is decreasing.
finite, we see that
If the mean
Rl(oo)=Cl/~
(~)
of F exists and is
and thus, a measure of
asymptotic relative efficiency of the ARP procedure is
P(~) = Rl(~)/Rl(oo) = ~(Cl-C2) r(~)/(Cl)·
A lower bound on the ARP cost, Rl(t), is
C2/~
(this would
occur if the unit is replaced an instant before it fails).
This gives a lower bound for our measure of relative
efficiency,
P(~) ~ (C2/~)/Rl(00) = C 2 /C l ·
For this measure, the lower the value the better the
performance of the policy (see Glasser, 1967).
It was shown by Berg (1976) that the ARP is an optimal
decision rule against a class of "reasonable" alternative
Page 10
maintenance policies.
rules of this type.
Thus, we confine our attention to
Appendix C gives some variants of the
age replacement policy.
§1.2
The classical model with an unknown distribution
function
Consider the ARP as described §1.1 but assume that the
experimenter does not know the probability distribution
underlying the failure of the units.
The objective is still
to determine the optimal replacement age, ~, in the most
efficient manner possible.
Let X ' ... 'X be i.i.d. times to
n
1
failure of the first n units.
From these observations, an
estimate of ~ can be constructed, ~ =T (X , ... ,X).
n n 1
n
Let 0
be
the set of all lifetime distribution functions with finite
mean.
Without restrictions on 0
there is little hope of
determining finite sample properties of ~n.
Thus, we look at
asymptotic properties of ~n to determine the appropriate
estimator.
We assume there are sufficient conditions so that ~
uniquely minimizes R (t)
1
increasing to 00).
(e.g., the failure rate strictly
One obvious choice of ~n is that it be
chosen so as to minimize Rn , the sample cost function defined
Page 11
by replacing F with F n , the empirical distribution function.
This estimator has been shown to be strongly consistent (cf.,
Bergman, 1979).
Using a slightly modified estimator,
Arunkumar (1972) was able to get rates of convergence for his
estimator and establish an asymptotic distribution.
shall now be briefly outlined,
sinc~
His work
there turn out to be
interesting connections with stochastic approximation's
asymptotic theory.
A key item in Arunkumar's procedure was his recognizing
that the unmodified order statistics generate an empirical
distribution function (EDF) that do not have the necessary
asymptotic properties for getting convergence of ~n to a
nondegenerate distribution.
By using a modified EDF,
Arunkumar was able to establish these rates.
As an example
of a modified order statistic, let {X.} be an i.i.d. sample
1
having a uniform density over (0,1) and {X.
} the
1:n
corresponding order statistics.
It is well known that the
grid formed by {X. }?Ol is dense with probability one on
1 1=
(0,1).
For many applications, the grid formed by the order
statistics is too narrow and we would like to have a "wider"
Let Wi : n = X[i n~J:n' where ~€(O,l) and [.J is the
greatest integer function.
Again, it is well-known that this
one.
grid becomes dense with probability one, and V
~
> 0,
page 12
p{ns(W'+l
l n - w.1:n }/~
1 :n -W.1:n »~} < n E {W'+
1:
= nS
[(i+l) n OC ] - [i n OC ] )/«n+l)~)
S
< n OC + S/( (n+l)~) = 0(1)
Thus, in our example, W'+
-W.
= open
1 l :n 1:n
-S
Let {Xi} be an i.i.d. sample with an
continuous d.f. F.
for B < l-oc.
), uniformly in i.
absolute~y
The failure rate is assumed to be
continuous and strictly increasing to 00.
Establish two grids
that are dense with probability one for Arunkumar's
procedure.
Let {wi,n} be a possibly random grid on [0,00)
such that wi+l,n-wi,n=op(n-l/3) and let {vi,n} be a grid on
2 3
[0,1] such that v'+
1 l ,n -v.1,n =0 p (n- / ), both uniformly in i.
. . a p01nt,
.
P1ck1ng
w.* , to be an arbitrary point from
J,n
[wj,n' wj+l,n)' an analog of the EDF (F n ) is defined as
F * (x) = F (w.* )
for w.
< x < w'+
.
n
n J,n
J,n J ,l n
Recall that the objective cost function to be minimized is
Rl(t) which is given in (0.1). Now, define the transform
F-l(y)
'F(y) = ~o
(l-F(u»du
and, if y=F(t), we have
Rl(t) = Rl(F
-1
(y»
m
= {ClY + C2(1-Y)}/~F(Y) = UF(y)·
* e
To get the estimate of ~, we use the t
n
* * *
minimizes U *(y) and take ~n=Fn(tn).
Not only does
{v.
} that
1,n
F
* but he
Arunkumar demonstrate the strong consistency of ~n'
also obtains
.1.* .I.
-1/3
~n-~=op(n
), thus establishing a rate of convergence.
•
Page 13
Sl.3
The classical model viewed sequentially
The fixed sample procedure described by Arunkumar gives
a method for finding an optimal
assumptions.
~
under prescribed
It can be more desirable to introduce a
procedure that may be used on an ongoing basis, constantly
updating the estimate ~
n
subsequent observations.
based on the new information from
In deciding whether or not to
censor (truncate) an observation, there is a trade-off
between minimizing costs and obtaining information about the
tails of the distribution.
Of course, as one gets more
information about the tails, one should be able to construct
a better estimator, or at least good.
We assume that the experimenter is not willing to ·make
an assumption about the parametric form of the survival
distribution,
S(x)=l~F(x),
of the components and thus we
operate in a nonparametric context.
From the work of the
previous section (Arunkumar), any procedure devised should
provide estimates which converge in probability, and
preferably almost surely.
Second, this convergence should be
fast enough so that the optimal asymptotic cost is achieved.
Arunkumar's procedure does not use the estimator of ~ to
truncate the observations, and hence does not attempt to
attain an optimal long-run cost.
As a third criterion, we
•
Page 14
choose one procedure over another based on the rate of that
convergence and if the rates are the same, we use the
asymptotic variance.
Bather (1977) introduces a scheme that meets the first
two criteria.
Suppose we have a sequence of i.i.d. lifetime
random variables {Xi} with a finite mean and truncating
random variables {~i}'
The procedure depends on two
sequences of constants, {b n } and {Pn}' where,
1=Pl~P2~P3~' •. ,
(3)
00
and
Let
{<X.}
.1
.?n=l Pn =
00.
be an independent sequence of Bernoulli random
variables with mean p. and independent of {X.}.
1
.1
When
<X
n
=1,
we shall allow the observation to continue uncensored,
otherwise we use the censoring time.
Because of (3), we
allow uncensored observations less and less often as
still infinitely many times.
n~oo
but
We use the sequence {b n } to get
an approximation of ~n' yet to be defined, as follows,
= max {bm : b m ~ ~n' m~O}
=
00
when
<X
when
<X
n
n
=0
=1.
Denote the observed outcomes as Z =min(X ,n ). We use a
n
n In
.
**
reduced sample (RS) estlmator Sn (x) to estimate S(X) where
Page 15
S ** ( x )
n
=
>
I ( Z. >x) /
>
J -
I (1"\. >x).
IJ -
Properties of the RS estimator have been studied where the
censoring times 9n are i.i.d. random variables (cf., Breslow
and Crowley, 1974).
To complete the description of the sequential procedure,
define an approximation to the cost function Rl(x), by
Rn(X)
=
[C l -(C I -C 2 ) Sn**
(X)]/~n(X)
x
where
~n(x) =
c'0 Sn
** (u)du.
It is assumed that ~ uniquely minimizes Rl(x).
The estimate
~n+l is determined by minimizing Rn(X) with respect to x.
Under the above assumptions, the following results have
been shown by Bather:
(a)
S ** (x)
n
(b)
sup
(c)
~n ~ ~
x
~
S(x)
[~n(X)
-
a.s.
x
~o S(u)du]
~
0
a.s.
a.s.
a.s.
(d)
Remarks concerning the Bather scheme:
(a)
The introduction of {b n } is a technical device needed
to get the uniform a.s. convergence of
~n(x),
which, in turn,
is essential for result (c).
(b)
** (x) for the survival
The use of the RS estimator Sn
distribution does not use the information concerning
Page 16
observations censored by time x.
There are other estimators
of the survival distribution (e.g., the Kaplan-Meier product
limit estimator, piecewise exponential, and the cumulative
hazard;
see Miller, 1981) which use more of the information
and thus are thought to be better.
When the censoring times
are i.i.d. random variables independent of the lifetimes,
this can be shown by comparing asymptotic variances.
(c)
The use of the
{~n}
sequence of random variables to
ensure the occasional untruncated observation is only one
possible technique.
Equivalently, the experimenter could
specify a sequence of integers, {r } where r <r <r <... such
k
l
2 3
that
rk~oo
uncensored.
as
k~oo
and when n€{r k } the observation is
Requiring that
lim
k-)oo
rk/k
is analogous to pn-)O.
=
00
Censoring with Bernoulli random
variables was adapted by Bather for mathematical convenience.
The important idea behind the paper of Bather is the
introduction of a sequential scheme that an experimenter may
employ to achieve the optimal asymptotic cost with only
minimal knowledge of the d.f.
In the theorem below, we give
some simple conditions so that any sequential methodology
meeting these assumptions will achieve the optimal asymptotic
cost (see (0.2».
This opens the door for competing'
sequential methodologies.
We then advocate stochastic
Page 17
approximation in the following chapters as a superior
sequential scheme.
Theorem 1.1
Let {X } be a sequence of i.i.d. r.v.'s with d.f. F,
n
having finite variance ~2.
Define Pn=~(Xl, ... ,Xn).
Suppose
there exists a sequence of r.v.'s {<Pi}' such that <P n is P n - l measurable and <Pn~<P a.s., a continuity point of F.
The
sample cost of the first n items is
n
>.
Rn = -J= 1 {Cll(X.<~.)
J o/J + C2l(X.>~.)}.
J-o/J
Let N(t) be the number of replacements by time t.
Then,
Proof:
Define Un =
-n
~l
{l(X j < <Pj) - F(<Pj)}.
By construction, {Un,F n } is a martingale and IUn+l-Unl ~ 1.
2 -2 < 00 and by Theorem 5 of Chow
n
Thus
~ E p (Un+l-U n )
n
(1965, see also stout, 1974, page 137), we have
(4 )
Since <p.J ~ 0/
~
a. s ., we have n -1
> F ( o/J
~ .)
-
~ F (<P) a. s .
By (4) and the continuity of F at <P, we have
> leX.J
n- l
(5 )
-
Similarly, define
Thus
V
n
< o/J
~o) ~ F(~)
0/
=
a.s.
-n
~
>1 {m1n (X 0 ,0/ 0) J
J
0
{V,F} is a martingale and
n
n
00
{<po
'oJ S(u)du}.
Ivn.+l-Vn I
-< Xn +~, where
E Xi=~o S(u)du=~.
Thus,
-00
~
E p . (Vn+l-V n )
2
n
-2
<
>
E(Xn+~)
n
= (~2+4~2) > n- 2 <
00,
giving
2
n
-2
Page 18
a.s.
(6)
Now, by continuity of indefinite integrals,
.
~
11m ~on S(u)du
thus
n-
1
~~
~
= ~o
(~J'
'0
S(u)du a.s.,and
(~
'0
S(u)du a.s.
~
(~
'0
S(u)du ~
By (6), we
have
.
n -1 > m1
n (X ., ~ .)
(7 )
-
Now, let
Ln
J
= -5 nl
J
min(X.,~.)
J 'f
J
=
a.s.
S (u) du
sample time of 1 st n units.
By construction, LN(t) ~ t < LN(t)+l' giving
(8 )
N( t )
RN( t ) IN ( t )
<
N(t)+l LN(t)+l/(N(t)+l)
Now, the right hand side of (8)
as
t~oo
by (5) and (7).
=
Similarly, since
the left hand side of (8) goes to
result. 0
CN(t)/N(t)
L
IN(t)~ Rl(~)
N(t)
Rl(~)
a.s.
N(t)/(N(t)+l)~l
and we have the
a.s.
CHAPTER 2
S2.1
STOCHASTIC APPROXIMATION AND AN APPLICATION
Introduction
This chapter introduces stochastic approximation (SA) to
the reader and shows how it may be used in the age
replacement policy problem.
We begin, in S2.2, by attempting to motivate SA as a
sequential estmation scheme.
As a statistical tool, we then
outline situations when it is appropriate to use SA.
Given
its use, we discuss broad issues that are important for its
efficient implementation.
The following section moves from the broad overview to a
more detailed discussion of SA by giving a short historical
development.
We begin with a description of the important
early cases, the Robbins-Monro (R-M) and Kiefer-Wolfowitz (KW) processes, and reasons for their importance.
More or less
chronologically, we then give highlights of the properties of
these cases.
Searching for optimal properties, various
modifications (such as transformations of the observations
Page 20
and adaptive procedures> and generalizations (such as more
general algorithms> of the special cases are described.
Also
in this development, we mention some applications of SA, both
to standard statistical problems and real-world situations.
Remarks on stopping rules proposed are then made.
Finally,
we mention other areas of interest in SA not reviewed here.
In §2.4 we develop a simple procedure for constructing
estimates of the optimal replacement time via SA.
All
assumptions and resulting properties are stated here.
Further, we give a simple technique of transforming the
observations to avoid constrained SA.
In the following section we give the details of the
proofs of the properties stated in §2.4.
The useful theorems
of Robbins-Siegmund (1971> and (a univariate version> of
Fabian (1968ii> are stated.
Further, many intermediate
lemmas are proved in forms slightly more general than
necessary for Chapter 2, as they will be immediately
applicable without revision for the proofs of Chapter 3
results.
Page 21
§2.2.
Stochastic approximation-motivation
Suppose that you, as a weapons advisor to a smaller,
less technologically developed nation, have just purchased a
large number of sophisticated bombs from a more advanced
society.
One of the pieces of information that the sellers
neglected to give you (perhaps intentionally) was at what
height a bomb should be dropped so that it will explode, say,
99 percent of the time.
You know from your experience with
other bombs that a sound bomb (not a dud) will explode
sometimes at a certain height but not always.
Factors such
as prevailing winds, rain, even whether one is bombing an
area containing forests versus a mountainous region come into
You would like to find this 99 th percentile point of
play.
heights for purposes such as fuel economy and radar evasion.
Keeping in mind the rather expensive nature of each bomb, you
would like to have a scientific testing scheme to determine
the point.
The above is a rather macabre example of how stochastic
approximation might be utilized in a real-world setting (but
there may be very little "practical" value in dropping
bombs).
Generally, there is an unknown function of the real
line, R(x), and the objective is to get information about a
single value.
Typically, we wish to find the 9 such that
page 22
R(a)=~,
where
~
is fixed or find 9 such that R(9)=supx R(x).
This would be a standard problem in analysis except that it
is presumed that there are errors in the measurement of R.
In fact, we let
Yk = R(Xk ) +
6k
'
where Y is the outcome of the experiment measured at X and
k
k
{6 i }
is a sequence of independent "error" (mean zero) random
variables.
The function R can be thought of as the
regression of Y on X, i.e., E(Ylx).
In the above example, we
desired a procedure to find 9 such that R(9) = .99.
was the outcome of dropping the bomb at height X
k
Each Y
k
(in this
very simple example, either 0 or 1).
Stochastic approximation (SA) is an iterative procedure
which determines a sequence of values that converges in some
sense to the desired point 9.
Thus, we use
Xn + l = Xn + Gn(Y n ), where Gn is a known function of the
observations at each stage.
It can be thought of as a
stochastic analog of well-known iterative procedures in
analysis (e.g., Newton-Raphson method).
Indeed, much of the
historical impetus for developing stochastic approximation
comes from the desire to have iterative procedures where
there are errors in the measurements.
As an estimation
technique, SA procedures have three main advantages:
(a)
The procedure is nonparametric in form, i.e., no
Page 23
assumption concerning the distribution of the errors need be
made.
(b)
Experimental effort is not wasted in attempting to
estimate the entire function R.
We are merely interested in
one special value it may take on.
(c)
The procedure makes no assumption that the regression
is linear in form, or indeed, of any other functional form.
The procedure should be useful in difficult nonlinear
regression problems.
(Thus, SA procedures are nonparametric
in a second sense as well as (a) above.)
Given an initial starting value Xl' the function of the
outcomes of the experiment G , and the outcomes Y , we have
n
n
identified {X }, a sequence of random variables.
n
at each stage of computing X +1
n
=
Note that
X + G (Y ), there is no
n
n n
reason to take only one observation.
After specifying the
problem by making assumptions about the form of G and the
n
errors involved, there are 4 main areas of problems to be
attacked in stochastic approximation:
(a)
The question of the convergence of the sequence to
e,
and the nature and rate of convergence.
(b)
The asymptotic distribution of the sequence.
(c)
The choice of the transformation Gn giving the smallest
asymptotic variance.
(d)
The optimal stopping time of the sequence.
For the earliest procedures proposed, Robbins-Monro (1951)
Page 24
and Kiefer-Wolfowitz (1952),
largely been answered.
the first three questions have
The fourth is more open, although
Sielken (1973), McLeish (1976) and stroup and Braun (1982)
have provided some useful results.
§2.3
Stochastic approximation-short historical development
The first use of stochastic approximation appeared in
1951 by Robbins and Monro (R-M).
finding a unique value 9 such that
generality, we may take
be measured with error.
procedure
~=O),
They were interested in
R(9)=~
(without loss of
R being a function that could
They suggested the iterative
Xn + l = Xn - an Yn , where {an} is a sequence of
positive decreasing constants, and Y
n
is a conditionally
(given xl, .•. ,X ) unbiased estimator of R(X n ).
n
Providing
certain conditions on {an} and R hold, they showed that X is
n
a weakly consistent estimator of 9 (i.e., X -9=op(1».
n
The following year, 1952, Kiefer and Wolfowitz (K-W)
showed how these techniques could be applied to finding a
maximum (assumed unique) of R(x).
Their suggested iterative
procedure is Xn +l = Xn + an/c n (Y2n-Y2n-l)' where {an} and
{c n } are sequences of positive decreasing constants, Y2n and
Y2n - l are conditionally unbiased estimators of R(Xn+C n ) and
R(Xn-c n ), respectively.
Again, providing that certain
Page 25
conditions on {a }, {c } and R hold, K-W showed X -9=0 (1).
n
n
n
p
The K-W result is important because it demonstrates how SA as
introduced by R-M may be extended to find the zeroes of any
derivative of a function (see Burkholder, 1956).
Having established these two basic procedures, various
generalizations and development of the convergence properties
of {X } have been investigated over the years.
n
Blum (1954i) established the strong consistency of {X }
n
for both the R-M and K-W procedures using martingale
techniques.
Later that year, Blum (1954ii) demonstrated how
SA may be applied in a multivariate context.
Asymptotic
normality of the estimates X , when suitably standardized,
n
was first established by Chung (1954).
He employed a methods
of moments technique and concurrently established orders of
magnitude for the moments.
Sacks (1958) showed asymptotic
normality via characteristic functions and under weaker
conditions.
Fabian (1968ii) proved a very useful result for
demonstrating asymptotic normality, stated in §2.5.
Proofs of consistency and other properties of the
estimators have been made more elegant and under weaker
conditions as researchers have made greater use of
probabilistic tools, especially martingale theory.
Heyde
Page 26
(1974) emphasized this point and easily obtained a law of
iterated logarithm for the R-M and K-W procedures.
Robbins
and Siegmund (1971) proved a very useful inequality which
will be stated and exploited later.
Kersting (1977ii) and
Ruppert (1982) gave representations of the estimators as
weighted averages of the errors for the R-M and K-W
procedures, respectively.
Whereas Kersting considered only
the one dimensional procedure with independent errors,
Ruppert allowed for a multivariate procedure with dependent
errors.
These two papers subsume much of the earlier work on
asymptotic normality and laws of iterated logarithm.
Attempts have been made to alter the R-M and K-W
processes to improve the quality of results.
Transformation
of the observations was first suggested in 1971.
case, i.e., where we wish to find 9 such that
For the R-M
R(9)=~,
Yn-R(X n ) has a known symmetric density g about zero.
assume
Then
the observations can be transformed (via g'(.)/g(.», so that
the resulting estimator is asymptotically normal with mean 0
and variance of the Cramer-Rao lower bound.
This result was
proved independently by Anbar (1973) and Abdelhamid (1973).
The K-W analog was also established by Abdelhamid.
Fabian
(1973) and Obremski (1976) gave modifications for an unknown
g for the R-M and K-W cases, respectively.
Another
modification allows m observations to be taken at each stage
of the procedure, where m is fixed positive integer.
Using
Page 27
m>l is advantageous only for the K-W procedure.
Questions of
design enter and have been partially answered by Fabian
(l968i).
The choice of how large m should be is unanswered.
venter (1967) proposed using random variables in lieu of
the {a } sequence of the R-M procedure.
n
This modification
shall be termed as an adaptive procedure, as it adapts the
values of a
n
progresses~
closer to the optimal ones as the procedure
His motivation was to provide a practical
modification which achieves the best asymptotic variance.
Conditions imposed by venter were subsequently weakened by
Fabian (1968ii).
Later, Fabian (1971) gave an adaptive
procedure for the multivariate K-W situation to achieve the
best asymptotic mean square error for the choice of the {an}
sequence.
All of these procedures require taking extra
observations at various finite stages.
Heuristically, this
is considered undesirable since the information in these
observations is not used to speed the process towards
convergence.
While asymptotically negligible, the effect of
the observations could be important in finite samples.
Anbar
(1978) gave a procedure for the univariate R-M process which
uses only one observation at a time.
A least-squares
technique is employed to recursively estimate the slope.
conditions imposed on the slope of the regression function
are, however, rather restrictive.
The
page 28
These notions were formalized by Lai and Robbins in a
series of papers.
Calling
-n
~i=l
(Xi-a)
2
the cost of the
process at the nth stage, an adaptive univariate R-M
procedure was given which minimizes this cost.
It was
demonstrated how the restrictive conditions imposed by Anbar
can be removed.
Lai and Robbins (1978, 1979 and 1981) gave a
general theory for the univariate R-M process which subsumes
much of the earlier work on this process.
In particular,
using techniques (involving functions of slow growth)
borrowed from Gaposhkin and Krasulina (1974), they gave a
generalized version of the almost sure representation of
Kersting (1977ii).
This representation yields (after
verifying some rather messy technical details) several
important large sample results, such as central limit
theorems and laws of iterated logarithm.
Aside from potential application to real-world problems,
stochastic approximation has been useful as a new method
being applied to standard statistical problems.
Several
authors have investigated SA applied to non-linear regression
problems which began with the work of Albert and Gardner
(1967)
(cf., Has'minskii, 1977, Anbar, 1976i).
Dupac (1977)
and Lai and Robbins (1981) use SA in linear regression models
having errors in the variables.
Sakrison (1965) initiated
research towards using SA to get asymptotically efficient
recursive estimators.
Assume that the density of the errors
Page 29
is known up to a finite number of parameters that we are
trying to estimate.
We then have available an unbiased
estimator of the gradient of the Kullback-Leibler information
number.
Based on this observation, a sequence of estimators
can be constructed that converges to the point of minimum for
the Kullback-Leibler information.
This sequence is not only
consistent, but is efficient in the sense that when
standardized it is asymptotically normally distributed with
the smallest possible variance, the Cramer-Rao bound.
Further work to weaken the conditions of Sakrison has been
done by Has'minskii (1975) and Fabian (1978).
In other applications, Fritz (1973) used stochastic
approximation to find the mode of a multivariate density.
Dupac (1977) noted several interesting applications
such as
finding the maximal eigenvalue of a symmetric matrix that can
only be observed with error and finding the unique solution
of a linear system of equations that can only be observed
with error (i.e., find X
observed with error).
o
such that Ax=b, where Ax is
Nevelson (1975) applied stochastic
approximation to M-estimation (minimization) problems, where
we wish to find to as a solution to the equation
~ $(x,t) dF(x) = 0, where $ is non-decreasing.
Almost sure
convergence and asymptotic normality of the recursive
procedure is shown.
An estimate of the derivative is used to
get the same asymptotic variance as the standard M-estimate,
Page 30
and hence is adaptive.
The procedure requires truncation
above and below of the estimate of the function and of the
derivative.
Some sufficient conditions are given to remove
the truncation of the estimate of the function.
Conditions
imposed by Nevelson requiring i.i.d. observations were
weakened by Holst (1980, 1982).
At times mathematical techniques developed for use in SA
schemes have been useful in investigating the properties of
Revesz (1977) showed how a
other recursive estimators.
recursive estimator for a regression function has desirable
asymptotic properties using techniques in the SA literature.
More specifically, let {X.,Y.} be a sequence of i.i.d. random
1
1
variables where we wish to estimate B(Ylx=x) equal to, say,
r(x).
Let A be some positive constant, a
O<~<l,
and K be a suitable kernel function.
n
= A n-~
For r
where
o (x) = 0,
let
rn+l(x) = rn(x) + K«x-Xn)!a n ) (Yn-rn(x»/(na n )
define the estimator of r(x).
Revesz demonstrated the strong
consistency and asymptotic normality (when standardized) of
these estimators.
Using different SA-type techniques, Isogai
(1980) defines a similar recursive density estimator and
demonstrates its asymptotic properties.
Page 31
Examples in the literature of SA actually being employed
are rather limited.
There are only two known cases (to us).
The more recent case is an application to the harvesting of
Atlantic Menhaden (Ruppert et. a1., 1982).
Earlier, Janac
(1971) used SA in a Monte-Carlo simulation to find optimum
parameters of an automobile suspension system.
Fabian (1971)
mentions some unpublished applications of SA in chemical
research.
schemes.
Textbook examples are more abundant.
For example,
Several examples are provided in the monographs of
Wasan (1969), Albert and Gardner (1967), and Wetherill
(1966).
Generally, SA is applied to questions that are framed in
such a general manner so that there is no real hope of
adequate finite sample results.
Naturally, asymptotic theory
enters the picture and a question the practical statistician
asks is how many observations (or at what stage) are
necessary to guarantee that the large sample results are
approximately correct.
For a univariate RM procedure,
Sie1kin (1973) gave a stopping time N ,~
d
for a fixed-width
confidence interval that is asymptotically consistent (in the
usual stopping time sense, i.e.,
limd~O P(
IXN
d,~
-
91
~ d) = l-~).
More generally, for a
univariate R-M process with martingale difference errors,
Page 32
McLeish (1976) provided a functional central limit theorem
using weak convergence arguments.
An immediate corollary is
a useful stopping rule for the process.
We note here that
Dvoretsky (1956) gave a finite sample result, and we give
some details of those results in §6.2.
Many advances in the theory and application of SA have
not been reviewed at this time.
The literature is vast.
Some authors, e.g., Dvoretsky (1956), Ljung (1978), Kushner
and Clark (1978) take a broader view of stochastic
approximation and liken it to a stochastic version of
recursive numerical analysis.
They prefer to analyze more
general algorithms (of which R-M and K-W are special cases)
which may be viewed as the sum of a deterministic convergent
process and a driving stochastic component.
From this
viewpoint, their algorithms often exhibit the behavior of
solutions to stochastic differential equations.
Other areas
of interest arise naturally from this viewpoint, e.g.,
stochastic approximation algorithms where the solutions are
constrained to lie in some specified set (see Kushner and
Clark, 1978).
Another generalization which some researchers
have addressed is the problem where the function itself
changes with time.
This was first formulated by Dupac (1965,
1966), who termed this problem dynamic SA, and has since been
addressed by Uosaki (1974) and Ruppert (1979).
Page 33
§2.4
Stochastic approximation applied to the ARP
We now develop a sequential procedure to estimate the
optimal replacement time in a cost efficient manner (see
Theorem 1.1).
In particular, we use SA as an estimation
technique to find ~, the (assumed) unique, finite minimum of
the cost function R (.)
I
(see (0.1».
All assumptions are
explicitly stated in this section as well as the properties
of the resulting estimators.
The proofs of these properties
are in §2.5.
Define
M(t)
=
t
(C -C )f(t)
l
2
Now J/Jt Rl(t)
= Kt
~o S(u)du - S(t){C l F(t)+C 2 S(t)}.
M(t), where Kt
is positive, and thus by
assuming that RI(t) is uniquely minimized at some finite
point ~, we have that M(t){t-~} > 0, for each t ~~.
Thus,
instead of looking for the minimum of R(t), we wish to find
the zero of M(t).
If there existed an unbiased estimator for
M(t), we could use this estimator in conjunction with the R-M
procedure and straight-forwardly get "nice" estimates of ~.
Unfortunately, such is not the case and we must deal with the
bias in the Kiefer-Wolfowitz fashion.
In this section,
observations shall be taken only in pairs and we introduce
the simplest practicable estimator.
Page 34
Suppose ~l is a nonnegative random variable such that
E ~21
<
~
00.
Let {X.In }oo
n= 1 be an i.i.d. . sequences of r.v.'s
with distribution function F, where {X
ln
} is independent of
{X
}. The procedure will use two sequences of nonnegative
2n
real constants, {a } and {c}. We define
n
n
z.In =min(X.In ,~~n +c)
n
i=1,2
n=1,2, ...
and let the succeeding
estimates of the optimal censoring time, ~ , be defined by
n
the recursive formula,
~n+l = ~n - an Mn(~n)'
where M (t) is our estimate of M(t), yet to be defined.
n
Before discussing our choice of M and the properties of
n .
the resulting estimator, it is convenient at this time to
deal with the possibility of the procedure forcing ~ <0 for
n
some finite n.
If this is the case, then Mn (~ n ) is not
properly defined.
This can be handled in two ways.
We define the operation
x < a
a < x < b
x > b
if
a
if
x
if
b
and the recursive formula
[xl·b =
a
[
For our case, we choose b =
number but smaller
than~.
there exists a known
a = C2 ~o/Cl and t~a,
~o
00
and a to be some small positive
One way to choose a is to presume
such that
~o ~~.
Then, if
Rl(t) ~ C2 /t ~ Cl/~o ~ Cl/~ ~ Rl(~)·
Page 35
For an application of the approach, see Albert and Gardner
(1967, page 9).
Another method of dealing with this issue is to
introduce a known, strictly increasing function
g:R~[O,oo).
We shall assume the first s+l derivatives of 9 exist
everywhere and are bounded, where s is a fixed positive
integer (e.g., g(t) = 10g(1+e t ) ).
g, we now find that t
*
giving <I> = g(<I».
(say .J.*)
~
with the assumptions on
minimizing R (g(t»,
1
Since,
sign {J/Jt R (g(t»} = sign{g' (t)M(g(t»},
1
we may use as our procedure,
(1)
*
*
*
<l>n+1 ,= <l>n - an Mg,n (<I>n)
where M
is an estimator of g'(t)M(g(t», yet to be
g,n
defined.
In many practical situations, <I> will be
sufficiently far from 0 (relative to the sequences {an} and
{c }), so there is little chance that negative values of the
n
approximations will be obtained, and thus we may use g(t)=t.
We drop the star superscript of <1>, and note that as a final
step in our iteration, we take g(<I>n) as an estimate of the
true optimal replacement time.
As the final step in specifying the SA procedure, we
define an estimator of g'(t)M(g(t», called Mg,n(t).
For
Page 36
i=1,2, let Z. =min(X. ,g(cP +c»,
ln
ln
n n
F. (t}=I{Z. <t},
ln
ln
Sin(t)=l-Fin(t}, fgin(t}=I{g(t-Cn}~Zin<g(t+cn}}/2cnand
get}
(2)
Mg,n(t}=(Cl-C2}fgln(t}~o
S2n(u}du-
g'(t}Sln(g(t}}{ClF2n(g(t}}+C2S2n(g(t}}}.
We will be able to calculate the conditional expectation of
g,n and quantify the resulting biases in
M
We use (F*g}(r}(x) = Jr/Jt
r
a
meaningful way.
F(g(t}} It=x.
For
convenience, a list of the most important assumptions are
provided below:
AO.
The distribution function F of the i.i.d.
observations is absolutely continuous having density f, with
~ t 2 dF(t} <
F(O)=O and
AI.
00.
g is a known, strictly increasing, continuous
function such that
g:R~[O,oo}
and the first s+l derivatives
exist and are bounded.
and
A2.
For each x, (X-cP}M(g(x})
A3.
lim c
2
_00
n
+ cP.
00.
( F*g) (2 ) (x) exists for each xeR, and there exist
constants A,B
> 0 such that
I (F*g) (2) (x) I
~ A + B g (x) , V x.
A4'.
V x
- 0,
-> 1 a n Ic n <
A4 •
>0
(F*g)(l}(x) and (F*g)(2)(x) exist for each xeR and
are bounded.
Page 37
AS.
(F*g)(3)(x) exists for each xeR and is bounded.
AS'. (F*g)(l)(x) and (F*g)(3)(X) exist for each xeR and
are bounded.
A6.
There exists p > 2 such that
f
t P dF(t) <
Let q
00.
be defined by 2/p+l/q=1.
A7.
For some A,C > 0, 1
e
(0,1) such that 1-1 < 2
r,
we
have
a n =A n
-1
c =C n- 1
n
where
r=
A(g , (<I» ) 2 M' (g ( <1»
AO is always assumed to hold true.
We note here that the
above assumptions are not independent.
condition than A7.
satisfied.
) •
A3 is a weaker
If A4' (AS') is true, A4 (AS) is
We use the weaker assumptions initially, and
replace these with stronger assumptions as we strive for
better results.
stronger than A2,
inf{IM(g(x»
We remark here that a typical SA assumption,
V~>O,
I: ~<lx-<I>I<~-l}>O,
is not needed here due to the
assumed continuity of M(.) and g(.).
The asymptotic variance of the <l>n will be proportional
to
:?:,
where
g(<I»
(F*g)(l)(<I»
~o
u S(u)du.
We now state some asymptotic properties of our procedure.
page 38
Theorem 2.1
Assume Al-A3 and A4 or AS.
Then, for the procedure
defined in (1) and (2),
cP n ~ cP a.s.
Theorem 2.2
Assume Al, A2, A6, A7, and either
(a)
A4' holds and
1/3 < 1 < l/q
( b)
AS' holds and
1/5 < 1 < l/q.
or
Then,
n (1-1>/2 (cPn-cP> ~D N(O,A 2 c- l ~ /(2 r-l+1».
(c)
Further, assume AS' holds, 1=1/5 and (F*g)(3)<t>
exists and is continuous in a neighborhood of cPo
Then,
n 2/ s (cPn-cP> ~D N(2T/(2 r - 4/5> ,A 2 C- l ~ /(2 r-4/s»
g(cP)
where T=A C2(Cl-C2>(~O
S(u>du)(F*g)(3)(cP)/6.
S2.s
Proofs
We shall be working with martingale techniques.
n = ~(cPl' Zij' i=1,2 j=l, ..• ,n-l).
The bias term is represented by
let
P
Thus,
Define Un=(cPn-cP).
/\n = B p {Mg n(cPn)-M(g<cPn»g'(cP n )} and the important part of
n
'
the variance term by ~ = Bp M2 (cP n ). We work with only
n
n g,n
parts of Mg, net) at a time, thus define
Page 39
.
Mln(g(t»
9 (t)
= (Cl-C2){fgln(t)~o
9
M (g(t»
2n
=
1
(t) f (g ( t) ) ~
S2n(u) du} -
get)
o
S (u) du}
gl(t){S(g(t»[C F(g(t»+C S(g(t»] I
2
This gives
Mg,n(t) - gl (t)M(g(t» = Mln(g(t» + M2n (g(t».
(3)
A finite sample estimate of ~ is v~, where
vn = c!n
00
~ = ~o t
relationships between random variables are meant to hold with
probability one, unless specified otherwise.
Finally, K ,
l
K , etc. will denote the appropriate constants to be used in
2
our inequalities.
Many of the lemmas will be given in a form more general
than immediately required.
The proof of Theorem 2.1 depends
on a result due to Robbins and Siegmund (1971), given for
convenience.
Theorem (Robbins-Siegmund)
Let F n be a nondecreasing sequence of sub
~-fields
of F.
Suppose that Xn , Bn , 9n and ln are nonnegative Fn-measurable
random variables such that
Sp Xn +1 ~ Xn (1 + Bn) +
n
Then, lim
n~oo
Xn
9n
- 1n
for n
exists and ~ ln <00 on {~ Bn <
=
00,
1,2, •..
?::
9 n < oo}.
Page 40
Lemma 2.1
Assume Al and A2 hold.
Then there exist nonnegative
constants K and K such that
l
2
(a) if A4 holds, then 16 l < c
[ Kl+K21~n-~1
n
n
(b) If AS holds, then 16 l < c 2
n
n Kl ·
Proof:
By the definition of
we have
6 n = E p {M g n(~n)-g'(~n)M(g(~n»}
(4)
n
'
g(~n)
= (C l -C 2 ) ~o
t
Since ~ o
(S)
~n'
].
S(u)du -< ~
16 n l
S(u)du E p {fgln(~n)-g'(~n)f(g(~n»}'
n
V t, we have
~ K31[F(g(~n+Cn»-F(g(~n-Cn»]/2Cn-(F*9)(1)(~n)f
If A4 is true, then [F(g(~n+Cn»-F(g(~n-Cn»]/(2cn)
= (F* g) ( 1 ) (~n) + c ~ / ( 4c n ) [ ( F* g) ( 2 ) ( 9 1 ) - ( F* g) ( 2 ) ( 9 2 ) ]
where
A
In.11 't'n I -<
cn
i=1,2.
Hence,
(6)
Now,
I~nl ~ K3 c n /4 I(F*9)(2)(9l)-(F*9)(2)(92)1
I (F* g) ( 2 ) (x) I
~ K4 + KS 9 ( x )
and,
giving,
(7)
(F*g)(2)(X) ~ K + K
O
l
Ix-~I.
Since 19i-~1 ~ I~n-~I + cn' i=1,2, we have,
16n l ~ K3 c n /4 2(KO+Kll~n-~I)
and thus the result in (a) is true. If AS holds,
Page 41
(8 )
[F(g(~n+cn})-F(g(~n-cn»]/(2cn}
= (F* g) ( 1 )
(~n) + c ~/ ( 12 c n ) [ ( F* g) (3 ) (91 ) + ( F* g) ( 3 ) (9 2 ) ]
where 19i-~nl ~ c n
i=1,2.
From (S) and (8), we have
I~nl ~
K3 C~/(12Cn}·
Thus, the result in {b} holds.D
Lemma 2.2
For some nonnegative constants K -K , if AI, A2, and A4
l
6
or AS holds, then
and
(a)
(b)
Proof:
M2
(t)
g,n
= [(C l -C 2 }f9 ln (t)
get}
~o
S2n(u}du-
g'(t}Sln(g(t}}{ClF2n(g(t}}+C2S2n(g(t}}}]2
<
(C l -C 2 )
22·
2
I
22
f9ln(t) Z2n + (g (t)) Cl ·
By conditional independence, we get
2
~n = Ep n Mg,
n(~n} ~ (Clg'(~n}}2
+(Cl-C2}2(~2+~2}Ep fgin(~n)}
n
~ Kl
2
+ K2 E p fgln(~n)
n
Further,
proving (a).
Ep
n
fgin(~n) =
=
1/(2c n ) E p
n
fgln(~n}
1/(4C~) [F(g(~n+cn)
-
F(g(~n-cn»]·
If A4 holds,
Ep
n
fgin(~n)
=
1/(4C~}{2Cn(F*g)(1)(~n}
Page 42
+ C~/2 [ (F*g) (2) (91) - (F*g) (2) (92) ]} •
Now, ( F* g) ( 1"> (~n)
=
(F* g) (1) (~) + (~n-~ ) ( F* g) ( 2 ) (9 3 ) )
~ (F*g) (1) (~) + I~n-~I [K l + K2 g(93)]
< K4 +
I~n-~I
[K S + K6 I~n-~I]
< K7 + KS U~ .
Thus,
holds. If AS holds, from (a) and (S),
~n ~ Kl + K2[l/(4C~){2Cn(F*g)(1)(g(~n»
+ c ~/ 6 [( F* g) ( 3 ) ( 9 1) + (F* g) ( 3 ) ( 9 2 ) ] } ] •
Since,
I(F*g)
(1)
(~n)
I = I(F*g) (1) (~)
+ (~n-~) (F*g) (2) (~)
+ (~n -<I> ) 2/2 (F* g) ( 3 ) (9 )
~
I
2
Kg + KIO Un'
we get the result in (b) when AS holds.D
Proof of Theorem 2.1:
Note first that ~n is Pn-measurable.
Now,
2 2 2
Un + l = (~n+l -~) = (~n+l - <l>n + ~n - ~)
= (a~
Mg ,n(<I>n»2 +
U~
- 2
an(~n-~) Mg,n(~n)'
Since,
Ep
n
(~-~n)Mg n(~n)
'
~ IU n ~nl -
=
(<I>-~n)[g'(~n)M(g(~n»
+ ~n]
(~n-~)g'(~n)M(g(~n»'
We get, by taking conditional expectations with respect to
Page 43
F ,
n
2
2anlun ~nl + an~n
(9 )
- 2an(~n-~)g'(~n)M(g(~n»·
By Lemma 2.1, if A4 holds,
If AS holds,
Thus, by (9) and Lemma 2.2(b),
EFnU~+l < U~
+
+
=
2anCn(K3+K4U~
)
a~(Ks+K6U~)/cn
-
2ang'(~n)(~n-~)M(g(~n»
U~ [1+2K 4 anc n + K6a~/Cn] + 2K 3 anc n
+ KS
a~/cn - 2ang'(~n)(~n-~)M(g(~n»·
By Robbins-Siegmund and A3, we get ~n ~ z
a.s for some
finite r.v. z and,
co
~ 1 an g'(~n)(~n-~)M(g(~n» < co
Since
> an =
a.s.
co, we have the result.D
We now show the asymptotic normality of the estimators,
suitably standardized, in a series of lemmas culminating with
the application of Fabian's result.
We begin with an easy
result which quantifies the order of magnitude of the bias.
Page 44
Lemma 2.3
Assume Al-A3 hold, and
(a)
if A4 holds, then
.6n
= o (c n ) •
If AS holds, then .6 = o ( c 2 ) .
n
n
If A4 or AS holds, and ( F* g) ( 3 ) ( t) exists and is
( b>
( c)
continuous in a neighborhood of ~, then
g(~)
.6 / c 2 =
1i m
n~CD
n
(C 1 - C2 )
n
d0
S ( u) d u ) ( F* g) (3 )
(~ ) /6
a .s .
Proof:
By (6) of Lemma 2.1,
l.6 n l ~
K
3
if A4 holds, we have
C~/2 I(F*9)(2)(9l)-(F*9)(2)(92)1
where
I n·
11
'fin
By Theorem 2.1 9i ~ ~ a.s.
(a).
I -<
_A
cn
i=1,2.
i=1,2 and we get the result in
The result in (b) is true from Lemma 2.l(b).
If the
assumption in (c) holds, from (4) and (8), we get,
.6n
9
= (C l -C 2 ) ~o
(~
)
n S(u)du Sp {fgln(~n)-(F*g)(l)(~)}
(g(~n)
= (C -C ) ('0
l
2
.
n
3
S(u)du) c n /(l2c n )
{ ( F* g) ( 3 ) (9 1 ) ) +( F* g) ( 3 ) (9 2 ) } .
We get the result from Theorem 2.1.0
We now give a univariate statment of the theorem due to
Fabian (l968ii).
The succeeding lemmas then check to see
that the conditions of Fabian's theorem hold.
Theorem (Fabian)
. Suppose Pn is a nondecreasing sequence of sub
of P.
Suppose Un' Vn , Tn'
rn ,
~-fields
and ~n are random variables
Page 45
such that
r,
r.n , ~n- l'
and ~ be real constants with r
r,
rn ~
Let ex, 13, T, ~,
Vn- 1 are P n -measurable.
~n ~~,
Tn ~ T
> 0 such that
or E
I Tn -T I
~ 0,
E p nVn = 0
and
>
C
J-;7CO
V~
~
-
I~
O.
2
.e
= E I[V.2>
rJx 1
J,r
J -
Suppose, with
lim.~
I Epn
0-.
2
=0 V r
J,r
0-.
or
13
I{ ex=l}
and
ex
- (ex+B) /2
Un + l = Un [l-n- rnl - n
where
n B/ 2 Un ~D N( T/ (
Then,
O<ex~l ,
r - 13+/2),
that
-1 -n
2
=0 V r.
J,r
0 < 13 ,
13+ < 2 r
ex=l and lim n
Let
+ = 13
v.J2 ,
~l 0-.
~
+ n -ex-B/2 Tn·
n Vn
L ~2/(2
r- B+».
The lemma below is stated in a more general form than
immediately necessary, because we will use it when discussing
more general density estimators.
Lemma 2.4
Let f9 n be an estimate of the density used in Mg,n that·
is conditionally (given Pn) independent of Z2n.
A6, A4 or AS, and that (F*g)(l)(x) is bounded.
Assume AI,
For the p in
A6, let K , K , K and t be nonnegative real constants with
l
2
3
t~.
Then,
EpnV~
<
c~/2[Kl
+ K2 Epnlfgnlt + K3lEpnfgnltl.
Proof:
Now, Vn =
and
(i)
c~ (Mln(g(~n»
M2n(g(~n»
+
M2n(g(~n»
- /\n)
~ Cl g'(~n) ~ K4 ,
Page 46
g(<P )
n
16n I= I(C I - C2 ) ~ 0
( i i)
S ( u )d u
B p fgn(<pn)-(F*g)(l)(<p n
IB p
~ K
S + K6
/Mln(g(<P n »
(iii)
n
n
fgn(<p n )
I,
g(<P )
n
1=1 (C I -C 2 )fg n (<p n ) ~o
S2n(u)du
g(<P )
(F* g) (1) (<P ) ~
n
0
-
)I
n S ( u) d u
I
~ K 7 + K a Z2n Ifgn(<pn)l·
Recall that for any nonnegative constants a,b,c,d,
(a+b+c)d ~ 3 d (ad + b d + Cd).
(V~/cn)t/2 ~
3
t
Thus, we have
[ Min(g(<p n »
+
M~n(g(<Pn»
We get the result by the above inequalities and taking
conditional expectations.D
Lemma 2.5
Assume AI-A3, and A4 or AS.
lim B
v2
=0
and
> a.s.
=
Pn n
Then
Proof:
Recall
Vn
= c nl
(Mg,n(<P n ) - g'(<Pn)M(g(<p n »
The first part is obvious by the definition of
6 n ).
6.
Now,
n
-
g(t)
= (CI-C2)[fgln(t)~0
Mln(g(t»
S2n(u)du
g(t)
-g' (t)f(g(t» ~o S(u)du].
Note that for v>u, S2n(v)S2n(u) = S2n(v). Now,
(t
2.
2
2
2
Thus,
B (, o S2 n (u)du) -< B Z2 n < I.l +<r < 00.
(t
2
(t (t
00 > B ('0 S2n(u)du)
~ B '0 '0 S2n(v) S2n(u) dvdu
t t
. t u
= B{~o~u S2n(V)S2n(U)dVdU+~0~0
t
t
= B{~o ~u
t
t
S2n(v) dv +
t
= ~o ~u
t
S(v)dvdu +
S2n(v)S2n(u)dvdu}
u
~o ~o
S2n(u)dvdu}
~o u S(u)du
Page 47
=
et
V
e
'0 '0
S(v)dudv +
et
'0
u S(u)du
=
t
2
e
'0
u S(u)du.
Giving,
t
t
(10)
H(t) = 2 ~O u S(u)du = E( ~O S2n (u)du)2 < co.
NOw,
(11)
since
cn
Ep
n
f9ln(eP n ) = [F(g(ePn+Cn»-F(g(ePn-cn»]/2 ~ 0 a.s.
c n ~O and eP n ~eP . by Theorem 2.1. Similarly,
g(eP )
2
c n [g' (eP n ) f (g (eP n » ~ 0 n S (u) du] ~ 0 a. s ., 9 i v ing
2
lim c n Ep Mln(g(eP n »
(12)
n
2
2
= lim c n (C l -C 2 ) H(g(eP n »E p n f9 l n (eP n )
and since
E p c n fg 12n (eP n ) = [F(g(eP n +c n »-F(g(eP n -c n »]/(4c n )
n
~ [ g' (eP)f(g(eP»
]/2 a.s.
Thus,
limn~co EpnC n Mfn(g(eP n » = >' a.s.
By Lemma 2.3, -n
J\ = o(C n ) and from (11), we get
(13)
lim
c n E p Mln(g(eP n »
n
By definition,
Mg , n (eP n ) -g , (eP n )M(g (eP n »
and since
IM 2n (g(eP n »
2
lim E p v
n n
=
0 a.s.
= Mln (g (eP n ) ) +~2n (g (eP n ) )
I
~
Clg'(eP n ), we have
= lim Cn[E p {Mln(g(ePn»-M2n(g(ePn»}2_~~]
n
= lim c n Epn[M~n(g(ePn» - 2Mln(g(ePn»M2n(g(ePn»]
= >
by (13).0
Lemma 2.6
Assume Al-A3, A6 and A4' or AS'.
Then
(a) E p .V2 is bounded for each nand
n n
(b) for any t~, limn~co E (C! V ) t = O.
n
n
Page 48
Proof:
By Lemma 2.4,
EF
n
V~ ~ C~/2[Kl
+ K2 E F
n
Ifgn(~n)lt
+ K31Ep
n
fgn(~n) It].
Now, E p fgn(~n) = (F*g)(l)(~n) + 1/(2c n ) R:, where R: is a
remaind~r term equal to C~/2 [(F*g)(2)(91)-(F*g)(2)(92)] or
C~/6 [ (F*g) (3) (91) - (F*g) (3) (92) ]
(where 19i-~nl ~ cn' i=1,2) depending on whether A4' or AS'
is valid.
In either case, we have boundedness for the term
E F fg n (~n) •
n
For t=2,
c n E p fg~(~n) = 1/2 E p fgn(~n)
n
which is
n
bounded and thus (a) is true.
For
c~
Ep
n
O~t~p,
fg~(~n) = 2-t[F(g(~n+Cn»-F(g(~n-Cn»] = 0(1)
which is bounded and thus
c~/2 EpnV~ is bounded.
Result
(b) follows immediately from ED (c t V )t = 0(1) and the
.E'n n n
bounded version of the Lebesgue Dominated Convergence
Theorem. 0
Lemma 2.7
Define for r=1,2, .•. and n=1,2, •••
2 >rnll.
~2n,r = E[V n2 I[Vn-
E (C t V )p = 0(1).
n
Then
n
2
limn~oo ~n,r
Assume A6, A7, and
For the q in A6, assume 0
= 0
< y~ l/q.
for each r.
Proof:
By Holder's and Markov's inequalities, we get
Page 49
0-
2
n,r
<
<
=
0(1)/[nP/(2q) c 1+p/(2q)1
n
= o(1)/[n P/(2q)
n- y (p/2)1
proof of Theorem 2.2:
with Lemmas 2.5-2.7, this is an easy application of
Fabian's (1971) result. 'NOW, from (1) and (2), with U =~ -~,
n n
6=1- Y, ex=l,
M(g (~n»
= (~n -~) g' (9 n) M' (g (9 n) )
and Mg,n (~)
n
Un + l
=
= Un
-
= c-lv
n n +g'(~n )M(g(~n »+~n ,
an[C~lVn
U [l-a g'
n
n
(~ n )g'
+
g'(~n)M(g(~n»
(9 n )M' (g(QIn »
19 n -~ I
where
]-A
gives
+ /\n 1
c- l
n- l + Y/ 2 v -a /\ .
n n-n
2
By Lemma 2.5, B p V ~ > a. s. Now, let
n
n
r n = Ag' (~n)g' (9n)M' (g(9n»
~ A(g' (~) )2M,
~
=
n
A
c- Y
Tn
=
< I ~n -~ I ,
(g(~»
=
r
A n(1-Y)/2 /\ •
-n
This gives,
(14)
Un + l
=
Un[l-n- l rnl - n- l + Y/ 2
~n Vn - n Y/ 2 - 3/ 2 Tn.
Now, if A4' holds, by Lemma 2.3,
IT
n
I <
Kn(1-Y)/2 o(n- Y )
=
K 0(n(1-3Y)/2)
If AS' holds, by Lemma 2.3,
=
0(1)
for Y~1/3.
Page 50
IT
n
I <
-
K n(1-Y)/2 O(n- 2Y ) = O(n(1-5Y)/2) = 0(1)
Thus, the bias T=O.
for Y > 1/5.
The result is immediate from Lemmas 2.6
For part ( c), by ( 14 ) ,
U
= U [1-n- 1
if\
n- 1 + Y/ 2 V _ n- Y/ 2- 3/ 2 Tn
n+1
n
n
~n
n
where Tn = A n(1-Y)/2 /\n and Y = 1/5. Now, by Lemma 2.3,
and 2.7 for parts (a) and (b).
r.] -
1imn~oo Tn = A c 2 1imn~oo n -2Y n (1- Y)/2 (6n/C~)
2
2
= A C lim (6n /c n )
g(eP)
= A C 2 (C -C ) (~o S(u)du)(F*g)(3)(eP)/6 = T a.s.
1 2
The result is immediate from Lemmas 2.5-2.7 and Fabian's
Theorem. []
CHAPTER 3
DEVELOPING AN OPTIMAL SA ARP
METHODOLOGY
§3.1
Reducing the order of bias
From the proof of Theorem 2.2, we note that the rate of
convergence of ~n to asymptotic normality could be made
quicker if the order of magnitude of the bias term, -n
/\ , were
smaller.
By taking several observations at each stage,
Fabian (1967) showed how to modify the Kiefer-Wolfowitz
procedure to reduce the order of magnitude of the bias term.
We are able to achieve the same effect by taking advantage of
the special nature of density estimators rather than taking
additional observations at each stage.
We now focus on a
more general class of estimators that achieve a better rate
of convergence.
Recall, from (2.3), that,
/\n
= Ep
=
n
{Mg n(~n) - g'(~n)M(g(~n»}
'
B p {Mln(g(~n»
n
Since B p M2n(g(~n»
n
+ M2n(g(~n»}
= 0,
=
giving (2.4),
Page 52
where
f9ln(t) = I[g(t-C n ) < Zln
~
g(t+c n )]/(2C n )
= 1[-1< [g-l(Zl )-t]/c < 1]/(2c ).
n
n n
Let P and r be integers such that O<p<r and f*g(P>(x)
-
n
be an estimator of (F*g) (p+l.) (x) using the nth observation.
We use f*gn(x> for f*9~O)(X).
We can achieve better
convergence rates for the SA ARP estimators by constructing
estimators f*g(P)(x) which satisfy:
n
A8 :
(a)
sup
x
IE f*g~P)(X)-(F*9)(P+l)(X)1
= o(n-(r-p)/(2r+l»,
(b)
A9:
to xo~
sup
x
and for t>-l
Elf*g (x>l t + l = o(n- t /(2r+l».
Let t>-l and x
. n
n
.
.
be a sequence of constants tending
Then, there exists constants "I and "2,t such that,
(a)
lim n(r-p)/(2r+l) [E f*g(P) (x )-(F*g) (p+l) (x )] = "I
n
n
n
-t/(2r+l)
t+l
= "2,t.
(b)
lim n
E(f*gn(x n »
n~oo
n~oo
Recall that (~*g)(p+l) is the p th derivative of the
density function when g(t)=t.
If g(t)=t, the number of
derivatives of the density that will be assumed to exist and
be bounded is r.
In AlO we will give more precise conditions
on the distribution and g(.) for a specific value of r.
In
Page 53
this section we only use f*g , that is, p=O.
n
In Lemma 3.4 we
will need the estimator for p=l and in Lemma 3.5 we will need
a modified version of p=r.
We now describe one method of
creating such estimators, using kernel functions.
Lemmas
will be given providing sufficient conditions for A8 and A9.
We first present a small review of the method of kernel
estimators of a density, primarily following Singh (1977).
Let B
o
be the class of all Borel-measurable real-valued
functions k(.) where k(.) is bounded and equals zero outside
[0,1].
For integers rand p where
M = {keB o : l/j!
(1
'0
.
yJ k(y)dy =
O~p<r,
[~
define
j=p
j~p
j=O,l, ... ,r-l}
M* = {keM: k is continuous and of bounded variation}.
Consider Xl, .•• ,X , an i.i.d. sample having density f.
n
Let {c } be a sequence of positive constants tending to O.
n
For a fixed keM, our kernel estimate of f(P), the pth
derivative of f, is,
n
i(p)(x} =
l/(nc~+l} ~l k«Xj-x}/c n )·
In practice, k(.) can be taken to.be a polynomial of order r.
We now state three important (to us) results in density
estimation.
Page 54
Theorem A
If keR and f(r)
(a)
sUPx
is bounded, then
l/(c~-P)
IE
Further, if for some t>l,
(b)
sup
x
1/(c~-P-1/t)
IE
= 0(1).
f(P)(X)-f(P)(X)1
~ If(r) (x) It <
00,
f(P) (x)-f(P) (x)
then
I = 0(1).
then,
(a)
sup
x
n(r- p )/(r+l)(f(P)(x)-f(P)(X»2/(log log n)
Now, let keN* and c
. n
=
C n- w/ 2 where w
=
= 0(1).
1/(r+l-1/t).
If the assumptions of Theorem A hold, then
(b)
sup
x
nw(r-p-l/t)
(f(P)(x)-f(P)(X»2/(log log n)
= 0(1).
Theorem C
Let keR, f and f(r) are bounded, and c
n
=C
n- l /(2r+l).
Then
In the stochastic approximation scenario, we are
interested in the properties of a single observation
estimator of (F*g)(l)(x).
We give below our analog to the
important part (a) of Theorem A.
But first, we need to
impose the following slightly stronger condition on the
distribution function:
Page 55
For some integer r~l, assume that (F*g}(l}(x) and
AIO:
(F*g}(r+I)(X) exist for each x, are bounded on the entire
real line, and are continuous in a neighborhood of ~.
Lemma 3.1
Let keN and for some integers rand p, let
Suppose (F*g) (r+l}(x) is bounded.
O~p<r.
Define
fg(P)(x) = k[(g-l(X }-x)/c l/c p + l •
n
n
n
n
Then,
(a)
supx
lE
fg~P} (x}-(F*g) (p+1) (x)
I
Further, if (F*g}(r+I}(X) is continuous in a
neighborhood of a fixed x ' and {x } is a sequence of
o
n
constants tending to x , then
o
(b)
[E fg(P} (
}-(F*g) (p+l) (x ) l
lim c-(r-p}
n
n
xn
n
I
= (F*g) (r+l) (xo).(r! ~o yr key) dye
Proof:
By a change of variables, we have
-p-l
E fg(P) (x) = c
n
n
~ k[(g-I(S)-X)/cnl f(s) ds
= c- P
n
~l key) g'(x+cny)f(g(x+cny»
0
= c- P
n
~l kCy} (F*g) (1) (x+c y) dye
0
n
dy
A Taylor-series expansion gives,
(F*g)(1) (x+cny) = (F*g)(1)(X) + (C Y)(F*g)(2)(X) + ••• +
n
(Cny)(r-l)/(r-l)! (F*g}(r}(X) + Rn(y)
where Rn(y) = (Cny)r/r!
for some
Thus
9
such that
IRn(y)1 ~ KI
(F*g) (r+l)
'9 -x I
(9)
~ cny·
(Cny)r /r! < K2 c~.
By the boundedness and orthogonality of k to yj, j=l, ... ,r-l,
we have,
Page 56
c~ E f9~P)(x)
= c~
1
(F*g)(p+1)(X) + ~o k(y) Rn(y) dy
which gives the result (a).
For part (b),
c-(r-p)£E fg(P)(x )-(F*g)(p+1)(X )]
n
n
n
n
=
1
c~r ~o k(y) Rn(y) dy
1
= ~0
k (y) yr /r!
(F*g) (r+1)
(9)
dy
1
~ (F*9)(r+1)(X ) ~o k(y) yr/r! dy
a.s.
o
through an application of the bounded convergence theorem.O
We remark here that with A10 the results of Lemma 3.1,
(a) and (b), show that the kernel estimator meets assumptions
A8(a) and A9(a), respectively.
With A10 the results of Lemma
3.2, (a) and (b), will show that the kernel estimator meets
assumptions A8(b) and A9(b), respectively.
We note here that
since we take only a single observation, one would not expect
a result analogous to Theorem B to be true.
An analog of
Theorem C will be of consequence.
Lemma 3.2
Assume keN and that (F*g)(l)(x) is bounded and
continuous at x • Let {x } be a sequence of constants
o
n
tending to x o·
If t> -1, then,
I f9 n (x) I t+1
= O(C~t) and
(a)
supx
(b)
lim c t E f9 t+1 (x ) = ( F*g)
n
n
n
E
(1)
(x ) ~
0
1
0
k t +1 (y) dye
Proof:
The proof is an easy application of the bounded
convergence theorem.
Since,
•
Page 57
l
t l
k«g -1 (s)-X)/c ) I t + f(s) ds
Ifg (x)l + = l/c
~
I
n
n
n
1
= ~ Ik(y)l t +l (F*g)(l)(x+c y) dye
o
n
By the boundedness of (F*g)(1) and k(.), result (a) is true.
Further, by that boundedness,
lim c t B fgt+l(x )
n
n
n
=
~lo
kt+l(y) lim (F*g)(l)(x +c y)dy.O
n
n
We now prove results about the SA ARP estimator using a
general density estimator f*g(P)which satisfies AS and A9.
n
As pointed out, kernel estimators are one type of such
estimator.
All quantities, e.g., M
, / \ , etc., are defined
g,n -n
as in Chapter 2 but using the new density estimator f*gn in
place of the histogram estimator f9
ln
•
Theorem 3.1
Assume Al-A3, AS and A9.
Then
a.s.
( a)
Assume AI, A2 and A6-A9 and let 1/<2r+l)=1
(b)
n(1-1)/2(<!>n-<!»
-7
D
N(lJ ' ()-~)
o
g(<!»
~
l/q.
Then
where
r-
1J 0 = 2 A (C l -C 2 ) ~o
S(u)du 1 1 /(2
1+1)
2
2 cg (<!»
()-~ = 2 A (C l -C 2 ) J o
u S(u)du 1 2 ,1/(2
r - 1+1).
Proof:
Recall Un = <!>n-<!>.
1=1/(2r+l), we get
2.2(a) and AS(b)
Now, by AS(a)
I~nl ~ o(n- 1r ).
(with t=l), we get,
(with p=O), (2.4), and
Hence, by (2.9), Lemma
Page 58
-2an(~n-~)g'{~n)M{g{~n»
+ a 2 [K +K {O{n Y »] + U2 + 2a (1+U 2 ) O{n- Yr )
n
n
n
n
2
l
= -2 an{~n-~)g'{~n)M{g{~n» + U~ [l+O{ann- Yr )]
+ a 2 O{nY) + O{a n- Yr ).
n
n
Thus, by the Robbins-Siegmund result, A2 and A3, we get part
(a).
By equation (2.14) , we have
U + = Un[l-n- l
n l
rn ]
-
n- l + Y/ 2 In V - n Y/ 2 - 3/ 2 Tn
n
where Vn = c~/2{Mg,n{~n)-g'{~n)M{g{~n»-~n)'
f n = A g' (~n )g' (9 )M' {g (9 n» ~ A (g' (~) ) 2 M' {g ( cI») =
~n = A C-! = ~
By A9{a)
and
r,
Tn = A n{1-Y)/2 ~n'
(with p=O), and part (a),
g{~ )
/
limn~oo Tn = lim A n r (2r+l) (C l -C 2 ) ~o
n S{u) du
[B p f*gn{~n)-{F*9){1){~n)]'
9 (~)
n
= A (C l -C 2 ) (~o S{u) du) '1 = T, say.
By equation (2.l0), (2.l2) and A9{b) (with t=l),
lim c n B p Mfn{g{~n»
n
= (C l -C 2 )2 lim H{g{~n) CnB p fg~{x)
2 g{~)
n
= 2(C l -C 2 ) ~o u S{u)du C '2,1
= S2
o
,say.
Thus, as in Lemma 2.5, we get
B
v2 ~
Pn n
S2
0
a.s.
Now, by
A8{b) (with t=O,l) and Lemma 2.4, we get the boundedness of
B p V~.
The result of the Theorem comes immediately from an
n
application of Fabian's Theorem, providing we show that
0-
2
2 >rn) v 2 ~ 0 for each r.
n, r = B !(v nn
By Lemma 2.7, we need only show for t<p,
B (c ~ Vn ) t
BF
n
~ O.
By Lemma 2. 4 ,
v~ < c~/2[Kl+K2Bp Ifgn{~n) It+K3IBp fgn{~n) It].
n
n
Page 59
Now, cnfgn(~n) and E p fgn(~n) are bounded.
By Lemma 2.4 and
n
(with t=O and p-l), we have E p (c! V )t = 0(1).
n
n
A8(b)
The
n
bounded version of Lebesgue's Dominated Convergence Theorem
gives E p (c! V )t = 0(1) and thus the theorem.D
n n n
Corollary 3.1
Define the kernel estimator,
fg~P)(x) = k[«g-l(Zln)-X)/cn]/c;+l.
(1)
Use fgn(x) for
fg~Olx) and let f*gn(x)=fgn(x).
~
~ ~
Then,
(a)
if Al-A3, and AI0 hold, then
(b)
Assume AI, A2, A6, A7 and AlO hold, and let
Y = 1/(2r+l)
~
l/q.
n
Then,
n(1-Y)/2(..l _..l) ~
""n ""
g(~)
1.11 = 2 A Cr(Cl-C2)(~0
N(II
D
1"'1'
S(u)du)
0- 2
1
)
1
i
= 2 A2 C-1
>
1
~ 0 k 2 (y ) dy / [2
where
(F*g)(r+l)(~)
~o yr/r! k(y)dy/[2
0-
a.s.
r-
r-
l+Y]
1+ Y] •
Proof:
Immediate from Lemmas 3.1 and 3.2 and Theorem 3.l.D
S3.2
Reducing the asymptotic mean sguare error
In S3.l, we demonstrated how to define a sequential
procedure,
(2)
~n+l= ~n
- A n-
l
Mg,n(~n)'
that under certain reasonable conditions produce estimators
that are strongly consistent and asymptotically normal.
In
Page 60
this section, we investigate recommendations for the choice
of these parameters.
To be explicit, only choices for the
kernel estimators are presented.
From Corollary 3.1 of the previous section, we see that
under certain conditions, n(1-Y)/2 (~n-~) ~D
X,
normally distributed random variable with mean
variance ~~
where
~1
X is
and
One criterion for selection of parameters
suggested by Abde1hamid (1973) is to choose A and C to
. ..
ffi1n1rn1ze
E
X2 ,
. mean square error.
t h e asymptot1c
Elementary
calculations show that
(3)
A
opt
=
[(g' (~»2 M' (g(~» ]-1
(4 ).
C
opt
=
{2Y(r+1)/r
>
-
~
1
0
k 2 (y)dY }1/(2r+1)
{2(C1-C2)(F*9)(r+1)(~)
1g(~)
S(u)
0
du
110
yr k (y)/rl dy}-2/(2r+1),
We note that the asymptotic distribution of the process
presented in §2.4 and §3.1 depend on the choice of transform
function g(.).
We now discuss how to choose the parameters A
and C so that the asymptotic distribution of the resulting
procedure is invariant to the choice of g(.).
Page 61
We first re-introduce the star superscript notation used
in (2.1).
Let
~
be the t that uniquely minimizes Rl(t), and
let ~ * be the t that uniquely minimizes Rl(g(t».
Thus
~=g(~ * ).Using the ~ * estimators defined in (2.1) and (2.2)
n
with kernel estimators defined in (3.1), under certain
conditions we showed that
in Corollary 3.1.
Since ~ * is a continuity point of g(.),
the "b-method" tells us that
Now,
g'(~*)~l = K A Cr (F*g)(r+l)(~*)/{2A(g,(~*»2M'(~)-1+Y}
l
(g,(~*»2o-i = K2 A2 c- l (g,(~*»3/{2A(g,(~*»2M'(~)-1+Y}
where
yr/r! k(y)dy
(1
J0 k
2
(y)dy.
To make the denominators invariant with respect to g(.),
we choose A=Ao(g,(~*»-2, where A is some positive constant.
o
This gives,
g'(~*)~l = K
3
c r (g,(~*»-l (F*g)(r+l)(~*)
(g,(~*»2o-i = R
4
where
c- l
(g,(~*»-l
K3 = Kl Ao/{2AoM'(~)-1+Y}
K4 = K2 A~/{2AoM'(~)-1+Y}.
As with the first criterion, the best choice of Ao is
(M,(~»-l.
Now, by using C=Co(g,(~*»-l, we make the
Page 62
asymptotic variance independent of the choice of g(.) and the
asymptotic mean nearly so.
More explicitly, with this choice
of C,
gl (<I>*)~l = K
S
(F*g) (r+l) (<I>*)(gl (<I>*»-(r+l)
(g' (<1>*) )2(}-f = K
6
where,
r
and
KS = K3 C0
As before, the best choice of Co is Copt gl(<I> * ) where Copt is
given in (4).
Unfortunately, this choice
of Co does depend
on g ( • ) •
Now suppose that g(.) is approximately a line in the
neighborhood, or only at ~J.*.
require that g(i) (<1>*)=0
More generally, we need only
for i=2, ••• ,r+l.
Then,
(F*g) (r+l) (<1>*) = (g' (<1>*) )r+l f(r+l) (g(<I>*»
and
g'(<I>*)~l = K f(r+l) (<1».
S
Thus, within this more restrictive class of transforms than
defined by AI, we see that the asymptotic distribution is
invariant with respect to the choice of the transform.
which does not depend on g(.) and
J.* -1
Copt=Co(g'(~» •
Page 63
In Chapter 6 we investigate some finite sample
properties of these optimal choices of A and C using
simulation techniques.
E
X2
S3.3
The choice of k(y)
~
M to minimize
has not been solved yet.
An adaptive ARP procedure
AS pointed out in S3.2, for a fixed distribution
function F, and known transform g and kernel k, there is a
choice of A and C that minimizes the asymptotic MSE.
Unfortunately, these best choices of A and C depend on
knowledge of F and ~ which are generally unknown prior to
conducting the experiment.
In this section, A and Care
replaced by estimators in an adaptive manner such that the
procedure attains the optimal MSE without a priori knowledge
of F and ~.
Let {X.1., n}' i=1,2, be two sequences of i.i.d. random
variables that are mutually independent, each having d.f. F.
{~n}'
* and {c * } are sequences of random variables.
{an}
n
\I and /\ to denote max and min, respectively.
constants Zl' Z2' Z3' Z4' a, b, c and Y define
a
-1
(5)
\I a n*
/\ Z2 n
an = (Zl (log n)
C
(6 )
c n = (Z3 n- b \I c n* ) /\ Z4 n
Use
For positive
Page 64
where a+rc<rY, a+b+Y<1/2, Y=1/(2r+l) and b<rY/(r+l).
This type of truncation device has been used by other
researchers in stochastic approximation (cf., Fabian, 1971,
Theorem 2.4).
These particular truncating
functio~s
are not
unique and could be replaced by others which are more
complicated (cf., Lai and Robbins, 1981, Theorem 3).
For
convenience we choose these simple, sufficient conditions
over the more complex, yet only slightly weaker, conditions.
Let Z.1n =
F
n
x.1n /\ (g(~n+cnn-Y»
and define
= ~(~l' Z .. , i=1,2 j=l, ... ,n-l).
1J
.
are Fn-measurable.
We assume a * and c *
n
n
Definitions for all quantities (F n , Sn'
f9 n , Mg,n' etc.) are as in S3.l (using (1) for the density
estimator), except now based on the new truncated
observations, {Zin}' and replacing A and C by an and c n '
respectively. Our adaptive procedure is
(7)
..l
..l
o/n+1 = o/n -an n
-1 M
(..l)
g,n o/n •
Some additional notation will be useful; define
get)
A (t) = (C -C )
l
n
2
~o
S2n(u)du
get)
(8 )
A(t)
= (C l -C 2 )
~o
Bn(t) = g'(t)Sln(g(t»
B(t)
= g'(t)S(g(t»
S(u)du
[C F (g(t» + C S (g(t»)
1 2n
2 2n
[C F(g(t» + C S(g(t»).
1
2
From (1) and (2.2), we have
Mg,n(t) = An(t) f9n(t) - Bn(t).
Recall that
(9)
where
Mg,n(~n)
= g'
(~n)M(g(~n»
+ (cnn-Y)-l Vn +
~n
Page 65
6 n = A(ePn )
[E p
n
fgn(eP n ) - (F*9)(1)(eP n )]
Vn = (Cnn- Y)! {fgn(ePn)An(eP n ) - E p [fgn(ePn)An(eP n )]
n
- [Bn(eP n ) - B(eP n )]}·
Even when using the modified procedure, we retain the
conditional unbiasedness of parts of the estimator, i.e.,
EpnSin(t)
=
EpnI(Zin>t)
=
S(t)
for
t~g(ePn+cnn-Y)
giving,
and
Further, we have the following
Lemma 3.3
Assume Al-A3, AlO, and (5)-(7).
/\ = o«c n
(10)
-n
Then,
n-Y)r) < O(n(C-Y)r).
Proof:
Via a Taylor-series expansion,
B p f9n(t) = (cnn-Y)-l B p k[(g-l(Zln)-t)/(cnn- Y)]
n
1
n
= ~o k(y) (F*g) (1) (t+cnn- Y y) dy
1
= (F*g) (1) (t) + (c n-Y)r ~
yr k(y)/r!
n
0
where 19-tl ~ cnn- Y,
(F*g) (r+l)
(n)
I
which gives the result.D
We note from the proof of the lemma that Bp fgn(eP n ) is
n
bounded. We have now placed enough structure on the
procedure to give the following:.
dy
Page 66
Theorem 3.2
Assume Al-A3, AlO and (5)-(7).
(11)
c!> n
= c!>
Then
a.s.
Proof:
Us ing (9) in (7) gives,
(12)
c!>n+l
= c!>n
-ann- l [g'(c!>n)M(g(c!>n»+(cnn-Y)-l
Vn+~n]·
Squaring and taking conditional expectations with respect to
F n gives,
(13) Sp (c!>n+l-c!»2 = (c!>n-c!»2 - 2n-lan(c!>n-c!»[g'(c!>n)M(g(c!>n»+~n]
n
2
+ n- a~ [(g' (c!>n)M(g(c!>n» + ~n)2 + c~lnY EpnV~].
Note that by AlO g'(t)M(g(t»
is bounded.
By the Robbins-
Siegmund Theorem, sufficient for (11) is
(14 )
a. s. ,
> n -1 an I~nl < 00
-2 a 2 < 00
(15 )
a.s. and
> n
n
2
2
Y
(16 )
a.s.
> n -2 a n c -1
n n S v < 00
Pn n
(15 ) is true by the definition of {an}·
Lemma 3 . 3 .
(14 ) is true by
From ( 8 ) and the remark following Lemma 3.3, we
have that B, B , A and E p fg (c!> ) are bounded, giving
n
n n n
~ Kl + K2 Sp A~(c!>n) Sp (fg n (c!>n»2
n
n
n
n
2
for some constants Kl , K2 • Now, Sp A (c!> ) is bounded and
n n n
(fg (c!> » 2 = o«c n- Y)-2) ~ o(n 2 (Y+b». Since we require
n
n
n
a+b+Y<1/2, this is sufficient for (16), and hence the
Sp «cnn-Y)-l V )2
theorem. []
* and {c n* }, so that
We now specify the sequences {an}
(17 )
Page 67
where, Aop t and Cop t are given in (3) and (4). To estimate
(F*g}(r+1}(~), an additional mild assumption on the density
is needed.
All.
For some d>O,
(F*g)(r+1)(X)
= (F*g)(r+1)(~) + o(lx-~ld)
for each x.
In the following two lemmas, we construct Pn-measurab1e
* and {c * } that satisfy (17).
sequences {an}
n
definition of fg~l}and f9 in (1).
n
Recall the
Lemma 3.4
Assume A1, A2, A6, A7, A10 and (S)-(7).
Define
~n = An(~n}[fg~l)(~n) - g"(~n)fgn(~n)]
+ g'(~n) fgn(~n}[C1F2n(g(~n»+C2S2n(g(~n»]'
*
a n +1
= [n -1
-n
~1 ~j]
-1
,
and an as in (S).
Then, if Y+b<1/3,
we have
an
~
Aopt
a.s.
Proof:
We begin by noting that
Bp
n
~n = A(~n}[Bp fg~l}(~n) - g"(~n}Bp fgn(~n)]
n
n
+ g'(~n} B p fgn(~n}[C1F(g(~n})+C2S(g(~n»]
n
~ A(~) [(F*g) (2) (~) _ g"(~) (F*g) (1) (~)]
+ g' (~) (F*g)
(1)
= (g' (~) ) 2M , (g ( ~)
(~) [C 1 F(g(~} )+C2S(g(~»]
)
fg(P) (~ ) ~ (F*g) (p+1) (~)
n
n
(using the same argument as in Lemma 3.1(b». Hence,
from Theorem 3.2 and since B
Pn
Page 68
n
-1 -n
Wn
~l
= -n
~l
-1
~
B p ex
k
k
Aopt .
Def ine the martingale
(ex k - B p ex k )· . From the result of Chow (1965),
k
sufficient for the proof of the lemma is
2 -2
~ B p (Wn+l-W )
n
< 00 a. s. Now,
n
n
(W n + l -Wn )2 = E p (exn-E p ex n )2 ~ B p ex~.
n
n
n
n
Since B p A2(~ ), gl and g" are bounded, there exists
n n n
positive constants K , K and K such that
l
2
3
Bp
Epn(Wn+l-Wn)2
~
Kl +
K2Bpnfg~(~n)
+
K3Bpn(fg~1)(~n»2.
Now,
fg2(~ )
(18)
n
and thus
n
~ n- 2
Bp
=
n
O«c n- Y )-2)
n
(fgn(~n»2 <
= o(n 2 (Y+b»
00
a.s.
Further,
(c n - Y) 3 B
(fg (1) (~ » 2
n
Pn
n
n
= l/(Cnn- Y )
= ~
Thus,
> n- 2
B
k 2 (y)
p
~ k2[(g-1(S)-~n)/(Cnn-Y)] f(s) ds
(F*g) (1) (~ +c n - Yy) dy
n
n
<
a. s •
00
(fg~1)(~n»2 = > n- 2 o(n 3 (Y+b»
<
00
a.s.D
n
To estimate Copt' we see from (4) that there are three
unknown quantities.
As a preliminary, to estimate the r th
derivative of the transformed density, we need to introduce
k r , a bounded function equal to zero outside (0,1), such that
{I
'0
yr kr(y)dy
=
1
We define f9~r)(t)
and
(I
'0
.
yJ kr(y)dy = 0
j=O, ..• ,r-l.
= kr[(g-l(Zln)-t)/(Cnn-Y)]/(Cnn-Y)r+l
as a quantity to be used in our estimate.
Page 69
Lemma 3.5
Assume All and that the conditions of Lemma 3.4 hold.
and c n as in (6). Then,
g(4))
*
<19 )
~o
S(u)du
1 n +1 ~
(F* ) (r+l)
9 n+l
(20 )
~
a.s.
(F*g) (r+l) (4))
1 2
-*
~
2
)"
~
k (Y)dY
>
-n+l
0
which gives,
(21)
a.s.
a.s.
a.s.
Proof:
(19)
3.2.
is an easy application of Chow's Theorem and Theorem
To prove (20),
E p f9~r)(t)
n
=
= ~
1
(cnn-Y)-r ~okr(Y) dy (F*g)(l) (t+cnn-Yy) dy
1
yr /r! k (y) (F*g) (r+l)
o
r
for some 9 such that 19 -t I~Cnn - Y
(n)
I
dy
1
= (F*g) (r+l) (t) +
~o yr/r! kr(y) O( '9-t1d) dy.
ThUS,
E
Pn
fg ( r) (4) ) - (F*g ) (r+ 1 ) (4) ) = 0 ( (c n- Y)d)
n
n
n
n
= o(nd(C-Y» = 0(1).
By Theorem 3.2 and Kronecker's Lemma, we have
Page 70
n-l)"n E
fg~r)<4>.) ~ <F*g)<r+l)<4»
a.s.
J
J
We will have proved (20) with Chow's Theorem if
-1
(22)
)" n-
t
Ep
pj
n
I <fgi~) <4>n)
-
E p f9l<r) <J.
n
holds, for some t such that 0<t<2.
n
""n
»
It <
00
a.s.
If we can find such a
t>l, we need only show
t
If9~r) <4>n) It < 00 a.s.
n
This is because we can apply an algebraic inequality
(23)
') n-
Ep
t
<a+b)t<2 <a t +b t ) for all nonnegative constants a and band
') n-t<oo for t>l.
Now,
and thus,
E
p
n
If gn<r)
from (6).
<4> )
n
I =
0< <c n-Y)l-t<r+l»
n
= o<n<-b-Y)<l-t<r+l»)
We will have shown (23) if we can find t€<1,2)
such that
(24)
-t-(b+Y){l-t(r+l)} < -1.
First note that b<rY/<r+l) iff l+b/{Y<Y+b)}<r+l.
We can pick
a z such that l+b/{Y<Y+b)}<z<r+l and let
t
=
{l-z<b+Y)}/{l-<r+l)(b+Y)}.
Easy algebraic calculations show that the resulting t€<1,2)
and satisfies (24).
Thus we have proved (20).
To show (21),
W = cnn-Y[S~ - E p S~].
n
n
and note that
we define a martingale difference
For the p in A6, let
Ep
n
ISni t ~ K + K E p
2
l
Thus, for
t~p/2,
t~p
n
Ifg <4>n) It ~ K + K <c n- Y )-<t-l).
n
l
3 n
Page 71
By the theorem on martingale differences due to Chow (1965),
for
O<t~2,
.?-
Ep
n
IWn I t
<
n -t
00
Since, (Y+b)(t-l) - t < -1
a.s.
iff
=>
a.s.
t>l, we choose t=min(2,p/2),
giving
(25)
n -1 )" c.
J
j - Y [5 ~ - B
5 ~] ~ O.
pj J
J
Thus, if we prove,
(26)
a.s.
this and (25) will give (21) and thus the lemma.
From cnn-Y=O(l),
lim
Epnfgn(~n)
cnn- Y E p 52 = lim
n n
cnn- Y B p
bounded and Lemma 3.2(b),
n
[fgn(~n) An(~n)]2
Because of the wealth of information in the density
estimator, we have succeeded in providing adaptive estimators
for the parameters of our stochastic approximation algorithm.
This is a much harder task for a general K-W situation.
We
now give the final theorem of this chapter, the culmination
of our previous efforts.
Page 72
Theorem 3.3
Assume AI, A2, A6, A7, AlO, All, (5)-(7) and (17).
Define
Y=1/(2r+l) and assume Y+b<1/3.
T
=
r
Copt
Let
g(~)
(Cl-C2)(~0 S(u) du)
{I
(F*g) (r+l) (~) ( J yr/r! key) dy)
o
~2
=
2 AoptT/(l+Y)
2
2
-1
{I 2
~2 = 2 Aopt Copt> '0 k (y)dy/(l+Y).
Then, for the procedure described in (7),
where Aopt and Copt are given in (3) and (4).
Proof:
With the preceeding lemmas, the proof is a straightforward application of Fabian's (1968ii) result.
By a Taylor-
series expansion,
g' (~n)M(g(~n»
=
(~n-~)
[gll(9 )M(g(9) )+(g' (9»
2
M' (g(9»]
for some 9 such that 19-~1 ~ I~n-~I.
This, (7), (17) and Theorem 3.2 give
(28)
where
~n+l-~ = (~n-~)(l-n-lrn)
rn
i1\
~n
Tn
2
l
+ n- + Y/
~nVn
2
3 2
+ n- / + Y/
Tn
= a [gll(9)M(g(9»+(g'(9»2M'(g(9»]
n
...;.!
=
anc n
=
a n (1- Y) /2
n
iI\ -
~ ~ 1\
A
-!
opt Copt
~ A
Lln -,.
opt
T
~
r=
1,
and
.
Since E p V = 0, the proof of the theorem will be complete
n n
upon showing,
(29)
~
( 30 )
for each r.
0
and
By the boundedness of g', A, B, Bn and E p fgn(~n)' and since
n
Page 73
by (26).
cnn- Y B p
Further,
n
(fgn(~n»2
= (c n n- Y )-l ~ k[(g-l(S)_~n )/(c n n- Y )] f(s) ds
~lo
=
k(y)
(F*g)(l)(~n +c n n-Yy)
dy
This gives the boundedness of B p v2 and proves
n n
To show (30), by Lemma 2.7, we need only show for t~p,
is bounded.
(29).
(31 )
By Lemma 2.4,
Bp
n
v~ ~ c~/2n-Yt/2[K1+K2Bp Ifgn(~n) It+K3IBp fgn(~n) It].
Now, cnn- Y,
n
n
Bpnfgn(~n) and Icnn- Y fgn(~n) It are bounded.
Since B p (c n- Y V2 )t = 0(1), the bounded version of
n n
n
Lebesgue's Dominated Convergence Theorem proves (31) and
hence the resu1t.D
Page 74
CHAPTER 4
ANOTHER APPLICATION OF SA
ESTIMATING THE MODE
§4.1
Introduction
The asymptotic rate for convergence of the sequential
procedures introduced in Chapter 2 and refined in Chapter 3
is largely dominated by the behavior of the density
estimator.
For this reason and because density estimation is
such an important problem in statistical estimation, in this
'
t'1ve
'
t t h e p th d er1va
c h ap t er we f ocus on es t 1mators
suc h tha
of the density at the estimator converges (in some sense) to
a specified value.
These estimators are obtained in a
sequential manner via stochastic approximation and are in
some ways better than those currently existing in the
literature.
More specifically, let f(P) denote the pth derivative of
the density function f, p=O,l, ...
find Xo€R such that f(P) (xo)=~.
we take
~
= O.
Fixing
~€R,
the goal is to
without loss of generality,
As motivation, note that if we impose some
mild restrictions on f and let p=l, our formulation is the
Page 75
problem of finding the mode of a distribution function.
This
is a difficult problem for unknown densities which cannot
assumed to be symmetric.
is R and that the X
o
We assume the range of the density
we seek is unique.
In many cases, X
o
will not be unique over the whole real line, as in the case
of finding points of inflection in a symmetric distribution.
In this case, we assume there exists known constants a and b
such that a<b (where a or b may take on infinite values) such
that X o is a unique root in [a,bl.
We may then apply a
truncated procedure as introduced in §2.4.
The procedure proposed will be shown to have the same
rate of convergence as the best estimator among those known,.
such as introduced by Eddy (1980), which is described below.
However, our procedure solves a broader class of problems.
Further, using well-known techniques of stochastic
approximation, we easily handle the multivariate version and
introduce certain types of dependency structures into the
observations.
According to the procedure proposed by Eddy, let {a } be
n
a nonincreasing sequence of positive constants and k be a
bounded function (k is called the kernel).
estimator for the density f is
fn(t)
=
(nan)
-1 -n
~l
k«Xi-t)/a n )
A kernel
Page 76
where {X.} is an i.i.d. sample from a distribution having
1
density f.
=
M(f)
a
n
=
Define the functional
inf{teR: f(t)
=
sUPs f(s)}.
=
Let a
M(f) and
M(f ), where, under suitable conditions on f, a is the
n
mode of f and a
n
is an estimator of the mode.
For a class of
kernel functions k and sequences {a }, Eddy demonstrated the
n
asymptotic normality of an-a, suitably standardized.
These
rates are superior to earlier efforts (cf., Chernoff, 1964).
As Fritz (1973) argued, stochastic approximation is a
natural procedure for estimating the mode.
Especially in a
multivariate situation, there could be computational
difficulties in calculating an empirical density and using
its maximum as an estimator of the mode.
This is essentially
the method used by Eddy in the univariate case.
The
multivariate case is examined by Samanta (1973), Konankov
(1974)
and Sager (1978), among others.
Fritz used the
kernel functions of Bhattacharya (1967) and showed that his mdimensional estimators of the mode converged almost surely.
This chapter uses more sophisticated estimators given by
Singh (1977).
Almost sure convergence and convergence in
distribution (nondegenerate) for estimators suitably
standardized is achieved here.
Fritz has some additional
interesting features in his process which we remark on in
§4.4.
Page 77
§4.2
Notation and Assumptions
Let B o be the class of all Borel-measurable real-valued
functions k where k is bounded and equal to zero outside
[0,1] •
II
For fixed integers p and r where
= {k€B : l/j!
o
~l yJ.
k(y)dy = [
0
~
O~p<r,
define
j=p
j~p
j=O,l, ... ,r-l}.
We assume there exists sequences of positive decreasing
constants {an} and {c }.
n
pth derivative of f, by
Define the estimator of f(P), the
f(P)(x) = k«Z -x)/c )/c P +l
n
n
n
n
(1)
where {Zn} is an i.i.d. sequence of random variables having
density f.
We note that the definition in (1) allows for
negative estimates of the density.
This flexibility
considerably enhances the rate of convergence which we will
be able to achieve (cf., Singh, 1977).
Take Xl to be an arbitrary random variable with finite
second moments and let
Fn=~(Xl'Zl' ..•
'Zn_l).
Define the
remainder of the sequence {X } by the recursive equation
n
(2 )
X +1 = X - a f(P)(X).
n
n
n n
n
We show in the succeeding sections strong consistency of the
sequence {X } to X and calculate rates for convergence.
n
o
begin by listing the most important assumptions.
We
Page 78
Bl.
For some positive integers p and r, where
O~p<r,
we
assume f(x) and f(r) (x) exist for each x and are bounded on
the entire real line.
B2.
For each x~xo' (X-Xo)f(P) (x) > O.
B3.
lim
n
=
2/ 2p+l<
)"00
and
n
c
_ 1 an c n
B4.
0,
-00
~
1 an
=
00,
00.
f(p+l) is bounded on R.
f, f(p+l), f(r) are continuous
at x •
o
B5.
c
n
For some A,C~O, Y=l/(2r+l), let an = A n- l and
= C n- Y .
ro =
A f ( p+1
Remark:
Assume that Y(r-p) <
ro
where
) (x ).
0
While B2 may appear strange (for p=O) since
densities are always nonnegative, recall that we are really
interested in finding f(x
Let T
= f(r)(x ) ~
o
)=~
for some
~.
1
000
yr/rt k(y)dy, which will turn
out to be a factor in our asymptotic bias.
The asymptotic
variance of Xn will be proportional to ~o' where
1
~o = f(x o ) ~o k 2 (y)dy.
§4.3
Univariate Results and Remarks
Because the univariate results are most often used in
practice and investigated in the literature, we give these
Page 79
results separately in this section.
Their proofs will be
direct corollaries of the multivariate version given in §4.4.
Theorem 4.1
Assume BI-B3.
Then for the procedure defined in (2),
a.s.
Theorem 4.2
Assume Bl, B2, B4 and BS.
Then for the procedure
defined in (2), we have
n<r-p)/<2r+l)
<Xn-x ) ~D N<1l , o-~)
o
3
where 113 = A c-<r-p) To/<ro-Y<r-p»
0- 2
3
=
A2 c-<2p+l) )" /<2
-0
r0 -2Y<r-p».
For p=l, we achieve the same rates of convergence as
Eddy but using weaker conditions.
Eddy required the
existence of a bounded <r+l)st derivative.
In fact, if we
assume Eddy's conditions we get better rates.
These rates
are optimal in the sense that they are the same as given by
stone (1980) under similar conditions.
The choice of the
kernel is an open question.
Although we have chosen [0,1] to be the interval of
support for our class of kernel estimators, the bandwidth is
actually controlled by our choice of c n = C n -Y •
We have
Page 80
specified Y to achieve a best rate of convergence, but we
still have the flexibility of specifying C>O.
One criteria
that has been proposed is to minimize the asymptotic mean
square error.
By Theorem 4.2, we have that under certain
conditions n(r-p)/(2r+l)
(X -x ) ~D
n
0
X,
where
X is
a normally
distributed random variable with mean
A C-(r-p) T /(
o
A2 C-(2p+l)
r -Y(r-p»
0
> /(2
-0
and variance
r -2Y(r-p».
·0
and C to minimize E
X2
The criteria of selecting A
was suggested in a stochastic
approximation scheme by Abdelhamid (1973).
Elementary
calculations show that
A
=
r/(r-p) /f(p+l)(X )
o
2
C = {[(2p+l)(r+p+l)(1-Y) ~0]/[8(r-p)
E X2 = A2
>
-0
/{4C 2p + l r(r+p+l)y 2 }.
Unfortunately, values for the optimal A and C are still
\
given in terms of the unknown density f.
In other SA
problems, it has been suggested that A and C be replaced by
random variables that converge almost surely to the best
A and C, i.e., those that give the best asymptotic mean
square error.
See Fabian (1971) for a good review or Lai and
Robbins (1979) for a more recent approach.
Adaptive
estimators of A and C would not be hard to construct using
the procedures introduced in §3.3.
Page 81
We conjecture here that several generalizations of the
procedures should be immediately possible (beyond the
multivariate extension of §4.4).
(a)
Chapter 4 uses only one type of kernel estimator.
More general results are possible so ,that many types of
density estimators would be applicable, as in §3.l.
(b)
Papers in the previous literature have made the
assumption that {Zn} is an i.i.d. sequence.
This
(in)dependence assumption may not be necessary in using
stochastic approximation estimators.
The basis of the proof
in §4.5will apply Fabian's (1968ii) result.
Using results
of Ruppert (1982), these type of dependencies may be further
weakened.
(c)
In the situation where X
n
is a lifetime random
variable, the estimator of §4.3 uses information of the
lifetime of a unit only up to X +c.
n
n
This truncated
procedure could result in potential cost savings if, for
example, the cost of the experiment were in some way related
to how long a unit survives, e.g., clinical trials.
Lai and
Robbins (1979) have formalized this idea with respect to the
R-M procedure and shown optimality properties.
This should
be possible for the sequential mode estimator as well.
Page 82
§4.4
Multivariate Analogs
In this section we reformulate the results of §4.2 in a
multivariate setting.
general setting.
§4.5 proves these results in this more
Let f be a probability density function
defined on Rm (m-dimensional Euclidean space) and define
Pl
Pl
P2
P2
Pm
Pm
Q(~) = (J
f(~)/Jxl' J
f(~)/Jx2 , ... , J
f(~)/Jxm )',
where O<p.<r,
i=l, ... ,m for integers p.1 and r. The prime
1denotes transpose and xi represents the i th entry in the
column vector x.
We shall use an underscore to emphasize a
m
For a fixed ~=(~l' .. "~m)' e R , we wish to find
vector.
that -x0
e
R
m such that D(x )=~, where x is assumed unique.
--0
-0
If Pl=' .. =Pm=O, this is the problem of finding the point at
which the density takes on a certain
value~.
For
Pl= ... =Pm=1 and -~=(O, ... ,O)', this is the problem of
estimating the mode of a multivariate density.
This is a
straight-forward generalization that handles all the
important problems.
Other extensions should be possible,
say, in finding levels of mixed partial derivatives of a
density (cf., Singh, 1976).
Let {Z.}
be an i.i.d. sample having density f and denote
-1
~j=(Xl,j, ...
,xm,j)'.
We define our kernel functions k i by
requiring they be bounded, measurable functions from R to R
equal to 0 outside [0,1] and such that
page 83
I
= L
1
1/ j! ~ 0
1
0
if j
=
p.
)' r~ Pi1
1' f
. 0 , •••
)=
,r- 1 .
Our estimator of D(x) is
(PI)
(Pm)
(3)
D (x) = (f
(x), .. .,f
(x»'
-n n
n
(p. )
where
f
1
n
(x)
=
-p.-m
c 1 k. ( ( Z . -x. ) / c )
n
1
1,n 1
n
The product is over the set {j:
mean k.(y) with p.=O.
1
1
Tf
k « z . -x . )/ c ).
0
),n )
n
l~j~m,
i~j}.
By ko(y) we
without loss of generality, take
ex= ( 0 , ••• , 0 ) , •
Let p = max{p.} and 11.1 I denote the usual rna
1
dimensional Euclidean norm. The stochastic approximation
algorithm that we use is
(4)
2
where ~l is an arbitrary random vector such that E I I~ll 1 <00.
Before giving results, it is convenient here to
introduce some more notation.
Let H denote the matrix of
p.+l
p.
partial derivatives of D, Le., H = (J 1 f(x)/ Jx. 1 Jx.) ...
-
-
1
)
1)
We assume that H is positive definite in a neighborhood of
mxm (the space of real mXm matrices) such
Let Cn , E e R
that C =diag(c P i+m/ 2 ) and E=diag(!(Pi=Pa». Recall that
n
n
~o·
I(.) is the indicator function.
Use I for an rnXm identity
Let P n =~(Xl'Zl'
... 'Z
A conditional bias term
-n- 1).
is J\ = E p (D (X )-D(X ». The asymptotic_variance is
-n
n-n-n --n
proportional to") = (~ .. ) .. , where
matrix.
-0
1)
1)
Page 84
(T ••
JJ
(T • •
~J
= f
(~o)
1
2
dy)m-l
do ko(y)
(
~l
0
2
k j (y) dy) and
1
1
2 (y) dy)m-2
)
k
= f(x
do
k i (y) ko(y) dy)
do 0
-0
1
do kj(y) ko(y) dy) iFj.
Define an orthogonal matrix P so that P'H(x
)P is diagonal,
-0
say equal to /\.
For i=l, •.. ,m, let
1
t i = l/r! [Jrf(~o)/Jx~ ~o yrki(y)dy
1
+ ~o yrko(y)dy {~=l (Jrf(~o)/drxj)I(iFj)}]
As in §4.2, a list of important assumptions are
provided.
Cl.
Let r be some integer, and (Pl, .•• ,Pm) an m-tuple of
integers such that O<p.<r,
i=l, ... ,m.
- ~
We assume that.
f(~),
dr(X)/dX~, i=l, ... ,m exist for each xeRm , are bounded on Rm ,
~
p.
p.
and that J ~f(x)/Jx.~ , i=l, ..• ,m exist for each x.
-
~
C2.
'(x-x ):
inf{ -D(x)
--0
C3.
cn
I I!.-!.o I I >~
00
00
~
0,
~l an =
00,
~l an
}
> 0
V~>O.
c(r-Pa) <
n
00,
00
and ~l an2 c- ( 2P a+m ) < 00.
n
C4. H(x) exists for each x and is bounded.
a,rf(x)/'x
Q
-
C5.
r.
~
. 1 , ... ,m b e con t'~nuous a t x .
~
~=
For some
c n = C n- Y .
Let f, H, and
A,C~O,
let Y=1/(2r+m), a
, -1
n
= A n , and
Assume Y(r-Pa) < mini ).,i (ro )' where r o = A H(!.o)
and ~i(ro) is the i th eigenvalue of roo
Page 85
We now proceed to give the main results, the details of
the proofs being relegated to the following section.
Theorem 4.3
Assume CI-C3.
!n -7'
Then for the procedure defined in (4),
~o
a •s •
Theorem 4.4
Assume Cl, C2, C4, C5.
Then, for the procedure defined
in (4),
n(r- P a )/(2r+m)(!n - ~o) -7'D N(~4' ~4)
where
~4
= A c-(r-Pa) [
ro -I('y(r-pa »
]-1 E T
-0
2
2P m
~4 = A c-( a+ ) PMP '
M ..
1J
= (P I E ~o E P) .. (/\.. + /\.. - 2'y (r-Pa) ) -1 .
1J
11
JJ
We easily get a convergence result for a multivariate
estimator of the mode.
Corollary 4.1
Assume Cl, C2, C4, C5 and that Pl= ..• =Pm=l.
Then, the
sequence {X
-n } defined in (4) is a strongly consistent
estimator of the mode
(~o)
and
n(r-l)/(2r+m) (!n-~o) -7'D N(~5' ~5)
where ~5 = AC-(r-l) [A H(~o)-I«.Y(r-l»]-l !o
~5 = A2 c-(m+2) PMP '
-
M..
= (Pi>
P) 1J
.. (/\ 11
.. + /\J'J' - 2'y(r-l»
1J
.:...0
-1
.
Page 86
We remark here that Fritz (1973) allowed for multimodal
densities and introduced a stochastic approximation procedure
that converged almost surely to one of the local maxima.
Whether the condition of allowing for more than one local
maxima inhibits convergence to a nondegenerate distribution
when suitably standardized is an open question.
§4.5
Proofs
We begin the proofs of Theorems 4.3 and 4.4 with a few
(p.)
p.1
p.1 .
1
preparatory lemmas.
Let f
(x)=J
f(x)/Jx. .
Unless
-
-
1
otherwise indicated, all relationships between random
variables are assumed to hold with probability one •.
Lemma 4.1
Assume Cl and that there exists a sequence of constants
{x } tending to x •
-n
-0
Then,
(p. )
(a)
sUPx
IE
(b)
lim
c-(r-Pi ) {B f
n
f n 1(x)
n
(p. )
(p . )
1(X) - f
1(X)}
n-n
-n
=
t.
1
1
~o yr ki(y) dy
4
Jrf(!.ol/Jxj I(i~j)}.
Proof:
Through an application of the transformation theorem, we
have
(p. )
E f
n
1(X)
= c n-p.-m
1
f (s) ds
=
c-Pi~
n
[O,l]m
k.(y.)
1
1
IT.
J
k (y.) f(x+c y) dy.
0
J
n
Page 87
Via a Taylor series expansion,
y. }
J. .
)
where
J r f (n ) / ( J x. • .. J x. ) Tfr. -1 y. •
cr/r
n .l
J. l
I
J. r
J-
J. j
There is only one term with the power of
o 0
0
Pi 0
0
Yl Y2··· Yi-l Yi Yi+l···Ym' this term has coefficient
p.
p.
p.
c l/p.! {J J.f(x)/J J. x .}. Since kJ..eK,
n
J.
J.
(p. )
B f
=
n
J.(x)
-
-p. l
p. p.
(p. )
*
C
J. O k.(y.)Tf k (y.){y.J. c J./p.!f J.(~)+R (n)}dy}
n
J. J.
0
)
J. n
J.
I
(p. )
=
J. ( X
f
+ R
)
n
(9 )
_po
where
Rn (9)
= cn
J. ~ ki(Yi)
it
ko(Yj) R*(9) dy.
Result (a) follows directly from the boundedness of k
f(r).
limn c
To prove
and
(b) ,
(p. )
- (r-p. )
J. [B f
n J. (~n) n
R
where
c
r-Pi
n
(
I
if i j
i
=L
0
i
n (9
*)
I I 9* -~o I I~ I I~n -~ I I
/r!
~l y.r k.
0
)
J.j(yj)
i. 1 i
. ) = J..
.
J.j
From a simple application of the bounded convergence theorem,
we get the result.O
Page 88
Lemma 4.2
Onder the assumptions of Lemma 4.1, for t>O there exists
a constant M>O such that,
(a)
sUPx c
t()
m
n
Pi+
-m IB(f
(p.)
n
l(~»tl< M
(p.)
)
l'lm c t( p.+m
-m B(f n l(X
-n »t =
n n 1
f (x ). ~
k ~ (y .)
-0 [O,llm
1
1
Proof:
(b)
IT ...t. k t
Jr10
(y .)
J
dy.
An easy application of the bounded convergence theorem,
c t ( Pi +m ) -m
n
B(f
(p . )
n
1
= c~m~
= ~
[O,ll
(x»
t
ki«Si-Xi)/c n )
m ki(Yi)
IT j
IT
kg(Yj)
kg«Sj-Xj)/c n ) f(s) ds
f(~+cnY)
dye
By the boundedness of f and k., result (a) is true.
1
Further,
c~(Pi+m)-m B(f:Pi~~n»t
limn
= ~ k~(y.)
IT.J
1
1
kot(y.) limn f(x +c y)dy
J
-n n
which gives result (b).D
Lemma 4.3
Under the conditions of Lemma 4.1, for iFj, and for some
constant M>O, we have,
Proof:
For iFj,
Page 89
k . ( (s . -x . ) / c ) k « s . -x. ) / c )
c -p i -p j - 2m ~
am 1
1
1
n
0
1
1
n
n
k . ( ( s . -x . ) / c ) k « s . -x . ) / c )
J
J J
n
0
J J
n
lTk ,.,/..1 , J
. k 2 ( ( sk-xk) / c )
on
f ( s ) ds
k.(y.) k (y.) k.(y.) k (y.)
1
1
0
1
J J
0
J
IT k~(Yk) f(~n+cnY)
dye
We get the result (a) by the boundedness of f and k ..
1
Result
(b) is due to the bounded convergence theorem and the fact
that -n
x ~x
. []
-0
Proof of Theorem 4.3:
By a Taylor-series expansion,
Let
U = (X -x )'(X -x).
n
-n -0
-n-o
Thus,
Un + l = (!n+l - !n +!n - ~o)'(!n+l - !n +!n - ~o)
= Un +
By Lemma 4.2(a)
a~ ~n(!n)'~n(!n)
- 2a n
~n(!n)'(!n-~o)·
(t=2),
E
P
(f
n
(p. )
2
l(X»
n
-n
Thus,
U + Ka 2 c-( 2P a+m ) - 2a /\' (X -x )
n
n n
n-n -n-o
- 2a D(X )'(X -x )
n- -n
-n-o
2P a+m )
< Un <l+2a n 11/\
II) + 2a n 11/\
II + Ka n2 c-(
-n
-n
n
-2a D(X )'(X -x ).
n- -n
-n-o
From Lemma 4.l(a), we have
I I~nl 1 = O(C~r-Pa».
The
result is immediate from C3 and the theorem due to Robbins
and Siegmund.[]
page 90
Proof of Theorem 4.4:
We will show that the conditions of Fabian's (1968ii)
theorem are met and use that result to give the conclusion of
the theorem.
Define
= Cn(Qn(!n)-Q(!n)-!\n)
~n
*
*
*
= Qn(!n)-Q (!n)-!\n·
(5)
Obviously,
E P -n
V = 0•
n
By Theorem 4.3 and Lemma 4.l(a), Q(!n)=o(l) and
!\n=O(C~r-Pa»=O(l). Since Cn=o(l), we have
E
V V '=8
F -n-n
n
V D* (X )' and
P -n-n -n
n
* *
0(1) .
SF Q * (!n) Q* (!n ) , =8 P !\D(X)'=
-n-n -n
n
n
n
4.2(b)
By Lemmas
*
D* (X )' tends to >
(t=2) and 4.3(b), E p Qn(!n)
-n -n
-0
n
giving
By Lemma 4.l(a),!\
is bounded, Lemma 4.2(a)
-n
8
(t=l) gives
Qn(!n) bounded and Q(!n) is bounded due to Cl.
F
*
*
E p n Qn(!n)
Qn(!n),
is bounded by Lemma 4.2(a)
(t=2).
n
Together, there exist a constant M such that
(7)
liEF
n
~n~n'
Now, define
0-
II <
M
for each n.
2
2
n,r = 8 [I ( II ~n 112~rn) II ~ 1/ ] .
For t>O,
= c mt / 2 [{D*(X )-D*(X )_A *}'{D*(X )-D*(X )_!\*}]t/2
n
-n -n -n ~n
-n -n -n -n
< K c~t/2[{Q~(!n) 'Q~(!n)}t/2 + {2Q~(!n) '(Q*(!n)+6n )}t/2
+
{(Q
* (!n)+6* )'
n
(Q
* (!n)+6* )} t/2 ]
n
for some positive constant K.
bounded and 0(1).
By the Cauchy-Schwarz inequality,
Page 91
t
0(1) , we only
Thus, to show Cmt / 2 E
n
P n IIYnll is bounded and
/ 2E
need show Cmt
n
P n 1 1g~ (!n) I I t is bounded and 0(1) • But
this obvious from Lemma 4.2(a) . Thus, by the bounded version
of Lebesgue's Dominated Convergence Theorem, we get
B [c
mt / 2 1 Iv Ilt]=o(l)
n
for any t>O.
-n
s-l+ t- l = 1.
For a fixed t>O, define s by
By Holder's
and Markov's inequalities,
0-
2
n,r
2t )1/t
11
< E(llv
-n,
2
[p(IIV
11 >rn)]1/S
-n-
< (O(1)/c~t)l/t [EIIY I1 2t /(rn) t]l/s
n
= 0(1) C~m(O(l)/(C~ rn)t)l/s
= 0(1) n Ym n-(l-Ym)t/s
= 0(1)
n
(Ym-l)t+l
0(1) n Ym+(Ym-l) (t-l)
=
= 0(1)
if (Ym-l)t+l=O
for each r.
We may choose t=1+m/(2r), and thus we get,
(8)
0-
2
n,r ~ 0
as n~oo
Let -n
U = -n-o
X -x.
u
-n+l
From (4), we get
- a D (X )
= -n
U
n-n -n
- a (C-lV + D(X ) + /\ )
= -n
U
n n-n
--n
-n
= [I - a H(n
n
I
where
for r=l, 2, ..•
) ]U
-n
-
a C-Iv - a /\
n n -n
n-n
I 19 -~o I I~ 1I!n -~o 11•
4.l(b), '!n~A c-(r-P a ) E,!o.
Cl and Theorem 4.3.
Let T
-n
Le t
rn
= A nY(r-P a )/\.
-n
=
A H
(9 )~ A
H ( ~o )
By Lemma
= ro
by
Let
~n = A c~l n Y <r-P a )-1/2 ~ ~ = A C-(P a +m/2) E.
_U + = [I-n- l r.]U - n- l / 2 - Y (r-P a
n -n
n l
)jJ\
V
~n-n
Thus, we get,
- n-l-Y(r-P a ) T .
-n
Page 92
We have thus constructed a recursive algorithm that is
in the form specified by Fabian (l96Sii) with sufficient
properties (5)-(S).
Our Theorem 4.4 is an immediate
corollary of his result.D
CHAPTER 5
S5.l
SA-SOME EXTENSIONS OF THE THEORY
Adaptive K-W procedures
Adaptive stochastic approximation procedures have
received attention recently in the literature in the RobbinsMonro case, c.f. Anbar (1978), Lai and Robbins (1978, 1979
and 1981).
As stated in S2.3, much of this work was
motivated by the paper of Venter (1967).
Some attempts have
been made to efficiently adapt the Kiefer-Wolfowitz (K-W)
procedure (cf., Fabian, 1971).
In this section, conditions
are given for achieving an optimal K-W procedure (optimal in
a sense to be defined).
Consider the Robbins-Monro (R-M) procedure,
(1 )
Xn +1
= Xn
- A n-
l
Yn
where Yn is a conditionally (given xl, ... ,X n ) unbiased
estimator of f(X n ), A is a fixed, known constant. The
recursive procedure described in (1) produces a sequence of
estimators for finding the root 9
of f.
Under mild
conditions, the following asymptotic properties hold,
Page 94
(2)
lim
~
n~oo
a.s.
X = 9
n
(3)
~2 = lim E p ([Yn-f(Xn)]2Ixl, •.. ,Xn).
where
The choice of A
n
to minimize the asymptotic mean square error (MSE) is
(f(l)(9»-1.
For an f(.) known up to a finite number of
parameters, it was proposed by Albert and Gardner (1967) and
Sakrison (1965) under different scenarios to replace A by a
sequence of estimators.
Sufficient conditions were given to
retain properties (2) and (3) of the modified procedure.
venter (1967) proposed modifying the procedure in (1) in the
case where f is unknown to get estimators of A.
Under
certain conditions, this modified algorithm was shown to have
the asymptotic properties (2) and (3), with A = (f(1)(9»-1.
Fabian (1968i) weakened the conditions imposed by Venter.
Alternative least squares estimators have been proposed by
Anbar (1978) and Lai and Robbins (1978), whose properties
have been further investigated by Lai and Robbins (1979 and
1981) .
For the K-W procedure, define,
(4)
Xn + 1 = Xn - A n
-1
Yn
where Yn is a conditionally (given X1 ' ..• ,X ) unbiased
n
observation of (f(Xn+cn)-f(Xn-cn)]/(2cn).
Let {c n } be a
sequence of fixed, known constants such that c n =c n- Y
Assume that 9 is the unique maximum of f and
~~ = limn E([Yin-f(Xn-(-1)iCn)]2Ixl, ... ,Xn)' i=1,2.
Then,
Page 95
under certain mild conditions,
(5)
(6)
1 im n..,oo
~
Xn
=
9
a •s • ,
n 21 (X -9) ~D N(1-16,A 2
n
where 1=1/6 and 1-1
6
c- 2 0-~/[2Af(2)(9)-2/3]),
= A C 2 f(3)(9)/[6(2Af(2)(9)-2/3)].
To
minimize the mean square error of the asymptotic
distribution, elementary calculations show that we should
choose parameters A and C such that
=
(f(2) (9) )-1
--
[24 0-2/(f(3) (9) )2]1/6
(7 )
6
Unfortunately, values for f(2)(9), f(3)(9) and 0- 2 are
usually not known prior to conducting the experiment.
For a
more sophisticated estimator of f(l)(X ) and in the
n
multivariate case, Fabian (1971) demonstrated how the
algorithm in (4) may be modified so that A may be replaced
with a strongly consistent estimator such that (5) and (6)
continue to hold.
His modifications start with {m }, a
k
sequence of positive integers increasing to
00.
At each stage
n=m , k=1,2, ... , take additional observations to estimate an*
k
and c n* (Fabian did not concern himself with estimating Copt
with c * ' however, the extension is immediate).
n
chosen going to infinity as
k~oo
The {m } are
k
but at a slow rate compared
to the rate of convergence of the process so as to be
asymptotically negligible.
A major drawback with Fabian's
approach is that it requires taking extra observations at
various stages.
In Chapter 3, an adaptive procedure was
Page 96
developed for the ARP problem without taking additional
observations at various stages.
The goal of this secion is
to give some sufficient conditions so that (5) and (6) hold
with the best possible choice A and C without taking
additional observations.
Along the lines of Lai and Robbins
this procedure will be called an adaptive Kiefer-Wolfowitz
procedure.
We wish to find the (assumed) unique maximum of the
function f, say 9.
Let Xl be an arbitrary random variable
with finite second moment.
We use sequences of random
* and {Yin}' i=1,2, where YIn is
variables {X n }, {c n* }, {an}
independent of Y2n .
Define
and an* are Pn-measurable.
Pn=~(Xl' ...
'Xn-l)' and assume c n*
For some positive constants Zl'
Z2' Z3' Z4' a, b, c and Y, let
a
an = (Zl(log n) -1 V a~) /\ Z2 n
c = (Z3 n -b V c * ) /\ Z4 n c
n
n
(8 )
(9 )
where a/2+c<Y, a+b+Y<1/2 and Y=1/6.
Yn = (Yln-Y2n)/(2Cnn-Y).
Further, let
The procedure is given by
(10)
We make the following assumptions:
C
f(Xn+Cnn~Y)
and
B
Y
= f(Xn-Cnn- Y ).
pn 2n
Dl.
BpnY ln =
D2.
For some positive constant C,
> BF [Y in - f ( Xn - ( -1 ) i c n n - Y) ] 2 ~ ~ ~
n
D3.
i =1 , 2 .
The function f has two continuous derivatives and
sUPx If(2) (x)
I
<
00.
Page 97
D4.
For each x€R, and some d>O,
f(3) (x) = f(3) (9) + O( Ix-9I d ).
D5.
Bp
For some t>2,
[Y in -f(Xn -(-l) i cnn -1 )] t <
n
D6.
i=1,2.
a.s.
~>O,
For each
inf{f(l)(x):
00
IX-91>~}>0 and inf{f(9)-f(x): Ix-91>~}>0.
D7.
and
a.s.
Theorem 5.1
Assume D1-D6.
For the procedure described in (10),
(11)
a.s.
Proof:
By a Taylor-series expansion,
B
Y = [f(X +c n- 1 ) - f(X -c n- 1 )]/(2c n- 1 )
Pn
n n
n n
n
n
= f (1) (X ) + (cnn - 1) 3/( 3! 2c n - 1) [f (3) (91) +f (3) (92) ] ,
n
n
where In .-x I<c n- 1 •
11 n - n
Let B = (c n- 1 )2/ 12 [f(3)(91)+f(3)(92)] and
n
n
~n
= Yn-B p Yn' giving
n
(12)
X
-X
n+1 -
n
-n- 1 a(f(1)(X)+B +
n
n
n
~n)·
Squaring and taking conditional expectations with respect to
Pn gives,
B
pn
(X
+ .,..9) 2 = (X n _9)2 - 2n- 1 a n (X n -9)(f(1)(Xn ) + Bn )
+ n- 2 a 2 [(f(l)(X )+B )2
n 1
n
n
n
By D4,
IBnl~ O(n 2 (c-1»[2f(3)(9) + '91-9Id + '92-9Id]
< O(n 2 (c-1»[2f(3)(9) + 2(c n- 1 )d + 2/X -9I d ]
n
n
Page 98
Thus,
(13)
Since f (1 ) (X )
n
=
f ( 1) (9) + (X -9) f ( 2 ) (n) < K
n
I -
I Xn -91
'
we get
f(l)(X) I < O(n 2 (c-Y»[O(1)+O«X _9)2)] and
n
n
(f(1) (X »2 < O( (X _9)2).
IB n
n
Since E
-
~2 =
n
O(n 2 (b+Y»
(11) and from the Robbins-Siegmund
n
'
n
Theorem, we get the result.D
P
Thus, we have established convergence of the algorithm
*
in (10) for a wide range of sequences {a * } and {c}.
n
n
Our
.
goal is to specify procedures for calculat1ng
an* and c n*
such that
(14)
and that (5) and (6) hold.
This turns out to be quite
difficult for the general K-W procedure, so we mention this
as an open problem.
su~h
However, for more specific problems,
as the ARP and estimation of the mode, these estimators
are easily calculated.
This is due to the fact that both
problems revolve around density estimation.
For the purposes of the following theorem, we shall
assume that there exist consistent sequences such that an*
Page 99
and c * are Fn-measurable (07).
n
We remark here that upon
completion of the nth trial, the observations YIn and Y
are
2n
available and should be used in the computation of an and c n
prior to calculating Xn + l . This, however, destroys the
expedient martingale property. Lai and Robbins (1981) note
this same problem for their adaptive R-M scenario, and with
great effort give sufficient conditions to use this extra
information.
We merely note the dilemma and label it as an
area for future research.
We now prove the optimality of the modified K-W
procedure in the following
Theorem 5.2
Assume 01-07.
Then, for the procedure described in
(10),
~O
l 3
n / (X n -9)
(15)
where 1-1
6
= Aopt
N(1-I6'
A~Pt C~;t o-~/[2AoPtf(2)(9)-2/3])
C~Pt f(3)(9)/[6(2A opt f(2)(9)-2/3)], and
Aopt and Copt are given in (7).
Proof:
We apply Fabian's (1968ii) classic result on asymptotic
normality of stochastic approximation processes.
By a Taylor-
series expansion, f(1)(X )=(X -9)f(2)(9) for some 9 such that
n
n
19-91
~ IXn-91.
<16 )
(X + -9) = (X -9) (l-n -1
n
n 1
where
This, (12), and Theorem 5.1, give
r.n )+n-5/6
r.n =
a
~n
a /(/2 c ) ~ ~ = (C
n
n
=:=
n
f(2)
(n)
I
~
r=
if\
~n
V +n -1-1/3 T
n
n·
f(2) (9)/f(2) (9) = 1
12 f(2) (9) )-1
Page 100
B ~ T = c 2 f(3) (9)/(6 f(2) (9»
n n
1 6 ~n'
Vn = /2 c n- /
n
From 02, there exists a positive constant C such that,
Tn
=
n1/ 3 a
Thus, we need only show
(1 7 )
B I (V
By OS, B p ~t
n n
2
> r n) Vn2
n -
~ O.
o(c-tn- Yt ), giving Bp V~ = 0(1).
n
n
1
Define s by s-l + 2t- = 1 and use Holder's inequality to
=
get,
B
2 I(V 2 >rn)
Fnvn
n
=
=
~
(B p
0(1)
n
V~
(B
p
)2/t (B
p
n
I(V~ > rn»l/s
v~/(rn)t/2)1/S
n
t
n/(2s)
0(1)
=
0
(1) .
Since V2 I<V 2 > rn) is dominated by V 2 and B V~ < 00, we
n
n
n
apply Lebesgue's Dominated Convergence Theorem to get (17).0
§5.2
SA-Representation Theorem
We consider stochastic approximation processes of the
form
(18)
x
-n+1
=
X - n- 1 (f(X ) + n -c-1 B + n -c~ }
-n
-n
-n-n
where !1 is an arbitrary random vector (mX1) with finite
m
m
second moments, f is an unknown function from R to R , B
-n
m
and ~n are random vectors in R and -ceR with 0~-c~1/2.
The
algorithm in (18) is a special case of that studied by
Kushner (1977) and Ljung (1978) for finding the (assumed)
Page 101
unique
~
such the f(9)=Q.
One may think of the vector in as
the random error at the nth stage due to the fact that f(X )
-n
can only be measured with error.
The!n vector may be
thought of as a bias term in the measurement of feX).
-n
For
x
example, if m=l and FeX)=~ -00 f(t)dt can be measured with
error, we may use [F(Xn+hn)-F(Xn-hn»)/hn as an approximation
for f(x n ) for some small hn>O, giving
Bn=f(xn)-[F(Xn+hn)-F(Xn-hn»)/hn.
Many important functions
can only be estimated with bias, e.g., estimators of a
probability density function.
To represent R-M processes, we
take -n
B =0
- for each n.
Motivation for studying a general algorithm has also
been given by Ruppert (1982).
Ruppert considers a very
similar algorithm,
<19 )
where
_X + = _X - n -1 ef (X ) + -2'"C 13 + n '"C~ )
-n
n
-n
-n
n l
n
0~~~1/4.
He then proves a representation theorem which
we will use in our development of a random central limit
theorem (CLT).
After having argued that the representation
that we give is appropriate, methods popularized by
Billingsley (1968) are used in SS.3 to prove a random CLT for
the process described in (18).
As an easy corollary to our
random CLT, we get a sequentially determined bounded length
confidence interval for the general stochastic approximation
process described in (18).
In SS.4, we show that these
Page 102
conditions are sufficient when applied to the ARP problem.
We choose to use a slight modification of the algorithm
considered by Ruppert (19), as given in (lS).
Our interest
is in applying this algorithm to problems that have non-zero
bias terms
-n ) but with rates of convergence (hopefully)
greater than one third. To employ (19) directly (and
(B
Ruppert's main Theorem 3.1), we would need to include these
biases in the term
~ , thus greatly clouding the natural
-n
interpretation of these vectors.
By considering (lS),
however, Ruppert's proof goes through with only minute
changes.
The algorithm (lS) is in a form more related to the
easily accessible algorithm given by Fabian (196Sii).
Throughout the remainder of this chapter, we assume that
all random variables are defined on a fixed probability space
(O,X,P).
When speaking of weak convergence, for simplicity
we shall only consider functions defined on [0,00)
m=l).
(i.e.,
Thus, let D=D[O,oo) be the space of all functions
defined on [0,00) having left-hand limits and continuous from
the right.
Endow D with Stone's (1963) extension of
Skorokhod's (1956) Jl-topology and use =>w to denote
convergence in this topology.
Also, use
~D
and =D to denote
convergence and equilvalence in distribution, respectively.
mXm
-00 i .
M
For MeR
, let exp{M)=.?o M /11, t =exp[{log t)M] and
~i
(M)
Page 103
is the i th eigenvalue of M.
I I. I I
Euclidean norm and recall that
function.
denotes the m-dimensional
[.J is the greatest integer
We state below the important assumptions and two
preparatory lemmas.
1
E.
Let p>O, !€Rm and !n=! + O(n -p
I).
Let 9 >0 and G€Rmxm . We suppose that
E2.
f(~) = G(~-9) + O( 11~-91Il+9).
an
E3.
For each M€Rmxm , where mini )..,i (M»1/2, suppose
00
-M ~
~l n
-n < 00 a.s.
E4.
limn~oo -n
X
= -9
E5.
There exists a standard Brownian Motion B on [0,00) and
a.s.
----'>.
such that
~>O
~k<t ~k
=
0-
O(tl-~).
B(t) +
Lemma 5.1
Consider the process defined in (18) and assume El-E4.
Then, for
(20)
n
X
X=1/2-~,
~>O
there exists an
such that
(!n+1-9) =
- n
-l -n
>
-1
Proof:
For convenience, take 9=0.
(21)
Fix a
nb!n =
for each 6<x.
b<x and define -n
Y =(n-l)6-n
x .
have
(22)
0(1)
We first show that
Y
-n+l
From (18), E2 and E4, we
Page 104
where Un satisfies Un=o(l>.
Since !n=O(l>
and -1+~+"C-!<-1+)<-)<=-1, we have
(23)
-> n -2+~+"C B
-n
converges.
By E3 and -1+~+"C<-1/2,
(24)
> n -1+~+"C -n
~
converges.
(22)-(24) is sufficient for
where> d
- -n
converges.
(1982) Lemma 3.1.
get (21).
This is equation 3.1 of Ruppert's
Following the arguments of that result, we
The remainder of the argument follows Ruppert
(1982, Theorem 3.1>, replacing the appropriate exponents.O
Lemma 5.2 (Ruppert, 1982, Lemma 4.1>
For m=l, assume a>-1/2 and E5.
standard Brownian Motion Ba and an
(26)
Then there exists a
~'>O
such that
~k<t k a ~k = ~ B (t 2a + l (2a+l)-1) + O(ta+!-~').
a
If a<-1/2, then limt~oo ~k<t k
a
~k exists and is finite a.s.
Theorem 5.3
Suppose El-E5 are satisfied for the process defined in
(18), m=l and ><=1/2-"C.
= [nt]
)<
Let
(X[nt]+1-9)-B/(G-)<) and assume G>)<.
Then,
there exists a standard Brownian Motion process B defined on
[0,00), such that
(27)
Wn(t) =>w Z(t)
where Z(t) = {2(G-)<)}-! ~ t-(G-)<>B(t 2 (G-)<».
Page 105
Proof:
From Lemmas 5.1 and 5.2, with
a=G+~-1=G-X-l/2,
-;-(G-X-;) G-X-;
Wn ( t) = - [ n t 1
~k~.l n t 1 k
~k
+ 0 (n
-~
)
= -[ntl-(G-X) {()- B( [nt1 2 (G-X-;)+1 {2(G-X-;)+1}-1)
+ O([ntl(G-X-;)+;-~')} + O(n-~)
Let,
Vn ( t) = - [n t 1- (G- X) ()- {2 (G- X ) }-; B( [ nt 12 (G- X) )
=
D
=>
_ [n t 1- (G- X) n (G- X) ()- {2 (G- X) }-; B( [ nt 12 (G- X) n - 2 (G- X) )
W
t-(G-X) ()-{2(G-X) }-; B(t 2 (G-X»
= Z(t)
by the deterministic convergence of [ntl/n and the almost
sure continuity of Brownian Motion.
SUPt<T {Wn(t)-Vn(t)}
~
0
Since
a.s. and hence in probability, we
get Wn(t)=>w Z(t) by Theorem 4.1 of Billingsley (1968),
giving the result of the theorem.O
S5.3
SA-sequential fixed-width confidence interval
As an application of the results of S5.2, in this
section we show the asymptotic normality of the stochastic
process defined by (18) when indexed by a stopping time.
An
easy corollary gives a bounded length confidence interval.
As argued in S5.2, the algorithm (18) includes both the R-M
and K-W cases and allows for weaker dependency assumptions on
the errors.
This result is a considerable improvement over
Page 106
Sie1ken (1973), McLeish (1976), and Stroup and Braun (1982).
These authors considered a univariate R-M process assuming
the errors were martingale differences.
somewhat different.
Our approach is
In S5.2, we demonstrated how the SA
process may be strongly approximated by a Gaussian process.
In this section we demonstrate that the weak convergence
properties are inherited by certain randomized versions.
This is related to the approach advocated by Csorgo and
Revesz (1981, Chapter 7).
They dealt with the strong
approximation of partial sums of weakly dependent random
variables to a Gaussian process.
They demonstrated that the
distribution of the process when suitably standardized is
unaffected by a random change in time.
This result, with the
strong approximation, gives the weak convergence of the
randomized partial sums.
While starting with strong
approximations, we choose to stay with the conventional weak
convergence arguments.
In the classical case of partial sums
of i.i.d. random variables, both approaches achieve weak
convergence of the randomized partial sums.
Thus, we would
not strengthen our results by resorting strictly to strong
approximation arguments.
Given below is a result due to Chow and Robbins (1965)
which is typical of the results we shall prove.
Page 107
Theorem (Chow and Robbins)
Let {Xi} be an i.i.d. sequence from a population with
unknown mean ~ and known finite variance ~2.
the
(1-~)
Let z~ denote
quantile of the standard normal distribution.
each d>O, let Nd=inf{n: n~z~/2~2/d2}.
For
Then,
( b)
2
2
Nd = z~/2 ~
a. s.
222
limd~O d
ENd = z~/2 ~ (asymptotic efficiency)
( c)
limd~O P ( X
( a)
limd~O d
2
I
N -~ I~d) = l-~
(asymptotic consistency).
d
The result (c) is not all one might hope for.
A better
result is P( IXN -~I~d) ~ l-~, that is, true for a fixed d.
d
As this result is not available in the i.i.d. case, we shall
not move in this direction for the SA ARP estimators.
In the
above example, we refer to IN =[XN -d,XN +dl as an interval
d
d
d
estimator for~. The Chow-Robbins result is sometimes
referred to as an "absolute accuracy" result.
In some cases,
one may be concerned with "proportional accuracy".
f such that O<f<l, define M = inf{n:
f
Then
limp~o P«XMp-~)/~~P) = l-~.
For some
n~z~/2 ~2/(X~ p2)}.
Results for a mixing of
these two criteria are also available (Nadas, 1969).
shall only be concerned with absolute accuracy.
We
See Sen
(1981, Chapter 10) for a more in-depth description of
sequential interval estimation and associated stopping times.
Page 108
We fix the asymptotic coverage probability at
1-~.
We
will define a sequence of random intervals {I}
anda
n n
stopping random variable N so that Nd is the first n such
d
that
1ength(In)~2d
and
1imd~0 p(e~IN )=1-~.
Each of the
d
following two assumptions give conditions needed for a
randomly indexed CLT.
E6.
Let
(28)
Nn 1mn - N = 0 p (1)
where N is a positive random variable, {m n } are integers
going to infinity and {N } is a sequence of random variables.
n
Let {B }, {~n} and {G } be sequences of random
n
n
variables such that B -B=o(l), ~ -~=0(1) and Gn -G=o(l).
E7.
n
n
With E7 we can define an appropriate stopping rule and a 2dwidth confidence interval.
Let z~ be the (l_~)th quantile of
the standard normal distribution.
(29)
(30 )
For d>O, define
inf{n~l: d~z~/2 ~n n-)< {2(G n -)<)}-1}
Nd =
=
if no such n exists
00
In = [X n - n
xn
-)<
(Bn/(Gn-)<)+z~/2~n{2(Gn-)<)}
- n-K(Bn/(G n -)<) -
-1 ),
Z~/2~n{2(Gn-)<)}-1)].
Remark:
n
We may define the sequence {n } where
d
= inf{n~l: d~z~/2 ~ n-)<{2(G-)<)-l}. It is immediate that
d
under the assumptions of Theorem 5.3 and E7 that
limd~O
p(e
~
(1976), we get
I
nd
)=1-~.
1imd~0
Further, from Lemma 3.6 of McLeish
Nd/nd=l
a.s.
We also need,
Page 109
Let P a = ~(~a' ... '~b). Then, for each t~l, n~O, AeP t ,
l
b
00
Ip(AB)-P(A)P(B) I~ <I>(n) where <I>(n)~O as n~oo.
BeF + ,
ES.
t
n
Theorem 5.4
Suppose EI-E6 and ES are satisfied for the process
defined in (IS), m=l and
X=1/2-~.
Then, for the Gaussian
process Z(.) defined in Theorem 5.3,
(31)
[Nntl)< (X[N tl- 9 ) - B/(G-)<) =>w Z(t).
n
Corollary 5.4
Suppose EI-E5, E7 and ES are satisfied for the process
defined in (IS), m=l and
)<=1/2-~.
Then, for N defined in
d
(29), we have,
1 im d ~ 0 P (9 ~ IN )
d
=
1- ac .
Proof of Corollary:
By (29) and E7, we have,
d
N~ {2(G-Y)}!/~
and thus, E6 is satisfied.
- zac/2 = opel)
as
d~O
Thus, from Theorem 5.4, for t=l,
we have,
N~ (X N -9) ~D N(B/(G-)<) ,~2/{2(G-)<)})
d
(32)
as
d~O.
The proof of the corollary is two easy steps from
(32).[]
The proof of Theorem 5.4 is merely an alteration of
Billingsley's (196S) Theorem 17.2, given here for the
reader's convenience.
Page 110
Proof of Theorem 5.4:
Define ~n(t)=t Nn/m n if Nn~mn' and equal to t N
By E6, we have
otherwise.
~n(t) ~p t N = ~(t)
for each t>O.
By (27), we have Wm (t) =>W Z(t).
n
(33)
(W n ' ~n) =>W (z, ~)
Now, assume
where Z and ~ are independent processes.
= Wmn(~(t»
WNn(t)
Then,
I(Nn~mn) + WNn(t) I(Nn>m n )
=>W Z(~(t».
But, since Z and ~ are independent,
Z ( ~ ( t»
= {2 ( G- )<) }-!
0- (
tN) - ~ G- )< ) B( ( tN) 2 (G- )<) )
{2 (G-)<) }-! 0- (tN) - (G-)<) NG-)< B(t 2 (G-)<»
-D
= Z(t) .
Thus, we need only show (33).
Fix T>O.
If we show weak convergence for D[O,T), by
Stone (1963), this is sufficient for weak convergence on
D[0,00)
(cf., Sen, 1981, Theorem 2.3.6, page 24).
We now follow Billingsley (1968), and assume N is
Redefining the sequence {m }
n
bounded with probability one.
if necessary, we have
o<
Define
N
W~(t)
<K <1
a.s.
= -[ntl-(G-)<)
Z.pn~k~[ntl
G
k -><-!
~k
where {pn}
is a sequence of integers going to infinity such that
By (27), W (t)=O (1) and thus
pn
P
Iwn (t)-W'n (t) I =
[ntl-(G-)<)
= n - (G-)<)
and thus,
I>-k< p
k G-)<-! ~k + O(n-~)
-In
pG-)<
I n
10
P
(1)
I
=
0
p
(1)
fn/n~O.
I
Page III
SUPt<T IWn(t) - W~(t)
(34)
I~
0
in probability.
m
Define Bm to be the Borel sets of R and Ho the field
consisting of the sets of the form
m
{w: (~l (w) '~2(w), ••• '~m(w) )€B } for m>l.
Let E€H
and A be a Z continuity set of D.
o
have for large n,
P«W~€A)
~
and E) _
P(W~€A)
Since
fn~oo,
we
P(E)
Z(A) P(E) by (27), (34) and ES, where Z(A)=P(Z€A).
By Billingsley (196S), Theorem 4.5, we have,
(W~ ,In) =>w (Z,~)
n
in the product topology, where ~ is independent of Z.
This
and (34) give
(35)
(Wa ,In) =>w (z,I)
n
thus proving (33) for N bounded.
For N not bounded, we employ a simple truncation device
and use Billingsley's Theorem 4.2.
§5.4
This complete the proof.D
ARP-sequential fixed width confidence interval
We now show that the results of §5.2 and §5.3 are
sufficiently general to include the ARP problem.
All
quantities are defined as in §3.1 with the kernel functions
used to estimate the density.
for AS and A9.
Recall that AIO is sufficient
Further, recall the algorithm given in (3.2),
Page 112
(36)
-1
eP
Mg,n( n)
eP n +1 = eP n - A n
= eP n - A n- 1 (g'(eP )M(g(eP »
n
n
+ C-!V +
n n
6.n )·
We first show that assumptions E1-E5 are true given the
assumptions of Chapters 2 and 3, letting
~=Y/2
xn-e=ePn-eP
(37)
B =A n(l-Y)/2/\ =T
n
-n n
~n = A
f(x)=Ag'(X)M(g(X»
B= 1J 1[ f-(l-Y)/2]
C-! Vn ·
To show E5 we employ a result due to Strassen (cf., Sen,
1981, Theorem 2.5.1, page 34).
Lemma 5.3 (Strassen)
Let
{~n,Pn+l}
'
be a martingale difference sequence.
-
-n
2
De f ~ne Yn = ~k<n B p ~n' S(Y n ) = ~l ~n' and S(t) by linear
n
interpolation.
Let h be a nonnegative, nondecreasing
function on [0,00) such that t- 1 h(t) is nonincreasing.
~
( 38)
Y
(39)
~~ Bp {~~ I(~~ >h(Yn»}(h(yn»-l < 00
n
00
a. s.
If,
and
a.s.,
n
then there exists a Brownian Motion process B on [0,00) such
that
(40)
S(t)-B(t) = o( (log t) (t h(t)
)1).
Lemma 5.4
Assume Al, A2, A6, A7, A10, (37) and that p>2+l/r for
the process defined in (36).
Then, there exists a standard
Brownian Motion process B on [0,00) and an
~k<t ~k = {o-f(2
r -l+'Y)}!
B(t) +
~>O
such that
O(t!-~).
Page 113
Proof:
We first show that the assumptions of Lemma 5.3 are
true.
As in Lemma 2.5,
{~n,Pn}
Yn/n~o-~(2 r-l+Y)
sequence and
Theorem 3.1.
is a martingale difference
a.s.
Thus, (38) is true.
o-~ is defined in
where
To show (39), we use the
conditional version of Holder's inequality.
For the p and q
in A6, we have
(h(yn»-l E p {~~I(~~ >h(Y n »}
n
< (h (Y » -1
n
(E
~p) 2/p [E
( I ( ~ 2 >h (Y »)] 1/q
Pn n
Pn
n
n
~p )2/p [E ~p]l/q [h(Y )]-p/(2q)
Pn n
Pn n
n
= E ~p (h(Y »-p/2
Pn n
n
In the proof of Theorem 3.1, we showed that C P / 2 E p vP is
n
n n
bounded a.s. Thus, there exists an M>O such that
< (h(Y »-1 (E
n
~~ ~ M n Yp / 2
a.s.
n
Let ~>O, and taking h(t)=tl-~ and Y=1/(2r+l), we see that
Ep
(39) converges a.s. since
(yl-~)-p/2 nYp/2 = (yl-~ n- Y)-p/2
n
n
= «Y /n)l-~ nl-Y-~)-p/2 = 0(1) n-rpY+~p/2.
n
Since rpY>l by assumption, we can choose
-rpY+~p/2
so that ~ n
converges.
~
sufficiently small
Thus, we have by Lemma 5.3 that for some standard
Brownian motion Bo '
S(Yn ) - Bo(Y n ) = O(Y~-~ )
by taking h(t)=tl-5~. Taking n large and using
Yn/n~o-i (2 r-l+Y) gives
Page 114
and
B0 ( Yn) = B0 (n
=
0-
{o-f
f
(2
r-l +,Y»
+ 0 (1 )
(2 r-l+,Y)}! B(n) + 0(1)
where B(.) is also standard Brownian motion.
~k<n ~k
=
{o-f
(2 r-l+'y)}! B(n) +
We get the result for a general t
Thus,
o(n!-~).
(not necessarily an
integer) since B(n+I)-B(n)=Op(l).D
Lemma 5.5
Assume AI, A2, A6, A7, AlO, (37) and that p>2+l/r for
the process defined in (36).
G=A ( g' (<I» ) 2 M' (g ( <1»
Then E2-E4 are satisfied with
) •
Proof:
E4 is satisfied by Theorem 3.l(a).
assumptions Al and AIO (take 9=1).
E2 is satisfied by
Lemmas 5.2 and 5.4
satisfy E3.D
Lemma 5.6
Assume AI, A2, A6, A7, AlO, All for the process defined
in (36).
Then there exist a
nF(Bn-B)
=
p>O such that
0(1).
Proof:
From AI, we get g(x)=g(<I»+O(lx-<I>I).
Theorem 3.1, we get
From the proof of.
Page 115
(B-Bn)/{A Cr (C 1 -C 2 )}
=
g(~)
~o
S(u)du (F*g)(r+l)(~)
1
~o yr/r! k(y)dy
9 (~n)
1
- ~o
S(u)du C~r{BF fgn(~n)-(F*9)( )(~n)}
n
and, from Lemma 3.l(b),
•
=
g(~)
~o
-
S(u)du (F*g)(r+l)(~)
1
~o yr/r! k(y)dy
g(~n)
1
r
1
S (u) du {~o y /r! k (y) (F*g) (r+ ) (9 )dy}
~0
n I -< c n y.
where IQ-~
I
g(~)
< ~o
1
S(u)du
r
I~o y /r! key)
{(F*g) (r+l)
=
0(
19 -~n I d)
= o(n- dY )
+ 0(
(9 )-(F*g) (r+l) (~) }dyl+
o( I~n-~I)
I ~n-~ I )
+ o(n-rY).O
Theorem 5.5
Assume AI, A2, A6, A7, AID, All, (37), and p>2+l/r for
the process defined in (36).
Corollary 3.1.
Let
~l
and
~l
be defined as in
Then, there exists a Brownian Motion process
B defined on [0,00) such that
[ntl(1-Y)/2
where z*(t)
= ~
1
(~[ntl+l-~) - ~l =>w z*(t)
t-(2 r-l+Y)/2
B(t 2 r-l+Y».
Proof:
An immediate application of Theorem 5.3 and Lemmas 5.4-
5.6.0
page 116
Corollary 5.5
Let
~n' I~, ~~ and (F*9>~~rl> be defined as in
Lemmas 3.4 and 3.5, respectively.
Define Gn=A{n
-1 -n
~l ~j}
-1
,
1
Bn = A Cr(Cl-C2>(~O
and
•
~ n =A 2 c- l -n
>*+1/(2G n -l+Y>.
For each d>O, define
N = inf{n>l: d>z
n-(1-Y>/2 ~n}
d
- ~/2
= 00
if no such n exists.
Define the sequential 2d-width confidence interval,
In =
[~n
-
n-YB~/(2Gn-l+Y>
~n
-
n-Yz~/2~n'
-
n-YB~/(2Gn-l+Y>
+
n-Yz~/2~n]·
Then, under the assumptions of Theorem 5.5,
Proof:
A slight modification of Lemmas 3.4 and 3.5 give
,
Gn-G=o(l>,
for E7.
~n-~l=o(l>
and Bn-Bn=o(l>.
This is sufficient
We have constructed Nd as in (29), and thus the
corollary is true as a special case of corollary 5.4.0
CHAPTER 6
S6.1.
MONTE CARLO STUDIES
Introduction
In this chapter we investigated the finite sample
properties of the estimators and the procedure proposed.
purpose is really two-fold.
The
First, we wanted to verify the
usefulness of the asymptotic approximations in finite
samples.
Second, we wanted to compare the performance of the
estimators in finite samples when the parameters are chosen
having various values in the "neighborhood" of the
asymptotically optimal values.
Further, as discussed below,
the performance of the estimators at finite stages is
improved by parameters of the procedure that do not appear in
the asymptotic theory.
These parameters are introduced in
S6.3.
The following section, S6.2, describes an investigation
of the ARP which was made prior to the Monte-Carlo
experiment.
The goal of this investigation was to develop
systems of computer programs that calculated various
deterministic characteristics of the ARP problem (such as the
Page 118
optimal replacement time and mean lifetime of a unit), given
various input parameters (such as costs and the lifetime
distribution).
A short description of the Monte-Carlo experiment is
given in §6.3.
In this section we specify the assumptions of
the model and the basis for calculating the expected value of
the estimators.
§6.2.
The true values of the estimators come from
This leads to a short discussion of a criterion for
comparison of these estimators.
The candidate adopted here
is a version of the mean square error (MSE).
Based on the preceeding sections, §6.4 gives a summary
of the results of the experiment.
Some remarks about the
nature of the experiment and the accuracy of results are
made.
We describe the tables included in the appendix and
follow this description with remarks on the highlights of the
data.
§6.2
Preliminary investigation
Given the probability distribution of the errors and the
costs associated with the types of failure, the actual
calculation of the optimal replacement time ~ is straight-
Page 119
forward in principle, but requires careful numerical work.
This section gives the details of that calculation and the
calculation of other parameters that depend upon~.
The work
of this section is not new (cf., Glasser, 1967), but is
needed in §6.3.
In all cases, we assumed that errors have a Weibull
distribution with input parameters alpha (location) and
lambda (scale).
The Weibull is a standard lifetime
distribution in survival analysis having moments that can be
calculated given alpha and lambda.
Further, if alpha is
greater than one then the Weibull distribution has an
increasing failure rate.
Even when the transform function g
is the identity (e.g., g(t)=t, see §2.4),
~>l
is sufficient
to ensure that ~ will take on a unique, finite value.
other input parameters for this phase are the costs, C
l
and C , associated with failure and planned replacements,
2
respectively.
It can be seen from (0.1) that determining ~
depends only on the ratio of the costs.
Thus, we took C 2 =1
in all cases.
All computing was done on an IBM 370/155 operating at
the Computation Center of the Universtiy of North Carolina at
Chapel Hill.
The operating system is IBM 08/360 MVT, the
Page 120
programming language is FORTRAN and the compiler, version H.
To calculate the definite integral ~
x
o
S(u)du and the
variances of the Weibull distribution, the IMSL routine
DCADRE was used.
deviates.
The IMSL routine GGWIB produced the random
All calculations were done in double precision.
DCADRE was also used with SAS/GRAPH to produce plots of the
function R (.) that we wanted to minimize.
l
These plots are
given in the Appendix A.
The most important output variable of the program was ~.
This was calculated by using a Newton-Raphson routine to find
the root of M(x)
(see (2.2) for the definition of M(.».
Note that each stage of the iteration, it was required to
evaluate a definite integral via DCADRE.
with a good initial
guess (one), the algorithm converged very quickly (in three
iterations).
Having calculated ~, the calculation of other
important variables was straight-forward.
These results are
summarized in Table AI.
S6.3
Description of the simulation
The asymptotic theory of Chapters 2 and 3 deals with a
wide variety of SA procedures for solving the sequential ARP
problem.
To make the Monte-Carlo study tractable, we first
restricted ourselves to a certain subset of those procedures,
.'
page 121
which is described below.
All comments concerning the costs and assumptions on the
distribution of the errors made in S6.2 continue to hold.
We
were interested in the relative performance of the estimators
in finite samples for various values of the parameters.
Thus, it was felt that not much would be gained by allowing
the mean and variance to vary, so we fixed
~=2
and
~=2.2.
This gives a mean of 1.77125 and standard deviation of .8499
for the lifetime of the units.
The transformation function g allowed us to look at a
broader class of distribution functions and to use
unconstrained SA.
Thus, in doing the Monte-Carlo work we
felt compelled to take g other than the identity function.
However, this transformation will not affect the speed of the
convergence (see Theorem 2.2 and S3.2).
To keep the amount
of simulation feasible, all simulations used the function
g(t)=log{l+exp(t)}.
The kernel function used to estimate the density was the
simplest one presented, the indicator function in Chapter 2.
While it would be interesting to investigate the effects of
different kernels (or other types of density estimators) with
different parameters of the procedure, we limit ourselves
Page 122
here to investigating the effect on the procedure for
different values of parameters and for a fixed kernel.
Choice of a particular density estimator has been
investigated by other researchers (cf., Wegman, 1972, for
some interesting numerical comparison of probability density
estimators).
•
Because of our interest in the parameters and
since we did not necessarily use the procedure with the best
rate of convergence (we do not take advantage of the
smoothness of the underlying d.f. with a better choice of the
kernel), no Monte-Carlo work was done on the adaptive
procedures of S3.3.
Having mentioned all of these restrictions, the reader
may wonder what variables were actually investigated.
Recall
the equations:
~n+l = ~n - anMg,n(~n)
a
n
= A n- l
c
n
= C n-· 2
S. (t) = I{Z. >t} i=1,2
ln
In-
f9ln(t) = I{g(t-Cn)~Zin~g(t+cn)}/(2cn)
get)
Mg ,n(t)=(C l -C 2 ) f9ln(t)
~o
S2n(u)du
- g'(t)Sln(g(t»{(C l -C 2 )S2n(g(t»+C l }·
We certainly were interested in the effect on the procedure
for different values of A, C, and ~l' the starting value of
the sequence.
Further, we found that the value of ~n
oscillated wildly in the early stages of the procedure (when
n is small).
Dvoretsky (1956), while investigating the R-M
Page 123
-1
case, replaced an by A{n+k ) ,with a positive k A . For a
A
very simple example, he showed how this form was optimal in a
minimax sense (a finite sample result, uncommon in SA
theory).
The optimal choice of k
A
is related to the variance
of the units and the starting value of the procedure.
this Monte-Carlo study, we used an
c
n
= C(n+k C ) - • 2 ,
= A(n+kA)-l
In
and
and investigated possible values of k
and
A
We note here that these replacements do not alter the
asymptotic distribution (see Theorem 3.3).
To choose the best possible A, C, ~l' k A and k ' we used
C
as our criterion the performance of the resulting estimator
~n.
For notational convenience, let ~n be used for
~n(A,C'~l,kA,kc).
The estimators were judged based on their
resulting biases and mean square error, computed as follows.
Each simulation (for a fixed A, C, ~l' k A and k ) was based
C
on 1000 independent Monte-Carlo trials. Denoting~.
to be
~,n
the estimator of ~ at the nth stage on the i th trial, we used
1000
E ~ n = (.001) -i=l
>
~.
as our estimator of the
~,n
th
expected value of
Thus, for the bias at the n
stage,
we used BIAS
n
=
Similarly, for the mean square
E
error, we used MSE
=
(.001)
1000
5.
1
-~=
(~.
-~)
2
.
n
~,n
asymptotic theory (Theorem 2.2) indicates that
The
= open -.8 ). However, on examining a standardized
n
8
version of the mean square error, SMSE n = n· MSE n , we found
MSE
this to be unstable.
It turned out that the estimator is
page 124
highly sensitive to the choice of k . Heuristically, in
A
l
replacing A n- with A(n+kA)-l, the procedure believes it is
at the (n+kA)th stage when only n iterations have been
performed.
We thus used an adjusted standardized mean square
error, ASMSE
= (n+kA)·a MSE , as our criterion for comparing
n
n
different estimators. For values of n = 10, 50, 250, we
found this criterion to be very satisfactory.
While we were very interested in the rates of
convergence for our estimators, from a practical standpoint
the effect of the procedure on the actual cost is even more
important.
In Theorem 1.1 we guarentee the optimal cost
would be obtained asymptotically, so we were interested in
how useful this result is in finite samples.
Denote X.1)'k to
be our i th sample (i=1,2) at the jth stage (j=10,50,250) from
the k th trial (k=l, ... ,lOOO).
Z,1)'k=min (X.1)'k,g (~,)k+c,»
,)
and
With~,) , k as before, define
b"k=I(Z"k<g(~,
k+ c ,».
1)
1)
),)
For
the kthtrial, the actual sample cost per unit time at the
nth stage is,
n
SCn,k =
~j=1{Cl(bljk+b2jk)
+ C2(2-bljk-b2jk)}
n
Z2'k)·
/>,
-)= 1(Zl'k+
)
)
The mean sample cost per unit time at the nth stage is
1000
MSC n = (.001) >
-k=l
Page 125
§6.4
Summary of results
We found the results of the Monte-Carlo experiment to be
very satisfactory, especially given the highly variable
nature of the estimators we used.
This variablility is due
to the estimator of the density which appears explicitly in
M(.).
The magnitude of the variability can be illustrated in
the following example.
Suppose that Corollary 2.2 is true not only
asymptotically but also for finite n, i.e., that
2
n / 5 (cI>n-cI»
where VAR = A2
c- 2 ~/(2
- N(BIAS, VAR)
r-
.8).
Ignoring the BIAS term for
the moment, a (l-~)% confidence interval for cI> of length 2d
requires that n2/5>(z~/2 VAR I / 2 )/d, where z~ is the (l_~)th
quantile of the normal distribution.
Assume we want a 95%
confidence interval for cI> with error of no more than .05.
Using the optimal values of A and C for the values given in
Table AI, we have
n 2 / 5 > (1.96(.80551)1/2)/(.05) => n ~ 7,342.
Even in this simple case, a large sample size is required for
moderately precise results.
We now describe the Tables of Appendix B where the raw
output of the Monte-Carlo study can be found, then give the
page 126
highlights of the findings of that study.
Each table gives
the bias (BIAS ), mean square error (MSE ), adjusted
n
n
standardized mean square error (ASMSE ) and mean sample cost
n
.
(MSC n ) for various values of A, C, ~l' k A and k C .
quantities were defined in §6.3.
These
We choose stages n = 10,
50, 250 to reflect what we considered to be small, moderate
and large samples.
Table Bl begins with the optimal value of
A and C, as given in Table AI, and a relatively close
starting value, ~l=l.
k
A
and k
C
The performance for selected values of
is then presented.
The best results seem to occur
at k =k c=50 (although k =k =25 is close), so we retained
A c
A
these values for some of the subsequent investigations.
Tables B2, B3, and B4 display the results of using different
choices of A, C and ~l' respectively.
We were also
interested in the effect of starting very far from the
optimal value on the best choice of k
A
and k
c.
Tables B5 and
B6 display the results for high and low values of ~l'
respectively.
In general, the data from the Monte-Carlo study is very
satisfactory.
In all of the intermediate range trials, the
bias and mean cost decreases with the stage of the
experiment.
The adjusted standardized mean square error
(ASMSE) is relatively constant, but does decrease slightly
with n.
This change is more drastic when k
when large as n+k A will change more.
is small than
A
This is due to the
Page 127
extra sampling variability at the earlier stages of the
experiment.
At n=250, the ASMSE are approximately 15% higher
than the standardized mean square error (SMSE).
Adjusted in
this way, the mean square errors appear to be in the same
area as the theoretical MSE = .8161, given in Table Al
(recall that the SMSE's were adjusted to make comparisons
between stages).
Also, for most intermediate range
experiments, the algorithm initially honed in on the optimal
value quickly and then more slowly as the stochastic portion
dominated.
The results of the study indicate that the performance
of the algorithm was greatly enhanced by introducing the
parameters k
A
and k •
C
dominant parameter.
From Table Bl, we see that k A is the
The introduction of k
c
improved the
behavior of the estimators slightly (in terms of the ASMSE of
the estimators), but a dramatic improvement was caused by the
introduction of k
A
in the performance of ~n.
that by using too large a k
A
and k
c
We also note
the rate of convergence
of the algorithm is slowed down considerably.
The algorithm varied with the parameters A and C as
expected.
Table B2 shows that too large a value of A caused
large oscillations in the early stages which calmed down in
later stages when the asymptotics took over.
In Table B3, we
Page 128
see that too large a value of C means that the bandwidth of
the density estimator is too wide, even by stage 250.
For
intermediate ranges, the performance was relatively
insensitive to different values of A and C, performing best
near the theoretical optimal values.
the procedure was noticably worse.
For small values of A,
Recall, to achieve
convergence is distribution we required
A > (1-Y)/(2(g,(~»2M'(9(~»)
=
.5834.
For small values of
C, the performance was not as good but we did not notice the
dramatic shift as with A.
In Table B4 we investigated how sensitive the procedure
is to the starting value.
Starting far away, either high or
low, we notice the usual pattern of the high ASMSE's at early
stages which decrease as n increases.
The magnitude of the
ASMSE increases as ~l moves away from~.
Recall that we
allowed starting from a negative ~l due to the g function.
When we started very close to the optimal value, the ASMSE
was small in the early stages then increased to the level of
the other experiments.
This is because of the very small
bias terms in the earlier stages, the stochastic portion
eventually becoming dominant.
Tables B5 and B6 investigate the effect on k
A
and k
when starting very high (~1=2.5) and low (~l=-l), with
C
e.
Page 129
respect to ~ (=.533).
When starting high, reducing k
by one half gave the best results.
when kA=kc=lO.
A
and k
c
The ASMSE's crept back up
Conversely, for a low starting value, there
was improvement for k =k =25 but we did even better when we
A C
let kA=kc=lO.
This is in line with Dvoretsky's (1956) result
that the best choice of k A depends not only on the variance
of the observations but also on the starting value.
convergence to the vicinity of the optimal replacement
time ~ requires a large number of experiments relative to
other Monte-Carlo trials.
This is to be expected, as the
function itself is flat around~.
While this may be a dismal
prospect to the practitioner, there is a bright note.
Recall
that the long run cost for a failure replacement policy is
Cl/~=5/l.77l25=2.823.
In virtually every experiment in
Tables Bl-B6 we achieved a lower mean cost by the lOth stage
(the exceptions being when kA=O and ~=3.5).
These results
are especially significantly since the best mean cost we
could hope for is 1.90386 (see Table AI).
Thus, we have
achieved considerable cost reductions even for very small
samples, an important practical consideration when deciding
whether or not to use a SA age replacement policy.
Page 130
TABLE Al - SUMMARY OF DETERMINISTIC CALCULATIONS
ASSUMPTIONS
Distribution function F is Weibu11 with
at
=
=
~
2
Failure cost C
1
=
2.2
5 and replacement cost C
2
Transform function
=
1.
g(t) = log{l+exp(t)}.
SA algorithm parameters
A
=
C
=
1.5.
u S(u)du
=
.53760
2.3
OUTPUT VARIABLES
~ = .53349
R1(g(~»
(g' (~»2 M' (g(~»
>
= .68559
g(~)
(F*g)(l)(~) ~o
=
VAR
T
= 1.90386
=
=
>/{2A (g'
=
.80551
A
c2
(~»2 M' (g(~»
g(~)
(C 1 -C )/6 (!o
2
.12124
S(u)du)
MEAN = 2T/[2A (g,(~»2 M'(g(~»
MSE
=
VAR + MEAN 2
=
.8161
- 1+.2}
(F*g)(3)(~)
- .8l = .10302
Page 131
ARP COST FUNCTION
WEDULl. tIlOEL WITH LOCATION PftRftt£TER- 2.2
AfI) SCAlE PARAf£TER - 2. 0
o
0.2
0.4
0.6
0.8
1.0
1.2
WLlIE Of AGE REP TD£
1.4
1.6
1.8
2.0
Page 132
ARP COST FUNCTION
TRANSFORMED BY G FUNCTION
WEDULL t1JDEL WITH LOCATION PftRftt£TER • 2.2
All) SCALE PARAt£TER • 2. 0
0.0
0.2
0.'
0.6
0.8
Wl.UE
(F
1.0
1.2
AGE REP TIt£
1.'
1.6
1.8
2.0
Page 133
TABLE B1 - PERFORMANCE OF ESTIMATORS
A
C
<PI
k
k
Stage of Algorithm
250
50
c
10
50
.3544
.1284
.0113
.2016
.1091
.0377
5.335
4.344
3.617
2.268
2.159
2.053
0 -1.205
-1.248
-1.230
n
ASMSE n
40.03
252.6
39.24
·897.3
38.68
3204.
MSC
4.767
13.09
48.47
50 -1.109
-1.112
-1.069
34.89
34.16
33.56
220.1
781.1
2781.
4.900
13.18
43.44
.3896
.2030
.1659
.1082
.0247
5.371
4.307
3.584
2.472
2.261
2.091
.2836
.0486
-.0130
BIAS
2.3
n
1.5
1.0
A
50
MSE
n
ASMSE n
MSC
n
BIAS
0
n
MSE
n
BIAS
0
n
MSE
n
ASMSE
MSC
n
BIAS n
MSE
n
ASMSE
MSC
n
50
0
n
n
BIAS
25
n
25
MSE
.0374
n
ASMSE n
.2661
.1385
.0413
4.574
4.380
3.693
MSC
2.291
2.158
2.049
.4576
.4254
.3118
.2099
.1833
.1038
53.15
47.88
31.17
2.155
2.111
2.077
n
BIAS
1000 1000
n
MSE
n
ASMSE
MSC n
n
Page 134
TABLE B2 - PERFORMANCE OF ESTIMATORS
A
BIAS
2.3
n
MSE
C
1.5
Stage of Algorithm
50
250
<P1
kA
k
C
10
1.0
50
50
.3544
.1284
.0113
n
ASMSE n
.2016
.1091
.0377
5.335
4.344
3.617
MSC n
2.268
2.159
2.053
.4389
.3617
.2466
.1974
5.222
.1418
.0721
5.645
6.910
2.285
2.208
2.124
.3658
.1584
.0289
.1940
.1045
.0359
5.134
4.162
3.440
2.269
2.161
2.059
.3267
.0831
-.0128
.2239
.1286
.0467
5.922
5.121
4.480
2.262
2.145
2.045
.2506
-.0358
-.0324
.3323
.1775
.0752
8.790
7.066
7.208
2.255
2.128
2.038
.1180
-.0934
-.0501
.7349
.5089
.1362
19.44
20.26
13.06
2.240
2.128
2.054
BIAS n
.5
MSE n
ASMSE n
MSC
n
BIAS n
2.0
MSE
n
ASMSE
MSC
n
n
BIAS
3.0
n
MSE n
ASMSE
MSC
n
BIAS
n
MSE
n
ASMSE
MSC
n
5.0
n
n
BIAS
10.0
n
MSE
n
ASMSE
MSC
n
n
Page 135
TABLE B3 - PERFORMANCE OF ESTIMATORS
A
BIAS
2.3
n
C
1.5
MSE
n
ASMSE
n
MSC n
BIAS
.5
n
MSE
n
ASMSE
MSC
n
n
BIAS
1.0
n
MSE
n
ASMSE
MSC
n
n
BIAS
2.0
n
MSE
n
ASMSE
MSC
n
n
BIAS
3.0
n
MSE
n
ASMSE
MSC
n
n
BIAS
5.0
n
MSE
n
ASMSE
MSC
n
n
<1>1
1.0
k
A
50
k
C
50
10
Stage of Algorithm
250
50
.3544
.1284
.0113
.2016
.1091
.0377
5.335
4.344
3.617
2.268
2.159
2.053
.3239
.0741
-.0477
.3154
.2494
.0992
8.343
9.928
9.505
2.102
2.037
1.983
.3365
.1250
-.0057
.2260
.1495
.0520
5.979
5.953
4.986
2.172
2.078
2.002
.3683
.1609
.0348
.1959
.1046
.0357
5.181
4.165
3.424
2.367
2.253
2.126
.4138
.2577
.1036
.2188
.1407
.0455
5.788
5.600
4.364
2.562
2.459
2.306
.5325
.6000
.5070
.3253
.4521
.3189
8.607
18.00
30.57
2.798
2.747
2.673
page 137
TABLE B5 - PERFORMANCE OF ESTIMATORS
A
BIAS
2.3
n
C
1.5
eP1
kA
k
2.5
50
50
C
MSE
n
ASMSE
MSC
n
n
BIAS
50
n
25
MSE
n
ASMSE
MSC
n
n
BIAS
25
n
50
MSE
n
ASMSE
n
MSC n
BIAS
25
n
25
MSE
n
ASMSE
MSC
n
n
BIAS
10
n
50
MSE
n
ASMSE
MSC
n
n
BIAS
MSE
n
n
ASMSE n
MSC
n
10
10
10
Stage of Algorithm
50
250
1.582
.7228
.1177
2.640
.7360
.0649
69.84
29.30
6.218
2.745
2.522
2.210
1.585
.7246
.1196
2.627
.7147
.0634
69.50
28.45
6.078
2.763
2.542
2.221
1.320
.4422
.0450
2.137
.5244
.0511
36.74
16.59
4.567
2.694
2.419
2.148
1.320
.4489
.0446
2.100
.5046
.0508
36.10
15.96
4.540
2.716
2.437
2.158
.9191
.2402
.0180
1.871
.5454
.0925
20.55
14.43
7.905
2.593
2.314
2.105
.9539
.2250
.0192
1.768
.4644
.0871
19.43
12.29
7.449
2.650
2.359
2.118
Page 138
TABLE B6 - PERFORMANCE OF ESTIMATORS
A
BIAS n
2.3
C
1.5
<1>1
-1.0
k
A
50
MSE
n
ASMSE
MSC
n
n
50
BIAS n
MSE
n
ASMSE
MSC
n
n
BIAS,n
25
MSE
n
ASMSE
n
MSC n
BIAS n
25
MSE
n
ASMSE
MSC
n
n
BIAS n
10
k
c
10
Stage of Algorithm
250
50
50 -1.434
-1.132
-.4743
2.058
1.289
.2473
54.44
1.289
23.71
2.256
2.136
1.970
25 -1.435
-1.136
-.4810
2.061
1.298
.2524
54.51
51.67
24.20
2.184
2.098
1.963
50 -1.349
, -.8867
-.2604
1.824
.8129
.1020
31.35
25.71
9.121
2.229
2.049
1.947
25 -1.350
-.8921
·-.2623
1.828
.8208
.1034
31.43
25.96
9.269
2.162
2.019
1.943
50 -1.149
-.5442
-.1232
MSE
n
ASMSE n
1.354
.3901
.0586
14.87
10.32
5.010
MSC n
2.167
1.988
1.960
10 -1.157
-.5542
-.1265
1.366
.3916
.0585
15.01
10.36
5.000
2.042
1.953
1.959
BIAS
n
MSE n
ASMSE
n
MSC
n
10
Page 139
APPENDIX C
§l
SOME ALTERNATIVE MODELS
Random Replacement model
An early modification of the basic ARP model considers ~
as a random variable rather than a fixed but unknown
constant, making [O,~) a "random replacement interval."
While this model is more general, no savings are made and
much simplicity is 1ost.
As before, let Xl""'X n be a
random sample having d.f. F and now suppose that ~ is an
independent r.v. with d.f. G Cassume G is left-continuous).
Then Z.=minCX.,~) has distribution function I-SCI-G) since
1
1
PCZ~t) = PCx~t, ~~t) = SCt)
CI-GCt».
Thus,
lim B [CCt)/t] = {CIPCZ<~) + C2PCZ~~)}/B[Z]
Q)
Q)
= {C l
~o FCu)dGCu) + C2 ~o GCu)dFCu)}
Q)
/{ ~ CI-GCu»SCu)du}
= {
PCu) dGCu)} /
{ ~ OCu) dGCu)}
t
where
PCt) = CIFCt) + C 2 SCt) and
OCt) = ~o SCu) duo
Since RICt)=PCt)/OCt), we assume there exist sufficient
•
conditions such that RICt) may be uniquely optimized.
be such a minimization point.
Then
Let
~
Page 140
lim E [C(t)/t] = {~ P(u)dG(u)}/{~ Q(u)dG(u)} > P($)/Q($)
the expected cost under nonrandom replacement.
Thus, in this
broad class of problems, we see there is little to be gained
by considering a random replacement interval.
See Barlow and
Proschan (1965) for more details.
§2
Discounted ARP
An interesting and useful modification of the
replacement problem is the introduction of the time value of
costs into the problem.
Consider the ordinary age
replacement policy with planned (failure) replacement of a
unit with associated cost C (C ).
2 l
Under the discount ARP
model, the objective is to find a fixed but unknown $ such
that the expected cost of the process is minimized, where all
costs are discounted back to (a fixed but arbitrary) time
zero.
No work seems to exist in the literature for the
random replacement model with a discount feature.
Under the discount model, let
factor, t'=Zl+
..• +Z"].
].
th e cos t
0
d=log(l+~)
the time until the i
th
be the discount
replacement, and
d C
f th e].. th un].. t '].s C ( Zi) (=C ].. f P 1 anne,
2
l
otherwise).
Thus the discounted cost of the process is a
random variable Y, where
00
Y = -].=
>. 1 C(Z.)
exp{-d t]..},
].
Page 141
and we wish to find ~ to minimize the expected value of Y.
computing E Y , we get
00
E Y=
>. 1
-1.=
00
=q
•
E [C(Zi) exp(-dZ i ») E [exp(-dZ)
= B[C(Z)
=
exp(-dZ»)
-
~
E[exp(-dZ»)
)
i-l
i
B[C(Z) exp(-dZ»)/[l-B[exp(-dZ»).
Now, E [C(Z) exp(-dZ»)
= Cl
~
~o exp(-du)dF(u) + C2exp(-d~)S(~), and,
via integration by parts, we get,
~:
exp(-du) S(u) du
=
{l-B(-dZ)}/d.
~
Hence,
Cl~o exp(-du)dF(u)+c2exp(-d~)S(~)
d
~~
exp(-du) S(u)du
In 1966, Fox gave sufficient conditions under which a
finite unique optimal interval exists.
His result is that,
if the failure rate r(.) exists, is continuous and strictly
increasing to
00,
then a unique, finite ~ exists satisfying
(2 )
This was strengthened by Ran and Rosenlund and is given
through introducing maintenance costs into the model.
Page 142
Fox also showed that as the discount rate approaches
zero, we return to the long-run expected cost per unit time
("classical") ARP model.
Hence, with small prevailing
interest rates, the classical ARP model is approximately the
same as the discount ARP model.
Explicitly denoting the
•
dependence of R2 (t) on d by R (d,t), Fox showed that:
2
(3)
V t>O,
lim + d R2 (d,t) = Rl(t)
d~O
lim + d R2(d'~d)
(4)
d~O
=
Rl(~)'
and, if r(.) is continuous and strictly increasing to
1 im + ~ d
(5)
d~O
=
00,
then
~.
As with the usual ARP, it has been shown that for a fixed d
the optimal replacement time ~d' cannot occur in an interval
where the failure rate is decreasing (Denardo and Fox, 1967).
§3
Discounted ARP with explicit cost function
The introduction of a cost intensity or maintenance
costs make for a more realistic if less wieldy model.
Scheaffer (1971) first introduced the notion of explicitly
accounting for a cost related to the age of the unit.
Three
typical reasons cited why there may be a cost associated with
a unit are:
(a)
the adjustments that need to be made,
(b)
the unit may perform less efficiently as it ages, and
(c)
replacement costs may increase due to depreciation or
wear.
Page 143
One could consider a policy for replacing gasoline engines
whose cost of operation increases with time due to increased
gas and oil consumption.
Another example is rubber tires,
whose salvage value will decrease as wear increases.
Scheaffer assumed an increasing cost factor (increasing
with the age of the unit) and sought to minimize the long-run
cost per unit time, i.e., the classical ARP objective
function.
For explicit cost functions he showed how to
derive the optimal time interval and also noted that, along
the lines of Fox, a random replacement policy is superfluous.
By properly choosing the cost factor, one may broaden the
class of life distributions considered to include the
(negative) exponential distribution.
In his examples,
Scheaffer considered the exponential life distribution whose
ARP is a failure replacement policy.
Cleroux and Hanscom (1974) gave a reasonable model where
the costs were not necessarily increasing (arbitrary) and
when the costs occurred at fixed, equal length times of a
unit's life.
This was generalized by Ran and Rosenlund
(1976) who considered any continuous cost intensity function.
Of course, a sequence of continuous cost functions may be
used to approximate the discrete ones of Cleroux and Hanscom
arbitrarily close.
While Cleroux and Hanscom used the usual
Page 144
ARP objective function, Ran and Rosenlund minimized expected
discounted costs.
Letting g(t) be our continuous cost intensity function,
t
we can define A(t)
= ~o
exp(-du) S(u)du, the expected
discounted usage per unit and
O(t)
=
t
~o exp(-du) S(u) g(u)du, the expected cost due to
the cost intensity.
As before, the discounted cost can be
represented via the random variable Y, where
00
Y
= -i=l
>
and Ran and Rosenlund showed that
B[Y] = R3(~) = R2(~) + O(~)/(d A(~»
where R2(~) is defined in (1).
Ran and Rosenlund extended a
proof first given by Cleroux and Hanscom to give sufficient
conditions under which a finite optimal interval exists.
If
there exists ~ * <00 such that,
{(Cl-C2)r(~)+g(~)}/d-C2
*
then d ~ ~.
> min{R 3 (t):
O<t<~}
V ~>~ *
The immediate corollary is that if either r(t)
or g(t) goes to 00 as t goes to 00, the ~d exists and is
finite, which strengthens an earlier result of Fox.
finding minima (local or otherwise), the equation
corresponding to (2) is
(6 )
For
Page 145
§4
Block replacement policies
Any review of age replacement policies, however sketchy,
would be incomplete without mentioning its major competitor,
block replacement policies.
Under block replacement, a
stochastically failing unit is replaced at failure with cost
Cl , or at times ~, 2~, 3~, ... with cost C 2 , where O<C 2 <C l .
To optimize this policy, again the choice of ~ depends on C ,
l
C , and the failure distribution of the units.
2
This policy
has obvious intuitive appeal when several units are on test
simultaneously, e.g., the lightbulb replacement problem.
The
choice of which policy {or their many variants> depends on
the physical situation, and guidelines for choosing have been
discussed by many authors {cf.,Barlow and Proschan, 1964 and
Gertsbakh, 1977, pg.99-103>.
Page 146
BIBLIOGRAPHY
Abdelhamid, S.N. (1973). Transformation of observations
in stochastic approximation. Ann. statist. 1, 1158-1174.
Albert, A.E. and Gardner, L.A. (1967). Stochastic
Approximation and Non-Linear Regression. M.I.T. Press,
Cambridge, Mass.
Anbar, D. (1973). On optimal estimation methods using
stochastic approximation procedure. Ann. Statist. 1, 11751184.
Anbar, D. (1976i). An application of a theorem of Robbins
and Siegmund. Ann. Statist. 4, 1018-1021.
Anbar, D. (1976ii). An asymptotically optimal inspection
policy. Naval Res. Legist. Quart. 23, 211-218.
Anbar, D. (1978). A stochastic Newton-Raphson method.
Statist. Plan. Inf. 2, 153-163.
J.
Aoki, M. (1977). Optimal Control and Systems Theory in
Dynamic Economic Analysis. North-Holland Publishing Co., New
York.
Arunkumar, S. (1972). Nonparametric age replacement
policy. Sankhya ~ 34, 251-256.
Aven, T. (1982). Optimal replacement times - a general
set-up. Statist. Res. Report 1, 1982. Univ. of Oslo,
Norway.
Barlow, R.E. and Proschan, F. (1964). Comparison of
replacement policies and renewal theory implications. Ann.
Math. Statist. 35, 577-589.
Barlow, R.E. and Proschan, F. (1965).
of Reliability. wiley, New York.
Mathematical Theory
Bather, J.A. (1977). On the sequential construction of an
optimal age replacement policy. Bulletin of the
International Statistical Institute 47, 253-266.
Beichelt,F. (1981). Replacement policies based on system
age and maintenance cost limits. Math. Operationsforsch.
Statist., Sere Statistics 12, 621-627.
Berg, M. (1976). A proof of optimality for age replacement
policies. Journal of Applied Probability 13, 751-759.
•
Page 147
Bergman, B. (1979). On age replacement and the total time
on test concept. Scandinavian Journal of Statistics 6, 161168.
Bhattacharya, P.K. (1967). Estimation of a probability
density function and its derivatives. Sankhya Sere ~ 29, 373382.
Billingsley, P. (1968).
Measures. New York, Wiley.
Convergence of Probability
Blum, J.R. (1954i). Approximation methods which converge
with probability one. Ann. Math. Statist. 25, 382-386.
Blum, J.R. (1954ii). Multidimensional stochastic
approximation methods. Ann. Math. Statist. 25, 737-744.
Breslow, N. and Crowley, J. (1974). A large sample study
of the life table and product limit estimates under random
censorship. Ann. Statist. 2, 437-453.
Burkholder, D.L. (1956). On a class of stochastic
approximation processes. Ann. Math. Statist. 27, 1044-1059.
Chernoff, H. (1964).
Statist. Math. 16, 31-41.
.
Estimation of the mode.
Ann. Inst.
Chung, K.L. (1954). On a stochastic approximation method.
Ann. Math. Statist. 25, 463-483 •
Chow, Y. S. ( 1965 ). Local convergence of martingales and
the law of large numbers. Ann. Math. Statist. 36, 552-558.
Chow, Y.S., Robbins, H. and Siegmund, D. (1971). Great
Expectations: The Theory of Optimal Stopping. HoughtonMifflin Co., Boston.
Cleroux, R. and Hanscom,M. (1974). Replacement with
adjustment and depreciation costs and interest charges.
Technometrics 16, 235-239.
Csorgo, M. and Revesz, P. (1981). Strong Approximation in
Probability and Statistics. Akademiai Kiado, Budapest.
Denardo, E.V. and Fox, B.L. (1967). Nonoptimalityof
planned replacement in intervals of decreasing failure rate.
Operations Research 15, 358-359.
Dupac, V. (1965). A dynamic stochastic approximation
method. Ann. Math. Statist. 36, 1695-1702.
Dupac, V. (1966). Stochastic approximation in the
presence of a trend. Czech. Math. J. 16, 91, 454-461.
Page 148
Dupac, V. (1977). Stochastic approximation methods in
linear regression model (with consideration of errors in
regressors). Math. Operationsforsch. statist., Sere
Statistics 8, 107-118.
Dupac, V. and Kral, F. (1972). Robbins-Monro procedure
with both variables subject to experimental error. Ann.
Math. Statist. 43, 1089-1095.
Dvoretsky, A. (1956). On stochastic approximation. Proc.
Third Berkeley ~ . Math. Statist. Probab., 1 (J. Neyman,
ed.), 39-55. Univ. California Press.
Eddy, W. (1980). Optimal kernel estimators of the mode.
Ann. Statist. 8, 870-882.
Fabian, V. (1967). Stochastic approximation of m~n~ma
with improved asymptotic speed. Ann. Math. Statist. 38, 191200.
Fabian, V. (1968i). On the choice of design in stochastic
approximation methods. Ann. Math. Statist. 39, 457-465.
Fabian, V. (1968ii). On asymptoti9 normality in
stochastic approximation. Ann. Math. Statist. 39, 13271332.
Fabian, V. (1971). Stochastic approximation.
Methods in Statistics, J.S. Rustagi(ed.), 439-460.
Optimizing
Fabian, V. (1973). Asymptotically efficient stochastic
approximation; the R-M case. Ann. Statist. 1, 486-495
Fabian, V. (1978). On asymptotically efficient recursive
estimation. Ann. Statist. 6, 854-866.
Feller, W. (1971). An Introduction to Probability Theory
and its Applications !. Wiley, New York.
Fox, B. (1966). Age replacement with discounting.
Operations Research 14, 533-537.
Fritz, J. (1973). Stochastic approximation for finding
local maxima of probability densities. Studia Sci. Math.
Hungar. 8, 309-322.
Gaposkin, V.F. and Krasulina, T.P. (1974). On the law of
the iterated logarithm in stochastic approximation processes.
Theor. Probab. ~ . 19, 844-850.
Gertsbakh, I.B. (1977).
North-Holland, New York.
Models of Preventive Maintenance.
•
Page 149
Glasser, G.J. (1967).
Technometrics 9, 83-91.
The age replacement problem.
Govindarajulu, z. (1975). Sequential Statistical
Procedures. Academic Press, New York.
.
Has'minskii, R. (1975). Sequential estimation and
recursive asymptotically optimal procedures of estimation and
observation control. Proc. Prague ~ . Asymptotic Statist .
1, 157-178.
Has'minskii, R. (1977). Stochastic approximation methods
in non-linear regression models. Math. Operationsforsch.
Statist., Sere Statistics 8, 95-106.
Heyde, C. (1974). On martingale limit theory and strong
convergence results for stochastic approximation procedures.
Stoch. Proc. ~ . 2, 359-370.
Holst, u. (1980). Convergence of a recursive stochastic
algorithm with m-dependent observations. Scand. J. Statist.
7, 207-215.
Holst, u. (1982). Convergence of a recursive stochastic
algorithm with strongly regular observations. TFMS-3026,
Dept. Math. Statist., u. of Lund, Sweden.
•
Isogai, E. (1980). Strong consistency and optimality of a
sequential density estimator. Bull. Math. Statist. 19, 5569.
-- -Janac, K. (1971). Adaptive stochastic approximations.
Simulation 16, no. 2, 51-58.
Kersting, G. (1977i). Some results in the asymptotic
behaviour of the Robbins-Monro procedure. Bull. Int.
Statist. Inst. 47, II 327-335.
Kersting, G. (1977ii). Almost sure approximation of the
Robbins-Monroe process by sums of independent random
variables. Ann. Probability 5, 954-965.
Kiefer, J. and Wolfowitz, J. (1952). Stochastic
estimation of the maximum of a regression function. Ann.
Math. Statist. 23, 462-466.
Konakov~ V.D. (1974).
On the asymptotic normality of the
mode of multidimensional distributions. Theor. Probab. ~.
19, 794-799.
Page 150
Kushner, H.J. (1977). General convergence results for
stochastic approximations via weak convergence theory. J.
Math. Anal. and ~ . 61, 490-503.
Kushner, H.J. and Clark, D.S. (1978). Stochastic
Approximation Methods for Constrained and Unconstrained
Systems. Springer-verlag, New York.
Lai, T.L. and Robbins, H. (1978). Limit theorems for
weighted sums and stochastic approximation processes. Proc.
Nat1. Acad. Sci. USA 75, no. 3, 1068-1070.
---Lai, T.L. and Robbins, H. (1979). Adaptive designs and
stochastic approximation. Ann. Statist. 7, 1196-1221.
Lai, T.L. and Robbins, H. (1981). Consistency and
asymptotic efficiency of slope estimates in stochastic
approximation schemes. z. Wahrschein1ichkeitstheorie Verw.
Geb. 56, 329-360.
Ljung, L. (1978). Strong convergence of a stochastic
approximation algorithm. Ann. Statist. 6, 680-696.
McLeish, D.L. (1976). Functional and random central limit
theorems for the Robbins-Monro process. ~.~. Prob. 13,
148-154.
Miller, R.G. (1981>'
Survival Analysis.
Wiley, New York.
Nadas, A. (1969). An extension of a theorem of Chow and
Robbins on sequential confidence intervals for the mean.
Ann. Math. Statist. 40, 667-671.
Neve1son, M. (1975). On the properties of the recursive
estimates for a functional of an unknown distribution
function.
Limit Theorems of Probability Theory, P. Revesz
(ed.). ColI. Math. Soc. Janos Bolya 11, 227-251.
Obremski, T.E. (1976). A Kiefer-Wo1fowitz type stochastic
approximation procedure. Ph.d. dissertation. Dept. Statist.
and Probab., Mich. S. Univ., East Lansing, Mich.
Parzen, E. (1962). On estimation of a probability density
function and mode. Ann. Math. Statist. 33, 1065-1076.
Ran, A. and Rosenlund, S.!. (1976) •. Age replacement with
discounting for a continuous maintenance cost model.
Technometrics 18, 459-465.
Revesz, P. (1977). How to apply the method of stochastic
approximation in nonparametric estimation of a regression
function. Math. Operationsforsch. Statist., Sere Statistics
8, 119-126.
•
Page 151
Robbins, H. and Monro, S. (195l). A stochastic
approximation method. Ann. Math. Statist. 22, 400-407.
Robbins, H. and Siegmund, D. (197l). A convergence
theorem for nonnegative almost supermartingales and some
applications. Optimizing Methods in Statistics (J.S.
Rustagi, Ed.), 233-257. Academic Press, N.Y.
Ruppert, D. (1979). A new dynamic stochastic
approximation procedure. Ann. Statist. 6, 1179-1195.
Ruppert, D. (1982). Almost sure approximations to the
Robbins-Monro and Kiefer-Wolfowitz processes with dependent
noise. Ann. Statist. 10,178-187.
Ruppert, D., Reish, R.L., Deriso, R.B., and Carroll,
R.J. (1982). Monte Carlo optimization by stochastic
approximation (with application to harvesting of Atlantic
menhaden).
Institute of Statistics Mimeo Series #1500.
Chapel Hill, North Carolina.
Sacks, J. (1958). Asymptotic distribution of stochastic
approximation procedures. Ann. Math. Statist. 29, 373-405.
Sager, T.W. <l978).
Ann. Statist. 6, 802-812.
Estimation of a multivariate mode.
Sakrison, D. (1965). Efficient recursive estimation;
applications to estimating the parameters of a covariance
function.
Int.~. Engng. Sci. 3, 461-483.
Samanta, M. (1973). Nonparametric estimation of the mode
of a multivariate density. South. African ~. 7, 109-117.
Scheaffer, R.L. (197l). Optimum age replacement policies
with an increasing cost factor.
Technometrics 9, 83-91.
Sen, P.K. (198l). Sequential Nonparametrics: Invariance
Principles and Statistical Inference. New York, Wiley.
Sielken, R.L. (1973). Stopping times for stochastic
approximation procedures.
Z. Wahrscheinlichkeitstheorie
Verw. Geb. 26, 67-75.
J
Singh, R.S. (1976). Nonparametric estimation of mixed
partial derivatives of a multivariate density. J. Multi.
Anal. 6, 111-112.
Singh, R.S. (1977).
Improvement on some known
nonparametric uniformly consistent estimators of derivatives
of a density. Ann. Statist. 5, 394-399.
Page 152
Skorohod, A.V. (1956). Limit theorems for stochastic
processes. Theor. Probab. ~ . 1, 261-290.
Smith, W.L. (1955). Regenerative stochastic processes.
Proc. Royal Soc., Series ~ 232, 6-31.
Stone, C.J. (1963). Weak convergence of stochastic
processes defined on semi-infinite time intervals. Proc.
Amer. Math. Soc. 14, 694-696.
Stone, C.J. (1980). Optimal rates for nonparametric
estimators. Ann. Statist. 8, 1348-1360.
stout, W.F. (1974).
Press, New York.
Almost Sure Convergence.
Academic
Stroup, D.F. and Braun, I. (1982). On a stopping rule for
stochastic approximation. Z. Wahrschein1ichkeitstheorie
Verw. Geb. 60, 535-554.
Uosaki, K. (1974). Some generalizations of dynamic
stochastic approximation processes. Ann. Statist. 2, 10421048.
venter, J.H. (1967). An extension of the Robbins-Monro
procedure. Ann. Math. Statist. 38, 181-190.
Wegman, E.J. (1972). Nonparametric probability density
estimation: II. A comparison of density estimation methods.
~. Statist. Compo Sim. 1, 225-245.
Wetherill, G.B. (1966).
Metheun, London.
Sequential Methods in Statistics.
J
Page 153
INDEX
)
NOTATION AND ASSUMPTIONS
Symbol
short description
F ( . ) , S(.)
Cl ' C2
$
Rl ($)
N( t)
failure, survival distribution
failure, planned replacement cost
optimal replacement time
long-run expected cost per unit time
number of replacements by time t
1
1
1
1
2
E, P
expectation, probability operators
4
sigma-field of the past
sigma-field generated by (.)
conditional expectation
4
4
5
Pn
cr ( • )
Ell'
n
0, 0, 0p' 0p
page where first introduced
e
asymptotic relationships for orders of
magnitude
m-dimensional Euclidean space and
associated Borel sets
membership in a set
5
5
C(t)
cost by time t
5
Rm, Bm
Nl(t), N2 (t)
5
number of failure, planned replacements
by time t
mean of distribution function F
indicator function
density function for distribution function F
asymptotic relative efficiency of the ARP
greatest integer function
function associated with cost
sequences used in the SA ARP procedure
recursive estimator of $
truncation operator
transform function
g-l($)
6
7
7
9
9
11
33
34
34
34
35
35
Mg , n ( . )
estimator of M(g(.»
36
fg In ( . )
histogram estimator of the density
estimators of F(.), S(.)
36
36
pth derivative of F(g(.»
36
constants associated with {an}' {c }
n
37
j..l
I (•)
f (.)
F(.)
[
.]
M(t)
{an}' {c n }
$n
[ . ]b
a
g(. )
$*
Fin' Sin
(F*g)(P)
A, C
Page 154
INDEX
Symbol
NOTATION AND ASSUMPTIONS
short description
y
page where first introduced
constant which determines the rate of
convergence
r
= A (g' (cf» ) 2 M , (g ( cf» )
>
N(a,b)
I
factor in the ARP asymptotic variance
random variable that is distributed
normally with mean a and variance b
cf>n-cf>
bias term
2
B p Mg
n
'
MIn' M2n
2
02
vn
f*g
n'
f*g(P)
n
fg(P)
n
2
1-1 0 ,
~o
1-1 1 ,
~l
2
Aopt ' Copt
V,/\
n (cf>n )
37
37
37
38
38
38
38
factors of Mg, nand M(g(.»
39
variance of distribution function F
estimator of )"
39
39
general estimators of (F*g)(l)
and (F*g) (p+l) satisfying A8 and A9
52
constants used in A9
class of kernel functions
52
52
class of orthogonal kernel functions
53
class of orthogonal kernel functions that
are continuous and of bounded variation
SA kernel estimator of (F*g)(p+l)
53
55
asymptotic mean and variance in Theorem 3.1
57
asymptotic mean and variance in Corollary 3.1 59
optimal choice of A, C
60
max, min
63
quantities used in Mg,n(') and M(g(.»
64
used to adaptively estimate Aopt
used to estimate (F*g)(r+l)
67
68
to adaptively estimate Copt
69
Page 155
INDEX
short description
Symbol
•
1-1
2,
0-
2
2
f(P)
.
n
ro '
~o
72
. t'1ve
p th d er1va
74
0
77
78
asymptotic mean and variance for SA
univariate density estimator
79
D( • )
mixed partial derivative of f(.)
82
D (.)
SA estimator of D(.)
83
space of real mxm matrices
p,+m/2
€Rmxm , = diag ( (c 1
),)
83
83
H(.)
mXm identity matrix
€Rmxm , = diag { {I (p ,=p ),)
1
a 1
matrix of partial derivatives
83
I I. II
m-dimensional Euclidean norm
83
~4' ~4
asymptotic mean and covariance matrix of SA
multivariate density estimator
85
~5' ~5
asymptotic mean and covariance matrix of SA
multivariate mode estimator
85
~Pif(X)/~X~i
1
86
factor in SA K-W asymptotic variance
94
1-1 6
asymptotic mean of SA K-W procedure
95
B
-n
vector of bias
100
~
-n
vector of errors
space of functions on [0,00) having left
hand limits and continuous from the right
100
102
=> W
weak convergence in Stone's topology of D
102
B( •)
Brownian motion on [0,00)
(l_oc)th quantile of the standard
normal distribution
103
To'
3,
0-
2
3
-n
mXm
R
en
n
I
E
(p, )
f
1
0- 2
6
D
•
asymptotic mean and variance in Theorem 3.3
factors in SA density estimator
asymptotic distribution
1-1
•
page where first introduced
f a density f
SA kernel estimator for f(P)
f(P)
).
NOTATION AND ASSUMPTIONS
z
oc
(x)
-
1
83
83
107
Page 156
INDEX
NOTATION AND ASSUMPTIONS
•
Assumptions
AO-A7
page where first introduced
•4.
36-37
A8-A9
52
AlO
55
All
67
Bl-B5
78
Cl-C5
84
•
01-07
96-97
El-E5
103
E6-E8
108