Johnson, M.A.; (1969)On the Kiefer-Wolfowitz process and some of its modifications."

This research was supported by the National Institutes of Health,
Institute of General Medical Sciences, Grant No. GM-l2868-04 .
.
ON THE KIEFER-WOLFOWITZ PROCESS AND SOME OF ITS
MODIFICATIONS
Mark Allyn Johnson
University of North Carolina
Institute of Statistics Mimeo Series No. 644
September 1969
•
ABSTRACT
JOHNSON t MARK ALLYN.
Modifications.
On the Kiefer-Wolfowitz Process and Some of Its
(Under the direction of JAMES E. GRIZZLE).
The Kiefer-Wolfowitz (K-W) process is a stochastic approximation
procedure used to estimate the value of a controlled variable that
maximizes the expected response (M(x»
of a corresponding dependent
random variable.
An extension of a theorem on the asymptotic distribution of a
certain class of stochastic approximation procedures t which includes
•
the K-W process, is given.
This extension admits an increasing number
of observations to be taken on each step, constrains the observed
random variable to lie between predetermined bounds (not necessarily
finite), and extends the class of admissable norming sequences.
Each
of these extensions is examined in the context of their usefulness to
the application of the K-W process.
Approximations of the small sample mean and variance of the K-W
process are given and studied through an examination of various special
cases of the response function.
Traditionally the estimation of the
variance of the process involves the estimation of M"(8).
It is shown
that an efficient estimator of M"(8) based on the observations generated
•
by the process provides a consistent estimator of M"(8) .
A modification of the K-W process is developed for which the
dependent random variables influence the direction t but not the magni-.
tude of the correction on each step.
In addition to obtaining the
asymptotic distribution of this process, an approximation for the small
sample variance is given.
An estimator of this variance is given which
does not entail the estimation of unknown parameters.
The estimator is
shown to be consistent.
Finally, examples illustrating the use of these procedures are
given .
•
-e
BIOGRAPHY
The author was born September 30, 1943, in Weld County, Colorado.
He was reared on an irrigated farm near Eaton, Colorado.
He graduated
from Eaton High School in 1956 and went directly to college.
He received the Bachelor of Science degree with a major in zoology
at Colorado State University in 1960.
On receiving a grant from the
National Institute of Health, he studied experimental statistics in the
graduate school of the same university for two years.
In September,
1966, he entered the Graduate School of North Carolina State University
at Raleigh to continue studying experimental statistics.
He has ac-
cepted a position as a statistical consultant for the Upjohn Company
in Kalamazoo, Michigan.
The author married Miss Martha McLendon Willey from Nashville,
North Carolina, in 1968.
As yet they have no children.
ACKNOWLEDGMENTS
The author wishes to thank his advisor, Dr. James Grizzle, who
suggested the topic of this dissertation, for his freely given counsel
throughout his doctoral program.
He also wishes to thank Professors
N. L. Johnson, Pranab K. Sen, Harry Smith, Jr., and H. R. van der Vaart
for serving on his advisory committee and offering assistance during
this study.
The author wishes to thank Mrs. Gay Goss for typing the manuscript
and to thank Mrs. Mary Donelan for assisting in the typing of
th~
rough
draft of this manuscript.
Finally, the author wishes to thank his parents, Mr. and Mrs.
Walter F. Johnson for their inspiration throughout his academic career,
and to thank his dearest wife, Martha, for her patience and her
assistance in the proofreading of each manuscript and in the typing of
the rough draft.
TABLE OF CONTENTS
Page
LIST OF FIGURES . . .
1.
INTRODUCTION AND REVIEW OF LITERATURE. .
1.1
1.2
1.3
2.
·v
1
Introduction . • . . . .
Summary of Literature . . . . .
Summary of Results . • •
1
5
7
STOCHASTIC APPROXIMATION PROCEDURES OF TYPE AT .
2.1
2.2
2.3
Constrained Stochastic Approximation
Procedures . • . • . . . . . • .
Constrained Random Variables . • • .
Asymptotic Distribution of a Type
AT Process . • . . • . . . • . • .
o
3.
THE CONSTRAINED KIEFER-WOLFOWITZ PROCESS .
3.1
3.2
3.3
3.4
3.5
3.6
4.
9
o
Definition of the Constrained KieferWolfowitz Process . • . . . . . . .
Asymptotic Properties of the CKW Process.
Effect of Constraining the K-W Process . .
Expected Small Sample Behavior of the
K-W Process . • . . . . • . . . . • .
Approximations of the Small Sample Variance •
Estimation of E(X - 8)2 • . •
n
THE SIGNED KIEFER-WOLFOWITZ PROCESS.
9
11
14
27
27
28
34
38
44
54
57
4.1 Definition of the Signed Kiefer-Wolfowitz
4.2
4.3
4.4
.
Process . . • . . . • • . . . . . . • . . .
Asymptotic Theory of the SKW Process..
....
Sample Properties of the SKW Process.
Estimation of E(X - 8)2.
n
57
62
70
73
5.
TWO EXAMPLES . • . .
79
6.
SUMMARY AND SUGGESTIONS FOR FURTHER RESEARCH.
88
6.1
6.2
Summary
8
•
••
•
•
•
•
•
•
•
Suggestions for Further Research . .
LIST OF REFERENCES . . .
88
90
91
LIST' OF:'FIGURES
Page
Figure
Graphical Definition of
) .
3
2.
Optimal Determination of c (=C)
44
: 3.
Graph of the Sequence {b }. . •
46
Graph of Reproduction versus Temperature.
80
4.
.
~(c
1.
n
n
n
.e
..
CHAPTER
1.1
1~
INTRODUCTION AND REVIEW OF LITERATURE
Introduction
The problem of finding the maximum of a response curve is often
encountered in applied statistics.
mum have been developed extensively.
Two methods for locating this maxiThe method of steepest ascent,
which relies on multiple regression principles, is the most widely
accepted.
The second method is a sequential procedure.
this sequential procedure has been developed mainly
lines.
..
Heretofore,
along asymptotic
We shall develop two modifications of this latter method with
special emphasis given to the practical application of this method .
With this in mind, let X be a variable that can be controlled by
the experimenter, and let Y(X) be an associated random variable such
that E[Y(X)]
= M(X).
In otller words, let M(x) be a function of x,
and assume Y(x) is an observable random variable with a distribution
function F(olx) such that
I
00
M(x) =
ydF(y!x).
-00
It will be assumed througllout this paper that M(x) has a unique maximum
at x
= e and
greater than
that M(x) is increasing or decreasing as x is less than or
e.
The foundations of the method of estimation, which Kiefer and
Wolfowitz [8] adapted for the particular problem at hand, are contained
in a paper by Robbins and Monro [9].
The spirit of this estimation
.e
2
procedure is to generate a sequence of estimates {X } that converge to
n
e.
The procedure is such that then'th estimate (X ) equals X 1 plus
n
n-
a correction term that is a function of an observable random variable
(Y(X 1) depending on X. for i
n1
~
1,
n-l.
These procedures have certain advantages and disadvantages when
they are compared to the usual methods of response surface methodology.
They are simple to execute and easy to evaluate.
They are generally
applicable in practice, and they admit a complete mathematical formulation.
On the negative side, they have the disadvantage of most
sequential procedures in that they require evaluation after each step.
Moreover, their variance often depends upon parameters for which a
...
good estimator cannot be calculated from the observations.
The stochastic approximation procedure proposed by Kiefer and
Wolfowitz [8] was later called the Kiefer-Wolfowitz process.
paper it will be referred to as the K-W process.
In this
Its definition
follows.
Definition:
Let {a } and {c } be sequences of positive numbers,
n
n
and suppose Y(x) is a random variable with distribution function F(ylx)
00
where M(x)
~
!YdF(Y/X).
Let Xl be arbitrary, and let the random
-00
variable sequence {X } be defined by
n
l
X + ~ X - a c- (Y(X - c ) - Y(X + c»,
n l
n
nn·
n
n
n
n
(1.1)
where Y(X - c ) and Y(X + c ) are random variables which, conditional
n
n
n
n
.e
3
distributed independently as F(y!x
n
- c ) and F(yjx + c ).
n
n
n
Before summarizing the literature on this process, some
motiva~
tion for the assumptions on the sequences {a } and {c } will be given.
n
n
It follows from the definition that
If x
n
- c
> 8, then M(x - c ) - M(x
nn
n
n
x + c <
n
n-
e,
then M(x
n
+
c ) is positive, and if
n
- c ) - M(x + c ) is negative.
n
n
n
Obviously, under
suitable regularity assumptions on M(x), there exists a point
that M(x
n
zero as x
n
~(c
n
) such
- c ) - M(x + c ) is greater than, equal to, or less than
n
n
n
is greater than,equal to, or less than
~(c).
See Figure 1
n
for an illustration.
I
I
I
1I
I
I
I
~ (c
n
)-c
Figure 1.
n
e
T
I
I
I
I
I
I
~(c
n
~(c
)
Graphical Definition of
~(c
n
)
n
)+c
n
4
.e
It
is equally clear that if j.l(c ) exists, then /j.l(c ) - el < c , and
n
n
- n
that if M(x) is symmetric about e, then j.l(c )
n
= e.
If
C
tion tells us that X should tend to j.l(c) rather than e.
n
n
~
c, intui-
Thus, if M(x)
is not required to be symmetric about e, it is usually assumed that
C
n
+
O.
Moreover,
> E(x -8-2a 0/X =x )
2
2 l l
2
> (x -8)-20 L a .
l
m=l
m
By induction
00
If La <
m
00,
it is possible to choose (xl-e) > 20 L a implying an asymp1 m
totic bias; thus it is necessary to assume that La
n
=
00.
5
.e
.
Th east
common
1
.
assumpt~on
is somewhat less intuitive.
tend to zero.
~s that ~~a2c-2 <
nn
~
00.
This assumption
2 -2
However, it is easy to see why a c
must
nn
In fact, it follows from (1.1) and the inequalities
= E(-ann
c-l[(y(x -c)
n n
- M(x -c »
n n
- (y(x +c ) - M(x +c »])2~
n n
n n
These three assumptions have almost always been made in theorems
implying X
n
that c
e in
+
any usual mode of convergence.
The exception is
does not have to converge to zero if M(x) is symmetric about
n
e.
The validity of these assumptions will always be assumed in the following review.
1.2
Summary of Literature
In their introduction of K-W process, Kiefer and Wolfowitz [8]
proved that X
n
+
e in
functions, E(Y(x»)
2
<
probability if M(x) is bounded by two quadratic
00,
-1
and La c
<
n n
In later papers (J. R. Blum
00.
[1], D. L. Burkholder [2], and A. Dvoretzky [4]) the assumption that
La c
-1
n n
<
00
was dropped, and the conclusion was strengthened to that
of convergence with probability one.
Burkholder's theorem relaxes the
assumption of quadratic bounds on M(x) with the assumption which implies for practical purposes that 1M' (x)l > 0 if x ~
e.
Burkholder
also noted that this assumption admits response functions taking
values on bounded value sets
(~'K'
2
exp(-x ».
6
.e
The first significant results on the asymptotic distribution of
X were derived by Burkholder [2] who used methods suggested by
n
K. L. Chung [3].
Burkholder established that the moments of n~(X -8)
n
converged to those of a central normal random variable if the sequences
{a } and {c } are of the form a
n
n
n
~ =
= O(n- l )
and c
n
=
Burkholder also assumed that W > (4+2n)
l/2-w.
1~(E)-81 = O(E l +n).
O(n- w), and
-1
,where
He proved that, if the third derivative of M(x)
is continuous, then n
~
1.
Using a method developed by Hodges and
Lehman [7], he was able to drop the assumption of the quadratic bounds,
but he had to weaken the conclusion to that of convergence in distribu- .
tiori.
Sachs [10] used a different approach to show that
n
1/3
(In n)
-1
.
(X -8) was asymptot1cally normal under similar restrictions
n
on the sequence {a } and the assumption of quadratic bounds on M(x).
n
He did this by considering a slightly different class of sequences {c }.
n
He also proved that, in general, n
l/3
(X -8) is not asymptotically norn
mal under the assumptions on the sequences {a } and {c } discussed in
n
the
n
introduction.
More recently, Fabian [6] proved a very general theorem about
which he stated, " . . . [the theorem] implies in a simple way all known
results on asymptotic normality in various cases of stochastic approximation."
He used his result to obtain the asymptotic normality of his
generalization of the K-W process (Fabian [5]) which can be constructed
to obtain an asymptotic variance of order (n-(l-E)) for any E > O.
7
.e
Fabian achieves this smaller variance by taking observations at
more than two points On each step.
His results have interesting pos-
sibilities, especially if a very precise estimate of 8 is needed.
However, the K-W process seems to have the better small sample properties unless one has substantial
~
priori knowledge as to the loca-
don of 8.
J. H. Venter [12] has proposed that observations be taken at
li
xn - e
n'Xn, and Xn + c n on each iteration and thereby estimate M (8).
He noted that this estimate could be used to minimize the variance
of
n~(xn -8)
if a
n
is of the form An-I.
Again, this is mainly an
asymptotic result and will not be considered, here.
The only small sample results relevant to the K-W process seem
to be those of B. Springer's [11] simulation study on a process proposed by Wilde [13].
This process is a modification of the K-W
process in which the sign of Y(X -c ) - Y(X +c ) is used rather than
n n
n n
the difference.
In this study he used a geometric sequence for {a }
n
which obviously does not satisfy the usual assumption that La
diverge.
n
Then he unjustifiably criticized the process as a whole for
the bias that he observed.
1.3 Summary of Results
The K-W process has been largely of theoretical interest, and the
behavior of this process has been studied mainly through its asymptotic
properties.
With the exception of some remarks by a few writers,
little knowledge of use to the consulting statistician can be gleaned
from a review of the literature.
fill this gllP.
It is the intent of this paper to
8
~
,.,
~
Chapter 2 will be devoted to a generalization of Burkholder's
[2] theorem on the asymptotic distribution of a certain class of stochastic approximation procedures.
These extensions will admit
sequences {a } of the form O(n- a ) where 1/2 < a ~ 1, admit the use of
n
constrained random variables (to be defined later), and allow the
experimenter to take an increasing number of observations at each
step.
The usefulness of these extensions will be discussed in the
following chapters as they arise.
In the third chapter, a generalization of the K-W process will
be proposed and its asymptotic properties will be obtained from the
results of Chapter 2.
Approximate bounds will be given for the small
sample variance of the process when M(x) is a quadratic function, and
-e
this result will be used to obtain approximate bounds in cases of
more general functions.
The effects of the response function, the
value of a, the variance of Y(x), and the starting value (Xl) on the
mean square error will be studied in some detail.
In the fourth chapter we shall develop the modification of the
K-W process proposed by Wilde [13].
Various asymptotic and small
sample properties will be established.
In addition, a method of
estimating the variance of the estimator will be developed.
In the final chapter two examples will be given to illustrate
some of the results in the third and fourth chapters.
CHAPTER 2.
.
2.1
STOCHASTIC APPROXIMATION PROCEDURES 'OF TYPE AT
o
Constrained Stochastic Approximation Procedures
In this chapter we shall extend Burkholder's [2] Theorem 3 on
the asymptotic normality of a class of stochastic approximation
procedures.
This extension will improve the practicality and the
flexibility of these procedures.
The main discussion of
t~e
useful-
ness of this extension will be made in Chapter 2 and in Chapter 3 in
which more concrete examples will be examined.
In the development that follows, F[r](o) will denote the dis-
..
tribution function of the mean of [r] ([r] denotes the greatest
integer less than or equal to r) independent random variables with
distribution function F(o).
We shall denote by y
T
the random vari-
able defined by
T,
yT
=
Y
> T
y, -T < Y < T
-T,
where Y is any random variable.
Y <-T
T
The random variable (y ) will be
called a constrained random variable, and if R(Y) is any function or
functional of Y, then RT(y) will denote the corresponding function
T
of functional of y •
(!'K"
if M(x)
= E(Y(x»,
T
then M (x)
= E(YT (x).)
When it is possible to do so without confusion, the random variable
sequence defined by
10
T ,
n
y
n'
T
y n =
n
•
Y
n
-T
-Tn'
will be denoted by yT.
>
T
n
< Y < T
nn- n
Y
n
< -T
n
We shall say that a function sequence' {D (x)}
n
n
is continuously convergent to A at x
= e if for every
E > 0, there
exists a o(e) > 0 and an N(E) > 0 such that if lx-e) < 0 and
n > N(E), then ID (x)-A/ < E.
n
The usual symbols 0(0) and 0(0) will
be used to indicate the order of convergence.
Definition (2.1):
numbers such that if T
n
inf T > O.
Let {a'} and {X } be sequences of positive
n
=f: co
n
for every n greater thansome.n , then
0
Let Z (x) denote a random variable with distribution
n
n
function Gn(ojx).
Let Xl be fixed, and define the random variable
sequence {X } by
n
X
n+l
= Xn
- a'ZT(X),
n n
(2.1)
where ZT(x) is a random variable with conditional distribution GT(olx)
n
n
T
given Z1'
T
$
f;
"
,
Zn- l' Xl' ... , Xn- l' and Xn
= x.
The stochastic
T
approximation sequence X will be said to be of Type A .
n
0
This definition is, in fact, Burkholder's [2] definition of a
Type A process.
o
However, the superscript T is added to emphasize the
fact that, although the sequence {X } depends on constrained random
n
variables, it is usually more reasonable to make assumptions on the
unconstrained random variables.
11
In order to get an appreciation for this class of processes and
for some functions that will be defined later, we shall derive a few
well-known results for the K-W process.
T
That the K-W process is of Type A can be seen by letting
0
a'
n
-1
= a ncn ,
R (x)
n
T
n
:: 00,
= E(Zn (x)).
and ZT(X)
n
= Y(Xn -cn )
Then R (x)
n
= M(x-cn )
- Y(X +c ). Let
n n
- M(x+c ), and, as was noted
n
in the introduction, under suitable regularity assumptions there
= M(~ n -cn
) - M(~ +c ) = 0,
n n
.
exists a sequence {~ } such that R (~ )
n
n n
and
(x-~
n
expansion of R (x) about
n
.
·e
+ o(cn (~n -8)).
for x
~ ~.
) R (x) > 0 if x
n
= ~n ,
By considering the Taylor series
n
~
n
, we can show that R (x)
n
Letting D (x) = R (x)/c (x-~ ) for x
n
n
n
n
~ ~
n
,
= -2M"(8)
we find that the function sequence {D (x)} is continuously
n
convergent to -2M"(8) at x = 8 since I~
2.2
= -2cn (x-~n )M"(~n )
n
-el
+
° as c
n
+
0.
Constrained Random Variables
The following lemmas establish certain relationships between
constrained and unconstrained random variables that are used in
Theorem (2.1).
Lemma (2.1):
2
(a ).
Let Y be a random variable with finite variance
Then Var(yT) < a 2 .
Proof:
Without loss of generality assume that E(Y)
the random variable (Y') by
Y'
..
=
T'
Y> T
{ Y,
Y < T.
= O.
Define
12
Then
Var(Y' )
•
Therefore
Lemma (2.2):
Let' {T } and {r } be sequences satisfying
n
n
Definition (2.1), and let {Z (x)} denote a sequence of random
n
variables with distribution functions {F[ru](olx)} such that
I
00
R(x)
=
zdF(zlx).
Assume that there exists a finite number (B)
-00
.
~
and an even integer (K
2) such that
00
supx
J
(Z-R(x)l dF'(Z'lx) <B.
-00
Then for 0 < C < 1 there exist positive constants D , D , and D
2
3
l
such that
a. if /R(x)! > CTn' then
T
R n(x)sgn(R(x»
> D T - D r- K/ 2 T-(K-l)
1 n
2 n
n
b. and if /R(x)! < CTn' then
T
R(x) - R n (x) I < D r -K/2 T-:(K-l)
!
- 3 n
n
Proof:
Case (a):
Without loss of generality, assume R(x) > O.
Choose C' such that 0 < C' < C.
.
We have then
13
.e
[r ]
T
E(ZT(x»
n
= -TnP(Z n (x) < -Tn
) + fn zdF n (zlx)
_T
n
+ TnP(Z n (x) > Tn )
> -T P(Z (x) < (C-C')T )
n
n
n
+ (C - C')T P(Z (x) < (C-C')T )
n
n
n
=
(C-C')T
n
- (l+C-C')T P(Z (x) < (C-C')T ).
n
n
n
But since R (x) > CT we have
n
n
P(Z (x) < (C-C')T ) < p(IZ (x) - R (x) I > CIT )
n
n
n
n
n -
..
< (C'T )-K E(Z, (~) - R(x»K
-
n
n
< '(CIT )-K r -K/2 ,E(Z(x) -R(x»K
-
n
'n
'
We have then
E(ZT(X»
n
> (C-C')T - (l+C-C,)-KBT-(K-l) r -K/2
n
n
n
The desired result is obtained by letting D
l
D = (l+C-C') (C,)-K.
2
Case b:
Again we can assume R(x) > O.
=
(C-C') and
14
We have that
T
IR(x) - R n(x)/
=
< (/R(X)! + T ) P([Z (x) < -T ] U[Z (x) > T ])
n
n
n
n
n
-Tn
00
+
f
!z-R(x)ldF(rn ] (z!x) +
T
f . IZ-R(x~dF[rn](z/x)
_00
n
< (/R(x)! + T ) P(!Z(x) - R(x)] > T - R(x»
n
+ IT
since )R(x)] < CT
n
n
+ R(x)/-(K-l)
< T.
n
n
-T
Jn
IZ-R(x)I K dF[r n ] (zlx)
-00
Again using Chebyshev bounds, we see that
T
jR(x) - R nl < (l+C)T (B(l-C)T )-K r- K/ 2 +
n
n
n
I (l-C)Tn ,-(K-l\-K/2
n
The desired result is obtained by letting D
3
2.3
= 2B(1
+ C)(l - C)
B.
-K .
T
Asxmptotic Distribution of a Type A Process
o
We shall now give an extension ~f Burkholder's [2] Theorem 3.
It is helpful to recall the discussion on the K-W process preceding
Lemma (2.1) and to bear in mind that the assumptions are based on the
unconstrained random variables even though the theorem concerns a
Type AT process.
o
Before giving Theorem (2.1) we shall give a theorem by Burkholder [2] and a theorem by Fabian [6] which play integral parts in
15
the proof of Theorem (2.1).
Suppose {X*} is a stochastic approxi-
Theorem (Burkholder [2J):
n
mation process of Type A and e is a real number such that
o
(1) there is a function Q from the positive real numbers into
the integers such that if E >
0, jx-el > E, and n > Q(E),
then (x-e) R* (x) > 0;
n
(2) sup
(3) sup
(4) if
n,x
n,x
*
(IR (x)!/(l+!xl»
n
Var(Z * (x»
n
< 00;
° < 01' < 02 < 00,
*
(5) E(a) 2 <
n
< 00',
then La * (
inf
]R* (x)/)
n sl~lx-e 1~02 n
= 00;
00;
*
'Ie
n
n
(6) if n is a positive integer, then R (x) and Var(Z (x»
are
Borel measurable.
Then P(lim X
n
= e) = 1.
It will be useful to note that condition (1) ,can be replaced by
the condition that
(1') there is a function Q from the positive real numbers into
the positive integers such that if E > 0, Ix-el > E,
and n > Q(x), then (x-e)R* (x) >
(x) + is (x) where
n
E (x) > 0, and La * sup 10 (x) I < 00.
x n
n
n
n
E
n
The following theorem is a one-dimensional version of Fabian's
[6] theorem on asymptotic normality.
real numbers.
In the theorem let R denote the
Let P denote the set of all measurable transformations
from the measurable space
(~,S)
to R.
Let X[A] denote the indicator
function of the set A, and, unless indicated otherwise, all convergence notions will mean convergence almost everywhere.
16
Theorem (Fabian):
Suppose K is a positive integer, S
n
decreasing sequence of a-fields, S
n
,t<
~,
r,¢ER, and r > 0.
0:,
(3, sR, and
r
n
+
r, ¢
n
¢, T
n
+
+
T, or EIT -TI
n
= 0, C > jE(Vn2 /S n ) ~ ~I
lim a
2
j,r
j~
= 1, lim n
11~
(3+ = (3 if
0:
n
-1
2
~
j=l
= 1, (3 =
+
=
a.J,r
° if
°<
0:
0:
+
0,
(2.2)
0,
(2.3)
° for every r > 0,
=
° for every r > 0.
(2.4)
2
and, with a.
= E(X [V 2 ~ rj ]V 2),
J,r
0:
suppose U ,V ,T , r , ¢ sP,
n n n
n
n
Suppose r n ,¢n- l' Vn- 1 are Sn - measurable, C,
E(Vn Is n )
or
c: S;
a non-
+
let either
Suppose that
:I 1,
°
< 1,
~ (3, (3+ < 2f,
(2.5)
and
U +1 = (I - n-ar ) U + n-(a+(3)!2 ¢ V + n- a -(3!2 T •
n
n
n
n n
n
(2.6)
Then the asymptotic distribution of n(3!2 U is normal with mean
n
(r - (3+/2)
-1
T and variance
~
2
¢ !(Zf - (3+).
Let {X } be a stochastic approximation process of
n
P
Type AT with EZ (x) = R (x) and G (olx) = G[n ] Colx). Suppose that
o
n
n
n
Theorem (2.1):
{~
n
} is a real number sequence, {c } is a positive number sequence,
e is
n
a real number, P, W, and
n , A, C, a2 , v, a, and
o
~
are non-negative numbers, and each of
A is a positive number such that
17
i. {R (x)} is a function sequence such that if n > n , then
n
0
= 0, and if x
R (~ )
n n
ii. sup
n,x
~ ~
n
nW!R (x)// (1 + Ix!) <
n
iii. if 0 < 8
1
and if T
n
< 8
<
2
then La l
00,
~
, then (x -
n
) R (x) > 0;
n
00;
inf
(
IR (x)!) =
n 8 </x- el2.82 n
i
~
for all n > n , then
00
0
lim (
n
IR
inf
8 <lx- e l<8
1
(x) /) exis ts ;
n
2
iv. {D (x)} is a function sequence defined by D (x)
n
n
~
R (x)/c (x-~ ) if x
n
n
n
T
~,
n
v. sup Var(Z(x»
taining
e,
<
00,
__ J~A*
if x
= ~n ,
=
and is continuously
e;
convergent to A at x =
a2 at x =
00,
Var(Z(x»
is continuous and converges to
there exists a finite open interval (I), con-
e such
n>
that for
0
lim E{X[(Z(x) - R(x»2 > nja] (Z(x) - R(x»2} + O.
j-+oo
uniformly for xE1, and if T
n
sup E(jZ(x) - R(x)jr) <
x
00
of
00
for every n > n , then
0
for r an even integer ~ 2;
vi. G(zlo) is Borel measurable for all z;
vii. (a) 1/2 < a- ~ 1,
*The
that D
I~n
-
eI = O(n-v ),
na-w anI
+
A/C,
definition of D (~ ) is arbitrary except for the restriction
n n
(ll) -+ A.
n
n
18
(rp/2 +
S/2
> w + (1 - a), and
(r-l)~)
= a/2
+ p/2 - w < V;
(b) if p/2 < w, then p > 1 - 2(a - w);
(c) if T #
n
00
for every n > n , then 6/2 < (rp/2 +
0
(d) and 6+ = 0 if a < 1, = 6 < 2AA if a
Then X
n
+
8 a.e., and n 6/2 (X
n
Proof:
(r-l)~-w);
= 1.
- 8) is asymptotically normal
First we shall use Burkholder's theorem to show X
n
4
8
a.e. by considering two cases.
Case I:
p/2 < w.
Burkholder has shown that condition (i) implies that, for every
E > 0, there exists a Q(E) such that if Ix - 81 > E, then
(x -
8)R (x) > 0 for every n > Q(E).
n
(x - 8)R*(x)
n
If Tn #
00
=
Thus if T
n
=
(x - 8)n P/ 2R (x) > O.
n
for every n > no' then by Lemma (2.2) (with ern]
we have that
O(n-rp/2 -
!e
00,
(r-l~
) max {D ,D })
2 3
=
[nP])
20
lime
inf
JR (x)I),
n+oo 01<!x- eI202 n
by L.
This limit exists and is finite by (ii) and (iii).
Lemma (2.2) and (vii. (a»
for
° < D4 < min{L,Dl }.
give
Since a~
again the series diverges.
for some
If L > 0,
If L
= O(n-(a-w», w·~ 0, and a ~ 1,
= 0,
then
° < D < 1 and sufficiently large n since lim inf n-sT
n
> 0.
Thus by (b) of Lemma (2.2) and the fact that (rp/2 + (r-l) s) >
w + (I-a)
~
0, the divergence of the series follows from the diver-
genee for the case T
n
=
00
Condition (4) therefore holds in all
cases.
Condition (5) follows from
by (vii. (b».
Finally, Burkholder (2) has shown that if G(zje) is Borel
measurable, then G[nP](zle) is Borel measurable and if the
sup Var(Z(x»
<
00,
then the measurability of R(x) and Z(x) is implied
by that of G(zle).
Since the measurability of G[nP](zle) clearly im-
x
2 T
plies the same for the distribution function of n P/ Z (x) , the
n
measurability of R*(x) and VarZ*(x) is implied by the fact that
n
n
21
sup
Var Z*(x) <
x,n
n
Thus condition (6) is satisfied.
00.
Therefore, X
-+
e a.e.
Case II:
~
w.
n
pj2
Let a*
n
= a'c
n n'
and Z*(x)
n
= c-1ZT(x).
n n
Conditions (1), (2), (3), (4), and (6) are established as in
Case I.
Condition (5) follows from
by (vii. (a».
Thus, in either case, X
n
-+
e
a.e.
We shall now establish the asymptotic normality of n S/2 (X
n
- e)
using Fabian's theorem.
From the definition of {X } we have
n
X+
n 1
-
e = Xn
-
e-
a'RT(X) - a' (ZT(X) - RT(X»
nn
n n
n
= Xn
-
e-
aIR (X)-a'«RT(X)-R (X»+(ZT(X)-RT(X»).
nn
n
n
n
n
n
Bur by conditions (i) and (iv) ,
R (x)
n
(x -
~
=
cn (8 -
~
=
(1 - a'c D (X » (X - 8) - a'(ZT(X) - RT(X»
= c
n
n
) D (x)
n
n
)Dn (x) + cn (Xn -8)Dn (x),
and so
..
nnn
n
- a'«RT(X) - R (X»
n
n
n
n
n
n
+ cn (8 - ~n )Dn (X».
n
(2.7)
22
.e
Make the correspondence with
Fabian'~
theorem as follows:
X - 8, f '" na a'c D (X),
n '" n
n
n n n
U
-1
1
n (Ci.+S)/2[n PJ"2 a' V
_[n P ]2 (ZT(X) _ RT(X»,
cPn '"
n' n '"
n
n
and
+ c D (X) (8 - ~
nn
n
T '" T (X) '" _n a+S/ 2a'«RT (x) - R (X»
n
n
n
n
n
».
(Note the possible confusion of T (X) with the values at which ZT(X)
n
n
is constrained.)
T
Let S be the a-field generated by Xl' Zl'
n
T
Clearly this is the same a-field generated by Zl'
Xl' .•. , Xn'
... , ZT
T
... , Znl'
n~l'
Moreover, Sn is increasing, and cP n- 1 and Vn- 1 are
measurable with respect to S.
n
Since R~(X) '" E(ZT(X)/X ), R (X) is
n
n
n
n
measurable with respect to S implying the same for D (X) and theren
n
fore r .
n
To establish (2.2) we recall that X a+e.
n
e:
Since D (X) is
n
continuously convergent to A at x '" 8, it follows that Dn (X)
and therefore R (X) a+e. O.
n
+
A a.e.,
Thus
by conditions (iv) and (vii. (a».
Again by (vii. (a»
'" A/C
since S/2 '" a/2 + p/2 - w.
Finally, since a~ '" o(n-(a-W»,Tn(X)
+
8
23
a.e.
That
n B/ 2 (e -
~n )
follows from the fact that je - ~
T
n
+
D (X)
+
n
n
J=
Oa.e.
O(n-V) and B/2 <
v.
Thus
0 a.e.
We have
E(Vn jsn )=E(E(Vn IXn )IS n )
If Tn
=
n
rp/2 +
= O.
then there exists a C > 0 such that (with E =
00,
2
C > jE(V ls ) -
by (v).
= E(Ols n )
If T
n
~
2
Hence IE(V IS ) n
+
2
)
0 a.e.
for ever.y n > n , we have from (vii. (a»
0
> w+ 1 - a
(r-l)~
n
00
rj
n
0
rl
+
~
that
0, and thus
0 a.e. in either case.
This establishes (2.3).
To show that lim 0: t = 0 for every t > 0, let I be defined as
J'
j+oo
in (v).
2
Since E(V ) = 0, o. t < sup
n
J,
-
by Lemma (2.1) and condition (v).
2
x
E(V.) < sup
J
-
x
Var Z(x) <
Letting D denote this last upper
bound, we obtain
2
O. t
J,
..
= EX{[V:J
> tja]
-
Ur}v:J
+ EX{[V: > tja] UI C } V:
J -
< EX{ [V: > tja]
-
J -
00
J
UI}V:J + DP (X.J 8r c ) .
24
.e
The first term goes to zero by (v), and the last term goes to zero
since P(X
n
~I) +
1.
Thus condition (2.4) is satisfied with t = r.
Finally, by noting that
2
that n S/ (X
r = AA
and
S<
2 AA if a
2 2
n
= 1,
it follows
2
S+».
- e) is asymptotically normal (0, a A /c (2AA -
Although Theorem (2.1) does not extend the conclusions of
Fabian's theorem to a larger class of stochastic approximation procedures, the assumptions of Theorem (2.1) are more related to the
conditions of an experiment in which the use of these procedures may
be attractive.
Theorem (2.1) also has the added attraction of re-
moving the implied assumption that X
n
+
e
a.e. and including it in
the conclusion of the theorem.
Theorem (2.1) extends Burkholder's [2] Theorem 3 in three
directions.
Firstly, it increases the possible values for a from that
of 1 to that of the interval (1/2, 1).
Since the constant a (=An-a )
n
can be thought of as the weight assigned to the n'th observation, this
wider range of values for a gives the experimenter more flexibility
in weighting the observations.
This greater flexibility often enables
the experimenter to decrease the mean squared error during the early
stages of the search for
e.
strained random variables.
Secondly, it allows the use of conAn example in Chapter 3 points out when
such constraining can improve the behavior of the K-W process. And,
thirdly, it allows an increasing number of observations to be taken
on each step.
Again an example in Chapter 3 describes situations in
which, by taking more observations on each step, we can obtain the
same variance as by taking the same number of observations using more
steps.
The practical importance of decreasing the number of times the
.e
experimenter must obtain observations is quite clear.
Due to the generality of Theorem (2.1), it is difficult to comment on the practical consequences of each assumption with the exception of assumptions (v) and (vi).
It is sufficient to guarantee
assumption (v) that our observations be bounded.
Since this can
almost always be assumed in practice, assumption (5) is not practically restrictive for any finite r.
Thus if either, or p is greater
than zero, the assumption that
rp!2 + (r - 1)
~
> w + (1 - a)
and
B/2 < (rp/2 +
are almost always satisfied.
(r - 1) , -
w)
Assumption (vi) seems never to be
questioned in practice.
Theorem (2.1) expresses the order of the variance in terms
of the number of steps rather than the number of observations.
Let
N(n) be the total number of observations necessary to obtain X .
n
Then
N(n)
=
n-l
~ mP
= O(n p+1).
m=l
If Y is such that (N(n»)Y
= n B!2
(p + l)y
= a!2
then
+ p/2 - w,
or
Y = (a/2 + p!2 - w)!(p + 1).
26
We shall discuss the value of y for the K-W process in the next
chapter, and shall only note here that W = 0 for the Robbins-Monro [9]
process giving y
=
(a + p)/2 (p + 1).
·e
CHAPTER 3.
3.1
THE CONSTRAINED KIEFER-WOLFOWITZ PROCESS
Definition of the Constrained Kiefer-Wolfowitz Process
The Kiefer-Wolfowitz process is not generally accepted as a
practical procedure for seeking a maximum in experimental work.
basic reasons for this are:
The
its sequential nature, insufficient
knowledge of its small sample properties, lack of a good estimate of
its variance, and an unfamiliarity of experimental statisticians with
its attributes.
We shall attempt to relieve these deficiencies somewhat.
Fol-
lowing the definition of the Modified Kiefer-Wolfowitz process, we
shall obtain, under certain mild restrictions, its asymptotic distribution as a corollary to Theorem (2.1).
Various examples will be
given to study the effect of the parameters of the process and to
illustrate its small sample behavior.
Finally, bounds will be ob-
tained for expressions related to the small sample variance, and a
method of estimating this variance will be discussed.
Definition (3.1):
Let {a }, {c }, {r }, and {T } denote
nn
n
n
sequences of positive numbers such that [r ] > 1 and if T
n n
n greater than some n , then inf T > O.
o
n
~
00
for all
Let Y(x) denote a random
variable with mean M(x) and distribution function F(O!x).
Let Xl be
fixed, and let the random variable sequence {X } be defined by
n
•
= Xn
- a c-l(y(X - c ) - y(X + c »T,
n n
n
n
n
n
(3.1)
28
.e
where Y(X - c ) and Y(X + c ) are random variables which, conditional
n
n
n
n
on Y(X I - c l ), " ' , Y(Xn - l + en-I)' Xl' .•. , Xn- l' Xn
[r ]
tributed independently as F n
= xn ,
are dis-
[r ]
(ylx
n
c ) and F n (ylx
n
n
+ c n ).
This will be called the Constrained Kiefer-Wolfowitz process,
and it will be referred to as the CKW process.
The CKW process is a generalization of the K-W process.
The
CKW process bounds the difference (Y(X - c ) - Y(X + c » by + T ,
n
n
n
n
- n
and it allows the experimenter to vary the number of observations on
each step.
T _
n
3.2
It clearly reduces to the K-W process when r
n
- r > 1 and
00.
Asymptotic Properties of the CKW Process
We shall now examine certain asymptotic properties of the CKW
process through the following corollary to Theorem (2.1).
In re-
viewing the corollary, it will be helpful to bear in mind that the
assumptions are based on the unmodified random variables.
Corollary (3.1):
Let {X } be a CKW process with r
n
n
= n 2aw
Let 8 be a real number, let a and b be non-negative numbers, and let
2
each of n , A, C, 0 , a, w, and A be positive numbers such that
o
8) M'(x) > 0 if x
a. M'(x) is continuous, (x
sup x (I MI (x)
II (l
+ Ix I»
<
~
8, and
,
00'
b. the third derivative of M(x) exists and is continuous for x
in an open interval (I) containing 8, and M" (8)
c. Var Y(x)
= 02(x)
= -A/2;
is bounded and continuous, there exists an
open interval (I) containing 8 such that for any t > 0
29
lim E{X[(Y(x) - M(x))2 > tja] (Y - M(x))2}
j-+oo
~
uniformly for x E I, and if T
n
"
sup E(/Y(x) - M(x)/r) <
x
00
00
+
°
for every n > n , then
0
for r an even integer ~ 2;
d. F (ylo) is Borel measurable for all y;
e. (1) 1/2 < a ~ 1, naa
n
+ A,
nWc
n
+
C, lim inf n-b°Wr
n
> 0,
(ra + (r - l)b) > 1 + (1 - a)/w, and
B/2
=
al2 + (a - l)w <2 W;
l)/~w
(2) if a < 1, then a > 1 - (2a ~
(3) if T
n
00
for every n > n , then
0
B/2 < (ra + (r - I) b -l)w;
+=
(4) and B
° if a < 1, = B < 2 AA if a = 1.
Then X + 8 almost everywhere, and n B/2 (X
n
n
- 8) is asymptotically
222
normal (0, 2 A cr IC (2 AA - B+)).
The basic ideas of Corollary (3.1) have already been noted by
Burkholder [2].
However, since Corollary (3.1) is an extension of
his work and since it is instructive to do so, we shall give a proof
of this corollary.
Proof:
a'
n
Make the correspondence with Theorem (2.1) as follows.
= a n c-l,z
(x) =
n
n
(Y(X -c ) - Y(X +c )), and R (x)
n n
n n
n
= M(x-c n )-M(x+cn ).
Condition (a) implies that M(x) is continuous, and that M(x) is
strictly increasing or decreasing as x is less than or greater than
8, respectively.
Thus, there exists a sequence {lJ J such that
n
30
.e
~
(x -
n
) R (x) > 0 for x #
n
~
n
.
From conditions (a) and (e.(l»
sup
n,xnWIRn (x)I/(l + Ixl)
we obtain that the
= sup n,xnW(c n jO(M'x)I/(l +
Ix!»~ <
00.
This is condition (ii).
To verify condition (iii) we note that by (a) and (e. (1»
ra'n (
inf
(\«x-e)~oZ
IRn (x)/ )
inf
10 (M' (x»
n 0l~lx-e.J<PZ
=
a (
=
OO::an )
=
00.
Moreover, the lim
inf
IR (x)1
n 0l<jx-ej<oZ
n
= lim
above argument and the fact that c
O.
n
+
I)
n
c
=0
0(1)
n
by the
The function sequence {D (x)} in condition (iv) is defined by
n
D (x)
n
=
c-l(x - ~ )-1 (M(x - c ) - M(x + c » if x # ~ ,
n
n
n
n
n
using condition (b).
For x £ I we have, since R
n
=
M(x - c ) - M(x + c )
n
n
(x -
~)(M'(~
n
n
(~
- c ) n
n
)
=
A at x
=
e
= 0,
M'(~
n
+ cn»
+ O(xn -~ n )Z(M"(§l - c)
n - M"(§Z + c»
n
=
where
Iso 1
xl < I~n
for x £ I and I~
n
-
-Z c n (x -
= 1,Z.
xl, i
- el < c
-
n
+
~ ) M"(~
n
n ) + O(cn (x -
~
n»
Since the third derivative exists
0, it follows that for every £ > 0 there
exists a 0 > 0 and an N such that for Ix - el < 0 and n > N,IDn(x)-AI<£.
Thus the condition (iv) holds.
Condition (v) is a direct consequence of (c) with Var Z(x)
=
Z
20 .
31
Finally, Burkholder has noted that condition (d) implies (vi).
He has also shown that the existence of a continuous third derivative
.
e implies
in an open interval containing
°(n-2U) .
Letting v
= 2w,
p
= 2aw,
that I~
n
- eJ = 0(cn2) =
and l; ::: bW , condition (vii) can be
verified using simple arithmetic.
Thus the corollary is proven.
Except in a few pathological cases, an experimenter can assume
that the derivatives of M(x) exist, that the r'th central moment of
Y(x) exists and is continuous, and that F(ylx) is a continuous function of x for every y.
Moreover, it is a rare case when he cannot
specify a finite interval which should contain
e.
By restricting X
n
to this interval, all the assumptions of Corollary (3.1) hold for r
arbitrarily large.
If the above assumptions are satisfied, condition (c) is equivalent to e I .
1/2 < ex.
~
ex.
1, n a
n
+
w
A, n c
n
+
C, a + b > 0,
S/2 ::: ex./2 + (a - 1) w < 2w, and
S+ ::: 0 if
ex.
< 1,
=S <
2 AA if ex. = 1.
Thus, except for assumption (e ' ), the assumptions of Corollary
(3.1) should not be looked upon as restrictions, but rather as factors
influencing the behavior of the process.
Since in experimental work it can usually be assumed that the
supremum over x of any central moment of
Y(~)
is finite (at least over
an interval known to contain e), the assumptions that
(ra + (r - l)b) > 1 + (1 - ex.)-w
and
32
a/2 + (a - 1) w> (ra + (r - l)b - l)!W
~re
generally not restrictive.
Practically speaking then, the constant
a must satisfy
a/2 + (a - 1)
W<
2'w
(3.2)
and
1 - (2a - l)/lw < a,
or, equivalently,
1 - (2a - 1) /2w < a < 3 - a/2 w.
(3.3)
As in Chapter 2, let N(n) denote the number of observations
required to obtain X , and let y be defined by (N(n))Y
n
= n S/ 2 .
By (2.8)
Y = (a/2 + (a - 1)w)/(2aw + 1)
Thus, if a
= 0,
each step), Y
(i.~.,
= a/2
the same number of observations are taken on
- w.
Since (3.2) implies
w> a/2(3 - a),
we obtain, for a
(3.4)
(3.5)
= 0,
Y
Suppose a > O.
(2aw + 1) 2~
dw
= S/2
< a/;2 - a/6
= a/3.
We have
=
(2aw + l)(a - 1) - (a/2 + (a - 1)w)2a
= a(l
- a) - 1.
(3.6)
33
.e
It follows that d y/dw is greater than, equal to, or less than zero
as a is greater than, equal to, or less than (1 - a)
..
Case I:
mizing w.
a < (1 - a)
-1
•
-1
.
In this case y is maximized by minj-
Thus by (3.4) and (3.5)
y < (a/2 + a(a - 1)/2(3 - a»/(a a/(3 - B) + 1)
= a/(3
- a(l - a»
< a/3.
(3.7)
The equality sign holds if either a = 0 or a = 1.
a ~ (1 _a)-I.
Case II:
1/2 < a
~
By (3.3) (1 - a)-l ~ 3 - a/2w.
Since
1, this equality is not satisfied if either w ~ 1/4 or
a > 2/3.
Assuming these conditions are satisfied, substitution intb (3.4)
gives
y
= (a/2 + aw/(l - a»/(2w/(1 - a) + 1)
=
If a > (1
a/2 < 1/3.
.. db
' . . .
a ) -1 ,y i s maX1m1ze
y m1n1m1z1ng
w.
limit of (3.4) as w +
00
Taking the
we obtain
y = (a - 1)/2a.
By (3.3) a < 3, and so y < 1/3.
Thus in all cases y < 1/3.
..
We might add that a and w can be ad-
justed so that y is arbitrarily close to the bound obtained in each
case.
34
.e
We shall see that the bias is minimized Qy increasing the number
of steps and usually not affected by increasing the number of observa-
•
The restriction that ex > 1/2 implies for
tions taken on each step.
Case II that a > 2 and so this case is usually not of practical significance.
It follows from (3.7) that, for Casel, the order of the variance increases as either a decreases or ex increases.
Moreover, the
effect of a change in a becomes less pronounced as ex increases, and it
vanishes at ex
= 1.
Thus, if the number of steps is sufficient to dis-
count a serious bias, the experimenter loses little efficiency in a
statistical sense
by increasing moderately the number of observations
on each step and decreasing the number of steps when ex is close to 1.
However, by taking observations at fewer levels of the controlled
variable (X), the experimenter decreases the time required for the
experiment and possibly the cost.
3.3
Effect of Constraining the K-W Process
It is difficult to study the advantages and disadvantages of
constraining the random variables via a strict mathematical treatment.
Rather, using an intuitive approach, we shall study an example which
illustrates when we may improve the K-W process by constraining the
random variables.
Let Z ~ N(~,a2).
It is clear by Lemma 2.1 and the symmetry of
the normal distribution that
..
1. sgn
T·
(~
T
)
c
= sgn(~);
.
.
2. ~ ~ ~ if I~I < < T;
(3.9)
35
3./~TI < I~I if I~/ >T;
4. Var yT(x) ~ 0
2
if /~I < < T and 0 < < T;
5. and Var yT(x) < 0 2 if 0 > T.
Partition the x-axis into the intervals II
1
2
=
(-00,
e-
k),
= [8 - k, 8 + k], and 1 3 = (8 + k, 00). Assume that
M'(a) < < /M'(b)/ for a £ II and b £ 1 , Assume that T - T, c - C,
n
n
3
2
and 0 are such that T < 0 and !M(x - C) - M(x + C)/ < < T for x £ II'
(These assumptions on M(x) are often realized in various threshold
phenomena.)
T
Let X and X denote the K-W process and the CKW process
n
n
respectively, and assume that Xl
= xi = x.
To compare the processes we can use the observations in (3.9)
with Z
= Y(x
- C) - Y(x + C).
Using (5) and the fact that T < 0, we
obtain
Var(y(x - C) - Y(x + C»T < Var(Y{x - C) - Y(X + C».
It follows that, at least for a few steps, the K-W process is more
variable about its expected value.
In addition to this, as long as
both processes remain in the same interval, we can visualize the
following cases.
Case I:
x £ II'
Since !M(x - C) - M(x + C)
(2) of (3.9) implies the processes approach
rate.
e
I
< < T for x £ II'
at the same expected
The fact that M'(a) < < M'(b) for a £ II and b £ 1
3
implies
that this expected rate is much less than the corresponding rate for
either process approaching
e
from 1 ,
3
either process moves from II to 1
2
That is, the probability that
in a few steps is relatively small.
36
It is possibly a little larger for the K-W process due to its more
erratic nature.
Case II:
x
1 ,
2
£
While in this region, the behavior of both pro-
cesses is characterized by random movement since
M(x-C) - M(x+C)
~
0 < < T.
process if T < 20.>
This movement is more erratic for the K-W
Moreover, if by chance either process leaves this
region, the movement away from
e should
most likely be greater for the
K-W process due to its sensitivity to observations from the tails of
the distribution.
This difference increases with an increase in 0 or
a decrease in T.
Case III:
x
£
1 ,
3
It follows from (3) of (3.9) that the ex-
pected rate of approach toward
e is
faster for the K-W process.
difference increases with a decrease in T.
of approach toward
e
This
However, the expected rate
is fast for either process, and so the probability
that either process moves from 1
3
to 1
2
in a few steps is relatively
large.
Since both processes tend to move out of the region 1
3
more
quickly than out of region II' we can expect both processes to be
biased after a few steps given that Xl is in a neighborhood of
The CKW is less biased for two reasons:
1
3
into 1
2
1
2
Firstly, it tends to move from
more slowly, and it tends to move from II into 1
the same rate.
e.
2
at about
Secondly, but probably more importantly, on leaving
the K-W process tends to be thrown farther into regions II and 1 ,
3
Due to the usually quick recovery from 1
3
and the usually slow re-
covery from II' this can greatly increase the difference in the bias
37
of the estimates.
If Xl s II' neither process seems to be the better, and although
..
the K-W process seems better if Xl S 1 , the difference quickly dis3
appears since both processes approach 8 at a rapid rate.
Thus the CKW
process seems to be preferable for this example.
The essential feature of this example is the existence of two
regions, both excluding 8, that are differentiated by the magnitude of
M'(x).
Other examples with this feature can be constructed, but we
note only that regression functions of the form e-
x2
have this feature.
The advantage of the CKW process is that of lessening the probability
of a deep penetration into the region for which IM'(x)/ is small.
The optimal choice of T is largely determined by 0
regions 1
2
Clearly, if 4 0 < T, where 0
and 1 ,
3
2
= Var
2
and the
Y(x), con-
straining is of little consequence in 1 , and if it is of consequence
2
in 1 , it is a negative factor.
3
crease the variability in 1
information that 1
3
2
Thus, we must choose T so as to de-
without sacrificing too much of the
contains with respect to the direction of 8.
either more observations are taken at each step, or 0
2
As
is decreased,
this becomes more difficult to do.
Before leaving this subject it should be observed that as
a
n
+
0, the influence of each iteration goes to zero.
This implies
that both processes become characterized by their expected rate of
approach toward 8.
In any small neighborhood of 8, toward which
both processes tend, this is very similar.
.e
38
304
Expected Small Sample Behavior of the K-W Process
In this section we shall discuss the effects that the parameters
a and w, the starting value (Xl)' and the shape of the response curve
have on the behavior of· the K-W process.
Although the CKW process will
not be discussed, it will be seen that most of the results in this section apply to it as well.
Let I denote an interval containing 8, and assume that M'(x)
for x
€
rCo
=B
Assume that the distribution function of Y(x) - M(x) is
independent of x.
Let Z(X ) = Y(X - c ) - y(X + c ), and Var Z(X )
n
n
n
n
n
n
2
2 a .
We have by (3.1) that
X + - 8 = X - 8 - a c-l(M(X -c ) - M(X +c »
n l
nn
n n
n n
n
If ~ > > 8, the process will remain in I
high probability.
C
-
a c-lZ(X )
nn
n
(3.10)
for a few iterations with a
Thus if n - k is not too large, (3.10) becomes
= (X -8) - 2 B a - a c-1Z(X )
n
n
n n
n
By induction
n
=
(~-8)
- 2 B ~ a
m=k m
-1
a c Z(X),
m=k m m
m
n
~
where, conditional on X > > 8, Z(X ), m=k, ..• , n, are independent
m
k
random variables.
It follows that
n
= (Xk -8)- 2 B E a ,
m=k m
and
(3.11)
=
39
Var(Xn+ - 81~ > > 8)
l
Let a
n
=
An- a and c
n
= en-w .
L m-p
2 n
2 -2
L a c
mm
m=k
(3.12)
It is a well known result that
n+l
n
~
m=k
of p > O.
=2 a
J x- P
dx
(3.13)
k
Substituting the expressions for a
n
and c
n
into (3.11) and
(3.12) and applying (3.13), we obtain
E(Xn +l - 81~ > > 8) ~
~
- 8 - 2BA(ln(n+l) - In k), a = 1,
and
-e
Var(Xn+1 - 8 I~ > > 8)
2 2 -2
-1
A C (a-w-l/2)
~~
(Kl - 2 (a-w) _ (n+l)1-2(a-w)).
The above expressions indicate that if we are approaching 8
along a constant slope, our expected rate of approach is proportional
to the slope and increases very rapidly as a decreases.
Furthermore,
the variance of our rate of approach decreases as either a increases
or w decreases.
Thus if we expect to approach 8 in a region in which M'(x) is
fairly constant, we should generally decrease both a and w.
be wise to let c
n
and a
n
It may
be constant until the sequence {X } no
longer shows a trend in a particular direction.
indicates we may be in a neighborhood of 8.
n
This lack of a trend
40
In this example we are trying to decrease the bias for a given
number of observations.
It is clear from (3.11) that the bias is not
decreased by increasing the number of observations on each step.
There-
fore, unless it is practically restrictive, it is best to take a minimum number of observations on each step until we are in a neighborhood
of 8.
The previous example reveals the behavior of the K-W process for
a few iterations while in a region in which M'(x) is fairly constant.
However, the assumptions of this example are certainly violated in a
small neighborhood of 8, or in a region in which M'(x) cannot be approximated by a constant.
A more reasonable approximation, at least in a neighborhood of 8,
is to assume that M(x) is a quadratic function of x.
M(x)
for B > O.
Let a
n
= An- a
= -(B/2) (x-8) 2
and c
y(X +c ) - (M(X -c ) - M(X +c
n n
n n
n n
function of Z(X
n
Ixn = x)
That is, assume
n
»,
;::: Cn -w .
(3.14)
Let Z(X )
n
= y(Xn-c n )
-
and assume that the distribution
is independent of a.
Let Var Z(X )
n
=2
0
2
Using (3.10) we have
;::: X - 8 - An- a c (B/2) (-(X -c _8)2 - (X +c _8)2)
n
n
n n
n n
- AC- l n-(a-w) Z(X )
n
(3.15)
41
Since, conditional on X
n
=
x, Z(X ) is independent of x, we obtain by
n
induction that
n
X + - 8= IT (1 - 2ABj-a)(~ - 8)
n I
j=k
n
- AC- l
E
n
E (1 - 2ABj-a)n-(a-w)Z(X ),
m
m=k j=m+l
n
where
(1 - 2ABj-a)
IT
= 1.
It follows that
j=n+l
=
n
IT
j=k
(1 - 2ABj-a)(~ - 8),
and
n
~(k,n)(~-8)2 + 2A2C- 2a 2 E D(m+I,n)m- 2 (a-w), (3.16)
m=k
n
where D(m,n)
= n (1 -
2ABj-a)2.
The last two expressions of (3.16)
j=m
are, conditional on
~,
the square of the bias and the variance of X ,
n
respectively.
It is another well known result that there exists 8
k
+
0 such
that for every k, n
Using results similar to (3.13) it has been shown that there exists
8k
+
0 such that for every k, n
(1-8 k )exp {-4AB(I-a) -I (n I-a.
-k I-a.) }
~
D(k,n)
(3.17)
f)
< (I+8 )exp {-4AB(I-a) -I (n I-a-k I-a}
) , a < 1,
k
and
••
42
(3.15)
1.
We can use the results of (3.17), (3.18) and Corollary (3.1) to
obtain asymptotic expressions for the bias and variance of X in this
n
example.
In fact, it follows from (3.16), (3.17), and (3.18) that
)(
«~-e)
l
2
+ 0(1)exp{-4AB(1-ct)
-1 l-ct
n
}, ct < 1
(3.19 )
«~-e)
2
+ o(l»n
-4AB
,ct = 1.
Although Corollary (3.1) does not guarantee that Var(X ) exists, it
n
does imply that
••
n~/2(x - e), with ~/2
n
= (ct -
2w)/2, tends to be
distributed as a normal random variable with variance equal to
(3.20)
2A
222
for A > (a - 2w)/4B if a = 1.
a /c (4AB - .(ct - 2w», ct
= 1
Many writers have noted that the variance
expression for a = 1 is maximized with respect to A if A = (ct - 2w)/2B.
It is reasonably evident from the nature of D(k,m) that the bias
of the K-W procedure is diminished by decreasing ct.
is clear from (3.19).
a - 2w < a/3.
However, it follows from (3.6) that
n
a and w be small.
•
~/2
=
Thus the asymptotic variance increases as ct decreases.
Again, until X is in a neighborhood of
creased .
Asymptotically this
e,
it is probably best to let
Once X is in a neighborhood of
n
e,
.ex should be in-
43
It is important to note that in this example the bias is not a
function of c , and, as is to be expected, the variance decreases as
n
either C increases or w decreases.
It follows that we can set w
By doing so y (see (3.4)) becomes a/2.
asymptotic variance of O(n- 1 ).
= o.
Hence with a = 1 we obtain an
As was noted in Chapter 1, it is the
symmetry of M(x) about 8 that allows us to assume w
= O.
It is clear
that if M(x) is reasonably symmetric about 8, and if only a rough
estimate of 8 is needed, then it is best to let w ~
o.
Suppose we can assume that M(x) is symmetric about 8.
implies that we can set c
n
= C.
This
Asymptotically, the optimal choice
of C maximizes the difference !M(x-C) - M(x+C)I for small deviations
of x about 8.
It is quite clear then how C should be chosen.
We only
note here that, for a quadratic response surface, C should be made
arbitrarily large, and, for surfaces of the form eas in Figure 2.
I
I
I
-~
rI
2C
i~
----+j
I
I
I
I
I
I
x
Figure 2.
Optimal Choice of c (=C)
n
x2
, C is determined
44
2
In this example we have assumed that Var Z(X ) "" 2 cr.
n
Var Z(X )
n
= 2[n 2aW ]-1 cr2 ,
If
(3.10) becomes
1
E(X + l n
el~)
2
= D
(k,n)(~-e)
and
n
~ D(m+l,n)m- 2 (a+(a-l)W)
m""k
3.5
Approximations of the Small Sample Variance
The importance of evaluating the last expression in equation
(3.21) is quite clear.
This expression is, conditional on X ' the
k
small sample variance of X if M(x) is a quadratic function.
n
If
each X , m=l, •.. , n, lies in a neighborhood of e in which M(x) is
m
reasonably approximated by a quadratic function, it is a close approximation to this variance for more general response functions.
At the present, the only simple, but adequate, approximations to
this expression rely heavily on asymptotic theory.
We now obtain
bounds for this expression using an appeal to asymptotic theory that
is satisfied for values of k and n usually realized in practice.
Let
n
b
(m,n)
;:;:
IT
(3.22)
j;:;:m+l
where D, ¢ > O. For simplicity, let bm "" b(m,n ) keeping in mind the
dependence on n.
t)
45
••
Using (3.22) we obtain
b
m
- b
m-I
= b (l_b- l b
m
m
m-
1)
(3.23)
Thus, for k sufficiently large, {b m} is,an
increasing sequence
'
m > k if either a < 1, or a = 1 and 2 D > $.
-a
(m-l)
Since m
-a = O(n-(l+a», the second difference is given
by
(b -b
m m-l
••
) -
(b
m-l
-b
m-2
)
=(2Dm- a _
+ Oem
-2a
ym
'-l)"(b'b ')
'm-' m'-l
(3.24)
)(b m-b m- 1)'
Thus, for k sufficiently large, {b -b I} is an increasing sequence
m m-
for m > k if either a < 1, or a = 1 and 2 D > Y
n
The last expression of (3.21) is of the form
1: b
It
is
m=k m
clear from equations (3.23) and (3.24) that, for sufficiently large k,
upper and lower bounds can be obtained for this sum using a geometrical
argument.
See Figure 3 for a renresentation of the elements b
illustrates this approach.
m
that
In fact,
Area 6 ADE - Area 6 ABC, k > A
n
1: b
m=k
(3.25 )
>
m-
Area 6 ADE
•
and
, k < A
46
b
k -------
(f)
n-l
Figure 3.
n
Graph of the Sequence {b }
n
n
L:
b
m=k
< (n - k) b .
m-
n
This geometrical approach will be used in only one case in the
following lemmas.
However, every bound will be obtained by finding
sequences of functions {f (x)} such that either
n
f
n
= k,
(m) > b ,
m
(m) < b,
m = k, ... , n.
-
m
.•• , n
or
f
n
-
m
The respective bound will be obtained by integrating f (x) over a
n
region such as (k,n+l) or (k-l,n).
Lemma (3.1):
a
= 1.
Let AI' A ,
2
¢ < Al < 2 D < 1. 2 .
k, n > N (1. ,1. ),
1 2
Let the sequence {b } be defined by (3.22) with
m
¢, and D be real numbers such that
Then there exists an N(A ,A ) such that, for every
1 2
47
A-(ep-l)
n
(k-l) 2
) < r b
m=k
< (A
-
I
- (ep- 1»-1n
-A
A -(ep-1)
I «n+l) I
).
Consider the function sequence {f (X,A)} <;lefined by
Proof:
n
f n (x ' A) = n
for x > O.
A -(ep-l)
_ k I
m
-A
A-ep
x· ,
(3.26)
(n , A) = n -ep = b •
(3.27)
Clearly, by (3.22),
f
n
n
We have using (3.26) that
f
n
(3.28)
(m,A) > b «b )
-
m- m
i f and only i f
n
Let d (A)
m
=
-A
ep-A
> m
mep-A b •
m
m
«mC/l-A b ).
-
(3.29)
m
We have
=
by (3.23).
b
-1
I + (A - 2D) m
+ Oem-2 )
Since Al < 2 D < A , there exists an N = N(A I ,A 2 ) such
2
that the sequences {dm(AI)}m>N and {dm(A 2 )}m>N are increasing and
decreasing respectively.
48
But (3.28), and hence (3.29), holds for m = n.
•
Thus, by the
monotinicity of the sequences {d (A )} and {d (A )}, (3.29), and hence
m 1
m 2
(3.28), holds for all m > N(AJ.,A ).
2
Since ¢ < A < A , {f (x,A )} and {f (x,A )} are sequences of inn
n
1
2
2
1
It follows that, for k > N(A ,A ),
creasing functions.
n
f
k-l
-A
n
2
x
1
A2-¢
2
n
11+1 -A
A1-¢
dx.
n 1 x
dx < 2: b <
m
k
m=k
f
(3.30)
Lemma (3.1) is obtained by integrating the above expression.
Lemma (3.2):
Let the sequence {b } be defined as in Lemma (3.1).
m
Let A , A , ¢, and D be such that 0 < ¢ - 1 < A < 2D < A < ¢.
2
1
2
1
Then there exists an N(A ,A ) such that, for every k, n > N(A ,A ),
1 2
1 2
·e
(A
2
A -(¢-1)
A -(¢-1)
n
2«n+l) 2
_ k 2
) < 2: b
m
m=k
-A
A -(¢-)
A -(¢-1)
- (¢_1»-1 n 1 (n 1
- (k-l) 1
).
- (¢-l»-ln
< (A
1
Proof:
-A
The proof is a repetition of the proof of Lemma (3.1)
except, in this case, {f (X,A)} is a sequence of decreasing functions.
n
Thus the ranges of integration in (3.30) must be reversed.
The re-
striction that A > ¢ - 1 is to prevent fn(x,A) from being of the
1
1
form x- .
Lemma (3.3):
a < 1.
Let the .sequence {bm} be defined by (3.22) with
Let 0 < A < 2D.
2
bn (b n-b n- 1)
~
Then there exists an N(A) such that
-1 (1 - exp{An-a (k-n)})/2 < n2: b
m
m=k
-a - exp {-a
}
A-1 n -¢+a (exp{An}
An
(k-n».
49
.e
Proof:
Consider the function sequence {fn (x,A.)} defined by
f (x,A.)
n
for x > O.
= bn
exp{A.n-a (x-n)},
(3.31)
We have
f (m,A.) > b
n
if
-
(3.32)
m
and only i f
(3.33)
Let d
m
= bm exp{- mn-a}.
Then
(3.34)
by (3.23).
Since A. < 2D, there exists an N(A.) such that
{dm(A.)}m>N(A.) is an increasing sequence.
But (3.32), and hence (3.33),
ho~ds
for m = n.
Thus, by the
monotonicity of the sequence {dm(A,)}m>N(A)' (3.33), and hence (3.32)~
holds for all m > N(A.).
Since fn(x,A.) is an increasing function for all n,
n
n+l
E b < b exp{-A. n l - a } J exp{An-ax}dx.
m=k m - n
k
The right hand inequality of Lemma (3.3) follows by integration and by
noting that b
n
= n-¢.
.50
To obtain the left hand inequality of Lemma (3.3),·· we note that
the slope of the line AE (see Figure 3) is given by bn - bn- 1.
Therefore D - A
= b n (b n
- b
n-
1)-1 and (B ~ A)
= (C
- B)(b
n
- b
n-
1)-1.
But, for k >N(A), (3.31) and (3.32) imply that
<b
n
exp{A n
-a
(k-n)}.
It follows from (3.25) that, for k > N(A),
n
z:; b > Area
mm=k
~
ADE - Area
~
ABC
2
~1
a
>b (b -b . 1·) ( 1 - exp{A n- " (k....n)} )12.
- n
n n-
Thus Lemma (3.3). is proven.
We now use Lemmas (3.1), (3.2),' and (3.3)''taca1culate asymptotic
bounds for the variance.
Let the sequence {b } be 'defined' by (3.22).
Theorem (3.2):
m
Then, for fixed k, i f a < 1..,
and if a;= 1 and 2D > ep - 1 > 0"
n
nep-az:; b = (2D - (¢_1»-1+ 0(1).
m
m=k
Proof:
We see from (3.'16) and (3.22) ,-t;hat
.e
51
if D = 2 AB.
-
2D > <P
Using (3.16), (3.19), and the fact that i f a < 1, then
1, we have n<P-O'.b
k
-+
0, for any fixed k.
Thus we need con-
sider only the sum from m > N where N is chosen to satisfy the previous
lemmas.
The case in which a =1 and 2D4<P follows directly from lemmas
(3.1) and (3.2)
ily close to 2D.
si~ce
Al and A of these lemmas can be chosen arbitrar2
n
It is true for <P = 2D since E b is a monotonic
m=k m
function of D for every k and n.
Again the fact that A is arbitrary in Lemma (3.3) implies the
right hand inequality for Theorem (3.1) when a < 1.
The left hand
inequality holds from Lemma (3.3) since
= (4D)-1 + 6(1)
by (3.22) and (3.23).
We can apply Theorem (3.2) t9 the example in which M(x) is a
quadratic function by letti~g D
ditional
on~,
= 2 A~
and <p.= 2(0'. - W). Thus, con0'.-2 w
e
the asymptotic variance of n
(X - ) is equal to
n
p 2 A2 (i}C2(2 AB -
r\)
where p = 1 if a
This is in agreement with (3.20).
...
= 1, 1/2 ~ p ~ 1 if a < 1-
It also points out the important
fact that if M(x) is quadratic, the variance of X converges to the
n
variance of its distribution given by Corollary (3.1).
The same
11
result for a more general class of regression functions has been
obtained by many writers using different approaches.
52
We now consider the extent to which Lemmas (3.1), (3.2), and
(3.3) rely on asymptotic theory.
We first note that it ,is not neces-
serry for the sequence {d } to be monotonic in order that ei·ther (3.29)
m
or (3.33) hold.
Consider the upper bounds established in lemmas (3.1) and (3.2).
The appeal to asymptotic results lies in the value of N such that, for
all m > N,
(3.35)
where .\2 < 2 D.
If D = 1, (3.35) holds for N = 1.
Using the Taylor expansion, we find that (3.35) is equivalent to
1 - 2Dm- l + D2m- 2 <
00
j-l
L: ( 11
j=O i=O
(3.36)
By (3.35), the right hand side of (3.36) decreases as .\Zincreases.
Thus N is less than or equal to the value of N obtained for .\2
-3
Assuming .\2 .::: 2D and considering terms up to orders (m
~
2D.
), we find
N is approximately the solution of
This solution is given by x
= 2(2D
- 1)/3.
Actually, i f 2D > 3,
the coefficient of (m- 4 ) is positive implying the value of Nis less
than 2(2D-l)/3 up to orders (m
-4
).
It is clear that the upper bounds in lemmas (3.1) and (3.2) are
"
very practical bounds for moderate D.
that, for a
= 1,
Moreover, we have already noted
(3.20) is maximized at A
=
(1-2w)/2B.
The
53
corresponding value of D is 2(1-2w), giving N
~
2.
The appeal to asymptotic theory for the upper bound of Lemma
(3.3) lies in the minimum value of N such that
exp{An
-a }(l-Dm-a ) 2
(3.37)
for allm,n > N (see (3.23)' and (3.34» .
For the K-W process, cj>
= 2(0. +
(a-l)w).
than one, we shall consider the case cj> < 2.
Since a is usually less
It follows from (3.23)
a
that if 2D is not significantly less than cj>, then (1_Dm- )2 (1-m-1)-cj>
is increasing in m for quite small m.
case for
m > N.
m~>
Let N be such that this is the
o
N . Then, if (3.37) holds for m
o
= N0 , it holds for all
Let N1be the smallest value for which (3.37) holds with m
and for which N
l
~
Thus, if either 2D
No'
~
We have, for m, n
~
= n,
N ,
l
cj> or 2D > cj>, N is a reasonable approximation for
1
the minimum solution of (3.37).
If we let m = n and neglect terms of orders (n -3a)~·in: 3~37, we'obtain an approximation of N given by the minimum solution of
l
2
2
2
a
1 + (A-2D)x- + (A /2 + D - 2DA)X- 2a < 1 - cj>x- l + cj>(cj>-1)x- •
cj> > 1 for the K-W process, the coefficient of x-
2
is positive.
Since
We
therefore drop this term knowing that the resulting minimum solution
is increased.
If we let A = CD; for 0 < C < 2, the above inequality is surely
"
satisfied if x is greater than the minimum solution of
54
or, equivalently,
For 2 -
12 < C < 2 the quadratic in C is negative.
Thus if C is so
restrained, our approximation is less than the solution of
or
1
Y = (¢!(2D_A»1-a
3."6'
Estimation of E(X - 8)2.
n
A comparison of (3.11) and (3.12) with (3.16) quickly reveals
that (3.16) is generally not the correct expression for
E«X +l n
)21~).
However, it often serves as a reasonable approxi-
mation when the variables {X
+ c }n~k
m -
m m=
lie in a region of 8 in which
o
M(x) is reasonably approximated by a quadratic function.
We assume
in this section that the process has been in such a region for
m = k , ..• , n.
o
A natural estimator of n
a-2w E(X - 8) 2 is the right hand exn
pression of (3.16) after the parameters have been replaced by their
estimators and the expression has been suitably normalized.
Using
this estimator, the problem becomes one of choosing a good value for
k and estimating the parameters.
Assume, for the moment, that estimates of the parameters are
•
available.
The squared bias expression in (3.16) is minimized as
either k decreases or X approaches 8.
k
This suggests that a good
55
choice for
..
k -- k.
0
if n - k
would be close to X + (the estimate of 8) and such that
n l
~
Using this method, the squared bias expression can be ignored
o
is
of any consequence.
A natural estimate of the variance of X , conditional on k,
n
would be to substitute the estimate for -4 AM " (8) (= 2D) into the
upper bounds given in lemmas (3.1), (3.2), and (3.3).
moderately large, this estimate (after normalizing by n
If n-k is
a-2w
) closely
approximates the corresponding estimate of the asymptotic variance
given in Corollary (3.1).
The parameters, for which we have assumed estimates exist, are
cr
2
and M" (8).
Carrying the assumption that M(x) is approximately
quadratic a little further, we can obtain estimates of these param-
·e
eters using regression analysis.
It is interesting to consider Fisher I s information for M' '(8)
when Y(x) _ N(-B(x-8)2, cr 2 ).
We have
d2
= - --2
2 2
2
(C - (y+B(x-8) ) /2cr )
dB
d
= -aB
(x-8)4/ 2 cr 2 .
=
Thus, conditional on X., i
J.
by (assuming 2[m
2aw
(l2cr) -2
222
«y + B(x-8) )(x-8) /2 cr )
= k0 ,
... , n, the information is given
] observations are taken on each step)
n
L:
m=k
o
•
n
~
c 2 n 2aw) .
O( L:
m=k
m
o
.e
56
If c
n
-w
= nand
w is arbitrarily close to its lower bound
(a/2(3-a)), the information is bounded below by a quantity of
O(n I - a(2-a)/(3-a) for arbitrary small E.
E
)
This expression increases as either a decreases
or as a increases until a
= 2.
Moreover, since (2-a)/(3-a) < 2/3 for
small a, the usual regression estimators of 0
consistent estimates if M(x) is quadratic.
2
and M"(8) will yield
.e
.
CHAPTER 4.
4.1
.IRE SIGNED KIEF.ER-WOLFOWITZ PROCESS
Definition of the Signed Kiefer-Wolfowitz Process
Chapter 3 is a development of a generalization of the K-W process
T
in which the corrective variable «Y(X -c ) - Y(X +c » n) is conn n n n
strained to lie between + T.
- n
This corrective variable can be char-
acterized as influencing both the direction and magnitude of the adjustment at the n'th step.
In this chapter we shall develop a modification
of the K-W process in which the corrective variable influences only the
direction of this adjustment.
Definition (4.1).
-e
Let {a }, {c }, and {r } be sequences of
n
n
n
positive numbers where [r J > 1.
n -
Let Y(x) denote a random variable
mean M(x) and the distribution function F("lx).
[r
n
o(x ,c ) = [r ]-1 L:
n n
n
i=l
Let
J
o.]. (x, cn ),
where
1 i f Yi(x-c n ) > Y.]. (x+c n ),
°i(x,c n ) =
0 if Y. (x-c) = Y. (x+c ),
].
n
].
n
-1 if Y. (x-c) < Y. (x+c ),
].
n
].
n
i
= 1, •.• , ern]. Let Xl be fixed, and let the random variable
sequence {X } be defined by
n
or
w~th
58
(4.1)
where Y.(X -c ) and Y.(X +c ), i
~
n n
~
n n
which, conditional on
X
n
= xn ,
= 1,
... , [r ], are random variables
n
(Xl,c ), •.. , o(Xn_l,c _ l ), Xl' •.. , X - l , and
l
n
n
are distributed independently as F(-Ix -c ) and F(-Ix +c ).
n n
n n
This process will be called the Signed Kiefer-Wolfowitz process,
and will be referred to as the SKW process.
=r
When [r ]
n
(a positive
integer) the SKW becomes the process first proposed by Wilde [13].
Wilde did not attempt to develop the process except to note that it
seemed less sensitive to the variance of Y(x) than the K-W process.
The only other attempt to study this process seems to be Springer's
[11] work which was discussed in Chapter 1.
-e
The SKW process is intrinsically different from the K-W process.
The K-W process seeks the value (8) such that E[Y(8)]
all x.
~
E[Y(x)] for
The SKW process seeks the value (8') such that
P[Y(8') > Y(x)!Y(8') # Y(x)] ~ 1/2
for all x.
Notice that if Y(x) is such that E[Y(x)] > E[Y(x)] f6r
x # x' implies that p[Y(x) > Y(x')ly(x) # Y(x')] > 1/2, then 8 = 8'.
In particular, if Y(x) - N(M(x), 02(x», then 8
= 8'.
However, the behavior of the two processes can be markedly different because of their different emphases on the magnitude of the corrective variable.
When [r ] = 1, the SKW process can be considered as a limiting
n
,
case of the CKW process.
let a
n
= An-a
Let a
n
= As n -a
for the CKW process with T
the corrective term for the CKW process
n
for the SKW process, and
= T.
59
(AT- l n-a c- l (Y(X -c ) - Y(X +c »T)
n
n n
n n
..
converges in probability to the correction term of the SKW process
(An
-a c -1
n
o(X ,c »
n
for aqy fixed n as T
n
O.
+
Before deriving any basic results it is helpful to discuss the
random variable (6(x,c
,
n ».
We have
E o(x,cn ) = P[Y(x-cn ) > Y(x+cn )]
- P[Y(x-c ) < Y(x+c )].
n
(4.2)
n
If Y(x) is a continuous random variable for all x, (4.2) can be expressed as
Y
f ( JdF(z.jx+cn»dF(y]x-cn )
00
E o(x,c ) =
n
-00
00
y
00
J ,.f dF(z Ix-cn)dF{y Ix+cn )
-00
(4.3)
-00
We now consider the important special case in which Y(x) is a
normal random variable with density function
-1
f(y,x) = (2
2 2
a2
(x»
exp{-(1/2)«y - M(x»/a(x» }.
TI
Substitution into (4.3) gives
r(J
Y
00
E 0 (x,E)
=
-00
00
f(z,x+€)dz) f(y,x-E)dy
-00
- J ( J f(z,x-E)dz)
-00
(4.4)
y
f(y,x+E)dy.
-00
Since we shall be interested in the behavior of o(x,c ) for
n
c
n
small and x in a neighborhood of 8, we shall need estimates of
60
a
a
ax E O(X,E) and aEaX E O(X,E).
Assume that
1. M'(x) is continuous and there exists an open interval I
containing
e
2. 0 < 0'2 (x) <
such that for x E I M"(x) is continuous,
co,
a'(x) is continuous, and a"(x) is
continuous for x E I.
Then the Taylor expansion of f(y,X+E) gives
f(y,X+E) = f(y,x) + E(
for 0 <
~
< T.
a~
f(y,x+ ) )
i
=
~
Since
aTa f(y,T) =
assumptions (1) and (2) guarantee that the last expression in (4.5)
can be dominated by an integrable function for x in a bounded interval.
We can, therefore, express (4.4) as
co
E O(X,E) =
Y
J[ J (f(Z,X+E)
-co
- f(Z,X-E»dz] (f(y,x) + Eg(y,x,E)dy
(4.7)
-co
where g(y,x,E) can be dominated by an integrable function.
Since the term in brackets is bounded by
for every z as
E -7
1 and goes to zero
0, we have from (4.7) and assumptions (1) and (2),
that
~~ ~E
±.
ES(X,E)- 2 {(
~x
1
f(z,x) dz ) f(y,x)dy
61
00
= 2 cr(x)
.
f ( ~x
[(z - M(x)/a(x)] ) f
2
(y,x)dy
_00
= Mt (x) / ITI
In order to evaluate
(4.8)
a (x)
t- Eo(x,e)
~E
at x
=8
and
E
= 0,
we
note that, for x in a closed subinterval of I, the continuity of
M' I (x) and a
..
cont~nUJ.ty 0
I I
f
(x) is sufficient to guarantee the existence and
3Ea
aa
E u~( X,E ) an d aax
a
3E
E
~( X,E ) •
u
Moreover,
these second partials are equal, and we have by (4.6), (4.7), and
(4.8) that
lim
c +0
n
a 28E ax
EQ(X,E)I
x=8, E=c
n
= - M' (8)/cr(8)17f
I
since M' (8)
(4.9)
= o.
Thus if Y(x) is a normal random variable, we have
E o(x,c ) = -c M'(x)/a(x)lIT + O(c ),
n
n
n
and, for
Ix
- 8 I· :sufficiently small,
E o(x,c ) = -c (x-8)M " (8)/cr(8)1TI + O(c (x-8».
n
n
n
If, more generally, Y(x) is a continuous random variable whose
density function f(y(x»)is continuous in x for all y,then i t follows
2
from Definition (4.1) that E 0i(x,c ) :: 1 and E <S.(x,c) + 0 as c·+
~
n
n
n
o.
62
.e
Since, conditional on X = x, o.(x,c ), i = 1, ... , [r ], are independn
~
n
n
ent random variables, we have
lim
(4.10)
Var o(8.c ) = 1.
n
c~
n
4.2
Asymptotic Theory of the SKW Process
Let {X } be a SKW process with r
Corollary (4.1):
n
n
= n
2~
Let {V } be a real number sequence, 8 be a real number, a and W be
n
non-negative numbers, and no' AS, C,
v,
and a be positive numbers such
that
a. if n > n , then E o(V ,c ) = 0, and if x
o
n n
~ ~
n
, then
(x-v) E o(x,c ) > 0;
n
n
b. for every x, E 6(x,0) =
° and
d
~E
E O(X,E) is continuous,
and if B = sup c , then
n
c. for every x in an open interval (I) containing
e and
for every
lEI < B, (o/ox) E o(x,E)l x=8, y=O equals zero, (o2/oE6x)
E o(x,y) is continuous, and the sequence
{(d 2 IdEdX) E O(X,E)] x= 8 , E=c } converges to A > 0;
n
d. Var o(x,c ) is continuously convergent to 0
2
n
at x = 8;
e. F(y/o) is Borel measurable for all y;
W
a
f. (1) 1/2 < a < 1, n a + AS, n c + C, Iv, n
n
n
81 = o(n- V ) ,
(2) if a < 1, then a > 1 - (2a - 1)/2w;
(3) and S+ = 0 if a < 1, = S < 2 AS A if a = 1.
63
·e
Proof:
.
Make the correspondence with Theorem (Z.l) as follows:
-1
Z (x) = o(x ,c ), and R (x) = E o(x ,c ).
n
n n
n
n n
a' = a c
n
n n
To verify that conditions (i) through (vii) of Theorem (2.1) are
satisfied, we first note that condition (i) is a direct consequence
of condition (a).
It follows from condition (b) that
sup
nW/E o(x ,c )j/(l+jxl)
n n
n,x
= sup
-e
where I~ (x)
n
I
n,X
< c.
n
nWc
n
(J ~
aE
E o(x ,E)/ c ( »/(I+lx/)
n
Y=sn x
The last expression is finite by condition (b),
and therefore condition (ii) is verified.
To verify condition (iii) we note that condition (b) implies
the existence of a function sequence {~ (x)} such that
n
IE
6'xn,c
(
dE E
n ) I = c n l~
where .'I~n (x)/ < c.
n
°(Xn,E )I
E=~ (x)
I
n
By condition (f. (1», c n converges to a limit,
and hence the function sequence {~ (x)} converges to a function, say
n
~(x).
Moreover, condition (b) implies
~(x)
is continuous, and there-
fore the function / (d/dE) E o(x,E)IE=~(x)/ is continuous and attains
its infimum (k(ol'oZ»
But ~n
.
+
over the set [01 ~ Ix-el.~ 02]'
e by (f. (1».
Therefore if 0 < 01 ~ 02<
00,
there
exists an N such that for every n > N we have that
[01~lx-el':::"02]n [Ix-e/ ~ Illn-ej] = cp.
k(ol'oZ) > O.
This implies by (a) that
Thus condition (iii) is verified by noting that the
.e
64
divergence of
L:a'
n
01
~
inf
]x-e]
<
0z
IR
(x)
I
n
is implied by the divergence of
We have by (a) and (c) that
E o(x ,c )/c (x-~ )
n n
n
n
where
In -
c-l(~ E o(x,c »
n
aX
n I < Ix - ~ n I and I~I < c.
n
~
tinuously convergent to A at x =
e by
n
x=n
The last expression is con-
condition (c).
Thus condition
(iv) is satisfied.
Condition (v) follows from condition (d) and the fact that
I0 (xn ,cn ) I -<
1.
Since the P[Y(x-c ) > Y(x+c )] can be written as the limit of
n
n
Lebesgue-Stieltjes sums, which by condition (e) are Borel measurable,
it is clear that the distribution function of o(x,c ) is Borel measurn
able for all (y).
Thus condition (vi) is satisfied.
Condition (vii) can be shown to hold by direct substitution for
the case T
n
_
00
and p = Zaw.
Thus Corollary (4.1) is proven.
The conditions in this corollary are somewhat unintuitive.
This
arises from the fact that the SKW process explores a response curve
65
(E O(x,c »
n
which, although often closely related, is not the response
curve of the observed random variables.
However, as in Corollary (3.1),
the assumptions of Corollary (4.1) are satisfied in practice almost
without exception.
This fact will become clearer in the following
examples.
2
Suppose Y(x) _ N(M(x) , a (x».
Example 1.
Let {X } be a SKW
n
process which satisfies assumptions (1) and (2) preceding (4.5) and
assumption (f.) of Corollary (4.1) with w > O.
We have yet to
determine V.
In addition, assume that
3. (x -8) M'(x) <
a
~
for x
8, =
a
for x = 8,
4. M"'(x) is continuous for x E I;
5. and sup (IM'(x)! + ja'(x) I)/a(x)(l + Ixl) < 00,
x
We shall show that n S/
A
= - M"(8)/a(8) /-IT,
2
(X
n
-8)
~
s
N(O,(As )2/ c2(2A A - S+», where
using Corollary (4.1).
We first note that (3) implies condition (a) since
P[Y(x) > Y(x')] > 1/2 if and only if E Y(x) > E Y(x') for independent
non-degenerate normal random variables.
(4) and Corollary (3.1) that V
It then follows by condition
= 2w.
Condition (e) is an immediate consequence of the continuity of
M(x) and a(x).
We now summarize the relevant results obtained earlier for this
special case:
1.
d
aE
E a(X,E) is continuous;
2. {Var a(x,c )} is continuously convergent to 1 at x
n
= 8;
66
3. for every x E I and every
(
~x
jE] < B,
E <'i(x,£) ) xeS, E=O = 0, the sequence
d2
dEdX EO(X,E)/x=8
£=c
,
n
converges to
2
A (
=
d
lIT), and dE3x
E O(X,E) is continuous.
-M"(8)!a(8)
Thus, the assertion of asymptotic normality is established if we can
show that the
B,xl~E
suP / E / <
E o(x,E)//(l + Ix]) <
00.
Taking the partial with respect to E under the integral signs
in (4.3) (The continuity of M(x) and a(x) assures us that this can be
done.) we obtain integrals of the form
-e
y
co
J (J ~E
-00
f(z,£) dz) f(y,£) dy
_00
and
OO
f
-00
(
JY
d
f(Z,E) dz) aE f(y,E) dy.
-00
An examination of (4.5) reveals that assumption (5) is sufficient to
guarantee that the above supremum is finite.
Thus the desired asymptotic normality of nS/ 2 (X
n
-.f)
is estab-
lished.
.
The variances of the respective normal distributions to which
'
the SKW process and the K-W process converge are respectively
67
.e
and
when a = 1, and
and
(4.11)
when 1/2 < a < 1.
It is evident that the K-W process is asymptotically
more sensitive to the variance of the observations.
Before leaving this example we note that if AS = A o(8)/IIT,
the expressions in (4.11) are identical.
Example 2:
Suppose Y(x) has an exponential density function
given by
f(y,x)
~ (~(x) e-~(x)y, x ~
l
, x <
0
0
o.
Let {X } be a SKW process satisfying assumption (f) of Corollary
n
(4.1) with w > O.
Let M(x)
= ~-l(x)
and assume that
1. M'(x) is continuous and (x- 8)M'(x) < 0 by x # 8;
2. sup /M'{x)I/M(x) (1 + Ixl) < 00;
x
3. and there exists an open interval I containing 8 such that
M"'(x) is continuous.
68
We have
..
P[Y(X-E) < Y(x+E)] = 1 - A(X+E)!(A(x+E) + A(X-E»
= 1 - M(X-E)! (M(X+E) + M(X-E»
by the last expression in (4.3).
It follows that
E O(X,E) = (M(X-E) - M(X+E»!(M(x-E) + M(X+E».
(4.12)
It is clear from (4.12) that the value of V is less than 2w as
in Corollary (3.1).
With the possible exception of condition (b), the conditions
of Corollary (4.1) can be easily verified using assumptions (1) and
(2) •
We have
=
SUPjE] < B,xIM'(x)/!M(x)(l + Ixl) < ~
by assumption (2).
It remains to evaluate the Var 6(8,0) and
limVar o(x,c )
c +0
n
+
.
By assumption (3) the value of A is given by
d
dX
or
The
1 by (4.9) for all x and, in particular, for x
n
.
A~
.L
dE
E
I
O(X,E) E=O x=8'
=
8.
69
A = - M"(8)/M(8)
M"(8)/cr(8)
= -
since M(8)
= A- l (8) = cr(8).
A comparison of (4.13) with (4.9) suggests that A may be gener-
ally of the form - k M"(8)/cr(8) where k is a constant characterizing
a family of distribution functions.
The next example gives an impor-
tant case when this is not true.
Example 3:
Suppose Y(x) is a Bernoulli random variable with
the P[Y(x) = 1] = p(x).
Suppose that assumptions (1) and (3) of
Example 2 hold with M(x) replaced by p(x).
(2) sup Ip'(x)1 (l+x) <
x
In addition, assume that
00.
It is easily seen that these assumptions are sufficient to verify
the conditions of both Corollary (4.1) and Corollary (3.1).
Moreover,
in this case the SKW process and the K-W process are identical.
Thus,
using the results of Corollary (3.1), we find that
n S/ 2 (X -8)
n
~ N(O, (As )2 cr 2/ c2(2 ASA - S+»
where A = - 2 p"(8), and cr
2
=
2 pee) (1-p(8».
This example shows that the corrective variable in the SKW process
is a transformation of the observed random variable into a discrete,
bounded random variable with a regression function that is less than,
equal to, and greater than zero as x is less than, equal to, or greater
than
~.
n
That is, the SKW process is simply a K-W process after this
transformation on the observed random variables has been made.
70
These examples indicate that, at least asymptotically, the SKW
process is more dependent than the K-W process on the family of distributions
~.K.
abIes come.
normal, exponential) from which the observed random variHowever, for the special cases of the normal and exponen-
tial families, the SKW process is less sensitive to the variance of the
parent distribution.
4.3
Sample Properties of the SKW Process
Unlike the K-W process, whose small sample properties are largely
determined by M(x) and a(x), the small sample properties of the SKW
process often depend on the distribution function ina more intrinsic
manner.
This makes a general discussion of the small sample properties
of the SKW process very difficult.
However, a few general remarks can be made.
Firstly, the effect
of using the sign of (Y(X -c ) - Y(X +c )) is somewhat similar to that
n n
nn
of constraining an observed random variable; both operations lessen
the effect of an observation from the tails of the distribution.
Secondly, if the SKW process is approaching
e along
a constant, rela.:.-o
tively small slope, the expected behavior is fairly well approximated
by (3.11) and (3.12) in which the constants B and a
of the family of observed random variables.
process is approaching
e along
2
are characteristic
And, thirdly, if the SKW
a steep slope, the approach is very
uniform, and the rate is determined by the sequences {a } and {c }.
n
n
It is clear from the nature of the SKW process that is it impossible to consider an example corresponding to M(x) being quadratic
71
for the K-W process.
That is, since 0 (x., c ) is bounded by + 1, it
n
is impossible for E o(x,c )
n
= k cn (x-e) to hold for all x.
This implies
that the moments of n S/ (X -8) generally do not ·converge to those of its
2
n
asymptotic normal distribution unless possibly when X is restricted to
n
lie in a bounded interval containing 8.
To study the expected rate of
approach to 8 therefore requires a slight modification of the method
used for the K-W process.
Let {X} be a SKW process.
n
Let I be an open interval of 8 such
that
E o(x,c )
n
..
where 0 <
1.
< Tn (x)
2. T <
00...
= cn Tn (x) (x-8)
(4.14)
Let 1\ be the event that XmE I for
m -> k, and let X*n denote the random variable Xn
conditional
on :
A..
-1<
That is, assume that each member of the random variable sequence
. {X }oo k lies in I.
m m=
It is not unreasonable to study the random variable sequence {X*}
n
for two reasons.
Since X -+ e a.e. by Corollary (4.1), P(1\) -+ 1 as
n
k -+ 00, and since P(A ) + 1 n
k
'
S/2 (x*-8)
n
-+
n S/2 (X -8) in probability and
n
therefore has the same asymptotic distribution.
We have, for a
n
= ASn- a
Xn*+l - 8
and c
n
= C n-W
= X* - 8 - a c- l o(x*,c )
n
=
n n
n
n
(4.15)
(1 - AST (X*)n- a )(X*-8) _ ASC-ln-(a-w)Z(X*)
n
n
n
n
.e
72
where, conditional on X* , E Z(X * )
n
n
=0
and Var Z(X * )
n
= Var
o(X* ,c }.
nn
In much the same manner that we obtained (3.16), we obtain
n
E«X*+1-8)2 ~) ~ D(k,n) (X= _6)2 + A2C- 20 2 i: D(m+l,n)m- 2 (a-w)
n .
m=k
n
2-2
where D(k,n) = TI (1 - A~T j-a) and a
j=m
= sup n,x
Var o(x,c).
n
(4.16)
The
-2
.
. 2
reverse inequality holds if ! and a
were replaced
by T and
Q. Quite
* a.e.
*
a.e.
clearly, since T (X) +
A and Var o(X ,c) +
Vai: 0(6,0), it could
n n
n n
be shown that
nlJQ E(X*-8) 2
(4.17)
n
where
S=a
- 2w.
Since it will be useful in the next section, we shall show that
Where
If we raise (4.15) to the'4'th power and take expectations, we
obtain terms of the following forms:
..
. *·-6) 3 Z (X)
* IX*]
since E[:(X
:.n
n n
n
=
O.
Using (4.16) and Holder's inequality,
the last two expressions are of smaller order than·thesecond.' Thus
we can write
73
.e
+ 6 ~ (A s )2 C-2(G 2 + 0(1) ~ D2(m+1,n)m~3~+4w
.
m=k
(4.18)
Letting
n
b
m
=
n>
IT
0
j=m+l
it is seen that Lemmas (3.1),'(3.2) and (3.3) as well as Theorem (3.2)
hold with 2 D replaced by nD.
Thus the last expression in (4.18) is
less than
or, equivalently,
~.e
n- 2 (a-2w) 3 ~2 (1 + 0(1»).
It should be noted that, when a = 1, the restriction that
a - 2 w < 2 A A implies
~
- 1
= 2(a-2w)
s
< 4 A A = n D in the above
expressions.
Using (3.17) and (3.18) with 2 B = AS
A, weobtairi
(4.19)
as we might have expected.
4.4
Estimation of E(X
11
- 8)2
The fact that the form of A is generally unknown for the SKW
process adds a new complexity to the estimation of the variance of
X.
n
If we can assume a particular distribution for Y(x), we can
.e
74
often evaluate A as was illustrated in the preceding examples.
An
estimate of the variance of X could then be calculated using the
n
method developed in Chapter 3.
not always known.
HOVlever,the distribution of Y(x) is
We now develop a 'methO'd'O'f estimating the variance
of X which does not require this knowledge.
n
Denote the Var X* by V.
n
Let b
n
m
be an increasing subsequence
Define V by
n
= N-1
V
n
N
2:
b
mel
s
m
*
~~
(X -X )
m n
2
(4.20)
•
We shall study the properties of V through a study of the
n
properties of
V'
n
= N- l
(4.21)
The asterisk, which indicates that X , m=k, ... , n, all lie in
m
a neighborhood of 8 in which (4.14) is a reasonable approximation for
E ,'o(x,cn ), will be omitted in the rest of this discussion.
It follows from (4.17) that Vi is asymptotically unbiased.
n
To obtain the variance of V' we note that
n
2
Var[V ' ] = Nn
N
2:
(b b I)
I
m m
m,m =1
= N-
Cov[
2 -.
- 8 ) . (~
(~
m
2
2
- 8) ]
m'
N
2
- 8) (x.
2:
(b b ,) (E[X
_8)2]
I
m m
-lJ I
b
m,m =1
m
m
- E(~
m
~ 6)2 ~(~
_ 8)2)
m'
75
..
+ Var(Xj,
- 81Xb )] - E(Xb
m
m'
m
- 8)2 E(Xb
m'
- N-2
b ,
m
. 11 .
j=b
m
Using (4.19) we obtain
.
N
Var(V') < (6 ¢2 + o(1))N- 2
n
-
D(b ,b ,).
L:
m m
m'>m
Consider the case a
= 1.
(4.22)
We have from (3.22) and (3.28) that for
every A < 2 A T there exists a N(A) such that, for every m, n > N(A),
D(b ,b ,) < (b /b ,) A.
m m m m
(4.23)
Let m' = m + 1, and suppose that the sequence {b } is chosen such that
m
(bm/b + ) is independent of m.
ml
Then, for some constant B,
m-1
L:
i=l
(In (b i +l ) - In b)
or, equivalently,
Thus b
m
is of the form (sm).
(m-l)B
76
Let b
m
be so defined with s > 1.
Then from (4.22) and (4.23) we
obtain
Var(V') < (6 ¢2 + 0(1) N- 2 ~ (N-m)s-mA
n
m=O
Since N is approximately the solution of sN = n, we have
In(n) Var(V') ~ 6 ¢2 In(s)/(l-s-T) + 0(1).
n
Consider the case a < 1.
We have from (3.22) and (3.32) that for
every A < 2 AST there exists and N(A) such that for every m, n > N(A)
D(b ,b ,) < (b /b ,)¢ exp[A b-~ (b -b ,)]
mm
mm
m
mm
where ¢ >
(4.24)
O.
Let m'
=
m + 1 and suppose the sequence {b Vis(:'chosen,'such':\:hat
m
the term in brackets is asymptotically independent of m.
-1
bm+l = bm(l + oem
»,
we have
m-1 bl-a
..
-+
m
. eas1Y
'I seen t h at b
It 1S
Let b
m
=
m
be so defined.
(T m)l-a)
L
C.
-1
has the required property.
Then, for fixed k and m'
term in brackets becomes [- A T k + 0(1)].
we obtain
Then, if
= m + k, the
Thus by (4.22) and (4.24)
77
Since N is approximately the solution of
(T N) (l-a)
-1
we have
Since 8 is unknown, it is natural to use V instead of Vi.
n
n
We have
.
EV
n
= N- 1 ~
m=l
N
L
m=l
b
m.
E(X.·m-. Xn ·.) 2.
bm(E(~ _8)2 - 2 E(~ -8)(Xn-8) + E(Xn-~}2)
m
m
N
L:
(¢
+ 0(1».
m=l
Thus V is asymptotically unbiased.
n
Moreover, for a < 1,
O(b- S exp{-n- a (n-b )}
m
..
m
'.
Thus for a < 1, V is usually a conservative estimate of V •
n
n
This method of estimation is obviously very inefficient for a
close to one.
It does have the advantages of being relatively easy
78
to calculate and of being generally applicable.
It should be noted
that the order of the variance of V depends on the number of steps
n
and not on the number of observations.
This is undesirable when we
wish to economize by taking more observations on each step and taking
fewer steps.
-e
CHAPTER 5.
TWO EXAMPLES
We now illustrate the application of some of the results of
Chapter 3 and of Chapter 4 by means of two hypothetical examples.
Both
examples entail designing an experiment to emphasize the factors that
influence the choice of the sequences {a }, {c }, and {T }.
n
n
n
Example 1:
A research worker is studying the relationship between
temperature and the rate of reproduction of a certain strain of bacteria.
One of his objectives is to estimate the temperature (8) that maximizes
the average rate of reproduction.
-e
He emphasizes that he wishes to mini-
mize the E (M(O) - M(8»2 rather than E(e - 8)2.
From past experience he is reasonably confident that 8 lies between 80 and 110 degrees Fahrenheit and that the average rate of reproduction is reasonably cOnstant in the interval (8-5,8+5).
For temper-
atures below 8 - 5 degrees the average rate of reproduction changes approximately one unit for each 10 degree change in temperature, and for
temperatures above 8 + 5 degrees the average rate of reproduction
changes two to three units for each 10 degree change in temperature.
The standard deviation is expected to be approximately three units at
temperatures below 8 and to moderately increase as temperatures rise
in a twenty degree range above 8.
He also mentions that for practical
reasons he can perform the experiment at only 20 different temperatures.
.
The above discussion indicates that the relation between the
average rate of reproduction and temperature may be represented by
Figure 4.
80
"
22
IE
T-/-
I:l
0
'M
.w
20'
HI
I
fl(c) I
()
::J
"0
0
H
18
~ 16
~
I
I
IG
I
I
I
8+5
8-5
in
Figure 4.
80
90 100
110
Temperature
120
Graph of Reproduction versus Temperature
We shall illustrate the design of this experiment using two
CKW processes.
In order that the first process (X') approaches 8
n
along a moderate slope and that the second process (X") approaches
n
8 along a steep slope, we set
Xi = 80
and
Xi'
= 110. A reason in
favor.of using two such processes in practice rather than one is
that two processes would usually generate observations more uniformly and over a larger range of temperatures.
This allows the
experimenter to study the whole relationship more reliably.
Since so many of the results in this paper depend on (3.15)
being a reasonable approximation of the process, we must define an
interval I of 8 for which this is true.
Then we must design the
process such that X is in the region at least for the last few steps.
n
Since we are practically restricted to 20 steps, we shall choose the
sequence {an} such that with a high probability X has reached this
n
region after 10 steps.
Before doing this we must choose the sequence {c}.
n
This
81
sequence should be chosen with two thoughts in mind.
Firstly, c
n
should be large as long as Ix - el > Cn ; that is, Xn is still
n
- el < c ,
n
n
should be small enough so that the expected correction term always
Secondly, when Ix
searching for a neighborhood of e.
c
n
directs the process to a region in which M(x)
The maximum value of c
n
length of the line DF, and let c
n
M(e).
for which the last statement is true
can be approximated by considering Figure 4.
picture that if c
~
=
(D - F)/2.
< c, then I~(c ) - el < 5.
n
Let D - F denote the
We see from the
The experimenter has
noted that there is not a biologically significant difference between
M(~(c»
and M(e) i f I~(c) - el <:5. Thus we seek the maximum value of
c for which I~(c) - el < 5.
Let p be such that (G - F)
= 2pc.
Then if sl and s2 represent
the slopes of lines FE and ED, we have
E - G
= 2cpsl = 2c(1-p)s2'
or
Since
5 > I~(c) -
e]
.
for s2 > sl'
Thus, for our example,
c - 5(.1 + .2)/(.2 - .1)
= 15.
.e
..
82
Consider the choice of c
so that if x
n
n
for X'.
n
Initially c
+ cn < 8, then Xn' +l > X'n with
Since p[ly(x) - M(x)
I
n
should be large enough
a high probability •
< 0] ~ .6, we may choose c (~C) as the solution
1
of
20 = M(x
~
l
+ C) - M(X1 - C)
.2C,
or
C
~
30.
If, in addition, we require that
then W
~
.3.
With these values for C and W we find that c
Similarly for the choice of c
20
n
.4C.
~
15.
~
11.5.
for X' we obtain
n
= M(x l + C) - M(x l
~
ZO
- C)
or
C
In this case, c
1
is sufficiently small to discount a serious bias due
Thus, we can set w = 0, or c
n
S 15.
It should be noted that we could start the process at n
and let n
= k,
•.• , 20 + k.
chosen constants.
=k
> 1
By doing so, we can adjust the change in
.e
83
An examination of Figure 4. reveals that if c
IXn
n
< 15 and
- 81< 5, then (3.15) is a reasonable approximation for
Xn+1 .
Equa-
tions (3.11) and (3.12) are useful for choosing the sequence {a } such
n
that X reaches the interval I by the n'th step.
n
In choosing a (=~a) for X' we note first that this process will
n
n
be somewhat exploratory due to the small slope.
want a
n
Therefore, we do not
to decrease too rapidly as that would make the process too
sensitive to the first few steps.
8£(80, 110),
IXI
Thus we might set a
- (8 - 5)1 < 25.
Since
Thus a conservative value for A,
which reasonably assures that X £1 for some n
n
10
-.75
2(1.1)A L m
m=l
by (3.11).
= 3/4.
~
.2(3.8)A
=
25,
~
10, is the solution of
(4.25)
We therefore set A = 30.
In choosing a (=An-a ) for X" we note first that this process
n
n'
should be in a neighborhood of
e
quite early due to the steep slope.
Thus, we want the sequence {a } to optimize the behavior of X" under
n
this assumption.
n
Asymptotically this is done by setting a
I, where B = 1
IM"(8) I, we note from
=1
with
the restriction that A > B/2 Mil (8)
- 2w = 1.
a conservative estimate of
(3.14) and Figure 4.
that
-B(8-10~8) ~
~
or
M'(8-10)
.1,
To obtain
84
B
Thus A > 2.5.
-
~M"(e)
Furthermore, by the note following (3.20), asymptotically
the optimal choice of A is 5.
However, if we require A to be greater
than the solution of the equation corresponding to (4.25), we must have
A > 25.
There are numerous bases on which the sequence {T } of constraints
n
may be chosen.
We illustrate one method for X".
n
In the region of
steepest slope the expected correction term is
AC- l n- l (M(x-c ) - M(x+c )) ~ 6AC- l n- l
n
n-
If we set T = DCA-In then the maximum correction would be th
n
It is
clear that for D > 6 this constraint does not seriously affect the
approach of X' in the region of steep slope.
n
By letting D < 10, the
possible bias arising from a large movement to the left of
early stages of the experiment is diminished.
of T
n
e
in the
Obviously, the effect
diminishes very quickly as n gets large.
We can get an approximation for the
(3.16) and Lemmas (3.1) and (3.3).
varianc~
of X' and X" using
n
n
We have for X' that
n
n
D(m,n) <
IT (1 - 2(.1)30j-a)2
j=m+l
giving
Var(X
21
-
elxk )
2(9)20 -.
~ ';;;;"'>':1:-:2~
75+. 6
75
(exp{6(20)-'}
- 75) }
- exp{-(20 ... k)6(20)'
85
by Lemma (3.3).
= 10,
For k
we have
•
(4.26)
If X 10
81
< 5, then
20
<IT·
a·/'x10 ) _
..
75 )2 25
(1 - 6.-.
J
j=ll
~ (11/20)-1.5+.6 exp{-12(10)(20)-·75}25
~
by (4.24) with b
m
O.
= 20.
10 and b ,
=
m
We have for X" that
n
D(m,n)
~
n
.-1 2
'..II (1 - 2(.1)25 J ) •
j=m+l
Using Lemma (3.1), we obtain
alx10 ).:.
Var(X "1 2
~
If
Ix"n - al
2
(9) (~5)
(20- 1 - (.5)5 10-1 )
(15) (10-1)
.14.
(4.27)
< 5, then
2
E (X"
20
-
alx )
20
< 'll
10 - j=l1
(1 - 5j-l)2 25
:::.. (11/20) 10
:::.. .03 .
...
86
Before comparing (4.28) and (4.27), we must adjust for the fact
that X" approached
n
e along
the steepest slope.
The corresponding ad-
justment in (4.25) changes the restriction on AS to AS > 15.
With
this new value for AS we obtain
We conclude this example by noting that much the same procedure
would be followed if the SKW process were considered.
Example 2.
Suppose it has been noted that the addition of a
moderate amount of a certain medication to the daily diet of some
chickens has improved their health, and it is desired to estimate the
optimal amount.
In this type of problem, it is often difficult to quantify the
observed variable (health).
Moreover, the shape of the response func-
tion depends on the manner in which we choose to quantify this variable.
Thus it seems natural to use a method of estimation that
depends on relative magnitudes rather than absolute magnitudes.
We
therefore illustrate this example using the SKW process.
The corrective variable (o(X ,c » is defined to be -1 if the
n n
chickens which are given the amount of medication (X - c ) are
n
n
healthier than the chickens which are given the amount of medication
(X + c).
n
n
As we noted in the discussion of Example 3t
the random
variable o(x,c ) generates a response curve that equals zero at
n
~.
n
We shall suppose that the experimenter is reluctant to assume that
••
87
f
lln
r)
8.
-
If for no other reason than'"it is somewhat meaningless to
view the" expected health as a function of diet, we shall also assume
that the experimenter wishes to minimize E(8 - 8)2.
Since it is not too unreasonable to think of ).In - 8 as the bias
of X , we might let c equal some constant (greater than 1) times the
n
n
bias we are willing to accept.
such that c
m
The values C and w should be chosen
is large during the period of time we expect the process
to be searching for the region of 8, and approximately equal to c
n
when in this region.
We have such little knowledge of A in this case that is is
probably wise to set a < I and thereby obviate the risk of not satisfying the inequality 2A
s
~ >
S for
the case a
= 1.
Moreover, due to
the unknown nature of A, we are restricted to estimating the variance
of X using the method proposed in the previous section.
n
This latter
fact may cause us to diminish a still further and to compensate for the
corresponding increase in the variance by decreasing AS.
Using the principles employed in Example 4, we could evaluate the
sequences {a } and {c } more specifically.
n
n
..
CHAPTER 6.
6.1
SUMMARY AND SUGGESTIONS FOR FURTHER RESEARCH
Sununary
A large class of stochastic approximation procedures can be
represented by
(6.1)
where, conditioned on X , Z(X ) is an observable random variable
n
n
independent of X. and Z(X.),i
1
.
1
= 1,
.•. , n-l.
In particular the Kiefer-
Wolfowitz (K-W) process admits this representation with a'
n
and Z(X )
n
= Y(Xn-c n )
- Y(X +c ) where, conditional on X·
n n
n
= an
= x,
c
-1
n
Y(X -c )
n n
and Y(X +c ) are independent random variables with means M(x-c ) and
n
n
n
M(x+c ) respectively.
n
An extension of Burkholder's [2] theorem on the asymptotic distribution of the stochastic approximation procedures, which can be
represented by (6.1), is given in Chapter 2.
This extension allows
the number of observations taken on each step to be of O(n-p)~· p ~ 0,
allows the random variable Iz(x )/ to be constrained to lie between
n
bounds of O(n- S), S ~ 0, and admits a more general sequence of constants a'.
n
The practical consequences of this extension for the K-W
process are discussed in Chapter 3.
The small sample mean and variance of X ' conditional on Xk '
n
are studied through a consideration of some specific examples.
The
case in which M(x) is a quadratic function is considered since M(x)
89
is often closely approximated by a quadratic function in a neighborhood
of
e.
It is shown by means of some approximations that, after suitable
noxma~ization,
the variance of the normal random variable to which the
K-W process converges in distribution is a conservative estimate of the
Var(Xnl~).
0
2
and M"(8).
This asymptotic variance depends on the unknown parameters
It is shown that an efficient estimator of M"(e) is also
..
2
a consistent estimator of M" (8) when Y(x). ~ N(M(x),.o ).
A modification of the K-W process first proposed by Wilde [13] is
defined in Chapter 4.
= an
It satisfies (6.1) with a'
n
c
-1
n
-and with
Z(X ) being the mean of [r ] independent random variables defined by
n
n
1 if Y. (X -c ) > Y. (X +c ),
1
01 (x,c n ) =
n
n
n
1
n
0 if Y. (X -c ) = Y. (X +c ),
1
n
n
n
1
n
-1 if Y. (X -c ) < Y. (X +c ),
1
n n
1
n n
= x,
where, conditional on X
n
Y.(X -c ) and Y.(X +c ), i
1
n n
1
n n
= 1,
... , [r
n
J,
are independent random variables with mean M(x-c ) and M(x+c ),
n
respectively.
proc..ess.
n
This process is termed the Signed Kiefer-Wolfowitz (SKW)
The asymptotic distribution of the SKW process is obtained
under;quite general assumptions on Y(x) if the sequences a , c , andr
n
n
n
are of the form n- P , p > O. It is seen that the order of the variances
of the K-W process and the SKW process are the same, but that the
variance of theSKW process shows a greater dependency on the nature of
the distribution of Y(x).
The variance of X conditional on X is again studied by considerk
n
ing specific examples.
It is seen that this variance depends on an
90
unknown
,
parameter~
tribution of Y(x).
the nature of which depends on the form of the disThe estimator of the variance,
-1 N S
2
- X )
Var(X ) = N l: b (X
b
m
n
m=l
m
n
where {b } is an increasing subsequence of the integers, 1, .. ", n, and
m
nS(X -8) is asymptotically normal, is proposed since this estimator
n
does not aepend on the unknown parameter.
This estimator is shown to
be consistent for somewhat restricted class of sequences {b }.
m·
In Chapter 5, two hypothetical examples are given to illustrate
the designing of an experiment in which these procedures are used.
6.2
Suggestions for Further Research
Although the results in the paper provide some added insight into
the behavior of the K-W and SKW processes and methods of estimating the
variance of the estimate of
further study.
e are
presented, both of these areas need
One very promising possibility would be a numerical
study comparing these stochastic approximation procedures with the
Method of Steepest Ascent .
.
.,
...
,
91
,
LIST OF REFERENCES
1. Blum, J. R. 1954.
probabili ty one.
Approximation methods which converge with
Ann. Math. Statist. 25:382-386.
2. Burkholder, D. L. 1956. On a class of stochastic approximation
processes. Ann. Math. Statist. 27:1044-1059.
3. Chung, K. L. 1954. On a stochastic approximation method.
Math. Statist. 25:463-483.
4. Dvoretsky, A. 1956.
Berkley Symposium.
On stochastic approximation.
1:39-56.
5. Fabian, V. 1967. Stochastic approximation with
speed. Ann. Math. Statist. 38:191-200.
Ann.
Proc. Third
improv~d
asymptotic
6. Fabian, V. 1968. On asymptotic normality in stochastic approximation. Ann. Math. Statist. 39:1327-1332.
7. Hodges, J. L., Jr., and E. L. Lehman. 1956. Two approximations
to the Robbins-Monro process. Proc.Third Berkeley Symposium.
1:95-104.
8. Kiefer, J. and J. Wolfowitz. 1952'.' Sfochastic:'apptoximati'on of
the maximum of a regression function. Ann. Math. Statist.
23:462-466.
9. Robbins, H~ E. and S. Monro. 1951. A stochastic approximation
methoa. Ann. Math. Statist. 22:400-407.
10. Sachs, J. 1958. Asymptotic distribution of stochastic approximation procedures. Ann. Math. Statist. 29:373-405.
11. Springer, B. G. F. 1969. Numerical optimization in the-presence
of random variability. The single factor case. Biometrika.
56:65-74.
12. Venter, J. H. 1967. An extension of the Robbins-Monro procedure.
Ann. Math. Statist. 38:181-190.
•
13. Wilde, D. J. Optimum seeking methods.
Englewood Cliffs, N. J •
Prentice~Ha11,
Inc.
1964.