This research was supported by the National Institutes of Health, Institute of General Medical Sciences, Grant No. GM-l2868-04 . . ON THE KIEFER-WOLFOWITZ PROCESS AND SOME OF ITS MODIFICATIONS Mark Allyn Johnson University of North Carolina Institute of Statistics Mimeo Series No. 644 September 1969 • ABSTRACT JOHNSON t MARK ALLYN. Modifications. On the Kiefer-Wolfowitz Process and Some of Its (Under the direction of JAMES E. GRIZZLE). The Kiefer-Wolfowitz (K-W) process is a stochastic approximation procedure used to estimate the value of a controlled variable that maximizes the expected response (M(x» of a corresponding dependent random variable. An extension of a theorem on the asymptotic distribution of a certain class of stochastic approximation procedures t which includes • the K-W process, is given. This extension admits an increasing number of observations to be taken on each step, constrains the observed random variable to lie between predetermined bounds (not necessarily finite), and extends the class of admissable norming sequences. Each of these extensions is examined in the context of their usefulness to the application of the K-W process. Approximations of the small sample mean and variance of the K-W process are given and studied through an examination of various special cases of the response function. Traditionally the estimation of the variance of the process involves the estimation of M"(8). It is shown that an efficient estimator of M"(8) based on the observations generated • by the process provides a consistent estimator of M"(8) . A modification of the K-W process is developed for which the dependent random variables influence the direction t but not the magni-. tude of the correction on each step. In addition to obtaining the asymptotic distribution of this process, an approximation for the small sample variance is given. An estimator of this variance is given which does not entail the estimation of unknown parameters. The estimator is shown to be consistent. Finally, examples illustrating the use of these procedures are given . • -e BIOGRAPHY The author was born September 30, 1943, in Weld County, Colorado. He was reared on an irrigated farm near Eaton, Colorado. He graduated from Eaton High School in 1956 and went directly to college. He received the Bachelor of Science degree with a major in zoology at Colorado State University in 1960. On receiving a grant from the National Institute of Health, he studied experimental statistics in the graduate school of the same university for two years. In September, 1966, he entered the Graduate School of North Carolina State University at Raleigh to continue studying experimental statistics. He has ac- cepted a position as a statistical consultant for the Upjohn Company in Kalamazoo, Michigan. The author married Miss Martha McLendon Willey from Nashville, North Carolina, in 1968. As yet they have no children. ACKNOWLEDGMENTS The author wishes to thank his advisor, Dr. James Grizzle, who suggested the topic of this dissertation, for his freely given counsel throughout his doctoral program. He also wishes to thank Professors N. L. Johnson, Pranab K. Sen, Harry Smith, Jr., and H. R. van der Vaart for serving on his advisory committee and offering assistance during this study. The author wishes to thank Mrs. Gay Goss for typing the manuscript and to thank Mrs. Mary Donelan for assisting in the typing of th~ rough draft of this manuscript. Finally, the author wishes to thank his parents, Mr. and Mrs. Walter F. Johnson for their inspiration throughout his academic career, and to thank his dearest wife, Martha, for her patience and her assistance in the proofreading of each manuscript and in the typing of the rough draft. TABLE OF CONTENTS Page LIST OF FIGURES . . . 1. INTRODUCTION AND REVIEW OF LITERATURE. . 1.1 1.2 1.3 2. ·v 1 Introduction . • . . . . Summary of Literature . . . . . Summary of Results . • • 1 5 7 STOCHASTIC APPROXIMATION PROCEDURES OF TYPE AT . 2.1 2.2 2.3 Constrained Stochastic Approximation Procedures . • . • . . . . . • . Constrained Random Variables . • • . Asymptotic Distribution of a Type AT Process . • . . • . . . • . • . o 3. THE CONSTRAINED KIEFER-WOLFOWITZ PROCESS . 3.1 3.2 3.3 3.4 3.5 3.6 4. 9 o Definition of the Constrained KieferWolfowitz Process . • . . . . . . . Asymptotic Properties of the CKW Process. Effect of Constraining the K-W Process . . Expected Small Sample Behavior of the K-W Process . • . . . . • . . . . • . Approximations of the Small Sample Variance • Estimation of E(X - 8)2 • . • n THE SIGNED KIEFER-WOLFOWITZ PROCESS. 9 11 14 27 27 28 34 38 44 54 57 4.1 Definition of the Signed Kiefer-Wolfowitz 4.2 4.3 4.4 . Process . . • . . . • • . . . . . . • . . . Asymptotic Theory of the SKW Process.. .... Sample Properties of the SKW Process. Estimation of E(X - 8)2. n 57 62 70 73 5. TWO EXAMPLES . • . . 79 6. SUMMARY AND SUGGESTIONS FOR FURTHER RESEARCH. 88 6.1 6.2 Summary 8 • •• • • • • • • • Suggestions for Further Research . . LIST OF REFERENCES . . . 88 90 91 LIST' OF:'FIGURES Page Figure Graphical Definition of ) . 3 2. Optimal Determination of c (=C) 44 : 3. Graph of the Sequence {b }. . • 46 Graph of Reproduction versus Temperature. 80 4. . ~(c 1. n n n .e .. CHAPTER 1.1 1~ INTRODUCTION AND REVIEW OF LITERATURE Introduction The problem of finding the maximum of a response curve is often encountered in applied statistics. mum have been developed extensively. Two methods for locating this maxiThe method of steepest ascent, which relies on multiple regression principles, is the most widely accepted. The second method is a sequential procedure. this sequential procedure has been developed mainly lines. .. Heretofore, along asymptotic We shall develop two modifications of this latter method with special emphasis given to the practical application of this method . With this in mind, let X be a variable that can be controlled by the experimenter, and let Y(X) be an associated random variable such that E[Y(X)] = M(X). In otller words, let M(x) be a function of x, and assume Y(x) is an observable random variable with a distribution function F(olx) such that I 00 M(x) = ydF(y!x). -00 It will be assumed througllout this paper that M(x) has a unique maximum at x = e and greater than that M(x) is increasing or decreasing as x is less than or e. The foundations of the method of estimation, which Kiefer and Wolfowitz [8] adapted for the particular problem at hand, are contained in a paper by Robbins and Monro [9]. The spirit of this estimation .e 2 procedure is to generate a sequence of estimates {X } that converge to n e. The procedure is such that then'th estimate (X ) equals X 1 plus n n- a correction term that is a function of an observable random variable (Y(X 1) depending on X. for i n1 ~ 1, n-l. These procedures have certain advantages and disadvantages when they are compared to the usual methods of response surface methodology. They are simple to execute and easy to evaluate. They are generally applicable in practice, and they admit a complete mathematical formulation. On the negative side, they have the disadvantage of most sequential procedures in that they require evaluation after each step. Moreover, their variance often depends upon parameters for which a ... good estimator cannot be calculated from the observations. The stochastic approximation procedure proposed by Kiefer and Wolfowitz [8] was later called the Kiefer-Wolfowitz process. paper it will be referred to as the K-W process. In this Its definition follows. Definition: Let {a } and {c } be sequences of positive numbers, n n and suppose Y(x) is a random variable with distribution function F(ylx) 00 where M(x) ~ !YdF(Y/X). Let Xl be arbitrary, and let the random -00 variable sequence {X } be defined by n l X + ~ X - a c- (Y(X - c ) - Y(X + c», n l n nn· n n n n (1.1) where Y(X - c ) and Y(X + c ) are random variables which, conditional n n n n .e 3 distributed independently as F(y!x n - c ) and F(yjx + c ). n n n Before summarizing the literature on this process, some motiva~ tion for the assumptions on the sequences {a } and {c } will be given. n n It follows from the definition that If x n - c > 8, then M(x - c ) - M(x nn n n x + c < n n- e, then M(x n + c ) is positive, and if n - c ) - M(x + c ) is negative. n n n Obviously, under suitable regularity assumptions on M(x), there exists a point that M(x n zero as x n ~(c n ) such - c ) - M(x + c ) is greater than, equal to, or less than n n n is greater than,equal to, or less than ~(c). See Figure 1 n for an illustration. I I I 1I I I I ~ (c n )-c Figure 1. n e T I I I I I I ~(c n ~(c ) Graphical Definition of ~(c n ) n )+c n 4 .e It is equally clear that if j.l(c ) exists, then /j.l(c ) - el < c , and n n - n that if M(x) is symmetric about e, then j.l(c ) n = e. If C tion tells us that X should tend to j.l(c) rather than e. n n ~ c, intui- Thus, if M(x) is not required to be symmetric about e, it is usually assumed that C n + O. Moreover, > E(x -8-2a 0/X =x ) 2 2 l l 2 > (x -8)-20 L a . l m=l m By induction 00 If La < m 00, it is possible to choose (xl-e) > 20 L a implying an asymp1 m totic bias; thus it is necessary to assume that La n = 00. 5 .e . Th east common 1 . assumpt~on is somewhat less intuitive. tend to zero. ~s that ~~a2c-2 < nn ~ 00. This assumption 2 -2 However, it is easy to see why a c must nn In fact, it follows from (1.1) and the inequalities = E(-ann c-l[(y(x -c) n n - M(x -c » n n - (y(x +c ) - M(x +c »])2~ n n n n These three assumptions have almost always been made in theorems implying X n that c e in + any usual mode of convergence. The exception is does not have to converge to zero if M(x) is symmetric about n e. The validity of these assumptions will always be assumed in the following review. 1.2 Summary of Literature In their introduction of K-W process, Kiefer and Wolfowitz [8] proved that X n + e in functions, E(Y(x») 2 < probability if M(x) is bounded by two quadratic 00, -1 and La c < n n In later papers (J. R. Blum 00. [1], D. L. Burkholder [2], and A. Dvoretzky [4]) the assumption that La c -1 n n < 00 was dropped, and the conclusion was strengthened to that of convergence with probability one. Burkholder's theorem relaxes the assumption of quadratic bounds on M(x) with the assumption which implies for practical purposes that 1M' (x)l > 0 if x ~ e. Burkholder also noted that this assumption admits response functions taking values on bounded value sets (~'K' 2 exp(-x ». 6 .e The first significant results on the asymptotic distribution of X were derived by Burkholder [2] who used methods suggested by n K. L. Chung [3]. Burkholder established that the moments of n~(X -8) n converged to those of a central normal random variable if the sequences {a } and {c } are of the form a n n n ~ = = O(n- l ) and c n = Burkholder also assumed that W > (4+2n) l/2-w. 1~(E)-81 = O(E l +n). O(n- w), and -1 ,where He proved that, if the third derivative of M(x) is continuous, then n ~ 1. Using a method developed by Hodges and Lehman [7], he was able to drop the assumption of the quadratic bounds, but he had to weaken the conclusion to that of convergence in distribu- . tiori. Sachs [10] used a different approach to show that n 1/3 (In n) -1 . (X -8) was asymptot1cally normal under similar restrictions n on the sequence {a } and the assumption of quadratic bounds on M(x). n He did this by considering a slightly different class of sequences {c }. n He also proved that, in general, n l/3 (X -8) is not asymptotically norn mal under the assumptions on the sequences {a } and {c } discussed in n the n introduction. More recently, Fabian [6] proved a very general theorem about which he stated, " . . . [the theorem] implies in a simple way all known results on asymptotic normality in various cases of stochastic approximation." He used his result to obtain the asymptotic normality of his generalization of the K-W process (Fabian [5]) which can be constructed to obtain an asymptotic variance of order (n-(l-E)) for any E > O. 7 .e Fabian achieves this smaller variance by taking observations at more than two points On each step. His results have interesting pos- sibilities, especially if a very precise estimate of 8 is needed. However, the K-W process seems to have the better small sample properties unless one has substantial ~ priori knowledge as to the loca- don of 8. J. H. Venter [12] has proposed that observations be taken at li xn - e n'Xn, and Xn + c n on each iteration and thereby estimate M (8). He noted that this estimate could be used to minimize the variance of n~(xn -8) if a n is of the form An-I. Again, this is mainly an asymptotic result and will not be considered, here. The only small sample results relevant to the K-W process seem to be those of B. Springer's [11] simulation study on a process proposed by Wilde [13]. This process is a modification of the K-W process in which the sign of Y(X -c ) - Y(X +c ) is used rather than n n n n the difference. In this study he used a geometric sequence for {a } n which obviously does not satisfy the usual assumption that La diverge. n Then he unjustifiably criticized the process as a whole for the bias that he observed. 1.3 Summary of Results The K-W process has been largely of theoretical interest, and the behavior of this process has been studied mainly through its asymptotic properties. With the exception of some remarks by a few writers, little knowledge of use to the consulting statistician can be gleaned from a review of the literature. fill this gllP. It is the intent of this paper to 8 ~ ,., ~ Chapter 2 will be devoted to a generalization of Burkholder's [2] theorem on the asymptotic distribution of a certain class of stochastic approximation procedures. These extensions will admit sequences {a } of the form O(n- a ) where 1/2 < a ~ 1, admit the use of n constrained random variables (to be defined later), and allow the experimenter to take an increasing number of observations at each step. The usefulness of these extensions will be discussed in the following chapters as they arise. In the third chapter, a generalization of the K-W process will be proposed and its asymptotic properties will be obtained from the results of Chapter 2. Approximate bounds will be given for the small sample variance of the process when M(x) is a quadratic function, and -e this result will be used to obtain approximate bounds in cases of more general functions. The effects of the response function, the value of a, the variance of Y(x), and the starting value (Xl) on the mean square error will be studied in some detail. In the fourth chapter we shall develop the modification of the K-W process proposed by Wilde [13]. Various asymptotic and small sample properties will be established. In addition, a method of estimating the variance of the estimator will be developed. In the final chapter two examples will be given to illustrate some of the results in the third and fourth chapters. CHAPTER 2. . 2.1 STOCHASTIC APPROXIMATION PROCEDURES 'OF TYPE AT o Constrained Stochastic Approximation Procedures In this chapter we shall extend Burkholder's [2] Theorem 3 on the asymptotic normality of a class of stochastic approximation procedures. This extension will improve the practicality and the flexibility of these procedures. The main discussion of t~e useful- ness of this extension will be made in Chapter 2 and in Chapter 3 in which more concrete examples will be examined. In the development that follows, F[r](o) will denote the dis- .. tribution function of the mean of [r] ([r] denotes the greatest integer less than or equal to r) independent random variables with distribution function F(o). We shall denote by y T the random vari- able defined by T, yT = Y > T y, -T < Y < T -T, where Y is any random variable. Y <-T T The random variable (y ) will be called a constrained random variable, and if R(Y) is any function or functional of Y, then RT(y) will denote the corresponding function T of functional of y • (!'K" if M(x) = E(Y(x», T then M (x) = E(YT (x).) When it is possible to do so without confusion, the random variable sequence defined by 10 T , n y n' T y n = n • Y n -T -Tn' will be denoted by yT. > T n < Y < T nn- n Y n < -T n We shall say that a function sequence' {D (x)} n n is continuously convergent to A at x = e if for every E > 0, there exists a o(e) > 0 and an N(E) > 0 such that if lx-e) < 0 and n > N(E), then ID (x)-A/ < E. n The usual symbols 0(0) and 0(0) will be used to indicate the order of convergence. Definition (2.1): numbers such that if T n inf T > O. Let {a'} and {X } be sequences of positive n =f: co n for every n greater thansome.n , then 0 Let Z (x) denote a random variable with distribution n n function Gn(ojx). Let Xl be fixed, and define the random variable sequence {X } by n X n+l = Xn - a'ZT(X), n n (2.1) where ZT(x) is a random variable with conditional distribution GT(olx) n n T given Z1' T $ f; " , Zn- l' Xl' ... , Xn- l' and Xn = x. The stochastic T approximation sequence X will be said to be of Type A . n 0 This definition is, in fact, Burkholder's [2] definition of a Type A process. o However, the superscript T is added to emphasize the fact that, although the sequence {X } depends on constrained random n variables, it is usually more reasonable to make assumptions on the unconstrained random variables. 11 In order to get an appreciation for this class of processes and for some functions that will be defined later, we shall derive a few well-known results for the K-W process. T That the K-W process is of Type A can be seen by letting 0 a' n -1 = a ncn , R (x) n T n :: 00, = E(Zn (x)). and ZT(X) n = Y(Xn -cn ) Then R (x) n = M(x-cn ) - Y(X +c ). Let n n - M(x+c ), and, as was noted n in the introduction, under suitable regularity assumptions there = M(~ n -cn ) - M(~ +c ) = 0, n n . exists a sequence {~ } such that R (~ ) n n n and (x-~ n expansion of R (x) about n . ·e + o(cn (~n -8)). for x ~ ~. ) R (x) > 0 if x n = ~n , By considering the Taylor series n ~ n , we can show that R (x) n Letting D (x) = R (x)/c (x-~ ) for x n n n n ~ ~ n , = -2M"(8) we find that the function sequence {D (x)} is continuously n convergent to -2M"(8) at x = 8 since I~ 2.2 = -2cn (x-~n )M"(~n ) n -el + ° as c n + 0. Constrained Random Variables The following lemmas establish certain relationships between constrained and unconstrained random variables that are used in Theorem (2.1). Lemma (2.1): 2 (a ). Let Y be a random variable with finite variance Then Var(yT) < a 2 . Proof: Without loss of generality assume that E(Y) the random variable (Y') by Y' .. = T' Y> T { Y, Y < T. = O. Define 12 Then Var(Y' ) • Therefore Lemma (2.2): Let' {T } and {r } be sequences satisfying n n Definition (2.1), and let {Z (x)} denote a sequence of random n variables with distribution functions {F[ru](olx)} such that I 00 R(x) = zdF(zlx). Assume that there exists a finite number (B) -00 . ~ and an even integer (K 2) such that 00 supx J (Z-R(x)l dF'(Z'lx) <B. -00 Then for 0 < C < 1 there exist positive constants D , D , and D 2 3 l such that a. if /R(x)! > CTn' then T R n(x)sgn(R(x» > D T - D r- K/ 2 T-(K-l) 1 n 2 n n b. and if /R(x)! < CTn' then T R(x) - R n (x) I < D r -K/2 T-:(K-l) ! - 3 n n Proof: Case (a): Without loss of generality, assume R(x) > O. Choose C' such that 0 < C' < C. . We have then 13 .e [r ] T E(ZT(x» n = -TnP(Z n (x) < -Tn ) + fn zdF n (zlx) _T n + TnP(Z n (x) > Tn ) > -T P(Z (x) < (C-C')T ) n n n + (C - C')T P(Z (x) < (C-C')T ) n n n = (C-C')T n - (l+C-C')T P(Z (x) < (C-C')T ). n n n But since R (x) > CT we have n n P(Z (x) < (C-C')T ) < p(IZ (x) - R (x) I > CIT ) n n n n n - .. < (C'T )-K E(Z, (~) - R(x»K - n n < '(CIT )-K r -K/2 ,E(Z(x) -R(x»K - n 'n ' We have then E(ZT(X» n > (C-C')T - (l+C-C,)-KBT-(K-l) r -K/2 n n n The desired result is obtained by letting D l D = (l+C-C') (C,)-K. 2 Case b: Again we can assume R(x) > O. = (C-C') and 14 We have that T IR(x) - R n(x)/ = < (/R(X)! + T ) P([Z (x) < -T ] U[Z (x) > T ]) n n n n n -Tn 00 + f !z-R(x)ldF(rn ] (z!x) + T f . IZ-R(x~dF[rn](z/x) _00 n < (/R(x)! + T ) P(!Z(x) - R(x)] > T - R(x» n + IT since )R(x)] < CT n n + R(x)/-(K-l) < T. n n -T Jn IZ-R(x)I K dF[r n ] (zlx) -00 Again using Chebyshev bounds, we see that T jR(x) - R nl < (l+C)T (B(l-C)T )-K r- K/ 2 + n n n I (l-C)Tn ,-(K-l\-K/2 n The desired result is obtained by letting D 3 2.3 = 2B(1 + C)(l - C) B. -K . T Asxmptotic Distribution of a Type A Process o We shall now give an extension ~f Burkholder's [2] Theorem 3. It is helpful to recall the discussion on the K-W process preceding Lemma (2.1) and to bear in mind that the assumptions are based on the unconstrained random variables even though the theorem concerns a Type AT process. o Before giving Theorem (2.1) we shall give a theorem by Burkholder [2] and a theorem by Fabian [6] which play integral parts in 15 the proof of Theorem (2.1). Suppose {X*} is a stochastic approxi- Theorem (Burkholder [2J): n mation process of Type A and e is a real number such that o (1) there is a function Q from the positive real numbers into the integers such that if E > 0, jx-el > E, and n > Q(E), then (x-e) R* (x) > 0; n (2) sup (3) sup (4) if n,x n,x * (IR (x)!/(l+!xl» n Var(Z * (x» n < 00; ° < 01' < 02 < 00, * (5) E(a) 2 < n < 00', then La * ( inf ]R* (x)/) n sl~lx-e 1~02 n = 00; 00; * 'Ie n n (6) if n is a positive integer, then R (x) and Var(Z (x» are Borel measurable. Then P(lim X n = e) = 1. It will be useful to note that condition (1) ,can be replaced by the condition that (1') there is a function Q from the positive real numbers into the positive integers such that if E > 0, Ix-el > E, and n > Q(x), then (x-e)R* (x) > (x) + is (x) where n E (x) > 0, and La * sup 10 (x) I < 00. x n n n n E n The following theorem is a one-dimensional version of Fabian's [6] theorem on asymptotic normality. real numbers. In the theorem let R denote the Let P denote the set of all measurable transformations from the measurable space (~,S) to R. Let X[A] denote the indicator function of the set A, and, unless indicated otherwise, all convergence notions will mean convergence almost everywhere. 16 Theorem (Fabian): Suppose K is a positive integer, S n decreasing sequence of a-fields, S n ,t< ~, r,¢ER, and r > 0. 0:, (3, sR, and r n + r, ¢ n ¢, T n + + T, or EIT -TI n = 0, C > jE(Vn2 /S n ) ~ ~I lim a 2 j,r j~ = 1, lim n 11~ (3+ = (3 if 0: n -1 2 ~ j=l = 1, (3 = + = a.J,r ° if °< 0: 0: + 0, (2.2) 0, (2.3) ° for every r > 0, = ° for every r > 0. (2.4) 2 and, with a. = E(X [V 2 ~ rj ]V 2), J,r 0: suppose U ,V ,T , r , ¢ sP, n n n n n Suppose r n ,¢n- l' Vn- 1 are Sn - measurable, C, E(Vn Is n ) or c: S; a non- + let either Suppose that :I 1, ° < 1, ~ (3, (3+ < 2f, (2.5) and U +1 = (I - n-ar ) U + n-(a+(3)!2 ¢ V + n- a -(3!2 T • n n n n n n (2.6) Then the asymptotic distribution of n(3!2 U is normal with mean n (r - (3+/2) -1 T and variance ~ 2 ¢ !(Zf - (3+). Let {X } be a stochastic approximation process of n P Type AT with EZ (x) = R (x) and G (olx) = G[n ] Colx). Suppose that o n n n Theorem (2.1): {~ n } is a real number sequence, {c } is a positive number sequence, e is n a real number, P, W, and n , A, C, a2 , v, a, and o ~ are non-negative numbers, and each of A is a positive number such that 17 i. {R (x)} is a function sequence such that if n > n , then n 0 = 0, and if x R (~ ) n n ii. sup n,x ~ ~ n nW!R (x)// (1 + Ix!) < n iii. if 0 < 8 1 and if T n < 8 < 2 then La l 00, ~ , then (x - n ) R (x) > 0; n 00; inf ( IR (x)!) = n 8 </x- el2.82 n i ~ for all n > n , then 00 0 lim ( n IR inf 8 <lx- e l<8 1 (x) /) exis ts ; n 2 iv. {D (x)} is a function sequence defined by D (x) n n ~ R (x)/c (x-~ ) if x n n n T ~, n v. sup Var(Z(x» taining e, < 00, __ J~A* if x = ~n , = and is continuously e; convergent to A at x = a2 at x = 00, Var(Z(x» is continuous and converges to there exists a finite open interval (I), con- e such n> that for 0 lim E{X[(Z(x) - R(x»2 > nja] (Z(x) - R(x»2} + O. j-+oo uniformly for xE1, and if T n sup E(jZ(x) - R(x)jr) < x 00 of 00 for every n > n , then 0 for r an even integer ~ 2; vi. G(zlo) is Borel measurable for all z; vii. (a) 1/2 < a- ~ 1, *The that D I~n - eI = O(n-v ), na-w anI + A/C, definition of D (~ ) is arbitrary except for the restriction n n (ll) -+ A. n n 18 (rp/2 + S/2 > w + (1 - a), and (r-l)~) = a/2 + p/2 - w < V; (b) if p/2 < w, then p > 1 - 2(a - w); (c) if T # n 00 for every n > n , then 6/2 < (rp/2 + 0 (d) and 6+ = 0 if a < 1, = 6 < 2AA if a Then X n + 8 a.e., and n 6/2 (X n Proof: (r-l)~-w); = 1. - 8) is asymptotically normal First we shall use Burkholder's theorem to show X n 4 8 a.e. by considering two cases. Case I: p/2 < w. Burkholder has shown that condition (i) implies that, for every E > 0, there exists a Q(E) such that if Ix - 81 > E, then (x - 8)R (x) > 0 for every n > Q(E). n (x - 8)R*(x) n If Tn # 00 = Thus if T n = (x - 8)n P/ 2R (x) > O. n for every n > no' then by Lemma (2.2) (with ern] we have that O(n-rp/2 - !e 00, (r-l~ ) max {D ,D }) 2 3 = [nP]) 20 lime inf JR (x)I), n+oo 01<!x- eI202 n by L. This limit exists and is finite by (ii) and (iii). Lemma (2.2) and (vii. (a» for ° < D4 < min{L,Dl }. give Since a~ again the series diverges. for some If L > 0, If L = O(n-(a-w», w·~ 0, and a ~ 1, = 0, then ° < D < 1 and sufficiently large n since lim inf n-sT n > 0. Thus by (b) of Lemma (2.2) and the fact that (rp/2 + (r-l) s) > w + (I-a) ~ 0, the divergence of the series follows from the diver- genee for the case T n = 00 Condition (4) therefore holds in all cases. Condition (5) follows from by (vii. (b». Finally, Burkholder (2) has shown that if G(zje) is Borel measurable, then G[nP](zle) is Borel measurable and if the sup Var(Z(x» < 00, then the measurability of R(x) and Z(x) is implied by that of G(zle). Since the measurability of G[nP](zle) clearly im- x 2 T plies the same for the distribution function of n P/ Z (x) , the n measurability of R*(x) and VarZ*(x) is implied by the fact that n n 21 sup Var Z*(x) < x,n n Thus condition (6) is satisfied. 00. Therefore, X -+ e a.e. Case II: ~ w. n pj2 Let a* n = a'c n n' and Z*(x) n = c-1ZT(x). n n Conditions (1), (2), (3), (4), and (6) are established as in Case I. Condition (5) follows from by (vii. (a». Thus, in either case, X n -+ e a.e. We shall now establish the asymptotic normality of n S/2 (X n - e) using Fabian's theorem. From the definition of {X } we have n X+ n 1 - e = Xn - e- a'RT(X) - a' (ZT(X) - RT(X» nn n n n = Xn - e- aIR (X)-a'«RT(X)-R (X»+(ZT(X)-RT(X»). nn n n n n n Bur by conditions (i) and (iv) , R (x) n (x - ~ = cn (8 - ~ = (1 - a'c D (X » (X - 8) - a'(ZT(X) - RT(X» = c n n ) D (x) n n )Dn (x) + cn (Xn -8)Dn (x), and so .. nnn n - a'«RT(X) - R (X» n n n n n n + cn (8 - ~n )Dn (X». n (2.7) 22 .e Make the correspondence with Fabian'~ theorem as follows: X - 8, f '" na a'c D (X), n '" n n n n n U -1 1 n (Ci.+S)/2[n PJ"2 a' V _[n P ]2 (ZT(X) _ RT(X», cPn '" n' n '" n n and + c D (X) (8 - ~ nn n T '" T (X) '" _n a+S/ 2a'«RT (x) - R (X» n n n n n ». (Note the possible confusion of T (X) with the values at which ZT(X) n n is constrained.) T Let S be the a-field generated by Xl' Zl' n T Clearly this is the same a-field generated by Zl' Xl' .•. , Xn' ... , ZT T ... , Znl' n~l' Moreover, Sn is increasing, and cP n- 1 and Vn- 1 are measurable with respect to S. n Since R~(X) '" E(ZT(X)/X ), R (X) is n n n n measurable with respect to S implying the same for D (X) and theren n fore r . n To establish (2.2) we recall that X a+e. n e: Since D (X) is n continuously convergent to A at x '" 8, it follows that Dn (X) and therefore R (X) a+e. O. n + A a.e., Thus by conditions (iv) and (vii. (a». Again by (vii. (a» '" A/C since S/2 '" a/2 + p/2 - w. Finally, since a~ '" o(n-(a-W»,Tn(X) + 8 23 a.e. That n B/ 2 (e - ~n ) follows from the fact that je - ~ T n + D (X) + n n J= Oa.e. O(n-V) and B/2 < v. Thus 0 a.e. We have E(Vn jsn )=E(E(Vn IXn )IS n ) If Tn = n rp/2 + = O. then there exists a C > 0 such that (with E = 00, 2 C > jE(V ls ) - by (v). = E(Ols n ) If T n ~ 2 Hence IE(V IS ) n + 2 ) 0 a.e. for ever.y n > n , we have from (vii. (a» 0 > w+ 1 - a (r-l)~ n 00 rj n 0 rl + ~ that 0, and thus 0 a.e. in either case. This establishes (2.3). To show that lim 0: t = 0 for every t > 0, let I be defined as J' j+oo in (v). 2 Since E(V ) = 0, o. t < sup n J, - by Lemma (2.1) and condition (v). 2 x E(V.) < sup J - x Var Z(x) < Letting D denote this last upper bound, we obtain 2 O. t J, .. = EX{[V:J > tja] - Ur}v:J + EX{[V: > tja] UI C } V: J - < EX{ [V: > tja] - J - 00 J UI}V:J + DP (X.J 8r c ) . 24 .e The first term goes to zero by (v), and the last term goes to zero since P(X n ~I) + 1. Thus condition (2.4) is satisfied with t = r. Finally, by noting that 2 that n S/ (X r = AA and S< 2 AA if a 2 2 n = 1, it follows 2 S+». - e) is asymptotically normal (0, a A /c (2AA - Although Theorem (2.1) does not extend the conclusions of Fabian's theorem to a larger class of stochastic approximation procedures, the assumptions of Theorem (2.1) are more related to the conditions of an experiment in which the use of these procedures may be attractive. Theorem (2.1) also has the added attraction of re- moving the implied assumption that X n + e a.e. and including it in the conclusion of the theorem. Theorem (2.1) extends Burkholder's [2] Theorem 3 in three directions. Firstly, it increases the possible values for a from that of 1 to that of the interval (1/2, 1). Since the constant a (=An-a ) n can be thought of as the weight assigned to the n'th observation, this wider range of values for a gives the experimenter more flexibility in weighting the observations. This greater flexibility often enables the experimenter to decrease the mean squared error during the early stages of the search for e. strained random variables. Secondly, it allows the use of conAn example in Chapter 3 points out when such constraining can improve the behavior of the K-W process. And, thirdly, it allows an increasing number of observations to be taken on each step. Again an example in Chapter 3 describes situations in which, by taking more observations on each step, we can obtain the same variance as by taking the same number of observations using more steps. The practical importance of decreasing the number of times the .e experimenter must obtain observations is quite clear. Due to the generality of Theorem (2.1), it is difficult to comment on the practical consequences of each assumption with the exception of assumptions (v) and (vi). It is sufficient to guarantee assumption (v) that our observations be bounded. Since this can almost always be assumed in practice, assumption (5) is not practically restrictive for any finite r. Thus if either, or p is greater than zero, the assumption that rp!2 + (r - 1) ~ > w + (1 - a) and B/2 < (rp/2 + are almost always satisfied. (r - 1) , - w) Assumption (vi) seems never to be questioned in practice. Theorem (2.1) expresses the order of the variance in terms of the number of steps rather than the number of observations. Let N(n) be the total number of observations necessary to obtain X . n Then N(n) = n-l ~ mP = O(n p+1). m=l If Y is such that (N(n»)Y = n B!2 (p + l)y = a!2 then + p/2 - w, or Y = (a/2 + p!2 - w)!(p + 1). 26 We shall discuss the value of y for the K-W process in the next chapter, and shall only note here that W = 0 for the Robbins-Monro [9] process giving y = (a + p)/2 (p + 1). ·e CHAPTER 3. 3.1 THE CONSTRAINED KIEFER-WOLFOWITZ PROCESS Definition of the Constrained Kiefer-Wolfowitz Process The Kiefer-Wolfowitz process is not generally accepted as a practical procedure for seeking a maximum in experimental work. basic reasons for this are: The its sequential nature, insufficient knowledge of its small sample properties, lack of a good estimate of its variance, and an unfamiliarity of experimental statisticians with its attributes. We shall attempt to relieve these deficiencies somewhat. Fol- lowing the definition of the Modified Kiefer-Wolfowitz process, we shall obtain, under certain mild restrictions, its asymptotic distribution as a corollary to Theorem (2.1). Various examples will be given to study the effect of the parameters of the process and to illustrate its small sample behavior. Finally, bounds will be ob- tained for expressions related to the small sample variance, and a method of estimating this variance will be discussed. Definition (3.1): Let {a }, {c }, {r }, and {T } denote nn n n sequences of positive numbers such that [r ] > 1 and if T n n n greater than some n , then inf T > O. o n ~ 00 for all Let Y(x) denote a random variable with mean M(x) and distribution function F(O!x). Let Xl be fixed, and let the random variable sequence {X } be defined by n • = Xn - a c-l(y(X - c ) - y(X + c »T, n n n n n n (3.1) 28 .e where Y(X - c ) and Y(X + c ) are random variables which, conditional n n n n on Y(X I - c l ), " ' , Y(Xn - l + en-I)' Xl' .•. , Xn- l' Xn [r ] tributed independently as F n = xn , are dis- [r ] (ylx n c ) and F n (ylx n n + c n ). This will be called the Constrained Kiefer-Wolfowitz process, and it will be referred to as the CKW process. The CKW process is a generalization of the K-W process. The CKW process bounds the difference (Y(X - c ) - Y(X + c » by + T , n n n n - n and it allows the experimenter to vary the number of observations on each step. T _ n 3.2 It clearly reduces to the K-W process when r n - r > 1 and 00. Asymptotic Properties of the CKW Process We shall now examine certain asymptotic properties of the CKW process through the following corollary to Theorem (2.1). In re- viewing the corollary, it will be helpful to bear in mind that the assumptions are based on the unmodified random variables. Corollary (3.1): Let {X } be a CKW process with r n n = n 2aw Let 8 be a real number, let a and b be non-negative numbers, and let 2 each of n , A, C, 0 , a, w, and A be positive numbers such that o 8) M'(x) > 0 if x a. M'(x) is continuous, (x sup x (I MI (x) II (l + Ix I» < ~ 8, and , 00' b. the third derivative of M(x) exists and is continuous for x in an open interval (I) containing 8, and M" (8) c. Var Y(x) = 02(x) = -A/2; is bounded and continuous, there exists an open interval (I) containing 8 such that for any t > 0 29 lim E{X[(Y(x) - M(x))2 > tja] (Y - M(x))2} j-+oo ~ uniformly for x E I, and if T n " sup E(/Y(x) - M(x)/r) < x 00 00 + ° for every n > n , then 0 for r an even integer ~ 2; d. F (ylo) is Borel measurable for all y; e. (1) 1/2 < a ~ 1, naa n + A, nWc n + C, lim inf n-b°Wr n > 0, (ra + (r - l)b) > 1 + (1 - a)/w, and B/2 = al2 + (a - l)w <2 W; l)/~w (2) if a < 1, then a > 1 - (2a ~ (3) if T n 00 for every n > n , then 0 B/2 < (ra + (r - I) b -l)w; += (4) and B ° if a < 1, = B < 2 AA if a = 1. Then X + 8 almost everywhere, and n B/2 (X n n - 8) is asymptotically 222 normal (0, 2 A cr IC (2 AA - B+)). The basic ideas of Corollary (3.1) have already been noted by Burkholder [2]. However, since Corollary (3.1) is an extension of his work and since it is instructive to do so, we shall give a proof of this corollary. Proof: a' n Make the correspondence with Theorem (2.1) as follows. = a n c-l,z (x) = n n (Y(X -c ) - Y(X +c )), and R (x) n n n n n = M(x-c n )-M(x+cn ). Condition (a) implies that M(x) is continuous, and that M(x) is strictly increasing or decreasing as x is less than or greater than 8, respectively. Thus, there exists a sequence {lJ J such that n 30 .e ~ (x - n ) R (x) > 0 for x # n ~ n . From conditions (a) and (e.(l» sup n,xnWIRn (x)I/(l + Ixl) we obtain that the = sup n,xnW(c n jO(M'x)I/(l + Ix!»~ < 00. This is condition (ii). To verify condition (iii) we note that by (a) and (e. (1» ra'n ( inf (\«x-e)~oZ IRn (x)/ ) inf 10 (M' (x» n 0l~lx-e.J<PZ = a ( = OO::an ) = 00. Moreover, the lim inf IR (x)1 n 0l<jx-ej<oZ n = lim above argument and the fact that c O. n + I) n c =0 0(1) n by the The function sequence {D (x)} in condition (iv) is defined by n D (x) n = c-l(x - ~ )-1 (M(x - c ) - M(x + c » if x # ~ , n n n n n using condition (b). For x £ I we have, since R n = M(x - c ) - M(x + c ) n n (x - ~)(M'(~ n n (~ - c ) n n ) = A at x = e = 0, M'(~ n + cn» + O(xn -~ n )Z(M"(§l - c) n - M"(§Z + c» n = where Iso 1 xl < I~n for x £ I and I~ n - -Z c n (x - = 1,Z. xl, i - el < c - n + ~ ) M"(~ n n ) + O(cn (x - ~ n» Since the third derivative exists 0, it follows that for every £ > 0 there exists a 0 > 0 and an N such that for Ix - el < 0 and n > N,IDn(x)-AI<£. Thus the condition (iv) holds. Condition (v) is a direct consequence of (c) with Var Z(x) = Z 20 . 31 Finally, Burkholder has noted that condition (d) implies (vi). He has also shown that the existence of a continuous third derivative . e implies in an open interval containing °(n-2U) . Letting v = 2w, p = 2aw, that I~ n - eJ = 0(cn2) = and l; ::: bW , condition (vii) can be verified using simple arithmetic. Thus the corollary is proven. Except in a few pathological cases, an experimenter can assume that the derivatives of M(x) exist, that the r'th central moment of Y(x) exists and is continuous, and that F(ylx) is a continuous function of x for every y. Moreover, it is a rare case when he cannot specify a finite interval which should contain e. By restricting X n to this interval, all the assumptions of Corollary (3.1) hold for r arbitrarily large. If the above assumptions are satisfied, condition (c) is equivalent to e I . 1/2 < ex. ~ ex. 1, n a n + w A, n c n + C, a + b > 0, S/2 ::: ex./2 + (a - 1) w < 2w, and S+ ::: 0 if ex. < 1, =S < 2 AA if ex. = 1. Thus, except for assumption (e ' ), the assumptions of Corollary (3.1) should not be looked upon as restrictions, but rather as factors influencing the behavior of the process. Since in experimental work it can usually be assumed that the supremum over x of any central moment of Y(~) is finite (at least over an interval known to contain e), the assumptions that (ra + (r - l)b) > 1 + (1 - ex.)-w and 32 a/2 + (a - 1) w> (ra + (r - l)b - l)!W ~re generally not restrictive. Practically speaking then, the constant a must satisfy a/2 + (a - 1) W< 2'w (3.2) and 1 - (2a - l)/lw < a, or, equivalently, 1 - (2a - 1) /2w < a < 3 - a/2 w. (3.3) As in Chapter 2, let N(n) denote the number of observations required to obtain X , and let y be defined by (N(n))Y n = n S/ 2 . By (2.8) Y = (a/2 + (a - 1)w)/(2aw + 1) Thus, if a = 0, each step), Y (i.~., = a/2 the same number of observations are taken on - w. Since (3.2) implies w> a/2(3 - a), we obtain, for a (3.4) (3.5) = 0, Y Suppose a > O. (2aw + 1) 2~ dw = S/2 < a/;2 - a/6 = a/3. We have = (2aw + l)(a - 1) - (a/2 + (a - 1)w)2a = a(l - a) - 1. (3.6) 33 .e It follows that d y/dw is greater than, equal to, or less than zero as a is greater than, equal to, or less than (1 - a) .. Case I: mizing w. a < (1 - a) -1 • -1 . In this case y is maximized by minj- Thus by (3.4) and (3.5) y < (a/2 + a(a - 1)/2(3 - a»/(a a/(3 - B) + 1) = a/(3 - a(l - a» < a/3. (3.7) The equality sign holds if either a = 0 or a = 1. a ~ (1 _a)-I. Case II: 1/2 < a ~ By (3.3) (1 - a)-l ~ 3 - a/2w. Since 1, this equality is not satisfied if either w ~ 1/4 or a > 2/3. Assuming these conditions are satisfied, substitution intb (3.4) gives y = (a/2 + aw/(l - a»/(2w/(1 - a) + 1) = If a > (1 a/2 < 1/3. .. db ' . . . a ) -1 ,y i s maX1m1ze y m1n1m1z1ng w. limit of (3.4) as w + 00 Taking the we obtain y = (a - 1)/2a. By (3.3) a < 3, and so y < 1/3. Thus in all cases y < 1/3. .. We might add that a and w can be ad- justed so that y is arbitrarily close to the bound obtained in each case. 34 .e We shall see that the bias is minimized Qy increasing the number of steps and usually not affected by increasing the number of observa- • The restriction that ex > 1/2 implies for tions taken on each step. Case II that a > 2 and so this case is usually not of practical significance. It follows from (3.7) that, for Casel, the order of the variance increases as either a decreases or ex increases. Moreover, the effect of a change in a becomes less pronounced as ex increases, and it vanishes at ex = 1. Thus, if the number of steps is sufficient to dis- count a serious bias, the experimenter loses little efficiency in a statistical sense by increasing moderately the number of observations on each step and decreasing the number of steps when ex is close to 1. However, by taking observations at fewer levels of the controlled variable (X), the experimenter decreases the time required for the experiment and possibly the cost. 3.3 Effect of Constraining the K-W Process It is difficult to study the advantages and disadvantages of constraining the random variables via a strict mathematical treatment. Rather, using an intuitive approach, we shall study an example which illustrates when we may improve the K-W process by constraining the random variables. Let Z ~ N(~,a2). It is clear by Lemma 2.1 and the symmetry of the normal distribution that .. 1. sgn T· (~ T ) c = sgn(~); . . 2. ~ ~ ~ if I~I < < T; (3.9) 35 3./~TI < I~I if I~/ >T; 4. Var yT(x) ~ 0 2 if /~I < < T and 0 < < T; 5. and Var yT(x) < 0 2 if 0 > T. Partition the x-axis into the intervals II 1 2 = (-00, e- k), = [8 - k, 8 + k], and 1 3 = (8 + k, 00). Assume that M'(a) < < /M'(b)/ for a £ II and b £ 1 , Assume that T - T, c - C, n n 3 2 and 0 are such that T < 0 and !M(x - C) - M(x + C)/ < < T for x £ II' (These assumptions on M(x) are often realized in various threshold phenomena.) T Let X and X denote the K-W process and the CKW process n n respectively, and assume that Xl = xi = x. To compare the processes we can use the observations in (3.9) with Z = Y(x - C) - Y(x + C). Using (5) and the fact that T < 0, we obtain Var(y(x - C) - Y(x + C»T < Var(Y{x - C) - Y(X + C». It follows that, at least for a few steps, the K-W process is more variable about its expected value. In addition to this, as long as both processes remain in the same interval, we can visualize the following cases. Case I: x £ II' Since !M(x - C) - M(x + C) (2) of (3.9) implies the processes approach rate. e I < < T for x £ II' at the same expected The fact that M'(a) < < M'(b) for a £ II and b £ 1 3 implies that this expected rate is much less than the corresponding rate for either process approaching e from 1 , 3 either process moves from II to 1 2 That is, the probability that in a few steps is relatively small. 36 It is possibly a little larger for the K-W process due to its more erratic nature. Case II: x 1 , 2 £ While in this region, the behavior of both pro- cesses is characterized by random movement since M(x-C) - M(x+C) ~ 0 < < T. process if T < 20.> This movement is more erratic for the K-W Moreover, if by chance either process leaves this region, the movement away from e should most likely be greater for the K-W process due to its sensitivity to observations from the tails of the distribution. This difference increases with an increase in 0 or a decrease in T. Case III: x £ 1 , 3 It follows from (3) of (3.9) that the ex- pected rate of approach toward e is faster for the K-W process. difference increases with a decrease in T. of approach toward e This However, the expected rate is fast for either process, and so the probability that either process moves from 1 3 to 1 2 in a few steps is relatively large. Since both processes tend to move out of the region 1 3 more quickly than out of region II' we can expect both processes to be biased after a few steps given that Xl is in a neighborhood of The CKW is less biased for two reasons: 1 3 into 1 2 1 2 Firstly, it tends to move from more slowly, and it tends to move from II into 1 the same rate. e. 2 at about Secondly, but probably more importantly, on leaving the K-W process tends to be thrown farther into regions II and 1 , 3 Due to the usually quick recovery from 1 3 and the usually slow re- covery from II' this can greatly increase the difference in the bias 37 of the estimates. If Xl s II' neither process seems to be the better, and although .. the K-W process seems better if Xl S 1 , the difference quickly dis3 appears since both processes approach 8 at a rapid rate. Thus the CKW process seems to be preferable for this example. The essential feature of this example is the existence of two regions, both excluding 8, that are differentiated by the magnitude of M'(x). Other examples with this feature can be constructed, but we note only that regression functions of the form e- x2 have this feature. The advantage of the CKW process is that of lessening the probability of a deep penetration into the region for which IM'(x)/ is small. The optimal choice of T is largely determined by 0 regions 1 2 Clearly, if 4 0 < T, where 0 and 1 , 3 2 = Var 2 and the Y(x), con- straining is of little consequence in 1 , and if it is of consequence 2 in 1 , it is a negative factor. 3 crease the variability in 1 information that 1 3 2 Thus, we must choose T so as to de- without sacrificing too much of the contains with respect to the direction of 8. either more observations are taken at each step, or 0 2 As is decreased, this becomes more difficult to do. Before leaving this subject it should be observed that as a n + 0, the influence of each iteration goes to zero. This implies that both processes become characterized by their expected rate of approach toward 8. In any small neighborhood of 8, toward which both processes tend, this is very similar. .e 38 304 Expected Small Sample Behavior of the K-W Process In this section we shall discuss the effects that the parameters a and w, the starting value (Xl)' and the shape of the response curve have on the behavior of· the K-W process. Although the CKW process will not be discussed, it will be seen that most of the results in this section apply to it as well. Let I denote an interval containing 8, and assume that M'(x) for x € rCo =B Assume that the distribution function of Y(x) - M(x) is independent of x. Let Z(X ) = Y(X - c ) - y(X + c ), and Var Z(X ) n n n n n n 2 2 a . We have by (3.1) that X + - 8 = X - 8 - a c-l(M(X -c ) - M(X +c » n l nn n n n n n If ~ > > 8, the process will remain in I high probability. C - a c-lZ(X ) nn n (3.10) for a few iterations with a Thus if n - k is not too large, (3.10) becomes = (X -8) - 2 B a - a c-1Z(X ) n n n n n By induction n = (~-8) - 2 B ~ a m=k m -1 a c Z(X), m=k m m m n ~ where, conditional on X > > 8, Z(X ), m=k, ..• , n, are independent m k random variables. It follows that n = (Xk -8)- 2 B E a , m=k m and (3.11) = 39 Var(Xn+ - 81~ > > 8) l Let a n = An- a and c n = en-w . L m-p 2 n 2 -2 L a c mm m=k (3.12) It is a well known result that n+l n ~ m=k of p > O. =2 a J x- P dx (3.13) k Substituting the expressions for a n and c n into (3.11) and (3.12) and applying (3.13), we obtain E(Xn +l - 81~ > > 8) ~ ~ - 8 - 2BA(ln(n+l) - In k), a = 1, and -e Var(Xn+1 - 8 I~ > > 8) 2 2 -2 -1 A C (a-w-l/2) ~~ (Kl - 2 (a-w) _ (n+l)1-2(a-w)). The above expressions indicate that if we are approaching 8 along a constant slope, our expected rate of approach is proportional to the slope and increases very rapidly as a decreases. Furthermore, the variance of our rate of approach decreases as either a increases or w decreases. Thus if we expect to approach 8 in a region in which M'(x) is fairly constant, we should generally decrease both a and w. be wise to let c n and a n It may be constant until the sequence {X } no longer shows a trend in a particular direction. indicates we may be in a neighborhood of 8. n This lack of a trend 40 In this example we are trying to decrease the bias for a given number of observations. It is clear from (3.11) that the bias is not decreased by increasing the number of observations on each step. There- fore, unless it is practically restrictive, it is best to take a minimum number of observations on each step until we are in a neighborhood of 8. The previous example reveals the behavior of the K-W process for a few iterations while in a region in which M'(x) is fairly constant. However, the assumptions of this example are certainly violated in a small neighborhood of 8, or in a region in which M'(x) cannot be approximated by a constant. A more reasonable approximation, at least in a neighborhood of 8, is to assume that M(x) is a quadratic function of x. M(x) for B > O. Let a n = An- a = -(B/2) (x-8) 2 and c y(X +c ) - (M(X -c ) - M(X +c n n n n n n function of Z(X n Ixn = x) That is, assume n », ;::: Cn -w . (3.14) Let Z(X ) n = y(Xn-c n ) - and assume that the distribution is independent of a. Let Var Z(X ) n =2 0 2 Using (3.10) we have ;::: X - 8 - An- a c (B/2) (-(X -c _8)2 - (X +c _8)2) n n n n n n - AC- l n-(a-w) Z(X ) n (3.15) 41 Since, conditional on X n = x, Z(X ) is independent of x, we obtain by n induction that n X + - 8= IT (1 - 2ABj-a)(~ - 8) n I j=k n - AC- l E n E (1 - 2ABj-a)n-(a-w)Z(X ), m m=k j=m+l n where (1 - 2ABj-a) IT = 1. It follows that j=n+l = n IT j=k (1 - 2ABj-a)(~ - 8), and n ~(k,n)(~-8)2 + 2A2C- 2a 2 E D(m+I,n)m- 2 (a-w), (3.16) m=k n where D(m,n) = n (1 - 2ABj-a)2. The last two expressions of (3.16) j=m are, conditional on ~, the square of the bias and the variance of X , n respectively. It is another well known result that there exists 8 k + 0 such that for every k, n Using results similar to (3.13) it has been shown that there exists 8k + 0 such that for every k, n (1-8 k )exp {-4AB(I-a) -I (n I-a. -k I-a.) } ~ D(k,n) (3.17) f) < (I+8 )exp {-4AB(I-a) -I (n I-a-k I-a} ) , a < 1, k and •• 42 (3.15) 1. We can use the results of (3.17), (3.18) and Corollary (3.1) to obtain asymptotic expressions for the bias and variance of X in this n example. In fact, it follows from (3.16), (3.17), and (3.18) that )( «~-e) l 2 + 0(1)exp{-4AB(1-ct) -1 l-ct n }, ct < 1 (3.19 ) «~-e) 2 + o(l»n -4AB ,ct = 1. Although Corollary (3.1) does not guarantee that Var(X ) exists, it n does imply that •• n~/2(x - e), with ~/2 n = (ct - 2w)/2, tends to be distributed as a normal random variable with variance equal to (3.20) 2A 222 for A > (a - 2w)/4B if a = 1. a /c (4AB - .(ct - 2w», ct = 1 Many writers have noted that the variance expression for a = 1 is maximized with respect to A if A = (ct - 2w)/2B. It is reasonably evident from the nature of D(k,m) that the bias of the K-W procedure is diminished by decreasing ct. is clear from (3.19). a - 2w < a/3. However, it follows from (3.6) that n a and w be small. • ~/2 = Thus the asymptotic variance increases as ct decreases. Again, until X is in a neighborhood of creased . Asymptotically this e, it is probably best to let Once X is in a neighborhood of n e, .ex should be in- 43 It is important to note that in this example the bias is not a function of c , and, as is to be expected, the variance decreases as n either C increases or w decreases. It follows that we can set w By doing so y (see (3.4)) becomes a/2. asymptotic variance of O(n- 1 ). = o. Hence with a = 1 we obtain an As was noted in Chapter 1, it is the symmetry of M(x) about 8 that allows us to assume w = O. It is clear that if M(x) is reasonably symmetric about 8, and if only a rough estimate of 8 is needed, then it is best to let w ~ o. Suppose we can assume that M(x) is symmetric about 8. implies that we can set c n = C. This Asymptotically, the optimal choice of C maximizes the difference !M(x-C) - M(x+C)I for small deviations of x about 8. It is quite clear then how C should be chosen. We only note here that, for a quadratic response surface, C should be made arbitrarily large, and, for surfaces of the form eas in Figure 2. I I I -~ rI 2C i~ ----+j I I I I I I x Figure 2. Optimal Choice of c (=C) n x2 , C is determined 44 2 In this example we have assumed that Var Z(X ) "" 2 cr. n Var Z(X ) n = 2[n 2aW ]-1 cr2 , If (3.10) becomes 1 E(X + l n el~) 2 = D (k,n)(~-e) and n ~ D(m+l,n)m- 2 (a+(a-l)W) m""k 3.5 Approximations of the Small Sample Variance The importance of evaluating the last expression in equation (3.21) is quite clear. This expression is, conditional on X ' the k small sample variance of X if M(x) is a quadratic function. n If each X , m=l, •.. , n, lies in a neighborhood of e in which M(x) is m reasonably approximated by a quadratic function, it is a close approximation to this variance for more general response functions. At the present, the only simple, but adequate, approximations to this expression rely heavily on asymptotic theory. We now obtain bounds for this expression using an appeal to asymptotic theory that is satisfied for values of k and n usually realized in practice. Let n b (m,n) ;:;: IT (3.22) j;:;:m+l where D, ¢ > O. For simplicity, let bm "" b(m,n ) keeping in mind the dependence on n. t) 45 •• Using (3.22) we obtain b m - b m-I = b (l_b- l b m m m- 1) (3.23) Thus, for k sufficiently large, {b m} is,an increasing sequence ' m > k if either a < 1, or a = 1 and 2 D > $. -a (m-l) Since m -a = O(n-(l+a», the second difference is given by (b -b m m-l •• ) - (b m-l -b m-2 ) =(2Dm- a _ + Oem -2a ym '-l)"(b'b ') 'm-' m'-l (3.24) )(b m-b m- 1)' Thus, for k sufficiently large, {b -b I} is an increasing sequence m m- for m > k if either a < 1, or a = 1 and 2 D > Y n The last expression of (3.21) is of the form 1: b It is m=k m clear from equations (3.23) and (3.24) that, for sufficiently large k, upper and lower bounds can be obtained for this sum using a geometrical argument. See Figure 3 for a renresentation of the elements b illustrates this approach. m that In fact, Area 6 ADE - Area 6 ABC, k > A n 1: b m=k (3.25 ) > m- Area 6 ADE • and , k < A 46 b k ------- (f) n-l Figure 3. n Graph of the Sequence {b } n n L: b m=k < (n - k) b . m- n This geometrical approach will be used in only one case in the following lemmas. However, every bound will be obtained by finding sequences of functions {f (x)} such that either n f n = k, (m) > b , m (m) < b, m = k, ... , n. - m .•• , n or f n - m The respective bound will be obtained by integrating f (x) over a n region such as (k,n+l) or (k-l,n). Lemma (3.1): a = 1. Let AI' A , 2 ¢ < Al < 2 D < 1. 2 . k, n > N (1. ,1. ), 1 2 Let the sequence {b } be defined by (3.22) with m ¢, and D be real numbers such that Then there exists an N(A ,A ) such that, for every 1 2 47 A-(ep-l) n (k-l) 2 ) < r b m=k < (A - I - (ep- 1»-1n -A A -(ep-1) I «n+l) I ). Consider the function sequence {f (X,A)} <;lefined by Proof: n f n (x ' A) = n for x > O. A -(ep-l) _ k I m -A A-ep x· , (3.26) (n , A) = n -ep = b • (3.27) Clearly, by (3.22), f n n We have using (3.26) that f n (3.28) (m,A) > b «b ) - m- m i f and only i f n Let d (A) m = -A ep-A > m mep-A b • m m «mC/l-A b ). - (3.29) m We have = by (3.23). b -1 I + (A - 2D) m + Oem-2 ) Since Al < 2 D < A , there exists an N = N(A I ,A 2 ) such 2 that the sequences {dm(AI)}m>N and {dm(A 2 )}m>N are increasing and decreasing respectively. 48 But (3.28), and hence (3.29), holds for m = n. • Thus, by the monotinicity of the sequences {d (A )} and {d (A )}, (3.29), and hence m 1 m 2 (3.28), holds for all m > N(AJ.,A ). 2 Since ¢ < A < A , {f (x,A )} and {f (x,A )} are sequences of inn n 1 2 2 1 It follows that, for k > N(A ,A ), creasing functions. n f k-l -A n 2 x 1 A2-¢ 2 n 11+1 -A A1-¢ dx. n 1 x dx < 2: b < m k m=k f (3.30) Lemma (3.1) is obtained by integrating the above expression. Lemma (3.2): Let the sequence {b } be defined as in Lemma (3.1). m Let A , A , ¢, and D be such that 0 < ¢ - 1 < A < 2D < A < ¢. 2 1 2 1 Then there exists an N(A ,A ) such that, for every k, n > N(A ,A ), 1 2 1 2 ·e (A 2 A -(¢-1) A -(¢-1) n 2«n+l) 2 _ k 2 ) < 2: b m m=k -A A -(¢-) A -(¢-1) - (¢_1»-1 n 1 (n 1 - (k-l) 1 ). - (¢-l»-ln < (A 1 Proof: -A The proof is a repetition of the proof of Lemma (3.1) except, in this case, {f (X,A)} is a sequence of decreasing functions. n Thus the ranges of integration in (3.30) must be reversed. The re- striction that A > ¢ - 1 is to prevent fn(x,A) from being of the 1 1 form x- . Lemma (3.3): a < 1. Let the .sequence {bm} be defined by (3.22) with Let 0 < A < 2D. 2 bn (b n-b n- 1) ~ Then there exists an N(A) such that -1 (1 - exp{An-a (k-n)})/2 < n2: b m m=k -a - exp {-a } A-1 n -¢+a (exp{An} An (k-n». 49 .e Proof: Consider the function sequence {fn (x,A.)} defined by f (x,A.) n for x > O. = bn exp{A.n-a (x-n)}, (3.31) We have f (m,A.) > b n if - (3.32) m and only i f (3.33) Let d m = bm exp{- mn-a}. Then (3.34) by (3.23). Since A. < 2D, there exists an N(A.) such that {dm(A.)}m>N(A.) is an increasing sequence. But (3.32), and hence (3.33), ho~ds for m = n. Thus, by the monotonicity of the sequence {dm(A,)}m>N(A)' (3.33), and hence (3.32)~ holds for all m > N(A.). Since fn(x,A.) is an increasing function for all n, n n+l E b < b exp{-A. n l - a } J exp{An-ax}dx. m=k m - n k The right hand inequality of Lemma (3.3) follows by integration and by noting that b n = n-¢. .50 To obtain the left hand inequality of Lemma (3.3),·· we note that the slope of the line AE (see Figure 3) is given by bn - bn- 1. Therefore D - A = b n (b n - b n- 1)-1 and (B ~ A) = (C - B)(b n - b n- 1)-1. But, for k >N(A), (3.31) and (3.32) imply that <b n exp{A n -a (k-n)}. It follows from (3.25) that, for k > N(A), n z:; b > Area mm=k ~ ADE - Area ~ ABC 2 ~1 a >b (b -b . 1·) ( 1 - exp{A n- " (k....n)} )12. - n n n- Thus Lemma (3.3). is proven. We now use Lemmas (3.1), (3.2),' and (3.3)''taca1culate asymptotic bounds for the variance. Let the sequence {b } be 'defined' by (3.22). Theorem (3.2): m Then, for fixed k, i f a < 1.., and if a;= 1 and 2D > ep - 1 > 0" n nep-az:; b = (2D - (¢_1»-1+ 0(1). m m=k Proof: We see from (3.'16) and (3.22) ,-t;hat .e 51 if D = 2 AB. - 2D > <P Using (3.16), (3.19), and the fact that i f a < 1, then 1, we have n<P-O'.b k -+ 0, for any fixed k. Thus we need con- sider only the sum from m > N where N is chosen to satisfy the previous lemmas. The case in which a =1 and 2D4<P follows directly from lemmas (3.1) and (3.2) ily close to 2D. si~ce Al and A of these lemmas can be chosen arbitrar2 n It is true for <P = 2D since E b is a monotonic m=k m function of D for every k and n. Again the fact that A is arbitrary in Lemma (3.3) implies the right hand inequality for Theorem (3.1) when a < 1. The left hand inequality holds from Lemma (3.3) since = (4D)-1 + 6(1) by (3.22) and (3.23). We can apply Theorem (3.2) t9 the example in which M(x) is a quadratic function by letti~g D ditional on~, = 2 A~ and <p.= 2(0'. - W). Thus, con0'.-2 w e the asymptotic variance of n (X - ) is equal to n p 2 A2 (i}C2(2 AB - r\) where p = 1 if a This is in agreement with (3.20). ... = 1, 1/2 ~ p ~ 1 if a < 1- It also points out the important fact that if M(x) is quadratic, the variance of X converges to the n variance of its distribution given by Corollary (3.1). The same 11 result for a more general class of regression functions has been obtained by many writers using different approaches. 52 We now consider the extent to which Lemmas (3.1), (3.2), and (3.3) rely on asymptotic theory. We first note that it ,is not neces- serry for the sequence {d } to be monotonic in order that ei·ther (3.29) m or (3.33) hold. Consider the upper bounds established in lemmas (3.1) and (3.2). The appeal to asymptotic results lies in the value of N such that, for all m > N, (3.35) where .\2 < 2 D. If D = 1, (3.35) holds for N = 1. Using the Taylor expansion, we find that (3.35) is equivalent to 1 - 2Dm- l + D2m- 2 < 00 j-l L: ( 11 j=O i=O (3.36) By (3.35), the right hand side of (3.36) decreases as .\Zincreases. Thus N is less than or equal to the value of N obtained for .\2 -3 Assuming .\2 .::: 2D and considering terms up to orders (m ~ 2D. ), we find N is approximately the solution of This solution is given by x = 2(2D - 1)/3. Actually, i f 2D > 3, the coefficient of (m- 4 ) is positive implying the value of Nis less than 2(2D-l)/3 up to orders (m -4 ). It is clear that the upper bounds in lemmas (3.1) and (3.2) are " very practical bounds for moderate D. that, for a = 1, Moreover, we have already noted (3.20) is maximized at A = (1-2w)/2B. The 53 corresponding value of D is 2(1-2w), giving N ~ 2. The appeal to asymptotic theory for the upper bound of Lemma (3.3) lies in the minimum value of N such that exp{An -a }(l-Dm-a ) 2 (3.37) for allm,n > N (see (3.23)' and (3.34» . For the K-W process, cj> = 2(0. + (a-l)w). than one, we shall consider the case cj> < 2. Since a is usually less It follows from (3.23) a that if 2D is not significantly less than cj>, then (1_Dm- )2 (1-m-1)-cj> is increasing in m for quite small m. case for m > N. m~> Let N be such that this is the o N . Then, if (3.37) holds for m o = N0 , it holds for all Let N1be the smallest value for which (3.37) holds with m and for which N l ~ Thus, if either 2D No' ~ We have, for m, n ~ = n, N , l cj> or 2D > cj>, N is a reasonable approximation for 1 the minimum solution of (3.37). If we let m = n and neglect terms of orders (n -3a)~·in: 3~37, we'obtain an approximation of N given by the minimum solution of l 2 2 2 a 1 + (A-2D)x- + (A /2 + D - 2DA)X- 2a < 1 - cj>x- l + cj>(cj>-1)x- • cj> > 1 for the K-W process, the coefficient of x- 2 is positive. Since We therefore drop this term knowing that the resulting minimum solution is increased. If we let A = CD; for 0 < C < 2, the above inequality is surely " satisfied if x is greater than the minimum solution of 54 or, equivalently, For 2 - 12 < C < 2 the quadratic in C is negative. Thus if C is so restrained, our approximation is less than the solution of or 1 Y = (¢!(2D_A»1-a 3."6' Estimation of E(X - 8)2. n A comparison of (3.11) and (3.12) with (3.16) quickly reveals that (3.16) is generally not the correct expression for E«X +l n )21~). However, it often serves as a reasonable approxi- mation when the variables {X + c }n~k m - m m= lie in a region of 8 in which o M(x) is reasonably approximated by a quadratic function. We assume in this section that the process has been in such a region for m = k , ..• , n. o A natural estimator of n a-2w E(X - 8) 2 is the right hand exn pression of (3.16) after the parameters have been replaced by their estimators and the expression has been suitably normalized. Using this estimator, the problem becomes one of choosing a good value for k and estimating the parameters. Assume, for the moment, that estimates of the parameters are • available. The squared bias expression in (3.16) is minimized as either k decreases or X approaches 8. k This suggests that a good 55 choice for .. k -- k. 0 if n - k would be close to X + (the estimate of 8) and such that n l ~ Using this method, the squared bias expression can be ignored o is of any consequence. A natural estimate of the variance of X , conditional on k, n would be to substitute the estimate for -4 AM " (8) (= 2D) into the upper bounds given in lemmas (3.1), (3.2), and (3.3). moderately large, this estimate (after normalizing by n If n-k is a-2w ) closely approximates the corresponding estimate of the asymptotic variance given in Corollary (3.1). The parameters, for which we have assumed estimates exist, are cr 2 and M" (8). Carrying the assumption that M(x) is approximately quadratic a little further, we can obtain estimates of these param- ·e eters using regression analysis. It is interesting to consider Fisher I s information for M' '(8) when Y(x) _ N(-B(x-8)2, cr 2 ). We have d2 = - --2 2 2 2 (C - (y+B(x-8) ) /2cr ) dB d = -aB (x-8)4/ 2 cr 2 . = Thus, conditional on X., i J. by (assuming 2[m 2aw (l2cr) -2 222 «y + B(x-8) )(x-8) /2 cr ) = k0 , ... , n, the information is given ] observations are taken on each step) n L: m=k o • n ~ c 2 n 2aw) . O( L: m=k m o .e 56 If c n -w = nand w is arbitrarily close to its lower bound (a/2(3-a)), the information is bounded below by a quantity of O(n I - a(2-a)/(3-a) for arbitrary small E. E ) This expression increases as either a decreases or as a increases until a = 2. Moreover, since (2-a)/(3-a) < 2/3 for small a, the usual regression estimators of 0 consistent estimates if M(x) is quadratic. 2 and M"(8) will yield .e . CHAPTER 4. 4.1 .IRE SIGNED KIEF.ER-WOLFOWITZ PROCESS Definition of the Signed Kiefer-Wolfowitz Process Chapter 3 is a development of a generalization of the K-W process T in which the corrective variable «Y(X -c ) - Y(X +c » n) is conn n n n strained to lie between + T. - n This corrective variable can be char- acterized as influencing both the direction and magnitude of the adjustment at the n'th step. In this chapter we shall develop a modification of the K-W process in which the corrective variable influences only the direction of this adjustment. Definition (4.1). -e Let {a }, {c }, and {r } be sequences of n n n positive numbers where [r J > 1. n - Let Y(x) denote a random variable mean M(x) and the distribution function F("lx). [r n o(x ,c ) = [r ]-1 L: n n n i=l Let J o.]. (x, cn ), where 1 i f Yi(x-c n ) > Y.]. (x+c n ), °i(x,c n ) = 0 if Y. (x-c) = Y. (x+c ), ]. n ]. n -1 if Y. (x-c) < Y. (x+c ), ]. n ]. n i = 1, •.• , ern]. Let Xl be fixed, and let the random variable sequence {X } be defined by n or w~th 58 (4.1) where Y.(X -c ) and Y.(X +c ), i ~ n n ~ n n which, conditional on X n = xn , = 1, ... , [r ], are random variables n (Xl,c ), •.. , o(Xn_l,c _ l ), Xl' •.. , X - l , and l n n are distributed independently as F(-Ix -c ) and F(-Ix +c ). n n n n This process will be called the Signed Kiefer-Wolfowitz process, and will be referred to as the SKW process. =r When [r ] n (a positive integer) the SKW becomes the process first proposed by Wilde [13]. Wilde did not attempt to develop the process except to note that it seemed less sensitive to the variance of Y(x) than the K-W process. The only other attempt to study this process seems to be Springer's [11] work which was discussed in Chapter 1. -e The SKW process is intrinsically different from the K-W process. The K-W process seeks the value (8) such that E[Y(8)] all x. ~ E[Y(x)] for The SKW process seeks the value (8') such that P[Y(8') > Y(x)!Y(8') # Y(x)] ~ 1/2 for all x. Notice that if Y(x) is such that E[Y(x)] > E[Y(x)] f6r x # x' implies that p[Y(x) > Y(x')ly(x) # Y(x')] > 1/2, then 8 = 8'. In particular, if Y(x) - N(M(x), 02(x», then 8 = 8'. However, the behavior of the two processes can be markedly different because of their different emphases on the magnitude of the corrective variable. When [r ] = 1, the SKW process can be considered as a limiting n , case of the CKW process. let a n = An-a Let a n = As n -a for the CKW process with T the corrective term for the CKW process n for the SKW process, and = T. 59 (AT- l n-a c- l (Y(X -c ) - Y(X +c »T) n n n n n .. converges in probability to the correction term of the SKW process (An -a c -1 n o(X ,c » n for aqy fixed n as T n O. + Before deriving any basic results it is helpful to discuss the random variable (6(x,c , n ». We have E o(x,cn ) = P[Y(x-cn ) > Y(x+cn )] - P[Y(x-c ) < Y(x+c )]. n (4.2) n If Y(x) is a continuous random variable for all x, (4.2) can be expressed as Y f ( JdF(z.jx+cn»dF(y]x-cn ) 00 E o(x,c ) = n -00 00 y 00 J ,.f dF(z Ix-cn)dF{y Ix+cn ) -00 (4.3) -00 We now consider the important special case in which Y(x) is a normal random variable with density function -1 f(y,x) = (2 2 2 a2 (x» exp{-(1/2)«y - M(x»/a(x» }. TI Substitution into (4.3) gives r(J Y 00 E 0 (x,E) = -00 00 f(z,x+€)dz) f(y,x-E)dy -00 - J ( J f(z,x-E)dz) -00 (4.4) y f(y,x+E)dy. -00 Since we shall be interested in the behavior of o(x,c ) for n c n small and x in a neighborhood of 8, we shall need estimates of 60 a a ax E O(X,E) and aEaX E O(X,E). Assume that 1. M'(x) is continuous and there exists an open interval I containing e 2. 0 < 0'2 (x) < such that for x E I M"(x) is continuous, co, a'(x) is continuous, and a"(x) is continuous for x E I. Then the Taylor expansion of f(y,X+E) gives f(y,X+E) = f(y,x) + E( for 0 < ~ < T. a~ f(y,x+ ) ) i = ~ Since aTa f(y,T) = assumptions (1) and (2) guarantee that the last expression in (4.5) can be dominated by an integrable function for x in a bounded interval. We can, therefore, express (4.4) as co E O(X,E) = Y J[ J (f(Z,X+E) -co - f(Z,X-E»dz] (f(y,x) + Eg(y,x,E)dy (4.7) -co where g(y,x,E) can be dominated by an integrable function. Since the term in brackets is bounded by for every z as E -7 1 and goes to zero 0, we have from (4.7) and assumptions (1) and (2), that ~~ ~E ±. ES(X,E)- 2 {( ~x 1 f(z,x) dz ) f(y,x)dy 61 00 = 2 cr(x) . f ( ~x [(z - M(x)/a(x)] ) f 2 (y,x)dy _00 = Mt (x) / ITI In order to evaluate (4.8) a (x) t- Eo(x,e) ~E at x =8 and E = 0, we note that, for x in a closed subinterval of I, the continuity of M' I (x) and a .. cont~nUJ.ty 0 I I f (x) is sufficient to guarantee the existence and 3Ea aa E u~( X,E ) an d aax a 3E E ~( X,E ) • u Moreover, these second partials are equal, and we have by (4.6), (4.7), and (4.8) that lim c +0 n a 28E ax EQ(X,E)I x=8, E=c n = - M' (8)/cr(8)17f I since M' (8) (4.9) = o. Thus if Y(x) is a normal random variable, we have E o(x,c ) = -c M'(x)/a(x)lIT + O(c ), n n n and, for Ix - 8 I· :sufficiently small, E o(x,c ) = -c (x-8)M " (8)/cr(8)1TI + O(c (x-8». n n n If, more generally, Y(x) is a continuous random variable whose density function f(y(x»)is continuous in x for all y,then i t follows 2 from Definition (4.1) that E 0i(x,c ) :: 1 and E <S.(x,c) + 0 as c·+ ~ n n n o. 62 .e Since, conditional on X = x, o.(x,c ), i = 1, ... , [r ], are independn ~ n n ent random variables, we have lim (4.10) Var o(8.c ) = 1. n c~ n 4.2 Asymptotic Theory of the SKW Process Let {X } be a SKW process with r Corollary (4.1): n n = n 2~ Let {V } be a real number sequence, 8 be a real number, a and W be n non-negative numbers, and no' AS, C, v, and a be positive numbers such that a. if n > n , then E o(V ,c ) = 0, and if x o n n ~ ~ n , then (x-v) E o(x,c ) > 0; n n b. for every x, E 6(x,0) = ° and d ~E E O(X,E) is continuous, and if B = sup c , then n c. for every x in an open interval (I) containing e and for every lEI < B, (o/ox) E o(x,E)l x=8, y=O equals zero, (o2/oE6x) E o(x,y) is continuous, and the sequence {(d 2 IdEdX) E O(X,E)] x= 8 , E=c } converges to A > 0; n d. Var o(x,c ) is continuously convergent to 0 2 n at x = 8; e. F(y/o) is Borel measurable for all y; W a f. (1) 1/2 < a < 1, n a + AS, n c + C, Iv, n n n 81 = o(n- V ) , (2) if a < 1, then a > 1 - (2a - 1)/2w; (3) and S+ = 0 if a < 1, = S < 2 AS A if a = 1. 63 ·e Proof: . Make the correspondence with Theorem (Z.l) as follows: -1 Z (x) = o(x ,c ), and R (x) = E o(x ,c ). n n n n n n a' = a c n n n To verify that conditions (i) through (vii) of Theorem (2.1) are satisfied, we first note that condition (i) is a direct consequence of condition (a). It follows from condition (b) that sup nW/E o(x ,c )j/(l+jxl) n n n,x = sup -e where I~ (x) n I n,X < c. n nWc n (J ~ aE E o(x ,E)/ c ( »/(I+lx/) n Y=sn x The last expression is finite by condition (b), and therefore condition (ii) is verified. To verify condition (iii) we note that condition (b) implies the existence of a function sequence {~ (x)} such that n IE 6'xn,c ( dE E n ) I = c n l~ where .'I~n (x)/ < c. n °(Xn,E )I E=~ (x) I n By condition (f. (1», c n converges to a limit, and hence the function sequence {~ (x)} converges to a function, say n ~(x). Moreover, condition (b) implies ~(x) is continuous, and there- fore the function / (d/dE) E o(x,E)IE=~(x)/ is continuous and attains its infimum (k(ol'oZ» But ~n . + over the set [01 ~ Ix-el.~ 02]' e by (f. (1». Therefore if 0 < 01 ~ 02< 00, there exists an N such that for every n > N we have that [01~lx-el':::"02]n [Ix-e/ ~ Illn-ej] = cp. k(ol'oZ) > O. This implies by (a) that Thus condition (iii) is verified by noting that the .e 64 divergence of L:a' n 01 ~ inf ]x-e] < 0z IR (x) I n is implied by the divergence of We have by (a) and (c) that E o(x ,c )/c (x-~ ) n n n n where In - c-l(~ E o(x,c » n aX n I < Ix - ~ n I and I~I < c. n ~ tinuously convergent to A at x = e by n x=n The last expression is con- condition (c). Thus condition (iv) is satisfied. Condition (v) follows from condition (d) and the fact that I0 (xn ,cn ) I -< 1. Since the P[Y(x-c ) > Y(x+c )] can be written as the limit of n n Lebesgue-Stieltjes sums, which by condition (e) are Borel measurable, it is clear that the distribution function of o(x,c ) is Borel measurn able for all (y). Thus condition (vi) is satisfied. Condition (vii) can be shown to hold by direct substitution for the case T n _ 00 and p = Zaw. Thus Corollary (4.1) is proven. The conditions in this corollary are somewhat unintuitive. This arises from the fact that the SKW process explores a response curve 65 (E O(x,c » n which, although often closely related, is not the response curve of the observed random variables. However, as in Corollary (3.1), the assumptions of Corollary (4.1) are satisfied in practice almost without exception. This fact will become clearer in the following examples. 2 Suppose Y(x) _ N(M(x) , a (x». Example 1. Let {X } be a SKW n process which satisfies assumptions (1) and (2) preceding (4.5) and assumption (f.) of Corollary (4.1) with w > O. We have yet to determine V. In addition, assume that 3. (x -8) M'(x) < a ~ for x 8, = a for x = 8, 4. M"'(x) is continuous for x E I; 5. and sup (IM'(x)! + ja'(x) I)/a(x)(l + Ixl) < 00, x We shall show that n S/ A = - M"(8)/a(8) /-IT, 2 (X n -8) ~ s N(O,(As )2/ c2(2A A - S+», where using Corollary (4.1). We first note that (3) implies condition (a) since P[Y(x) > Y(x')] > 1/2 if and only if E Y(x) > E Y(x') for independent non-degenerate normal random variables. (4) and Corollary (3.1) that V It then follows by condition = 2w. Condition (e) is an immediate consequence of the continuity of M(x) and a(x). We now summarize the relevant results obtained earlier for this special case: 1. d aE E a(X,E) is continuous; 2. {Var a(x,c )} is continuously convergent to 1 at x n = 8; 66 3. for every x E I and every ( ~x jE] < B, E <'i(x,£) ) xeS, E=O = 0, the sequence d2 dEdX EO(X,E)/x=8 £=c , n converges to 2 A ( = d lIT), and dE3x E O(X,E) is continuous. -M"(8)!a(8) Thus, the assertion of asymptotic normality is established if we can show that the B,xl~E suP / E / < E o(x,E)//(l + Ix]) < 00. Taking the partial with respect to E under the integral signs in (4.3) (The continuity of M(x) and a(x) assures us that this can be done.) we obtain integrals of the form -e y co J (J ~E -00 f(z,£) dz) f(y,£) dy _00 and OO f -00 ( JY d f(Z,E) dz) aE f(y,E) dy. -00 An examination of (4.5) reveals that assumption (5) is sufficient to guarantee that the above supremum is finite. Thus the desired asymptotic normality of nS/ 2 (X n -.f) is estab- lished. . The variances of the respective normal distributions to which ' the SKW process and the K-W process converge are respectively 67 .e and when a = 1, and and (4.11) when 1/2 < a < 1. It is evident that the K-W process is asymptotically more sensitive to the variance of the observations. Before leaving this example we note that if AS = A o(8)/IIT, the expressions in (4.11) are identical. Example 2: Suppose Y(x) has an exponential density function given by f(y,x) ~ (~(x) e-~(x)y, x ~ l , x < 0 0 o. Let {X } be a SKW process satisfying assumption (f) of Corollary n (4.1) with w > O. Let M(x) = ~-l(x) and assume that 1. M'(x) is continuous and (x- 8)M'(x) < 0 by x # 8; 2. sup /M'{x)I/M(x) (1 + Ixl) < 00; x 3. and there exists an open interval I containing 8 such that M"'(x) is continuous. 68 We have .. P[Y(X-E) < Y(x+E)] = 1 - A(X+E)!(A(x+E) + A(X-E» = 1 - M(X-E)! (M(X+E) + M(X-E» by the last expression in (4.3). It follows that E O(X,E) = (M(X-E) - M(X+E»!(M(x-E) + M(X+E». (4.12) It is clear from (4.12) that the value of V is less than 2w as in Corollary (3.1). With the possible exception of condition (b), the conditions of Corollary (4.1) can be easily verified using assumptions (1) and (2) • We have = SUPjE] < B,xIM'(x)/!M(x)(l + Ixl) < ~ by assumption (2). It remains to evaluate the Var 6(8,0) and limVar o(x,c ) c +0 n + . By assumption (3) the value of A is given by d dX or The 1 by (4.9) for all x and, in particular, for x n . A~ .L dE E I O(X,E) E=O x=8' = 8. 69 A = - M"(8)/M(8) M"(8)/cr(8) = - since M(8) = A- l (8) = cr(8). A comparison of (4.13) with (4.9) suggests that A may be gener- ally of the form - k M"(8)/cr(8) where k is a constant characterizing a family of distribution functions. The next example gives an impor- tant case when this is not true. Example 3: Suppose Y(x) is a Bernoulli random variable with the P[Y(x) = 1] = p(x). Suppose that assumptions (1) and (3) of Example 2 hold with M(x) replaced by p(x). (2) sup Ip'(x)1 (l+x) < x In addition, assume that 00. It is easily seen that these assumptions are sufficient to verify the conditions of both Corollary (4.1) and Corollary (3.1). Moreover, in this case the SKW process and the K-W process are identical. Thus, using the results of Corollary (3.1), we find that n S/ 2 (X -8) n ~ N(O, (As )2 cr 2/ c2(2 ASA - S+» where A = - 2 p"(8), and cr 2 = 2 pee) (1-p(8». This example shows that the corrective variable in the SKW process is a transformation of the observed random variable into a discrete, bounded random variable with a regression function that is less than, equal to, and greater than zero as x is less than, equal to, or greater than ~. n That is, the SKW process is simply a K-W process after this transformation on the observed random variables has been made. 70 These examples indicate that, at least asymptotically, the SKW process is more dependent than the K-W process on the family of distributions ~.K. abIes come. normal, exponential) from which the observed random variHowever, for the special cases of the normal and exponen- tial families, the SKW process is less sensitive to the variance of the parent distribution. 4.3 Sample Properties of the SKW Process Unlike the K-W process, whose small sample properties are largely determined by M(x) and a(x), the small sample properties of the SKW process often depend on the distribution function ina more intrinsic manner. This makes a general discussion of the small sample properties of the SKW process very difficult. However, a few general remarks can be made. Firstly, the effect of using the sign of (Y(X -c ) - Y(X +c )) is somewhat similar to that n n nn of constraining an observed random variable; both operations lessen the effect of an observation from the tails of the distribution. Secondly, if the SKW process is approaching e along a constant, rela.:.-o tively small slope, the expected behavior is fairly well approximated by (3.11) and (3.12) in which the constants B and a of the family of observed random variables. process is approaching e along 2 are characteristic And, thirdly, if the SKW a steep slope, the approach is very uniform, and the rate is determined by the sequences {a } and {c }. n n It is clear from the nature of the SKW process that is it impossible to consider an example corresponding to M(x) being quadratic 71 for the K-W process. That is, since 0 (x., c ) is bounded by + 1, it n is impossible for E o(x,c ) n = k cn (x-e) to hold for all x. This implies that the moments of n S/ (X -8) generally do not ·converge to those of its 2 n asymptotic normal distribution unless possibly when X is restricted to n lie in a bounded interval containing 8. To study the expected rate of approach to 8 therefore requires a slight modification of the method used for the K-W process. Let {X} be a SKW process. n Let I be an open interval of 8 such that E o(x,c ) n .. where 0 < 1. < Tn (x) 2. T < 00... = cn Tn (x) (x-8) (4.14) Let 1\ be the event that XmE I for m -> k, and let X*n denote the random variable Xn conditional on : A.. -1< That is, assume that each member of the random variable sequence . {X }oo k lies in I. m m= It is not unreasonable to study the random variable sequence {X*} n for two reasons. Since X -+ e a.e. by Corollary (4.1), P(1\) -+ 1 as n k -+ 00, and since P(A ) + 1 n k ' S/2 (x*-8) n -+ n S/2 (X -8) in probability and n therefore has the same asymptotic distribution. We have, for a n = ASn- a Xn*+l - 8 and c n = C n-W = X* - 8 - a c- l o(x*,c ) n = n n n n (4.15) (1 - AST (X*)n- a )(X*-8) _ ASC-ln-(a-w)Z(X*) n n n n .e 72 where, conditional on X* , E Z(X * ) n n =0 and Var Z(X * ) n = Var o(X* ,c }. nn In much the same manner that we obtained (3.16), we obtain n E«X*+1-8)2 ~) ~ D(k,n) (X= _6)2 + A2C- 20 2 i: D(m+l,n)m- 2 (a-w) n . m=k n 2-2 where D(k,n) = TI (1 - A~T j-a) and a j=m = sup n,x Var o(x,c). n (4.16) The -2 . . 2 reverse inequality holds if ! and a were replaced by T and Q. Quite * a.e. * a.e. clearly, since T (X) + A and Var o(X ,c) + Vai: 0(6,0), it could n n n n be shown that nlJQ E(X*-8) 2 (4.17) n where S=a - 2w. Since it will be useful in the next section, we shall show that Where If we raise (4.15) to the'4'th power and take expectations, we obtain terms of the following forms: .. . *·-6) 3 Z (X) * IX*] since E[:(X :.n n n n = O. Using (4.16) and Holder's inequality, the last two expressions are of smaller order than·thesecond.' Thus we can write 73 .e + 6 ~ (A s )2 C-2(G 2 + 0(1) ~ D2(m+1,n)m~3~+4w . m=k (4.18) Letting n b m = n> IT 0 j=m+l it is seen that Lemmas (3.1),'(3.2) and (3.3) as well as Theorem (3.2) hold with 2 D replaced by nD. Thus the last expression in (4.18) is less than or, equivalently, ~.e n- 2 (a-2w) 3 ~2 (1 + 0(1»). It should be noted that, when a = 1, the restriction that a - 2 w < 2 A A implies ~ - 1 = 2(a-2w) s < 4 A A = n D in the above expressions. Using (3.17) and (3.18) with 2 B = AS A, weobtairi (4.19) as we might have expected. 4.4 Estimation of E(X 11 - 8)2 The fact that the form of A is generally unknown for the SKW process adds a new complexity to the estimation of the variance of X. n If we can assume a particular distribution for Y(x), we can .e 74 often evaluate A as was illustrated in the preceding examples. An estimate of the variance of X could then be calculated using the n method developed in Chapter 3. not always known. HOVlever,the distribution of Y(x) is We now develop a 'methO'd'O'f estimating the variance of X which does not require this knowledge. n Denote the Var X* by V. n Let b n m be an increasing subsequence Define V by n = N-1 V n N 2: b mel s m * ~~ (X -X ) m n 2 (4.20) • We shall study the properties of V through a study of the n properties of V' n = N- l (4.21) The asterisk, which indicates that X , m=k, ... , n, all lie in m a neighborhood of 8 in which (4.14) is a reasonable approximation for E ,'o(x,cn ), will be omitted in the rest of this discussion. It follows from (4.17) that Vi is asymptotically unbiased. n To obtain the variance of V' we note that n 2 Var[V ' ] = Nn N 2: (b b I) I m m m,m =1 = N- Cov[ 2 -. - 8 ) . (~ (~ m 2 2 - 8) ] m' N 2 - 8) (x. 2: (b b ,) (E[X _8)2] I m m -lJ I b m,m =1 m m - E(~ m ~ 6)2 ~(~ _ 8)2) m' 75 .. + Var(Xj, - 81Xb )] - E(Xb m m' m - 8)2 E(Xb m' - N-2 b , m . 11 . j=b m Using (4.19) we obtain . N Var(V') < (6 ¢2 + o(1))N- 2 n - D(b ,b ,). L: m m m'>m Consider the case a = 1. (4.22) We have from (3.22) and (3.28) that for every A < 2 A T there exists a N(A) such that, for every m, n > N(A), D(b ,b ,) < (b /b ,) A. m m m m (4.23) Let m' = m + 1, and suppose that the sequence {b } is chosen such that m (bm/b + ) is independent of m. ml Then, for some constant B, m-1 L: i=l (In (b i +l ) - In b) or, equivalently, Thus b m is of the form (sm). (m-l)B 76 Let b m be so defined with s > 1. Then from (4.22) and (4.23) we obtain Var(V') < (6 ¢2 + 0(1) N- 2 ~ (N-m)s-mA n m=O Since N is approximately the solution of sN = n, we have In(n) Var(V') ~ 6 ¢2 In(s)/(l-s-T) + 0(1). n Consider the case a < 1. We have from (3.22) and (3.32) that for every A < 2 AST there exists and N(A) such that for every m, n > N(A) D(b ,b ,) < (b /b ,)¢ exp[A b-~ (b -b ,)] mm mm m mm where ¢ > (4.24) O. Let m' = m + 1 and suppose the sequence {b Vis(:'chosen,'such':\:hat m the term in brackets is asymptotically independent of m. -1 bm+l = bm(l + oem », we have m-1 bl-a .. -+ m . eas1Y 'I seen t h at b It 1S Let b m = m be so defined. (T m)l-a) L C. -1 has the required property. Then, for fixed k and m' term in brackets becomes [- A T k + 0(1)]. we obtain Then, if = m + k, the Thus by (4.22) and (4.24) 77 Since N is approximately the solution of (T N) (l-a) -1 we have Since 8 is unknown, it is natural to use V instead of Vi. n n We have . EV n = N- 1 ~ m=l N L m=l b m. E(X.·m-. Xn ·.) 2. bm(E(~ _8)2 - 2 E(~ -8)(Xn-8) + E(Xn-~}2) m m N L: (¢ + 0(1». m=l Thus V is asymptotically unbiased. n Moreover, for a < 1, O(b- S exp{-n- a (n-b )} m .. m '. Thus for a < 1, V is usually a conservative estimate of V • n n This method of estimation is obviously very inefficient for a close to one. It does have the advantages of being relatively easy 78 to calculate and of being generally applicable. It should be noted that the order of the variance of V depends on the number of steps n and not on the number of observations. This is undesirable when we wish to economize by taking more observations on each step and taking fewer steps. -e CHAPTER 5. TWO EXAMPLES We now illustrate the application of some of the results of Chapter 3 and of Chapter 4 by means of two hypothetical examples. Both examples entail designing an experiment to emphasize the factors that influence the choice of the sequences {a }, {c }, and {T }. n n n Example 1: A research worker is studying the relationship between temperature and the rate of reproduction of a certain strain of bacteria. One of his objectives is to estimate the temperature (8) that maximizes the average rate of reproduction. -e He emphasizes that he wishes to mini- mize the E (M(O) - M(8»2 rather than E(e - 8)2. From past experience he is reasonably confident that 8 lies between 80 and 110 degrees Fahrenheit and that the average rate of reproduction is reasonably cOnstant in the interval (8-5,8+5). For temper- atures below 8 - 5 degrees the average rate of reproduction changes approximately one unit for each 10 degree change in temperature, and for temperatures above 8 + 5 degrees the average rate of reproduction changes two to three units for each 10 degree change in temperature. The standard deviation is expected to be approximately three units at temperatures below 8 and to moderately increase as temperatures rise in a twenty degree range above 8. He also mentions that for practical reasons he can perform the experiment at only 20 different temperatures. . The above discussion indicates that the relation between the average rate of reproduction and temperature may be represented by Figure 4. 80 " 22 IE T-/- I:l 0 'M .w 20' HI I fl(c) I () ::J "0 0 H 18 ~ 16 ~ I I IG I I I 8+5 8-5 in Figure 4. 80 90 100 110 Temperature 120 Graph of Reproduction versus Temperature We shall illustrate the design of this experiment using two CKW processes. In order that the first process (X') approaches 8 n along a moderate slope and that the second process (X") approaches n 8 along a steep slope, we set Xi = 80 and Xi' = 110. A reason in favor.of using two such processes in practice rather than one is that two processes would usually generate observations more uniformly and over a larger range of temperatures. This allows the experimenter to study the whole relationship more reliably. Since so many of the results in this paper depend on (3.15) being a reasonable approximation of the process, we must define an interval I of 8 for which this is true. Then we must design the process such that X is in the region at least for the last few steps. n Since we are practically restricted to 20 steps, we shall choose the sequence {an} such that with a high probability X has reached this n region after 10 steps. Before doing this we must choose the sequence {c}. n This 81 sequence should be chosen with two thoughts in mind. Firstly, c n should be large as long as Ix - el > Cn ; that is, Xn is still n - el < c , n n should be small enough so that the expected correction term always Secondly, when Ix searching for a neighborhood of e. c n directs the process to a region in which M(x) The maximum value of c n length of the line DF, and let c n M(e). for which the last statement is true can be approximated by considering Figure 4. picture that if c ~ = (D - F)/2. < c, then I~(c ) - el < 5. n Let D - F denote the We see from the The experimenter has noted that there is not a biologically significant difference between M(~(c» and M(e) i f I~(c) - el <:5. Thus we seek the maximum value of c for which I~(c) - el < 5. Let p be such that (G - F) = 2pc. Then if sl and s2 represent the slopes of lines FE and ED, we have E - G = 2cpsl = 2c(1-p)s2' or Since 5 > I~(c) - e] . for s2 > sl' Thus, for our example, c - 5(.1 + .2)/(.2 - .1) = 15. .e .. 82 Consider the choice of c so that if x n n for X'. n Initially c + cn < 8, then Xn' +l > X'n with Since p[ly(x) - M(x) I n should be large enough a high probability • < 0] ~ .6, we may choose c (~C) as the solution 1 of 20 = M(x ~ l + C) - M(X1 - C) .2C, or C ~ 30. If, in addition, we require that then W ~ .3. With these values for C and W we find that c Similarly for the choice of c 20 n .4C. ~ 15. ~ 11.5. for X' we obtain n = M(x l + C) - M(x l ~ ZO - C) or C In this case, c 1 is sufficiently small to discount a serious bias due Thus, we can set w = 0, or c n S 15. It should be noted that we could start the process at n and let n = k, •.• , 20 + k. chosen constants. =k > 1 By doing so, we can adjust the change in .e 83 An examination of Figure 4. reveals that if c IXn n < 15 and - 81< 5, then (3.15) is a reasonable approximation for Xn+1 . Equa- tions (3.11) and (3.12) are useful for choosing the sequence {a } such n that X reaches the interval I by the n'th step. n In choosing a (=~a) for X' we note first that this process will n n be somewhat exploratory due to the small slope. want a n Therefore, we do not to decrease too rapidly as that would make the process too sensitive to the first few steps. 8£(80, 110), IXI Thus we might set a - (8 - 5)1 < 25. Since Thus a conservative value for A, which reasonably assures that X £1 for some n n 10 -.75 2(1.1)A L m m=l by (3.11). = 3/4. ~ .2(3.8)A = 25, ~ 10, is the solution of (4.25) We therefore set A = 30. In choosing a (=An-a ) for X" we note first that this process n n' should be in a neighborhood of e quite early due to the steep slope. Thus, we want the sequence {a } to optimize the behavior of X" under n this assumption. n Asymptotically this is done by setting a I, where B = 1 IM"(8) I, we note from =1 with the restriction that A > B/2 Mil (8) - 2w = 1. a conservative estimate of (3.14) and Figure 4. that -B(8-10~8) ~ ~ or M'(8-10) .1, To obtain 84 B Thus A > 2.5. - ~M"(e) Furthermore, by the note following (3.20), asymptotically the optimal choice of A is 5. However, if we require A to be greater than the solution of the equation corresponding to (4.25), we must have A > 25. There are numerous bases on which the sequence {T } of constraints n may be chosen. We illustrate one method for X". n In the region of steepest slope the expected correction term is AC- l n- l (M(x-c ) - M(x+c )) ~ 6AC- l n- l n n- If we set T = DCA-In then the maximum correction would be th n It is clear that for D > 6 this constraint does not seriously affect the approach of X' in the region of steep slope. n By letting D < 10, the possible bias arising from a large movement to the left of early stages of the experiment is diminished. of T n e in the Obviously, the effect diminishes very quickly as n gets large. We can get an approximation for the (3.16) and Lemmas (3.1) and (3.3). varianc~ of X' and X" using n n We have for X' that n n D(m,n) < IT (1 - 2(.1)30j-a)2 j=m+l giving Var(X 21 - elxk ) 2(9)20 -. ~ ';;;;"'>':1:-:2~ 75+. 6 75 (exp{6(20)-'} - 75) } - exp{-(20 ... k)6(20)' 85 by Lemma (3.3). = 10, For k we have • (4.26) If X 10 81 < 5, then 20 <IT· a·/'x10 ) _ .. 75 )2 25 (1 - 6.-. J j=ll ~ (11/20)-1.5+.6 exp{-12(10)(20)-·75}25 ~ by (4.24) with b m O. = 20. 10 and b , = m We have for X" that n D(m,n) ~ n .-1 2 '..II (1 - 2(.1)25 J ) • j=m+l Using Lemma (3.1), we obtain alx10 ).:. Var(X "1 2 ~ If Ix"n - al 2 (9) (~5) (20- 1 - (.5)5 10-1 ) (15) (10-1) .14. (4.27) < 5, then 2 E (X" 20 - alx ) 20 < 'll 10 - j=l1 (1 - 5j-l)2 25 :::.. (11/20) 10 :::.. .03 . ... 86 Before comparing (4.28) and (4.27), we must adjust for the fact that X" approached n e along the steepest slope. The corresponding ad- justment in (4.25) changes the restriction on AS to AS > 15. With this new value for AS we obtain We conclude this example by noting that much the same procedure would be followed if the SKW process were considered. Example 2. Suppose it has been noted that the addition of a moderate amount of a certain medication to the daily diet of some chickens has improved their health, and it is desired to estimate the optimal amount. In this type of problem, it is often difficult to quantify the observed variable (health). Moreover, the shape of the response func- tion depends on the manner in which we choose to quantify this variable. Thus it seems natural to use a method of estimation that depends on relative magnitudes rather than absolute magnitudes. We therefore illustrate this example using the SKW process. The corrective variable (o(X ,c » is defined to be -1 if the n n chickens which are given the amount of medication (X - c ) are n n healthier than the chickens which are given the amount of medication (X + c). n n As we noted in the discussion of Example 3t the random variable o(x,c ) generates a response curve that equals zero at n ~. n We shall suppose that the experimenter is reluctant to assume that •• 87 f lln r) 8. - If for no other reason than'"it is somewhat meaningless to view the" expected health as a function of diet, we shall also assume that the experimenter wishes to minimize E(8 - 8)2. Since it is not too unreasonable to think of ).In - 8 as the bias of X , we might let c equal some constant (greater than 1) times the n n bias we are willing to accept. such that c m The values C and w should be chosen is large during the period of time we expect the process to be searching for the region of 8, and approximately equal to c n when in this region. We have such little knowledge of A in this case that is is probably wise to set a < I and thereby obviate the risk of not satisfying the inequality 2A s ~ > S for the case a = 1. Moreover, due to the unknown nature of A, we are restricted to estimating the variance of X using the method proposed in the previous section. n This latter fact may cause us to diminish a still further and to compensate for the corresponding increase in the variance by decreasing AS. Using the principles employed in Example 4, we could evaluate the sequences {a } and {c } more specifically. n n .. CHAPTER 6. 6.1 SUMMARY AND SUGGESTIONS FOR FURTHER RESEARCH Sununary A large class of stochastic approximation procedures can be represented by (6.1) where, conditioned on X , Z(X ) is an observable random variable n n independent of X. and Z(X.),i 1 . 1 = 1, .•. , n-l. In particular the Kiefer- Wolfowitz (K-W) process admits this representation with a' n and Z(X ) n = Y(Xn-c n ) - Y(X +c ) where, conditional on X· n n n = an = x, c -1 n Y(X -c ) n n and Y(X +c ) are independent random variables with means M(x-c ) and n n n M(x+c ) respectively. n An extension of Burkholder's [2] theorem on the asymptotic distribution of the stochastic approximation procedures, which can be represented by (6.1), is given in Chapter 2. This extension allows the number of observations taken on each step to be of O(n-p)~· p ~ 0, allows the random variable Iz(x )/ to be constrained to lie between n bounds of O(n- S), S ~ 0, and admits a more general sequence of constants a'. n The practical consequences of this extension for the K-W process are discussed in Chapter 3. The small sample mean and variance of X ' conditional on Xk ' n are studied through a consideration of some specific examples. The case in which M(x) is a quadratic function is considered since M(x) 89 is often closely approximated by a quadratic function in a neighborhood of e. It is shown by means of some approximations that, after suitable noxma~ization, the variance of the normal random variable to which the K-W process converges in distribution is a conservative estimate of the Var(Xnl~). 0 2 and M"(8). This asymptotic variance depends on the unknown parameters It is shown that an efficient estimator of M"(e) is also .. 2 a consistent estimator of M" (8) when Y(x). ~ N(M(x),.o ). A modification of the K-W process first proposed by Wilde [13] is defined in Chapter 4. = an It satisfies (6.1) with a' n c -1 n -and with Z(X ) being the mean of [r ] independent random variables defined by n n 1 if Y. (X -c ) > Y. (X +c ), 1 01 (x,c n ) = n n n 1 n 0 if Y. (X -c ) = Y. (X +c ), 1 n n n 1 n -1 if Y. (X -c ) < Y. (X +c ), 1 n n 1 n n = x, where, conditional on X n Y.(X -c ) and Y.(X +c ), i 1 n n 1 n n = 1, ... , [r n J, are independent random variables with mean M(x-c ) and M(x+c ), n respectively. proc..ess. n This process is termed the Signed Kiefer-Wolfowitz (SKW) The asymptotic distribution of the SKW process is obtained under;quite general assumptions on Y(x) if the sequences a , c , andr n n n are of the form n- P , p > O. It is seen that the order of the variances of the K-W process and the SKW process are the same, but that the variance of theSKW process shows a greater dependency on the nature of the distribution of Y(x). The variance of X conditional on X is again studied by considerk n ing specific examples. It is seen that this variance depends on an 90 unknown , parameter~ tribution of Y(x). the nature of which depends on the form of the disThe estimator of the variance, -1 N S 2 - X ) Var(X ) = N l: b (X b m n m=l m n where {b } is an increasing subsequence of the integers, 1, .. ", n, and m nS(X -8) is asymptotically normal, is proposed since this estimator n does not aepend on the unknown parameter. This estimator is shown to be consistent for somewhat restricted class of sequences {b }. m· In Chapter 5, two hypothetical examples are given to illustrate the designing of an experiment in which these procedures are used. 6.2 Suggestions for Further Research Although the results in the paper provide some added insight into the behavior of the K-W and SKW processes and methods of estimating the variance of the estimate of further study. e are presented, both of these areas need One very promising possibility would be a numerical study comparing these stochastic approximation procedures with the Method of Steepest Ascent . . ., ... , 91 , LIST OF REFERENCES 1. Blum, J. R. 1954. probabili ty one. Approximation methods which converge with Ann. Math. Statist. 25:382-386. 2. Burkholder, D. L. 1956. On a class of stochastic approximation processes. Ann. Math. Statist. 27:1044-1059. 3. Chung, K. L. 1954. On a stochastic approximation method. Math. Statist. 25:463-483. 4. Dvoretsky, A. 1956. Berkley Symposium. On stochastic approximation. 1:39-56. 5. Fabian, V. 1967. Stochastic approximation with speed. Ann. Math. Statist. 38:191-200. Ann. Proc. Third improv~d asymptotic 6. Fabian, V. 1968. On asymptotic normality in stochastic approximation. Ann. Math. Statist. 39:1327-1332. 7. Hodges, J. L., Jr., and E. L. Lehman. 1956. Two approximations to the Robbins-Monro process. Proc.Third Berkeley Symposium. 1:95-104. 8. Kiefer, J. and J. Wolfowitz. 1952'.' Sfochastic:'apptoximati'on of the maximum of a regression function. Ann. Math. Statist. 23:462-466. 9. Robbins, H~ E. and S. Monro. 1951. A stochastic approximation methoa. Ann. Math. Statist. 22:400-407. 10. Sachs, J. 1958. Asymptotic distribution of stochastic approximation procedures. Ann. Math. Statist. 29:373-405. 11. Springer, B. G. F. 1969. Numerical optimization in the-presence of random variability. The single factor case. Biometrika. 56:65-74. 12. Venter, J. H. 1967. An extension of the Robbins-Monro procedure. Ann. Math. Statist. 38:181-190. • 13. Wilde, D. J. Optimum seeking methods. Englewood Cliffs, N. J • Prentice~Ha11, Inc. 1964.
© Copyright 2024 Paperzz