EDWARD W. FREES. On Construction of Sequential Age Replacement Policies via Stochastic Approximation. / (Under the direction of DAVID RUPPERT and GORDON SIMONS.) Stochastic approximation (SA), a stochastic analog of iterative techniques for finding the zeroes of a function (e.g., Newton-Raphson), is applied to a problem in resource allocation, the age replacement policy (ARP) problem. A stochastically failing unit is replaced at failure or at time t, whichever comes first, under an age replacement policy. When the form of the failure distribution is unknown the question of estimating ~, the optimal replacement time (optimal in terms of achieving the smallest long-run expected cost), is important. SA is superior to other existing sequential methodologies introduced by Bather (1977). Further, this application motivates novel ways in which the theory of SA is extended. This research was supported in part by the National Science Foundation (Grant MCS-8l00748). ACKNOWLEDGEMENTS Learning to do statistical research is like following a very mucky trail. and dead ends. It can be full of pitfalls, false paths I would now like to acknowledge some of the people who have helped me learn how to stick to the right trails. Thanks go to my co-adviser, David Ruppert, for the generous use of his time. With the proper amount of guidance from him, solving this dissertation problem became a very , educational and worthwhile experience for me. Thanks also go to Prof. Gordon Simons, co-adviser, for suggesting the problem and helping me with some initial rocky stages. To Prof. W. L. Smith, chairman of the dissertation committee, for allowing me to use the department's word processor and his personal text-editing language (DQ3) on which this document was produced. The many painful drafts made for a much better product in the end. Thanks also go to the other members of the dissertation committee, Professors Janet Begun, M. R. Leadbetter, P. K. Sen. The gratitude that I have for my parents and family are constants in my life and need not really be mentioned. I remind them anyway as to how much they are appreciated. Finally, thanks go to my nit-picking English editor, Marie Davidian, for her support in a non-stochastic manner. Table of Contents Page Chapter 0 INTRODUCTION AND NOTATION Chapter 1 THE ARP MODEL 1.1 1.2 1.3 Chapter 2 2.1 2.2 2.3 2.4 2.5 Chapter 3 3.1 3.2 3.3 Chapter 4 4.1 4.2 4.3 4.4 4.5 Chapter 5 5.1 5.2 5.3 5.4 Chapter 6 6.1 6.2 6.3 6.4 . The classical model The classical model with an unknown distribution function The classical model viewed sequentially 1 6 10 13 STOCHASTIC APPROXIMATION AND AN APPLICATION Introduction Stochastic approximation-motivation SA-short historical development Stochastic approximation applied to the ARP Proofs 19 21 24 33 38 DEVELOPING AN OPTIMAL SA ARP METHODOLOGY Reducing the order of bias Reducing the asymptotic mean square error An adaptive SA ARP methodology 51 59 63 ANOTHER APPLICATION OF SA-ESTIMATING THE MODE Introduction Notation and assumptions Univariate results and remarks Multivariate analogs Proofs 74 77 78 82 86 SA-SOME EXTENSIONS OF THE THEORY Adaptive K-W procedures SA-representation theorem sA-sequential fixed-width confidence interval ARP-sequential fixed-width confidence interval 93 100 105 III MONTE-CARLO STUDIES Introduction Preliminary investigation Description of the simulation Summary of results 117 118 120 125 Appendix A - Deterministic graphs and calculations Appendix B - Monte-Carlo output Appendix C - Some alternative models 130 133 139 Bibliography Index - Notation and assumptions 146 153 CHAPTER 0 INTRODUCTION AND NOTATION Consider a functioning unit with specified life distribution F, and the probability of survival to age x is l-F(x)=S(x). Let Cl and C2 be fixed, known costs with Cl >C 2 >O. If the unit fails prior to ~ units of time after its installation, it is replaced at that failure time with cost Cl • Otherwise, the unit is replaced ~ units of time after its installation with cost C2 . replacement is immediate. It is assumed that Under the age replacement policy (ARP), the replacement unit is available from a sequence of such units that fail independently with the same distribution function F. The objective is to minimize the long-run accumulation of costs in some sense. A practical bound on the cost corresponds to ~=OD which is merely a failure replacement policy (replace only at failure). The cost function here is the expected long-run average cost, (1 ) This cost function is motivated and developed in Sl.l (Section 1 of Chapter 1). introduced in Appendix C. Alternative cost functions are Page 2 This dissertation approaches the ARP problem as a problem in statistical estimation. In particular, we are interested in estimating ~, the (assumed unique and finite) optimal replacement time. We give a short review for i.i.d. (identically and independently distributed) data in Sl.2, focusing on nonparametric methods. In Sl.3 we introduce sequentially conducted experiments, where the estimators (~~) of the optimal replacement time are actually used in replacing the unit. In particular, we show that if ~~ estimates ~ well, then, (2 ) -1 _N(t) t s ~i=l {CII(Xi<~i) + C2I(Xi~~~)} ~ Rl(~) a.s. where Xl' ... 'X n are i.i.d. observations with distribution function F and N(t) is the number of failures by time t. Thus, the actual cost achieved by the experimenter asymptotically is the same as if that person knew the optimal replacement time all along! To provide sequential estimators of ~ with properties sufficient for (2), we introduce stochastic approximation (SA) as a sequential estimation technique. Readers familiar with this methodology may wish to skip S2.2 and S2.3, which motivate SA and give a short historical development. A simple, easy to follow recipe for calculating the estimators (~n) is given in S2.4. In this section the assumptions are laid out and some asymptotic properties of these estimators are stated. The proofs are given in S2.5. Page 3 An estimator is said to be better than another if it has a better rate of convergence, or, having the same rate of convergence, a smaller asymptotic mean square error (MSE). While this criterion is not unique, it is certainly not unreasonable. Chapter 3 is devoted to improving the estimator introduced in §2.4 to achieve an optimal estimator. The procedure in §2.4 uses a simple estimator for the density of the units at a point. In §3.1 we replace this estimator with a more sophisticated one (primarily using kernel methods, although more generality is sought). The resulting procedure gives estimators that have optimal rates. Given these rates, §3.2 shows how parameters of the procedure should be chosen to minimize the asymptotic MSE. §3.3 gives a modification of §3.1 to actually achieve this optimal MSE. In Chapter 4 we allow ourselves a bit of a digression. Using techniques similar to those introduced in Chapters 2 and 3, we show how to estimate the mode of a distribution. While the goals of this chapter are independent of the others, the techniques of the proofs are not. In particular, the proofs of §4.5 require an intimate familarity with stochastic apprpximation. For this reason we have split the details of the proofs off into a separate section. results given here are stronger than curre~tly The in the literature and the procedure is practicable, but we felt these results are not all one might hope for. Hence, this Page 4 chapter represents the current state of a work in progress. We begin Chapter 5 by extending the notion of an adaptive stochastic approximation process in the K-W case. The first section is independent of the others in this chapter. In S5.2 we prove a representation theorem that is a modification of Ruppert's (1982). The modified version we give is easier to apply to specific SA problems. this'result to the ARP problem in S5.4. We apply InS5.3 we develop a sequential fixed-width confidence interval from the representation. This result is a considerable improvement over previous results concerning sequential confidence intervals in SA since these results dealt only with the Robbins-Monro case. As an example, in S5.4 we show how to develop a sequential fixed-width confidence interval for the ARP problem. Having discussed the asymptotic properties of several estimators, we felt a need to provide an investigation of the finite sample properties of these estimators. Chapter 6 gives the details of a Monte-Carlo study performed using the procedure proposed in S2.4. The raw output of this study is given in Appendices A and B. While most of the notation is introduced as is needed, we mention some here. (2) refers to the second equation of Page 5 the current chapter. (1.2) refers to the second equation of the first chapter. For some distribution function (d.f.) F, we use E and P to denote expectation and probability. ~(X) is the sigma field generated by the random variable (r.v.) X, F n is a sigma field generated by previous events (to be specified with each application) and E p is the conditional n expectation given Pn. Rm denotes m-dimensional Euclidean space, and Bm are the Borel sets of Rm. write ~ we mean deterministic or almost sure (a.s.) convergence. distribution. set. Generally, when we The symbol ~D is used for convergence in The symbol € is used to denote membership of a Let {Xn } and {Yn } be sequences of random variables. write X =o(y ) if there exists a r.v. z such that n n IXnl/IYnl~z a.s. for all n. Xn=o(Yn ). If IXnl/IYnl~O a.s. we write We use 0p(.) and 0p(.) for the corresponding symbols for relationships in probability. We CHAPTER 1 THE ARP MODEL §l.l The classical model An important area in resource allocation theory (operations research) concerns the optimal replacement and/or maintenance of a stochastically deteriorating system. Lately, this has been an active area of research for applied probabilists, cf. Beiche1t (1981), Aven (1982). The age replacement policy (ARP) arises as a specal case of the theory of maintenance policies, e.g., systems designed to maintain a network of components. By considering the more restrictive ARP introduced in Chapter 0, more detailed results can be presented. One of the earliest ARP's minimizes the long-run expected cost per unit of time. This model assumes the time to replace a unit is negligible and that all costs are absorbed in C and C2 . Denote N1 (t) (N 2 (t» to be the number 1 of failure (planned) replacements in time interval [O,t). Thus the cost over the interval is Page 7 ~ and it is desired to find take ~ to minimize lim E [C(t>/t]. We to be a fixed but unknown constant and call the semi- open interval [O,~) the optimal replacement interval. This problem may be viewed as a simple application of renewal theory. Let Xl' .•• 'X n be an i.i.d. sample from lifetime distribution F, and Zi=min(Xi'~) i=l, ... ,n. When we speak of a "lifetime" distribution function, we mean a d.f. F such that F(O)=O. We also assume left-continuity of the d.f. (~) and that the mean of F is finite. Define, vt>O, a random . variable N* (t) = 1nf {n: S n >t}, where S n =Zl+ •.• +Z. n It is trivial to show that for fixed t, N* (t) is a proper stopping variable for {Sn}' and that SN*(t)-l ~ t < SN*(t)· Since the Z. 's are non-negative, we employ a lemma due to 1 Wald (cf., Chow-Robbins-Siegmund, 1971, page 22). We have E ( )N * (t) U.) = E N*(t) E U, - l 1 where {U.} is any sequence of random variables that is 1 independently adapted to {~(Xl, ... ,Xn)}' not necessarily independent of N* (t). Now, denoting N(t) = N (t)+N (t), and 2 l noting that N(t)=N * (t)-l-I(SN(t)+l=t), we get (using I(.) to denote the indicator function) [ >N(t)I(X.<A) -1 1 ~ ] * = E ~~ (t)[I(Xi<~)] - E I(XN*(t)<~) E I(SN(t)+l=t) I(XN(t)+l<~) where E is an event such that SN<t)+l=t and XN<t)+l<~. Page 8 Similarly, B[N (t)] 2 = B[ >N(t)!(X.>~) ] -1 1.-'1" = B N* (t) S(c!» - P(XN*(t)~c!» - peE). A well-known result from renewal theory (cf., Feller, 1971, page 330) gives, lim B N(t)/t = liB Z. lim B N* (t)/t=l/B Z Thus and lim B C<t)/t = lim {CIB N* (t) F(c!» + C B N* (t) 2 S(c!»}/t c!> = {ClF(c!»+ C 2 S(c!»}/B Z = {ClF{c!» + C2S(c!»}/{~o S(u)du}. Thus, with Rl (.) defined in (0.1), we have, lim B C(t)/t = R l (c!». Alternatively, readers more familiar with renewal theory may N(t) recognize ~l !(Zi<c!» as a cumulative process. The (1) t~oo proof of (1) then follows immediately from Lemma 4 of Smith (1955). Solving for the optimal replacement interval [O,t) for a given Cl ' C2 ' and distribution function F is thus reduced to a problem in analysis. Several examples appear in the literature wherein a parametric family is specified and values of c!> are calculated for given costs and values of the parameters (see Barlow & Proschan, 1964, 1965). Glasser (1967) gives detailed graphs of costs and parameters for the truncated Normal, Gamma, and Weibull distributions. Page 9 Assume now that F is absolutely continuous. To find min Rl(t), set o = ~/Jt Rl(t) It=~ => 0 = (Cl-C2)f(~) ~ ~o S(u)du - S(~)[C1F(~)+C2S(~)]. This gives (2 ) where f=F' and r(.) is the failure or hazard rate. To ensure unique solutions to (2), it is common to restrict consideration to lifetime distributions having a strictly increasing failure rate. Further, if the failure rate increases to 00, then the solution is finite. It was shown by Denardo and Fox (1967), that ~ cannot occur where the failure rate is decreasing. finite, we see that If the mean Rl(oo)=Cl/~ (~) of F exists and is and thus, a measure of asymptotic relative efficiency of the ARP procedure is P(~) = Rl(~)/Rl(oo) = ~(Cl-C2) r(~)/(Cl)· A lower bound on the ARP cost, Rl(t), is C2/~ (this would occur if the unit is replaced an instant before it fails). This gives a lower bound for our measure of relative efficiency, P(~) ~ (C2/~)/Rl(00) = C 2 /C l · For this measure, the lower the value the better the performance of the policy (see Glasser, 1967). It was shown by Berg (1976) that the ARP is an optimal decision rule against a class of "reasonable" alternative Page 10 maintenance policies. rules of this type. Thus, we confine our attention to Appendix C gives some variants of the age replacement policy. §1.2 The classical model with an unknown distribution function Consider the ARP as described §1.1 but assume that the experimenter does not know the probability distribution underlying the failure of the units. The objective is still to determine the optimal replacement age, ~, in the most efficient manner possible. Let X ' ... 'X be i.i.d. times to n 1 failure of the first n units. From these observations, an estimate of ~ can be constructed, ~ =T (X , ... ,X). n n 1 n Let 0 be the set of all lifetime distribution functions with finite mean. Without restrictions on 0 there is little hope of determining finite sample properties of ~n. Thus, we look at asymptotic properties of ~n to determine the appropriate estimator. We assume there are sufficient conditions so that ~ uniquely minimizes R (t) 1 increasing to 00). (e.g., the failure rate strictly One obvious choice of ~n is that it be chosen so as to minimize Rn , the sample cost function defined Page 11 by replacing F with F n , the empirical distribution function. This estimator has been shown to be strongly consistent (cf., Bergman, 1979). Using a slightly modified estimator, Arunkumar (1972) was able to get rates of convergence for his estimator and establish an asymptotic distribution. shall now be briefly outlined, sinc~ His work there turn out to be interesting connections with stochastic approximation's asymptotic theory. A key item in Arunkumar's procedure was his recognizing that the unmodified order statistics generate an empirical distribution function (EDF) that do not have the necessary asymptotic properties for getting convergence of ~n to a nondegenerate distribution. By using a modified EDF, Arunkumar was able to establish these rates. As an example of a modified order statistic, let {X.} be an i.i.d. sample 1 having a uniform density over (0,1) and {X. } the 1:n corresponding order statistics. It is well known that the grid formed by {X. }?Ol is dense with probability one on 1 1= (0,1). For many applications, the grid formed by the order statistics is too narrow and we would like to have a "wider" Let Wi : n = X[i n~J:n' where ~€(O,l) and [.J is the greatest integer function. Again, it is well-known that this one. grid becomes dense with probability one, and V ~ > 0, page 12 p{ns(W'+l l n - w.1:n }/~ 1 :n -W.1:n »~} < n E {W'+ 1: = nS [(i+l) n OC ] - [i n OC ] )/«n+l)~) S < n OC + S/( (n+l)~) = 0(1) Thus, in our example, W'+ -W. = open 1 l :n 1:n -S Let {Xi} be an i.i.d. sample with an continuous d.f. F. for B < l-oc. ), uniformly in i. absolute~y The failure rate is assumed to be continuous and strictly increasing to 00. Establish two grids that are dense with probability one for Arunkumar's procedure. Let {wi,n} be a possibly random grid on [0,00) such that wi+l,n-wi,n=op(n-l/3) and let {vi,n} be a grid on 2 3 [0,1] such that v'+ 1 l ,n -v.1,n =0 p (n- / ), both uniformly in i. . . a p01nt, . P1ck1ng w.* , to be an arbitrary point from J,n [wj,n' wj+l,n)' an analog of the EDF (F n ) is defined as F * (x) = F (w.* ) for w. < x < w'+ . n n J,n J,n J ,l n Recall that the objective cost function to be minimized is Rl(t) which is given in (0.1). Now, define the transform F-l(y) 'F(y) = ~o (l-F(u»du and, if y=F(t), we have Rl(t) = Rl(F -1 (y» m = {ClY + C2(1-Y)}/~F(Y) = UF(y)· * e To get the estimate of ~, we use the t n * * * minimizes U *(y) and take ~n=Fn(tn). Not only does {v. } that 1,n F * but he Arunkumar demonstrate the strong consistency of ~n' also obtains .1.* .I. -1/3 ~n-~=op(n ), thus establishing a rate of convergence. • Page 13 Sl.3 The classical model viewed sequentially The fixed sample procedure described by Arunkumar gives a method for finding an optimal assumptions. ~ under prescribed It can be more desirable to introduce a procedure that may be used on an ongoing basis, constantly updating the estimate ~ n subsequent observations. based on the new information from In deciding whether or not to censor (truncate) an observation, there is a trade-off between minimizing costs and obtaining information about the tails of the distribution. Of course, as one gets more information about the tails, one should be able to construct a better estimator, or at least good. We assume that the experimenter is not willing to ·make an assumption about the parametric form of the survival distribution, S(x)=l~F(x), of the components and thus we operate in a nonparametric context. From the work of the previous section (Arunkumar), any procedure devised should provide estimates which converge in probability, and preferably almost surely. Second, this convergence should be fast enough so that the optimal asymptotic cost is achieved. Arunkumar's procedure does not use the estimator of ~ to truncate the observations, and hence does not attempt to attain an optimal long-run cost. As a third criterion, we • Page 14 choose one procedure over another based on the rate of that convergence and if the rates are the same, we use the asymptotic variance. Bather (1977) introduces a scheme that meets the first two criteria. Suppose we have a sequence of i.i.d. lifetime random variables {Xi} with a finite mean and truncating random variables {~i}' The procedure depends on two sequences of constants, {b n } and {Pn}' where, 1=Pl~P2~P3~' •. , (3) 00 and Let {<X.} .1 .?n=l Pn = 00. be an independent sequence of Bernoulli random variables with mean p. and independent of {X.}. 1 .1 When <X n =1, we shall allow the observation to continue uncensored, otherwise we use the censoring time. Because of (3), we allow uncensored observations less and less often as still infinitely many times. n~oo but We use the sequence {b n } to get an approximation of ~n' yet to be defined, as follows, = max {bm : b m ~ ~n' m~O} = 00 when <X when <X n n =0 =1. Denote the observed outcomes as Z =min(X ,n ). We use a n n In . ** reduced sample (RS) estlmator Sn (x) to estimate S(X) where Page 15 S ** ( x ) n = > I ( Z. >x) / > J - I (1"\. >x). IJ - Properties of the RS estimator have been studied where the censoring times 9n are i.i.d. random variables (cf., Breslow and Crowley, 1974). To complete the description of the sequential procedure, define an approximation to the cost function Rl(x), by Rn(X) = [C l -(C I -C 2 ) Sn** (X)]/~n(X) x where ~n(x) = c'0 Sn ** (u)du. It is assumed that ~ uniquely minimizes Rl(x). The estimate ~n+l is determined by minimizing Rn(X) with respect to x. Under the above assumptions, the following results have been shown by Bather: (a) S ** (x) n (b) sup (c) ~n ~ ~ x ~ S(x) [~n(X) - a.s. x ~o S(u)du] ~ 0 a.s. a.s. a.s. (d) Remarks concerning the Bather scheme: (a) The introduction of {b n } is a technical device needed to get the uniform a.s. convergence of ~n(x), which, in turn, is essential for result (c). (b) ** (x) for the survival The use of the RS estimator Sn distribution does not use the information concerning Page 16 observations censored by time x. There are other estimators of the survival distribution (e.g., the Kaplan-Meier product limit estimator, piecewise exponential, and the cumulative hazard; see Miller, 1981) which use more of the information and thus are thought to be better. When the censoring times are i.i.d. random variables independent of the lifetimes, this can be shown by comparing asymptotic variances. (c) The use of the {~n} sequence of random variables to ensure the occasional untruncated observation is only one possible technique. Equivalently, the experimenter could specify a sequence of integers, {r } where r <r <r <... such k l 2 3 that rk~oo uncensored. as k~oo and when n€{r k } the observation is Requiring that lim k-)oo rk/k is analogous to pn-)O. = 00 Censoring with Bernoulli random variables was adapted by Bather for mathematical convenience. The important idea behind the paper of Bather is the introduction of a sequential scheme that an experimenter may employ to achieve the optimal asymptotic cost with only minimal knowledge of the d.f. In the theorem below, we give some simple conditions so that any sequential methodology meeting these assumptions will achieve the optimal asymptotic cost (see (0.2». This opens the door for competing' sequential methodologies. We then advocate stochastic Page 17 approximation in the following chapters as a superior sequential scheme. Theorem 1.1 Let {X } be a sequence of i.i.d. r.v.'s with d.f. F, n having finite variance ~2. Define Pn=~(Xl, ... ,Xn). Suppose there exists a sequence of r.v.'s {<Pi}' such that <P n is P n - l measurable and <Pn~<P a.s., a continuity point of F. The sample cost of the first n items is n >. Rn = -J= 1 {Cll(X.<~.) J o/J + C2l(X.>~.)}. J-o/J Let N(t) be the number of replacements by time t. Then, Proof: Define Un = -n ~l {l(X j < <Pj) - F(<Pj)}. By construction, {Un,F n } is a martingale and IUn+l-Unl ~ 1. 2 -2 < 00 and by Theorem 5 of Chow n Thus ~ E p (Un+l-U n ) n (1965, see also stout, 1974, page 137), we have (4 ) Since <p.J ~ 0/ ~ a. s ., we have n -1 > F ( o/J ~ .) - ~ F (<P) a. s . By (4) and the continuity of F at <P, we have > leX.J n- l (5 ) - Similarly, define Thus V n < o/J ~o) ~ F(~) 0/ = a.s. -n ~ >1 {m1n (X 0 ,0/ 0) J J 0 {V,F} is a martingale and n n 00 {<po 'oJ S(u)du}. Ivn.+l-Vn I -< Xn +~, where E Xi=~o S(u)du=~. Thus, -00 ~ E p . (Vn+l-V n ) 2 n -2 < > E(Xn+~) n = (~2+4~2) > n- 2 < 00, giving 2 n -2 Page 18 a.s. (6) Now, by continuity of indefinite integrals, . ~ 11m ~on S(u)du thus n- 1 ~~ ~ = ~o (~J' '0 S(u)du a.s.,and (~ '0 S(u)du a.s. ~ (~ '0 S(u)du ~ By (6), we have . n -1 > m1 n (X ., ~ .) (7 ) - Now, let Ln J = -5 nl J min(X.,~.) J 'f J = a.s. S (u) du sample time of 1 st n units. By construction, LN(t) ~ t < LN(t)+l' giving (8 ) N( t ) RN( t ) IN ( t ) < N(t)+l LN(t)+l/(N(t)+l) Now, the right hand side of (8) as t~oo by (5) and (7). = Similarly, since the left hand side of (8) goes to result. 0 CN(t)/N(t) L IN(t)~ Rl(~) N(t) Rl(~) a.s. N(t)/(N(t)+l)~l and we have the a.s. CHAPTER 2 S2.1 STOCHASTIC APPROXIMATION AND AN APPLICATION Introduction This chapter introduces stochastic approximation (SA) to the reader and shows how it may be used in the age replacement policy problem. We begin, in S2.2, by attempting to motivate SA as a sequential estmation scheme. As a statistical tool, we then outline situations when it is appropriate to use SA. Given its use, we discuss broad issues that are important for its efficient implementation. The following section moves from the broad overview to a more detailed discussion of SA by giving a short historical development. We begin with a description of the important early cases, the Robbins-Monro (R-M) and Kiefer-Wolfowitz (KW) processes, and reasons for their importance. More or less chronologically, we then give highlights of the properties of these cases. Searching for optimal properties, various modifications (such as transformations of the observations Page 20 and adaptive procedures> and generalizations (such as more general algorithms> of the special cases are described. Also in this development, we mention some applications of SA, both to standard statistical problems and real-world situations. Remarks on stopping rules proposed are then made. Finally, we mention other areas of interest in SA not reviewed here. In §2.4 we develop a simple procedure for constructing estimates of the optimal replacement time via SA. All assumptions and resulting properties are stated here. Further, we give a simple technique of transforming the observations to avoid constrained SA. In the following section we give the details of the proofs of the properties stated in §2.4. The useful theorems of Robbins-Siegmund (1971> and (a univariate version> of Fabian (1968ii> are stated. Further, many intermediate lemmas are proved in forms slightly more general than necessary for Chapter 2, as they will be immediately applicable without revision for the proofs of Chapter 3 results. Page 21 §2.2. Stochastic approximation-motivation Suppose that you, as a weapons advisor to a smaller, less technologically developed nation, have just purchased a large number of sophisticated bombs from a more advanced society. One of the pieces of information that the sellers neglected to give you (perhaps intentionally) was at what height a bomb should be dropped so that it will explode, say, 99 percent of the time. You know from your experience with other bombs that a sound bomb (not a dud) will explode sometimes at a certain height but not always. Factors such as prevailing winds, rain, even whether one is bombing an area containing forests versus a mountainous region come into You would like to find this 99 th percentile point of play. heights for purposes such as fuel economy and radar evasion. Keeping in mind the rather expensive nature of each bomb, you would like to have a scientific testing scheme to determine the point. The above is a rather macabre example of how stochastic approximation might be utilized in a real-world setting (but there may be very little "practical" value in dropping bombs). Generally, there is an unknown function of the real line, R(x), and the objective is to get information about a single value. Typically, we wish to find the 9 such that page 22 R(a)=~, where ~ is fixed or find 9 such that R(9)=supx R(x). This would be a standard problem in analysis except that it is presumed that there are errors in the measurement of R. In fact, we let Yk = R(Xk ) + 6k ' where Y is the outcome of the experiment measured at X and k k {6 i } is a sequence of independent "error" (mean zero) random variables. The function R can be thought of as the regression of Y on X, i.e., E(Ylx). In the above example, we desired a procedure to find 9 such that R(9) = .99. was the outcome of dropping the bomb at height X k Each Y k (in this very simple example, either 0 or 1). Stochastic approximation (SA) is an iterative procedure which determines a sequence of values that converges in some sense to the desired point 9. Thus, we use Xn + l = Xn + Gn(Y n ), where Gn is a known function of the observations at each stage. It can be thought of as a stochastic analog of well-known iterative procedures in analysis (e.g., Newton-Raphson method). Indeed, much of the historical impetus for developing stochastic approximation comes from the desire to have iterative procedures where there are errors in the measurements. As an estimation technique, SA procedures have three main advantages: (a) The procedure is nonparametric in form, i.e., no Page 23 assumption concerning the distribution of the errors need be made. (b) Experimental effort is not wasted in attempting to estimate the entire function R. We are merely interested in one special value it may take on. (c) The procedure makes no assumption that the regression is linear in form, or indeed, of any other functional form. The procedure should be useful in difficult nonlinear regression problems. (Thus, SA procedures are nonparametric in a second sense as well as (a) above.) Given an initial starting value Xl' the function of the outcomes of the experiment G , and the outcomes Y , we have n n identified {X }, a sequence of random variables. n at each stage of computing X +1 n = Note that X + G (Y ), there is no n n n reason to take only one observation. After specifying the problem by making assumptions about the form of G and the n errors involved, there are 4 main areas of problems to be attacked in stochastic approximation: (a) The question of the convergence of the sequence to e, and the nature and rate of convergence. (b) The asymptotic distribution of the sequence. (c) The choice of the transformation Gn giving the smallest asymptotic variance. (d) The optimal stopping time of the sequence. For the earliest procedures proposed, Robbins-Monro (1951) Page 24 and Kiefer-Wolfowitz (1952), largely been answered. the first three questions have The fourth is more open, although Sielken (1973), McLeish (1976) and stroup and Braun (1982) have provided some useful results. §2.3 Stochastic approximation-short historical development The first use of stochastic approximation appeared in 1951 by Robbins and Monro (R-M). finding a unique value 9 such that generality, we may take be measured with error. procedure ~=O), They were interested in R(9)=~ (without loss of R being a function that could They suggested the iterative Xn + l = Xn - an Yn , where {an} is a sequence of positive decreasing constants, and Y n is a conditionally (given xl, .•. ,X ) unbiased estimator of R(X n ). n Providing certain conditions on {an} and R hold, they showed that X is n a weakly consistent estimator of 9 (i.e., X -9=op(1». n The following year, 1952, Kiefer and Wolfowitz (K-W) showed how these techniques could be applied to finding a maximum (assumed unique) of R(x). Their suggested iterative procedure is Xn +l = Xn + an/c n (Y2n-Y2n-l)' where {an} and {c n } are sequences of positive decreasing constants, Y2n and Y2n - l are conditionally unbiased estimators of R(Xn+C n ) and R(Xn-c n ), respectively. Again, providing that certain Page 25 conditions on {a }, {c } and R hold, K-W showed X -9=0 (1). n n n p The K-W result is important because it demonstrates how SA as introduced by R-M may be extended to find the zeroes of any derivative of a function (see Burkholder, 1956). Having established these two basic procedures, various generalizations and development of the convergence properties of {X } have been investigated over the years. n Blum (1954i) established the strong consistency of {X } n for both the R-M and K-W procedures using martingale techniques. Later that year, Blum (1954ii) demonstrated how SA may be applied in a multivariate context. Asymptotic normality of the estimates X , when suitably standardized, n was first established by Chung (1954). He employed a methods of moments technique and concurrently established orders of magnitude for the moments. Sacks (1958) showed asymptotic normality via characteristic functions and under weaker conditions. Fabian (1968ii) proved a very useful result for demonstrating asymptotic normality, stated in §2.5. Proofs of consistency and other properties of the estimators have been made more elegant and under weaker conditions as researchers have made greater use of probabilistic tools, especially martingale theory. Heyde Page 26 (1974) emphasized this point and easily obtained a law of iterated logarithm for the R-M and K-W procedures. Robbins and Siegmund (1971) proved a very useful inequality which will be stated and exploited later. Kersting (1977ii) and Ruppert (1982) gave representations of the estimators as weighted averages of the errors for the R-M and K-W procedures, respectively. Whereas Kersting considered only the one dimensional procedure with independent errors, Ruppert allowed for a multivariate procedure with dependent errors. These two papers subsume much of the earlier work on asymptotic normality and laws of iterated logarithm. Attempts have been made to alter the R-M and K-W processes to improve the quality of results. Transformation of the observations was first suggested in 1971. case, i.e., where we wish to find 9 such that For the R-M R(9)=~, Yn-R(X n ) has a known symmetric density g about zero. assume Then the observations can be transformed (via g'(.)/g(.», so that the resulting estimator is asymptotically normal with mean 0 and variance of the Cramer-Rao lower bound. This result was proved independently by Anbar (1973) and Abdelhamid (1973). The K-W analog was also established by Abdelhamid. Fabian (1973) and Obremski (1976) gave modifications for an unknown g for the R-M and K-W cases, respectively. Another modification allows m observations to be taken at each stage of the procedure, where m is fixed positive integer. Using Page 27 m>l is advantageous only for the K-W procedure. Questions of design enter and have been partially answered by Fabian (l968i). The choice of how large m should be is unanswered. venter (1967) proposed using random variables in lieu of the {a } sequence of the R-M procedure. n This modification shall be termed as an adaptive procedure, as it adapts the values of a n progresses~ closer to the optimal ones as the procedure His motivation was to provide a practical modification which achieves the best asymptotic variance. Conditions imposed by venter were subsequently weakened by Fabian (1968ii). Later, Fabian (1971) gave an adaptive procedure for the multivariate K-W situation to achieve the best asymptotic mean square error for the choice of the {an} sequence. All of these procedures require taking extra observations at various finite stages. Heuristically, this is considered undesirable since the information in these observations is not used to speed the process towards convergence. While asymptotically negligible, the effect of the observations could be important in finite samples. Anbar (1978) gave a procedure for the univariate R-M process which uses only one observation at a time. A least-squares technique is employed to recursively estimate the slope. conditions imposed on the slope of the regression function are, however, rather restrictive. The page 28 These notions were formalized by Lai and Robbins in a series of papers. Calling -n ~i=l (Xi-a) 2 the cost of the process at the nth stage, an adaptive univariate R-M procedure was given which minimizes this cost. It was demonstrated how the restrictive conditions imposed by Anbar can be removed. Lai and Robbins (1978, 1979 and 1981) gave a general theory for the univariate R-M process which subsumes much of the earlier work on this process. In particular, using techniques (involving functions of slow growth) borrowed from Gaposhkin and Krasulina (1974), they gave a generalized version of the almost sure representation of Kersting (1977ii). This representation yields (after verifying some rather messy technical details) several important large sample results, such as central limit theorems and laws of iterated logarithm. Aside from potential application to real-world problems, stochastic approximation has been useful as a new method being applied to standard statistical problems. Several authors have investigated SA applied to non-linear regression problems which began with the work of Albert and Gardner (1967) (cf., Has'minskii, 1977, Anbar, 1976i). Dupac (1977) and Lai and Robbins (1981) use SA in linear regression models having errors in the variables. Sakrison (1965) initiated research towards using SA to get asymptotically efficient recursive estimators. Assume that the density of the errors Page 29 is known up to a finite number of parameters that we are trying to estimate. We then have available an unbiased estimator of the gradient of the Kullback-Leibler information number. Based on this observation, a sequence of estimators can be constructed that converges to the point of minimum for the Kullback-Leibler information. This sequence is not only consistent, but is efficient in the sense that when standardized it is asymptotically normally distributed with the smallest possible variance, the Cramer-Rao bound. Further work to weaken the conditions of Sakrison has been done by Has'minskii (1975) and Fabian (1978). In other applications, Fritz (1973) used stochastic approximation to find the mode of a multivariate density. Dupac (1977) noted several interesting applications such as finding the maximal eigenvalue of a symmetric matrix that can only be observed with error and finding the unique solution of a linear system of equations that can only be observed with error (i.e., find X observed with error). o such that Ax=b, where Ax is Nevelson (1975) applied stochastic approximation to M-estimation (minimization) problems, where we wish to find to as a solution to the equation ~ $(x,t) dF(x) = 0, where $ is non-decreasing. Almost sure convergence and asymptotic normality of the recursive procedure is shown. An estimate of the derivative is used to get the same asymptotic variance as the standard M-estimate, Page 30 and hence is adaptive. The procedure requires truncation above and below of the estimate of the function and of the derivative. Some sufficient conditions are given to remove the truncation of the estimate of the function. Conditions imposed by Nevelson requiring i.i.d. observations were weakened by Holst (1980, 1982). At times mathematical techniques developed for use in SA schemes have been useful in investigating the properties of Revesz (1977) showed how a other recursive estimators. recursive estimator for a regression function has desirable asymptotic properties using techniques in the SA literature. More specifically, let {X.,Y.} be a sequence of i.i.d. random 1 1 variables where we wish to estimate B(Ylx=x) equal to, say, r(x). Let A be some positive constant, a O<~<l, and K be a suitable kernel function. n = A n-~ For r where o (x) = 0, let rn+l(x) = rn(x) + K«x-Xn)!a n ) (Yn-rn(x»/(na n ) define the estimator of r(x). Revesz demonstrated the strong consistency and asymptotic normality (when standardized) of these estimators. Using different SA-type techniques, Isogai (1980) defines a similar recursive density estimator and demonstrates its asymptotic properties. Page 31 Examples in the literature of SA actually being employed are rather limited. There are only two known cases (to us). The more recent case is an application to the harvesting of Atlantic Menhaden (Ruppert et. a1., 1982). Earlier, Janac (1971) used SA in a Monte-Carlo simulation to find optimum parameters of an automobile suspension system. Fabian (1971) mentions some unpublished applications of SA in chemical research. schemes. Textbook examples are more abundant. For example, Several examples are provided in the monographs of Wasan (1969), Albert and Gardner (1967), and Wetherill (1966). Generally, SA is applied to questions that are framed in such a general manner so that there is no real hope of adequate finite sample results. Naturally, asymptotic theory enters the picture and a question the practical statistician asks is how many observations (or at what stage) are necessary to guarantee that the large sample results are approximately correct. For a univariate RM procedure, Sie1kin (1973) gave a stopping time N ,~ d for a fixed-width confidence interval that is asymptotically consistent (in the usual stopping time sense, i.e., limd~O P( IXN d,~ - 91 ~ d) = l-~). More generally, for a univariate R-M process with martingale difference errors, Page 32 McLeish (1976) provided a functional central limit theorem using weak convergence arguments. An immediate corollary is a useful stopping rule for the process. We note here that Dvoretsky (1956) gave a finite sample result, and we give some details of those results in §6.2. Many advances in the theory and application of SA have not been reviewed at this time. The literature is vast. Some authors, e.g., Dvoretsky (1956), Ljung (1978), Kushner and Clark (1978) take a broader view of stochastic approximation and liken it to a stochastic version of recursive numerical analysis. They prefer to analyze more general algorithms (of which R-M and K-W are special cases) which may be viewed as the sum of a deterministic convergent process and a driving stochastic component. From this viewpoint, their algorithms often exhibit the behavior of solutions to stochastic differential equations. Other areas of interest arise naturally from this viewpoint, e.g., stochastic approximation algorithms where the solutions are constrained to lie in some specified set (see Kushner and Clark, 1978). Another generalization which some researchers have addressed is the problem where the function itself changes with time. This was first formulated by Dupac (1965, 1966), who termed this problem dynamic SA, and has since been addressed by Uosaki (1974) and Ruppert (1979). Page 33 §2.4 Stochastic approximation applied to the ARP We now develop a sequential procedure to estimate the optimal replacement time in a cost efficient manner (see Theorem 1.1). In particular, we use SA as an estimation technique to find ~, the (assumed) unique, finite minimum of the cost function R (.) I (see (0.1». All assumptions are explicitly stated in this section as well as the properties of the resulting estimators. The proofs of these properties are in §2.5. Define M(t) = t (C -C )f(t) l 2 Now J/Jt Rl(t) = Kt ~o S(u)du - S(t){C l F(t)+C 2 S(t)}. M(t), where Kt is positive, and thus by assuming that RI(t) is uniquely minimized at some finite point ~, we have that M(t){t-~} > 0, for each t ~~. Thus, instead of looking for the minimum of R(t), we wish to find the zero of M(t). If there existed an unbiased estimator for M(t), we could use this estimator in conjunction with the R-M procedure and straight-forwardly get "nice" estimates of ~. Unfortunately, such is not the case and we must deal with the bias in the Kiefer-Wolfowitz fashion. In this section, observations shall be taken only in pairs and we introduce the simplest practicable estimator. Page 34 Suppose ~l is a nonnegative random variable such that E ~21 < ~ 00. Let {X.In }oo n= 1 be an i.i.d. . sequences of r.v.'s with distribution function F, where {X ln } is independent of {X }. The procedure will use two sequences of nonnegative 2n real constants, {a } and {c}. We define n n z.In =min(X.In ,~~n +c) n i=1,2 n=1,2, ... and let the succeeding estimates of the optimal censoring time, ~ , be defined by n the recursive formula, ~n+l = ~n - an Mn(~n)' where M (t) is our estimate of M(t), yet to be defined. n Before discussing our choice of M and the properties of n . the resulting estimator, it is convenient at this time to deal with the possibility of the procedure forcing ~ <0 for n some finite n. If this is the case, then Mn (~ n ) is not properly defined. This can be handled in two ways. We define the operation x < a a < x < b x > b if a if x if b and the recursive formula [xl·b = a [ For our case, we choose b = number but smaller than~. there exists a known a = C2 ~o/Cl and t~a, ~o 00 and a to be some small positive One way to choose a is to presume such that ~o ~~. Then, if Rl(t) ~ C2 /t ~ Cl/~o ~ Cl/~ ~ Rl(~)· Page 35 For an application of the approach, see Albert and Gardner (1967, page 9). Another method of dealing with this issue is to introduce a known, strictly increasing function g:R~[O,oo). We shall assume the first s+l derivatives of 9 exist everywhere and are bounded, where s is a fixed positive integer (e.g., g(t) = 10g(1+e t ) ). g, we now find that t * giving <I> = g(<I». (say .J.*) ~ with the assumptions on minimizing R (g(t», 1 Since, sign {J/Jt R (g(t»} = sign{g' (t)M(g(t»}, 1 we may use as our procedure, (1) * * * <l>n+1 ,= <l>n - an Mg,n (<I>n) where M is an estimator of g'(t)M(g(t», yet to be g,n defined. In many practical situations, <I> will be sufficiently far from 0 (relative to the sequences {an} and {c }), so there is little chance that negative values of the n approximations will be obtained, and thus we may use g(t)=t. We drop the star superscript of <1>, and note that as a final step in our iteration, we take g(<I>n) as an estimate of the true optimal replacement time. As the final step in specifying the SA procedure, we define an estimator of g'(t)M(g(t», called Mg,n(t). For Page 36 i=1,2, let Z. =min(X. ,g(cP +c», ln ln n n F. (t}=I{Z. <t}, ln ln Sin(t)=l-Fin(t}, fgin(t}=I{g(t-Cn}~Zin<g(t+cn}}/2cnand get} (2) Mg,n(t}=(Cl-C2}fgln(t}~o S2n(u}du- g'(t}Sln(g(t}}{ClF2n(g(t}}+C2S2n(g(t}}}. We will be able to calculate the conditional expectation of g,n and quantify the resulting biases in M We use (F*g}(r}(x) = Jr/Jt r a meaningful way. F(g(t}} It=x. For convenience, a list of the most important assumptions are provided below: AO. The distribution function F of the i.i.d. observations is absolutely continuous having density f, with ~ t 2 dF(t} < F(O)=O and AI. 00. g is a known, strictly increasing, continuous function such that g:R~[O,oo} and the first s+l derivatives exist and are bounded. and A2. For each x, (X-cP}M(g(x}) A3. lim c 2 _00 n + cP. 00. ( F*g) (2 ) (x) exists for each xeR, and there exist constants A,B > 0 such that I (F*g) (2) (x) I ~ A + B g (x) , V x. A4'. V x - 0, -> 1 a n Ic n < A4 • >0 (F*g)(l}(x) and (F*g)(2)(x) exist for each xeR and are bounded. Page 37 AS. (F*g)(3)(x) exists for each xeR and is bounded. AS'. (F*g)(l)(x) and (F*g)(3)(X) exist for each xeR and are bounded. A6. There exists p > 2 such that f t P dF(t) < Let q 00. be defined by 2/p+l/q=1. A7. For some A,C > 0, 1 e (0,1) such that 1-1 < 2 r, we have a n =A n -1 c =C n- 1 n where r= A(g , (<I» ) 2 M' (g ( <1» AO is always assumed to hold true. We note here that the above assumptions are not independent. condition than A7. satisfied. ) • A3 is a weaker If A4' (AS') is true, A4 (AS) is We use the weaker assumptions initially, and replace these with stronger assumptions as we strive for better results. stronger than A2, inf{IM(g(x» We remark here that a typical SA assumption, V~>O, I: ~<lx-<I>I<~-l}>O, is not needed here due to the assumed continuity of M(.) and g(.). The asymptotic variance of the <l>n will be proportional to :?:, where g(<I» (F*g)(l)(<I» ~o u S(u)du. We now state some asymptotic properties of our procedure. page 38 Theorem 2.1 Assume Al-A3 and A4 or AS. Then, for the procedure defined in (1) and (2), cP n ~ cP a.s. Theorem 2.2 Assume Al, A2, A6, A7, and either (a) A4' holds and 1/3 < 1 < l/q ( b) AS' holds and 1/5 < 1 < l/q. or Then, n (1-1>/2 (cPn-cP> ~D N(O,A 2 c- l ~ /(2 r-l+1». (c) Further, assume AS' holds, 1=1/5 and (F*g)(3)<t> exists and is continuous in a neighborhood of cPo Then, n 2/ s (cPn-cP> ~D N(2T/(2 r - 4/5> ,A 2 C- l ~ /(2 r-4/s» g(cP) where T=A C2(Cl-C2>(~O S(u>du)(F*g)(3)(cP)/6. S2.s Proofs We shall be working with martingale techniques. n = ~(cPl' Zij' i=1,2 j=l, ..• ,n-l). The bias term is represented by let P Thus, Define Un=(cPn-cP). /\n = B p {Mg n(cPn)-M(g<cPn»g'(cP n )} and the important part of n ' the variance term by ~ = Bp M2 (cP n ). We work with only n n g,n parts of Mg, net) at a time, thus define Page 39 . Mln(g(t» 9 (t) = (Cl-C2){fgln(t)~o 9 M (g(t» 2n = 1 (t) f (g ( t) ) ~ S2n(u) du} - get) o S (u) du} gl(t){S(g(t»[C F(g(t»+C S(g(t»] I 2 This gives Mg,n(t) - gl (t)M(g(t» = Mln(g(t» + M2n (g(t». (3) A finite sample estimate of ~ is v~, where vn = c!n 00 ~ = ~o t relationships between random variables are meant to hold with probability one, unless specified otherwise. Finally, K , l K , etc. will denote the appropriate constants to be used in 2 our inequalities. Many of the lemmas will be given in a form more general than immediately required. The proof of Theorem 2.1 depends on a result due to Robbins and Siegmund (1971), given for convenience. Theorem (Robbins-Siegmund) Let F n be a nondecreasing sequence of sub ~-fields of F. Suppose that Xn , Bn , 9n and ln are nonnegative Fn-measurable random variables such that Sp Xn +1 ~ Xn (1 + Bn) + n Then, lim n~oo Xn 9n - 1n for n exists and ~ ln <00 on {~ Bn < = 00, 1,2, •.. ?:: 9 n < oo}. Page 40 Lemma 2.1 Assume Al and A2 hold. Then there exist nonnegative constants K and K such that l 2 (a) if A4 holds, then 16 l < c [ Kl+K21~n-~1 n n (b) If AS holds, then 16 l < c 2 n n Kl · Proof: By the definition of we have 6 n = E p {M g n(~n)-g'(~n)M(g(~n»} (4) n ' g(~n) = (C l -C 2 ) ~o t Since ~ o (S) ~n' ]. S(u)du -< ~ 16 n l S(u)du E p {fgln(~n)-g'(~n)f(g(~n»}' n V t, we have ~ K31[F(g(~n+Cn»-F(g(~n-Cn»]/2Cn-(F*9)(1)(~n)f If A4 is true, then [F(g(~n+Cn»-F(g(~n-Cn»]/(2cn) = (F* g) ( 1 ) (~n) + c ~ / ( 4c n ) [ ( F* g) ( 2 ) ( 9 1 ) - ( F* g) ( 2 ) ( 9 2 ) ] where A In.11 't'n I -< cn i=1,2. Hence, (6) Now, I~nl ~ K3 c n /4 I(F*9)(2)(9l)-(F*9)(2)(92)1 I (F* g) ( 2 ) (x) I ~ K4 + KS 9 ( x ) and, giving, (7) (F*g)(2)(X) ~ K + K O l Ix-~I. Since 19i-~1 ~ I~n-~I + cn' i=1,2, we have, 16n l ~ K3 c n /4 2(KO+Kll~n-~I) and thus the result in (a) is true. If AS holds, Page 41 (8 ) [F(g(~n+cn})-F(g(~n-cn»]/(2cn} = (F* g) ( 1 ) (~n) + c ~/ ( 12 c n ) [ ( F* g) (3 ) (91 ) + ( F* g) ( 3 ) (9 2 ) ] where 19i-~nl ~ c n i=1,2. From (S) and (8), we have I~nl ~ K3 C~/(12Cn}· Thus, the result in {b} holds.D Lemma 2.2 For some nonnegative constants K -K , if AI, A2, and A4 l 6 or AS holds, then and (a) (b) Proof: M2 (t) g,n = [(C l -C 2 }f9 ln (t) get} ~o S2n(u}du- g'(t}Sln(g(t}}{ClF2n(g(t}}+C2S2n(g(t}}}]2 < (C l -C 2 ) 22· 2 I 22 f9ln(t) Z2n + (g (t)) Cl · By conditional independence, we get 2 ~n = Ep n Mg, n(~n} ~ (Clg'(~n}}2 +(Cl-C2}2(~2+~2}Ep fgin(~n)} n ~ Kl 2 + K2 E p fgln(~n) n Further, proving (a). Ep n fgin(~n) = = 1/(2c n ) E p n fgln(~n} 1/(4C~) [F(g(~n+cn) - F(g(~n-cn»]· If A4 holds, Ep n fgin(~n) = 1/(4C~}{2Cn(F*g)(1)(~n} Page 42 + C~/2 [ (F*g) (2) (91) - (F*g) (2) (92) ]} • Now, ( F* g) ( 1"> (~n) = (F* g) (1) (~) + (~n-~ ) ( F* g) ( 2 ) (9 3 ) ) ~ (F*g) (1) (~) + I~n-~I [K l + K2 g(93)] < K4 + I~n-~I [K S + K6 I~n-~I] < K7 + KS U~ . Thus, holds. If AS holds, from (a) and (S), ~n ~ Kl + K2[l/(4C~){2Cn(F*g)(1)(g(~n» + c ~/ 6 [( F* g) ( 3 ) ( 9 1) + (F* g) ( 3 ) ( 9 2 ) ] } ] • Since, I(F*g) (1) (~n) I = I(F*g) (1) (~) + (~n-~) (F*g) (2) (~) + (~n -<I> ) 2/2 (F* g) ( 3 ) (9 ) ~ I 2 Kg + KIO Un' we get the result in (b) when AS holds.D Proof of Theorem 2.1: Note first that ~n is Pn-measurable. Now, 2 2 2 Un + l = (~n+l -~) = (~n+l - <l>n + ~n - ~) = (a~ Mg ,n(<I>n»2 + U~ - 2 an(~n-~) Mg,n(~n)' Since, Ep n (~-~n)Mg n(~n) ' ~ IU n ~nl - = (<I>-~n)[g'(~n)M(g(~n» + ~n] (~n-~)g'(~n)M(g(~n»' We get, by taking conditional expectations with respect to Page 43 F , n 2 2anlun ~nl + an~n (9 ) - 2an(~n-~)g'(~n)M(g(~n»· By Lemma 2.1, if A4 holds, If AS holds, Thus, by (9) and Lemma 2.2(b), EFnU~+l < U~ + + = 2anCn(K3+K4U~ ) a~(Ks+K6U~)/cn - 2ang'(~n)(~n-~)M(g(~n» U~ [1+2K 4 anc n + K6a~/Cn] + 2K 3 anc n + KS a~/cn - 2ang'(~n)(~n-~)M(g(~n»· By Robbins-Siegmund and A3, we get ~n ~ z a.s for some finite r.v. z and, co ~ 1 an g'(~n)(~n-~)M(g(~n» < co Since > an = a.s. co, we have the result.D We now show the asymptotic normality of the estimators, suitably standardized, in a series of lemmas culminating with the application of Fabian's result. We begin with an easy result which quantifies the order of magnitude of the bias. Page 44 Lemma 2.3 Assume Al-A3 hold, and (a) if A4 holds, then .6n = o (c n ) • If AS holds, then .6 = o ( c 2 ) . n n If A4 or AS holds, and ( F* g) ( 3 ) ( t) exists and is ( b> ( c) continuous in a neighborhood of ~, then g(~) .6 / c 2 = 1i m n~CD n (C 1 - C2 ) n d0 S ( u) d u ) ( F* g) (3 ) (~ ) /6 a .s . Proof: By (6) of Lemma 2.1, l.6 n l ~ K 3 if A4 holds, we have C~/2 I(F*9)(2)(9l)-(F*9)(2)(92)1 where I n· 11 'fin By Theorem 2.1 9i ~ ~ a.s. (a). I -< _A cn i=1,2. i=1,2 and we get the result in The result in (b) is true from Lemma 2.l(b). If the assumption in (c) holds, from (4) and (8), we get, .6n 9 = (C l -C 2 ) ~o (~ ) n S(u)du Sp {fgln(~n)-(F*g)(l)(~)} (g(~n) = (C -C ) ('0 l 2 . n 3 S(u)du) c n /(l2c n ) { ( F* g) ( 3 ) (9 1 ) ) +( F* g) ( 3 ) (9 2 ) } . We get the result from Theorem 2.1.0 We now give a univariate statment of the theorem due to Fabian (l968ii). The succeeding lemmas then check to see that the conditions of Fabian's theorem hold. Theorem (Fabian) . Suppose Pn is a nondecreasing sequence of sub of P. Suppose Un' Vn , Tn' rn , ~-fields and ~n are random variables Page 45 such that r, r.n , ~n- l' and ~ be real constants with r r, rn ~ Let ex, 13, T, ~, Vn- 1 are P n -measurable. ~n ~~, Tn ~ T > 0 such that or E I Tn -T I ~ 0, E p nVn = 0 and > C J-;7CO V~ ~ - I~ O. 2 .e = E I[V.2> rJx 1 J,r J - Suppose, with lim.~ I Epn 0-. 2 =0 V r J,r 0-. or 13 I{ ex=l} and ex - (ex+B) /2 Un + l = Un [l-n- rnl - n where n B/ 2 Un ~D N( T/ ( Then, O<ex~l , r - 13+/2), that -1 -n 2 =0 V r. J,r 0 < 13 , 13+ < 2 r ex=l and lim n Let + = 13 v.J2 , ~l 0-. ~ + n -ex-B/2 Tn· n Vn L ~2/(2 r- B+». The lemma below is stated in a more general form than immediately necessary, because we will use it when discussing more general density estimators. Lemma 2.4 Let f9 n be an estimate of the density used in Mg,n that· is conditionally (given Pn) independent of Z2n. A6, A4 or AS, and that (F*g)(l)(x) is bounded. Assume AI, For the p in A6, let K , K , K and t be nonnegative real constants with l 2 3 t~. Then, EpnV~ < c~/2[Kl + K2 Epnlfgnlt + K3lEpnfgnltl. Proof: Now, Vn = and (i) c~ (Mln(g(~n» M2n(g(~n» + M2n(g(~n» - /\n) ~ Cl g'(~n) ~ K4 , Page 46 g(<P ) n 16n I= I(C I - C2 ) ~ 0 ( i i) S ( u )d u B p fgn(<pn)-(F*g)(l)(<p n IB p ~ K S + K6 /Mln(g(<P n » (iii) n n fgn(<p n ) I, g(<P ) n 1=1 (C I -C 2 )fg n (<p n ) ~o S2n(u)du g(<P ) (F* g) (1) (<P ) ~ n 0 - )I n S ( u) d u I ~ K 7 + K a Z2n Ifgn(<pn)l· Recall that for any nonnegative constants a,b,c,d, (a+b+c)d ~ 3 d (ad + b d + Cd). (V~/cn)t/2 ~ 3 t Thus, we have [ Min(g(<p n » + M~n(g(<Pn» We get the result by the above inequalities and taking conditional expectations.D Lemma 2.5 Assume AI-A3, and A4 or AS. lim B v2 =0 and > a.s. = Pn n Then Proof: Recall Vn = c nl (Mg,n(<P n ) - g'(<Pn)M(g(<p n » The first part is obvious by the definition of 6 n ). 6. Now, n - g(t) = (CI-C2)[fgln(t)~0 Mln(g(t» S2n(u)du g(t) -g' (t)f(g(t» ~o S(u)du]. Note that for v>u, S2n(v)S2n(u) = S2n(v). Now, (t 2. 2 2 2 Thus, B (, o S2 n (u)du) -< B Z2 n < I.l +<r < 00. (t 2 (t (t 00 > B ('0 S2n(u)du) ~ B '0 '0 S2n(v) S2n(u) dvdu t t . t u = B{~o~u S2n(V)S2n(U)dVdU+~0~0 t t = B{~o ~u t t S2n(v) dv + t = ~o ~u t S(v)dvdu + S2n(v)S2n(u)dvdu} u ~o ~o S2n(u)dvdu} ~o u S(u)du Page 47 = et V e '0 '0 S(v)dudv + et '0 u S(u)du = t 2 e '0 u S(u)du. Giving, t t (10) H(t) = 2 ~O u S(u)du = E( ~O S2n (u)du)2 < co. NOw, (11) since cn Ep n f9ln(eP n ) = [F(g(ePn+Cn»-F(g(ePn-cn»]/2 ~ 0 a.s. c n ~O and eP n ~eP . by Theorem 2.1. Similarly, g(eP ) 2 c n [g' (eP n ) f (g (eP n » ~ 0 n S (u) du] ~ 0 a. s ., 9 i v ing 2 lim c n Ep Mln(g(eP n » (12) n 2 2 = lim c n (C l -C 2 ) H(g(eP n »E p n f9 l n (eP n ) and since E p c n fg 12n (eP n ) = [F(g(eP n +c n »-F(g(eP n -c n »]/(4c n ) n ~ [ g' (eP)f(g(eP» ]/2 a.s. Thus, limn~co EpnC n Mfn(g(eP n » = >' a.s. By Lemma 2.3, -n J\ = o(C n ) and from (11), we get (13) lim c n E p Mln(g(eP n » n By definition, Mg , n (eP n ) -g , (eP n )M(g (eP n » and since IM 2n (g(eP n » 2 lim E p v n n = 0 a.s. = Mln (g (eP n ) ) +~2n (g (eP n ) ) I ~ Clg'(eP n ), we have = lim Cn[E p {Mln(g(ePn»-M2n(g(ePn»}2_~~] n = lim c n Epn[M~n(g(ePn» - 2Mln(g(ePn»M2n(g(ePn»] = > by (13).0 Lemma 2.6 Assume Al-A3, A6 and A4' or AS'. Then (a) E p .V2 is bounded for each nand n n (b) for any t~, limn~co E (C! V ) t = O. n n Page 48 Proof: By Lemma 2.4, EF n V~ ~ C~/2[Kl + K2 E F n Ifgn(~n)lt + K31Ep n fgn(~n) It]. Now, E p fgn(~n) = (F*g)(l)(~n) + 1/(2c n ) R:, where R: is a remaind~r term equal to C~/2 [(F*g)(2)(91)-(F*g)(2)(92)] or C~/6 [ (F*g) (3) (91) - (F*g) (3) (92) ] (where 19i-~nl ~ cn' i=1,2) depending on whether A4' or AS' is valid. In either case, we have boundedness for the term E F fg n (~n) • n For t=2, c n E p fg~(~n) = 1/2 E p fgn(~n) n which is n bounded and thus (a) is true. For c~ Ep n O~t~p, fg~(~n) = 2-t[F(g(~n+Cn»-F(g(~n-Cn»] = 0(1) which is bounded and thus c~/2 EpnV~ is bounded. Result (b) follows immediately from ED (c t V )t = 0(1) and the .E'n n n bounded version of the Lebesgue Dominated Convergence Theorem. 0 Lemma 2.7 Define for r=1,2, .•. and n=1,2, ••• 2 >rnll. ~2n,r = E[V n2 I[Vn- E (C t V )p = 0(1). n Then n 2 limn~oo ~n,r Assume A6, A7, and For the q in A6, assume 0 = 0 < y~ l/q. for each r. Proof: By Holder's and Markov's inequalities, we get Page 49 0- 2 n,r < < = 0(1)/[nP/(2q) c 1+p/(2q)1 n = o(1)/[n P/(2q) n- y (p/2)1 proof of Theorem 2.2: with Lemmas 2.5-2.7, this is an easy application of Fabian's (1971) result. 'NOW, from (1) and (2), with U =~ -~, n n 6=1- Y, ex=l, M(g (~n» = (~n -~) g' (9 n) M' (g (9 n) ) and Mg,n (~) n Un + l = = Un - = c-lv n n +g'(~n )M(g(~n »+~n , an[C~lVn U [l-a g' n n (~ n )g' + g'(~n)M(g(~n» (9 n )M' (g(QIn » 19 n -~ I where ]-A gives + /\n 1 c- l n- l + Y/ 2 v -a /\ . n n-n 2 By Lemma 2.5, B p V ~ > a. s. Now, let n n r n = Ag' (~n)g' (9n)M' (g(9n» ~ A(g' (~) )2M, ~ = n A c- Y Tn = < I ~n -~ I , (g(~» = r A n(1-Y)/2 /\ • -n This gives, (14) Un + l = Un[l-n- l rnl - n- l + Y/ 2 ~n Vn - n Y/ 2 - 3/ 2 Tn. Now, if A4' holds, by Lemma 2.3, IT n I < Kn(1-Y)/2 o(n- Y ) = K 0(n(1-3Y)/2) If AS' holds, by Lemma 2.3, = 0(1) for Y~1/3. Page 50 IT n I < - K n(1-Y)/2 O(n- 2Y ) = O(n(1-5Y)/2) = 0(1) Thus, the bias T=O. for Y > 1/5. The result is immediate from Lemmas 2.6 For part ( c), by ( 14 ) , U = U [1-n- 1 if\ n- 1 + Y/ 2 V _ n- Y/ 2- 3/ 2 Tn n+1 n n ~n n where Tn = A n(1-Y)/2 /\n and Y = 1/5. Now, by Lemma 2.3, and 2.7 for parts (a) and (b). r.] - 1imn~oo Tn = A c 2 1imn~oo n -2Y n (1- Y)/2 (6n/C~) 2 2 = A C lim (6n /c n ) g(eP) = A C 2 (C -C ) (~o S(u)du)(F*g)(3)(eP)/6 = T a.s. 1 2 The result is immediate from Lemmas 2.5-2.7 and Fabian's Theorem. [] CHAPTER 3 DEVELOPING AN OPTIMAL SA ARP METHODOLOGY §3.1 Reducing the order of bias From the proof of Theorem 2.2, we note that the rate of convergence of ~n to asymptotic normality could be made quicker if the order of magnitude of the bias term, -n /\ , were smaller. By taking several observations at each stage, Fabian (1967) showed how to modify the Kiefer-Wolfowitz procedure to reduce the order of magnitude of the bias term. We are able to achieve the same effect by taking advantage of the special nature of density estimators rather than taking additional observations at each stage. We now focus on a more general class of estimators that achieve a better rate of convergence. Recall, from (2.3), that, /\n = Ep = n {Mg n(~n) - g'(~n)M(g(~n»} ' B p {Mln(g(~n» n Since B p M2n(g(~n» n + M2n(g(~n»} = 0, = giving (2.4), Page 52 where f9ln(t) = I[g(t-C n ) < Zln ~ g(t+c n )]/(2C n ) = 1[-1< [g-l(Zl )-t]/c < 1]/(2c ). n n n Let P and r be integers such that O<p<r and f*g(P>(x) - n be an estimator of (F*g) (p+l.) (x) using the nth observation. We use f*gn(x> for f*9~O)(X). We can achieve better convergence rates for the SA ARP estimators by constructing estimators f*g(P)(x) which satisfy: n A8 : (a) sup x IE f*g~P)(X)-(F*9)(P+l)(X)1 = o(n-(r-p)/(2r+l», (b) A9: to xo~ sup x and for t>-l Elf*g (x>l t + l = o(n- t /(2r+l». Let t>-l and x . n n . . be a sequence of constants tending Then, there exists constants "I and "2,t such that, (a) lim n(r-p)/(2r+l) [E f*g(P) (x )-(F*g) (p+l) (x )] = "I n n n -t/(2r+l) t+l = "2,t. (b) lim n E(f*gn(x n » n~oo n~oo Recall that (~*g)(p+l) is the p th derivative of the density function when g(t)=t. If g(t)=t, the number of derivatives of the density that will be assumed to exist and be bounded is r. In AlO we will give more precise conditions on the distribution and g(.) for a specific value of r. In Page 53 this section we only use f*g , that is, p=O. n In Lemma 3.4 we will need the estimator for p=l and in Lemma 3.5 we will need a modified version of p=r. We now describe one method of creating such estimators, using kernel functions. Lemmas will be given providing sufficient conditions for A8 and A9. We first present a small review of the method of kernel estimators of a density, primarily following Singh (1977). Let B o be the class of all Borel-measurable real-valued functions k(.) where k(.) is bounded and equals zero outside [0,1]. For integers rand p where M = {keB o : l/j! (1 '0 . yJ k(y)dy = O~p<r, [~ define j=p j~p j=O,l, ... ,r-l} M* = {keM: k is continuous and of bounded variation}. Consider Xl, .•• ,X , an i.i.d. sample having density f. n Let {c } be a sequence of positive constants tending to O. n For a fixed keM, our kernel estimate of f(P), the pth derivative of f, is, n i(p)(x} = l/(nc~+l} ~l k«Xj-x}/c n )· In practice, k(.) can be taken to.be a polynomial of order r. We now state three important (to us) results in density estimation. Page 54 Theorem A If keR and f(r) (a) sUPx is bounded, then l/(c~-P) IE Further, if for some t>l, (b) sup x 1/(c~-P-1/t) IE = 0(1). f(P)(X)-f(P)(X)1 ~ If(r) (x) It < 00, f(P) (x)-f(P) (x) then I = 0(1). then, (a) sup x n(r- p )/(r+l)(f(P)(x)-f(P)(X»2/(log log n) Now, let keN* and c . n = C n- w/ 2 where w = = 0(1). 1/(r+l-1/t). If the assumptions of Theorem A hold, then (b) sup x nw(r-p-l/t) (f(P)(x)-f(P)(X»2/(log log n) = 0(1). Theorem C Let keR, f and f(r) are bounded, and c n =C n- l /(2r+l). Then In the stochastic approximation scenario, we are interested in the properties of a single observation estimator of (F*g)(l)(x). We give below our analog to the important part (a) of Theorem A. But first, we need to impose the following slightly stronger condition on the distribution function: Page 55 For some integer r~l, assume that (F*g}(l}(x) and AIO: (F*g}(r+I)(X) exist for each x, are bounded on the entire real line, and are continuous in a neighborhood of ~. Lemma 3.1 Let keN and for some integers rand p, let Suppose (F*g) (r+l}(x) is bounded. O~p<r. Define fg(P)(x) = k[(g-l(X }-x)/c l/c p + l • n n n n Then, (a) supx lE fg~P} (x}-(F*g) (p+1) (x) I Further, if (F*g}(r+I}(X) is continuous in a neighborhood of a fixed x ' and {x } is a sequence of o n constants tending to x , then o (b) [E fg(P} ( }-(F*g) (p+l) (x ) l lim c-(r-p} n n xn n I = (F*g) (r+l) (xo).(r! ~o yr key) dye Proof: By a change of variables, we have -p-l E fg(P) (x) = c n n ~ k[(g-I(S)-X)/cnl f(s) ds = c- P n ~l key) g'(x+cny)f(g(x+cny» 0 = c- P n ~l kCy} (F*g) (1) (x+c y) dye 0 n dy A Taylor-series expansion gives, (F*g)(1) (x+cny) = (F*g)(1)(X) + (C Y)(F*g)(2)(X) + ••• + n (Cny)(r-l)/(r-l)! (F*g}(r}(X) + Rn(y) where Rn(y) = (Cny)r/r! for some Thus 9 such that IRn(y)1 ~ KI (F*g) (r+l) '9 -x I (9) ~ cny· (Cny)r /r! < K2 c~. By the boundedness and orthogonality of k to yj, j=l, ... ,r-l, we have, Page 56 c~ E f9~P)(x) = c~ 1 (F*g)(p+1)(X) + ~o k(y) Rn(y) dy which gives the result (a). For part (b), c-(r-p)£E fg(P)(x )-(F*g)(p+1)(X )] n n n n = 1 c~r ~o k(y) Rn(y) dy 1 = ~0 k (y) yr /r! (F*g) (r+1) (9) dy 1 ~ (F*9)(r+1)(X ) ~o k(y) yr/r! dy a.s. o through an application of the bounded convergence theorem.O We remark here that with A10 the results of Lemma 3.1, (a) and (b), show that the kernel estimator meets assumptions A8(a) and A9(a), respectively. With A10 the results of Lemma 3.2, (a) and (b), will show that the kernel estimator meets assumptions A8(b) and A9(b), respectively. We note here that since we take only a single observation, one would not expect a result analogous to Theorem B to be true. An analog of Theorem C will be of consequence. Lemma 3.2 Assume keN and that (F*g)(l)(x) is bounded and continuous at x • Let {x } be a sequence of constants o n tending to x o· If t> -1, then, I f9 n (x) I t+1 = O(C~t) and (a) supx (b) lim c t E f9 t+1 (x ) = ( F*g) n n n E (1) (x ) ~ 0 1 0 k t +1 (y) dye Proof: The proof is an easy application of the bounded convergence theorem. Since, • Page 57 l t l k«g -1 (s)-X)/c ) I t + f(s) ds Ifg (x)l + = l/c ~ I n n n 1 = ~ Ik(y)l t +l (F*g)(l)(x+c y) dye o n By the boundedness of (F*g)(1) and k(.), result (a) is true. Further, by that boundedness, lim c t B fgt+l(x ) n n n = ~lo kt+l(y) lim (F*g)(l)(x +c y)dy.O n n We now prove results about the SA ARP estimator using a general density estimator f*g(P)which satisfies AS and A9. n As pointed out, kernel estimators are one type of such estimator. All quantities, e.g., M , / \ , etc., are defined g,n -n as in Chapter 2 but using the new density estimator f*gn in place of the histogram estimator f9 ln • Theorem 3.1 Assume Al-A3, AS and A9. Then a.s. ( a) Assume AI, A2 and A6-A9 and let 1/<2r+l)=1 (b) n(1-1)/2(<!>n-<!» -7 D N(lJ ' ()-~) o g(<!» ~ l/q. Then where r- 1J 0 = 2 A (C l -C 2 ) ~o S(u)du 1 1 /(2 1+1) 2 2 cg (<!» ()-~ = 2 A (C l -C 2 ) J o u S(u)du 1 2 ,1/(2 r - 1+1). Proof: Recall Un = <!>n-<!>. 1=1/(2r+l), we get 2.2(a) and AS(b) Now, by AS(a) I~nl ~ o(n- 1r ). (with t=l), we get, (with p=O), (2.4), and Hence, by (2.9), Lemma Page 58 -2an(~n-~)g'{~n)M{g{~n» + a 2 [K +K {O{n Y »] + U2 + 2a (1+U 2 ) O{n- Yr ) n n n n 2 l = -2 an{~n-~)g'{~n)M{g{~n» + U~ [l+O{ann- Yr )] + a 2 O{nY) + O{a n- Yr ). n n Thus, by the Robbins-Siegmund result, A2 and A3, we get part (a). By equation (2.14) , we have U + = Un[l-n- l n l rn ] - n- l + Y/ 2 In V - n Y/ 2 - 3/ 2 Tn n where Vn = c~/2{Mg,n{~n)-g'{~n)M{g{~n»-~n)' f n = A g' (~n )g' (9 )M' {g (9 n» ~ A (g' (~) ) 2 M' {g ( cI») = ~n = A C-! = ~ By A9{a) and r, Tn = A n{1-Y)/2 ~n' (with p=O), and part (a), g{~ ) / limn~oo Tn = lim A n r (2r+l) (C l -C 2 ) ~o n S{u) du [B p f*gn{~n)-{F*9){1){~n)]' 9 (~) n = A (C l -C 2 ) (~o S{u) du) '1 = T, say. By equation (2.l0), (2.l2) and A9{b) (with t=l), lim c n B p Mfn{g{~n» n = (C l -C 2 )2 lim H{g{~n) CnB p fg~{x) 2 g{~) n = 2(C l -C 2 ) ~o u S{u)du C '2,1 = S2 o ,say. Thus, as in Lemma 2.5, we get B v2 ~ Pn n S2 0 a.s. Now, by A8{b) (with t=O,l) and Lemma 2.4, we get the boundedness of B p V~. The result of the Theorem comes immediately from an n application of Fabian's Theorem, providing we show that 0- 2 2 >rn) v 2 ~ 0 for each r. n, r = B !(v nn By Lemma 2.7, we need only show for t<p, B (c ~ Vn ) t BF n ~ O. By Lemma 2. 4 , v~ < c~/2[Kl+K2Bp Ifgn{~n) It+K3IBp fgn{~n) It]. n n Page 59 Now, cnfgn(~n) and E p fgn(~n) are bounded. By Lemma 2.4 and n (with t=O and p-l), we have E p (c! V )t = 0(1). n n A8(b) The n bounded version of Lebesgue's Dominated Convergence Theorem gives E p (c! V )t = 0(1) and thus the theorem.D n n n Corollary 3.1 Define the kernel estimator, fg~P)(x) = k[«g-l(Zln)-X)/cn]/c;+l. (1) Use fgn(x) for fg~Olx) and let f*gn(x)=fgn(x). ~ ~ ~ Then, (a) if Al-A3, and AI0 hold, then (b) Assume AI, A2, A6, A7 and AlO hold, and let Y = 1/(2r+l) ~ l/q. n Then, n(1-Y)/2(..l _..l) ~ ""n "" g(~) 1.11 = 2 A Cr(Cl-C2)(~0 N(II D 1"'1' S(u)du) 0- 2 1 ) 1 i = 2 A2 C-1 > 1 ~ 0 k 2 (y ) dy / [2 where (F*g)(r+l)(~) ~o yr/r! k(y)dy/[2 0- a.s. r- r- l+Y] 1+ Y] • Proof: Immediate from Lemmas 3.1 and 3.2 and Theorem 3.l.D S3.2 Reducing the asymptotic mean sguare error In S3.l, we demonstrated how to define a sequential procedure, (2) ~n+l= ~n - A n- l Mg,n(~n)' that under certain reasonable conditions produce estimators that are strongly consistent and asymptotically normal. In Page 60 this section, we investigate recommendations for the choice of these parameters. To be explicit, only choices for the kernel estimators are presented. From Corollary 3.1 of the previous section, we see that under certain conditions, n(1-Y)/2 (~n-~) ~D X, normally distributed random variable with mean variance ~~ where ~1 X is and One criterion for selection of parameters suggested by Abde1hamid (1973) is to choose A and C to . .. ffi1n1rn1ze E X2 , . mean square error. t h e asymptot1c Elementary calculations show that (3) A opt = [(g' (~»2 M' (g(~» ]-1 (4 ). C opt = {2Y(r+1)/r > - ~ 1 0 k 2 (y)dY }1/(2r+1) {2(C1-C2)(F*9)(r+1)(~) 1g(~) S(u) 0 du 110 yr k (y)/rl dy}-2/(2r+1), We note that the asymptotic distribution of the process presented in §2.4 and §3.1 depend on the choice of transform function g(.). We now discuss how to choose the parameters A and C so that the asymptotic distribution of the resulting procedure is invariant to the choice of g(.). Page 61 We first re-introduce the star superscript notation used in (2.1). Let ~ be the t that uniquely minimizes Rl(t), and let ~ * be the t that uniquely minimizes Rl(g(t». Thus ~=g(~ * ).Using the ~ * estimators defined in (2.1) and (2.2) n with kernel estimators defined in (3.1), under certain conditions we showed that in Corollary 3.1. Since ~ * is a continuity point of g(.), the "b-method" tells us that Now, g'(~*)~l = K A Cr (F*g)(r+l)(~*)/{2A(g,(~*»2M'(~)-1+Y} l (g,(~*»2o-i = K2 A2 c- l (g,(~*»3/{2A(g,(~*»2M'(~)-1+Y} where yr/r! k(y)dy (1 J0 k 2 (y)dy. To make the denominators invariant with respect to g(.), we choose A=Ao(g,(~*»-2, where A is some positive constant. o This gives, g'(~*)~l = K 3 c r (g,(~*»-l (F*g)(r+l)(~*) (g,(~*»2o-i = R 4 where c- l (g,(~*»-l K3 = Kl Ao/{2AoM'(~)-1+Y} K4 = K2 A~/{2AoM'(~)-1+Y}. As with the first criterion, the best choice of Ao is (M,(~»-l. Now, by using C=Co(g,(~*»-l, we make the Page 62 asymptotic variance independent of the choice of g(.) and the asymptotic mean nearly so. More explicitly, with this choice of C, gl (<I>*)~l = K S (F*g) (r+l) (<I>*)(gl (<I>*»-(r+l) (g' (<1>*) )2(}-f = K 6 where, r and KS = K3 C0 As before, the best choice of Co is Copt gl(<I> * ) where Copt is given in (4). Unfortunately, this choice of Co does depend on g ( • ) • Now suppose that g(.) is approximately a line in the neighborhood, or only at ~J.*. require that g(i) (<1>*)=0 More generally, we need only for i=2, ••• ,r+l. Then, (F*g) (r+l) (<1>*) = (g' (<1>*) )r+l f(r+l) (g(<I>*» and g'(<I>*)~l = K f(r+l) (<1». S Thus, within this more restrictive class of transforms than defined by AI, we see that the asymptotic distribution is invariant with respect to the choice of the transform. which does not depend on g(.) and J.* -1 Copt=Co(g'(~» • Page 63 In Chapter 6 we investigate some finite sample properties of these optimal choices of A and C using simulation techniques. E X2 S3.3 The choice of k(y) ~ M to minimize has not been solved yet. An adaptive ARP procedure AS pointed out in S3.2, for a fixed distribution function F, and known transform g and kernel k, there is a choice of A and C that minimizes the asymptotic MSE. Unfortunately, these best choices of A and C depend on knowledge of F and ~ which are generally unknown prior to conducting the experiment. In this section, A and Care replaced by estimators in an adaptive manner such that the procedure attains the optimal MSE without a priori knowledge of F and ~. Let {X.1., n}' i=1,2, be two sequences of i.i.d. random variables that are mutually independent, each having d.f. F. {~n}' * and {c * } are sequences of random variables. {an} n \I and /\ to denote max and min, respectively. constants Zl' Z2' Z3' Z4' a, b, c and Y define a -1 (5) \I a n* /\ Z2 n an = (Zl (log n) C (6 ) c n = (Z3 n- b \I c n* ) /\ Z4 n Use For positive Page 64 where a+rc<rY, a+b+Y<1/2, Y=1/(2r+l) and b<rY/(r+l). This type of truncation device has been used by other researchers in stochastic approximation (cf., Fabian, 1971, Theorem 2.4). These particular truncating functio~s are not unique and could be replaced by others which are more complicated (cf., Lai and Robbins, 1981, Theorem 3). For convenience we choose these simple, sufficient conditions over the more complex, yet only slightly weaker, conditions. Let Z.1n = F n x.1n /\ (g(~n+cnn-Y» and define = ~(~l' Z .. , i=1,2 j=l, ... ,n-l). 1J . are Fn-measurable. We assume a * and c * n n Definitions for all quantities (F n , Sn' f9 n , Mg,n' etc.) are as in S3.l (using (1) for the density estimator), except now based on the new truncated observations, {Zin}' and replacing A and C by an and c n ' respectively. Our adaptive procedure is (7) ..l ..l o/n+1 = o/n -an n -1 M (..l) g,n o/n • Some additional notation will be useful; define get) A (t) = (C -C ) l n 2 ~o S2n(u)du get) (8 ) A(t) = (C l -C 2 ) ~o Bn(t) = g'(t)Sln(g(t» B(t) = g'(t)S(g(t» S(u)du [C F (g(t» + C S (g(t») 1 2n 2 2n [C F(g(t» + C S(g(t»). 1 2 From (1) and (2.2), we have Mg,n(t) = An(t) f9n(t) - Bn(t). Recall that (9) where Mg,n(~n) = g' (~n)M(g(~n» + (cnn-Y)-l Vn + ~n Page 65 6 n = A(ePn ) [E p n fgn(eP n ) - (F*9)(1)(eP n )] Vn = (Cnn- Y)! {fgn(ePn)An(eP n ) - E p [fgn(ePn)An(eP n )] n - [Bn(eP n ) - B(eP n )]}· Even when using the modified procedure, we retain the conditional unbiasedness of parts of the estimator, i.e., EpnSin(t) = EpnI(Zin>t) = S(t) for t~g(ePn+cnn-Y) giving, and Further, we have the following Lemma 3.3 Assume Al-A3, AlO, and (5)-(7). /\ = o«c n (10) -n Then, n-Y)r) < O(n(C-Y)r). Proof: Via a Taylor-series expansion, B p f9n(t) = (cnn-Y)-l B p k[(g-l(Zln)-t)/(cnn- Y)] n 1 n = ~o k(y) (F*g) (1) (t+cnn- Y y) dy 1 = (F*g) (1) (t) + (c n-Y)r ~ yr k(y)/r! n 0 where 19-tl ~ cnn- Y, (F*g) (r+l) (n) I which gives the result.D We note from the proof of the lemma that Bp fgn(eP n ) is n bounded. We have now placed enough structure on the procedure to give the following:. dy Page 66 Theorem 3.2 Assume Al-A3, AlO and (5)-(7). (11) c!> n = c!> Then a.s. Proof: Us ing (9) in (7) gives, (12) c!>n+l = c!>n -ann- l [g'(c!>n)M(g(c!>n»+(cnn-Y)-l Vn+~n]· Squaring and taking conditional expectations with respect to F n gives, (13) Sp (c!>n+l-c!»2 = (c!>n-c!»2 - 2n-lan(c!>n-c!»[g'(c!>n)M(g(c!>n»+~n] n 2 + n- a~ [(g' (c!>n)M(g(c!>n» + ~n)2 + c~lnY EpnV~]. Note that by AlO g'(t)M(g(t» is bounded. By the Robbins- Siegmund Theorem, sufficient for (11) is (14 ) a. s. , > n -1 an I~nl < 00 -2 a 2 < 00 (15 ) a.s. and > n n 2 2 Y (16 ) a.s. > n -2 a n c -1 n n S v < 00 Pn n (15 ) is true by the definition of {an}· Lemma 3 . 3 . (14 ) is true by From ( 8 ) and the remark following Lemma 3.3, we have that B, B , A and E p fg (c!> ) are bounded, giving n n n n ~ Kl + K2 Sp A~(c!>n) Sp (fg n (c!>n»2 n n n n 2 for some constants Kl , K2 • Now, Sp A (c!> ) is bounded and n n n (fg (c!> » 2 = o«c n- Y)-2) ~ o(n 2 (Y+b». Since we require n n n a+b+Y<1/2, this is sufficient for (16), and hence the Sp «cnn-Y)-l V )2 theorem. [] * and {c n* }, so that We now specify the sequences {an} (17 ) Page 67 where, Aop t and Cop t are given in (3) and (4). To estimate (F*g}(r+1}(~), an additional mild assumption on the density is needed. All. For some d>O, (F*g)(r+1)(X) = (F*g)(r+1)(~) + o(lx-~ld) for each x. In the following two lemmas, we construct Pn-measurab1e * and {c * } that satisfy (17). sequences {an} n definition of fg~l}and f9 in (1). n Recall the Lemma 3.4 Assume A1, A2, A6, A7, A10 and (S)-(7). Define ~n = An(~n}[fg~l)(~n) - g"(~n)fgn(~n)] + g'(~n) fgn(~n}[C1F2n(g(~n»+C2S2n(g(~n»]' * a n +1 = [n -1 -n ~1 ~j] -1 , and an as in (S). Then, if Y+b<1/3, we have an ~ Aopt a.s. Proof: We begin by noting that Bp n ~n = A(~n}[Bp fg~l}(~n) - g"(~n}Bp fgn(~n)] n n + g'(~n} B p fgn(~n}[C1F(g(~n})+C2S(g(~n»] n ~ A(~) [(F*g) (2) (~) _ g"(~) (F*g) (1) (~)] + g' (~) (F*g) (1) = (g' (~) ) 2M , (g ( ~) (~) [C 1 F(g(~} )+C2S(g(~»] ) fg(P) (~ ) ~ (F*g) (p+1) (~) n n (using the same argument as in Lemma 3.1(b». Hence, from Theorem 3.2 and since B Pn Page 68 n -1 -n Wn ~l = -n ~l -1 ~ B p ex k k Aopt . Def ine the martingale (ex k - B p ex k )· . From the result of Chow (1965), k sufficient for the proof of the lemma is 2 -2 ~ B p (Wn+l-W ) n < 00 a. s. Now, n n (W n + l -Wn )2 = E p (exn-E p ex n )2 ~ B p ex~. n n n n Since B p A2(~ ), gl and g" are bounded, there exists n n n positive constants K , K and K such that l 2 3 Bp Epn(Wn+l-Wn)2 ~ Kl + K2Bpnfg~(~n) + K3Bpn(fg~1)(~n»2. Now, fg2(~ ) (18) n and thus n ~ n- 2 Bp = n O«c n- Y )-2) n (fgn(~n»2 < = o(n 2 (Y+b» 00 a.s. Further, (c n - Y) 3 B (fg (1) (~ » 2 n Pn n n = l/(Cnn- Y ) = ~ Thus, > n- 2 B k 2 (y) p ~ k2[(g-1(S)-~n)/(Cnn-Y)] f(s) ds (F*g) (1) (~ +c n - Yy) dy n n < a. s • 00 (fg~1)(~n»2 = > n- 2 o(n 3 (Y+b» < 00 a.s.D n To estimate Copt' we see from (4) that there are three unknown quantities. As a preliminary, to estimate the r th derivative of the transformed density, we need to introduce k r , a bounded function equal to zero outside (0,1), such that {I '0 yr kr(y)dy = 1 We define f9~r)(t) and (I '0 . yJ kr(y)dy = 0 j=O, ..• ,r-l. = kr[(g-l(Zln)-t)/(Cnn-Y)]/(Cnn-Y)r+l as a quantity to be used in our estimate. Page 69 Lemma 3.5 Assume All and that the conditions of Lemma 3.4 hold. and c n as in (6). Then, g(4)) * <19 ) ~o S(u)du 1 n +1 ~ (F* ) (r+l) 9 n+l (20 ) ~ a.s. (F*g) (r+l) (4)) 1 2 -* ~ 2 )" ~ k (Y)dY > -n+l 0 which gives, (21) a.s. a.s. a.s. Proof: (19) 3.2. is an easy application of Chow's Theorem and Theorem To prove (20), E p f9~r)(t) n = = ~ 1 (cnn-Y)-r ~okr(Y) dy (F*g)(l) (t+cnn-Yy) dy 1 yr /r! k (y) (F*g) (r+l) o r for some 9 such that 19 -t I~Cnn - Y (n) I dy 1 = (F*g) (r+l) (t) + ~o yr/r! kr(y) O( '9-t1d) dy. ThUS, E Pn fg ( r) (4) ) - (F*g ) (r+ 1 ) (4) ) = 0 ( (c n- Y)d) n n n n = o(nd(C-Y» = 0(1). By Theorem 3.2 and Kronecker's Lemma, we have Page 70 n-l)"n E fg~r)<4>.) ~ <F*g)<r+l)<4» a.s. J J We will have proved (20) with Chow's Theorem if -1 (22) )" n- t Ep pj n I <fgi~) <4>n) - E p f9l<r) <J. n holds, for some t such that 0<t<2. n ""n » It < 00 a.s. If we can find such a t>l, we need only show t If9~r) <4>n) It < 00 a.s. n This is because we can apply an algebraic inequality (23) ') n- Ep t <a+b)t<2 <a t +b t ) for all nonnegative constants a and band ') n-t<oo for t>l. Now, and thus, E p n If gn<r) from (6). <4> ) n I = 0< <c n-Y)l-t<r+l» n = o<n<-b-Y)<l-t<r+l») We will have shown (23) if we can find t€<1,2) such that (24) -t-(b+Y){l-t(r+l)} < -1. First note that b<rY/<r+l) iff l+b/{Y<Y+b)}<r+l. We can pick a z such that l+b/{Y<Y+b)}<z<r+l and let t = {l-z<b+Y)}/{l-<r+l)(b+Y)}. Easy algebraic calculations show that the resulting t€<1,2) and satisfies (24). Thus we have proved (20). To show (21), W = cnn-Y[S~ - E p S~]. n n and note that we define a martingale difference For the p in A6, let Ep n ISni t ~ K + K E p 2 l Thus, for t~p/2, t~p n Ifg <4>n) It ~ K + K <c n- Y )-<t-l). n l 3 n Page 71 By the theorem on martingale differences due to Chow (1965), for O<t~2, .?- Ep n IWn I t < n -t 00 Since, (Y+b)(t-l) - t < -1 a.s. iff => a.s. t>l, we choose t=min(2,p/2), giving (25) n -1 )" c. J j - Y [5 ~ - B 5 ~] ~ O. pj J J Thus, if we prove, (26) a.s. this and (25) will give (21) and thus the lemma. From cnn-Y=O(l), lim Epnfgn(~n) cnn- Y E p 52 = lim n n cnn- Y B p bounded and Lemma 3.2(b), n [fgn(~n) An(~n)]2 Because of the wealth of information in the density estimator, we have succeeded in providing adaptive estimators for the parameters of our stochastic approximation algorithm. This is a much harder task for a general K-W situation. We now give the final theorem of this chapter, the culmination of our previous efforts. Page 72 Theorem 3.3 Assume AI, A2, A6, A7, AlO, All, (5)-(7) and (17). Define Y=1/(2r+l) and assume Y+b<1/3. T = r Copt Let g(~) (Cl-C2)(~0 S(u) du) {I (F*g) (r+l) (~) ( J yr/r! key) dy) o ~2 = 2 AoptT/(l+Y) 2 2 -1 {I 2 ~2 = 2 Aopt Copt> '0 k (y)dy/(l+Y). Then, for the procedure described in (7), where Aopt and Copt are given in (3) and (4). Proof: With the preceeding lemmas, the proof is a straightforward application of Fabian's (1968ii) result. By a Taylor- series expansion, g' (~n)M(g(~n» = (~n-~) [gll(9 )M(g(9) )+(g' (9» 2 M' (g(9»] for some 9 such that 19-~1 ~ I~n-~I. This, (7), (17) and Theorem 3.2 give (28) where ~n+l-~ = (~n-~)(l-n-lrn) rn i1\ ~n Tn 2 l + n- + Y/ ~nVn 2 3 2 + n- / + Y/ Tn = a [gll(9)M(g(9»+(g'(9»2M'(g(9»] n ...;.! = anc n = a n (1- Y) /2 n iI\ - ~ ~ 1\ A -! opt Copt ~ A Lln -,. opt T ~ r= 1, and . Since E p V = 0, the proof of the theorem will be complete n n upon showing, (29) ~ ( 30 ) for each r. 0 and By the boundedness of g', A, B, Bn and E p fgn(~n)' and since n Page 73 by (26). cnn- Y B p Further, n (fgn(~n»2 = (c n n- Y )-l ~ k[(g-l(S)_~n )/(c n n- Y )] f(s) ds ~lo = k(y) (F*g)(l)(~n +c n n-Yy) dy This gives the boundedness of B p v2 and proves n n To show (30), by Lemma 2.7, we need only show for t~p, is bounded. (29). (31 ) By Lemma 2.4, Bp n v~ ~ c~/2n-Yt/2[K1+K2Bp Ifgn(~n) It+K3IBp fgn(~n) It]. Now, cnn- Y, n n Bpnfgn(~n) and Icnn- Y fgn(~n) It are bounded. Since B p (c n- Y V2 )t = 0(1), the bounded version of n n n Lebesgue's Dominated Convergence Theorem proves (31) and hence the resu1t.D Page 74 CHAPTER 4 ANOTHER APPLICATION OF SA ESTIMATING THE MODE §4.1 Introduction The asymptotic rate for convergence of the sequential procedures introduced in Chapter 2 and refined in Chapter 3 is largely dominated by the behavior of the density estimator. For this reason and because density estimation is such an important problem in statistical estimation, in this ' t'1ve ' t t h e p th d er1va c h ap t er we f ocus on es t 1mators suc h tha of the density at the estimator converges (in some sense) to a specified value. These estimators are obtained in a sequential manner via stochastic approximation and are in some ways better than those currently existing in the literature. More specifically, let f(P) denote the pth derivative of the density function f, p=O,l, ... find Xo€R such that f(P) (xo)=~. we take ~ = O. Fixing ~€R, the goal is to without loss of generality, As motivation, note that if we impose some mild restrictions on f and let p=l, our formulation is the Page 75 problem of finding the mode of a distribution function. This is a difficult problem for unknown densities which cannot assumed to be symmetric. is R and that the X o We assume the range of the density we seek is unique. In many cases, X o will not be unique over the whole real line, as in the case of finding points of inflection in a symmetric distribution. In this case, we assume there exists known constants a and b such that a<b (where a or b may take on infinite values) such that X o is a unique root in [a,bl. We may then apply a truncated procedure as introduced in §2.4. The procedure proposed will be shown to have the same rate of convergence as the best estimator among those known,. such as introduced by Eddy (1980), which is described below. However, our procedure solves a broader class of problems. Further, using well-known techniques of stochastic approximation, we easily handle the multivariate version and introduce certain types of dependency structures into the observations. According to the procedure proposed by Eddy, let {a } be n a nonincreasing sequence of positive constants and k be a bounded function (k is called the kernel). estimator for the density f is fn(t) = (nan) -1 -n ~l k«Xi-t)/a n ) A kernel Page 76 where {X.} is an i.i.d. sample from a distribution having 1 density f. = M(f) a n = Define the functional inf{teR: f(t) = sUPs f(s)}. = Let a M(f) and M(f ), where, under suitable conditions on f, a is the n mode of f and a n is an estimator of the mode. For a class of kernel functions k and sequences {a }, Eddy demonstrated the n asymptotic normality of an-a, suitably standardized. These rates are superior to earlier efforts (cf., Chernoff, 1964). As Fritz (1973) argued, stochastic approximation is a natural procedure for estimating the mode. Especially in a multivariate situation, there could be computational difficulties in calculating an empirical density and using its maximum as an estimator of the mode. This is essentially the method used by Eddy in the univariate case. The multivariate case is examined by Samanta (1973), Konankov (1974) and Sager (1978), among others. Fritz used the kernel functions of Bhattacharya (1967) and showed that his mdimensional estimators of the mode converged almost surely. This chapter uses more sophisticated estimators given by Singh (1977). Almost sure convergence and convergence in distribution (nondegenerate) for estimators suitably standardized is achieved here. Fritz has some additional interesting features in his process which we remark on in §4.4. Page 77 §4.2 Notation and Assumptions Let B o be the class of all Borel-measurable real-valued functions k where k is bounded and equal to zero outside [0,1] • II For fixed integers p and r where = {k€B : l/j! o ~l yJ. k(y)dy = [ 0 ~ O~p<r, define j=p j~p j=O,l, ... ,r-l}. We assume there exists sequences of positive decreasing constants {an} and {c }. n pth derivative of f, by Define the estimator of f(P), the f(P)(x) = k«Z -x)/c )/c P +l n n n n (1) where {Zn} is an i.i.d. sequence of random variables having density f. We note that the definition in (1) allows for negative estimates of the density. This flexibility considerably enhances the rate of convergence which we will be able to achieve (cf., Singh, 1977). Take Xl to be an arbitrary random variable with finite second moments and let Fn=~(Xl'Zl' ..• 'Zn_l). Define the remainder of the sequence {X } by the recursive equation n (2 ) X +1 = X - a f(P)(X). n n n n n We show in the succeeding sections strong consistency of the sequence {X } to X and calculate rates for convergence. n o begin by listing the most important assumptions. We Page 78 Bl. For some positive integers p and r, where O~p<r, we assume f(x) and f(r) (x) exist for each x and are bounded on the entire real line. B2. For each x~xo' (X-Xo)f(P) (x) > O. B3. lim n = 2/ 2p+l< )"00 and n c _ 1 an c n B4. 0, -00 ~ 1 an = 00, 00. f(p+l) is bounded on R. f, f(p+l), f(r) are continuous at x • o B5. c n For some A,C~O, Y=l/(2r+l), let an = A n- l and = C n- Y . ro = A f ( p+1 Remark: Assume that Y(r-p) < ro where ) (x ). 0 While B2 may appear strange (for p=O) since densities are always nonnegative, recall that we are really interested in finding f(x Let T = f(r)(x ) ~ o )=~ for some ~. 1 000 yr/rt k(y)dy, which will turn out to be a factor in our asymptotic bias. The asymptotic variance of Xn will be proportional to ~o' where 1 ~o = f(x o ) ~o k 2 (y)dy. §4.3 Univariate Results and Remarks Because the univariate results are most often used in practice and investigated in the literature, we give these Page 79 results separately in this section. Their proofs will be direct corollaries of the multivariate version given in §4.4. Theorem 4.1 Assume BI-B3. Then for the procedure defined in (2), a.s. Theorem 4.2 Assume Bl, B2, B4 and BS. Then for the procedure defined in (2), we have n<r-p)/<2r+l) <Xn-x ) ~D N<1l , o-~) o 3 where 113 = A c-<r-p) To/<ro-Y<r-p» 0- 2 3 = A2 c-<2p+l) )" /<2 -0 r0 -2Y<r-p». For p=l, we achieve the same rates of convergence as Eddy but using weaker conditions. Eddy required the existence of a bounded <r+l)st derivative. In fact, if we assume Eddy's conditions we get better rates. These rates are optimal in the sense that they are the same as given by stone (1980) under similar conditions. The choice of the kernel is an open question. Although we have chosen [0,1] to be the interval of support for our class of kernel estimators, the bandwidth is actually controlled by our choice of c n = C n -Y • We have Page 80 specified Y to achieve a best rate of convergence, but we still have the flexibility of specifying C>O. One criteria that has been proposed is to minimize the asymptotic mean square error. By Theorem 4.2, we have that under certain conditions n(r-p)/(2r+l) (X -x ) ~D n 0 X, where X is a normally distributed random variable with mean A C-(r-p) T /( o A2 C-(2p+l) r -Y(r-p» 0 > /(2 -0 and variance r -2Y(r-p». ·0 and C to minimize E X2 The criteria of selecting A was suggested in a stochastic approximation scheme by Abdelhamid (1973). Elementary calculations show that A = r/(r-p) /f(p+l)(X ) o 2 C = {[(2p+l)(r+p+l)(1-Y) ~0]/[8(r-p) E X2 = A2 > -0 /{4C 2p + l r(r+p+l)y 2 }. Unfortunately, values for the optimal A and C are still \ given in terms of the unknown density f. In other SA problems, it has been suggested that A and C be replaced by random variables that converge almost surely to the best A and C, i.e., those that give the best asymptotic mean square error. See Fabian (1971) for a good review or Lai and Robbins (1979) for a more recent approach. Adaptive estimators of A and C would not be hard to construct using the procedures introduced in §3.3. Page 81 We conjecture here that several generalizations of the procedures should be immediately possible (beyond the multivariate extension of §4.4). (a) Chapter 4 uses only one type of kernel estimator. More general results are possible so ,that many types of density estimators would be applicable, as in §3.l. (b) Papers in the previous literature have made the assumption that {Zn} is an i.i.d. sequence. This (in)dependence assumption may not be necessary in using stochastic approximation estimators. The basis of the proof in §4.5will apply Fabian's (1968ii) result. Using results of Ruppert (1982), these type of dependencies may be further weakened. (c) In the situation where X n is a lifetime random variable, the estimator of §4.3 uses information of the lifetime of a unit only up to X +c. n n This truncated procedure could result in potential cost savings if, for example, the cost of the experiment were in some way related to how long a unit survives, e.g., clinical trials. Lai and Robbins (1979) have formalized this idea with respect to the R-M procedure and shown optimality properties. This should be possible for the sequential mode estimator as well. Page 82 §4.4 Multivariate Analogs In this section we reformulate the results of §4.2 in a multivariate setting. general setting. §4.5 proves these results in this more Let f be a probability density function defined on Rm (m-dimensional Euclidean space) and define Pl Pl P2 P2 Pm Pm Q(~) = (J f(~)/Jxl' J f(~)/Jx2 , ... , J f(~)/Jxm )', where O<p.<r, i=l, ... ,m for integers p.1 and r. The prime 1denotes transpose and xi represents the i th entry in the column vector x. We shall use an underscore to emphasize a m For a fixed ~=(~l' .. "~m)' e R , we wish to find vector. that -x0 e R m such that D(x )=~, where x is assumed unique. --0 -0 If Pl=' .. =Pm=O, this is the problem of finding the point at which the density takes on a certain value~. For Pl= ... =Pm=1 and -~=(O, ... ,O)', this is the problem of estimating the mode of a multivariate density. This is a straight-forward generalization that handles all the important problems. Other extensions should be possible, say, in finding levels of mixed partial derivatives of a density (cf., Singh, 1976). Let {Z.} be an i.i.d. sample having density f and denote -1 ~j=(Xl,j, ... ,xm,j)'. We define our kernel functions k i by requiring they be bounded, measurable functions from R to R equal to 0 outside [0,1] and such that page 83 I = L 1 1/ j! ~ 0 1 0 if j = p. )' r~ Pi1 1' f . 0 , ••• )= ,r- 1 . Our estimator of D(x) is (PI) (Pm) (3) D (x) = (f (x), .. .,f (x»' -n n n (p. ) where f 1 n (x) = -p.-m c 1 k. ( ( Z . -x. ) / c ) n 1 1,n 1 n The product is over the set {j: mean k.(y) with p.=O. 1 1 Tf k « z . -x . )/ c ). 0 ),n ) n l~j~m, i~j}. By ko(y) we without loss of generality, take ex= ( 0 , ••• , 0 ) , • Let p = max{p.} and 11.1 I denote the usual rna 1 dimensional Euclidean norm. The stochastic approximation algorithm that we use is (4) 2 where ~l is an arbitrary random vector such that E I I~ll 1 <00. Before giving results, it is convenient here to introduce some more notation. Let H denote the matrix of p.+l p. partial derivatives of D, Le., H = (J 1 f(x)/ Jx. 1 Jx.) ... - - 1 ) 1) We assume that H is positive definite in a neighborhood of mxm (the space of real mXm matrices) such Let Cn , E e R that C =diag(c P i+m/ 2 ) and E=diag(!(Pi=Pa». Recall that n n ~o· I(.) is the indicator function. Use I for an rnXm identity Let P n =~(Xl'Zl' ... 'Z A conditional bias term -n- 1). is J\ = E p (D (X )-D(X ». The asymptotic_variance is -n n-n-n --n proportional to") = (~ .. ) .. , where matrix. -0 1) 1) Page 84 (T •• JJ (T • • ~J = f (~o) 1 2 dy)m-l do ko(y) ( ~l 0 2 k j (y) dy) and 1 1 2 (y) dy)m-2 ) k = f(x do k i (y) ko(y) dy) do 0 -0 1 do kj(y) ko(y) dy) iFj. Define an orthogonal matrix P so that P'H(x )P is diagonal, -0 say equal to /\. For i=l, •.. ,m, let 1 t i = l/r! [Jrf(~o)/Jx~ ~o yrki(y)dy 1 + ~o yrko(y)dy {~=l (Jrf(~o)/drxj)I(iFj)}] As in §4.2, a list of important assumptions are provided. Cl. Let r be some integer, and (Pl, .•• ,Pm) an m-tuple of integers such that O<p.<r, i=l, ... ,m. - ~ We assume that. f(~), dr(X)/dX~, i=l, ... ,m exist for each xeRm , are bounded on Rm , ~ p. p. and that J ~f(x)/Jx.~ , i=l, ..• ,m exist for each x. - ~ C2. '(x-x ): inf{ -D(x) --0 C3. cn I I!.-!.o I I >~ 00 00 ~ 0, ~l an = 00, ~l an } > 0 V~>O. c(r-Pa) < n 00, 00 and ~l an2 c- ( 2P a+m ) < 00. n C4. H(x) exists for each x and is bounded. a,rf(x)/'x Q - C5. r. ~ . 1 , ... ,m b e con t'~nuous a t x . ~ ~= For some c n = C n- Y . Let f, H, and A,C~O, let Y=1/(2r+m), a , -1 n = A n , and Assume Y(r-Pa) < mini ).,i (ro )' where r o = A H(!.o) and ~i(ro) is the i th eigenvalue of roo Page 85 We now proceed to give the main results, the details of the proofs being relegated to the following section. Theorem 4.3 Assume CI-C3. !n -7' Then for the procedure defined in (4), ~o a •s • Theorem 4.4 Assume Cl, C2, C4, C5. Then, for the procedure defined in (4), n(r- P a )/(2r+m)(!n - ~o) -7'D N(~4' ~4) where ~4 = A c-(r-Pa) [ ro -I('y(r-pa » ]-1 E T -0 2 2P m ~4 = A c-( a+ ) PMP ' M .. 1J = (P I E ~o E P) .. (/\.. + /\.. - 2'y (r-Pa) ) -1 . 1J 11 JJ We easily get a convergence result for a multivariate estimator of the mode. Corollary 4.1 Assume Cl, C2, C4, C5 and that Pl= ..• =Pm=l. Then, the sequence {X -n } defined in (4) is a strongly consistent estimator of the mode (~o) and n(r-l)/(2r+m) (!n-~o) -7'D N(~5' ~5) where ~5 = AC-(r-l) [A H(~o)-I«.Y(r-l»]-l !o ~5 = A2 c-(m+2) PMP ' - M.. = (Pi> P) 1J .. (/\ 11 .. + /\J'J' - 2'y(r-l» 1J .:...0 -1 . Page 86 We remark here that Fritz (1973) allowed for multimodal densities and introduced a stochastic approximation procedure that converged almost surely to one of the local maxima. Whether the condition of allowing for more than one local maxima inhibits convergence to a nondegenerate distribution when suitably standardized is an open question. §4.5 Proofs We begin the proofs of Theorems 4.3 and 4.4 with a few (p.) p.1 p.1 . 1 preparatory lemmas. Let f (x)=J f(x)/Jx. . Unless - - 1 otherwise indicated, all relationships between random variables are assumed to hold with probability one •. Lemma 4.1 Assume Cl and that there exists a sequence of constants {x } tending to x • -n -0 Then, (p. ) (a) sUPx IE (b) lim c-(r-Pi ) {B f n f n 1(x) n (p. ) (p . ) 1(X) - f 1(X)} n-n -n = t. 1 1 ~o yr ki(y) dy 4 Jrf(!.ol/Jxj I(i~j)}. Proof: Through an application of the transformation theorem, we have (p. ) E f n 1(X) = c n-p.-m 1 f (s) ds = c-Pi~ n [O,l]m k.(y.) 1 1 IT. J k (y.) f(x+c y) dy. 0 J n Page 87 Via a Taylor series expansion, y. } J. . ) where J r f (n ) / ( J x. • .. J x. ) Tfr. -1 y. • cr/r n .l J. l I J. r J- J. j There is only one term with the power of o 0 0 Pi 0 0 Yl Y2··· Yi-l Yi Yi+l···Ym' this term has coefficient p. p. p. c l/p.! {J J.f(x)/J J. x .}. Since kJ..eK, n J. J. (p. ) B f = n J.(x) - -p. l p. p. (p. ) * C J. O k.(y.)Tf k (y.){y.J. c J./p.!f J.(~)+R (n)}dy} n J. J. 0 ) J. n J. I (p. ) = J. ( X f + R ) n (9 ) _po where Rn (9) = cn J. ~ ki(Yi) it ko(Yj) R*(9) dy. Result (a) follows directly from the boundedness of k f(r). limn c To prove and (b) , (p. ) - (r-p. ) J. [B f n J. (~n) n R where c r-Pi n ( I if i j i =L 0 i n (9 *) I I 9* -~o I I~ I I~n -~ I I /r! ~l y.r k. 0 ) J.j(yj) i. 1 i . ) = J.. . J.j From a simple application of the bounded convergence theorem, we get the result.O Page 88 Lemma 4.2 Onder the assumptions of Lemma 4.1, for t>O there exists a constant M>O such that, (a) sUPx c t() m n Pi+ -m IB(f (p.) n l(~»tl< M (p.) ) l'lm c t( p.+m -m B(f n l(X -n »t = n n 1 f (x ). ~ k ~ (y .) -0 [O,llm 1 1 Proof: (b) IT ...t. k t Jr10 (y .) J dy. An easy application of the bounded convergence theorem, c t ( Pi +m ) -m n B(f (p . ) n 1 = c~m~ = ~ [O,ll (x» t ki«Si-Xi)/c n ) m ki(Yi) IT j IT kg(Yj) kg«Sj-Xj)/c n ) f(s) ds f(~+cnY) dye By the boundedness of f and k., result (a) is true. 1 Further, c~(Pi+m)-m B(f:Pi~~n»t limn = ~ k~(y.) IT.J 1 1 kot(y.) limn f(x +c y)dy J -n n which gives result (b).D Lemma 4.3 Under the conditions of Lemma 4.1, for iFj, and for some constant M>O, we have, Proof: For iFj, Page 89 k . ( (s . -x . ) / c ) k « s . -x. ) / c ) c -p i -p j - 2m ~ am 1 1 1 n 0 1 1 n n k . ( ( s . -x . ) / c ) k « s . -x . ) / c ) J J J n 0 J J n lTk ,.,/..1 , J . k 2 ( ( sk-xk) / c ) on f ( s ) ds k.(y.) k (y.) k.(y.) k (y.) 1 1 0 1 J J 0 J IT k~(Yk) f(~n+cnY) dye We get the result (a) by the boundedness of f and k .. 1 Result (b) is due to the bounded convergence theorem and the fact that -n x ~x . [] -0 Proof of Theorem 4.3: By a Taylor-series expansion, Let U = (X -x )'(X -x). n -n -0 -n-o Thus, Un + l = (!n+l - !n +!n - ~o)'(!n+l - !n +!n - ~o) = Un + By Lemma 4.2(a) a~ ~n(!n)'~n(!n) - 2a n ~n(!n)'(!n-~o)· (t=2), E P (f n (p. ) 2 l(X» n -n Thus, U + Ka 2 c-( 2P a+m ) - 2a /\' (X -x ) n n n n-n -n-o - 2a D(X )'(X -x ) n- -n -n-o 2P a+m ) < Un <l+2a n 11/\ II) + 2a n 11/\ II + Ka n2 c-( -n -n n -2a D(X )'(X -x ). n- -n -n-o From Lemma 4.l(a), we have I I~nl 1 = O(C~r-Pa». The result is immediate from C3 and the theorem due to Robbins and Siegmund.[] page 90 Proof of Theorem 4.4: We will show that the conditions of Fabian's (1968ii) theorem are met and use that result to give the conclusion of the theorem. Define = Cn(Qn(!n)-Q(!n)-!\n) ~n * * * = Qn(!n)-Q (!n)-!\n· (5) Obviously, E P -n V = 0• n By Theorem 4.3 and Lemma 4.l(a), Q(!n)=o(l) and !\n=O(C~r-Pa»=O(l). Since Cn=o(l), we have E V V '=8 F -n-n n V D* (X )' and P -n-n -n n * * 0(1) . SF Q * (!n) Q* (!n ) , =8 P !\D(X)'= -n-n -n n n n 4.2(b) By Lemmas * D* (X )' tends to > (t=2) and 4.3(b), E p Qn(!n) -n -n -0 n giving By Lemma 4.l(a),!\ is bounded, Lemma 4.2(a) -n 8 (t=l) gives Qn(!n) bounded and Q(!n) is bounded due to Cl. F * * E p n Qn(!n) Qn(!n), is bounded by Lemma 4.2(a) (t=2). n Together, there exist a constant M such that (7) liEF n ~n~n' Now, define 0- II < M for each n. 2 2 n,r = 8 [I ( II ~n 112~rn) II ~ 1/ ] . For t>O, = c mt / 2 [{D*(X )-D*(X )_A *}'{D*(X )-D*(X )_!\*}]t/2 n -n -n -n ~n -n -n -n -n < K c~t/2[{Q~(!n) 'Q~(!n)}t/2 + {2Q~(!n) '(Q*(!n)+6n )}t/2 + {(Q * (!n)+6* )' n (Q * (!n)+6* )} t/2 ] n for some positive constant K. bounded and 0(1). By the Cauchy-Schwarz inequality, Page 91 t 0(1) , we only Thus, to show Cmt / 2 E n P n IIYnll is bounded and / 2E need show Cmt n P n 1 1g~ (!n) I I t is bounded and 0(1) • But this obvious from Lemma 4.2(a) . Thus, by the bounded version of Lebesgue's Dominated Convergence Theorem, we get B [c mt / 2 1 Iv Ilt]=o(l) n for any t>O. -n s-l+ t- l = 1. For a fixed t>O, define s by By Holder's and Markov's inequalities, 0- 2 n,r 2t )1/t 11 < E(llv -n, 2 [p(IIV 11 >rn)]1/S -n- < (O(1)/c~t)l/t [EIIY I1 2t /(rn) t]l/s n = 0(1) C~m(O(l)/(C~ rn)t)l/s = 0(1) n Ym n-(l-Ym)t/s = 0(1) n (Ym-l)t+l 0(1) n Ym+(Ym-l) (t-l) = = 0(1) if (Ym-l)t+l=O for each r. We may choose t=1+m/(2r), and thus we get, (8) 0- 2 n,r ~ 0 as n~oo Let -n U = -n-o X -x. u -n+l From (4), we get - a D (X ) = -n U n-n -n - a (C-lV + D(X ) + /\ ) = -n U n n-n --n -n = [I - a H(n n I where for r=l, 2, ..• ) ]U -n - a C-Iv - a /\ n n -n n-n I 19 -~o I I~ 1I!n -~o 11• 4.l(b), '!n~A c-(r-P a ) E,!o. Cl and Theorem 4.3. Let T -n Le t rn = A nY(r-P a )/\. -n = A H (9 )~ A H ( ~o ) By Lemma = ro by Let ~n = A c~l n Y <r-P a )-1/2 ~ ~ = A C-(P a +m/2) E. _U + = [I-n- l r.]U - n- l / 2 - Y (r-P a n -n n l )jJ\ V ~n-n Thus, we get, - n-l-Y(r-P a ) T . -n Page 92 We have thus constructed a recursive algorithm that is in the form specified by Fabian (l96Sii) with sufficient properties (5)-(S). Our Theorem 4.4 is an immediate corollary of his result.D CHAPTER 5 S5.l SA-SOME EXTENSIONS OF THE THEORY Adaptive K-W procedures Adaptive stochastic approximation procedures have received attention recently in the literature in the RobbinsMonro case, c.f. Anbar (1978), Lai and Robbins (1978, 1979 and 1981). As stated in S2.3, much of this work was motivated by the paper of Venter (1967). Some attempts have been made to efficiently adapt the Kiefer-Wolfowitz (K-W) procedure (cf., Fabian, 1971). In this section, conditions are given for achieving an optimal K-W procedure (optimal in a sense to be defined). Consider the Robbins-Monro (R-M) procedure, (1 ) Xn +1 = Xn - A n- l Yn where Yn is a conditionally (given xl, ... ,X n ) unbiased estimator of f(X n ), A is a fixed, known constant. The recursive procedure described in (1) produces a sequence of estimators for finding the root 9 of f. Under mild conditions, the following asymptotic properties hold, Page 94 (2) lim ~ n~oo a.s. X = 9 n (3) ~2 = lim E p ([Yn-f(Xn)]2Ixl, •.. ,Xn). where The choice of A n to minimize the asymptotic mean square error (MSE) is (f(l)(9»-1. For an f(.) known up to a finite number of parameters, it was proposed by Albert and Gardner (1967) and Sakrison (1965) under different scenarios to replace A by a sequence of estimators. Sufficient conditions were given to retain properties (2) and (3) of the modified procedure. venter (1967) proposed modifying the procedure in (1) in the case where f is unknown to get estimators of A. Under certain conditions, this modified algorithm was shown to have the asymptotic properties (2) and (3), with A = (f(1)(9»-1. Fabian (1968i) weakened the conditions imposed by Venter. Alternative least squares estimators have been proposed by Anbar (1978) and Lai and Robbins (1978), whose properties have been further investigated by Lai and Robbins (1979 and 1981) . For the K-W procedure, define, (4) Xn + 1 = Xn - A n -1 Yn where Yn is a conditionally (given X1 ' ..• ,X ) unbiased n observation of (f(Xn+cn)-f(Xn-cn)]/(2cn). Let {c n } be a sequence of fixed, known constants such that c n =c n- Y Assume that 9 is the unique maximum of f and ~~ = limn E([Yin-f(Xn-(-1)iCn)]2Ixl, ... ,Xn)' i=1,2. Then, Page 95 under certain mild conditions, (5) (6) 1 im n..,oo ~ Xn = 9 a •s • , n 21 (X -9) ~D N(1-16,A 2 n where 1=1/6 and 1-1 6 c- 2 0-~/[2Af(2)(9)-2/3]), = A C 2 f(3)(9)/[6(2Af(2)(9)-2/3)]. To minimize the mean square error of the asymptotic distribution, elementary calculations show that we should choose parameters A and C such that = (f(2) (9) )-1 -- [24 0-2/(f(3) (9) )2]1/6 (7 ) 6 Unfortunately, values for f(2)(9), f(3)(9) and 0- 2 are usually not known prior to conducting the experiment. For a more sophisticated estimator of f(l)(X ) and in the n multivariate case, Fabian (1971) demonstrated how the algorithm in (4) may be modified so that A may be replaced with a strongly consistent estimator such that (5) and (6) continue to hold. His modifications start with {m }, a k sequence of positive integers increasing to 00. At each stage n=m , k=1,2, ... , take additional observations to estimate an* k and c n* (Fabian did not concern himself with estimating Copt with c * ' however, the extension is immediate). n chosen going to infinity as k~oo The {m } are k but at a slow rate compared to the rate of convergence of the process so as to be asymptotically negligible. A major drawback with Fabian's approach is that it requires taking extra observations at various stages. In Chapter 3, an adaptive procedure was Page 96 developed for the ARP problem without taking additional observations at various stages. The goal of this secion is to give some sufficient conditions so that (5) and (6) hold with the best possible choice A and C without taking additional observations. Along the lines of Lai and Robbins this procedure will be called an adaptive Kiefer-Wolfowitz procedure. We wish to find the (assumed) unique maximum of the function f, say 9. Let Xl be an arbitrary random variable with finite second moment. We use sequences of random * and {Yin}' i=1,2, where YIn is variables {X n }, {c n* }, {an} independent of Y2n . Define and an* are Pn-measurable. Pn=~(Xl' ... 'Xn-l)' and assume c n* For some positive constants Zl' Z2' Z3' Z4' a, b, c and Y, let a an = (Zl(log n) -1 V a~) /\ Z2 n c = (Z3 n -b V c * ) /\ Z4 n c n n (8 ) (9 ) where a/2+c<Y, a+b+Y<1/2 and Y=1/6. Yn = (Yln-Y2n)/(2Cnn-Y). Further, let The procedure is given by (10) We make the following assumptions: C f(Xn+Cnn~Y) and B Y = f(Xn-Cnn- Y ). pn 2n Dl. BpnY ln = D2. For some positive constant C, > BF [Y in - f ( Xn - ( -1 ) i c n n - Y) ] 2 ~ ~ ~ n D3. i =1 , 2 . The function f has two continuous derivatives and sUPx If(2) (x) I < 00. Page 97 D4. For each x€R, and some d>O, f(3) (x) = f(3) (9) + O( Ix-9I d ). D5. Bp For some t>2, [Y in -f(Xn -(-l) i cnn -1 )] t < n D6. i=1,2. a.s. ~>O, For each inf{f(l)(x): 00 IX-91>~}>0 and inf{f(9)-f(x): Ix-91>~}>0. D7. and a.s. Theorem 5.1 Assume D1-D6. For the procedure described in (10), (11) a.s. Proof: By a Taylor-series expansion, B Y = [f(X +c n- 1 ) - f(X -c n- 1 )]/(2c n- 1 ) Pn n n n n n n = f (1) (X ) + (cnn - 1) 3/( 3! 2c n - 1) [f (3) (91) +f (3) (92) ] , n n where In .-x I<c n- 1 • 11 n - n Let B = (c n- 1 )2/ 12 [f(3)(91)+f(3)(92)] and n n ~n = Yn-B p Yn' giving n (12) X -X n+1 - n -n- 1 a(f(1)(X)+B + n n n ~n)· Squaring and taking conditional expectations with respect to Pn gives, B pn (X + .,..9) 2 = (X n _9)2 - 2n- 1 a n (X n -9)(f(1)(Xn ) + Bn ) + n- 2 a 2 [(f(l)(X )+B )2 n 1 n n n By D4, IBnl~ O(n 2 (c-1»[2f(3)(9) + '91-9Id + '92-9Id] < O(n 2 (c-1»[2f(3)(9) + 2(c n- 1 )d + 2/X -9I d ] n n Page 98 Thus, (13) Since f (1 ) (X ) n = f ( 1) (9) + (X -9) f ( 2 ) (n) < K n I - I Xn -91 ' we get f(l)(X) I < O(n 2 (c-Y»[O(1)+O«X _9)2)] and n n (f(1) (X »2 < O( (X _9)2). IB n n Since E - ~2 = n O(n 2 (b+Y» (11) and from the Robbins-Siegmund n ' n Theorem, we get the result.D P Thus, we have established convergence of the algorithm * in (10) for a wide range of sequences {a * } and {c}. n n Our . goal is to specify procedures for calculat1ng an* and c n* such that (14) and that (5) and (6) hold. This turns out to be quite difficult for the general K-W procedure, so we mention this as an open problem. su~h However, for more specific problems, as the ARP and estimation of the mode, these estimators are easily calculated. This is due to the fact that both problems revolve around density estimation. For the purposes of the following theorem, we shall assume that there exist consistent sequences such that an* Page 99 and c * are Fn-measurable (07). n We remark here that upon completion of the nth trial, the observations YIn and Y are 2n available and should be used in the computation of an and c n prior to calculating Xn + l . This, however, destroys the expedient martingale property. Lai and Robbins (1981) note this same problem for their adaptive R-M scenario, and with great effort give sufficient conditions to use this extra information. We merely note the dilemma and label it as an area for future research. We now prove the optimality of the modified K-W procedure in the following Theorem 5.2 Assume 01-07. Then, for the procedure described in (10), ~O l 3 n / (X n -9) (15) where 1-1 6 = Aopt N(1-I6' A~Pt C~;t o-~/[2AoPtf(2)(9)-2/3]) C~Pt f(3)(9)/[6(2A opt f(2)(9)-2/3)], and Aopt and Copt are given in (7). Proof: We apply Fabian's (1968ii) classic result on asymptotic normality of stochastic approximation processes. By a Taylor- series expansion, f(1)(X )=(X -9)f(2)(9) for some 9 such that n n 19-91 ~ IXn-91. <16 ) (X + -9) = (X -9) (l-n -1 n n 1 where This, (12), and Theorem 5.1, give r.n )+n-5/6 r.n = a ~n a /(/2 c ) ~ ~ = (C n n =:= n f(2) (n) I ~ r= if\ ~n V +n -1-1/3 T n n· f(2) (9)/f(2) (9) = 1 12 f(2) (9) )-1 Page 100 B ~ T = c 2 f(3) (9)/(6 f(2) (9» n n 1 6 ~n' Vn = /2 c n- / n From 02, there exists a positive constant C such that, Tn = n1/ 3 a Thus, we need only show (1 7 ) B I (V By OS, B p ~t n n 2 > r n) Vn2 n - ~ O. o(c-tn- Yt ), giving Bp V~ = 0(1). n n 1 Define s by s-l + 2t- = 1 and use Holder's inequality to = get, B 2 I(V 2 >rn) Fnvn n = = ~ (B p 0(1) n V~ (B p )2/t (B p n I(V~ > rn»l/s v~/(rn)t/2)1/S n t n/(2s) 0(1) = 0 (1) . Since V2 I<V 2 > rn) is dominated by V 2 and B V~ < 00, we n n n apply Lebesgue's Dominated Convergence Theorem to get (17).0 §5.2 SA-Representation Theorem We consider stochastic approximation processes of the form (18) x -n+1 = X - n- 1 (f(X ) + n -c-1 B + n -c~ } -n -n -n-n where !1 is an arbitrary random vector (mX1) with finite m m second moments, f is an unknown function from R to R , B -n m and ~n are random vectors in R and -ceR with 0~-c~1/2. The algorithm in (18) is a special case of that studied by Kushner (1977) and Ljung (1978) for finding the (assumed) Page 101 unique ~ such the f(9)=Q. One may think of the vector in as the random error at the nth stage due to the fact that f(X ) -n can only be measured with error. The!n vector may be thought of as a bias term in the measurement of feX). -n For x example, if m=l and FeX)=~ -00 f(t)dt can be measured with error, we may use [F(Xn+hn)-F(Xn-hn»)/hn as an approximation for f(x n ) for some small hn>O, giving Bn=f(xn)-[F(Xn+hn)-F(Xn-hn»)/hn. Many important functions can only be estimated with bias, e.g., estimators of a probability density function. To represent R-M processes, we take -n B =0 - for each n. Motivation for studying a general algorithm has also been given by Ruppert (1982). Ruppert considers a very similar algorithm, <19 ) where _X + = _X - n -1 ef (X ) + -2'"C 13 + n '"C~ ) -n n -n -n n l n 0~~~1/4. He then proves a representation theorem which we will use in our development of a random central limit theorem (CLT). After having argued that the representation that we give is appropriate, methods popularized by Billingsley (1968) are used in SS.3 to prove a random CLT for the process described in (18). As an easy corollary to our random CLT, we get a sequentially determined bounded length confidence interval for the general stochastic approximation process described in (18). In SS.4, we show that these Page 102 conditions are sufficient when applied to the ARP problem. We choose to use a slight modification of the algorithm considered by Ruppert (19), as given in (lS). Our interest is in applying this algorithm to problems that have non-zero bias terms -n ) but with rates of convergence (hopefully) greater than one third. To employ (19) directly (and (B Ruppert's main Theorem 3.1), we would need to include these biases in the term ~ , thus greatly clouding the natural -n interpretation of these vectors. By considering (lS), however, Ruppert's proof goes through with only minute changes. The algorithm (lS) is in a form more related to the easily accessible algorithm given by Fabian (196Sii). Throughout the remainder of this chapter, we assume that all random variables are defined on a fixed probability space (O,X,P). When speaking of weak convergence, for simplicity we shall only consider functions defined on [0,00) m=l). (i.e., Thus, let D=D[O,oo) be the space of all functions defined on [0,00) having left-hand limits and continuous from the right. Endow D with Stone's (1963) extension of Skorokhod's (1956) Jl-topology and use =>w to denote convergence in this topology. Also, use ~D and =D to denote convergence and equilvalence in distribution, respectively. mXm -00 i . M For MeR , let exp{M)=.?o M /11, t =exp[{log t)M] and ~i (M) Page 103 is the i th eigenvalue of M. I I. I I Euclidean norm and recall that function. denotes the m-dimensional [.J is the greatest integer We state below the important assumptions and two preparatory lemmas. 1 E. Let p>O, !€Rm and !n=! + O(n -p I). Let 9 >0 and G€Rmxm . We suppose that E2. f(~) = G(~-9) + O( 11~-91Il+9). an E3. For each M€Rmxm , where mini )..,i (M»1/2, suppose 00 -M ~ ~l n -n < 00 a.s. E4. limn~oo -n X = -9 E5. There exists a standard Brownian Motion B on [0,00) and a.s. ----'>. such that ~>O ~k<t ~k = 0- O(tl-~). B(t) + Lemma 5.1 Consider the process defined in (18) and assume El-E4. Then, for (20) n X X=1/2-~, ~>O there exists an such that (!n+1-9) = - n -l -n > -1 Proof: For convenience, take 9=0. (21) Fix a nb!n = for each 6<x. b<x and define -n Y =(n-l)6-n x . have (22) 0(1) We first show that Y -n+l From (18), E2 and E4, we Page 104 where Un satisfies Un=o(l>. Since !n=O(l> and -1+~+"C-!<-1+)<-)<=-1, we have (23) -> n -2+~+"C B -n converges. By E3 and -1+~+"C<-1/2, (24) > n -1+~+"C -n ~ converges. (22)-(24) is sufficient for where> d - -n converges. (1982) Lemma 3.1. get (21). This is equation 3.1 of Ruppert's Following the arguments of that result, we The remainder of the argument follows Ruppert (1982, Theorem 3.1>, replacing the appropriate exponents.O Lemma 5.2 (Ruppert, 1982, Lemma 4.1> For m=l, assume a>-1/2 and E5. standard Brownian Motion Ba and an (26) Then there exists a ~'>O such that ~k<t k a ~k = ~ B (t 2a + l (2a+l)-1) + O(ta+!-~'). a If a<-1/2, then limt~oo ~k<t k a ~k exists and is finite a.s. Theorem 5.3 Suppose El-E5 are satisfied for the process defined in (18), m=l and ><=1/2-"C. = [nt] )< Let (X[nt]+1-9)-B/(G-)<) and assume G>)<. Then, there exists a standard Brownian Motion process B defined on [0,00), such that (27) Wn(t) =>w Z(t) where Z(t) = {2(G-)<)}-! ~ t-(G-)<>B(t 2 (G-)<». Page 105 Proof: From Lemmas 5.1 and 5.2, with a=G+~-1=G-X-l/2, -;-(G-X-;) G-X-; Wn ( t) = - [ n t 1 ~k~.l n t 1 k ~k + 0 (n -~ ) = -[ntl-(G-X) {()- B( [nt1 2 (G-X-;)+1 {2(G-X-;)+1}-1) + O([ntl(G-X-;)+;-~')} + O(n-~) Let, Vn ( t) = - [n t 1- (G- X) ()- {2 (G- X ) }-; B( [ nt 12 (G- X) ) = D => _ [n t 1- (G- X) n (G- X) ()- {2 (G- X) }-; B( [ nt 12 (G- X) n - 2 (G- X) ) W t-(G-X) ()-{2(G-X) }-; B(t 2 (G-X» = Z(t) by the deterministic convergence of [ntl/n and the almost sure continuity of Brownian Motion. SUPt<T {Wn(t)-Vn(t)} ~ 0 Since a.s. and hence in probability, we get Wn(t)=>w Z(t) by Theorem 4.1 of Billingsley (1968), giving the result of the theorem.O S5.3 SA-sequential fixed-width confidence interval As an application of the results of S5.2, in this section we show the asymptotic normality of the stochastic process defined by (18) when indexed by a stopping time. An easy corollary gives a bounded length confidence interval. As argued in S5.2, the algorithm (18) includes both the R-M and K-W cases and allows for weaker dependency assumptions on the errors. This result is a considerable improvement over Page 106 Sie1ken (1973), McLeish (1976), and Stroup and Braun (1982). These authors considered a univariate R-M process assuming the errors were martingale differences. somewhat different. Our approach is In S5.2, we demonstrated how the SA process may be strongly approximated by a Gaussian process. In this section we demonstrate that the weak convergence properties are inherited by certain randomized versions. This is related to the approach advocated by Csorgo and Revesz (1981, Chapter 7). They dealt with the strong approximation of partial sums of weakly dependent random variables to a Gaussian process. They demonstrated that the distribution of the process when suitably standardized is unaffected by a random change in time. This result, with the strong approximation, gives the weak convergence of the randomized partial sums. While starting with strong approximations, we choose to stay with the conventional weak convergence arguments. In the classical case of partial sums of i.i.d. random variables, both approaches achieve weak convergence of the randomized partial sums. Thus, we would not strengthen our results by resorting strictly to strong approximation arguments. Given below is a result due to Chow and Robbins (1965) which is typical of the results we shall prove. Page 107 Theorem (Chow and Robbins) Let {Xi} be an i.i.d. sequence from a population with unknown mean ~ and known finite variance ~2. the (1-~) Let z~ denote quantile of the standard normal distribution. each d>O, let Nd=inf{n: n~z~/2~2/d2}. For Then, ( b) 2 2 Nd = z~/2 ~ a. s. 222 limd~O d ENd = z~/2 ~ (asymptotic efficiency) ( c) limd~O P ( X ( a) limd~O d 2 I N -~ I~d) = l-~ (asymptotic consistency). d The result (c) is not all one might hope for. A better result is P( IXN -~I~d) ~ l-~, that is, true for a fixed d. d As this result is not available in the i.i.d. case, we shall not move in this direction for the SA ARP estimators. In the above example, we refer to IN =[XN -d,XN +dl as an interval d d d estimator for~. The Chow-Robbins result is sometimes referred to as an "absolute accuracy" result. In some cases, one may be concerned with "proportional accuracy". f such that O<f<l, define M = inf{n: f Then limp~o P«XMp-~)/~~P) = l-~. For some n~z~/2 ~2/(X~ p2)}. Results for a mixing of these two criteria are also available (Nadas, 1969). shall only be concerned with absolute accuracy. We See Sen (1981, Chapter 10) for a more in-depth description of sequential interval estimation and associated stopping times. Page 108 We fix the asymptotic coverage probability at 1-~. We will define a sequence of random intervals {I} anda n n stopping random variable N so that Nd is the first n such d that 1ength(In)~2d and 1imd~0 p(e~IN )=1-~. Each of the d following two assumptions give conditions needed for a randomly indexed CLT. E6. Let (28) Nn 1mn - N = 0 p (1) where N is a positive random variable, {m n } are integers going to infinity and {N } is a sequence of random variables. n Let {B }, {~n} and {G } be sequences of random n n variables such that B -B=o(l), ~ -~=0(1) and Gn -G=o(l). E7. n n With E7 we can define an appropriate stopping rule and a 2dwidth confidence interval. Let z~ be the (l_~)th quantile of the standard normal distribution. (29) (30 ) For d>O, define inf{n~l: d~z~/2 ~n n-)< {2(G n -)<)}-1} Nd = = if no such n exists 00 In = [X n - n xn -)< (Bn/(Gn-)<)+z~/2~n{2(Gn-)<)} - n-K(Bn/(G n -)<) - -1 ), Z~/2~n{2(Gn-)<)}-1)]. Remark: n We may define the sequence {n } where d = inf{n~l: d~z~/2 ~ n-)<{2(G-)<)-l}. It is immediate that d under the assumptions of Theorem 5.3 and E7 that limd~O p(e ~ (1976), we get I nd )=1-~. 1imd~0 Further, from Lemma 3.6 of McLeish Nd/nd=l a.s. We also need, Page 109 Let P a = ~(~a' ... '~b). Then, for each t~l, n~O, AeP t , l b 00 Ip(AB)-P(A)P(B) I~ <I>(n) where <I>(n)~O as n~oo. BeF + , ES. t n Theorem 5.4 Suppose EI-E6 and ES are satisfied for the process defined in (IS), m=l and X=1/2-~. Then, for the Gaussian process Z(.) defined in Theorem 5.3, (31) [Nntl)< (X[N tl- 9 ) - B/(G-)<) =>w Z(t). n Corollary 5.4 Suppose EI-E5, E7 and ES are satisfied for the process defined in (IS), m=l and )<=1/2-~. Then, for N defined in d (29), we have, 1 im d ~ 0 P (9 ~ IN ) d = 1- ac . Proof of Corollary: By (29) and E7, we have, d N~ {2(G-Y)}!/~ and thus, E6 is satisfied. - zac/2 = opel) as d~O Thus, from Theorem 5.4, for t=l, we have, N~ (X N -9) ~D N(B/(G-)<) ,~2/{2(G-)<)}) d (32) as d~O. The proof of the corollary is two easy steps from (32).[] The proof of Theorem 5.4 is merely an alteration of Billingsley's (196S) Theorem 17.2, given here for the reader's convenience. Page 110 Proof of Theorem 5.4: Define ~n(t)=t Nn/m n if Nn~mn' and equal to t N By E6, we have otherwise. ~n(t) ~p t N = ~(t) for each t>O. By (27), we have Wm (t) =>W Z(t). n (33) (W n ' ~n) =>W (z, ~) Now, assume where Z and ~ are independent processes. = Wmn(~(t» WNn(t) Then, I(Nn~mn) + WNn(t) I(Nn>m n ) =>W Z(~(t». But, since Z and ~ are independent, Z ( ~ ( t» = {2 ( G- )<) }-! 0- ( tN) - ~ G- )< ) B( ( tN) 2 (G- )<) ) {2 (G-)<) }-! 0- (tN) - (G-)<) NG-)< B(t 2 (G-)<» -D = Z(t) . Thus, we need only show (33). Fix T>O. If we show weak convergence for D[O,T), by Stone (1963), this is sufficient for weak convergence on D[0,00) (cf., Sen, 1981, Theorem 2.3.6, page 24). We now follow Billingsley (1968), and assume N is Redefining the sequence {m } n bounded with probability one. if necessary, we have o< Define N W~(t) <K <1 a.s. = -[ntl-(G-)<) Z.pn~k~[ntl G k -><-! ~k where {pn} is a sequence of integers going to infinity such that By (27), W (t)=O (1) and thus pn P Iwn (t)-W'n (t) I = [ntl-(G-)<) = n - (G-)<) and thus, I>-k< p k G-)<-! ~k + O(n-~) -In pG-)< I n 10 P (1) I = 0 p (1) fn/n~O. I Page III SUPt<T IWn(t) - W~(t) (34) I~ 0 in probability. m Define Bm to be the Borel sets of R and Ho the field consisting of the sets of the form m {w: (~l (w) '~2(w), ••• '~m(w) )€B } for m>l. Let E€H and A be a Z continuity set of D. o have for large n, P«W~€A) ~ and E) _ P(W~€A) Since fn~oo, we P(E) Z(A) P(E) by (27), (34) and ES, where Z(A)=P(Z€A). By Billingsley (196S), Theorem 4.5, we have, (W~ ,In) =>w (Z,~) n in the product topology, where ~ is independent of Z. This and (34) give (35) (Wa ,In) =>w (z,I) n thus proving (33) for N bounded. For N not bounded, we employ a simple truncation device and use Billingsley's Theorem 4.2. §5.4 This complete the proof.D ARP-sequential fixed width confidence interval We now show that the results of §5.2 and §5.3 are sufficiently general to include the ARP problem. All quantities are defined as in §3.1 with the kernel functions used to estimate the density. for AS and A9. Recall that AIO is sufficient Further, recall the algorithm given in (3.2), Page 112 (36) -1 eP Mg,n( n) eP n +1 = eP n - A n = eP n - A n- 1 (g'(eP )M(g(eP » n n + C-!V + n n 6.n )· We first show that assumptions E1-E5 are true given the assumptions of Chapters 2 and 3, letting ~=Y/2 xn-e=ePn-eP (37) B =A n(l-Y)/2/\ =T n -n n ~n = A f(x)=Ag'(X)M(g(X» B= 1J 1[ f-(l-Y)/2] C-! Vn · To show E5 we employ a result due to Strassen (cf., Sen, 1981, Theorem 2.5.1, page 34). Lemma 5.3 (Strassen) Let {~n,Pn+l} ' be a martingale difference sequence. - -n 2 De f ~ne Yn = ~k<n B p ~n' S(Y n ) = ~l ~n' and S(t) by linear n interpolation. Let h be a nonnegative, nondecreasing function on [0,00) such that t- 1 h(t) is nonincreasing. ~ ( 38) Y (39) ~~ Bp {~~ I(~~ >h(Yn»}(h(yn»-l < 00 n 00 a. s. If, and a.s., n then there exists a Brownian Motion process B on [0,00) such that (40) S(t)-B(t) = o( (log t) (t h(t) )1). Lemma 5.4 Assume Al, A2, A6, A7, A10, (37) and that p>2+l/r for the process defined in (36). Then, there exists a standard Brownian Motion process B on [0,00) and an ~k<t ~k = {o-f(2 r -l+'Y)}! B(t) + ~>O such that O(t!-~). Page 113 Proof: We first show that the assumptions of Lemma 5.3 are true. As in Lemma 2.5, {~n,Pn} Yn/n~o-~(2 r-l+Y) sequence and Theorem 3.1. is a martingale difference a.s. Thus, (38) is true. o-~ is defined in where To show (39), we use the conditional version of Holder's inequality. For the p and q in A6, we have (h(yn»-l E p {~~I(~~ >h(Y n »} n < (h (Y » -1 n (E ~p) 2/p [E ( I ( ~ 2 >h (Y »)] 1/q Pn n Pn n n ~p )2/p [E ~p]l/q [h(Y )]-p/(2q) Pn n Pn n n = E ~p (h(Y »-p/2 Pn n n In the proof of Theorem 3.1, we showed that C P / 2 E p vP is n n n bounded a.s. Thus, there exists an M>O such that < (h(Y »-1 (E n ~~ ~ M n Yp / 2 a.s. n Let ~>O, and taking h(t)=tl-~ and Y=1/(2r+l), we see that Ep (39) converges a.s. since (yl-~)-p/2 nYp/2 = (yl-~ n- Y)-p/2 n n = «Y /n)l-~ nl-Y-~)-p/2 = 0(1) n-rpY+~p/2. n Since rpY>l by assumption, we can choose -rpY+~p/2 so that ~ n converges. ~ sufficiently small Thus, we have by Lemma 5.3 that for some standard Brownian motion Bo ' S(Yn ) - Bo(Y n ) = O(Y~-~ ) by taking h(t)=tl-5~. Taking n large and using Yn/n~o-i (2 r-l+Y) gives Page 114 and B0 ( Yn) = B0 (n = 0- {o-f f (2 r-l +,Y» + 0 (1 ) (2 r-l+,Y)}! B(n) + 0(1) where B(.) is also standard Brownian motion. ~k<n ~k = {o-f (2 r-l+'y)}! B(n) + We get the result for a general t Thus, o(n!-~). (not necessarily an integer) since B(n+I)-B(n)=Op(l).D Lemma 5.5 Assume AI, A2, A6, A7, AlO, (37) and that p>2+l/r for the process defined in (36). G=A ( g' (<I» ) 2 M' (g ( <1» Then E2-E4 are satisfied with ) • Proof: E4 is satisfied by Theorem 3.l(a). assumptions Al and AIO (take 9=1). E2 is satisfied by Lemmas 5.2 and 5.4 satisfy E3.D Lemma 5.6 Assume AI, A2, A6, A7, AlO, All for the process defined in (36). Then there exist a nF(Bn-B) = p>O such that 0(1). Proof: From AI, we get g(x)=g(<I»+O(lx-<I>I). Theorem 3.1, we get From the proof of. Page 115 (B-Bn)/{A Cr (C 1 -C 2 )} = g(~) ~o S(u)du (F*g)(r+l)(~) 1 ~o yr/r! k(y)dy 9 (~n) 1 - ~o S(u)du C~r{BF fgn(~n)-(F*9)( )(~n)} n and, from Lemma 3.l(b), • = g(~) ~o - S(u)du (F*g)(r+l)(~) 1 ~o yr/r! k(y)dy g(~n) 1 r 1 S (u) du {~o y /r! k (y) (F*g) (r+ ) (9 )dy} ~0 n I -< c n y. where IQ-~ I g(~) < ~o 1 S(u)du r I~o y /r! key) {(F*g) (r+l) = 0( 19 -~n I d) = o(n- dY ) + 0( (9 )-(F*g) (r+l) (~) }dyl+ o( I~n-~I) I ~n-~ I ) + o(n-rY).O Theorem 5.5 Assume AI, A2, A6, A7, AID, All, (37), and p>2+l/r for the process defined in (36). Corollary 3.1. Let ~l and ~l be defined as in Then, there exists a Brownian Motion process B defined on [0,00) such that [ntl(1-Y)/2 where z*(t) = ~ 1 (~[ntl+l-~) - ~l =>w z*(t) t-(2 r-l+Y)/2 B(t 2 r-l+Y». Proof: An immediate application of Theorem 5.3 and Lemmas 5.4- 5.6.0 page 116 Corollary 5.5 Let ~n' I~, ~~ and (F*9>~~rl> be defined as in Lemmas 3.4 and 3.5, respectively. Define Gn=A{n -1 -n ~l ~j} -1 , 1 Bn = A Cr(Cl-C2>(~O and • ~ n =A 2 c- l -n >*+1/(2G n -l+Y>. For each d>O, define N = inf{n>l: d>z n-(1-Y>/2 ~n} d - ~/2 = 00 if no such n exists. Define the sequential 2d-width confidence interval, In = [~n - n-YB~/(2Gn-l+Y> ~n - n-Yz~/2~n' - n-YB~/(2Gn-l+Y> + n-Yz~/2~n]· Then, under the assumptions of Theorem 5.5, Proof: A slight modification of Lemmas 3.4 and 3.5 give , Gn-G=o(l>, for E7. ~n-~l=o(l> and Bn-Bn=o(l>. This is sufficient We have constructed Nd as in (29), and thus the corollary is true as a special case of corollary 5.4.0 CHAPTER 6 S6.1. MONTE CARLO STUDIES Introduction In this chapter we investigated the finite sample properties of the estimators and the procedure proposed. purpose is really two-fold. The First, we wanted to verify the usefulness of the asymptotic approximations in finite samples. Second, we wanted to compare the performance of the estimators in finite samples when the parameters are chosen having various values in the "neighborhood" of the asymptotically optimal values. Further, as discussed below, the performance of the estimators at finite stages is improved by parameters of the procedure that do not appear in the asymptotic theory. These parameters are introduced in S6.3. The following section, S6.2, describes an investigation of the ARP which was made prior to the Monte-Carlo experiment. The goal of this investigation was to develop systems of computer programs that calculated various deterministic characteristics of the ARP problem (such as the Page 118 optimal replacement time and mean lifetime of a unit), given various input parameters (such as costs and the lifetime distribution). A short description of the Monte-Carlo experiment is given in §6.3. In this section we specify the assumptions of the model and the basis for calculating the expected value of the estimators. §6.2. The true values of the estimators come from This leads to a short discussion of a criterion for comparison of these estimators. The candidate adopted here is a version of the mean square error (MSE). Based on the preceeding sections, §6.4 gives a summary of the results of the experiment. Some remarks about the nature of the experiment and the accuracy of results are made. We describe the tables included in the appendix and follow this description with remarks on the highlights of the data. §6.2 Preliminary investigation Given the probability distribution of the errors and the costs associated with the types of failure, the actual calculation of the optimal replacement time ~ is straight- Page 119 forward in principle, but requires careful numerical work. This section gives the details of that calculation and the calculation of other parameters that depend upon~. The work of this section is not new (cf., Glasser, 1967), but is needed in §6.3. In all cases, we assumed that errors have a Weibull distribution with input parameters alpha (location) and lambda (scale). The Weibull is a standard lifetime distribution in survival analysis having moments that can be calculated given alpha and lambda. Further, if alpha is greater than one then the Weibull distribution has an increasing failure rate. Even when the transform function g is the identity (e.g., g(t)=t, see §2.4), ~>l is sufficient to ensure that ~ will take on a unique, finite value. other input parameters for this phase are the costs, C l and C , associated with failure and planned replacements, 2 respectively. It can be seen from (0.1) that determining ~ depends only on the ratio of the costs. Thus, we took C 2 =1 in all cases. All computing was done on an IBM 370/155 operating at the Computation Center of the Universtiy of North Carolina at Chapel Hill. The operating system is IBM 08/360 MVT, the Page 120 programming language is FORTRAN and the compiler, version H. To calculate the definite integral ~ x o S(u)du and the variances of the Weibull distribution, the IMSL routine DCADRE was used. deviates. The IMSL routine GGWIB produced the random All calculations were done in double precision. DCADRE was also used with SAS/GRAPH to produce plots of the function R (.) that we wanted to minimize. l These plots are given in the Appendix A. The most important output variable of the program was ~. This was calculated by using a Newton-Raphson routine to find the root of M(x) (see (2.2) for the definition of M(.». Note that each stage of the iteration, it was required to evaluate a definite integral via DCADRE. with a good initial guess (one), the algorithm converged very quickly (in three iterations). Having calculated ~, the calculation of other important variables was straight-forward. These results are summarized in Table AI. S6.3 Description of the simulation The asymptotic theory of Chapters 2 and 3 deals with a wide variety of SA procedures for solving the sequential ARP problem. To make the Monte-Carlo study tractable, we first restricted ourselves to a certain subset of those procedures, .' page 121 which is described below. All comments concerning the costs and assumptions on the distribution of the errors made in S6.2 continue to hold. We were interested in the relative performance of the estimators in finite samples for various values of the parameters. Thus, it was felt that not much would be gained by allowing the mean and variance to vary, so we fixed ~=2 and ~=2.2. This gives a mean of 1.77125 and standard deviation of .8499 for the lifetime of the units. The transformation function g allowed us to look at a broader class of distribution functions and to use unconstrained SA. Thus, in doing the Monte-Carlo work we felt compelled to take g other than the identity function. However, this transformation will not affect the speed of the convergence (see Theorem 2.2 and S3.2). To keep the amount of simulation feasible, all simulations used the function g(t)=log{l+exp(t)}. The kernel function used to estimate the density was the simplest one presented, the indicator function in Chapter 2. While it would be interesting to investigate the effects of different kernels (or other types of density estimators) with different parameters of the procedure, we limit ourselves Page 122 here to investigating the effect on the procedure for different values of parameters and for a fixed kernel. Choice of a particular density estimator has been investigated by other researchers (cf., Wegman, 1972, for some interesting numerical comparison of probability density estimators). • Because of our interest in the parameters and since we did not necessarily use the procedure with the best rate of convergence (we do not take advantage of the smoothness of the underlying d.f. with a better choice of the kernel), no Monte-Carlo work was done on the adaptive procedures of S3.3. Having mentioned all of these restrictions, the reader may wonder what variables were actually investigated. Recall the equations: ~n+l = ~n - anMg,n(~n) a n = A n- l c n = C n-· 2 S. (t) = I{Z. >t} i=1,2 ln In- f9ln(t) = I{g(t-Cn)~Zin~g(t+cn)}/(2cn) get) Mg ,n(t)=(C l -C 2 ) f9ln(t) ~o S2n(u)du - g'(t)Sln(g(t»{(C l -C 2 )S2n(g(t»+C l }· We certainly were interested in the effect on the procedure for different values of A, C, and ~l' the starting value of the sequence. Further, we found that the value of ~n oscillated wildly in the early stages of the procedure (when n is small). Dvoretsky (1956), while investigating the R-M Page 123 -1 case, replaced an by A{n+k ) ,with a positive k A . For a A very simple example, he showed how this form was optimal in a minimax sense (a finite sample result, uncommon in SA theory). The optimal choice of k A is related to the variance of the units and the starting value of the procedure. this Monte-Carlo study, we used an c n = C(n+k C ) - • 2 , = A(n+kA)-l In and and investigated possible values of k and A We note here that these replacements do not alter the asymptotic distribution (see Theorem 3.3). To choose the best possible A, C, ~l' k A and k ' we used C as our criterion the performance of the resulting estimator ~n. For notational convenience, let ~n be used for ~n(A,C'~l,kA,kc). The estimators were judged based on their resulting biases and mean square error, computed as follows. Each simulation (for a fixed A, C, ~l' k A and k ) was based C on 1000 independent Monte-Carlo trials. Denoting~. to be ~,n the estimator of ~ at the nth stage on the i th trial, we used 1000 E ~ n = (.001) -i=l > ~. as our estimator of the ~,n th expected value of Thus, for the bias at the n stage, we used BIAS n = Similarly, for the mean square E error, we used MSE = (.001) 1000 5. 1 -~= (~. -~) 2 . n ~,n asymptotic theory (Theorem 2.2) indicates that The = open -.8 ). However, on examining a standardized n 8 version of the mean square error, SMSE n = n· MSE n , we found MSE this to be unstable. It turned out that the estimator is page 124 highly sensitive to the choice of k . Heuristically, in A l replacing A n- with A(n+kA)-l, the procedure believes it is at the (n+kA)th stage when only n iterations have been performed. We thus used an adjusted standardized mean square error, ASMSE = (n+kA)·a MSE , as our criterion for comparing n n different estimators. For values of n = 10, 50, 250, we found this criterion to be very satisfactory. While we were very interested in the rates of convergence for our estimators, from a practical standpoint the effect of the procedure on the actual cost is even more important. In Theorem 1.1 we guarentee the optimal cost would be obtained asymptotically, so we were interested in how useful this result is in finite samples. Denote X.1)'k to be our i th sample (i=1,2) at the jth stage (j=10,50,250) from the k th trial (k=l, ... ,lOOO). Z,1)'k=min (X.1)'k,g (~,)k+c,» ,) and With~,) , k as before, define b"k=I(Z"k<g(~, k+ c ,». 1) 1) ),) For the kthtrial, the actual sample cost per unit time at the nth stage is, n SCn,k = ~j=1{Cl(bljk+b2jk) + C2(2-bljk-b2jk)} n Z2'k)· />, -)= 1(Zl'k+ ) ) The mean sample cost per unit time at the nth stage is 1000 MSC n = (.001) > -k=l Page 125 §6.4 Summary of results We found the results of the Monte-Carlo experiment to be very satisfactory, especially given the highly variable nature of the estimators we used. This variablility is due to the estimator of the density which appears explicitly in M(.). The magnitude of the variability can be illustrated in the following example. Suppose that Corollary 2.2 is true not only asymptotically but also for finite n, i.e., that 2 n / 5 (cI>n-cI» where VAR = A2 c- 2 ~/(2 - N(BIAS, VAR) r- .8). Ignoring the BIAS term for the moment, a (l-~)% confidence interval for cI> of length 2d requires that n2/5>(z~/2 VAR I / 2 )/d, where z~ is the (l_~)th quantile of the normal distribution. Assume we want a 95% confidence interval for cI> with error of no more than .05. Using the optimal values of A and C for the values given in Table AI, we have n 2 / 5 > (1.96(.80551)1/2)/(.05) => n ~ 7,342. Even in this simple case, a large sample size is required for moderately precise results. We now describe the Tables of Appendix B where the raw output of the Monte-Carlo study can be found, then give the page 126 highlights of the findings of that study. Each table gives the bias (BIAS ), mean square error (MSE ), adjusted n n standardized mean square error (ASMSE ) and mean sample cost n . (MSC n ) for various values of A, C, ~l' k A and k C . quantities were defined in §6.3. These We choose stages n = 10, 50, 250 to reflect what we considered to be small, moderate and large samples. Table Bl begins with the optimal value of A and C, as given in Table AI, and a relatively close starting value, ~l=l. k A and k C The performance for selected values of is then presented. The best results seem to occur at k =k c=50 (although k =k =25 is close), so we retained A c A these values for some of the subsequent investigations. Tables B2, B3, and B4 display the results of using different choices of A, C and ~l' respectively. We were also interested in the effect of starting very far from the optimal value on the best choice of k A and k c. Tables B5 and B6 display the results for high and low values of ~l' respectively. In general, the data from the Monte-Carlo study is very satisfactory. In all of the intermediate range trials, the bias and mean cost decreases with the stage of the experiment. The adjusted standardized mean square error (ASMSE) is relatively constant, but does decrease slightly with n. This change is more drastic when k when large as n+k A will change more. is small than A This is due to the Page 127 extra sampling variability at the earlier stages of the experiment. At n=250, the ASMSE are approximately 15% higher than the standardized mean square error (SMSE). Adjusted in this way, the mean square errors appear to be in the same area as the theoretical MSE = .8161, given in Table Al (recall that the SMSE's were adjusted to make comparisons between stages). Also, for most intermediate range experiments, the algorithm initially honed in on the optimal value quickly and then more slowly as the stochastic portion dominated. The results of the study indicate that the performance of the algorithm was greatly enhanced by introducing the parameters k A and k • C dominant parameter. From Table Bl, we see that k A is the The introduction of k c improved the behavior of the estimators slightly (in terms of the ASMSE of the estimators), but a dramatic improvement was caused by the introduction of k A in the performance of ~n. that by using too large a k A and k c We also note the rate of convergence of the algorithm is slowed down considerably. The algorithm varied with the parameters A and C as expected. Table B2 shows that too large a value of A caused large oscillations in the early stages which calmed down in later stages when the asymptotics took over. In Table B3, we Page 128 see that too large a value of C means that the bandwidth of the density estimator is too wide, even by stage 250. For intermediate ranges, the performance was relatively insensitive to different values of A and C, performing best near the theoretical optimal values. the procedure was noticably worse. For small values of A, Recall, to achieve convergence is distribution we required A > (1-Y)/(2(g,(~»2M'(9(~») = .5834. For small values of C, the performance was not as good but we did not notice the dramatic shift as with A. In Table B4 we investigated how sensitive the procedure is to the starting value. Starting far away, either high or low, we notice the usual pattern of the high ASMSE's at early stages which decrease as n increases. The magnitude of the ASMSE increases as ~l moves away from~. Recall that we allowed starting from a negative ~l due to the g function. When we started very close to the optimal value, the ASMSE was small in the early stages then increased to the level of the other experiments. This is because of the very small bias terms in the earlier stages, the stochastic portion eventually becoming dominant. Tables B5 and B6 investigate the effect on k A and k when starting very high (~1=2.5) and low (~l=-l), with C e. Page 129 respect to ~ (=.533). When starting high, reducing k by one half gave the best results. when kA=kc=lO. A and k c The ASMSE's crept back up Conversely, for a low starting value, there was improvement for k =k =25 but we did even better when we A C let kA=kc=lO. This is in line with Dvoretsky's (1956) result that the best choice of k A depends not only on the variance of the observations but also on the starting value. convergence to the vicinity of the optimal replacement time ~ requires a large number of experiments relative to other Monte-Carlo trials. This is to be expected, as the function itself is flat around~. While this may be a dismal prospect to the practitioner, there is a bright note. Recall that the long run cost for a failure replacement policy is Cl/~=5/l.77l25=2.823. In virtually every experiment in Tables Bl-B6 we achieved a lower mean cost by the lOth stage (the exceptions being when kA=O and ~=3.5). These results are especially significantly since the best mean cost we could hope for is 1.90386 (see Table AI). Thus, we have achieved considerable cost reductions even for very small samples, an important practical consideration when deciding whether or not to use a SA age replacement policy. Page 130 TABLE Al - SUMMARY OF DETERMINISTIC CALCULATIONS ASSUMPTIONS Distribution function F is Weibu11 with at = = ~ 2 Failure cost C 1 = 2.2 5 and replacement cost C 2 Transform function = 1. g(t) = log{l+exp(t)}. SA algorithm parameters A = C = 1.5. u S(u)du = .53760 2.3 OUTPUT VARIABLES ~ = .53349 R1(g(~» (g' (~»2 M' (g(~» > = .68559 g(~) (F*g)(l)(~) ~o = VAR T = 1.90386 = = >/{2A (g' = .80551 A c2 (~»2 M' (g(~» g(~) (C 1 -C )/6 (!o 2 .12124 S(u)du) MEAN = 2T/[2A (g,(~»2 M'(g(~» MSE = VAR + MEAN 2 = .8161 - 1+.2} (F*g)(3)(~) - .8l = .10302 Page 131 ARP COST FUNCTION WEDULl. tIlOEL WITH LOCATION PftRftt£TER- 2.2 AfI) SCAlE PARAf£TER - 2. 0 o 0.2 0.4 0.6 0.8 1.0 1.2 WLlIE Of AGE REP TD£ 1.4 1.6 1.8 2.0 Page 132 ARP COST FUNCTION TRANSFORMED BY G FUNCTION WEDULL t1JDEL WITH LOCATION PftRftt£TER • 2.2 All) SCALE PARAt£TER • 2. 0 0.0 0.2 0.' 0.6 0.8 Wl.UE (F 1.0 1.2 AGE REP TIt£ 1.' 1.6 1.8 2.0 Page 133 TABLE B1 - PERFORMANCE OF ESTIMATORS A C <PI k k Stage of Algorithm 250 50 c 10 50 .3544 .1284 .0113 .2016 .1091 .0377 5.335 4.344 3.617 2.268 2.159 2.053 0 -1.205 -1.248 -1.230 n ASMSE n 40.03 252.6 39.24 ·897.3 38.68 3204. MSC 4.767 13.09 48.47 50 -1.109 -1.112 -1.069 34.89 34.16 33.56 220.1 781.1 2781. 4.900 13.18 43.44 .3896 .2030 .1659 .1082 .0247 5.371 4.307 3.584 2.472 2.261 2.091 .2836 .0486 -.0130 BIAS 2.3 n 1.5 1.0 A 50 MSE n ASMSE n MSC n BIAS 0 n MSE n BIAS 0 n MSE n ASMSE MSC n BIAS n MSE n ASMSE MSC n 50 0 n n BIAS 25 n 25 MSE .0374 n ASMSE n .2661 .1385 .0413 4.574 4.380 3.693 MSC 2.291 2.158 2.049 .4576 .4254 .3118 .2099 .1833 .1038 53.15 47.88 31.17 2.155 2.111 2.077 n BIAS 1000 1000 n MSE n ASMSE MSC n n Page 134 TABLE B2 - PERFORMANCE OF ESTIMATORS A BIAS 2.3 n MSE C 1.5 Stage of Algorithm 50 250 <P1 kA k C 10 1.0 50 50 .3544 .1284 .0113 n ASMSE n .2016 .1091 .0377 5.335 4.344 3.617 MSC n 2.268 2.159 2.053 .4389 .3617 .2466 .1974 5.222 .1418 .0721 5.645 6.910 2.285 2.208 2.124 .3658 .1584 .0289 .1940 .1045 .0359 5.134 4.162 3.440 2.269 2.161 2.059 .3267 .0831 -.0128 .2239 .1286 .0467 5.922 5.121 4.480 2.262 2.145 2.045 .2506 -.0358 -.0324 .3323 .1775 .0752 8.790 7.066 7.208 2.255 2.128 2.038 .1180 -.0934 -.0501 .7349 .5089 .1362 19.44 20.26 13.06 2.240 2.128 2.054 BIAS n .5 MSE n ASMSE n MSC n BIAS n 2.0 MSE n ASMSE MSC n n BIAS 3.0 n MSE n ASMSE MSC n BIAS n MSE n ASMSE MSC n 5.0 n n BIAS 10.0 n MSE n ASMSE MSC n n Page 135 TABLE B3 - PERFORMANCE OF ESTIMATORS A BIAS 2.3 n C 1.5 MSE n ASMSE n MSC n BIAS .5 n MSE n ASMSE MSC n n BIAS 1.0 n MSE n ASMSE MSC n n BIAS 2.0 n MSE n ASMSE MSC n n BIAS 3.0 n MSE n ASMSE MSC n n BIAS 5.0 n MSE n ASMSE MSC n n <1>1 1.0 k A 50 k C 50 10 Stage of Algorithm 250 50 .3544 .1284 .0113 .2016 .1091 .0377 5.335 4.344 3.617 2.268 2.159 2.053 .3239 .0741 -.0477 .3154 .2494 .0992 8.343 9.928 9.505 2.102 2.037 1.983 .3365 .1250 -.0057 .2260 .1495 .0520 5.979 5.953 4.986 2.172 2.078 2.002 .3683 .1609 .0348 .1959 .1046 .0357 5.181 4.165 3.424 2.367 2.253 2.126 .4138 .2577 .1036 .2188 .1407 .0455 5.788 5.600 4.364 2.562 2.459 2.306 .5325 .6000 .5070 .3253 .4521 .3189 8.607 18.00 30.57 2.798 2.747 2.673 page 137 TABLE B5 - PERFORMANCE OF ESTIMATORS A BIAS 2.3 n C 1.5 eP1 kA k 2.5 50 50 C MSE n ASMSE MSC n n BIAS 50 n 25 MSE n ASMSE MSC n n BIAS 25 n 50 MSE n ASMSE n MSC n BIAS 25 n 25 MSE n ASMSE MSC n n BIAS 10 n 50 MSE n ASMSE MSC n n BIAS MSE n n ASMSE n MSC n 10 10 10 Stage of Algorithm 50 250 1.582 .7228 .1177 2.640 .7360 .0649 69.84 29.30 6.218 2.745 2.522 2.210 1.585 .7246 .1196 2.627 .7147 .0634 69.50 28.45 6.078 2.763 2.542 2.221 1.320 .4422 .0450 2.137 .5244 .0511 36.74 16.59 4.567 2.694 2.419 2.148 1.320 .4489 .0446 2.100 .5046 .0508 36.10 15.96 4.540 2.716 2.437 2.158 .9191 .2402 .0180 1.871 .5454 .0925 20.55 14.43 7.905 2.593 2.314 2.105 .9539 .2250 .0192 1.768 .4644 .0871 19.43 12.29 7.449 2.650 2.359 2.118 Page 138 TABLE B6 - PERFORMANCE OF ESTIMATORS A BIAS n 2.3 C 1.5 <1>1 -1.0 k A 50 MSE n ASMSE MSC n n 50 BIAS n MSE n ASMSE MSC n n BIAS,n 25 MSE n ASMSE n MSC n BIAS n 25 MSE n ASMSE MSC n n BIAS n 10 k c 10 Stage of Algorithm 250 50 50 -1.434 -1.132 -.4743 2.058 1.289 .2473 54.44 1.289 23.71 2.256 2.136 1.970 25 -1.435 -1.136 -.4810 2.061 1.298 .2524 54.51 51.67 24.20 2.184 2.098 1.963 50 -1.349 , -.8867 -.2604 1.824 .8129 .1020 31.35 25.71 9.121 2.229 2.049 1.947 25 -1.350 -.8921 ·-.2623 1.828 .8208 .1034 31.43 25.96 9.269 2.162 2.019 1.943 50 -1.149 -.5442 -.1232 MSE n ASMSE n 1.354 .3901 .0586 14.87 10.32 5.010 MSC n 2.167 1.988 1.960 10 -1.157 -.5542 -.1265 1.366 .3916 .0585 15.01 10.36 5.000 2.042 1.953 1.959 BIAS n MSE n ASMSE n MSC n 10 Page 139 APPENDIX C §l SOME ALTERNATIVE MODELS Random Replacement model An early modification of the basic ARP model considers ~ as a random variable rather than a fixed but unknown constant, making [O,~) a "random replacement interval." While this model is more general, no savings are made and much simplicity is 1ost. As before, let Xl""'X n be a random sample having d.f. F and now suppose that ~ is an independent r.v. with d.f. G Cassume G is left-continuous). Then Z.=minCX.,~) has distribution function I-SCI-G) since 1 1 PCZ~t) = PCx~t, ~~t) = SCt) CI-GCt». Thus, lim B [CCt)/t] = {CIPCZ<~) + C2PCZ~~)}/B[Z] Q) Q) = {C l ~o FCu)dGCu) + C2 ~o GCu)dFCu)} Q) /{ ~ CI-GCu»SCu)du} = { PCu) dGCu)} / { ~ OCu) dGCu)} t where PCt) = CIFCt) + C 2 SCt) and OCt) = ~o SCu) duo Since RICt)=PCt)/OCt), we assume there exist sufficient • conditions such that RICt) may be uniquely optimized. be such a minimization point. Then Let ~ Page 140 lim E [C(t)/t] = {~ P(u)dG(u)}/{~ Q(u)dG(u)} > P($)/Q($) the expected cost under nonrandom replacement. Thus, in this broad class of problems, we see there is little to be gained by considering a random replacement interval. See Barlow and Proschan (1965) for more details. §2 Discounted ARP An interesting and useful modification of the replacement problem is the introduction of the time value of costs into the problem. Consider the ordinary age replacement policy with planned (failure) replacement of a unit with associated cost C (C ). 2 l Under the discount ARP model, the objective is to find a fixed but unknown $ such that the expected cost of the process is minimized, where all costs are discounted back to (a fixed but arbitrary) time zero. No work seems to exist in the literature for the random replacement model with a discount feature. Under the discount model, let factor, t'=Zl+ ..• +Z"]. ]. th e cos t 0 d=log(l+~) the time until the i th be the discount replacement, and d C f th e].. th un].. t '].s C ( Zi) (=C ].. f P 1 anne, 2 l otherwise). Thus the discounted cost of the process is a random variable Y, where 00 Y = -].= >. 1 C(Z.) exp{-d t]..}, ]. Page 141 and we wish to find ~ to minimize the expected value of Y. computing E Y , we get 00 E Y= >. 1 -1.= 00 =q • E [C(Zi) exp(-dZ i ») E [exp(-dZ) = B[C(Z) = exp(-dZ») - ~ E[exp(-dZ») ) i-l i B[C(Z) exp(-dZ»)/[l-B[exp(-dZ»). Now, E [C(Z) exp(-dZ») = Cl ~ ~o exp(-du)dF(u) + C2exp(-d~)S(~), and, via integration by parts, we get, ~: exp(-du) S(u) du = {l-B(-dZ)}/d. ~ Hence, Cl~o exp(-du)dF(u)+c2exp(-d~)S(~) d ~~ exp(-du) S(u)du In 1966, Fox gave sufficient conditions under which a finite unique optimal interval exists. His result is that, if the failure rate r(.) exists, is continuous and strictly increasing to 00, then a unique, finite ~ exists satisfying (2 ) This was strengthened by Ran and Rosenlund and is given through introducing maintenance costs into the model. Page 142 Fox also showed that as the discount rate approaches zero, we return to the long-run expected cost per unit time ("classical") ARP model. Hence, with small prevailing interest rates, the classical ARP model is approximately the same as the discount ARP model. Explicitly denoting the • dependence of R2 (t) on d by R (d,t), Fox showed that: 2 (3) V t>O, lim + d R2 (d,t) = Rl(t) d~O lim + d R2(d'~d) (4) d~O = Rl(~)' and, if r(.) is continuous and strictly increasing to 1 im + ~ d (5) d~O = 00, then ~. As with the usual ARP, it has been shown that for a fixed d the optimal replacement time ~d' cannot occur in an interval where the failure rate is decreasing (Denardo and Fox, 1967). §3 Discounted ARP with explicit cost function The introduction of a cost intensity or maintenance costs make for a more realistic if less wieldy model. Scheaffer (1971) first introduced the notion of explicitly accounting for a cost related to the age of the unit. Three typical reasons cited why there may be a cost associated with a unit are: (a) the adjustments that need to be made, (b) the unit may perform less efficiently as it ages, and (c) replacement costs may increase due to depreciation or wear. Page 143 One could consider a policy for replacing gasoline engines whose cost of operation increases with time due to increased gas and oil consumption. Another example is rubber tires, whose salvage value will decrease as wear increases. Scheaffer assumed an increasing cost factor (increasing with the age of the unit) and sought to minimize the long-run cost per unit time, i.e., the classical ARP objective function. For explicit cost functions he showed how to derive the optimal time interval and also noted that, along the lines of Fox, a random replacement policy is superfluous. By properly choosing the cost factor, one may broaden the class of life distributions considered to include the (negative) exponential distribution. In his examples, Scheaffer considered the exponential life distribution whose ARP is a failure replacement policy. Cleroux and Hanscom (1974) gave a reasonable model where the costs were not necessarily increasing (arbitrary) and when the costs occurred at fixed, equal length times of a unit's life. This was generalized by Ran and Rosenlund (1976) who considered any continuous cost intensity function. Of course, a sequence of continuous cost functions may be used to approximate the discrete ones of Cleroux and Hanscom arbitrarily close. While Cleroux and Hanscom used the usual Page 144 ARP objective function, Ran and Rosenlund minimized expected discounted costs. Letting g(t) be our continuous cost intensity function, t we can define A(t) = ~o exp(-du) S(u)du, the expected discounted usage per unit and O(t) = t ~o exp(-du) S(u) g(u)du, the expected cost due to the cost intensity. As before, the discounted cost can be represented via the random variable Y, where 00 Y = -i=l > and Ran and Rosenlund showed that B[Y] = R3(~) = R2(~) + O(~)/(d A(~» where R2(~) is defined in (1). Ran and Rosenlund extended a proof first given by Cleroux and Hanscom to give sufficient conditions under which a finite optimal interval exists. If there exists ~ * <00 such that, {(Cl-C2)r(~)+g(~)}/d-C2 * then d ~ ~. > min{R 3 (t): O<t<~} V ~>~ * The immediate corollary is that if either r(t) or g(t) goes to 00 as t goes to 00, the ~d exists and is finite, which strengthens an earlier result of Fox. finding minima (local or otherwise), the equation corresponding to (2) is (6 ) For Page 145 §4 Block replacement policies Any review of age replacement policies, however sketchy, would be incomplete without mentioning its major competitor, block replacement policies. Under block replacement, a stochastically failing unit is replaced at failure with cost Cl , or at times ~, 2~, 3~, ... with cost C 2 , where O<C 2 <C l . To optimize this policy, again the choice of ~ depends on C , l C , and the failure distribution of the units. 2 This policy has obvious intuitive appeal when several units are on test simultaneously, e.g., the lightbulb replacement problem. The choice of which policy {or their many variants> depends on the physical situation, and guidelines for choosing have been discussed by many authors {cf.,Barlow and Proschan, 1964 and Gertsbakh, 1977, pg.99-103>. Page 146 BIBLIOGRAPHY Abdelhamid, S.N. (1973). Transformation of observations in stochastic approximation. Ann. statist. 1, 1158-1174. Albert, A.E. and Gardner, L.A. (1967). Stochastic Approximation and Non-Linear Regression. M.I.T. Press, Cambridge, Mass. Anbar, D. (1973). On optimal estimation methods using stochastic approximation procedure. Ann. Statist. 1, 11751184. Anbar, D. (1976i). An application of a theorem of Robbins and Siegmund. Ann. Statist. 4, 1018-1021. Anbar, D. (1976ii). An asymptotically optimal inspection policy. Naval Res. Legist. Quart. 23, 211-218. Anbar, D. (1978). A stochastic Newton-Raphson method. Statist. Plan. Inf. 2, 153-163. J. Aoki, M. (1977). Optimal Control and Systems Theory in Dynamic Economic Analysis. North-Holland Publishing Co., New York. Arunkumar, S. (1972). Nonparametric age replacement policy. Sankhya ~ 34, 251-256. Aven, T. (1982). Optimal replacement times - a general set-up. Statist. Res. Report 1, 1982. Univ. of Oslo, Norway. Barlow, R.E. and Proschan, F. (1964). Comparison of replacement policies and renewal theory implications. Ann. Math. Statist. 35, 577-589. Barlow, R.E. and Proschan, F. (1965). of Reliability. wiley, New York. Mathematical Theory Bather, J.A. (1977). On the sequential construction of an optimal age replacement policy. Bulletin of the International Statistical Institute 47, 253-266. Beichelt,F. (1981). Replacement policies based on system age and maintenance cost limits. Math. Operationsforsch. Statist., Sere Statistics 12, 621-627. Berg, M. (1976). A proof of optimality for age replacement policies. Journal of Applied Probability 13, 751-759. • Page 147 Bergman, B. (1979). On age replacement and the total time on test concept. Scandinavian Journal of Statistics 6, 161168. Bhattacharya, P.K. (1967). Estimation of a probability density function and its derivatives. Sankhya Sere ~ 29, 373382. Billingsley, P. (1968). Measures. New York, Wiley. Convergence of Probability Blum, J.R. (1954i). Approximation methods which converge with probability one. Ann. Math. Statist. 25, 382-386. Blum, J.R. (1954ii). Multidimensional stochastic approximation methods. Ann. Math. Statist. 25, 737-744. Breslow, N. and Crowley, J. (1974). A large sample study of the life table and product limit estimates under random censorship. Ann. Statist. 2, 437-453. Burkholder, D.L. (1956). On a class of stochastic approximation processes. Ann. Math. Statist. 27, 1044-1059. Chernoff, H. (1964). Statist. Math. 16, 31-41. . Estimation of the mode. Ann. Inst. Chung, K.L. (1954). On a stochastic approximation method. Ann. Math. Statist. 25, 463-483 • Chow, Y. S. ( 1965 ). Local convergence of martingales and the law of large numbers. Ann. Math. Statist. 36, 552-558. Chow, Y.S., Robbins, H. and Siegmund, D. (1971). Great Expectations: The Theory of Optimal Stopping. HoughtonMifflin Co., Boston. Cleroux, R. and Hanscom,M. (1974). Replacement with adjustment and depreciation costs and interest charges. Technometrics 16, 235-239. Csorgo, M. and Revesz, P. (1981). Strong Approximation in Probability and Statistics. Akademiai Kiado, Budapest. Denardo, E.V. and Fox, B.L. (1967). Nonoptimalityof planned replacement in intervals of decreasing failure rate. Operations Research 15, 358-359. Dupac, V. (1965). A dynamic stochastic approximation method. Ann. Math. Statist. 36, 1695-1702. Dupac, V. (1966). Stochastic approximation in the presence of a trend. Czech. Math. J. 16, 91, 454-461. Page 148 Dupac, V. (1977). Stochastic approximation methods in linear regression model (with consideration of errors in regressors). Math. Operationsforsch. statist., Sere Statistics 8, 107-118. Dupac, V. and Kral, F. (1972). Robbins-Monro procedure with both variables subject to experimental error. Ann. Math. Statist. 43, 1089-1095. Dvoretsky, A. (1956). On stochastic approximation. Proc. Third Berkeley ~ . Math. Statist. Probab., 1 (J. Neyman, ed.), 39-55. Univ. California Press. Eddy, W. (1980). Optimal kernel estimators of the mode. Ann. Statist. 8, 870-882. Fabian, V. (1967). Stochastic approximation of m~n~ma with improved asymptotic speed. Ann. Math. Statist. 38, 191200. Fabian, V. (1968i). On the choice of design in stochastic approximation methods. Ann. Math. Statist. 39, 457-465. Fabian, V. (1968ii). On asymptoti9 normality in stochastic approximation. Ann. Math. Statist. 39, 13271332. Fabian, V. (1971). Stochastic approximation. Methods in Statistics, J.S. Rustagi(ed.), 439-460. Optimizing Fabian, V. (1973). Asymptotically efficient stochastic approximation; the R-M case. Ann. Statist. 1, 486-495 Fabian, V. (1978). On asymptotically efficient recursive estimation. Ann. Statist. 6, 854-866. Feller, W. (1971). An Introduction to Probability Theory and its Applications !. Wiley, New York. Fox, B. (1966). Age replacement with discounting. Operations Research 14, 533-537. Fritz, J. (1973). Stochastic approximation for finding local maxima of probability densities. Studia Sci. Math. Hungar. 8, 309-322. Gaposkin, V.F. and Krasulina, T.P. (1974). On the law of the iterated logarithm in stochastic approximation processes. Theor. Probab. ~ . 19, 844-850. Gertsbakh, I.B. (1977). North-Holland, New York. Models of Preventive Maintenance. • Page 149 Glasser, G.J. (1967). Technometrics 9, 83-91. The age replacement problem. Govindarajulu, z. (1975). Sequential Statistical Procedures. Academic Press, New York. . Has'minskii, R. (1975). Sequential estimation and recursive asymptotically optimal procedures of estimation and observation control. Proc. Prague ~ . Asymptotic Statist . 1, 157-178. Has'minskii, R. (1977). Stochastic approximation methods in non-linear regression models. Math. Operationsforsch. Statist., Sere Statistics 8, 95-106. Heyde, C. (1974). On martingale limit theory and strong convergence results for stochastic approximation procedures. Stoch. Proc. ~ . 2, 359-370. Holst, u. (1980). Convergence of a recursive stochastic algorithm with m-dependent observations. Scand. J. Statist. 7, 207-215. Holst, u. (1982). Convergence of a recursive stochastic algorithm with strongly regular observations. TFMS-3026, Dept. Math. Statist., u. of Lund, Sweden. • Isogai, E. (1980). Strong consistency and optimality of a sequential density estimator. Bull. Math. Statist. 19, 5569. -- -Janac, K. (1971). Adaptive stochastic approximations. Simulation 16, no. 2, 51-58. Kersting, G. (1977i). Some results in the asymptotic behaviour of the Robbins-Monro procedure. Bull. Int. Statist. Inst. 47, II 327-335. Kersting, G. (1977ii). Almost sure approximation of the Robbins-Monroe process by sums of independent random variables. Ann. Probability 5, 954-965. Kiefer, J. and Wolfowitz, J. (1952). Stochastic estimation of the maximum of a regression function. Ann. Math. Statist. 23, 462-466. Konakov~ V.D. (1974). On the asymptotic normality of the mode of multidimensional distributions. Theor. Probab. ~. 19, 794-799. Page 150 Kushner, H.J. (1977). General convergence results for stochastic approximations via weak convergence theory. J. Math. Anal. and ~ . 61, 490-503. Kushner, H.J. and Clark, D.S. (1978). Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer-verlag, New York. Lai, T.L. and Robbins, H. (1978). Limit theorems for weighted sums and stochastic approximation processes. Proc. Nat1. Acad. Sci. USA 75, no. 3, 1068-1070. ---Lai, T.L. and Robbins, H. (1979). Adaptive designs and stochastic approximation. Ann. Statist. 7, 1196-1221. Lai, T.L. and Robbins, H. (1981). Consistency and asymptotic efficiency of slope estimates in stochastic approximation schemes. z. Wahrschein1ichkeitstheorie Verw. Geb. 56, 329-360. Ljung, L. (1978). Strong convergence of a stochastic approximation algorithm. Ann. Statist. 6, 680-696. McLeish, D.L. (1976). Functional and random central limit theorems for the Robbins-Monro process. ~.~. Prob. 13, 148-154. Miller, R.G. (1981>' Survival Analysis. Wiley, New York. Nadas, A. (1969). An extension of a theorem of Chow and Robbins on sequential confidence intervals for the mean. Ann. Math. Statist. 40, 667-671. Neve1son, M. (1975). On the properties of the recursive estimates for a functional of an unknown distribution function. Limit Theorems of Probability Theory, P. Revesz (ed.). ColI. Math. Soc. Janos Bolya 11, 227-251. Obremski, T.E. (1976). A Kiefer-Wo1fowitz type stochastic approximation procedure. Ph.d. dissertation. Dept. Statist. and Probab., Mich. S. Univ., East Lansing, Mich. Parzen, E. (1962). On estimation of a probability density function and mode. Ann. Math. Statist. 33, 1065-1076. Ran, A. and Rosenlund, S.!. (1976) •. Age replacement with discounting for a continuous maintenance cost model. Technometrics 18, 459-465. Revesz, P. (1977). How to apply the method of stochastic approximation in nonparametric estimation of a regression function. Math. Operationsforsch. Statist., Sere Statistics 8, 119-126. • Page 151 Robbins, H. and Monro, S. (195l). A stochastic approximation method. Ann. Math. Statist. 22, 400-407. Robbins, H. and Siegmund, D. (197l). A convergence theorem for nonnegative almost supermartingales and some applications. Optimizing Methods in Statistics (J.S. Rustagi, Ed.), 233-257. Academic Press, N.Y. Ruppert, D. (1979). A new dynamic stochastic approximation procedure. Ann. Statist. 6, 1179-1195. Ruppert, D. (1982). Almost sure approximations to the Robbins-Monro and Kiefer-Wolfowitz processes with dependent noise. Ann. Statist. 10,178-187. Ruppert, D., Reish, R.L., Deriso, R.B., and Carroll, R.J. (1982). Monte Carlo optimization by stochastic approximation (with application to harvesting of Atlantic menhaden). Institute of Statistics Mimeo Series #1500. Chapel Hill, North Carolina. Sacks, J. (1958). Asymptotic distribution of stochastic approximation procedures. Ann. Math. Statist. 29, 373-405. Sager, T.W. <l978). Ann. Statist. 6, 802-812. Estimation of a multivariate mode. Sakrison, D. (1965). Efficient recursive estimation; applications to estimating the parameters of a covariance function. Int.~. Engng. Sci. 3, 461-483. Samanta, M. (1973). Nonparametric estimation of the mode of a multivariate density. South. African ~. 7, 109-117. Scheaffer, R.L. (197l). Optimum age replacement policies with an increasing cost factor. Technometrics 9, 83-91. Sen, P.K. (198l). Sequential Nonparametrics: Invariance Principles and Statistical Inference. New York, Wiley. Sielken, R.L. (1973). Stopping times for stochastic approximation procedures. Z. Wahrscheinlichkeitstheorie Verw. Geb. 26, 67-75. J Singh, R.S. (1976). Nonparametric estimation of mixed partial derivatives of a multivariate density. J. Multi. Anal. 6, 111-112. Singh, R.S. (1977). Improvement on some known nonparametric uniformly consistent estimators of derivatives of a density. Ann. Statist. 5, 394-399. Page 152 Skorohod, A.V. (1956). Limit theorems for stochastic processes. Theor. Probab. ~ . 1, 261-290. Smith, W.L. (1955). Regenerative stochastic processes. Proc. Royal Soc., Series ~ 232, 6-31. Stone, C.J. (1963). Weak convergence of stochastic processes defined on semi-infinite time intervals. Proc. Amer. Math. Soc. 14, 694-696. Stone, C.J. (1980). Optimal rates for nonparametric estimators. Ann. Statist. 8, 1348-1360. stout, W.F. (1974). Press, New York. Almost Sure Convergence. Academic Stroup, D.F. and Braun, I. (1982). On a stopping rule for stochastic approximation. Z. Wahrschein1ichkeitstheorie Verw. Geb. 60, 535-554. Uosaki, K. (1974). Some generalizations of dynamic stochastic approximation processes. Ann. Statist. 2, 10421048. venter, J.H. (1967). An extension of the Robbins-Monro procedure. Ann. Math. Statist. 38, 181-190. Wegman, E.J. (1972). Nonparametric probability density estimation: II. A comparison of density estimation methods. ~. Statist. Compo Sim. 1, 225-245. Wetherill, G.B. (1966). Metheun, London. Sequential Methods in Statistics. J Page 153 INDEX ) NOTATION AND ASSUMPTIONS Symbol short description F ( . ) , S(.) Cl ' C2 $ Rl ($) N( t) failure, survival distribution failure, planned replacement cost optimal replacement time long-run expected cost per unit time number of replacements by time t 1 1 1 1 2 E, P expectation, probability operators 4 sigma-field of the past sigma-field generated by (.) conditional expectation 4 4 5 Pn cr ( • ) Ell' n 0, 0, 0p' 0p page where first introduced e asymptotic relationships for orders of magnitude m-dimensional Euclidean space and associated Borel sets membership in a set 5 5 C(t) cost by time t 5 Rm, Bm Nl(t), N2 (t) 5 number of failure, planned replacements by time t mean of distribution function F indicator function density function for distribution function F asymptotic relative efficiency of the ARP greatest integer function function associated with cost sequences used in the SA ARP procedure recursive estimator of $ truncation operator transform function g-l($) 6 7 7 9 9 11 33 34 34 34 35 35 Mg , n ( . ) estimator of M(g(.» 36 fg In ( . ) histogram estimator of the density estimators of F(.), S(.) 36 36 pth derivative of F(g(.» 36 constants associated with {an}' {c } n 37 j..l I (•) f (.) F(.) [ .] M(t) {an}' {c n } $n [ . ]b a g(. ) $* Fin' Sin (F*g)(P) A, C Page 154 INDEX Symbol NOTATION AND ASSUMPTIONS short description y page where first introduced constant which determines the rate of convergence r = A (g' (cf» ) 2 M , (g ( cf» ) > N(a,b) I factor in the ARP asymptotic variance random variable that is distributed normally with mean a and variance b cf>n-cf> bias term 2 B p Mg n ' MIn' M2n 2 02 vn f*g n' f*g(P) n fg(P) n 2 1-1 0 , ~o 1-1 1 , ~l 2 Aopt ' Copt V,/\ n (cf>n ) 37 37 37 38 38 38 38 factors of Mg, nand M(g(.» 39 variance of distribution function F estimator of )" 39 39 general estimators of (F*g)(l) and (F*g) (p+l) satisfying A8 and A9 52 constants used in A9 class of kernel functions 52 52 class of orthogonal kernel functions 53 class of orthogonal kernel functions that are continuous and of bounded variation SA kernel estimator of (F*g)(p+l) 53 55 asymptotic mean and variance in Theorem 3.1 57 asymptotic mean and variance in Corollary 3.1 59 optimal choice of A, C 60 max, min 63 quantities used in Mg,n(') and M(g(.» 64 used to adaptively estimate Aopt used to estimate (F*g)(r+l) 67 68 to adaptively estimate Copt 69 Page 155 INDEX short description Symbol • 1-1 2, 0- 2 2 f(P) . n ro ' ~o 72 . t'1ve p th d er1va 74 0 77 78 asymptotic mean and variance for SA univariate density estimator 79 D( • ) mixed partial derivative of f(.) 82 D (.) SA estimator of D(.) 83 space of real mxm matrices p,+m/2 €Rmxm , = diag ( (c 1 ),) 83 83 H(.) mXm identity matrix €Rmxm , = diag { {I (p ,=p ),) 1 a 1 matrix of partial derivatives 83 I I. II m-dimensional Euclidean norm 83 ~4' ~4 asymptotic mean and covariance matrix of SA multivariate density estimator 85 ~5' ~5 asymptotic mean and covariance matrix of SA multivariate mode estimator 85 ~Pif(X)/~X~i 1 86 factor in SA K-W asymptotic variance 94 1-1 6 asymptotic mean of SA K-W procedure 95 B -n vector of bias 100 ~ -n vector of errors space of functions on [0,00) having left hand limits and continuous from the right 100 102 => W weak convergence in Stone's topology of D 102 B( •) Brownian motion on [0,00) (l_oc)th quantile of the standard normal distribution 103 To' 3, 0- 2 3 -n mXm R en n I E (p, ) f 1 0- 2 6 D • asymptotic mean and variance in Theorem 3.3 factors in SA density estimator asymptotic distribution 1-1 • page where first introduced f a density f SA kernel estimator for f(P) f(P) ). NOTATION AND ASSUMPTIONS z oc (x) - 1 83 83 107 Page 156 INDEX NOTATION AND ASSUMPTIONS • Assumptions AO-A7 page where first introduced •4. 36-37 A8-A9 52 AlO 55 All 67 Bl-B5 78 Cl-C5 84 • 01-07 96-97 El-E5 103 E6-E8 108
© Copyright 2025 Paperzz