Ruppert, David.A Representation Theorem for Generalized Robbins-Monro Processes and Applications."

A REPRESENTATION nrnOREM ffiR GENERALIZED
ROBBINS-M:>NRO PROCESSES AND APPLICATIONS
•
David Ruppert l
Department of Statistics
University of North Carolina at Chapel Hill
Rtnming title:
Robbins-Monro Processes
SUI1ITlary
•
Many generalizations of the
Robbins-~ro process
for the purpose of recursive estimation.
have been proposed
In this paper, it is shown that
for a large class of such processes, the estimator can be represented as
a stun of possibly dependent random variables, and therefore the
asymptotic behavior of the estimate can be studied using limit theorems
for stunS.
Two applications are considered:
robust recursive estimation
for autoregressive processes and recursive nonlinear regression.
I Supported by National Science Foundation Grant MCS81-00748.
1
Introduction.
1.
Stochastic approximation was introduced with
Robbins and MOnro's (1951) study of recursive estimation of the root, 8,
to R(x)
= 0,
when the ftmction R from
rn.
(the real line) to lR is
unknown, but when for each x, an tmbiased estimator of R(x) can be
.....
observed.
Their procedure lets 8
.....
1
be an initial estimate of 8 and
defines 8 recursively by
n
aY
nn
•
where an is a suitable positive constant and Y is an tmbiased estimate
n
"'"
of R(8 ). Blum (1954) extended their work to the multidimensional case:
n
that is, when R maps lRP (p-dimensional Euclidean space) to lRP .
A situation where the classical Robbins-Monro (RM) procedures are
inadequate occurs when there is a different ftmction (now mapping lRP to
lR) at each stage of the recursion. Letting R be the ftmction at the
n
th
n
stage, there may be multiple solutions to
(1.1)
but it is assumed that there exists a tmique 8 solving (1.1) for all n.
At the nth stage, we can, for any x in lRP , observe an tmbiased estimate
of R (x). In this context, we will define a generalized Robbins-Monro
n
(GRMO procedure to be a process satisfying
(
(1.2)
aYU
nnn
where "'"
8 is an arbitrary initial estimate of 8, {an} is a sequence of
1
,..,
positive constants, Y is an tmbiased estimate of R (8 ), and U is a
n
n n
n
suitably chosen random vector in lRP .
2
GRM processes are used by Albert and Gardner (1967) for recursive
estimation of nonlinear regression parameters.
They assume that one
observes a process {Yn : n = 1,2, ... } such that EY(n) = Fn (8), where Fn
is a known map of RP to lR and 8 is an tmknown parameter vector which
If Rn(x) = Fn(x) - Fn (8), then 8 solves (1.1) for
is to be estimated.
all n.
Another example of a GRM process appears in Campbell (1979), who
considers robust recursive estimation for autoregressive processes.
Suppose 8 = (8 1 , ... , 8p )' is a vector parameter and the process
{Yn : n = 1,2, ... } satisfies
•
p
Y =
n
l Y -k8k
k=l n
+ u
n
,
00
where {un }n=-oo are iid. Assume 1/J, which maps R to lR, satisfies
E1/J(S+u l ) = 0 if and only if s = O. Then for x = (xl' ... ' xp )', let
R
n
(x)
= 1/J(Yn -
In this case, because the Y's are random for each x, R (x) is a random
n
variable.
(Albert and Gardner's setup allows R (x) to be either random
n
or fixed; the situation is analogous to regression where the independent
variables may be random or fixed.)
.
p{there exi sts x
~
0 satisfying
Except in degenerate situations,
l\t (x) = 0
for all n} = 0 .
A third example of a GRM is given by Ruppert (1979, 1981a).
The classical Robbins-Monro process has been studied quite extensively,
3
but little beyond consistency has been established for GRM procedures.
Albert and Gardner (1967) investigate asymptotic distributions for
p = 1 but state that "owing to its considerable complexity, the question
of large-sample distribution for the vector estimates is not examined."
Campbell (1979) does not treat asymptotic distributions, and Ruppert
1
,..,
(198la) proves asymptotic normality of n~(en -e) (only in his special
case) under restrictions which imply that for each fixed x,
-n (x): n = 1,2, ... } are iid, that feYn -Rn (8n )): n = 1,2, ... } are iid,
and that {Rn(x): n = 1,2, ..• and x € RP} is independent of
{R
{(Y - R
n
(e )):
n
= 1,2, ... J.
n n
The classical studies of stochastic approximation procedures - - for
example, Chung (1954), Sacks (1958), and Fabian (1968) -- assume that the
,..,
sequence Yn - R(en ) forms a martingale difference sequence, but recent
authors, including Ljung (1978) and Kushner and Clark (1978), have felt
that this assumption can be unduly restrictive in applications.
Ljung
(1978) proves consistency, and Kushner and Clark (1978) give a detailed
treatment of consistency and asymptotic distributions for stochastic
approximation methods, under assumptions considerably weaker than that
the errors are martingale differences.
Ruppert (1978) proves almost sure
representations for multidimensional Robbins-MOnro and Kiefer-Wolfowitz
(1952) processes with weakly dependent errors.
These representations can
be used to establish almost sure behavior, i.e. laws of the iterated
logarithm and invariance principles, as well as asymptotic distributions.
Invariance principles are useful for the study of stopping rules; see
Sielken (1973).
For GRM procedures, the results of Kushner and Clark (1978) - - in
particular, their theorems 2.4.1,2.5.1, and 2.5.2 --are sufficient to
,..,
prove consistency tmder weak conditions on (Y(n) - R (e(n)), but they do
n
not appear adequate to handle asymptotic distributions.
4
In this paper, the techniques of Ruppert (1978) are used to show
that GRM processes can be almost surely approximated by a sequence of
partial sums of random vectors, and therefore, central limit theorems,
invariance principles, laws of the iterated logarithm, and other results
for such sums can also be established for GRM processes.
The procedures
of Campbell (1979) and Albert and Gardner (1967) are used as examples.
In future work, these results will be applied to the procedure of Ruppert
(1979, 1981a).
2.
Notation and assumptions.
All random variables are defined on a
probability space (n,F,p).
All relations between random variables are
meant to hold with probability 1. CIRk,B k ) is k-dimensional Euclidean
space with the Borel a-algebra.
Let XCi) be the i th coordinate of the
vector X, ACij) be the i ,j th entry of the matrix A, A' be the transpose
of A, I k be the k xk identity matrix, and IIA II be the Euclidean norm
of the matrix A, Le. IIAII = (I~=l Ij=l (A(ij))2)~ = trace(A' A)~. If A
is a square matrix, then A*(A) and A*(A) denote the minimum and maximum
of the real parts of the eigenvalues of A, respectively.
Let I CA) be the indicator of the set A.
Throughout, K will denote
a generic positive constant.
For a square matrix M, exp(M) is given by the usual series
definition:
00
exp(M) = I Mn(nl)-l
n=O
•
and for t > 0,
t
M
= exp((log t)M) .
5
Let P be a fixed integer.
I
p
For convenience, we will often abbreviate
by 1.
Notice that t I ~ tI, tMsM = sMtM
=
(st)M, (t- 1 )M
= t-M,
MtM = tMM,
M
MI
and (d/dt)t = Mt - .
We write X = O(Y ) or X «Y if the random vectors X and Y
n
n
n
n
n
n
satisfy IIXnl1 ~ XllYnl1 for a (finite) random variable X.
We will be needing the following assumptions.
AI.
M (X,w) (= M (X)) is a (lRPx~, #x F) to (RP,F) measurah1e
n
n
transformation, ,\(w) (= An) is a random matrix, v is a nonnegative Borel
function on RP,
e
is in RP, and B is a nonnegative random variable
n
such that
vex) =o(x-8) as x
(2.1)
-+
8
and
11Mn (x) - An (x- 8) II .~ Bn vex)
(2.2)
for all x in lRP and all n.
A2.
,. . ,
~
8 +1 = 8
n
n
- n
-1
,. . ,
eMn (8 n )
+ W ), where W is a random vector in
n
n
~
A3.
8
A4.
A is a p xp matrix such that A* (A) > ~ and C
n
-+
8.
•
1
such that
00
(2.3)
'\
l.
k=l
k
-1
(~ - A)
converges
is a constant
6
and
n
l
r
k- (II\: - All - C )
l
mSIl k=m
m+oo
AS.
0 .
Suppose
<
(2.5)
00
l.\'
(Z. 6)
k-~
-£
k=l
•
~
lim sup sup
(2.4)
00
,
W converges for all
k
£
> 0 ,
and for a constant CZ '
n
lim sup sup
(Z.7)
A6.
r k -1 ( 111\ II - C2) ~ 0 .
mSn k=m
JT}+O)
There exists a constant C
3
and matrices Qi such that
II Qi II
such that for all n > m, a real,
~ 1, we have
k
(2.8)
E{maxll
Z
a
r
i Q.(A._A)II : m ~ k ~ n}
i=m
1 1
~ C3
n
Za
r
i
i=m
and
(2.9)
E{max
II . ~l.
1=m
A7.
i -1 W.1
II 2 :
m ~ k ~ n} ~ C
For some 0 > 0,
vex) = O( Ilx 111+0)
~l.
.
3 1=m
as
x+
0 .
1.- 2
7
REMARKS.
The relationship between (1.2) and A2 is that
~ (x) = a-I Rn (x) Un (a is some positive m.unber), Wn = a -1 (Y - R (8n ) )Un ,
n n
and a
n
= an -1. More general an sequences could be treated, but they
8'n
would not lead to better rates for convergence of
to e.
McLeish
(1975) has investigated the convergence of the series 1~=1 cnXn , where
the c are constants and {X } is a sequence of random variables
n
n
satisfying certain mixing and moment conditions. As we will see in
Section 4, his results can be used in the verification of A4 and AS in
some specific case.
Note that (2.4) is a weaker assumption than that
. l~=l (11,,\ - All
- C1 )k -1 converges (and similarly for (2.7)). Also
Mcleish's theorem 1.6, which is a generalization of a martingale
inequality due to Doob (1953), gives conditions which imply A6.
3.
Main results.
IJM.fA 3.1.
a < All/(M)
norm
S
Let M be a pxp square matri3: suah that
A*(M) < b for real, numbers a and b.
II II ll/
Then there exists a
on RP suah that
for aU reaZ t and aU X in RP.
PROOF.
The proof is given by Hirsch and Smale (1974, page 146).
LFMvtA 3.2.
£
<
Assume AI to AS hoZd.
£ ...,
Then n (e - e) -+ 0 for aU
n
~.
PROOF.
define
<P
n =
Without loss of generality, we take e = O. Fix
e: ...,
(n -1) en' Then from (2.2) and A2, we obtain
£
<
~
and
0
8
~
n +1 = [I + n
-1
{£1 - An + un }]~n - n
£-1
W,
n
where the random matrix un satisfies
(3.1)
Define D = A - e:1 - u. Suppose n
n
n
n
m and using the definitions
~
m.
Then iterating (3.1) back to
D=A-d,
rf1
=
m
•
n
I
k-
1
k=rn
(1\ -D),
n
Kn = I k m k=m
1
and
rJi =
m
n
~
l
k=rn
k£-l W
k '
we see that
n
(3.2)
n ", + D ~ k- ("'k-"')
'"'l'n+1 -- '"'l'm - [t<?Thf..
mLl\f'm + Vm'l'm
kL
'l' 'l'm
=rn
n
+
I
k =rn
Choose to and a norm
k-l(Tl
- D)(~k-~m)
-k
II II *
II (I-Dt)xll*
for all t in [O,t ].
O
calculation.)
Let
1
+
rJi]
m .
such that
s (I - A*(D)t/2) Ilxll*
(This can be done by Lemma 3.1 and a simple
9
For x > 0 and Y > 0, define
(3.3)
(C
1
p(x,Y)
=
(x-y(x+2))/(llnll*(1+x) +y+C x+xy)
1
is given by A4) and
(3.4)
r(x,Y) = (p(x,Y) -y)(,,*(n)/2-y-llnll* x-xy-C1x) - (2y+xy) .
Choose II > 0 and L > 0 such that
t o/2 > P(L,ll) > II and r(L,ll) > 0 •
•
Then using (2.3), (2.4), (2.6), and (2.7), choose N > max{2/t o'
large that if n > m
~
n-1 }
so
N, then
(3.5)
(3.6)
and
(3.7)
k
Define nCl) = N and n(i+1) = inf{k ~ n(i): Kn(i) ~ p(L,ll)} for
i
= 1,2, ....
(3.8)
Note that
...n(i+1)
~n(i)
<
-
t
0
-1
Since p(L,N) > II > N ,
10
n(i +1) > n(i) for all i
(3.9)
ni
~
1.
Let
= sUP{II~(i)II~:
By (3.7), n. < n for i
1
~
n(i)
~
k
~
n(i+1)}.
1, and by (2.6), n. -+ O.
1
We now will prove that for each i and each k in {n(i), ... , n(i+1)},
The proof will be by induction on k for each fixed i.
Suppose that
(3.10) holds for all R, less than or equal to some k < n(i+1).
(3.2), (3.5), (3.6), and (3.9), and the fact that
•
K~(i)
Then by
< p(L,n) for
k<n(i+1),
I 1<Pk +1
k
- <Pn(i) I 1* ~ Kn(i) IIDII *
II<Pn(i) 11*
k
+ n(Kn(i) + 1) II<Pn(i) 11*
+ ((C 1 +n + IIDII*)K~(i) +n) • sup{II<pR, - <Pn(i)
1'*:
n(i) s R,
~ {p(L,n)[ IIDII",(l+L) + n + C L + nL] + n(L+2)} • max{ II <Pn(i)
1
s L max{ I I<Pn(i)
II""
~
k} + ni
11*,
nil
nil ,
which completes the proof of (3.10).
It follows from (3.2), (3.5) to
(3.8), and (3.10) that
I{11<Pn(i) 11* > n i }ll<Pn (i+1) II",
s
•
I l<Pn(i) I I*{l
..n(i+1)-1
+ 2n + nL - ~n(i)
[A*(D)/2 - n- liD II * L - nL - C1 L]} .
11
By the definition of n(i +1),
..n(i+1)-1
~n(i)
~
p(L,n) - (n(i+1))
-1
~
p(L,n) - n ,
and therefore
Now choose y. > 0 such that y. > n. and
1 1 1
LY.1 =
00.
Then by (3.10) and
(3.11), we have
•
An
application of Dennan and Sack's (1959) 1ennna 1, with a.
b.1
= 0,
1
= 0,
O.1
and c.1
= r(n,L)y.,
1
~
Asswne AI to A7.
...
n (6 n +1 - 6)
PROOF.
= -n -~
n
L
(kin)
= O.
en+1 = en - n -1 (Aen +
0
Then there exists
k=l
Again we take 6
114>n(.)11*
-+ 0, and then
l
II4>n ll * -+ o.
by (3.10) we can conclude that
TIIEOREM 3.1.
proves that
A-I
Wk
+
= An
n
By A6, E IIK II
n
2
- A
S
and ~n
£
_£
O(n ) .
By (2.2), A2, and A7,
8 + ~ +W ) ,
nn n
n
K
where
K
= y.1 (L+1),
= O(Bn 118n 11 1 +°)
C3 for all n, whence
.
> 0 such that
12
for all e: > O.
Thus for all e: > 0,
(Here and elsewhere in the proof, we make use of the fact that for
nonnegative random variables Xn , l~ Xn converges if l~ EXn converges.)
,..,
-~ +e:
Then since by Lerrma 3.2, en « n
for all E > 0, all E > 0,
•
Also, we can prove in a similar fashion that
00
l
n=l
Therefore for each
E
00
l
n=l
]
n -~ II E;
n
II
<
00
•
> 0,
e
n -~ -e: (K
+ E; + W) converges
n n
n n
,..,
by (2.6).
Now applying Theorem 3.1 of Ruppert (1978) with T = 0, X = en'
n
f(x) = Ax, en = 0, e = 0, en = (Kn8n + E;n +Wn ), and y = ~, we obtain
n ~ '"en+1 = -n -~
for some e: > O.
The proof will be completed by showing that for some
13
E
> 0,
(3.12)
We will asstDlle that A- I has only one eigenvalue, A.
This involves no
loss in generality since, by a change of basis, we can put A- I in real
canonical form (Hirsch and Smale (1974, page 130)) so that
q
•
~
p ,
where the square matrices M , ... , Mq each have exactly one eigenvalue.
l
Then we can apply the proof q times. By Lermna 3.1, there exists Q and
k
Qk
such that sup(II~11 + II~II) <
k
00
and
Then (3.12) holds if
n-~-A+2E/3 0*
~
and
50
Ik=l kA+ E/ 3 Qk
(K
i
k k
+E,;k) = 0(1) ,
by Kronecker' 5 lenuna it is sufficient to find E > 0 such that
00
I
(3.13)
k=l
k-~ +E ~E,;
k
converges
and
00
(3.14)
I
k=l
k-~ +E ~K
e
k k
converges .
14
Now since
II ~k II
«
... 1+0
Bn II en I,'
and
(3.13) holds for some e: > O.
...
II en II
_1
«
n ~
+e:
for all e: > 0,
It remains to prove (3.14).
Now fix a, e:, and e:' such that
a > 3/2 ,
and
o<
2ae:' < e: •
Define cf> (n) to be the greatest integer less than or equal to n a ,
•
n(i)
=
{j: cf>(i) ~ j ~ cf>(i+1) - l}, and IjJk
= QkKk
'We will prove (3.14) by showing that
00
(3.15)
III I
ljJ.e.ll<oo.
J J
j E:n(k)
k=l
Now define
and
...
z3,i = eA.(i)
'+'
I
j-n(i)
1jJ ••
J
k
-~ + e:'
•
Then
15
Then since e j = e<pCi) -
li~~Ci)
R,-I(MR,ceR,) +WR,) for j > <PCi), we will
have proved C3.15) if we prove that r:=1 11 zm,i II <
3.
co
for m = 1, Z, and
Now
<
00
,
since by A6
C3.16)
Zae:' < e: ,
and e: < a by choice of e: and a.
By C2.1), CZ.Z), and Lemma 3.2,
Therefore,
C3.17)
¥ Ilzz
i=l
·11
,1
«
r II
i=1
r
jenCi)
1/1·11[
J
3 Z
r
CIIA.II+B.)]i aC - / +e:').
jenCi)
J
J
16
By (2.5),
l
E[
(IIA.II +B.)]2
jE:n(i)
..
J
« (I#n(i))2 « i 2 (a-l) ,
J
where I#A is the cardinality of the set A. Therefore, using (3.16), the
th
expectation of the i
summand on the RHS of (3.17) is bounded by
K1· a ( -3/2 + e:') 1. -~ + ae:' 1.a-l -_ K.1 -a/2 - 3/2 + 2ae:'
Since e:' <
~,
the expectation of the RHS of (3.17) has a finite limit.
Therefore, l;=lll z2 , ill <
Since s';JP II i
•
1
•
00.
(~-e:')a 8cP (i) II
<
00,
l;=lll z3,i II converges if the
.
following converges:
.
(3.18)
Using (3.16) and -(a+l)/2 + 2 ae:' < -(a+l)/2 + e: < -1, (3.18) converges. 0
4.
Applications.
In this section, we elaborate upon two examples
of generalized Robbins-MOnro processes that were mentioned briefly in the
introduction.
ROBUST RECURSIVE ESfIMATIOO FOR AUI'OREGRESSIVE PROCESSES.
Let {Y } be a strictly stationary autoregressive process such that
n
p
Y
n
= l
Y
e(k) + u ,
k=l n-k
n
17
where p <
00
and {un} is an iid sequence with distribution ftmction F
and satisfying EU~ <
n = o. There are, of course, many methods,
e.g. least squares, of estimating a = (a (1) , ••• , a(p)) , based on the
00
and
EU
sample (Yl' ••. , Yn) .
If the Yn are being observed rapidly and we need
to update our estimate after receiving each new observation, then
recursively defined estimators are very attractive.
Campbell (1979) has
defined a class of recursive estimators which are robust (that is,
relatively insensitive to the tail behavior of the distribution of un
and to the presence of spurious observations), but she only examined the
consistency of these procedures. We will look at a class of procedures
including Campbell's, investigate its asymptotic distributions, and find
the procedure which minimizes the asymptotic variance-covariance matrix.
,..
Let Zn
=
(Yn - l , .. ·, Yn -p )'.
Define an recursively by
(4.1)
where g and I/J are maps from RP to m. and m. to lR respectively, and D
is a P)(P matrix.
(4.2)
For robustness, we require that
sup II/J(x) I <
XE:
00
R
and
sup I Ig(x)xl
xcRP
I<
00
•
These botmdedness assumptions will also simplify later proofs.
Campbell
uses only D = I, but she considers a more general subsequence, {an}' in
1
place of n- in expression (4.1). We will demonstrate that D = I is
asymptotically optimal only tmder rather special circumstances.
18
Moreover, as we will see, certain of our recursive estimators are
asymptotically equivalent to their nonrecursive competitors, bounded
influence regression estimates (Hampel (1978)), so there appears to be no
advantage in using other an.
First, we will verify that AI to A7 hold.
n = a(~:
F
M
"h (x)
k ~
n -
Let
1) ,
= [roo
L oo I/J(Z'n (X- 6) - u)dF(u)] g(Zn )DZ
. n
F
= En I/J(Z'(X-6)
- un )g(Zn )DZn ,
n
and
-
-
Wn = I/J(Z'(e
(6 ) .
nn -6) -un)g(Zn )DZn - Mnn
Then (4.1) is equivalent to A2.
We will also assume that:
Bl.
I/J has a bounded Radon-Nikodym derivative ~.
B2.
I/J has a bounded second derivative ~ except at a finite
collection of points {al ,···, aJ}.
B4.
F has a density f satisfying
Clf(x) - f(x+a)
BS.
Idx ~
Klal S for all a and some S > 0 •
The polynomial xP - l~=l xp - r 6(r) has all roots inside the
tmit circle.
19
Define c
B6.
and A
= EAn .
All
~
=
Asswne
r
-00
'~(-u)dF(u), Sn
"'* (A)
= g(Zn )Zn Z',
n
An
= cDS,
n
S
= ESn ,
> ~.
functions in the Princeton robustness study (Andrews (1972)),
for example, Huber' 5
= minCk, Ixl)sign
x
~(x) = sin ~ if Ixl
S
~(x)
and Andrews'
=0
1Tk
otherwise
satisfy Bl and B2, and they satisfy B3 if F is symmetric about zero.
Since {Wn,Fn +l } is a martingale difference sequence and is
uniformly bounded in L , (2.6) and (2.9) hold by the martingale
2
convergence theorem and Doob' s inequality (Doob (1953, Chapter VII,
Theorem 3.4).
Theorem 3.2 of Pharo and Tran (1980), B4, and the asslUIlption that
Ell~
imply that Y is strong mixing with mixing coefficients a(n)
n
which are O(pn) as n"" 00 for some p in (0,1).
<
00
ZO
q
Therefore,. if h is a Borel function on lR
and Vk = h(Y ,···, Y + - l ),
k
k q
then {V } is also strong mixing, and its mixing coefficients are O(pn/Z).
n
Define
(4.3)
Since Ey4 < 00 and g(x) is bounded, A (see B6) and B are strictly
n
n
n
stationary and have finite second moments, so (Z.5) holds. Define
Cl = EllAn - All and Cz = EBn . Then {An}' {IIAn - All - Cl }, and
{B - C } are mixinga1es of size -q for all q > O. (See McLeish (1975)
n
Z
for a definition and discussion of mixingales.)
1.8 of McLeish, (Z.3), (Z.4), and (Z.7) hold.
"'* (A)
since
>
~
is assumed, A4 holds.
Therefore, by Corollary
Therefore A5 holds, and
By Theorem 1. 6 of McLeish, A6
holds.
By Bl, BZ, and B4,
11Mn (x)
- An (x-e) II
= OrBn IIx
-
Z
ell)
as x -+ 8, so that Al and A7 hold with vex) = Ilx - 8 liZ and B given
n
-
by (4.3).
One can show that 8 -+ 8, i.e. that A3 holds, using Theorem
n
""
-1
2.4.1 of Kushner and Clark (1978) with X = 8n - 8, an = n ,
n
E,n = (Yn- 1'00" Yn-p ,u),
h O = 0, Bn :: 0, and h(x,Y) = Dg(Yl)Yll/J(Yl' x- Y2),
n
where y'
=
(YiYz)' Yl
E:
lRP , YZ
E:
lR, and x
E:
lRP .
Kushner and Clark's
assumptions AZ.4.2 and AZ.4.3' can be verified using the mixing properties
of Y which we have established.
n
gz (E,) =
II E, II
in A2. 4.3'.
Use 8(x) = Kllxll, gl (x,Y) = 1, and
The assumption that en - e is bounded is
established using the fact that
21
and Theorem 1 of Robbins and Siegmund (1971).
We have now verified AI to A7, so by Theorem 3.1,
for some
£
> O.
With this approximation, we could investigate the
...,
asymptotic behavior of en rather thoroughly.
restrict attention to asymptotic normality.
t n = [E~(-ul)][g(Zn)2
Here, however, we will
Define
Zn Zn']
and
Since
...,
e -+- e and Wn is bounded tmiformly in n,
F
E n W W' - t-+-O
(4.4)
n n
by the bounded convergence theorem.
n
Now
(tn-i)
is a mixingale of size
-q for all q > 0, so by Corollary 1. 8 of Mcleish (1975),
00
L
k=l
for any
£
k- (~+£) 0.
'k
ctn - *)0.''k
converges
> 0 and any sequence of uniformly bounded matrices Qk'
Thus
by (4.4) and the same argument used to conclude (3.12) from (3.13) and
22
(3.14), we see that
= V(c,D,S) ,
say.
(The integral does not diverge at 0 since A* (cDS - I) >
-~.)
By the
martingale central limit theorem of, for example, Brown (1971, Theorem 2),
1....
n~(en-e)
V
+
N(O, V(c,D,S)) .
(A functional central limit theorem can also be established by the same
theorem of Brown.)
The Lindeberg condition is easily established because
Wn is uniformly bounded.
If
~
and g are fixed so that c and S are
fixed, then D = (cS)-l is optimal, that is,
V(c,D,S) - V(c, (cD)
is p.s.d.
-1
,S)
This can be seen by noting that
= (cS) -1...f(CS) -1
so that
23
and the integrand of the LHS is p.s.d.
Note that
(4.5)
V(c,(csfl,S) = c- 2 S- 1 tS- 1
2
EIjJ (-u1 )
=
•
2 (Eg(Zl)Zl Zi)
-1
2
1
(Eg (Zl)ZlZiHEg(Zl)ZlZP- .
(EIjJ( -u ))
1
Also, if D = (cS)
-1
, then
n~(e - 8) = -n-~
n
n
1
= n~ L
(cS) -1 ljJ(u )g(Z )DZ + 0 (1) ,
n
n
n
p
k=l
,...
so by Carroll and Ruppert (1981, Theorem 1), 8 is asymptotically
n
,...
equivalent to the bounded influence regression estimate, 8, which solves
n
o = . t..t 1
t ,...
1jJ(Y •• Z. 8)g(Z.)Z.
1=
1
1
1
1
=0
•
Of course, c and S will generally not be known a priori.
however, estimate them by, e.g.,
cn
,...
=n
-1
Sn
=n
-1
n
t..t ·1jJ(Y••
i=l
1
,...
z.e.)
t
1 1
and
n
L g(Z.)Z.Z~
i=l
1
1 1
One can,
24
and then use (c- S ) -1 for D in (4.1). These estimates can be defined
nn
recursively. The asymptotic behavior of the resulting stochastic
approximation process is still an open problem, but once consistency is
established, then general results of Section 3 should be applicable.
If one is not concerned with robustness, then for any value of l/J,
g(x) :: 1 is optimal, but of course our proof assumes that xg(x) is
bounded, so it would need to be modified for this choice of g.
g and its optimal D, Le. D = (c(EZ 1ZiJ) -1, S
asymptotic variance-covariance
=
t
50
For this
that n~ en has
To show that this g is optimal, we must show that for any real-valued g,
is p.s.d., but (4.6) is the variance-covariance matrix of
i) -1 - (EZ 1Zi) -1 ]Zl·
[g(Zl) (Eg(Zl)Zl Z
The use of g(x)
= 1 allows outlying Zn to have great influence on
the estimation process, which could be disasterous if any of these Zn
have actually been observed with gross error, or if in some other way the
pair (Yn,Zn) is an outlier.
The optimal choice of g subject to the bound
Ilxg(x) II ~ B for all x ,
where B is a fixed constant, has been considered by Hampel (1978) in a
(slightly) different context, bounded influence linear regression.
Regardless of the value of g, we see from (4.5) that l/J should be
chosen to minimize E(l/J2(Ul))/(E~(ul))2.
If F has a density f
25
satisfying the typical regularity conditions of maximum likelihood
estimation, then of course
.
IjJ = flf
is optimal.
If one believes that f
is only "close" to some known f O' then one should choose IjJ robustly, and
for this purpose the reader is referred to Huber (1964, 1981) for an
introduction to robust estimation.
RECURSIVE NONLINEAR REGRESSION.
Now we consider the nonlinear regression model
Y.1
= F(v.,r)
+ e.
11
where F is a known function from Rq x:RP to R, v. is a known vector of
1
-
e
covariates, r is an unknown parameter vector, and e.1 is a random noise
term.
Usually r is estimated using least squares, that is, by finding
r
which minimizes
.... 2
l. 1 (Y.1 - F(v.,r))
1
n
.
1=
However, just as with our last example, we may wish to use recursively
defined estimators.
We will study a particular recursive estimation
sequence introduced by Albert and Gardner (1967), and we will show that
its asymptotic distribution is the same as least squares, at least under
certain regularity conditions.
(Both the least squares estimator and
Albert and Gardner's recursive estimator can be robustified as in the
previous example.)
Define h(v,w) to be the gradient (a/aw)F(v,w), and hn(w) = h(vn,w) .
....
Let HO be a p.d. pxp matrix, let T1 be in :RP , and define H and Tn
n
by
26
...
H
n
= n-
n
1 ...
(H + L h. (n.)h~ (n.)) ,
O i=l 1 1 1 1
-
where n 1. = r R, for some
R, :s;
i, and
(4.7)
Notice that
and if we define B
n
-1
Bn+1
...
= nHn'
-1
= Bn
then
B- 1 h
-
n
(
n+1 ~+1
)h'
(
n+1 nn+1
)B- 1
n
1 + h~+l (nn+1)B~1 hn +1 (nn+l)
-1
-1--1
Consequently, B = n H
can
n
n
be calculated recursively, and the inversion of H is not necessary at
(Albert and Gardner (1967, page 111)).
...
n
each stage of the recursion.
asymptotic properties of
However, (4.8) is useful to our study of
Tn and Hn .
First, we list the asslUllptions we will use.
These assumptions are
not the weakest possible, since we intend only to illustrate the content
of our general results.
Let col be the function mapping the set of pxp matrices into lRP
by stacking columns one below the other.
C1. The functions D , D , D , and D4 map lRP into R+, lRPXP ,
l
2 3
2
2x
Pxp
P
, and lR
lR
P, respectively. Let H be a p.d. p xp matrix. The
following hold for all v, w, and p. d. matrices M:
2
Z7
I I [F(v,w)
(4.9)
S
- F(v,r)]M- 1 h(v,w) - H- 1 h(v,w)h'(v,w) (w-r) I I
KI\(vHllw - rllZ
+
Z
11M - HII ) ,
.,
(4.10)
I IM- 1 h(v,w)
S
(4.11)
3
liZ
KD (v) (11w - r
1
+
11M - H liZ)
II
,
Ilco1[h(v,w)h'(v,w) - h(v,r)h'(v,r)] - D (v) (w-r) liZ
4
S
CZ.
- H- 1 h(v,r) - DZ(v) (w-r) - D (v)co1(M-H)
Let
KD (v) Ilw - r
1
E;,~ = (v~,en).
liZ •
{~}
Suppose R > Z and
is strong mixing
with mixing m..unbers {an} which are of size -R(R-Z)-l.
McLeish (1975)
defines size; here we mention Mcleish's remark that {a } is of size -q
n
if {a.n } is monotone1y decreasing and
l~=l a:
For all n, the marginal distribution of E;,
n
<
00
for some '" < l/q.
satisfies
and
(4.1Z)
=H
for all n .
ZR
IID3 (vn ) IIZR
Ehn (r)h~ (r)
Moreover,
s~ E{D~(Vn)
+
IIDz(vn)II
+
II en l1 2R
+
+
I Ihn(r) I IZR }
<
and there exists a pZxp matrix D such that
4
00
+
IID4 (vn )II
R
28
C3.
R~RKS.
•
Equations (4.9) to (4.11) merely state that certain Taylor
series expansions are valid.
Assumption C3 can be verified using the
results of Ruppert (1981b) to prove that
rn ~ r,
....
and a continuity assumption on h to show that Hn
and then using (4.12)
~
H.
Using C2 and (2.3) of Mcleish (1975) with P = 2, one sees that
{D (v )}, {enD (v )}, {e D (v )}, {II enD2 (v ) "}, {II en D3 (vn ) II},
n 3 n
n
Z n
1 n
{D (v)}, and {h (r)h'(r)}, when centered at expectations, are each
n
n
4 n
mixinga1es of size -~. MOreover, E(en D2 (vn )) = E(enD3 (vn )) = O.
.
e
Therefore, the following series converge by Corollary 1.8 of McLeish:
00
l
1
00
n -1 enI\(v ) for k
n
l n1
1
= Z,3
,
(D (v ) - ED (v ))
1 n
1 n
'1f n - (~ +cS) enhn (r)
for all cS > 0 ,
and
00
l
1
n-(~+cS)(h (r)h'(r) -H) for all cS > 0 .
n
n
29
Define
H- 1 h (r)h' (r) + e D (v )
n
n
n 2 n
•
I 2
P
By C1 and C3,
T
n +1 - r
= CIp (p+1) -
,..
col (Hn +1- H)
.
e
where Ilvnll
,..
r -r
n
-1
nAn)
- n
-1
en H- 1 h (r)
n
[
co1[H - h (r)h'(r)]
n
n
co1(H -H)
n
= O(D1 (vn Hllrn
+
,..
- rl1
2
+
IIfIn
2
- HI1 )).
Therefore,
assumptions AI to A7 hold with
,.,
r
"'"
en =
n
,..
A
=
col H
n
Ip
0
D4
I 2
P
and W = e H- 1 h (r). It is easy to see that all eigenvalues of A are
n
n
n
A
1. Moreover, the p xp matrix in the top left -hand corner of n - I is
I ' Therefore, we have shown that C1 to C3 imply the existence of an
p
£ > 0 such that
~,..
n (r - r)
n
REFERENCES
ALBERT, A.E. and GARDNER, L.A., JR. (1967).
Nonlinear Regression.
stochastic Approximation and
The M.I.T. Press, Cambridge, Mass.
vn ] ,
30
BLUM, JULIUS R. (1954).
methods.
Ann. Math. statist. 25 737-744.
BROWN, B.M. (1971).
•
Multidimensional stochastic approximation
Martingale central limit theorems.
Ann. Math .
statist. 42 59-66.
CAMPBELL, KATHERINE (1979).
Stochastic approximation procedures for
mixing stochastic processes.
Ph.D. Dissertation, The University of
New Mexico, Albuquerque, N.M.
CARROLL, RAu.DND J. and RUPPERT, DAVID (1980).
Weak convergence of
bounded influence estimates with applications.
Institute of
statistics Mimeo Series #1285, University of North Carolina,
Chapel Hill, N.C.
.
e
CHUNG, K.L. (1954).
On a stochastic approximation method.
Ann. Math .
statist. 25 463-483.
DERMAN, C. and SACKS, J. (1959).
theorem.
On Dvoretzky' s stochastic approximation
Ann. Math. Statist. 30 601-605.
DOOB, JOSEPH L. (1953).
stochastic Processes.
John Wiley and Sons,
New York.
FABIAN, VACLAV (1968).
On asymptotic normality in stochastic approximation.
Ann. Math. statist. 39 1327-1332.
HAMPEL, FRANK R. (1978).
Optimally bounding the gross-error-sensitivity
and the influence of position in factor space.
Proceedings of the
American statistical Association, Statistical Computing Section.
HIRSffi, K>RRIS and SMALE, STEmEN (1974).
Dynamical
Systems~
ffiJBER, PETER J. (1964).
and Linear Algebra.
Differential
Equations~
Academic Press, New York.
Robust estimation of a location parameter.
Ann. Math. statist. 35 73-101.
31
HUBER, PETER J. (1981).
Robust statistias.
John Wiley and Sons,
New York.
•
KIEFER, JACK and WOLFOWITZ, JACOB (1952).
maximtun of a regression function.
KUSHNER, H. J. and CLARK, D. S. (1978).
Stochastic estimation of the
Ann. Math. statist. 23 462-466.
Stoahastia Approximation Methods
for COflstrained and Unaonstrained Systems.
Springer-Verlag,
New York.
LJUNG, L. (1978).
algorithm.
Strong convergence of a stochastic approximation
Ann. statist. 6 680-696.
MCLEISH, D.L. (1975).
A maximal inequality and dependent strong laws.
Ann. Frob. 3 829-839.
.
e
PHAM, TUAN D. and TRAN, LANH T. (1980).
The strong mixing property of
the autoregressive moving average time series model.
Technical
Report, Department of Mathematics, Indiana University.
ROBBINS, H. and mNRO, S. (1951).
A stochastic approximation method.
Ann. Math. statist. 22 400-407.
ROBBINS, HERBERT and SIECM.JND, DAVID (1971).
A convergence theorem for
non negative almost supermartingales and some applications.
In
Optimizing Methods in Statistias (J.S. Rustagi, ed.) 233-257.
Academic Press, New York.
RUPPERT, DAVID (1978).
Almost sure approximations to the Robbins-Mlnro
and Kiefer-Wolfowitz processes with dependent noise.
Institute of
statistias Mimeo Series #1203, University of North Carolina,
Chapel Hill, N.C. (To appear in Ann. Frob.)
RUPPERT, D. (1979).
A new dynamic stochastic approximation procedure.
Ann. statist. 7 1179-1195.
32
RUPPERT, D. (1981a).
:ftmction.
Stochastic approximation of an implicitly defined
To appear in Ann. statist.
RUPPERT, DAVID (1981b).
Convergence of recursive estimators with
applications to nonlinear regression.
Institute of statisti08 Mimeo
Series #1333, University of North Carolina, Chapel Hill, N.C.
SACKS, JEROME (1958).
procedures.
Asymptotic distributions of stochastic approximation
Ann. Math. statist. 29 373-405.
SIELKEN, ROBERT 1., JR. (1973).
approximation procedures.
Gebiete 26 67-75 .
..
e
Stopping times for stochastic
Z. Wa'hrsaheinZiahkeitstheorie und
ve~.