Ruppert, David.; Convergence of Recursive Estimators with Applications to Nonlinear Regression."

•
CONVERGENCE OF RECURSIVE ESTIMATORS
WIrn APPLICATIONS TO NONLINEAR REGRESSION
by
David Ruppert
University of North Carolina
ABS1RACT
Strong convergence of a class of recursive estimators is proven
using a relationship between weak and uniform convergence of probability
measures.
Special attention is paid to recursive nonlinear regression,
where the independent variables and errors may each be dependent
sequences.
AMS 1970 Subject Classification:
Key Words and Phrases:
62L20.
Stochastic approximation, recursive estimation,
uniform convergence, nonlinear regression,
dependent errors.
This research was supported by the National Science Foundation
through Grant MCS78-01240.
1
1.
•
Introduction.
This paper is concerned with the strong
convergence of recursive estimators which are generalizations of the
Robbins-MOnro (1951) stochastic approximation procedure.
Emphasis is
placed on the recursive nonlinear regression estimator developed by
Albert and Gardner (1967).
However, the main result, Theorem 3.1, is
sufficiently general that it should be applicable to other recursive
estimators.
For example, the author intends to use this result to
continue his study (Ruppert (1979, 1981)) of Robbins-MOnro type procedures
where the root of the unknown regression function varies with time.
Albert and Gardner (1967) investigated nonlinear regression problems
where, for n = 1,2, ... , one observes Y such that for a known function
n
Fn' an unknown vector parameter e, and a mean-zero random variable en'
(1.1)
They considered estimators of
(1.2)
A
= en
e defined
+ a [Y
n n
by the recursion
A
- F (8 )]
where an is a suitably chosen vector.
n n
Although it is, of course,
possible to use nonlinear least-squares methodology here, Albert and
Gardner were interested in situations where the Y 's are observed
n
sequentially, and one needs to rapidly update one's estimate as each new
observation arrives.
Besides its use for "on-line" estimation, this
recursive nonlinear estimator may be useful when handling large data sets
·e
and models with large mmlbers of parameters.
Then, because of its
recursive nature, the calculation of the estimator has modest storage
requirements.
2
In their study of "optimal" values of a
in (1.2), Albert and
n
Gardner used a Taylor series linearization and their calculation of the
"optimal" an in the linear case and were led to the choice:
a
(1. 3)
n
where b
n
.
= Fn (n)
n
.
n
is the gradient of F ),
n
B
=
(F
(1.4)
= Bnbn
n
(B
-1
~
I
-1
+ l b.b.)
,
O
j=l J J
and nn is either 8 , a guessed value of 8, or
0
~
A
= 8m for some m :;; n.
Also, B can be calculated recursively and without matrix inversions,
n
except possibly for BO; see their equation (7.45).
Albert and Gardner do not actually prove that the algorithm
converges to 8 for 'this value of an' but instead they analyze a different
algorithm.
Let P be a convex set and let [ ] P denote the operation of
projection into P.
......
Then, they find sufficient conditions for 8 defined
n
by
[8n
n n - Fn (8n )] P
+ a [Y
to converge when an is given by (1.3).
troublesome:
One of these conditions is
P nmst lie within the ball of radius R centered at 8.
value of R is not given but can be found by examining their proof.
R is small, one will need good prior knowledge of 8.
The
When
3
Also, the need to project into P complicates the algorithm, perhaps
unnecessarily in most applications.
Theorem 2.4.1 of Kushner and Clark (1978) could be used to show that
A
A
8 converges, if one could show that s; II 8n II < 00, but this condition
n
seems difficult to establish.
In this paper, we will suppose that Fn (8) = F(Zn,8), where F is a
ftmction from Rq -1 x lRP to R and Z is a known vector. However, we
n
will not require any prior knowledge of 8.
Also, we will not require
that e , e 2 , . •. be independent (or even tmcorrelated). Since recursive
l
estimation is often used when the data form a time series, correlated
errors should be considered.
The proof of Theorem 3.1 utilizes a result by Ranga Rao (1962) on
-e
the relationship between weak and tmiformconvergence of probability
measures.
We are able to conclude that certain weighted averages
(averaged with respect to n) of ftmctions of Zn and 8 converge tmiformly
for 8 in RP (not just on compact subsets).
This may be the first use in
the recursive estimation literature of tmiform convergence of measures,
though, of course, use has been made elsewhere in statistical large
sample theory.
2.
Notation and assumptions.
All random variables are defined on a
probability space (n,F,p), and all relations between random variables
are meant to hold with probability 1. Let (Rk,Bk) be k-dimensional
Euclidean space with the Borel a-algebra.
·e
All ftmctions which we
consider between Euclidean spaces are assumed to be Borel measurable.
Let a prime denote matrix transposition.
II AII = (Trace
A' A) ~.
discussed below.
For a real matrix A,
1
We will need the following assumptions, which are
4
AI.
h ( • , •), hI ( • , •), and h ( •) are functions from RP x IRq to
2
q
r
lRP , IRP x:R. to IRpx , and RP to IRr respectively such that
•
A2.
matrix.
For each n, H is a pxp positive definite synnnetric random
n
.
For positive random variables A S 5:, all eigenvalues of Hare
n
between A and
A3.
5: for all n.
q q
Suppose ]1 is a probability measure on (lR ,B ) and E]1 denotes
expectation with respect to ]1.
A4.
Let F,;l' F,;2 ' . .. be a sequence of random vectors in Rq.
Suppose
there exists a constant c such that E]1 g = 0 and E]1 g2 = 1 implies that
-e
R,
2
n
2
E max (L c. g(F,;.)) s c L c.
msR,sn i=m 1
1
i=m 1
(2.1)
for all n > m and constants cm"'" cn .
AS. There exists a nonnegative continuous function h
q
on R such
3
that (i) E]1h~ < 00 and (ii) Ilh l (x,F,;) II s h (F,;) for all x and F,;.
3
A6. {hI (x,·)}
is an equicontinuous family on Rq.
-
XE:lRP
REMARKS.
In our application to nonlinear regression, F,;. will be the
1
vector formed from the error and the independent variables from the i th
observation.
We are asstnning the F,;. are random, but of course by
1
including degenerate distributions this allows the possibility that the
-e
independent variables are chosen by design.
If t,;l' F,;2""
is a random
sequence with a stationary marginal distribution ]1, then we can verify A4
in the case of independence using Kolmogorov's inequality, and for weak
dependence by a theorem of Mcleish (1975, Theorem 1.6).
s
In Lenuna 4.1, A4 is verified when the errors are iid and the
independent variables are nonrandom (have degenerate distributions) and
are periodic; that is, for some M, the independence variables of the nth
•
and mth observations are equal if n
= m modulo
M.
Thus, the independent
variables are selected by a design which repeats a set of M (not
necessarily all distinct) values. Then 1.1 is the product of the measure
placing mass M- l on each of these values and the error distribution.
The decomposition of h into hI and h Z allows some flexibility in
the application of the result of Ranga Rao (196Z) on uniform convergence.
NafATION.
f
-e
Define hI (x)
=f
h(x,EJdl.l(~)
and hex) = hI (x)hz(x) =
h(x,~)dl.l(~).
A7.
A8.
h
is bounded in a neighborhood of
z
For all £ > 0,
inf min{1 Ihl(x) I1,1 Ih(x)1
Ilx 11>£
A9.
I}
o.
> 0 .
Suppose that h is the gradient of V, (i)
inf (V(x) - V(O)) > 0 for all
Ilx 11>£
£
> 0
and (ii) with V the Hessian of V,
Ilv(x) II
REMARK.
S
M for some M and all x .
The assumption of a bounded Hessian is connnon in the
literature of multidimensional stochastic approximation.
-e
Fabian (1971).
See e.g.
6
AlD.
E h
2
11 4
<
There exists a nonnegative flIDction h
00,
4
on :Rq such that
and for all F.,;, x, and x' ,
Xl" .
Ilh(x,F.,;) - h(x', F.,;) II ~ h4 (F.,;) IIx Suppose
NOTATION.
Cl
> 2, define n(k) to be the integer part of
kCl , and define p = \n(k+1)-1 i-I
k
li=n(k)
•
All •
sup
. II H - H
II =
- - n(k)~ R.~n(k+1) -1
R.
n(k)
3.
(1) as k
-+
00 •
General results.
LfM.1A. 3.1.
-e
0
SUppose A3 and A4 hoZd.
For R. in {n(k), ... , n(k+1) -l},
define the random probability measure llk,R. by
11k, R. (A)
for A in Bq.
= p-1{
k
~
l
i=n(k)
i
-1
I (F.,;. € A) +
1
n(kV) -1
l
i=R.+1
i
-1
11 (A) }
(Here I (B) is the indicator function of the set B.)
11k i converges weakly to II as n(k) + i
,
-+
00.
(The indices (k,i) can be
ordered into a sequence according to the magnitude of n(k) + R..)
PROOF.
E
Suppose
f
g2 d11 <
max
n(k)~ R.~n(k+1)-1
= O(Pk
2
cJ
00.
gd11
n(k+1) - 1
l
R.=n(k)
Thus since a > 2,
Then by A4,
k,R.
-
f
gd 11)
Then,
2
i -2) = O(k1 - a ) .
7
00
2
E
k=l
cJ
max
n(k)~ R,~n(k+l)-l
gdll
J gdll)2
k,R,
<
00
,
whence
max
n(k)~ R,~n(k+l)-l
cJ gdllk
R, '
J gdll)
-+ 0
as k -+
00
•
By Theorem 6.6 of Parathasarthy (1968), there exists a sequence of
botmded tmifonnly continuous ftmctions gl' g2' . •. such that, for any
measures {vn}~=l and v on (Rq,Bq), we have v -+ v weakly if and only
n
if
J gR,
dvn -+
~
J gR,
3.2.
As
dv as n -+
k -+
sup
max
q
XE: R
n(k)~ R,~n(k+l)-l
PROOF.
00
for each R,.
00
The lenuna follows.
~
Pk-I,
R,
'i'
l
i=n(k)
i -1 (hl(x'~i) - hl(x))
I -+
0 .
The lenuna follows from A4, AS, A6, Lenuna 3.1, and Theorem
o
3.2 of Ranga Rao (1962).
lliEOREM 3.1.
Under AI to
AIl~
xn
-+ 0
where x is defined by the
n
reauI'sion
xn +1 = xn - n
PROOF.
-e
0
For n(k) + 1
~
R,
-1
~
Hn hexn"'n
,E )
n(k+l) - 1,
•
8
~
x~+1 =xn(k) -
L
i=n(k)
.
~
.-1
Hi h(xn(k))
-
L
1
I
i-I(Hi - Hn(k))(h(xn(k)'Si) - h(xn(k)))
i=n(k)
(3.1)
-1
i
i=n(k)
Hn(k) (h(xn(k) ,Si) - h(xn(k)))
~
.-1
1
H.(h(x.,s.) -hex (k)'s.))
i=n(k)
1
1 1
n
1
L
= Xn(k) - ~,~ - ~,~ - Tk,~ - Uk,~'
say.
By A2,
(3.2)
and
By A2 and Lenma 3.2,
(3.4)
By All,
n(k+I)-I .-1
_
IITk ~II = o( . L
1 (1Ih(xn(k)'Si)11 + I Ih(xn(k)) I I)) .
,
1=n(k)
llireover, by AI, A4, and AS,
9
n(k+1)-1
. L
1=n(k)
-1
i
Ilh(xn(k)'~i)"
n(k+1)-1 -1
h
s I 1 Z(xn(k)) I I . L
i
h3(~i)
1=n(k)
•
= 0(PkI1hz(xn(k))11)
and
Therefore,
(3.5)
Next, by A4 and AlO,
•
Ill\: til = O(Pk
max
Ilxi - xn(k) II)
,
n(k)s is t-1
(3.6)
By (3.1), (3.Z), and (3.4)-(3.6),
where
~
= 0(1).
Then by induction,
(3.7)
for all k so large that
~Pk
< 1.
By (3.6) and (3.7),
.
10
Now (3.1), (3.4), (3.5), and (3.7) imply that
..
By A5, hI (x) is botmded, so by A9 (ii), there exist E:
~ -t
Now choose
0 such that
B -} 0 such that
k
Ilxll >
~
E:k/~ -+
LPkai
0 and
=
k
-t 0 such that
00, then choose
Ilh l (x) liZ >
implies that
~
~ Bk
and finally find Yk -t 0 such that Ilx II
Ilh(x) liZ >
<1<'
that Vex) -
~k Ilh(X) liZ
+ E:kPk 11 h (x) liZ
since Pk ~ em-I, and by A7 and A8.
z
~
Yk + V(O).
and
implies
This can be done
Then, for k sufficiently large,
,.
Lemma 1 of Dennan and Sacks (1959) implies that V(xn(k))
then xn(k)
4.
-+
0 by A9 (i), whence xn
-+
-+
V(O), and
0 by (3.7).
Application to nonlinear regression.
0
In this section, we prove
A
consistency of en given by the asstmlptions:
B1.
Suppose (1.1) holds with Fn(e)
=
F(Zn,e), where F is a known
ftmction on Rq-l x lRP and the "independent variable," Z , is a known
n
element of Rq -1.
BZ.
Suppose b(Z,e) is the gradient of F(Z,e) with respect to e
= b(Zn,e).
Let H be a sequence of positive definite,
n
symmetric matrices satisfying AZ and All. Suppose (l.Z) holds with
and bn (e)
a
n
= n- l Hnbn .
11
x lJ , where lJ
and lJ are probability measures on
2
l
l
2
q
:m -1 and R respectively. Define i;'n = (z',
e ). AsslUlle that lJ and
n n
B3.
Let lJ = lJ
i;l' ~2' ... satisfy A4.
NOTATION.
Define
h(z,e,x)
(4.1)
= (F(z,x)
Vex)
=~ f
hex)
=f
- F(z,8) - e)b(z,x) ,
(F(z,x) -F(z,8) _e)2 dlJ(z,e) ,
and
B4.
h(z,e,x)dlJ(z,e) .
AsslUlle that h (x) is the gradient of V(x), i. e. that the RHS of
(4.1) can be differentiated under the integral sign.
Also asslUlle that h,
V, and lJ satisfy Al and AS to AlO, and that for all x, Ilb(z,x)11 ~ hS(z)
II'
for a normegative function such that
(4.2)
REMARKS.
It is desirable to know when A2 and All hold if H = nB
n
and B is defined by (1.4).
n
H(x)
=f
Define
b(z,x)(b(z,x)), dlJl(z) .
Suppose there exists positive constants ~ and ~ such that, for all x,
all eigenvalues of H(x) lie between ~ and L Suppose, also, that
4
lib (z ,x) 11 ~ h (z), where f h (z)dlJ < (Xl, that
6
6
l
P
{b(',x)(b(',x))': x E: lR } is an equicontinuous family on :mq - l , and
nt
(as in equation (1.4)) is equal to xn(k) if t is in
n
12
{n(k), ... ,n(k+l)-l}.
Then, using a proof like that of Lemma 3.2, one
can show that A2 and All hold.
In the special case of linear regression,
b(z ,x) does not depend upon x, so the value of n.R, is not relevant.
When zn is a nondegenerate random vector, then B3 essentially
implies that zn and en are independent for each n, a condition
typically used in regression analysis.
Also, A4 can often be verified
using the results mentioned at the end of Section 2.
If the zn are
degenerate random vectors, then the next lerrma may be useful in the
verification of A4.
It covers the situation where the errors are iid and
either (i) the independent variables are selected by repeating some
finite design or (ii) the Y form a time series with a periodic mean
n
ftmction.
Let M be a p08itive integer and let a , .•• , ~ be
l
q l
element8 of :m. - • Supp08e z = a. if n = j modulo M. Let {e }"" 1 be
n
J
n n=
iid random variable8. Define the probability mea81.a'e 11 on Rq by
LetM. 4.1.
•
l1(A)
1
=M
M
l
j=l
P((a~,el)'
€
J
A) ,
i. e. 11 i8the produat of the marginal di8tribution of e
mea81.a'e on {a , ••• , ~}.
1
:m.
from Rq to
PROOF.
not.
and j
Let I
2
Supp08e Eg (a ,e ) <
j
= I, ... ,M.
1
Then ~
n
=
00
1
for 80me funation g
(z',e) and 11 8ati8fy (2.1).
n n
.
. be 0 or 1 according to whether n = j modulo M or
n,J
Then, for .R, in {n(k), ... , n(k+I) -l},
.R,
I l
i=n(k)
.-1 (g (z . , e .) - E g)
1
1
11
1
with aounting
I
M
~ IjII ~,.R"j I
+
I~,.R,I ,
13
R-
Rk , R-,j =
.-1
i=~(k)
1
(g (a.. ,e ) - g (a. . )) I . .
J n
J
1,J
and
s..
M
n
-k:tv
= L
R-
.-1
L
j=l i=n(k)
1
(g(a..) - E g) 1.. .
J]J
1J
2
max
·e
Sk
n(k+l)-1_2
n(k)sR-sn(k+l)'
R-
= O(
L
i=n(k)
i)
by straightforward approximations, and
E(
max
n(k)sR-sn(k+l)
l\:
2
"
R- J.)
= O(
n(k+l)-l .-2
L
i=n(k)
1)
o
by Kolmogorov's inequality.
REMARK.
The requirement that {en} be iid could be weakened
considerably, but we will not pursue this matter here.
THEOREM 4.2.
PROOF.
Under Bl to B4,
eA ~ e.
Without loss of generality, we may take
e=
O.
Then the
theorem follows from Theorem 3.1.
EXAMPLE (linear regression).
satisfies B3.
Suppose
0
Suppose Y. = z~
1
1
e+
e. and ~' = (z',e )
1
n
n
n
14
and define
t = J zz'
~
Then let r
d~(z,e) .
= 2,
vee)
t
= ~(x-e)'
~
(x-e) ,
h(z ,e ,x)
=
(z' (x-e) + e)z
hex)
=
* (x-e)
hI (z,e,x)
=
[min{llx - ell-I, l}zz' (x-e)
hz(X)
=
[max(llx - ell, 1)
,
~
h (z,e)
3
= (II z 11 4 + II z 11 2
h 4 (z ,e)
= II z II 2 .
ze] ,
1]' ,
e 2) ~
and
o
One may check that Theorem 4.2 applies here.
REFERENCES
ALBERT, A.E. and GARDNER, L.A., JR. (1967).
Non "linear Regression.
The M. I. T. Press, Cambridge, Mass.
DERMAN, C. and SACKS, J. (1959).
theorem.
stochastic Approximation and
On Dvoretzky's stochastic approximation
Ann. Math. Statist. 30 601-605.
FABIAN, V. (1971).
Stochastic approximation.
in statistics (J.S. Rustagi, ed.) 439-470.
New York.
In Optimizing Methods
Academic Press,
15
MCLEISH, D.L. (1975).
A maximal inequality and dependent strong lows.
Ann. PI'ob. 3 829-839 .
•
.
PARATHASARTHY, K.R. (1967).
ProbabiZity Measures on Metric Spaces.
Academic Press, New York .
RANGA RAO, R. (1962).
Relations between weak and lIDifonn convergence of
measures with applications.
Ann. Math. statist. 33 659-680.
ROBBINS, H. and MONRO, S. (1951).
A stochastic approximation method.
Ann. Math. statist. 22 400-407.
RUPPERT, D. (1978).
flIDction.
Stochastic approximation of an implicitly defined
Institute of statistics Mimeo Series #1164, University of
North Carolina at Chapel Hill.
RUPPERT, D. (1979).
(To appear in Ann. Statist.)
A new dynamic stochastic approximation procedure.
Ann. Statist. 7 1179-1195.