A NEW MULTIVARIATE ROBBINS-MONRO
PROCEDURE
Running Title:
Multivariate Robbins-Monro
David Ruppert
1
·~
. .
..
(
,~.~
AMS Subject Classification:
KEY WORDS ,AND PHRASES:
62L20
Root finding, stochastic approximation.
IDavid Ruppert is Associate Professor, Department of Statistics,Ul1iversity
of North Carolina, Chapel Hill, NC 27514. Thi~ research was supported by
National Science Foundation Grant MCS-8100748.
ABSTRACT
k
Suppose that f is a function from :R
Initially f is unknown, but for any y in R
Y(x) with expectation f(x).
to R
k
k
and for some 8, f(8)
o.
=
we can observe a random vector
The unknown e can be estimated recursively by
Blum's (1954) multivariate version of the Robbins-Monro procedure.
Blum's
procedure requires the rather restrictive assumption that infimum of the inner
product (x_8)t f(x) over any compact set not containing 8 be positive.
at each x, f(x) gives information about the direction towards e.
recursion is Xn +
l
·e
Blum's
= Xn-anYn where the conditional expectation of Yn given
X1, ... ,X n is f(X ) and an > O.
n
Unlike Blum's
~ethod,
the procedure introduced
in this paper does not attempt to move directly towards
to Xn + .
l
Thus
e when
updating Xn
Rather, except for random fluctuations it moves in a direction
which decreases
II
f]1
2
, and it may follow a very circuitous route to 8.
Consequently, it does not require that (x-e)tf(x) have a constant signum.
This
new procedure is somewhat similar to the multivariate Kiefer-Wolfowitz procedure applied to Ilf
11
2
, but unlike the latter it converges to
e
at rate n- 1.
1.
INTRODUCTION
This paper is concerned with a multivariate version of a problem first
studied by Robbins and Monro (1951).
from "R
k
Suppose that f is an unknown function
k
to "R , and that for any x in R k we can observe a random vector
Let a in R
Y(x) with expectation f(x).
unique 8 such that f(8) = a.
k
be known, and suppose there is a
The goal is to estimate 8.
For example, suppose that k = 2 and x gives the doses of two drugs that
affect blood chemistry.
The concentrations of two chemicals in the blood are
measured after administration of the drugs, and f(x) gives the expected concentrations as a function of the doses.
If a gives the ideal concentrations
of the two blood components, then 8 gives the correct doses.
By choosing the appropriate measurement scales, we can without loss of
generality assume that a =
o.
Blum's (1954) version of the Robbins-Monro (RM) process begins with an
initial estimate Xl of 8 .
F
E ny
f(X ) where
n
n
Given Xl, ... ,X , one observes Y , such that
n
n
Fn is the a-algebra generated by Xl, ..• ,Xn . Then Xn is
updated by the recursion
X
(1.1 )
where a
n+l
n
= X
n
- a Y
n n
is a suitably chosen positive sequence converging to
of X to 8 is proved under the assumption that for each
n
E >
O.
Convergence
0
(1. 2)
The importance of (1.2) can be easily seen as follows.
F
fact that E n y = f(X ), we have
n
n
From (1.1) and the
-2-
II xn -8 II
2_ 2a (X -8) tf(X )
n n
n
F
a 2E nil y
+
n
n
n2
2
If we could ignore the term of order an' then II X -811
n
supermartingale and would converge a.s.
2
would be a positive
The term of order a
2
n
can be handled
using a theorem on "almost" positive supermartingales (Robbins and Siegmund,
1971), which also can be used to show that
00
I a (X -8)t f (Xn )
n=l n n
(1. 3)
converges a.s.
The sequence {a } is chosen so that
n
00
I
(1. 4)
n=l
an
and (1.2)-(1.4) imply that Xn
00
~
0 a.s.
Authors using condition (1.2) or some-
thing quite similar include Sacks (1958, assumption i (A1*)), Schmetterer (1968,
assumption (4.15)), and Walk (1977, assumption (2a)).
the RM process in a general Hilbert spaces.)
(Walk's paper concerns
Blum's original work uses
assumptions stronger than (1.2).
Unfortunately, (1.2) is a rather restrictive assumption, implying that
at each x, f(x) "points away from
e".
inf{-(x-8)t f (x):
(1.2')
Clearly we can replace (1.2) by
E
< II xII <
E-
l
} > 0;
this requires only that we change (1.1) to X + = Xn + anY .
n
n l
An example of
2 2 t
a function satisfying neither (1.2) nor (1.2') is f(x ,x 2) = (x ,x ) .
l 2
l
An alternative to the multivariate RM procedure would be to apply the
multivariate Kiefer-Wolfowitz (KW) procedure to minimize II f(x)
11
2
.
(For the
moment assume that the conditional variance of Y(x) is independent of x, so
F
2
. 2
that E nil y II
= II f(X )11 + constant. Otherwise, the KW procedure will
n
n
-3-
find the minimizer of
Ell
Y(x) II
2
.
not necessarily the solution to f
,
.
II
The KW procedure will tend to follow the negative gradient of
0.)
(x)
')
f(x)
II '-
Al though it may not move from X directly towards 8, except for random
n
fluctuations, it does move downhill.
Therefore, under mild conditions X will
n
converge to a local minimum of II f (x) 11
then X
n
-+
2
.
If 8 is the only local minimum,
O.
Unfortunately, the rate of convergence to 8 of the KW method is slower
1
than the n- 2 rate of RM method, though modifications of the Kiefer-Wolfowitz
1
method can produce rates arbitrarily close to n- 2 if f has derivatives of
sufficiently high order (Fabian, 1971).
In this paper, we propose a new multivariate RM process which in some
ways behaves as the Kiefer-Wolfowitz method applied to II.fI1
2
, but which
1
possesses the n- 2 rate of convergence even when f has only two derivatives.
·e
Let O(x) be the k x k derivative of f(x).
Ilf(x)
11
2
Then, the derivative of
is
The Hessian matrix of 1\ f(x) 11
H(x)
2
is
= 2[Ot(x)0(x)
+
I
H(i) (x)fi(x)] ,
i=l
where fi is ith coordinate of f and H(i) is the,Hessian of fi.
H(8)
=
2 Ot(8)0(8).
where a
>
Thus
The recursive procedure that is introduced here is
t
0, Bn is an estimate of [0 (8)0(8)]
-1
, and 0n is an estimate of f(X n ).
Blum's multivariate RM procedure uses one observation to construct f
thereby to update X to X + .
n l
n
n
and
Our procedure uses (2k)(m ) observations to
n
construct On and [nY] observations to construct f , where [.]. is the greatest
n
-4-
integer function, Y > 0, m
n
+
00
and m n- Y
,
n
o.
+
We let m
n
+
00
sufficiently
fast (see below), so that the conditional variance of Dn,given Xl, ... ,X ,
n
We require that m n- Y
converges to O.
n
0 so that among the totality of
+
observations used in constructing both D and f , the
n
mating D converges to
o.
n
proportio~ used
in esti-
These properties insure full asymptotic efficiency
(see section 5).
The estimator f
n
is simply the mean of [n Y] observations with conditional
The ith column of D is con-
expectation, given Xl, ... ,X , equal to f{X n ).
n
structed as follows.
n
Let e(i) be the ith column of the k x k identity matrix.
Let Y(n,i,w) and Y(n,i,l) each be the mean of m observations with conditional
n
expectation equal to f(X
lth column of
0
n
n
+
c e(i)) and f(X -c e(i)), respectively.
n
n
n
Then, the
is
D(i) = [Y(n,i,2) - Y(n,i,1)]/(2c )
·e
n
We choose m and c
n
n
n
so that c
n
+
0 and m'-1 c -2
n
n
+
O.
With this choice of c
n
and
m , the bias of D as an estimate of D(X ) converges to 0, and as mentioned
n
n
n
above the conditional variance of D also converges to
n
Bn
is constructed as follows. Let nand
- ' n
that n
-n
+ 0,
~
n
nn
o.
be positive sequences such
too, and certain other conditions (see section 2) are met.
Let
C
n
Let B = Cn
n
l
= (n-l)
if all eigenvalues of
_In-l t
L D.D.
i=l 1 1
c-n l
lie between nand
-n
n.
n
Otherwise let
B be some symmetric matrix whose eigenvalues are all between n
n
-'n
and ~ .
n
The procedure of this paper bears some resemblance to the one dimensional
Venter (1967) procedure.
It was shown by Chung (1954) that the univariate
Robbins-Monro procedure is asymptotically optimal when an =l/(nf'(e)).
course, f' (e) will typically be unknown.
estimate b
n
Of
Venter introduced a consistent
of fl(e) and showed that symptotic optimality could be achieved
-5-
with an a
n
= l/(n b ).
n
Our procedure also estimates f' but at each X , not simply at 8.
Moreover,
n
the Venter process and the original RM process are consistent under roughly
the same circumstances.
Our procedure is an attempt to improve the consistency
properties of Blum's multivariate RM process.
The matrix sequence B does,
n
however, playa role for our process which is analogous to that of b
in the
n
Venter process.
It should be mentioned that Blum's version of the RM process may be prcferable to the one introduced here under certain circumstances, namely when
t
(1.2) is known to hold or
x # 8.
However, when neither (1.2) nor (1.2') hold, our procedure, but not
necessarily Blum's, at least tends to move in a direction decreasing
·e
2.
.
(1.2') is known to hold and D (x)f(x) = 0 for some
II
f(x) 11
2
NOTATION AND ASSUMPTIONS
Let ]Rk be k dimensional Euclidean space.
If A is a kx,Q, matrix, let Aij
be the i, jth entry of A, and if ,Q,=l let Ai = Ail.
pose of A, and let
II
AI1
2
:::
I I
(A
ij
)2
Also, let At be the trans-
A- t = (A-l)t.
i=l j:::l
All random variables are defined on the same probability space, and all
relations between random variables are meant to hold with probability one.
If zl""zn are random variables, thena(zl"",zn) is the a-algebra that
they generate.
If F is a a-algebra, the EFX and VarFx are, respectively, the
conditional mean and variance matrix of the random vector X. If X is a
F
random matrix, then Var nX is conditional variance matrix of X arranged as a
column vector.
L denotes convergence in law.
+
We say that a .- b if a Ib
n
n
n n
+
1.
"0" and "0" notation has its usual
meaning.
The following assumptions on f will be needed.
Fl.
(i)
f = (fl, ... ,fk)t is a twice differentiable function from ]{k
. -6-
to R
k
,.
IS
.
.
In
b
(iii)
f(8) ::: O.
•
0(8) is nonsingular.
(iv)
II
k
O(X) is the derivative of f, i.e., Oij(x) ::: (CljClx.)f 1 (x).
J
.
(ii)
inf{
m
.II'.
For all
E >
0
II : E.:s. II
sup{ II 0 (x) II }
ot(x)f(x)
f(x)
(v)
<
(vi)
II ~
E- I } > o.
00
Let H(x) be the Hessian of
2
2
(3 j3x.Clx.)11 f(x)11
1
II
f(x)
11
2
.
, Hij(x) :::
, I.e.
•
J
Then,
For all
F2.
E > 0
inf{11 f(x)ll: E .:s.lIx-ell ~ E- } > o.
I
The modified Robbins-Monro algorithm will be described formally by the
·e
following assumptions.
AI.
X
n
X
n+l
(i)
is in R
k
n
X - an
n
-1
and B and 0 n are k
n
(ii)
m n- Y
:::
B 0
f
n n n
where a > 0,
x k
random matrices.
*
2
> 0, C
0, m is an integer, m c t
n
n
n
n n
l
0, n {O, ~2n-2 ::: o(n n- ) ~ t 00
-n
n·
-n
'n
+
c
'\ -1
(2.1)
Ln
n =
-n
00
'
Y > 0,
00
and
\ -1n [c 2 + n -1n (c 2 + c -2 m-1 + n- Y)] <
(2.2)
Ln
n
n
n
n
n
n
Then B is F I measurable,
(iii)
F
E nf
II
n
Var
F
n'·
Fn
D
n
II : :
0 (m
(iv)
and
n
E nO
::: f(X)
nn .
00
2
::: O(X ) + O(c ),
n
n
n
-1 -2
c )•
n n
II
F
Var nf
n
II : :
n-
O(n- Y), and
Given F , f and 0 are conditionally independent.
n
n
n
B is symmetric and all eigenvalues of B are between n
n
n-n
-7A2.
a > (1+y)/2
(i)
If
(ii)
xn
-+
8, then B
n
-+
F
(0\6)0(8))-1 and nY Var n f
-+
n
S for
some matrix S, and for all r > 0
lim n
n+oo
-1 n
I
i=l
E
HII v·1I
2
1
2:r
2
Ullv·1I
=o,
1
where V = nY (0 f (X )
n
n
n
Remarks on the assumptions.
I~ssian
and IT
n
= E
F
n
0
n
Fl(vi) corresponds to the assumption of a hounded
which has been used in the study of the Kiefer-Wolfowitz algorithm,
e. g., Fabian (1971, assumption (Z. Z)) .
Any substantially weaker condition
would probably require that the step sizes, aBO
n n n f n , in Al
. (i) be modified to
prevent increasingly large oscillations.
Otherwise, X might have no finite
n
limit points.
·e
Assumptions Fl (iv) and FZ are also similar to conditions that would be
needed if the KW process were applied to
bon (2.2) equation (1)).
II
2
f(x)
11
II
f(x) liZ.
See Fabian (1971, assump-
They imply that 8 is the only local minimum of
•
Given the method described in the introduction of constructing B , 0 ,
n
and f , assumptions Al (iii), Al (iv) and AZ (ii) are natural.
F
E nO
n
n
=
To have
Z
O(X) + O(c ), it is sufficient that the second derivative of f be
n
uniformly bounded.
n
Simple conditions sufficient for AZ (ii) can be found
using standard martingale techniques.
3.
THEOREMS
Theorem 3.1.
then X
n
n
-+
Assume Fl and AI.
Then f(X )
n
-+
8.
Theorem 3.2.
Assume Fl, F2, AI, and AZ.
Then
O.
If FZ also holds,
-8-
3.3~
Corollary
~symptotically
Under Fl, F2, AI, and A2, .the choice a = (l+y) is
optimal.
With this choice,
where N is the total number of observations needed to construct X .
n
4.
n
PROOFS
F
Proof of theorem 3.1.
Define 0
= E nDn'
n
By Fl (ii), Al (i), and Al (iii) there is a
d
n
s in
= Dn -0n ,
(0,1) such that
(4.1)
By Fl (v) and Al (iii)
·e
Ift(X )[0 -D(X )]8 Dt(X )f(X )
n
n
n
n
n
n
(4.2)
(4.3)
By Al (iii) and F1 (v)
E
F
2
nil Dtf 11
n n
=
II
Dt(X )f(X )
n
n
ft (X ) [0 ITt n'
(4.4)
I
(vi)
where
+
n n
F
E n[ct(O
n
ot
n n
t
+ d d ) E ]
n n
n
By (2.2), (4.1) to (4.4), and Al (iv)
11
2
+
and E
n
= f n -f(X n ).
-9-
(2an -1 n - -2
n n -2 )
--n
n
= II f(X n ) 11
2
II
(1+11 n )
- (2an ":1 n)(l + 0(1))
-n
where '11
<
L n
00
and'\)
< 00
Ln·
Dt (X )f(X) 112 + -2
n n -y-2
n
n
n
II
Dt (X )f(X)
n
n
11
2
+ \)
n
•
Therefore, by theorem 1 of Robbins and Siegmund (1971), lim II f(X )
n-+oo
n
II
exists
and is finite and
Then by F1 (iv)
·e
II
f(X ) 11
2
n
~
0, and therefore X
n
By theorem 3.1 and A2 (ii), B ~ (D t (8)D(8))-1,
Proof of theorem 3.2.
A
n
~
1, and D(X )
~
n
D(8).
~ 8 if F2 holds.
n
For each n > 0
for all sufficiently large n.
F
2
E nil f(X + ) 11 ~
n 1
II
Therefore, by (4.1) to (4.4), for each n
f(X n )
11
2
[1-2a(1-n)n-
1
> 0
+
2 -1
-2
-2-y
O(c n
+ n )] + O(n
)
n
for all n suf f icient1y large.
Note t h at ( n+l) I+E
= n l+E
For each 0 < E < Y and for each n > 0 and for all large
(n+l) l+E EFnll
l
+ n E( l+E ) + O(n C - ).
rt,
-1 l+EII f(X ) 112
f(X n.+l ) 112 ~ (1-[2a(l-n)-(I+E)]n)n
n
+ O(n-l-(Y-E)) .
-10-
Therefore, by A2 (i) and another application of theorem 1 of Robbins anel
1
E
Siegmund (1971), lim n +
II
f(x
n~
n
)11 2
exists and is finite for all
E
<
"y,
whence
(4. S)
fo r all c
< y.
We can now apply theorem 2.Z of Fabian (1968) with r
= I,
Ii
T
n
=T
(~
n
= aBDtf(X
),
n
n
F = aI,
X , ~n = aB , ~= aO- I (8)O-t(8), V = nY(D f(X )_ot f ),
n
n
F
n
n
n
nn
0, P = I, A = aI, and t = lim E n(VnV~), Note that (A(ii)+ A(jj)-r')
(l+c), Un
n~
(2a-l-y) for all i and j.
We need to calculate
t
more explicitly.
If VI' V ' WI' and W are random variables possessing finite second moments
z
Z
such that (VI,V ) is independent of (WI,W )' then
Z
Z
·e
+
Applying this fact coordinatewise to ot f
and using Al (ii) and Al (iii) one
n n
can show that
Then by A2 (ii) and (4.5)
I
=
a
Z
Y
F
t
lim n Var n(O f )
n n
n~
Proof of corollary 3.3.
It is trivial to show that aZ/(Za-l-y) is
minimized in a, subject to the constraint that a > (1+y)/2, by a
n
The corollary then follows because N
n
n
= .I 1
{[iY]+m.} - I
1
1=
i=l
=
(l+y).
[i Y]_ nl+Y/(I+y).
-11-
5.
ASYMPTOTIC EFFICIENCY
We will not treat the subject of efficiency in great detail, but we will
study a
simple example.
and Val'
Y(x) = S for all x.
x-Il
-1
Suppose D is nonsingular and known', f(X) = D(X-lI),
Y(x) _ N(8, D-1 SD -t ).
If Y(x) is normally distributed, then
Thus, if Zl,.,.,ZN
is any sequence of random
n
varjables, then the maximum likelihood estimate of 8 based on
Y(Zl), ... ,Y(ZN) is
n
i\lso r; _ N(8, N-ID- I S D- t ).
n
Therefore,
e and our
estimator, X , have the
n
same asymptotic distributions, even though our estimator has been devised for
a much more general problem.
·e
When deriving the asymptotic distribution of X , it is crucial that
F
F
F
F
n
nt
• t
n
11
t.
0 f = 0 (8)(Var f )0(8). Then because.Var . (0 t )
11
F
n n
n
11 n
Y
is determined by Var nf , full efficiency is obtained by having m n- ~ 0,
Val'
11
0
~
a
so that Var
n
n
so that the ratio of the number of observations used to construct On to the
number used to construct f
n
converges to O.
REFERENCES
r
1.
Blum, .1.R. (1954). Multidimensional stochastic approximation methods.
Ann. Math. Statist. 25, 737-744.
2.
Chung, K.L. (1954). On a stochastic approximation method.
Statist. ~, 463-483.
3.
Fabian, V. (1968). On asymptotic normality in stochastic approximation.
Ann. Math. Statist. 39, 1327-1332.
4.
Fabian, V. (1971). Stochastic approximation. In Optimizing Methods in
Statistics (J.S. Rustagi, ed.) 439-470. Academic Press: New York.
S.
Robbins, H. and Monro, S. (1951). A stochastic approximation method.
Ann. Math. Statist. ~, 400-407.
6.
Robbins, H. and Siegmund, D. (1971). A convergence theorem for non negative
almost supermartinga1es and some applications. In Optimizing Methods
in Statistics. (J.S. Rustagi, ed.) 233-257. Academic Press: New York.
7.
Sacks, J. (1958). Asymptotic distributions of stochastic approximat ion
procedures. Ann. Math. Statist. 29, 373-405.
8.
Schmetterer, L. (1969). Multidimensional stochastic approximation. (In
Multivariate Analysis, iI, P.R. Krishnaiah, ed.) Academic Press: New
York.
8.
Venter, J. (1967). An extension of the Robbins-Monro procedure.
Math. Statist. 38, 181-190.
9.
Walk, H. (1977). An invariance principle for the Robbins-Monro Process
in a l-lilbert Space. Z. Wahrscheinlickkeitstheorie verw. Cebiete, 31,
135-150.
Ann. Math.
Ann.
© Copyright 2026 Paperzz