Adaptive Nonmonotonic Methods With Averaging of

Adaptive Nonmonotonic Methods
With Averaging of Subgradients
Chepurnoj, N.D.
IIASA Working Paper
WP-87-062
July 1987
Chepurnoj, N.D. (1987) Adaptive Nonmonotonic Methods With Averaging of Subgradients. IIASA Working Paper. IIASA,
Laxenburg, Austria, WP-87-062 Copyright © 1987 by the author(s). http://pure.iiasa.ac.at/2990/
Working Papers on work of the International Institute for Applied Systems Analysis receive only limited review. Views or
opinions expressed herein do not necessarily represent those of the Institute, its National Member Organizations, or other
organizations supporting the work. All rights reserved. Permission to make digital or hard copies of all or part of this work
for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial
advantage. All copies must bear this notice and the full citation on the first page. For other purposes, to republish, to post on
servers or to redistribute to lists, permission must be sought by contacting [email protected]
ADAPTIW NONMONOTONIC METHODS
WITH AVERAGING OF SUBGRADIENTS
N.D. Chepurnoj
July 1987
WP-87-62
Working Papers are interim r e p o r t s on work of t h e International
Institute f o r Applied Systems Analysis and have received only limited
review. Views o r opinions e x p r e s s e d herein d o not necessarily
r e p r e s e n t those of t h e Institute or of i t s National Member
Organizations.
INTERNATIONAL INSTITUTE FOR APPLIED SYSTEMS ANALYSIS
A-2361 Laxenburg, Austria
FOREWORD
The numerical methods of t h e nondifferentiable optimization are used f o r solving decision analysis problems in economic, engineering, environment and agricult u r e . This p a p e r is devoted t o t h e adaptive nonmonotonic methods with averaging
of t h e subgradients. The unified approach is suggested f o r construction of new
deterministic subgradient methods, t h e i r stochastic finite-difference analogs and a
posteriori estimates of a c c u r a c y of solution.
Alexander B. Kurzhanski
Chairman
System and Decision Sciences Program
CONTENTS
1
Overview of Results in Nonmonotonic
Subgradient Methods
2
Subgradient Methods with Program-Adaptive
Step-Size Regulation
3
Methods with Averaging of Subgradients and
Program-Adaptive Successive Step-Size Regulation
4
Stochastic Finite-Difference Analogs t o Adaptive
Nonmonotonic Methods with Averaging of Subgradients
5
A P o s t e r i o r i Estimates of Accuracy of Solution to
Adaptive Subgradient Methods and Their Stochastic
Finite-Difference Analogs
References
ADAPTIVE NO~ONOTONICMlCI'HODS
WITH AVERAGING OF SUBGRADEXTI'S
N.D. Chepurnoj
1. OVEBYIEW OF RESULTS IN NONMONOTONIC SUBGEWDIENT METHODS
Among t h e existing numerical methods of solution of nondffferentiable optimization problems, t h e nonmonotonic subgradient methods hold an important position.
The pioneering work by N.Z. S h o r [26] gave impetus t o t h e i r explosive progress. In 1962, he suggested an iterative process of minimization of convex
piecewise-linear
function named afterwards t h e generalized gradient descent
(GGD):
where gS
E
a f ' ( x S) i s a set of subgradients of a function f ' ( x ) at a point x S ; rs r 0
i s a s t e p size.
For t h e differentiable functions this method a g r e e s v e r y closely with the
well-known subgradient method. The fundamental difference between them i s t h a t
t h e motion direction (- g ) in (1.1) is, a s a rule, not a descent direction.
A t t h e f i r s t attempts to substantiate theoretically t h e convergence of proc e d u r e s of t h e type (1.1) t h e r e s e a r c h e r s immediately faced two difficulties. For
one thing, t h e objective function lacked t h e p r o p e r t y of differentiability. For
another, method (1.1) w a s not monotonic. These combined f e a t u r e s r e n d e r e d impractical t h e use of known gradient procedure convergence theorems.
New theoretical approaches t h e r e f o r e became a must.
One more "misfortune" came on t h e neck of t h e others: numerical computations demonstrated t h a t GGD h a s a low convergence r a t e .
Initially g r e a t hopes were pinned on t h e step-size selection s t r a t e g y as a way
towards overcoming t h e crisis.
By t h e e a r l y 1970s difficulties caused by t h e formal substantiation of convergence of nonmontonic subgradient p r o c e d u r e s had been mastered and different approaches to t h e step-size regulation had been offered [6, 7, 8 , 1 9 , 20, 261. However
the computations continued t o prove t h e poor convergence of GGD in p r a c t i c e .
I t c a n b e said t h a t t h e f i r s t s t a g e in GGD evolution w a s o v e r in 1976.
Thereupon t h e numerical methods of nondifferentiable optimization developed
in t h r e e directions, i.e., methods with s p a c e dilation, monotone, and adaptive nonmonotonic methods w e r e explored.
Let us dwell on each of t h e s e approaches.
In a n e f f o r t t o enhance t h e GGD efficiency, N.Z. S h o r elaborated methods
where t h e operation of s p a c e dilation in t h e direction of a subgradient and a
difference between t w o successive subgradients w a s employed. Literally t h e next
f e w y e a r s w e r e prolific f o r p a p e r s [27, 28,291 investigating into t h e s p a c e dilation
operation in nondifferentiable function minimization problems. A high rate of convergence of t h e suggested methods w a s c o r r o b o r a t e d theoretically.
Computational p r a c t i c e a t t e s t e d convincingly to t h e advantageousness of application of t h e algorithms with s p a c e dilation, especially t h e r-algorithm [29], as
alternative t o GGD, providing dimensions of t h e s p a c e do not exceed 200 t o 300.
However, if dimensions are ample, f i r s t , a considerable amount of computations is
s p e n t on t h e s p a c e dilation matrix transformation, second, some e x t r a capacity of
computer memory i s required.
The monotonic methods became a n o t h e r essential direction.
Even though t h e f i r s t p a p e r s on t h e monotonic methods a p p e a r e d back in 1968
(V.F. Dem'janov [30]), t h e i r p r o g r e s s r e a c h e d its peak in t h e e a r l y 70's. Two
classes of these algorithms should b e distinguished h e r e : t h e &-steepest descent
15, 301 and t h e &-subgradient algorithms [31-341. W e shall not examine them in detail but note, t h a t t h e monotonic methods offered higher r a t e of convergence a s
against GGD. Just as with t h e methods using t h e s p a c e dilation, vast dimensions of
problems to be solved still remained Achilles' heel f o r t h e monotonic algorithms.
Thus, t h e nonmonotonic subgradient methods have come into p a r t i c u l a r importance in t h e solution of large-scale nondifferentiable optimization problems.
The nonmonotonic p r o c e d u r e s have a n o t h e r important object of application,
a p a r t from t h e large-scale problems, i.e., t h e problems in which t h e subgradient
cannot b e precisely defined at a point. The latter encompass problems of identification, learning, and p a t t e r n recognition [I, 211. The minimized function i s t h e r e a
mathematical expectation whose distribution law is unknown. E r r o r s in subgradient
calculation may s t e m from computation errors and many o t h e r r e a l processes.
Ju.M. Ermol'ev and Z.V. Nekrylova [9] w e r e t h e f i r s t to investigate t h e like
procedures. Stochastic programming problems have increasingly drawn t h e attention to t h e nonmonotonic subgradient methods.
However, as pointed out e a r l i e r . GGD, widely used, r e s i s t a n t to e r r o r s in
subgradient computations, saving memory capacity, still had a poor rate of convergence. Of g r e a t importance t h e r e f o r e w a s t h e construction of nonmonotonic
methods such t h a t , on t h e one hand, r e t a i n all advantages of GGD and, on t h e o t h e r ,
possess a high rate of convergence.
I t has been this requirement t h a t h a s l e t to elaboration of t h e adaptive nonmonotonic procedures.
An analysis revealed t h a t the Markov n a t u r e of GGD is t h e chief cause of i t s
slow convergence. I t is quite obvious t h a t t h e use of t h e m o s t intimate knowledge of
p r o g r e s s of t h e computations is indispensable t o selection of t h e direction and regulation of t h e stepsize.
Several ideas provided t h e basis f o r t h e development of adaptive nonmonotoni c methods.
The major concept of all techniques f o r selecting t h e direction and regulating
t h e step-size w a s the use of information about t h e fulfillment of necessary conditions to have the extremal-value function.
I t s implementation are t h e methods with averaging of the subgradients.
In t h e most general case by t h e operation of averaging i s meant a procedure
of "taking" t h e convex hull of a n a r b i t r a r y finite number of vectors.
The operation of averaging in t h e numerical methods w a s f i r s t applied by Ja.2.
Cypkin [ Z Z ] and Ju.M. Ermol'ev [ll].
The p a p e r by A.M. Gupal and L.G. Bazhenov [3] also dealing with t h e use of
operation of averaging of stochastic estimates of t h e generalized gradients app e a r e d in 1972.
However all t h e above p a p e r s considered t h e program regulation of the stepsize, i.e., a sequence [ r , ] independent of computations w a s selected such t h a t
The next n a t u r a l s t a g e in t h e evolution of this concept w a s t h e construction of
adaptive step-size regulation using t h e operation of averaging of preceding
subgradients.
In 1974, E.A. Nurminskij and L.A. Zhelikovskij [I81 suggested a successive
p r o g r a m a d a p t i v e regulation of the step-size f o r t h e quasigradient method of
minimization of weakly convex function.
The c r u x of this relation consists in t h e following.
Let an i t e r a t i v e sequence be constructed according to t h e r u l e
where g S
E
a j ( z s ) i s a quasi-gradient of t h e function j ( z ) at t h e point z S ,r o i s a
constant step-size.
Assume t h a t t h e r e e x i s t
z E En
t h a t f o r any s = 0 , 1,2, ... llzS
nation of subgradients
e S DE conv
Then t h e point
tgi
S
ji
and numerical parameters
- G 11 5 6.
t
> 0, 6 > 0
such
Let us suppose also t h a t a convex combi-
tgi if LO exists such t h a t IlesolI S t,
:0
S
.
z i s sufficiently close to t h e set X'
= argmin j ( z ) according to t h e
necessary extremum conditions. In t h e given case t h e step-size h a s to be reduced
and t h e procedure r e p e a t e d with t h e new step-size value r l starting at t h e ob-
.
numerical realization of t h e described algorithm r e q u i r e s a
tained point z S D The
specific r u l e f o r constructing vectors eS'. In [10] the v e c t o r eS' is constructed by
the r u l e os'
= P r o j O/conv
S
g kk ' s ,
t h a t is, all quasi-gradients are included
into t h e convex hull s t a r t i n g from t h e m o s t r e c e n t instant of t h e step-size change.
Numerical computations b o r e out t h e expediency of making allowances f o r such regulation. However a g r a v e disadvantage w a s inherent in it: t h e g r e a t laboriousness
of iteration. Considering t h a t t h e approach as a whole holds promise, averaging
schemes had to be developed f o r t h e efficient use when selecting t h e direction and
regulating t h e step-size.
This p a p e r treats such averaging schemes. They s e r v e as a foundation f o r new
nonmonotonic subgradient methods, f o r t h e description of stochastic finitedifference analogs, a posteriori estimates of solution accuracy. P r i o r to discussing results, let us make s o m e g e n e r a l assumptions. Presume t h a t t h e minimization
problem i s being solved on t h e e n t i r e s p a c e of t h e function j ( z ) :
where En i s a n n-dimensional Euclidean space. The function j'(z) will be everywhere thought of as being t h e convex eigenfunction j'(z), dom j' = E n , t h e sets
[z :/(z)
5 c
j being bounded f o r any bounded constant C. The set of solutions of
t h e problem (*) will be believed to be t h e s e t
2. SUBGRADIENT KETHODS WITFi PROGRAM-ADAPTIVE STEP-SIZE
ImGULATION
The concept of adaptive successive step-size regulation has already been set
forth. In 1231 a way of determining t h e instant of t h e step-size variation w a s suggested. Central to i t was t h e simplest scheme of averaging of t h e preceding subgradients. This method i s easy to implement and effects a saving in computer memory
capacity. Compared t o t h e program regulation, t h e adaptive regulation improves
convergence of t h e subgradient methods.
Description of Algorithm 1
Let z 0 be a n a r b i t r a r y initial point, b
sequences such t h a t ck
LO
=
> 0, tk
-4
0, rk
> 0 be a constant,
> 0, rk -+ 0. P u t
itk j,
s
[rkj be number
= 0,
j
= 0, k = 0,
E aj'(zO).
S t e p 1. Construct
S t e p 2. If j'(zS +') >j'(zO)
+ b, then s e l e c t zS
E Iz :j'(z) zSj'(zO)jand go
to Step 5.
S t e p 3. Define
lies +41 > c k , then s = s + 1and go t o s t e p 1.
S t e p 5. S e t k = k + 1,j = s + 1,s = s + 1and g o t o Step 1.
S t e p 4 . 1f
THEOREM 1.1
Assume t h a t t h e problem
(8)
is s o l v e d b y a l g o r i t h m 2. Then a l l
Limit p o i n t s of t h e sequence [ z S1 belong to x*.
PROOF
Denote t h e instants of step-size variations by s m . Let us prove t h a t the
step-size rk v a r i e s a n infinite number of times. Suppose i t i s not so, i.e., the stepsize does not v a r y starting from a n instant s, and i s equal r , . Then t h e points z S
f o r s 5 s, belong t o t h e set
and are r e l a t e d by
Considering t h a t t h e step-size does not v a r y , llesll
t o t h e limit by s
-4
> E, > 0 for
s
r s,.
In passing
in t h e inequality
we obtain a contradiction in t h e boundedness of t h e set
The f u r t h e r proof of Theorem 1.1amounts to checking t h e g e n e r a l conditions
of algorithm convergence derived by E.A. Nurminskij [17].
NURMINSKIJ THEOREM
Let t h e s e q u e n c e lzS1 a n d t h e set of s o l u t i o n s X *
be s u c h t h a t t h e f o l l o w i n g c o n d i t i o n s a r e s a t i s f i e d :
Dl. For any sequence [ z s k
1 such t h a t
D2
There exists t h e closed bounded set S such t h a t
D3
F o r any subsequence [ znk
1 such t h a t
t h e r e e x i s t s co > 0 s u c h t h a t f o r all 0
inf
m : [IIzm-znkII
> rj
<E
S
ro and any k
= m k < = I.
m"'k
D4
The continuous function W ( z )e x i s t s such t h a t f o r a n a r b i t r a r y subsequence
[ znk j such t h a t
and f o r t h e subsequence [ z m k jcorresponding to i t by condition D3 f o r a n a r b i trary 0
<E
S
r0
D5. The function W ( z )of condition D4 assumes no more t h a n countable number of
values on t h e set x*.
Then a l l limiting points of t h e sequence [ z S1 belong to x*.
Select t h e function f ( z ) as t h e function W ( z ) .Conditions Dl, D5 are satisfied
in view of t h e algorithm s t r u c t u r e and t h e e a r l i e r assumptions.
The rest of t h e conditions will b e verified by t h e following scheme. W e will
p r o v e t h a t conditions D3, D4 hold t h e points being t h e i n n e r points of t h e set
s
= [ z :f
( z) 3 f ( zO ) 1. I t is therewith obvious t h a t
max W ( z)
ES
< inf
W ( z)
t
Then t h e sequence [ z S1 falls outside t h e set S only finite number of times. Consequently, condition D2 is satisfied and t h i s automatically entails t h e validity of D3
and D4.
So, let t h e subsequence [znpj e x i s t s such t h a t znp --, z '
x*.
Assume at t h i s
s t a g e of t h e proof t h a t z ' E int S. W e will p r o v e t h a t t h e r e e x i s t s ro > 0 s u c h t h a t
f o r all 0
< E 3 r0 at a n a r b i t r a r y p:
Now suppose condition (2.1) i s not satisfied, t h a t is, f o r any r
such t h a t l1zs
- zn41 3 r f o r all s > np.
> 0 there
exists n p
W e have
f o r sufficiently l a r g e n p and s
> n,,.
By the supposition 0 Z B f ( x ' ) . By virtue of
the closedness, convexity and upper semi-continuity of t h e many-valued mapping
a f ( x ) t h e r e exists
E
>0
such t h a t 0
=
conv G q c ( z'), where conv
1.1
i s a convex
hull and G 4 r ( x ' ) i s a set
It i s easily seen t h a t
E
>0
U I e ( x') C int S,
where
f
Obviously 6
E conv G 4,(x ').
(
t h a t f o r k 2 K ( 6 ) we have
can be always selected in such a way t h a t
z)
=
> 0. As
S r)/Z.
ek
-
-x 1 5 4 .
6
= min 1; 11,
0, t h e r e exists a n integer
L(6) such
x :z
Let
P u t n p 2 K ( 6 ) . Then i t i s readily s e e n t h a t
f o r s 2 np t h e step-size r k can vary no m o r e than once within t h e set U I c ( z ' ) . Examine t h e sequence IsS1 separately on t h e intervals n p5 s
s;
= min
When n p S s
< s p* , where
sm :sm a n p ' .
< sp
t h e points z S are r e l a t e d as follows
where t h e index L i s reconstructed with r e s p e c t to s p . Let us consider t h e s c a l a r
products
where z np = grip,
Since
(z
N1
,g
zs
N1+l
E
s 2 np, it
conv G q , ( z ' ) ,
is
possible
to
prove
that
) 2y, y =1/2l9~.
Thus,
We n e x t c o n s i d e r t h e scalar p r o d u c t s
ds
= ( z N 1 +-lz s ,
where s 2 N1
g s ) = r 1 ( s -Nl - l ) ( z s - l , g s ) ,
+ 1.
Ne
The index N2 e x i s t s such t h a t ( z
,g
Ne+l
) 2 y and d N e + l hr l ( N 2
-
Then in a similar way we c a n p r o v e t h e e x i s t e n c e of indices Nt ( t 2 3 ) s u c h t h a t
I t i s e a s y t o p r o v e t h a t Nt
+
- Nt
S
N
< =, t = 1, 2,. ... Let
Nt
b e t h e maximal of
indices Nt t h a t does not e x c e e d s; . Then
Since s p
- Nt0 5 N , t h e n
with p
--r
= t h e l a s t t e r m on t h e right-hand side of
t h e inequality
a p p r o a c h e s z e r o . W e finally obtain
where
E;
-+ 0
with p'
--r w.
I t i s not difficult t o notice t h a t t h e reasoning which underlies t h e derivation
I
of inequality (2.2) may b e also r e p e a t e d without changes f o r t h e i n t e r v a l s L s p t o
get
f(zrn)
- f ( z n p ) s- 7 1
Adding (2.2) t o (2.3) we obtain
- s;) y + c;
In passing t o t h e limit by m
--, =
in inequality ( 2 . 4 ) we are led t o a contradiction
with r e s p e c t t o t h e boundedness of continuous function on t h e closed bounded set
U q t ( x d ) .Consequently, condition ( 2 . 1 ) i s proved.
Let
mp
= inf
m :l / x m
m >np
By s t r u c t u r e xmp
F
- x nPI\
>r .
u , ( x n p ) , b u t f o r sufficiently l a r g e p
All t h e reasoning involved in derivation of inequality ( 2 . 4 ) remains valid f o r t h e ins t a n t m p , t h a t is,
we have
In passing to t h e limit by p
-
--, -we
lim w(xmp) < lim w ( a n p )
P--
get
.
P--
The f u r t h e r proof of t h i s theorem follows from t h e Nurminskij theorem.
To fix more p r e c i s e l y t h e instant when t h e i t e r a t i o n p r o c e s s g e t s into t h e
neighborhood of t h e solution we c a n employ t h e following modification of algorithm
1 provided t h e computer c a p a c i t y allows.
Let z 0 be a n a r b i t r a r y initial point, d
sequences such t h a t ck
> 0, ck
4
0 , rk
> 0 be a
> 0,
rk
constant,
4
[
E
1,~ Irk 1 be number
0 ; k l , k2, . . . , k, be integer
positive bounded constants.
= 0,j = 0 ,k = 0 ,e0 = g o
Put s
E Bf(zO).
S t e p 1 Construct
Step 2
,
~ ~ + ~ ~ [ z : f ( z and
) ~ go
f ( t oz ~ ) ]
If f ( z S f l ) > f ( z O ) + d then
Step 5.
Step 3
Define
"0s
s +l
e,S
s - j +2
+1
=
+
1
g s +l
s - j + 2
Each of t h e notations P i ( - , ., -) designates a n a r b i t r a r y convex combination of a
finite number of t h e indicated preceding subgradients.
Find
-
min IIe; +lI I .
LLs+ l - o s p s m
>E
Step4
If p s + l
Step5
S e t k = k + 1 , j = s +1,s = s + 1 , eS = g S a n d g o L o S t e p 1 .
THEOREM 2.1
~
th
, ens = s +landgotoStepl.
m p p o s e t h a t the problem (*) is solved b y the modiJ%ed algo-
r i t h m I. Then all limit points of the sequence iz
1 belong to x*.
3. METHODS WITH AVERAGING OF SUBGRADIENTS AND PROGRAM-ADAPTIVE
SUCCESSIVE STEP-SIZE REGULATION
S u c c e s s i v e Step- Size R e g u l a t i o n
A s noted in a number of works [Z, 3, 1 2 , 161 i t i s expedient t o average subgradients calculated at t h e previous iterations s o t h a t t h e subgradient methods will b e
more regular. F o r instance, when t h e "ravineu-type functions are minimized, t h e
averaged direction points t h e way along the bottom of t h e "ravine".
I t will be demonstrated in Section 5 t h a t t h e operation of averaging enables
t h e improvement of a posteriori estimates of the solution a c c u r a c y along with t h e
upgrading of regularity of t h e described methods.
Methods with averaging of subgradients and consecutive p r o g r a m a d a p t i v e regulation of t h e step-size are set f o r t h in this section.
Results obtained h e r e stem from [24].
Description of Algorithm 2.
Let z 0 be a n a r b i t r a r y initial approximation;
3 >o
be a constant; Irk 1, irk j
be number sequences such t h a t
Puts
= 0,j = 0 ,k = 0 ,
Step 1 Construct
> f ( z O ) + 8, then go t o Step 7.
Step 2
If f ( z S + I )
Step 3
Define v S
Step4
~ o n s t r u c t e ~ + +
~ (=s e-~j + ~ ) - l ( v ~ + ~ - e ~ ) .
Step5
1f \leS+41> el:,t h e n s = s + l a n d g o t o S t e p l .
Step6
Set k = k + 1 , j = s + 1 , s = s + 1 , eS = v S a n d g o t o S t e p 1 .
Step7
~ e t z ~ + ~ E i z : f ( z ) ~ f ( z ~ ) ]j, =s s=, s
k=
+ kl +, l a n d g o t o
Step 1.
according t o t h e schemes a ) o r b).
In construction of t h e direction v S t h e following schemes of subgradient
averaging a r e dealt with.
a)
The "moving" average. Let K
where gi
b)
The
E
a f ( z i ), hi,
"weighted"
+ 1be a n integer. Then
+ 0.
average.
vS=gS+hS(vS-l-gS),
Let
where
M
+1
be
OShsS1
an
for
integer.
s f 0
(mod
Then
M),
0 S As S 6 <1f o r s = O (modM).
THEOREM 3.1
Assume that t h e probLem (*) is solved by aLgorithm 2. Then a L L
l i m i t p o i n t s of t h e sequence [zSj belong to t h e s e t x*.
4. STOCHASTIC FINITE-DIPFEENCE ANALOGS TO ADAPTIVE NONBEONOTONIC
METHODS WITH AVERAGING OF SUBGBADIENTS
I t should b e emphasized t h a t t h e practical value of t h e subgradient-type
methods essentially depends upon t h e existence of t h e i r finite-difference analogs.
Of g r e a t importance t h e finite-difference methods a r e primarily in situations when
subgradient computation programs a r e unavailable. This generally o c c u r s in t h e
solution of large-scale problems. Construction of t h e finite-difference methods in
the nonsmooth optimization originated two approaches: t h e nondeterministic and
the stochastic ones. Each of them h a s i t s own advantages and disadvantages. The
stochastic approach i s favored here.
One of t h e advantages of t h e introduced averaging operation i s t h e f a c t t h a t
t h e construction of stochastic analogs t o subgradient methods p r e s e n t s no special
problems.
The offered methods a r e close t o those with smoothing [4] which, in t h e i r t u r n ,
a r e closest to t h e schemes of stochastic quasi-gradient methods [IZ]. Research
into t h e stochastic quasi-gradient methods with successive step-size regulation i s
quite a new and underdeveloped field. Ju. M. Ermol'jev s p u r r e d f i r s t t h e investigations in this direction. His and Ju. M. Kaniovskij results [13] a r e undoubtedly of
theoretical interest. However implementation of methods described in [14] c r e a t e s
complications as t h e r e is no r u l e to regulate variations in the step-size.
Let us f i r s t dwell on functions f ( x , i ) of the form
where ai
> 0.
P r o p e r t i e s of the functions f ( x , i ) have been studied by A.M. Gupal [4]
proceeding from the assumption t h a t f ( z ) satisfies the Lipschitz local condition.
f!, f ( z )
THEOREM 4.1
is a convez e i g q f b n c t i o n , dom f = E" , t h e n f ( z , i ) is
a l s o a convez eige@unction, dom f ( z , i )
A sequence of j b n c t i o n s f ( z , i ) u n ~ r m l yconverges to f ( s )
THEOREM 4.2
w i t h ai
-+
= E n , for a n y ai > 0.
0 i n a n y bounded d o m a i n X.
Now we shall g o t o the description of stochastic finite-difference analogs t o
algorithms with successive program-adaptive regulation of the step-size and with
averaging of the direction.
Description of Algorithm 3
be a constant,
[ t ij,
Let s o b e a n a r b i t r a r y initial approximation, b
>0
[ t i j, [ai1, Ipi j be number sequences.
P u t s = 0,i = 0,j
= 0.
S t e p 1 Compute
<s
" (f(si8
'IS
. - . , ~ t + Q i. ,. . , X n )
=- 1
2 a i I: = 1
-
where S;, k = 1 , n are independent random values distributed uniformly on intervals [zi
- a i , z;
Step2
+ ail,
ai
> 0.
Construct e S in compliance with t h e schemes a ) and b), where the
subgradients a r e r e p l a c e by t h e i r stochastic estimates.
Step3
F i n d s S + ' = z S- t i e S .
Step4
Iff(zS+')>f(z0)+b,thengotoStep9.
Step5
~ e f i n e z ~ =" z S + ( s - j +l)-'(es
Step 6
If s
Step7
~ f I l z ~ + ~ I tI h>e tn ~s ,= s + l a n d g o t o S t e p l .
Step8
Puti = i + l , j = s + l , s = s + l a n d g o t o S t e p l .
Step9
~ e t z ~ I+
z : f~( z €) S f ( z 0 ) ] , j = s + 1 , i = i + 1 , s = s + l a n d g o
-j
< pi,
then s = s
-zS)
+ 1and go t o Step 1 .
t o Step 1.
THEOREM 4.3
Let t h e problem (*) be solved by a l g o r i t h m 3 a n d t h e n u m b e r se-
quences
satisfy t h e following conditions
Then almost f o r all o t h e sequence f ( z S (o))converges and all Limit points of the
sequence [ z S ( a ) j belong t o t h e set of solutions
x*.Theorem 4.3
is proved in detail
in [25].
5. A POSTERIORI ESTIMATES OF ACCURACY OF SOLUTION TO ADAPTIYE
SUBGRADIENT METHODS AND THEIR STOCHASTIC
FINITE-DIFFERENCE ANALOGS
In numeric solution of extremum problems of nondifferentiable optimization
strong emphasis i s placed on t h e check of obtained solution accuracy. Given t h e
solution accuracy estimates, f i r s t , a v e r y efficient r u l e of algorithm stopping c a n
be formulated, second, t h e obtained estimates c a n form the basis f o r justified conclusions with r e s p e c t t o t h e s t r a t e g y of selection of algorithm parameters.
Using r a t h e r simple procedure a posteriori estimates of solution accuracy for
t h e introduced adaptive algorithms are constructed h e r e . The estimates provide a
means f o r s t r i c t l y evaluating efficiency of t h e averaging operation use.
Thus, assume t h a t the convex function minimization problem
i s being solved.
Suppose t h e set X* contains only one point x
To solve t h e problem
theorem
2.1 i s
the
(0)
proof
*.
consider algorithm 1. The spin-off from the proof of
t h a t the sequence
l x S j falls outside
lx :p ( x ) 5 f ( x O ) + 61 a finite number of times only. Therefore,
t h e set
c 2 0 exists such
that for s 2 ?
Then t h e s t e p size will v a r y only if the condition lies
+'I1
5
rk is satisfied,
where
Without loss of generality w e will assume t h a t t h e f i r s t instant of t h e change from
t h e s t e p r o t o r l o c c u r r e d just because t h e condition
is satisfied.
From t h e convexity of t h e function f ( x ) i t i s i n f e r r e d t h a t
Summation of inequalities (5.1), (5.2), . . . (5.3) yields
C;O~
Denote the expression ( s o + I)-'
xi
- x O ) by A,,.
W e have obvious inequalities
where with s o d s
d sl
the points x S a r e related by x S +
' = x - r ' g S . For these
values of s it i s possible t o derive that
€
l x : j ' ( x ) d min l f ( z
s o +1
),
. . . ,f ' ( x s ' ) ] ] .
Thus, f o r s k + I d s d ~ ~ + have
~ w e
where
22;' E i x : J ( x > 5 min [ p ( x S k + l ) , . . . , j ( x S k + l ) ] ] ,
I t i s easily proved t h a t Ak
THEOREM 5.1
4
0.
A s s u m e that t h e problem (*) is s o l v e d b y a l g o r i t h m 2. T h e n t h e
inequalities
hold f o r such instants sk at which t h e step-size v a r i e s because t h e condition
lleskIl S ,rk i s satisfied.
REMARK
I t follows from theorem 5.1 t h a t t h e s a m e estimate o c c u r s both f o r t h e
11
subsequence of "records"
2, { and f o r C e s a m subsequence (8" {.
Let t h e problem (*) be solved by algorithm 2 where t h e operation of averaging
of proceeding subgradients i s used. Denote instants of changes in t h e step-size by
s i , i = 0 , 1, 2, .... Suppose t h e f i r s t instant of t h e change from r o to r l t a k e s
place because t h e inequality lleSolI S
EO
holds. Examine t h e scheme of averaging by
"moving" average. W e have
gS s
p *
+ (g", 2 s
-Ze)
,
s
C
Designate t h e expression
i =O
Then
Whence f o r s
s K w e have
Xi,
by j
'
.
For s
> K w e s h a l l have
Thus,
From t h e formula
t h e following recommendations c a n be o f f e r e d with r e s p e c t t o t h e selection of
p a r a m e t e r s Xi,, :
(2)
min
h,S*Oi
5= o X i , s ( g i , x i - x O ) , f:= o X i , s = I
i
The s u b g r a d i e n t averaging t h e r e b y allows improving a p o s t e r i o r i estimates of
t h e solution a c c u r a c y . This may substantiate formally t h a t i t is of advantage to int r o d u c e a n d study t h e o p e r a t i o n of subgradient averaging.
F o r a n a r b i t r a r y instant of step-size variation s f
> K we c a n easily obtain t h e
estimate
THEOREM 5.1
Let t h e problem ( a ) be solved b y algorithm 2 w i t h t h e u s e of
averaging scheme a). Then for t h e i n s t a n t s s f , for w h i c h 11 es'l
1 5 ci,
inequality
(5.9) holds. The scheme of averaging b y "weighted" average b) i s treated in a
similar way.
The a posteriori estimates of t h e solution a c c u r a c y attained f o r the adaptive
subgradient methods c a n be extended t o t h e i r stochastic finite-difference analogs
with t h e minimum of alterations. The way of getting them is illustrated with algorithm 3 . We will use notations introduced in Section 4 . When proving theorem 4.3 i t
is possible t o demonstrate t h a t t h e step-size r f v a r i e s an infinite number of times.
A s algorithm 3 converges with a probability of unity, then f o r almost all o i t i s possible t o indicate E(o) such t h a t with s 2
Therefore, with s 2 E(o) t h e step-size r f v a r i e s because t h e condition
holds, where sf 2 pi
+ j , zs'
= zs'
-'+ (sf - j ) l ( # s '
-z'
)
sequences Itf )
and Ipf 1 comply with p r o p e r t i e s formulated in theorem 4.3, j is reconstructed by
Sf.
Consider t h e event
where st
constant 0
is t h e instant of step-size change t h a t p r e c e d e s s f . There exists t h e
<c <
such t h a t with t h e probability g r e a t e r than 1
ble t o state t h a t
Then f o r t h e instant si t h e inequality
holds with t h e s a m e probability.
- Cdi
i t i s possi-
Theorem 5.3 is readily formulated and proved. Assume t h a t the problem (*) i s
solved by algorithm 3. Then f o r almost all w i t is possible t o isolate a subsequence
of points jxs'(w)j f o r which with the probability g r e a t e r than 1
- C bi t h e inequal-
ities hold
where fiYl = min f (x , i
2
€En
x i Y l E Argmin f (x , i
-I),
- 1) .
BEFERENCES
Ajzerman, M.A., E.M. Braverman and L.I. Rozonoer: Potential Functions Method
in Machine Learning Theory. M.: Nauka, 1970, p. 384.
Glushkova, O.V. and A.M. Gupal: About Nonmonotonic Methods of Nonsmooth
Function Minimization with Averaging of Subgradients. Kibernetika, 1980, No.
6, pp. 128-129.
Gupal, A.M. and L.G. Bazhenov: Stochastic Analog t o Conjugate Gradient
Method. Kibernetika, 1972, No. 1, pp. 125-126.
Gupal, A.M.: Stochastic Methods of Solution of Nonsmooth Extremum Problems.
Kiev: Naukova dumka, 1979, p. 152.
Dem'janov, V.F. and V.N. Malozemov: Introduction t o Minimax. M. : Nauka, 1972,
p. 368.
Eremin, 1.1.: The Relaxation Method of Solving Systems of Inequalities with
Convex Functions on t h e Left Side. Dokl. AN SSSR, 1965, Vol. 160, No. 5 , pp.
994-996.
Ermol'ev, Ju.M.: Methods of Solution of Nonlinear Extremum Problems. Kibernetika, 1966, No. 4, pp. 1-17.
Ermol'ev, Ju.M. and N.Z. Shor: On Minimization of Nondifferentiable Functions.
Kibernetika, 1967, No. 1, pp. 101-102.
Ermol'ev, Ju.M. and Z.V. Nekrylova: Some Methods of Stochastic Optimization.
Kibernetika, 1966, No. 6 , pp. 96-98.
Ermol'ev, Ju.M.: On the Method of Generalized Stochastic Gradients and Stochastic Quasi-Fejer Sequences. Kibernetika, 1969, No. 2 , pp. 73-83.
Ermol'ev, Ju.M.: On One General Problem of Stochastic Programming. Kibernetika, 1971, No. 3 , pp. 47-50.
Ermol'ev, Ju.M. : Stochastic Programming Methods. M. : Nauka, 1976, p. 240.
Ermol'ev, Ju.M. and Ju.M. Kaniovskij: Asymptotic P r o p e r t i e s of Some Stochastic Programming Methods with Constant Step-Size. Zhurn. Vych. Mat. i Mat.
Fiziki, 1979,Vol. 19,No. 2,pp. 356-366.
Kaniovskij, Ju.M., P.S. Knopov and Z.V. Nekrylova: Limit Theorems f o r Stochastic Programming. Kiev: Naukova dumka, 1980,p. 156.
Loev, hi.: Probability Theory. M.: Izd-vo inostr. lit., 1967,p. 720.
Norkin, V.N.: Method of Nondifferentiable Function Minimization with Averaging of Generalized Gradients. Kibernetika, 1980,No. 6,pp. 86-89, 102.
Nurminskij, E.A.: Convergence Conditions f o r Nonlinear Programming Algorithms. Kibernetika, 1973,N o . l,pp. 122-125.
Nurminskij, E.A. and A.A. Zhelikovskij: Investigation of One Regulation of Step
in Quasi-Gradient Method f o r Minimizing Weakly Convex Functions. Kibernetika, 1974,No. 6, pp. 101-105.
Poljak, B.T.: Generalized Method of Solving Extremum Problems. Dokl. AN
SSSR, 1967,Vol. 174,No. 1,pp. 33-36.
Poljak, B.T.: Minimization of Nonsmooth Functionals. Zhurn. vychisl. mat. i
mat. fiziki, 1969,Vol. 9,No. 3,pp. 509-521.
Tsypkin, Ja.Z.: Adaptation and Learning in Automatic Systems. M.: Nauka,
1968.
Tsypkin, Ja.Z.: Generalized Learning Algorithms. Avtomatika i telemekhanika,
1970,No. 1,pp. 97-103.
Chepurnoj, N.D.: One Successive Step-Size Regulation f o r Quasi-Gradient
Method of Weakly Convex Function Minimization. Collection: Issledovanie
Operacij i ASU. Kiev: Vyshcha shkola, 1981,No. 19,pp. 13-15.
Chepurnoj, N.D.: Averaged Quasi-Gradient Method with Successive Step-Size
Regulation t o Minimize Weakly Convex Functions. Kibernetika, 1981,No. 6,pp.
131-132.
Chepurnoj, N.D.: One Successive Step-Size Regulation in Stochastic Method of
Nonsmooth Function Minimization. Kibernetika, 1982,No. 4,pp. 127-129.
S h o r , N.Z.: Application of Gradient Descent Method f o r Solution of Network
Transportation Problem. In: Materialy nauchnogo seminara p o prikladnym
voprosam kibernetiki i issledovanija operacij. Nauchnyj sovet p o kibernetike
IK AN USSR, Kiev, 1962,vypusk 1,pp. 9-17.
S h o r , N.Z.: Investigation of Space Dilation Operation in Convex Function
Minimization Problems. Kibernetika, 1970,No. 1,pp. 6-12.
Shor, N.Z. and N.G. Zhurbenko: Minimization Method Using Space Dilation, in
t h e Direction of Difference of Two Successive Gradient. Kibernetika, 1971,
No. 3,pp. 51-59.
S h o r , N.Z.: Nondifferentiable Function Minimization Methods and Their Applications. Kiev, Nauk. dumka, 1979,p. 200.
Demjanov, V.F.: Algorithms f o r Some Minimax Problems. Journal of Computer
and Systems Science, 1968,2,No. 4,pp. 342-380.
Lemarechal, C. : An Algorithm f o r Minimizing Convex Functions. In: Information
Processing'74 /ed. Rosenfeld/, 1974,North-Holland, Amsterdam, pp. 552-556.
Lemarechal, C.: Nondifferentiable Optimization: Subgradient and Epsilon
Subgradient Methods. Lecture Notes in Economics and Mathematical Systems
/ed. Oettli W./, 1975,117,Springer, Berlin, pp. 191-199.
33
34
Bertsekas, D.P. and S.K. Mitter: A Descent Numerical Method f o r Optimization
Problems with Nondifferentiable Cost Functions. SIAM J o u r n a l on Control,
1973,11,No. 4,pp. 63'7-652.
Wolfe, P.: A Method of Conjugate Subgradients f o r Minimizing Nondifferentiable Functions. In: Nondifferentiable Optimization /eds. Balinski
M.L., Wolfe P./, Mathematical Programming Study 3, 1975, North-Holland, Amsterdam. pp. 145-1'73.