Algorithms of Inertial Mirror Descent in Convex Problems of

UDC 004.4
Algorithms of Inertial Mirror Descent
in Convex Problems of Stochastic Optimization
A. V. Nazin∗
∗
V.A. Trapeznikov Institute of Control Sciences of Russian Academy of
Sciences,
65 Profsoyuznaya str., Moscow 117997, Russia
Abstract. The goal is to modify the known method of mirror descent (MD)
in convex optimization, which having been proposed by A.S. Nemirovsky and
D.B. Yudin in 1979 and generalized the standard gradient method. To start, the
paper shows the idea of a new, so-called inertial MD method with the example
of a deterministic optimization problem in continuous time. In particular, in the
Euclidean case, the heavy ball method by B.T. Polyak is realized. It is noted that
the new method does not use additional averaging of points. Then, a discrete
algorithm of inertial MD is described. The proved theorem of the upper bound
on error in objective function is formulated.
Keywords: stochastic optimization problem, convex optimization, mirror descent, heavy ball method, inertial MD method.
1.
Introduction
Many problems of an applied nature can formally be reduced to minimization problem f (x) → minx∈X , where a priory unknown function
f : X → R is convex, set X is convex compact in RN ; see e.g. [1], [2]
where both problem statements and optimization methods are described.
In such problems, in order to sequentially estimate the minimum point
x∗ ∈ Argminx ∈X f (x ) it is assumed that, at each time t = 1, 2, . . . , there
is an ability to get subgradient gt = gt (xt−1 ) ∈ ∂f (xt−1 ) or its stochastic
version ut (xt−1 ) = gt + ξt at current point xt−1 ∈ X where ∂f (x) denotes
subdifferential of function f at point x, and ξt represents a disturbance of
the subgradient.1 The foregoing assumes that the minimized function is
known up to its membership in a given class F of convex functions (probably, under additional smooth properties); in addition, it is assumed that
at each current time t ≥ 1 it is possible to access the oracle at current
input point xt−1 and get a stochastic subgradient as the output ut (xt−1 ).
In [3] it is shown that in convex problems with the “correct” choice of
the MDM parameters, the latter is an effective method in the sense that
for each t > 1 the upper and lower bounds of the error (by the objective
function)
f (x̂t ) − min f (x)
(1)
x∈X
1 We
are talking about the concept of an oracle of the first order in the optimization
problem under consideration (either deterministic problem, when ξt ≡ 0, or stochastic
one, under E{ξt } ≡ 0) [3].
coincide up to an absolute constant; here x̂t represents “final” estimate
of the minimum point by the time t, based on previous observations of
subgradients at the obtained points xk , k = 0, 1, . . . , t − 1.
Often, as estimate x̂t , the arithmetic mean of the preceding points is
taken,
1 Xt−1
xk .
x̂t =
k=0
t
We note that the fundamentally new in the structure of the MD method
in comparison with the classical methods of gradient type is the (explicit
or implicit) presence of two spaces, primal space E ⊃ X with an initial
norm k · k and conjugate one E ∗ with dual norm k · k∗ ; for details, see [7],
section 3.1. In the particular, “Euclidean” case E = E ∗ when both norms
are Euclidean and the set X = RN is the whole initial space, the MD
method is transformed into subgradient method
xt = xt−1 − γt ut (xt−1 ),
t = 1, 2, . . .
Recall that the introduction of an additional inertia term into the gradient
method can improve the convergence properties of the algorithm. This
refers to the heavy-ball method proposed by B.T. Polyak in 1964 [10] (see
also [11]). Hence, it is reasonable to generalize the MDM by adding an
appropriate inertia term. Sections 3 and 4 are devoted to the realization
and study of this idea.
The paper has the following structure. In section 2, a convex stochastic optimization problem and goal of investigation is formulated. Then
section 3 outlines the idea of the method of inertial mirror descent in continuous time and a preliminary investigation is carried out, determining
the corresponding “momentum of inertia force” and showing the scheme
for obtaining the upper bound of the error (1). Further, in section 4, a
discrete algorithm of inertial mirror descent (IMD) is described and the
main results are formulated. Finally, conclusion, acknowledge, and list of
references are given.
2.
Stochastic optimization problem
Consider well-known minimization problem
f (x) , E Q(x, Z) → min ,
x∈X
(2)
where loss function Q : X × Z → R+ contains random variable Z with
unknown distribution on space Z, E — mathematical expectation, set
X ⊂ RN — given convex compact in N -dimension space, random function
Q(· , Z) : X → R+ is convex a.s. on X.
Let i.i.d sample (Z1 , . . . , Zt−1 ) be given where all Zi have the same
distribution on Z as Z. Introduce notation for stochastic subgradients
uk (x) = ∇x Q(x, Zk ) ,
k = 1, 2, . . . ,
(3)
such that ∀x ∈ X, E uk (x) ∈ ∂f (x). The goal is in constructing and
proving novel recursive MD algorithms meant for minimization (2) and
using stochastic subgradients ut (xt−1 ) (2) at current points x = xt−1 ∈ X,
t = 1, 2, . . . .
3.
The idea of method of inertial mirror descent
In this section, let f : RN → R be convex, continuously differentiable
function having a unique minimum point x∗ ∈ Argminf (x) and its minimal
value f ∗ = f (x∗ ). Consider continuous algorithm which extends MDM
ζ̇(t) = −∇f (x(t)),
t ≥ 0, ζ(0) = 0,
µt ẋ(t) + x(t) = ∇W (ζ(t)),
x(0) = ∇W (ζ(0)) .
(4)
(5)
Functional parameter in (3) is a convex, continuously differentiable function W : RN → R+ having conjugate function
V (x) = sup {hζ, xi − W (ζ)}.
(6)
ζ∈RN
Let W (0) = 0, V (0) = 0, and ∇W (0) = 0 for simplicity.
Remark. Under parameter µt ≡ 0 in (3), algorithm (3)–(3) represents
MDM (in continuous time) [3]; in particular, the identical map ∇W (ζ) ≡ ζ
and µt ≡ 0 lead to a standard gradient method
ẋ(t) = −∇f (x(t)),
t ≥ 0.
Under µt ≡ µ > 0 and W (ζ) ≡ ζ, algorithm (3)–(3) leads to continuous
method of heavy ball (MHB) [11]
µẍ(t) + ẋ(t) = −∇f (x(t)),
t ≥ 0.
Further, we assume that differentiable parameter µt ≥ 0, and method
(3)–(3) we call Method of Inertial Mirror Descent (MIDM).
Assume a solution {x(t), t ≥ 0} to system equations (3)–(3) exists.
Consider function W∗ (ζ) = W (ζ) − hζ, x∗ i , ζ ∈ RN , attempting to find a
candidate Lyapunov function. Trajectory derivative to system (3)–(3) be
d
W∗ (ζ(t))
dt
= hζ̇, ∇W − x∗ i = −h∇f (x), µt ẋ + x − x∗ i ≤
(1)
≤ f (x∗ ) − f (x(t)) − µt
d
[f (x(t)) − f ∗ ]
dt
(2)
where last inequality results from convexity f (·). Now, integrating on
interval [0, t] with W∗ (0) = 0, we obtain
Z t
t Z t
[f (x(t))−f ∗ ]dt ≤ −W∗ (ζ(t))−µt [f (x(t))−f ∗ ] + [f (x(t))−f ∗ ]µ̇t dt,
0
0
0
where two last terms in RHS got by integrating in parts. Taking (3) into
account, we continue
!Z
t
[f (x(t)) − f ∗ ]dt
1 − sup µ̇s
s∈[0,t]
≤ V (x∗ ) − µt [f (x(t)) − f ∗ ] +
0
+ µ0 [f (x(0)) − f ∗ ].
Therefore, it is reasonable to introduce the following constraints on parameter µt ≥ 0 : µ0 = 0, µ̇t ≤ 1 ∀t > 0, leading to inequality f (x(t)) − f ∗ ≤
V (x∗ )/µt . Maximizing µt under constraints above we get µt = t, t ≥ 0.
The related (continuous) IMD algorithm
ζ̇(t) = −∇f (x(t)),
t ≥ 0,
ζ(0) = 0,
t ẋ(t) + x(t) = ∇W (ζ(t)),
∗
∗
−1
proves upper bound f (x(t)) − f ≤ V (x ) t
4.
,
(9)
(10)
∀t > 0 .
Algorithm IMD. Main results.
Now return back to the stochastic optimization problem of section 2.
Let k · k be a norm in primal space E = RN , and k · k∗ be the related norm
in dual space E ∗ = RN ; set X ⊂ E is convex compact.
Assumption (L). Convex function V : X → R+ is such that its βconjugate Wβ is continuously differential on E ∗ with Lipschitz condition
k∇Wβ (ζ) − ∇Wβ ( ζ̃ )k ≤ (αβ)−1 kζ − ζ̃k∗ ,
∀ ζ, ζ̃ ∈ E ∗ , β > 0,
where α is positive constant being independent of β.
Consider now the discrete time t ∈ Z+ . Write a discrete version of
algorithm IMD (3)–(3) using stochastic subgradients (2) instead of the
gradients ∇f (·):
τt
ζt
= τt−1 + γt , t ≥ 1, τ0 = 0,
= ζt−1 + γt ut (xt−1 ), ζ0 = 0,
(3)
(4)
τt
xt − xt−1
+ xt
γt+1
=
−∇Wβt (ζt ),
x0 = −∇Wβ0 (ζ0 ).
(5)
Here function Wβ is defined by proxy-function V : X → R+ via
Legendre–
Fenchel type transform [7], Wβ (ζ) = supx∈X −ζ T x − βV (x) , ζ ∈ E ∗ .
Remark. Equation (5) may be written as
γt+1
τt
xt−1 −
∇Wβt (ζt ).
τt + γt+1
τt + γt+1
Since the vectors [−∇Wβt (ζt )] ∈ X under each t ≥ 0, equations (3)–(4)
show that xt ∈ X by induction.
Further, let sequences (γi )i≥1 and (βi )i≥1 are of view
√
γi ≡ 1 , βi = β0 i + 1 , i = 1, 2, . . . , β0 > 0.
(14)
xt =
Then system equations (3)–(5) leads to the IMD algorithm (c.f. [12]):
ζt
=
xt
=
ζt−1 + ut (xt−1 ), ζ0 = 0, x0 = −∇Wβ0 (ζ0 ),
1
(xt−1 + ∇Wβt (ζt )) , t ≥ 1.
xt−1 −
t+1
(6)
(7)
Theorem 1. Let X be convex closed set in RN , and loss function Q(·, ·)
satisfies the conditions of section 2, and, moreover, supx∈X Ek∇x Q(x, Z)k2∗
≤ L2X, Q , where constant LX, Q ∈ (0, ∞). Let V be proxy-function on X
with parameter α > 0 from Assumption (L), and let exists minimum point
x∗ ∈ Argmin f (x). Then for any t ≥ 1 estimate xt , defined by algorithm
x∈X
(6), (7) with stochastic subgradients (2) and sequence (βi )i≥1 from (4) with
arbitrary β0 > 0, satisfies inequality
!√
L2X, Q
t+2
∗
E f (xt ) − min f (x) ≤ β0 V (x ) +
.
x∈X
αβ0
t+1
If constant V is such that max V (x) ≤ V , and β0 = LX, Q (α V )−1/2 then
x∈X
√
E f (xt ) − min f (x) ≤ 2 LX, Q α
x∈X
5.
−1
V
1/2
t+2
.
t+1
Conclusion
We considered the well-known convex problem of stochastic optimization with the goal of constructing and investigating the novel recursive algorithms of mirror descent type which generalize both heavy ball method
and MDM. It turned out that the new method does not require additional
averaging of the input points to the oracle and it ensures the same upper
bound on the objective function, as the previous, effective method of MD
(on the class of considered problems) [3], [7]. It seems interesting the further research for another classes of objective functions and requirements
to the oracle.
Acknowledgments
The work is partially supported by the Russian Science Foundation
grant No 16–11–10015.
References
1. Boyd, S. and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.
2. Nesterov, Yu. Introductory Lectures on Convex Optimization. Boston:
Kluwer, 2004.
3. Nemirovskii, A.S. and Yudin, D.B., Problem Complexity and Method
Efficiency in Optimization, Chichester: Wiley, 1983.
4. A. Ben-Tal, T. Margalit, A. Nemirovski. The ordered subsets mirror
descent optimization method with applications to tomography. SIOPT
12(1), 79–108, 2001.
5. A. Beck, M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–
175, 2003.
6. Yu. Nesterov. Primal-dual subgradient methods for convex problems.
Mathematical Programming, 2007. DOI: 10.1007/s10107-007-0149-x.
7. A.B. Juditsky, A.V. Nazin, A.B. Tsybakov, and N. Vayatis. Recursive
aggregation of estimators by the mirror descent algorithm with averaging. Problems of Information Transmission, 41(4):368–384, 2005.
8. Nemirovski A., Juditsky A., Lan G. and Shapiro A. Robust stochastic approximation approach to stochastic programming // SIAM J.
Optim. 2009. V. 19. No. 4. P. 1574–1609.
9. Rockafellar R.T., Wets R.J.B. Variational Analysis. N.-Y.: Springer,
1998.
10. Polyak B.T. Some methods of speeding up the convergence of iteration
methods // Zh. Vych. Mat., 4, No. 5, 791—803, 1964.
11. Poljak B.T. Introduction to optimization. New York: Optimization
Software Inc., 1987.
12. Nesterov Yu., Shikhman V. Quasi-monotone Subgradient Methods for
Nonsmooth Convex Minimization // J. Optim. Theory Appl., 2015,
165:917—940. DOI: 10.1007/s10957-014-0677-5.