Scaled First–Order Methods for a Class of Large–Scale

Scaled First–Order Methods for a Class of Large–Scale
Constrained Least Squares Problems
Vanna Lisa Coli1,a) , Valeria Ruggiero2,b) and Luca Zanni1,c)
1
University of Modena and Reggio Emilia, Via Campi 213/b, Modena, Italy
2
University of Ferrara, Via Saragat 1, Ferrara, Italy
a)
Corresponding author: [email protected]
b)
[email protected]
c)
[email protected]
Abstract. Typical applications in signal and image processing involve the numerical solution of large–scale linear least squares
problems with simple constraints, related to an m × n nonnegative matrix A, m n. When the size of A is such that the matrix
is not available in memory and only the operators of the matrix-vector products involving A and AT can be computed, forward–
backward methods combined with suitable accelerating techniques are very effective; in particular, the gradient projection methods
can be improved by suitable step–length rules or by an extrapolation/inertial step. In this work, we propose a further acceleration
technique for both schemes, based on the use of variable metrics tailored for the considered problems. The numerical effectiveness
of the proposed approach is evaluated on randomly generated test problems and real data arising from a problem of fibre orientation
estimation in diffusion MRI.
INTRODUCTION
Many inverse problems arising in signal and imaging restoration can be reduced to the numerical solution of a class
of large–scale constrained least squares problems, stated as
2
min x∈Ω f (x) ≡ 21 Ax − b2
(1)
where A is an m × n matrix with nonnegative entries, b ∈ Rm , x ∈ Rn and Ω is a nonempty, convex and closed subset
of Rn , such that the projection PΩ (z) of a vector z ∈ Rn onto Ω can be performed by means of non–expensive O(n)
algorithms. Significant examples of such feasible regions are bound or box constraints or their combination with a
single linear equality/inequality constraint. In this work, we are interested to the case in which the size n of the problem
is very large, m n and only the operators of the matrix-vector products involving A and AT are available. These
assumptions occur, for example, in many variational formulations of large–scale imaging problems, for which popular
state of the art approaches based on the availability of the matrix A in memory are not practicable. Conversely, general
forward–backward schemes exploiting only the objective gradient and the projection onto the feasible region appear
very promising. Indeed, the convergence rate of these methods can be improved by means of different strategies that
impact on the key components of the iterate, such as the step–length selection rules [1] or the extrapolation/inertial
steps [2, 3]. Starting from very recent advances on these acceleration ideas, we propose to introduce variable metrics
[4] tailored for the problem (1), that combine well with state of the art step–length selection rules and extrapolation
techniques.
STATE OF THE ART FIRST-ORDER METHODS
Let iΩ (x) denote the indicator function of the set Ω. Problem (1) can be formulated as follows:
min F(x) ≡ f (x) + iΩ (x).
x
(2)
Here, F(x) is a convex function given by the sum of a differentiable term f (x) with L-Lipschitz continuous gradient
and a nondifferentiable one, iΩ (x); large–size problems of this kind can be efficiently solved by forward–backward
methods [5, 6], whose general iteration for (2) is given by
x(k+1) = x(k) + λk (PΩ (x(k) − αk ∇ f (x(k) )) − x(k) ).
Suitable choices of the positive parameters αk an λk enable to control convergence and effectiveness of these schemes.
In the recent literature, we can find two different approaches aiming at improving the convergence speed of forward–
backward schemes. The first is the standard formulation of a Gradient Projection (GP) method [7], while the second
is a variant that combines an extrapolation step with the basic forward–backward iteration, yielding a multistep algorithm, known as heavy ball or inertial method [8] and denoted in the following as GP Ex. These schemes well
emphasize the strategies that need to be used for an efficient implementation of the basic gradient projection idea: GP
exploits an adaptive updating rule for the step–length parameter αk and a line-search parameter λk for controlling the
descent of the objective function, while GP Ex uses the constant step–length αk = α but performs the projection step
starting from a vector obtained by an extrapolation step. One of the most effective rules for updating the step–length
parameter in gradient–type methods is the strategy proposed by Fletcher in [9]. This step–length selection, thought
for the case of unconstrained optimization problems, is aimed at approximating the inverse of the eigenvalues of the
Hessian matrix, by exploiting only the gradients computed in a prefixed small number of consecutive iterations. In
[1], a generalization of the Fletcher’s step–length selection to the case of the gradient–type methods for constrained
problems is introduced and numerical evidence confirms that GP equipped with this step–length selection exhibits
a remarkable performance gain with respect to the popular Barzilai–Borwein (BB) updating rule [10]. Taking into
account that the objective function of (1) is convex and its gradient is L–Lipschitz continuous with L = kAT1Ak2 , the
convergence of the sequence {x(k) } to a solution x∗ is ensured and f (x(k) ) − f (x∗ ) = O 1k [11, 12].
The acceleration strategy at the basis of the GP Ex method takes advantage from an extrapolation step from x(k) along
the direction (x(k) − x(k−1) ), with a proper choice of the step–size along this direction, providing for (1) the following
scheme:
y(k) = x(k) + βk (x(k) − x(k−1) ),
x(k+1) = PΩ (y(k) − α∇ f (y(k) )).
In [2, 8], the convergence of the general scheme is investigated, by showing that for α ≤ L1 and a suitable sequence of
positive parameters {βk } (with limk→∞ βk = 1) one has F(x(k) )−F(x∗ ) = O k12 . Recently, under additional assumptions
on the sequences {βk }, Chambolle and Dossal in [13] proved the convergence of the iterates {x(k) } to a solution x∗ ; in
1
(k)
particular, the choice β0 = 0, βk = k−1
k+a , k ≥ 1, with a ≥ 2, preserves the convergence rate O k2 of {F(x } and for
a > 2 enables the convergence of the iterates to a solution of the problem.
The above GP and GP Ex approaches have been successfully applied in many challenging problems and several ideas
for achieving further performance improvements have been recently investigated. In the next section, we introduce a
generalization of these methods based on the use of scaled gradient directions tailored for problem (1).
NEW SCALED FIRST-ORDER METHODS
The introduction of scaled gradient direction in GP and GP Ex leads to the schemes described in Algorithm SGP
[14, 12] and Scaled GP Ex [4], respectively. In both the algorithms, Dρ denotes the set of positive definite matrices D
with eigenvalues τ j such that 0 < ρ1 ≤ τ j ≤ ρ, j = 1, . . . , n, and PΩ,D (v) = argminu∈Ω (u − v)T D(u − v).
The convergence properties previously described for GP and GP Ex have been recently proved also for their scaled
versions, under suitable assumptions on the sequence of matrices {Di } that induce a variable metric at any iteration
(see [12, 4] for details). These assumptions are satisfied when, for any i, the eigenvalues τ(i)
j of Di are such that
P∞
(i)
1
2
0 < ρi ≤ τ j ≤ ρi , j = 1, . . . , n, and ρi = 1 + θi , with i=0 θi < ∞. For the convergence of the iterates of Scaled
GP Ex, it is also required that {θi } = O( i1p ), p > 2.
In the applications, to avoid significant computational costs, the matrices Di are chosen as diagonal matrices and
τ(i)
j ≡ (Di ) j, j . Furthermore, following a widely used technique for defining scaling directions in gradient methods for
nonnegative least square problems [15, 16, 17], we equip the SGP and the Scaled GP Ex algorithms with the following
scaling strategy:
r
(
(
))
(x(i) ) j
1
γ
(Di ) j, j = max
, min ρi , T (i)
, j = 1, . . . , n,
ρi = 1 + p .
ρi
i
(A Ax ) j
TABLE 1. Algorithm SGP (Scaled Gradient Projection Method).
Initialize: x(0) ∈ Ω, δ, σ ∈ (0, 1), 0 < αmin ≤ αmax ,
for i = 0, 1, . . . (i)
(i)
d(i) = PΩ,D−1
x
−
α
D
∇
f
(x
)
− x(i) ;
i
i
i
λi = 1;
while f (x(i) + λi d(i) ) > f (x(i) ) + σλi ∇ f (x(i) )T d(i)
λi = δλi ;
end
x(i+1) = x(i) + λi d(i) ;
define the step–length αi+1 ∈ [αmin , αmax ];
define the diagonal scaling matrix Di+1 ∈ Dρi+1 ;
end
α0 ∈ [αmin , αmax ],
D0 ∈ Dρ0 ;
(scaled gradient projection step)
(backtracking step)
(step–length updating rule)
(scaling updating rule)
TABLE 2. Algorithm Scaled GP Ex (Scaled Gradient Projection Method with Extrapolation).
Initialize: x(0) ∈ Rn , y(0) = x(0) , 0 < δ < 1, a ≥ 2, α0 > 0, D0 ∈ Dρ0 ;
for i = 0, 1, . . .
(i)
(i)
x(i+1) = PΩ,D−1
y
−
α
D
∇
f
(y
)
;
(scaled gradient proj. step)
i
i
i
while f (x(i+1) ) > f (y(i) ) + ∇ f (y(i) )T (x(i+1) − y(i) ) + 2α1 i kx(i+1) − y(i) k2D−1
i
αi = δαi ; x(i+1) = PΩ,D−1
y(i) − αi Di ∇ f (y(i) ) ;
(backtracking step)
i
end
i
βi+1 = i+1+a
;
αi+1
= αi ;
(i+1)
(i+1)
y
=x
+ βi+1 x(i+1) − x(i) ;
(extrapolation step)
define the diagonal scaling matrix Di+1 ∈ Dρi+1 ;
(scaling updating rule)
end
with γ > 0 and p > 2. This updating rule satisfy the convergence conditions and doesn’t add remarkable computational
cost since the vector AT Ax(i) is available from the gradient. The algorithms are reported in Tab. 1 and 2.
In the numerical experiments of the next section, we set γ = 1013 and p = 2.1.
NUMERICAL EXPERIMENTS
A first set of test problems of the form (1) with Ω = {x ∈ Rn | x ≥ 0} is randomly generated in such a way that a
solution x∗ ∈ Ω with k < m nonzero entries exists.
TABLE 3. Random test problems. x̄ denotes the approximation provided by the algorithms.
Not scaled method
Iterations
kA x̄ − bk2 − kAx∗ − bk2
GP Ex
GP
8.847e-04 (2.81e-04)
1.434e-03 (4.21e-04)
653 (271)
1176 (599)
Scaled method
Iterations
Seconds
kA x̄ − bk2 − kAx∗ − bk2
0.078 (0.025)
0.213 (0.106)
8.779e-04 (3.29e-04)
9.437e-04 (2.60e-04)
334 (124)
132 (37)
Seconds
0.137 (0.019)
0.078 (0.010)
In Tab. 3, the average results (standard deviation in brackets) of 50 test problems with m = 50, n = 500 and k = 27
confirm the effectiveness of the scaled version of GP and GP Ex.
The second set of experiments arises from a problem of fibre orientation estimation in diffusion MRI (dMRI)[18].
Following a spherical deconvolution framework, the problem of recovering the fibre orientation distribution (FOD)
estimation function [19] in each voxel of the white matter of the brain can be expressed in the linear form b = Φx + η,
where x ∈ Rn is the vector of the FOD coefficients, b ∈ Rm is the vector of the dMRI measurements in the voxel, Φ is
the matrix modeling the convolution operator and η is the acquisition noise. The deconvolution problem is intrinsically
ill–posed and regularization schemes are generally used, based on the assumption that the FOD to be reconstructed in
each voxel is sparse. Unlike widely used methods performing the reconstruction on a voxel–by–voxel level, the recent
approach proposed in [18] solves the fibre configuration of all voxels of interest simultaneously, aiming at taking into
101
100
GP_Ex
GP
Scaled GP_Ex
SGP
10
GP_Ex
GP
Scaled GP_Ex
SGP
0
10
-1
10
-2
Err(t)
Err(t)
10
-1
10-2
(a)
10-3 -2
10
10-1
100
101
102
Time
(b)
10-3 -2
10
10-1
100
101
102
Time
FIGURE 1. Behaviour of the methods on two test problems from dMRI application. Err(t) := ΦX(t) − B2 − ΦX∗ − B2 , where
X(t) is the approximation of the solution after t seconds and X∗ is a ground–truth, obtained by executing GP Ex with high accuracy.
account both voxelwise sparsity and the spatial coherence of the fibre orientation between neighbour voxels. This goal
is achieved by means of an iterative reweighted `1 –minimization scheme that solves a sequence of constrained least
squares problems of the form:
n
o
2
PN
min 1 Φ̃X − B ,
Ω = X ≥ 0,
wX ≤K ,
(3)
x∈Ω
2
i=1
2
i i
where Φ̃ is an M×N block diagonal matrix with M < N, the vectors X ∈ RN and B ∈ R M are obtained by concatenating
column–wise the FOD and the signal of each voxel, K is the estimated maximum number of fibres to be detected in
the brain volume and the weights w ∈ RN are used to promote at each iteration the spatially structured FOD sparsity.
Our experiment consists in solving two problems of the form (3) derived by applying the reweighted `1 –minimization
scheme to the dataset available at https://github.com/basp-group/co-dmri; in this case, M = 19200, N =
257280 and K = 3840. Figure 1 (a) and Fig. 1 (b) show the behaviour of the considered first–order methods on the
two problems.
The results in Tab. 3 and Fig. 1 show that better performances can be obtained when applying the scaled version of both
the considered algorithms. In particular, the time reduction for the fibre orientation estimation problems emphasizes
the usefulness of the proposed scaling strategy in solving large–scale dMRI problems.
ACKNOWLEDGMENTS
This research was supported by INDAM-GNCS2016.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
F. Porta, M. Prato, and L. Zanni, J. Sci. Comput. 65, 895–919 (2015).
A. Beck and M. Teboulle, SIAM J. Imaging Sci. 2, 183–202 (2009).
D. Lorentz and T. Pock, J. Math. Imaging Vis. 51, 311–325 (2015).
S. Bonettini, F. Porta, and V. Ruggiero, SIAM J. Sci. Comput. (2016).
P. Combettes and J.-C. Pesquet, Fixed-point algorithms for inverse problems in science and engineering
(Springer, New York NY, 2011), pp. 185–212.
P. Combettes and V. R. Wajs, Multiscale Modeling & Simulation 4, 1168–1200 (2005).
E. G. Birgin, J. Martinez, and M. Raydan, SIAM J. Optim. 10, 1196–1211 (2000).
D. P. Bertsekas, Convex Optimization Theory, Suppl. Ch. 6 on Convex Optim. Alg. (Athena Scientific, 2009).
R. Fletcher, Math. Program. 135, 413–436 (2012).
J. Barzilai and J. Borwein, IMA J. Numer. Anal. 8, 141–148 (1988).
A. Iusem, Comput. Appl. Math. 22, 37–52 (2003).
S. Bonettini and M. Prato, Inverse Prob. 31, 1196–1211 (2015).
A. Chambolle and C. Dossal, J. Optim. Theory Appl. 166, 968–982 (2015).
S. Bonettini, R. Zanella, and L. Zanni, Inverse Prob. 25, p. 015002 (23pp) (2009).
M. E. Daube-Witherspoon and G. Muehllener, IEEE Trans. Med. Imaging MI-5, 61–66 (1986).
H. Lantéri, M. Roche, and C. Aime, Inverse Prob. 18, 1397–1419 (2002).
F. Benvenuto, R. Zanella, L. Zanni, and M. Bertero, Inverse Prob. 26, p. 025004 (18pp) (2010).
A. Aurı́a, A. Daducci, J.-P. Thirana, and Y. Wiaux, Neuroimage 115, 245–255 (2015).
B. Jian and B. Vermuri, IEEE Trans. Med. Imaging 26, 1464–1471 (2007).