A Family of SQA Methods for Non-Smooth Non

Noname manuscript No.
(will be inserted by the editor)
A Family of SQA Methods for Non-Smooth
Non-Strongly Convex Minimization with Provable
Convergence Guarantees
Man-Chung Yue · Zirui Zhou ·
Anthony Man-Cho So
Received: date / Accepted: date
Abstract We propose a family of sequential quadratic approximation (SQA)
methods, the inexact regularized proximal Newton (IRPN) method, to minimize
a sum of smooth and non-smooth convex functions. Our proposed algorithm
features its strong convergence guarantees even when applied to problems with
degenerate solutions, while allowing the inner minimization to be solved inexactly. We prove that IRPN converges globally and under the so-called LuoTseng error bound (EB) condition, the sequence generated by IRPN converges
linearly, superlinearly or even quadratically to an optimal solution, depending on the choice of a parameter of the algorithm. Prior to this work, such
EB condition has been extensively used for proving linear convergence rate
of various first-order methods. To the best of our knowledge, this is the first
EB-based analysis to establish superlinear convergence of SQA-type methods
for non-smooth convex minimization. As a consequence of our result, IRPN
is capable of solving regularized regression or classification problems under
the high-dimensional setting with provable convergence guarantees. We compare two empirically efficient algorithms—the newGLMNET [29] and adaptively restarted FISTA [17]—with our proposed IRPN by applying them to the
`1 -regularized logistic regression problem. Experiment results show the competitiveness of our proposed algorithm.
Man-Chung Yue
Department of Systems Engineering and Engineering Management,
The Chinese University of Hong Kong, Shatin, N. T., Hong Kong
E-mail: [email protected]
Zirui Zhou
Department of Systems Engineering and Engineering Management,
The Chinese University of Hong Kong, Shatin, N. T., Hong Kong
E-mail: [email protected]
Anthony Man-Cho So
Department of Systems Engineering and Engineering Management,
The Chinese University of Hong Kong, Shatin, N. T., Hong Kong
E-mail: [email protected]
2
Man-Chung Yue et al.
Keywords Sequential Quadratic Approximation · Proximal Newton ·
Error Bound · Superlinear Convergence · Non-Strongly Convex · Composite
Minimization
1 Introduction
A wide range of tasks in machine learning and statistics can be formulated as
a non-smooth convex minimization of the following form1 :
min F (x) := f (x) + g(x),
x∈Rn
(1)
where f : Rn → (−∞, ∞] is convex and twice continuously differentiable on
an open subset of Rn containing dom(g) and g : Rn → (−∞, ∞] is a proper,
convex and closed function, but possibly non-smooth. A popular choice for
solving problem (1) is the successive quadratic approximation (SQA) method
(also called the proximal Newton method). Here is a blueprint for a generic
SQA: at iteration k, one computes the minimizer x̂ (or an approximation of
it) of a quadratic model qk (x) of the objective function F (x) at xk , where
1
qk (x) := f (xk ) + ∇f (xk )T (x − xk ) + (x − xk )T Hk (x − xk ) + g(x),
2
(2)
and Hk is a positive definite matrix approximating the Hessian matrix ∇2 f (xk )
of f at xk . A step size αk is obtained by performing a backtracking line search
along the direction dk := x̂ − xk , and then xk+1 := xk + αk dk is returned
as the next iterate. Since in most cases, the minimizer x̂ (no matter exact
or approximate) of (2) does not admit a closed-form expression, an iterative
algorithm, which we shall refer to as the inner solver, is invoked. Throughout
the paper, we will also refer the task of minimizing the quadratic model (2) as
the inner problem or inner minimization and the solution of it as the inner
solution.
There are three important ingredients that, more or less, determine an
SQA method: the approximate Hessian Hk , the inner solver for minimizing
qk , and the stopping criterion of the inner solver to control the inexactness of
the approximate inner solution x̂ to the inner problem. Indeed, many existing
SQA methods and their variants that are tailored for special instances of (1)
can be obtained by specifying the aforementioned ingredients. Friedman et
al. [9] developed the GLMNET algorithm for solving the `1 -regularized logistic
regression, where Hk is set to be the exact Hessian ∇2 f (xk ) and coordinate
minimization is used as the inner solver. Yuan et al. [29] improved GLMNET
by replacing Hk with ∇2 f (xk ) + νI for some constant ν > 0 and adding a
heuristic adaptive stopping strategy for inner minimization. This algorithm,
called newGLMNET, is now the workhorse of the well-known LIBLINEAR
package [7] for large scale linear classification. Hsieh et al. [10] proposed the
QUIC algorithm for solving sparse inverse covariance matrix estimation, which
1
Some authors refer to this as convex composite minimization.
Quadratically Convergent SQA...
3
makes use of a quasi-Newton model for forming the quadratic approximation
(2). Similar strategies have also been employed in Olsen et al. [18], where
the inner problems are solved by the fast iterative soft-shrinkage algorithm
(FISTA) [1]. Other SQA variants can be found, e.g., in [23, 2, 30].
Although the numerical advantage of the above algorithms have been well
documented, their results on global convergence and convergence rate require
that the inner problems can be solved exactly. This requirement is rarely satisfied in practice. To address the inexactness of the inner problem, Lee et
al. [11] and Byrd et al. [3] recently proposed several families of SQA methods
along with stopping criteria for inner problems that yield global convergence
and asymptotic superlinear rate. One crucial assumption, without which the
analyses break down or the algorithms are not well-defined2 , that underpins
their theoretical results is the strong convexity of f . Unfortunately, such an
assumption does not hold in many applications of interest. For example, the
objective function in the `1 -regularized logistic regression is
m
F (x) =
1 X
log(1 + exp(−bi · aTi x)) + µkxk1 ,
m i=1
Pm
1
T
where the smooth part f (x) = m
i=1 log(1 + exp(−bi · ai x)) is strongly
convex if and only if the data matrix A := [a1 , · · · , am ] ∈ Rn×m is of full
row rank, which is not a guaranteed property in most applications and is even
impossible in the high-dimensional setting; i.e., n m. In view of the above
discussion, the theories of existing SQA methods are rather incomplete, and
we are motivated to develop an algorithm for solving problem (1) that does
not require an exact inner solver or the strong convexity of f while possesses
desired convergence properties, such as global convergence and asymptotic
superlinear rate.
In this paper, we propose a new family of SQA methods, the inexact regularized proximal Newton (IRPN) method, that are capable of solving nonstrongly convex instances of problem (1) without exact inner solver. At iteration k, the proposed IRPN takes Hk to be the regularized Hessian Hk =
∇2 f (xk ) + µk I, where µk = c · r(xk )ρ for some constants c > 0, ρ ∈ [0, 1], and
r(·) is a residual function measuring the proximity of xk to the optimal set (see
Section 2.2 for details). The inner minimization is solved up to an adaptive inexactness condition that also depends on the residual r(xk ) and the parameter
ρ. It is worth noting that the idea of regularization is not new for second-order
methods. Indeed, Li et al. [12] investigated the regularized Newton method
for solving smooth convex minimization with degenerate solutions. The wellknown Levenberg-Marquardt (LM) method for solving nonlinear equations is
essentially a regularized Gauss-Newton method [16, 27]. The algorithm that is
closest in spirit to ours is the newGLMNET [29], which aims at solving minimization of the same form (1) and also uses the regularized Hessian, albeit the
2 For [11], H = ∇2 f (xk ). If strong convexity is missing, the quadratic model (2) is
k
not strongly convex and can have multiple minimizers. Hence the next iterate xk+1 is not
well-defined.
4
Man-Chung Yue et al.
regularization parameter remains a constant throughout the entire execution,
i.e., µk = ν for all k. Hence it may be seen as a special case (ρ = 0) of IRPN.
However, the constant ν is chosen empirically and a heuristic stopping rule is
adopted for their inner solver.
Another motivation of this research comes from the seminal paper [15] of
Luo and Tseng, where a notion of error bound (EB) condition was introduced.
This EB condition is a provably weaker assumption than the strong convexity
assumption and has been proved to hold for a wide range of instances of (1)
(see Theorem 1). Since its introduction, the Luo-Tseng EB plays an important
role in the convergence rate analysis of various first-order methods. Specifically, it has been utilized to prove the linear convergence of projected gradient
descent, proximal gradient descent, coordinate descent and block coordinate
descent for solving problem (1) without assuming strong convexity, see [15, 26,
25, 32]. Following the great success of EB-based analysis of first-order methods,
it is natural to ask whether it is possible to develop an EB-based analysis of
second-order methods for solving problem (1). Such an analysis is highly desirable as it would advance our understanding of the strength and limitation of
SQA methods and more accurately capture the interplay between algorithmic
properties and the geometrical aspect of problem (1). Some partial answers
are obtained towards this direction. Li et al. [12] proposed a regularized Newton method along with inexactness criteria for solving smooth non-strongly
convex minimization and proved its global convergence and local quadratic
convergence rate based on an EB assumption. Unfortunately, their analysis
relies heavily on the eigenvalue perturbation analysis of the Hessian matrix
of the smooth objective function and hence cannot be extended to secondorder algorithms for solving non-smooth minimization (1). Tseng and Yun [26]
proposed the (block) coordinate gradient descent (CGD) method for solving
problem (1) with non-convex f and non-smooth convex g that has separable
structure. CGD includes the SQA method as a special case. Although their
analysis is also based on the Luo-Tseng EB, CGD requires the sequence of
approximate Hessian Hk to be uniformly lower bounded, i.e., λmin (Hk )) ≥ λ
for all k for some constant λ, and the asymptotic convergence rate is only
linear. Another important contribution along this direction is the paper [8] by
Fischer, where he studied an abstract framework for solving a class of generalized equations and proved its asymptotic superlinear convergence rate. Besides
an error bound-type assumption and an inexactness requirement assumption,
the result is based also on the strong assumption that the distance between
current and next iterates is linearly bounded by the distance between the current iterate to the set of solutions, i.e., kxk+1 − xk k = O(dist(xk , X )), where
X is the set of solutions of the generalized equation. However, for a concrete
algorithm, establishing such an estimate is usually a nontrivial task. Moreover,
the iterative framework in [8] does not involve regularization, and hence it is
not obvious how our algorithm can be put into their abstract framework (see
equation (3) and (4) of [8]).
Our contributions can be summarized as follows. First, we propose a family
of SQA methods, IRPN, for solving non-strongly convex minimization (1). The
Quadratically Convergent SQA...
5
proposed algorithm is parametrized by a parameter ρ ∈ [0, 1] and comes with a
stopping criterion for inner problems, which allows us to solve the inner problems inexactly. Second, we establish the global convergence and asymptotic
convergence rate of IRPN. More precisely, we prove that each accumulation
point of the sequence of iterates generated by IRPN is optimal to (1). Furthermore, under the Luo-Tseng EB assumption, the sequence converges to an
optimal solution, and the convergence rate is R-linear if ρ = 0, Q-superlinear
if ρ ∈ (0, 1) and Q-quadratic if ρ = 1. In particular, our result on the convergence rate is proved based on the Luo-Tseng EB condition. This is also our
third contribution: we developed a novel EB-based analysis of SQA methods
for solving problem (1) that is able to yield superlinear convergence.
The paper is organized as follows. In Section 2, we prepare the assumptions and basic results needed in the paper. We also discuss the Luo-Tseng
EB condition for (1) and some related regularity conditions. In Section 3, we
detail the algorithm IRPN, including the regularized Hessian and inexactness
conditions for inner minimization. In Sections 4, we establish the global convergence and the local convergence rate of IRPN. Numerical studies are provided
in Section 5. Finally, we conclude our paper in Section 6.
Notation. We denote the optimal value of and the set of optimal solutions
to problem (1) as F ∗ and X , respectively, and let k · k be the Euclidean norm.
Given a set C ⊆ Rn , we denote dist(x, C) as the distance from x to C; i.e.,
dist(x, C) = inf{kx − uk | u ∈ C}. For a closed convex function h, we denote
by ∂h the subdifferential of h. Furthermore, if h is smooth, we let ∇h and ∇2 h
be the gradient and Hessian of h, respectively.
2 Preliminaries
Throughout, we consider problem (1), where the function f is twice continuously differentiable on an open subset of Rn containing dom(g). To avoid
ill-posed problems, we assume that the optimal value F ∗ > −∞ and the set
of optimal solutions X is non-empty. Furthermore, we make the following assumptions on the derivatives of f .
Assumption 1. The function f in problem (1) satisfies the following:
(a) The gradient is Lipschitz continuous on an open set U containing dom(g);
i.e., there exist constants L1 > 0 such that
k∇f (y) − ∇f (z)k ≤ L1 ky − zk
∀y, z ∈ U.
(3)
(b) The Hessian is Lipschitz continuous on an open set U containing dom(g);
i.e., there exist constants L2 > 0 such that
k∇2 f (y) − ∇2 f (z)k ≤ L2 ky − zk
∀y, z ∈ U.
(4)
6
Man-Chung Yue et al.
The above assumptions are standard in the analysis of Newton-type methods.
As we will see in Sections 4 and 5, Assumptions 1(a) is crucial to the analysis
of global convergence, and Assumption 1(b) is utilized in analyzing the local
convergence rate. A direct consequence of Assumption 1 is that the operator
norm of the Hessian ∇2 f is bounded.
Lemma 1 Under Assumption 1, for any x ∈ dom(g), it follows that
λmax (∇2 f (x)) ≤ L1 .
Proof. Suppose x is an arbitrary vector in dom(g). By Assumption 1, f is twice
continuously differentiable at x. Hence, by the definition of ∇2 f , we have for
any vector v ∈ Rn ,
∇f (x + hv) − ∇f (x)
.
h→0
h
∇2 f (x)v = lim
By taking the vector norm on both sides, we have
k∇2 f (x)vk = lim
h→0
k∇f (x + hv) − ∇f (x)k
L1 hkvk
≤ lim
= L1 kvk,
h→0
h
h
where the inequality is due to Assumption 1(a). Hence, by the definition of
λmax , we have λmax (∇2 f (x)) ≤ L1 .
u
t
2.1 Optimality and Residual Functions
We now introduce the optimality conditions of (1) and the announced error
bound condition. Given a convex function h, the so-called proximal operator
of h is defined as follows:
1
(5)
proxh (v) := argmin ku − vk2 + h(u).
u∈Rn 2
The proximal operator is the building block of many first-order methods for
solving (1), such as proximal gradient method and its accelerated versions, and
(block) coordinate gradient descent methods and can be viewed as a generalization of projection operator. When the function h is the indicator function of
a closed convex set C, the proximal operator of h reduces to the projection operator onto C. Moreover, the proximal operators of many nonsmooth functions,
such as `1 -norm, group-Lasso regularizer, elastic net regularizer and nuclear
norm, have closed-form representations. For example, let h be the scaled `1 norm function τ kxk1 . Then the proximal operator of h is the soft-thresholding
operator :
proxh (v) = sign(v) max{|x| − τ, 0}
where sign, max, | · | are entrywise operations and operates the entrywise
product. We now recall some of the useful properties of the proximal operator
that will be used later. One well-known property of the proximal operator is
that it characterizes the optimal solutions of (1) as its fixed points. For its
proof, we refer readers to Lemma 2.4 of [4].
Quadratically Convergent SQA...
7
Lemma 2 A vector x ∈ dom(g) is an optimal solution of (1) if and only if
for any τ > 0, the following fixed-point equation holds:
x = proxτ g (x − τ ∇f (x)).
(6)
Let R : dom(g) → Rn be the map given by
R(x) := x − proxg (x − ∇f (x)).
(7)
Lemma 2 suggests that r(x) = kR(x)k is a residual function for problem (1),
in the sense that r(x) ≥ 0 for all x ∈ dom(g) and r(x) = 0 if and only if
x ∈ X . In addition, the following proposition shows that both R(x) and r(x)
are Lipschitz continuous on dom(g) if Assumption 1(a) is satisfied.
Proposition 1. Suppose that Assumption 1(a) is satisfied. Then, for any
y, z ∈ dom(g), we have
|r(y) − r(z)| ≤ kR(y) − R(z)k ≤ (L1 + 2)ky − zk.
Proof. The first inequality is a direct consequence of the triangle inequality.
We now prove the second one. It follows from the definition of R that
kR(y) − R(z)k = ky − proxg (y − ∇f (y)) − z + proxg (z − ∇f (z))k
≤ ky − zk + kproxg (y − ∇f (y)) − proxg (z − ∇f (z))k
≤ 2ky − zk + k∇f (y) − ∇f (z)k
≤ (L1 + 2)ky − zk,
where the second inequality follows from the firm non-expansiveness of the
proximal operator (see Proposition 3.5 of [4]) and the last one is by Assumption
1(a).
u
t
2.1.1 Inner Minimization
Recall that at the k-th iteration of an SQA method, the goal is to find the
minimizer of (2). By letting fk (x) be the sum of the first three terms, the inner
minimization problem reads
min qk (x) := fk (x) + g(x),
x∈Rn
(8)
which is also a convex minimization of the form (1) with the smooth function
being quadratic. Therefore, both the optimality condition and the residual
function studied above for problem (1) can be adapted to the inner minimization (8). The following corollary is immediately implied by Lemma 2 after
noting the fact that ∇fk (x) = ∇f (xk ) + Hk (x − xk ).
Corollary 1. A vector x ∈ dom(g) is an minimizer of (8) if and only if for
any τ > 0, the following fixed-point equation holds
x = proxτ g (x − τ ∇fk (x)) = proxτ g ((I − τ Hk )x − τ (∇f (xk ) − Hk xk )).
8
Man-Chung Yue et al.
Similar to the map R for problem (1), let Rk : dom(g) → Rn be the map given
by
Rk (x) := x−proxg (x−∇fk (x)) = x−proxg ((I−Hk )x−(∇f (xk )−Hk xk )). (9)
Parallel to the role of r(x) for the outer problem (1), rk (x) = kRk (x)k is a
natural residual function for the inner minimization (8). Moreover, following
the lines of the proof of Proposition 1, we can easily show that both the Rk
and rk are Lipschitz continuous
Corollary 2. For any y, z ∈ Rn , we have3
|rk (y) − rk (z)| ≤ kRk (y) − Rk (z)k ≤ (λmax (Hk ) + 2)ky − zk.
2.2 Error Bound Condition
A prevailing assumption in existing analyses of the global convergence and
local convergence rates of SQA methods for solving problem (1) is that the
smooth function f is strongly convex [11, 3]. However, such an assumption
is difficult to verify and is invalid in many applications (see the discussion in
Section 1). Instead of assuming strong convexity, our analysis of IRPN is based
on the following local error bound (EB) condition.
Assumption 2. (Luo-Tseng EB Condition) Let r(x) be the residual function defined in Section 2.1. For any ζ ≥ F ∗ , there exist scalars κ > 0 and > 0
such that
dist(x, X ) ≤ κ · r(x),
whenever F (x) ≤ ζ, r(x) ≤ .
(10)
Error bounds have long been an important topic and permeate in all aspects
of mathematical programming [20, 6]. The Luo-Tseng EB (10) for problem (1)
was studied and popularized in a series of papers by Luo and Tseng [13–15]4 .
In particular, many useful subclasses of problem (1) have been verified to
saitsfy the Luo-Tseng EB, mainly for those with polyhedral g. Recently, the
validity of the Luo-Tseng EB is extended to problems with non-polyhedral
g [25] and even problems with non-polyhedral optimal solution set [31]. We
briefly summarize these results in the following theorem. We refer the readers
to the recent manuscript of Zhou and So [31] for a more detailed review of the
developments of the Luo-Tseng EB.
Theorem 1. For problem (1), Assumption 2 (Luo-Tseng EB condition) holds
in any of the following scenarios:
(S1) ([19, Theorem 3.1]) f is strongly convex, ∇f is Lipschitz continuous, and
g is any closed convex function.
3
Compared with Proposition 1, Assumption 1(a) is not required for Corollary 2.
Similar notion of EB had been studied before by Pang [19] for linearly constrained
variational inequalities.
4
Quadratically Convergent SQA...
9
(S2) ([14, Theorem 2.1]) f takes the form f (x) = h(Ax) + hc, xi, where A ∈
Rm×n , c ∈ Rn are arbitrarily given and h : Rm → (−∞, +∞) is a continuously differentiable function with the property that on any compact
subset C of Rn , h is strongly convex and ∇h is Lipschitz continuous on
C, and g is of polyhedral epigraph.
(S3) ([25, Theorem 2]) f takes the form f (x) = h(Ax), where A ∈ Rm×n and
h : Rm → (−∞, +∞) are P
as in scenario (S2), and g is the grouped LASSO
regularizer; i.e., g(x) = J∈J ωJ kxJ k2 , where J is a non-overlapping
partition of the index set {1, . . . , n}, xJ ∈ R|J| is the sub-vector obtained
by restricting x ∈ Rn to the entries in J ∈ J , and ωJ ≥ 0 is a given
parameter.
(S4) ([31, Proposition 12]) f takes the form f (X) = h(A(X)) for all X ∈
Rn×p , where A : Rn×p → Rm is a linear mapping, h : Rm → (−∞, +∞)
is as in scenario (S2), g is the nuclear norm regularizer, i.e., g(X) equals
the sum of singular values of X, and there exists an X ∗ ∈ X such that
the following strict complementary-type condition holds:
0 ∈ ∇f (X ∗ ) + ri(∂g(X ∗ )),
where ri denotes the relative interior.
Note that in scenarios (S2)-(S4), the assumptions on h are the same and can
1
2
, which corresponds to
readily be shown to be satisfied by h(y)
Pm = 2 ky − bk
−bi yi
least-squares regression, and h(y) = i=1 log(1 + e
) with b ∈ {−1, 1}m ,
which corresponds to logistic regression. The assumptions are also satisfied by
the loss functions that arise in the maximum likelihood estimation (MLE) for
conditional random field problems [30] and MLE under Poisson noise [22]. It
follows from Theorem 1 that many problems of the form (1), including `1 regularized logistic regression, satisfy the Luo-Tseng EB condition, even when
they are not strongly convex.
In Section 4, the Luo-Tseng EB condition (10) is harnessed to establish the
local superlinear convergence rate of our inexact regularized proximal Newton
(IRPN) method, thus further demonstrating its versatility in convergence analysis. Prior to this work, the Luo-Tseng EB condition (10) is used to prove the
local linear convergence of a number of first-order methods for solving problem (1), such as the projected gradient method, proximal gradient method,
coordinate minimization, and coordinate gradient descent method; see [15, 25,
26] and the references therein. Besides first-order methods, such EB condition
is also used to prove the superlinear convergence of primal-dual interior-point
path following methods [24]. However, these methods are quite different in
nature from the SQA methods considered in this paper.
2.3 Other Regularity Conditions
Besides the Luo-Tseng EB condition (10), there are other regularity conditions that are utilized for establishing the superlinear convergence rate of SQA
methods for problem (1).
10
Man-Chung Yue et al.
Yen et al. [28] and Zhong et al. [30] introduced the constant nullspace strong
convexity (CNSC) condition for a smooth function. They showed that, under
the CNSC condition on f and some other regularity conditions on the nonsmooth function g, the proximal Newton method [11] converges quadratically
and proximal quasi-Newton method [11] converges superlinearly. The claimed
convergence results are misleading nonetheless. Let xk be the iterates generated by their algorithms and zk be the projection of xk onto a subspace that is
associated with the CNSC assumption. Their results implies that z k converges
to a point z ∗ superlinearly or quadratically. However, this cannot guarantee
that {xk }k≥0 converges to the set of optimal solution and the rate of convergence of {xk }k≥0 is unclear. Here we also remark that both the inexactness of
the inner minimization and the global convergence were not addressed in [28]
and [30].
Dontchev and Rockafellar [5] developed a framework of solving generalized
equations. When specialized to the optimality condition 0 ∈ ∇f (x) + ∂g(x) of
problem (1), their framework coincides with the family of SQA methods and
the corresponding convergence results require the set-valued mapping ∇f + ∂g
to be either metrically regular or strongly metrically sub-regular [5]. Unfortunately, both of these two regularity conditions are provably more restrictive
than the EB condition (10) and are hardly satisfied by any instances of (1)
of interest. For example, consider the following two-dimensional `1 -norm regularized linear regression:
min
x1 ,x2
1
|x1 + x2 − 2|2 + |x1 | + |x2 |.
2
The above problem satisfies the EB condition (10) as it belongs to scenario
(S2) in Theorem 1. On the other hand, it can be verified that the set-valued
mapping associated with this problem is neither metrically regular nor strongly
metrically sub-regular.
3 The Inexact Regularized Proximal Newton Method
We now describe in detail the inexact regularized proximal Newton (IRPN)
method. The algorithm is input with an initial iterate x0 , a constant 0 > 0
that controls the precision, constants θ ∈ (0, 1/2), ζ ∈ (θ, 1/2), η ∈ (0, 1) that
are parameters of inexactness conditions and line search, and constants c >
0, ρ ∈ [0, 1] that are used in forming the approximation matrices Hk ’s. As we
will see in Section 4, the choice of constant ρ is crucial for the local convergence
rate of IRPN.
At iterate xk , we first construct the following quadratic approximation of
F at xk ,
1
qk (x) := f (xk ) + ∇f (xk )T (x − xk ) + (x − xk )T Hk (x − xk ) + g(x),
2
where Hk = ∇2 f (xk ) + µk I with µk = c · r(xk )ρ , and r(·) is the residual
function defined in Section 2.1. Since f is convex and µk > 0, the matrix Hk is
Quadratically Convergent SQA...
11
positive definite for all k. Hence, the quadratic model qk (x) is strictly convex
and thus has a unique minimizer. We next find an approximate minimizer x̂
of the quadratic model qk (x), such that the following inexactness conditions
hold:
rk (x̂) ≤ η · min{r(xk ), r(xk )1+ρ }
and qk (x̂) − qk (xk ) ≤ ζ(`k (x̂) − `k (xk )),
(11)
where rk (·) is the residual function for the inner minimization defined in Section 2.1.1 and `k is the first-order approximation of F at xk :
`k (x) := f (xk ) + ∇f (xk )T (x − xk ) + g(x).
Since the exact minimizer of qk has no closed-form in most cases, an iterative algorithm, such as coordinate minimization, coordinate gradient descent
method, and accelerated proximal gradient method, is typically called for finding the approximate minimizer x̂ that satisfies the conditions in (11). After performing a backtracking line search along the direction dk := x̂ − xk , we obtain
a step size αk > 0 that guarantees a sufficient decrease of the objective value.
The algorithm then steps into the next iterate by setting xk+1 := xk + αk dk .
Finally, we terminate the algorithm when r(xk ) is less than the prescribed
constant 0 . We summarize the details of IRPN method in Algorithm 1.
Algorithm 1 Inexact Regularized Proximal Newton (IRPN) Method
1: Input: Starting point x0 and constants 0 > 0, θ ∈ (0, 1/2), ζ ∈ (θ, 1/2), η ∈ (0, 1), c > 0,
ρ ∈ [0, 1] and β ∈ (0, 1).
2: for k = 0, 1, 2, . . . do
3:
Compute the value of residual r(xk ).
4:
If r(xk ) ≤ 0 , terminate the algorithm and return xk .
5:
Form the quadratic model qk (x) with Hk = ∇2 f (xk ) + µk I and µk = c · r(xk )ρ .
6:
Call an inner solver and find an approximate minimizer x̂ of qk such that the following
inexactness conditions are satisfied:
rk (x̂) ≤ η · min{r(xk ), r(xk )1+ρ }
7:
and
qk (x̂) − qk (xk ) ≤ ζ(`k (x̂) − `k (xk )). (11)
Let the search direction dk := x̂ − xk . Find the smallest positive integer i such that
F (xk ) − F (xk + β i dk ) ≥ θ(`k (xk ) − `k (xk + β i dk )).
(12)
8:
Let the step size αk = β i and set xk+1 = xk + αk dk .
9: end for
The inexactness conditions (11) is similar to those proposed in [3]. However,
the analysis in [3] cannot be applied to study the convergence behavior of
IRPN. First, the global convergence therein requires Hk λI for some λ > 0
for all k. Second, the local convergence rate therein requires ∇2 f to be positive
definite at the limiting point x∗ of the sequence {xk }k≥0 . Since we do not
assume that the smooth function f is strongly convex, neither of these two
assumptions holds in our case.
12
Man-Chung Yue et al.
One advantage of our proposed IRPN lies in its flexibility. Indeed, for SQA
methods using regularized Hessian, it is typical to let the regularization parameter µk be of the same order as the residual function r(xk ) [12, 21], i.e.,
µk = c · r(xk ). In contrast, we let the order of µk be adjustable according to
the parameter ρ. Therefore, IRPN is actually a tunable family of algorithms
parametrized by ρ ∈ [0, 1]. As we will see in Section 4 and Section 5, the parameter ρ plays a dominant role in the asymptotic convergence rate of IRPN. Even
more, IRPN allows one to choose the inner solver. Since each inner problem is
a very well-structured minimization problem with objective function being a
sum of a strongly convex quadratic function and the nonsmooth convex function g, diligent users can speed up the inner minimization by exploiting the
structure of ∇2 f and g to design special inner solvers.
The first thing that concerns us is the well-definition of IRPN. To prove this,
we need to argue that the inexactness condition (11) is always feasible and that
the line search procedure will terminate within finite steps. We first present
the following lemma, which ensures that at each iteration, the inexactness
condition (11) is satisfied as long as x̂ is close enough to the exact solution
of the inner minimization. The proof is identical to Lemma 4.5 in [3], but we
present it here for completeness.
Lemma 3 The inexactness conditions (11) are satisfied for any vectors in Rn
that are sufficiently close to the exact minimizer x̄ of qk (x).
Proof. Suppose we are at iteration k with the iterates xk . In addition, suppose
xk is not optimal for problem (1) because otherwise the algorithm has terminated. The first condition in (11) is readily verified by noting that rk (x̄) = 0
and rk (·) is continuous. It remains to show that the second condition in (11)
will eventually be satisfied. Since qk (x) = `k (x) + 21 (x − xk )T Hk (x − xk ) and x̄
is the minimizer of qk , we have Hk (xk − x̄) ∈ ∂`k (x̄). This leads to the property
that xk 6= x̄, because otherwise we have 0 ∈ ∂`k (xk ), which implies that xk is
an optimal solution of (1). Moreover, by the convexity of `k , we have
`k (xk ) ≥ `k (x̄) + (x̄ − xk )T Hk (x̄ − xk ).
(13)
Since Hk = ∇2 f (xk ) + µk I for some µk > 0, Hk is positive definite. This,
together with the fact that xk 6= x̄, implies that `k (xk ) > `k (x̄). Furthermore,
we have
1
qk (x̄) − qk (xk ) = `k (x̄) + (x̄ − xk )T Hk (x̄ − xk ) − `k (xk )
2
1
≤ (`k (x̄) − `k (xk ))
2
< ζ(`k (x̄) − `k (xk ))
where the first inequality is due to (13) and the last one is due to the facts
ζ ∈ (θ, 1/2) and `k (xk ) > `k (x̄). Therefore, due to the continuity of qk and `k ,
the second condition in (11) is satisfied as long as x is close enough to x̄. u
t
We next show that the backtracking line search is well defined.
Quadratically Convergent SQA...
13
Lemma 4 Suppose that Assumption 1(a) is satisfied. Then, for any iteration
k of Algorithm 1, there exists a positive integer i such that the descent condition
in (12) is satisfied. Moreover, the step size αk obtained from the line search
strategy satisfies
β(1 − θ)µk
.
αk ≥
(1 − ζ)L1
Proof. By definition, we have
1
qk (x̂) − qk (xk ) = `k (x̂) − `k (xk ) + (x̂ − xk )T Hk (x̂ − xk ).
2
Since the approximate minimizer x̂ satisfies the inexactness conditions (11),
the above implies
`k (xk ) − `k (x̂) ≥
1
µk
(x̂ − xk )T Hk (x̂ − xk ) ≥
kx̂ − xk k2 . (14)
2(1 − ζ)
2(1 − ζ)
Due to the convexity of `k (x), for any constant α ∈ [0, 1], we have
`k (xk ) − `k (xk + αdk ) ≥ α(`k (xk ) − `k (xk + dk )).
Combining the above two inequalities and noting that x̂ = xk + dk , we obtain
αµk
kdk k2 .
(15)
`k (xk ) − `k (xk + αdk ) ≥
2(1 − ζ)
Moreover, since ∇f is Lipschitz continuous by Assumption 1(a), it then follows
that
α2 L1 k 2
f (xk + αdk ) − f (xk ) ≤ α∇f (xk )T dk +
kd k ,
(16)
2
which leads to, by the definition of `k ,
F (xk ) − F (xk + αdk ) ≥ `k (xk ) − `k (xk + αdk ) −
α2 L1 k 2
kd k .
2
It then follows from the above inequality and (15) that
F (xk ) − F (xk + αdk ) − θ(`k (xk ) − `k (xk + αdk ))
≥ (1 − θ)(`k (xk ) − `k (xk + αdk )) −
α2 L1 k 2
kd k
2
(1 − θ)αµk k 2 α2 L1 k 2
kd k −
kd k
2(1 − ζ)
2
α 1−θ
=
µk − αL1 kdk k2 .
2 1−ζ
≥
Hence, as long as α satisfies α < (1 − θ)µk /(1 − ζ)L1 , the descent condition
(12) is satisfied. Since the backtracking line search multiplies the step-length
by β ∈ (0, 1) after each trial, then the line search strategy will output an αk
that satisfies αk ≥ β(1 − θ)µk /(1 − ζ)L1 .
u
t
Combining Lemma 3 and Lemma 4, we conclude that the method IRPN is
well-defined.
14
Man-Chung Yue et al.
4 Convergence Analysis
Throughout this section, we denote {xk }k≥0 as the sequence of iterates generated by Algorithm 1. We first present the a theorem showing the global
convergence of Algorithm 1.
Theorem 2 (Global Convergence). Suppose that Assumption 1(a) holds. Then,
we have
lim r(xk ) = 0.
(17)
k→∞
In other words, each accumulation point of the sequence {xk }k≥0 is an optimal
solution of (1).
Proof. From the inequality (15), we obtain
`k (xk ) − `k (xk + αdk ) ≥
αµk
kdk k2 ≥ 0.
2(1 − ζ)
(18)
Hence, due to the descent condition (12) and the assumption that inf F > −∞,
we have
lim `k (xk ) − `k (xk + αk dk ) = 0.
k→∞
Again, using (15), the above implies limk→∞ αk µk kdk k2 = 0. Moreover, recall
that the inexactness condition (11) implies rk (x̂) ≤ η · r(xk ). It then follows
that
(1 − η)r(xk )
≤ r(xk ) − rk (x̂)
= rk (xk ) − rk (x̂)
≤ (λmax (Hk ) + 2)kdk k
≤ (L1 + µk + 2)kdk k,
where the first equality follows from rk (xk ) = r(xk ), the second inequality is
by Corollary 2 and the last inequality is due to Lemma 1. Then, combining
Lemma 4, limk→∞ αk µk kdk k2 = 0 and (1 − η)r(xk ) ≤ (L1 + µk + 2)kdk k, we
have
µ2k
r(xk )2 = 0.
lim
k→∞ (µk + L1 + 2)2
This, together with the fact that µk = c · r(xk )ρ , implies that limk→∞ r(xk ) =
0.
u
t
Having established the global convergence of the IRPN method, we now
study its asymptotic convergence rate. We show that remarkably, the IRPN
method can attain superlinear and quadratic rates of convergence even without
strong convexity (see Theorem 3). Since we do not assume f to be strongly
convex, techniques based on the facts that the Hessian of f is positive definite
at an optimal x∗ or the optimal solution is unique collapse. Key to the proof
Quadratically Convergent SQA...
15
of Theorem 3 is the EB condition (10). Our techniques are different from
those in [11] and [3], and we believe that it will find further applications in
the convergence rate analysis of other second-order methods in the absence of
strong convexity.
Theorem 3 (Local Convergence Rate). Suppose that Assumptions 1 and 2
hold. Then, for sufficiently large k, we have
(i) dist(xk+1 , X ) ≤ γdist(xk , X ) for some γ ∈ (0, 1), if we take ρ = 0, c ≤ κ4 ,
and η ≤ 2(L11+3)2 ;
(ii) dist(xk+1 , X ) ≤ O(dist(xk , X )1+ρ ) if we take ρ ∈ (0, 1);
(iii) dist(xk+1 , X ) ≤ O(dist(xk , X )2 ) if we take ρ = 1, c ≥ κL2 , and η ≤
κ2 L2
2(L1 +1) .
Using Theorem 3, we can further establish that the sequence of iterates of
IRPN converges to a point in the set of optimal solutions.
Corollary 3 (Sequence Convergence). Suppose that Assumption 1 and 2 hold.
Then, for all the three cases in Theorem 3, the sequence {xk }k≥0 converges to
some x∗ ∈ X . Moreover,
(i) {xk }k≥0 converges R-linearly to x∗ , if ρ = 0, c ≤ κ4 , and η ≤ 2(L11+2)2 .
(ii) {xk }k≥0 converges Q-superlinearly to x∗ with order 1 + ρ, if ρ ∈ (0, 1).
(iii) {xk }k≥0 converges Q-quadratically to x∗ , if ρ = 1, c ≥ κL2 , and η ≤
κ2 L2
2(L1 +1) .
We remark that though Theorem 3 and Corollary 3 show that a larger ρ
results in a faster convergence rate with respect to the iteration counter k,
it does not suggest that a larger ρ leads to a better overall time complexity.
The reason is that a larger ρ results in a stricter inexactness condition and
thus the time consumed by each iteration is longer. Hence, despite the faster
convergence rate, the best choice of ρ ∈ [0, 1] depends on particular problems
and the solver used for inner minimization. Nonetheless, the above results
provide a complete characterization of the convergence rate of IRPN in terms
of ρ and facilitates the flexibility of IRPN. Developing an empirical or analytical
way for tuning ρ is definitely a topic worth pursuing.
The rest of this section is devoted to proving Theorem 3 and Corollary 3.
In Section 4.1, we prepare some technical lemmas. In Section 4.2 and Section
4.3, we prove Theorem 3 and Corollary 3, respectively.
4.1 Technical Lemmas
Since the following lemmas are all proved under the conditions of Theorem 3,
we suppress the required assumptions in their statements.
Lemma 5 Let x̂ex be the exact solution of the k-th subproblem, i.e. rk (x̂ex ) =
0, and x̄k be the projection of xk onto the optimal set X . Then
kx̂ex − x̄k k ≤
1
k∇f (x̄k ) − ∇f (xk ) − Hk (x̄k − xk )k.
µk
16
Man-Chung Yue et al.
Proof. If x̂ex = x̄k , the inequality holds trivially. Assuming x̂ex 6= x̄k , it follows
from definitions that 0 ∈ ∇f (x̄k ) + ∂g(x̄k ) and 0 ∈ ∇f (xk ) + Hk (x̂ex − xk ) +
∂g(x̂ex ). Since g is convex, the subdifferential is monotone. Hence we have
0 ≤h∇f (xk ) − ∇f (x̄k ) + Hk (x̂ex − xk ), x̄k − x̂ex i
=h∇f (xk ) − ∇f (x̄k ) − Hk (xk − x̄k ), x̄k − x̂ex i − hHk (x̄k − x̂ex ), x̄k − x̂ex i
≤k∇f (xk ) − ∇f (x̄k ) − Hk (xk − x̄k )kkx̄k − x̂ex k − µk kx̄k − x̂ex k2 ,
which yields the desired inequality.
u
t
We remark here that we did not use any special structure of Hk in the proof
of Lemma 5 but the assumption that Hk is positive definite. Therefore this
lemma also applies to other positive definite approximate Hessian Hk with µk
replaced by the minimum eigenvalue λmin (Hk ).
Lemma 6 It holds for all k that
kx̂ex − xk k ≤
L2
dist(xk , X ) + 2 dist(xk , X ).
2µk
Proof. Using triangle inequality and the definition of µk ,
k∇f (x̄k ) − ∇f (xk ) − Hk (x̄k − xk )k
≤k∇f (x̄k ) − ∇f (xk ) − ∇2 f (xk )(x̄k − xk )k + µk kx̄k − xk k
L2
≤ kx̄k − xk k2 + µk kx̄k − xk k,
2
(19)
where the second inequality follows from Assumption 1(b). Then, by Lemma
5 and inequality (19),
kx̂ex − xk k ≤ kx̂ex − x̄k k + dist(xk , X )
1
≤
k∇f (x̄k ) − ∇f (xk ) − Hk (x̄k − xk )k + dist(xk , X )
µk
L2 k
≤
kx̄ − xk k2 + kx̄k − xk k + dist(xk , X )
2µk
L2
k
=
dist(x , X ) + 2 dist(xk , X )
2µk
u
t
Lemma 7 It holds for all k that
kx̂ − x̂ex k ≤
η(L1 + |1 − µk |)
r(xk ) + ηr(xk )1+ρ .
c
Quadratically Convergent SQA...
17
Proof. The inequality holds trivially if x̂ = x̂ex . So we assume x̂ 6= x̂ex . Recall
that Rk (x̂) = x̂ − proxg (x̂ − ∇fk (x̂)). Then
Rk (x̂) − ∇fk (x̂) ∈ ∂g(x̂ − Rk (x̂))
Rk (x̂) + ∇fk (x̂ − Rk (x̂)) − ∇fk (x̂) ∈ ∂qk (x̂ − Rk (x̂))
(I − Hk )Rk (x̂) ∈ ∂qk (x̂ − Rk (x̂))
On the other hand, we have 0 ∈ ∂qk (x̂ex ). Since qk = fk + g is µk -strongly
convex, ∂qk is µk -strongly monotone and thus
h(I − Hk )Rk (x̂), x̂ − Rk (x̂) − x̂ex i ≥ µk kx̂ − Rk (x̂) − x̂ex k2 .
Then using Cauchy-Schwarz inequality,
1
k(I − Hk )Rk (x̂)k
µk
1
≤
k∇2 f (xk ) − (1 − µk )Ikop · rk (x̂)
µk
η(L1 + |1 − µk |)
r(xk )1+ρ
≤
µk
η(L1 + |1 − µk |)
r(xk ),
≤
c
kx̂ − Rk (x̂) − x̂ex k ≤
where the third inequality is due to Lemma 1 and the inexactness condition
on rk (x̂), and the last inequality is due to the setting of µk . Hence,
kx̂ − x̂ex k ≤ kx̂ − Rk (x̂) − x̂ex k + kRk (x̂)k
≤
η(L1 + |1 − µk |)
r(xk ) + ηr(xk )1+ρ .
c
u
t
We next present a lemma showing that eventually we have unit steplengths, i.e. αk = 1, and the direction dk is directly used without doing line
search.
Lemma 8 Suppose that Assumptions 1 and 2 hold. Then there exists a positive integer k0 such that αk = 1 for all k ≥ k0
(i) if ρ ∈ [0, 1), or
(ii) if ρ = 1 and c, η satisfy 2ηL2 (L1 + 2) + κ2 L22 + 2cκL2 ≤ 6c2 .
In particular, the inequality in (ii) is satisfied if we take c ≥ κL2 and η ≤
κ2 L2
2(L1 +1) .
18
Man-Chung Yue et al.
Proof. First note that
f (x̂) − f (xk ) = (x̂ − xk )T
Z
1
∇f (xk + t(x̂ − x))dt
0
k T
Z
1
∇f (xk + t(x̂ − x)) − ∇f (xk ) dt + (x̂ − xk )T ∇f (xk )
= (x̂ − x )
0
= (x̂ − xk )T
Z
1
t ∇2 f (xk + st(x̂ − x)) − ∇2 f (xk ) ds dt(x̂ − xk )
0
k T
Z
+ (x̂ − x )
1
t∇2 f (xk )dt(x̂ − xk ) + ∇f (xk )T (x̂ − xk )
0
≤
L2
1
(x̂ − xk )T ∇2 f (xk )(x̂ − xk ) + ∇f (xk )T (x̂ − xk ) +
kx̂ − xk k3 .
2
6
Therefore,
F (x̂) − F (xk ) + qk (xk ) − qk (x̂)
= f (x̂) − f (xk ) − ∇f (xk )T (x̂ − xk )
1
µk
(20)
− (x̂ − xk )T ∇2 f (xk )(x̂ − xk ) −
kx̂ − xk k2
2
2
L2
µk
≤
kx̂ − xk k3 −
kx̂ − xk k2 .
6
2
Hence using inequalities (20), (11) and (14), we have
F (x̂) − F (xk ) = F (x̂) − F (xk ) − qk (x̂) + qk (xk ) + qk (x̂) − qk (xk )
L2 k 3 µk k 2
kd k −
kd k
≤ ζ `k (x̂) − `k (xk ) +
6
2
= θ `k (x̂) − `k (xk ) + (ζ − θ) `k (x̂) − `k (xk )
L2 k 3 µk k 2
+
kd k −
kd k
6
2
ζ − θ µk k 2 L2 k 3
k
kd k +
kd k
≤ θ `k (x̂) − `k (x ) − 1 +
1−ζ
2
6
≤ θ `k (x̂) − `k (xk )
if
ζ − θ 3cr(xk )ρ
kd k = kx̂ − x k ≤ 1 +
.
1−ζ
L2
By Lemma 6 and Lemma 7, for sufficiently large k,
k
k
η(L1 + |1 − µk |)
r(xk ) + ηr(xk )1+ρ
c
L2
k
dist(x , X ) + 2 dist(xk , X )
+
2cr(xk )ρ
η(L1 + |1 − µk |)
≤
r(xk ) + ηr(xk )1+ρ
c
κ2 L2
+
r(xk )2−ρ + 2κr(xk ).
2c
(21)
kx̂ − xk k ≤
(22)
Quadratically Convergent SQA...
19
Hence a sufficient condition for inequality (21) is
η(L1 + |1 − µk |)
κ2 L2
r(xk ) + ηr(xk )1+ρ +
r(xk )2−ρ + 2κr(xk )
c
2c
ζ − θ 3cr(xk )ρ
≤ 1+
.
1−ζ
L2
(23)
If ρ ∈ [0, 1), then since r(xk ) → 0, (23) holds for sufficiently large k. Now we
consider the case ρ = 1. Since ζ > θ and µk ∈ (0, 1) for sufficiently large k, a
sufficient condition for (23) is
3c
η(L1 + 1) κ2 L2
+
+ 2κ ≤
,
c
2c
L2
which in particular holds if we take η ≤
κ 2 L2
2(L1 +1)
and c ≥ κL2 .
u
t
4.2 Proof of Theorem 3
By Lemma 8, we see that for all the three cases in Theorem 3, the step size
αk = 1 for k sufficiently large. Hence, for sufficiently large k, the approximate
inner solution x̂ is exactly the next iterate xk+1 . As a result, all the properties
derived in Section 4.1 still hold if we replace x̂ by xk+1 for k sufficiently large.
From Proposition 1, we have
kxk+1 − xk k ≤
≤
η(L1 + |1 − µk |)
r(xk ) + ηr(xk )1+ρ
c
L2
k
+
dist(x , X ) + 2 dist(xk , X )
2cr(xk )ρ
η(L1 + |1 − µk |)(L1 + 2)
+ η(L1 + 2)r(xk )ρ
c
!
L2
k
+
dist(x , X ) + 2 dist(xk , X )
2cr(xk )ρ
(24)
By the error bound assumption, dist(x, X ) ≤ κr(x) for sufficiently small r(x).
Since r(xk ) → 0 as k → ∞, we have that for sufficiently large k ,
kxk+1 − xk k = O(dist(xk , X )).
Using the same arguments as in Lemma 6, it can be shown that
kproxg (xk+1 − ∇f (xk+1 )) − proxg (xk+1 − ∇fk (xk+1 ))k
≤k∇f (xk+1 ) − ∇f (xk ) − Hk (xk+1 − xk )k.
(25)
20
Man-Chung Yue et al.
Finally, we have that for sufficiently large k,
dist(xk+1 , X ) ≤ κr(xk+1 )
≤ κkR(xk+1 ) − Rk (xk+1 )k + κkRk (xk+1 )k
≤ κkproxg (xk+1 − ∇f (xk+1 )) − proxg (xk+1 − ∇fk (xk+1 ))k
+ κηr(xk )1+ρ
≤ κk∇f (xk+1 ) − ∇f (xk ) − Hk (xk+1 − xk )k + κηr(xk )1+ρ
κL2 k+1
≤
kx
− xk k2 + cκr(xk )ρ kxk+1 − xk k + κηr(xk )1+ρ
2
κL2 k+1
≤
kx
− xk k2 + cκ(L1 + 2)ρ dist(xk , X )ρ kxk+1 − xk k
2
+ κη(L1 + 2)1+ρ dist(xk , X )1+ρ
≤ O(dist(xk , X )1+ρ ),
(26)
where first inequality follows from the error bound assumption, the fourth
from (25), the sixth from Proposition 1, and the seventh from (24).
If ρ = 0, inequality (26) does not necessarily imply linear convergence
unless the constant in the big-O notation is strictly less than 1. Inspecting the
second last line of (26) and the second line of (24), a sufficient condition is
κη(L1 + 2) + cκ
η(L1 + |1 − c|)(L1 + 2)
+ η(L1 + 2) + 2 < 1,
c
which in particular holds if we take η ≤
1
2(L1 +3)2
and c ≤ κ/4.
u
t
4.3 Proof of Corollary 3
By (17) and the error bound assumption, we have limk→∞ dist(xk , X ) = 0.
This implies that for all the three cases in Theorem 3, there exists an integer
K1 > 0 such that dist(xk+1 , X ) ≤ γ·dist(xk , X ) for all k ≥ K1 , where γ ∈ (0, 1)
is the constant in case (i) of Theorem 3. In addition, by (24), there exist σ > 0
and integer K2 > 0 such that kxk+1 − xk k ≤ σ · dist(xk , X ) for all k ≥ K2 .
Furthermore, since limk→∞ dist(xk , X ) = 0, given any > 0, there exists an
integer K3 > 0 such that dist(xk , X ) ≤ (1 − γ)/σ for all k ≥ K3 . Now taking
K = max{K1 , K2 , K3 }, we have
kxk1 − xk0 k ≤
kX
1 −1
kxi+1 − xi k ≤ σ
i=k0
∞
X
≤σ
i=0
kX
1 −1
i=k0
γ i · dist(xk0 , X ) ≤
dist(xi , X ) ≤ σ
k1 −k
X0 −1
i=0
σ
· dist(xk0 , X ) ≤ ,
1−γ
γ i · dist(xk0 , X )
Quadratically Convergent SQA...
21
for all k1 ≥ k0 ≥ K. Hence, {xk }k≥0 is a Cauchy sequence and converges to
some x∗ . This, together with limk→∞ dist(xk , X ) = 0 and the fact that X is
closed, implies that x∗ ∈ X .
It then suffices to prove the convergence rates of {xk }k≥0 in cases (i), (ii)
and (iii). From the above analysis, it is ready that
kxk+i − xk+1 k ≤
σ
· dist(xk+1 , X )
1−γ
for all integers i ≥ 1 and k ≥ K. Upon taking the limit for i → ∞, we have
kxk+1 − x∗ k ≤
σ
· dist(xk+1 , X ).
1−γ
(27)
In case (i), Theorem 3 implies that dist(xk+1 , X ) converges R-linearly to 0.
Hence, by (27), {xk }k≥0 converges R-linearly to x∗ . In case (ii), using Theorem
3 and (27), we obtain
kxk+1 − x∗ k = O(dist(xk , X )1+ρ ) ≤ O(kxk − x∗ k1+ρ ),
which implies that {xk }k≥0 converges Q-superlinearly to x∗ with order 1 + ρ.
Following the same arguments, we have the conclusion in case (iii).
u
t
5 Numerical Experiments
We now explore the numerical performance of the proposed IRPN algorithm.
Specifically, we apply it to solve `1 -regularized logistic regression, which is
widely used for linear classification tasks and also a standard benchmark problem for testing the efficiency of an algorithm for solving (1). The optimization
problem of `1 -regularized logistic regression takes the following form:
m
minn F (x) =
x∈R
1 X
log(1 + exp(−bi · aTi x)) + µkxk1 ,
m i=1
(28)
where (ai , bi ) ∈ Rn × {−1, 1}, i = 1, . . . , n are data points. We employ the data
set RCV1, which contains 20242 samples and 47236 features and is downloaded
from LIBSVM datasets repository. Since the number of features is larger than
the number of data points, the objective function F (x) in (28) associated with
this classification task is not strongly convex. However, due to Theorem 1, the
error bound condition (10) holds. Hence, both Assumption 1 and Assumption
2 hold for problem (28) and our convergence analysis of IRPN applies when
solving problem (28).
We compare our algorithm IRPN with two venerable methods for solving
(28). The first one is the newGLMNET algorithm, which is also a SQA method
and is the workhorse of the well-known LIBLINEAR package [29]. The second one for comparison is the fast iterative shrinkage algorithm with adaptive
restarting scheme [17] (adaFISTA), which is a first-order method. Although
both IRPN and newGLMNET are instantiations of SQA methods, there are
22
Man-Chung Yue et al.
Tol.
Algorithm
10−4
Outer iterations
Inner iterations
Training time (sec)
Outer iterations
Inner iterations
Training time (sec)
Outer iterations
Inner iterations
Training time (sec)
10−6
10−8
adaFISTA
newGLMNET
IRPN
ρ=0
IRPN
ρ = 0.5
IRPN
ρ=1
264
–
4.19
450
–
7.14
1978
–
31.52
21
52
4.10
28
85
7.65
38
116
10.56
14
28
3.56
20
70
6.02
23
90
9.32
9
51
6.10
10
92
7.90
11
151
13.44
8
54
4.53
9
123
11.34
9
146
12.10
Table 1: Comparison of number of iterations and time consumed. Algorithms
are terminated when the residual r(x) is less than the tolerances 10−4 , 10−6
and 10−8 .
two main differences: (i) newGLMNET regularizes the Hessian by adding a
constant multiple of identity matrix while IRPN employs an adative regularization (see Step 5 of Algorithm 1); (ii) newGLMNET uses heuristic stopping
schemes for inner minimization while IRPN employs adaptive inexactness conditions (11) that leads to provable convergence guarantees. To highlight these
two differences, we deactivate other heuristic tricks in the implementation
that can be applied to general SQA methods for acceleration, such as shrinking strategy. Moreover, for both of these two methods, randomized coordinate
descent algorithm are called for solving the inner minimization.
We see from Table 1 that IRPN with ρ = 0 is the best among the compared
algorithms in the sense that it requires minimum training time for reaching any
of the given tolerances. Note that when ρ = 0, IRPN differs with newGLMNET
only on the stopping conditions for inner minimization. Hence, Table 1 suggests
that the adaptive inexactness conditions (11) is not only theoretically sound
but also practically better than the heuristic stopping scheme employed by
newGLMNET.
We then study the convergence rate of the sequence {xk }k≥0 generated by
the IRPN method. Thanks to the error bound condition (10), the speed that xk
converges to the optimal set X is asymptotically of the order r(xk ). Hence, it
suffices to investigate the rate of r(xk ) converging to 0. It can be observed from
Figure 1(a) that r(xk ) converges linearly if ρ = 0 and superlinearly if ρ = 0.5 or
ρ = 1. This corroborates with our results in Theorem 3. Nonetheless, a faster
convergence rate of r(xk ) does not tell the whole story since a larger ρ will lead
to stricter inexactness conditions (11) and hence more iterations for each inner
minimization are needed. We refer to Figure 1(b), which plots the decreases of
(logarithms of) r(xk ) against training time, for the overall performance. The
figure shows that the training time complexity of these three values of ρ are
almost the same.
Quadratically Convergent SQA...
23
10 2
10 2
ρ=0
ρ = 0.5
ρ=1
10 0
10 -2
log(r(x))
log(r(x))
10 -2
10 -4
10 -4
10 -6
10 -6
10 -8
10 -8
10 -10
ρ=0
ρ = 0.5
ρ=1
10 0
0
5
10
15
20
Number of Iterations
(a)
25
30
10 -10
0
2
4
6
8
10
12
14
Training Time (sec)
(b)
Fig. 1: Convergence performance of the IRPN algorithm.
6 Conclusions
In this paper, we propose a family of SQA methods, the inexact regularized
proximal Newton (IRPN) method, for minimizing a sum of smooth and nonsmooth convex functions. Compared with prior works on SQA, IRPN has a
much more satisfying theory. It does not require an exact inner solver or the
strong convexity of the smooth part. We also prove that the IRPN method
globally converges to the set of optimal solutions and the asymptotic rate of
convergence is linear, superlinear and even quadratic, depending on the choice
of ρ. The key to our analysis is the Luo-Tseng error bound condition. Although
this error bound played a fundamental role in establishing linear convergence
of various first-order methods, to the best of our knowledge, it is the first time
that such condition is utilized for proving superlinear convergence of SQA
methods for this class of minimization problems. We compare our proposed
IRPN with two popular and efficient algorithms—the newGLMNET [29] and
adaptively restarted FISTA [17]—by applying them to the `1 -regularized logistic regression problem. Experiment results indicate that IRPN achieves the
same accuracies in shorter training time and less number of iterations than
the two other algorithms. This shows that the IRPN method does not only has
a promising theory, but also a superior numerical performance.
References
1. A. Beck and M. Teboulle. A Fast Iterative Shrinkage–Thresholding Algorithm for Linear
Inverse Problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
2. S. Becker and J. Fadili. A Quasi–Newton Proximal Splitting Method. In Advances in
Neural Information Processing Systems, pages 2618–2626, 2012.
3. R. H. Byrd, J. Nocedal, and F. Oztoprak. An Inexact Successive Quadratic Approximation Method for L–1 Regularized Optimization. Accepted for publication in Mathematical Programming, Series B, 2015.
4. P. L. Combettes and V. R. Wajs. Signal Recovery by Proximal Forward–Backward
Splitting. Multiscale Modeling and Simulation, 4(4):1168–1200, 2005.
24
Man-Chung Yue et al.
5. A. L. Dontchev and R. T. Rockafellar. Implicit Functions and Solution Mappings.
Springer Monographs in Mathematics. Springer Science+Business Media, LLC, New
York, 2009.
6. F. Facchinei and J.-S. Pang. Finite-dimensional variational inequalities and complementarity problems. Springer Science & Business Media, 2007.
7. R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A
Library for Large Linear Classification. The Journal of Machine Learning Research,
9:1871–1874, 2008.
8. A. Fischer. Local Behavior of an Iterative Framework for Generalized Equations with
Nonisolated Solutions. Mathematical Programming, 94(1):91–124, 2002.
9. J. Friedman, T. Hastie, H. Höfling, R. Tibshirani, et al. Pathwise Coordinate Optimization. The Annals of Applied Statistics, 1(2):302–332, 2007.
10. C.-J. Hsieh, I. S. Dhillon, P. K. Ravikumar, and M. A. Sustik. Sparse Inverse Covariance
Matrix Estimation using Quadratic Approximation. In Advances in Neural Information
Processing Systems, pages 2330–2338, 2011.
11. J. D. Lee, Y. Sun, and M. A. Saunders. Proximal Newton–Type Methods for Minimizing
Composite Functions. SIAM Journal on Optimization, 24(3):1420–1443, 2014.
12. D.-H. Li, M. Fukushima, L. Qi, and N. Yamashita. Regularized Newton Methods for
Convex Minimization Problems with Singular Solutions. Computational Optimization
and Applications, 28(2):131–147, 2004.
13. Z.-Q. Luo and P. Tseng. Error Bound and Convergence Analysis of Matrix Splitting Algorithms for the Affine Variational Inequality Problem. SIAM Journal on Optimization,
2(1):43–54, 1992.
14. Z.-Q. Luo and P. Tseng. On the Linear Convergence of Descent Methods for Convex Essentially Smooth Minimization. SIAM Journal on Control and Optimization,
30(2):408–425, 1992.
15. Z.-Q. Luo and P. Tseng. Error Bounds and Convergence Analysis of Feasible Descent
Methods: A General Approach. Annals of Operations Research, 46(1):157–178, 1993.
16. J. J. Moré. The Levenberg-Marquardt Algorithm: Implementation and Theory. In
Numerical analysis, pages 105–116. Springer, 1978.
17. B. O’Donoghue and E. Candès. Adaptive Restart for Accelerated Gradient Schemes.
Foundations of computational mathematics, 15(3):715–732, 2015.
18. F. Oztoprak, J. Nocedal, S. Rennie, and P. A. Olsen. Newton–Like Methods for Sparse
Inverse Covariance Estimation. In Advances in Neural Information Processing Systems,
pages 755–763, 2012.
19. J.-S. Pang. A Posteriori Error Bounds for the Linearly–Constrained Variational Inequality Problem. Mathematics of Operations Research, 12(3):474–484, 1987.
20. J.-S. Pang. Error Bounds in Mathematical Programming. Mathematical Programming,
79(1-3):299–332, 1997.
21. H. Qi and D. Sun. A Quadratically Convergent Newton Method for Computing
the Nearest Correlation Matrix. SIAM journal on matrix analysis and applications,
28(2):360–385, 2006.
22. S. Sardy, A. Antoniadis, and P. Tseng. Automatic Smoothing with Wavelets for a Wide
Class of Distributions. Journal of Computational and Graphical Statistics, 13(2):399–
421, 2004.
23. M. W. Schmidt, E. Berg, M. P. Friedlander, and K. P. Murphy. Optimizing Costly
Functions with Simple Constraints: A Limited–Memory Projected Quasi–Newton Algorithm. In International Conference on Artificial Intelligence and Statistics, page None,
2009.
24. P. Tseng. Error Bounds and Superlinear Convergence Analysis of some Newton–Type
Methods in Optimization. In Nonlinear Optimization and Related Topics, pages 445–
462. Springer, 2000.
25. P. Tseng. Approximation Accuracy, Gradient Methods, and Error Bound for Structured
Convex Optimization. Mathematical Programming, Series B, 125(2):263–295, 2010.
26. P. Tseng and S. Yun. A Coordinate Gradient Descent Method for Nonsmooth Separable
Minimization. Mathematical Programming, Series B, 117(1–2):387–423, 2009.
27. N. Yamashita and M. Fukushima. On the Rate of Convergence of the LevenbergMarquardt Method. In Topics in numerical analysis, pages 239–249. Springer, 2001.
Quadratically Convergent SQA...
25
28. I. E.-H. Yen, C.-J. Hsieh, P. K. Ravikumar, and I. S. Dhillon. Constant Nullspace
Strong Convexity and Fast Convergence of Proximal Methods under High-Dimensional
Settings. In Advances in Neural Information Processing Systems, pages 1008–1016,
2014.
29. G.-X. Yuan, C.-H. Ho, and C.-J. Lin. An Improved GLMNET for L1–Regularized
Logistic Regression. The Journal of Machine Learning Research, 13(1):1999–2030, 2012.
30. K. Zhong, I. E.-H. Yen, I. S. Dhillon, and P. K. Ravikumar. Proximal Quasi–Newton
for Computationally Intensive `1 –Regularized M –Estimators. In Advances in Neural
Information Processing Systems, pages 2375–2383, 2014.
31. Z. Zhou and A. M.-C. So. A Unified Approach to Error Bounds for Structured Convex
Optimization Problems. arXiv preprint arXiv:1512.03518, 2015.
32. Z. Zhou, Q. Zhang, and A. M.-C. So. `1,p –Norm Regularization: Error Bounds and
Convergence Rate Analysis of First–Order Methods. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pages 1501–1510, 2015.