Noname manuscript No. (will be inserted by the editor) A Family of SQA Methods for Non-Smooth Non-Strongly Convex Minimization with Provable Convergence Guarantees Man-Chung Yue · Zirui Zhou · Anthony Man-Cho So Received: date / Accepted: date Abstract We propose a family of sequential quadratic approximation (SQA) methods, the inexact regularized proximal Newton (IRPN) method, to minimize a sum of smooth and non-smooth convex functions. Our proposed algorithm features its strong convergence guarantees even when applied to problems with degenerate solutions, while allowing the inner minimization to be solved inexactly. We prove that IRPN converges globally and under the so-called LuoTseng error bound (EB) condition, the sequence generated by IRPN converges linearly, superlinearly or even quadratically to an optimal solution, depending on the choice of a parameter of the algorithm. Prior to this work, such EB condition has been extensively used for proving linear convergence rate of various first-order methods. To the best of our knowledge, this is the first EB-based analysis to establish superlinear convergence of SQA-type methods for non-smooth convex minimization. As a consequence of our result, IRPN is capable of solving regularized regression or classification problems under the high-dimensional setting with provable convergence guarantees. We compare two empirically efficient algorithms—the newGLMNET [29] and adaptively restarted FISTA [17]—with our proposed IRPN by applying them to the `1 -regularized logistic regression problem. Experiment results show the competitiveness of our proposed algorithm. Man-Chung Yue Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N. T., Hong Kong E-mail: [email protected] Zirui Zhou Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N. T., Hong Kong E-mail: [email protected] Anthony Man-Cho So Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N. T., Hong Kong E-mail: [email protected] 2 Man-Chung Yue et al. Keywords Sequential Quadratic Approximation · Proximal Newton · Error Bound · Superlinear Convergence · Non-Strongly Convex · Composite Minimization 1 Introduction A wide range of tasks in machine learning and statistics can be formulated as a non-smooth convex minimization of the following form1 : min F (x) := f (x) + g(x), x∈Rn (1) where f : Rn → (−∞, ∞] is convex and twice continuously differentiable on an open subset of Rn containing dom(g) and g : Rn → (−∞, ∞] is a proper, convex and closed function, but possibly non-smooth. A popular choice for solving problem (1) is the successive quadratic approximation (SQA) method (also called the proximal Newton method). Here is a blueprint for a generic SQA: at iteration k, one computes the minimizer x̂ (or an approximation of it) of a quadratic model qk (x) of the objective function F (x) at xk , where 1 qk (x) := f (xk ) + ∇f (xk )T (x − xk ) + (x − xk )T Hk (x − xk ) + g(x), 2 (2) and Hk is a positive definite matrix approximating the Hessian matrix ∇2 f (xk ) of f at xk . A step size αk is obtained by performing a backtracking line search along the direction dk := x̂ − xk , and then xk+1 := xk + αk dk is returned as the next iterate. Since in most cases, the minimizer x̂ (no matter exact or approximate) of (2) does not admit a closed-form expression, an iterative algorithm, which we shall refer to as the inner solver, is invoked. Throughout the paper, we will also refer the task of minimizing the quadratic model (2) as the inner problem or inner minimization and the solution of it as the inner solution. There are three important ingredients that, more or less, determine an SQA method: the approximate Hessian Hk , the inner solver for minimizing qk , and the stopping criterion of the inner solver to control the inexactness of the approximate inner solution x̂ to the inner problem. Indeed, many existing SQA methods and their variants that are tailored for special instances of (1) can be obtained by specifying the aforementioned ingredients. Friedman et al. [9] developed the GLMNET algorithm for solving the `1 -regularized logistic regression, where Hk is set to be the exact Hessian ∇2 f (xk ) and coordinate minimization is used as the inner solver. Yuan et al. [29] improved GLMNET by replacing Hk with ∇2 f (xk ) + νI for some constant ν > 0 and adding a heuristic adaptive stopping strategy for inner minimization. This algorithm, called newGLMNET, is now the workhorse of the well-known LIBLINEAR package [7] for large scale linear classification. Hsieh et al. [10] proposed the QUIC algorithm for solving sparse inverse covariance matrix estimation, which 1 Some authors refer to this as convex composite minimization. Quadratically Convergent SQA... 3 makes use of a quasi-Newton model for forming the quadratic approximation (2). Similar strategies have also been employed in Olsen et al. [18], where the inner problems are solved by the fast iterative soft-shrinkage algorithm (FISTA) [1]. Other SQA variants can be found, e.g., in [23, 2, 30]. Although the numerical advantage of the above algorithms have been well documented, their results on global convergence and convergence rate require that the inner problems can be solved exactly. This requirement is rarely satisfied in practice. To address the inexactness of the inner problem, Lee et al. [11] and Byrd et al. [3] recently proposed several families of SQA methods along with stopping criteria for inner problems that yield global convergence and asymptotic superlinear rate. One crucial assumption, without which the analyses break down or the algorithms are not well-defined2 , that underpins their theoretical results is the strong convexity of f . Unfortunately, such an assumption does not hold in many applications of interest. For example, the objective function in the `1 -regularized logistic regression is m F (x) = 1 X log(1 + exp(−bi · aTi x)) + µkxk1 , m i=1 Pm 1 T where the smooth part f (x) = m i=1 log(1 + exp(−bi · ai x)) is strongly convex if and only if the data matrix A := [a1 , · · · , am ] ∈ Rn×m is of full row rank, which is not a guaranteed property in most applications and is even impossible in the high-dimensional setting; i.e., n m. In view of the above discussion, the theories of existing SQA methods are rather incomplete, and we are motivated to develop an algorithm for solving problem (1) that does not require an exact inner solver or the strong convexity of f while possesses desired convergence properties, such as global convergence and asymptotic superlinear rate. In this paper, we propose a new family of SQA methods, the inexact regularized proximal Newton (IRPN) method, that are capable of solving nonstrongly convex instances of problem (1) without exact inner solver. At iteration k, the proposed IRPN takes Hk to be the regularized Hessian Hk = ∇2 f (xk ) + µk I, where µk = c · r(xk )ρ for some constants c > 0, ρ ∈ [0, 1], and r(·) is a residual function measuring the proximity of xk to the optimal set (see Section 2.2 for details). The inner minimization is solved up to an adaptive inexactness condition that also depends on the residual r(xk ) and the parameter ρ. It is worth noting that the idea of regularization is not new for second-order methods. Indeed, Li et al. [12] investigated the regularized Newton method for solving smooth convex minimization with degenerate solutions. The wellknown Levenberg-Marquardt (LM) method for solving nonlinear equations is essentially a regularized Gauss-Newton method [16, 27]. The algorithm that is closest in spirit to ours is the newGLMNET [29], which aims at solving minimization of the same form (1) and also uses the regularized Hessian, albeit the 2 For [11], H = ∇2 f (xk ). If strong convexity is missing, the quadratic model (2) is k not strongly convex and can have multiple minimizers. Hence the next iterate xk+1 is not well-defined. 4 Man-Chung Yue et al. regularization parameter remains a constant throughout the entire execution, i.e., µk = ν for all k. Hence it may be seen as a special case (ρ = 0) of IRPN. However, the constant ν is chosen empirically and a heuristic stopping rule is adopted for their inner solver. Another motivation of this research comes from the seminal paper [15] of Luo and Tseng, where a notion of error bound (EB) condition was introduced. This EB condition is a provably weaker assumption than the strong convexity assumption and has been proved to hold for a wide range of instances of (1) (see Theorem 1). Since its introduction, the Luo-Tseng EB plays an important role in the convergence rate analysis of various first-order methods. Specifically, it has been utilized to prove the linear convergence of projected gradient descent, proximal gradient descent, coordinate descent and block coordinate descent for solving problem (1) without assuming strong convexity, see [15, 26, 25, 32]. Following the great success of EB-based analysis of first-order methods, it is natural to ask whether it is possible to develop an EB-based analysis of second-order methods for solving problem (1). Such an analysis is highly desirable as it would advance our understanding of the strength and limitation of SQA methods and more accurately capture the interplay between algorithmic properties and the geometrical aspect of problem (1). Some partial answers are obtained towards this direction. Li et al. [12] proposed a regularized Newton method along with inexactness criteria for solving smooth non-strongly convex minimization and proved its global convergence and local quadratic convergence rate based on an EB assumption. Unfortunately, their analysis relies heavily on the eigenvalue perturbation analysis of the Hessian matrix of the smooth objective function and hence cannot be extended to secondorder algorithms for solving non-smooth minimization (1). Tseng and Yun [26] proposed the (block) coordinate gradient descent (CGD) method for solving problem (1) with non-convex f and non-smooth convex g that has separable structure. CGD includes the SQA method as a special case. Although their analysis is also based on the Luo-Tseng EB, CGD requires the sequence of approximate Hessian Hk to be uniformly lower bounded, i.e., λmin (Hk )) ≥ λ for all k for some constant λ, and the asymptotic convergence rate is only linear. Another important contribution along this direction is the paper [8] by Fischer, where he studied an abstract framework for solving a class of generalized equations and proved its asymptotic superlinear convergence rate. Besides an error bound-type assumption and an inexactness requirement assumption, the result is based also on the strong assumption that the distance between current and next iterates is linearly bounded by the distance between the current iterate to the set of solutions, i.e., kxk+1 − xk k = O(dist(xk , X )), where X is the set of solutions of the generalized equation. However, for a concrete algorithm, establishing such an estimate is usually a nontrivial task. Moreover, the iterative framework in [8] does not involve regularization, and hence it is not obvious how our algorithm can be put into their abstract framework (see equation (3) and (4) of [8]). Our contributions can be summarized as follows. First, we propose a family of SQA methods, IRPN, for solving non-strongly convex minimization (1). The Quadratically Convergent SQA... 5 proposed algorithm is parametrized by a parameter ρ ∈ [0, 1] and comes with a stopping criterion for inner problems, which allows us to solve the inner problems inexactly. Second, we establish the global convergence and asymptotic convergence rate of IRPN. More precisely, we prove that each accumulation point of the sequence of iterates generated by IRPN is optimal to (1). Furthermore, under the Luo-Tseng EB assumption, the sequence converges to an optimal solution, and the convergence rate is R-linear if ρ = 0, Q-superlinear if ρ ∈ (0, 1) and Q-quadratic if ρ = 1. In particular, our result on the convergence rate is proved based on the Luo-Tseng EB condition. This is also our third contribution: we developed a novel EB-based analysis of SQA methods for solving problem (1) that is able to yield superlinear convergence. The paper is organized as follows. In Section 2, we prepare the assumptions and basic results needed in the paper. We also discuss the Luo-Tseng EB condition for (1) and some related regularity conditions. In Section 3, we detail the algorithm IRPN, including the regularized Hessian and inexactness conditions for inner minimization. In Sections 4, we establish the global convergence and the local convergence rate of IRPN. Numerical studies are provided in Section 5. Finally, we conclude our paper in Section 6. Notation. We denote the optimal value of and the set of optimal solutions to problem (1) as F ∗ and X , respectively, and let k · k be the Euclidean norm. Given a set C ⊆ Rn , we denote dist(x, C) as the distance from x to C; i.e., dist(x, C) = inf{kx − uk | u ∈ C}. For a closed convex function h, we denote by ∂h the subdifferential of h. Furthermore, if h is smooth, we let ∇h and ∇2 h be the gradient and Hessian of h, respectively. 2 Preliminaries Throughout, we consider problem (1), where the function f is twice continuously differentiable on an open subset of Rn containing dom(g). To avoid ill-posed problems, we assume that the optimal value F ∗ > −∞ and the set of optimal solutions X is non-empty. Furthermore, we make the following assumptions on the derivatives of f . Assumption 1. The function f in problem (1) satisfies the following: (a) The gradient is Lipschitz continuous on an open set U containing dom(g); i.e., there exist constants L1 > 0 such that k∇f (y) − ∇f (z)k ≤ L1 ky − zk ∀y, z ∈ U. (3) (b) The Hessian is Lipschitz continuous on an open set U containing dom(g); i.e., there exist constants L2 > 0 such that k∇2 f (y) − ∇2 f (z)k ≤ L2 ky − zk ∀y, z ∈ U. (4) 6 Man-Chung Yue et al. The above assumptions are standard in the analysis of Newton-type methods. As we will see in Sections 4 and 5, Assumptions 1(a) is crucial to the analysis of global convergence, and Assumption 1(b) is utilized in analyzing the local convergence rate. A direct consequence of Assumption 1 is that the operator norm of the Hessian ∇2 f is bounded. Lemma 1 Under Assumption 1, for any x ∈ dom(g), it follows that λmax (∇2 f (x)) ≤ L1 . Proof. Suppose x is an arbitrary vector in dom(g). By Assumption 1, f is twice continuously differentiable at x. Hence, by the definition of ∇2 f , we have for any vector v ∈ Rn , ∇f (x + hv) − ∇f (x) . h→0 h ∇2 f (x)v = lim By taking the vector norm on both sides, we have k∇2 f (x)vk = lim h→0 k∇f (x + hv) − ∇f (x)k L1 hkvk ≤ lim = L1 kvk, h→0 h h where the inequality is due to Assumption 1(a). Hence, by the definition of λmax , we have λmax (∇2 f (x)) ≤ L1 . u t 2.1 Optimality and Residual Functions We now introduce the optimality conditions of (1) and the announced error bound condition. Given a convex function h, the so-called proximal operator of h is defined as follows: 1 (5) proxh (v) := argmin ku − vk2 + h(u). u∈Rn 2 The proximal operator is the building block of many first-order methods for solving (1), such as proximal gradient method and its accelerated versions, and (block) coordinate gradient descent methods and can be viewed as a generalization of projection operator. When the function h is the indicator function of a closed convex set C, the proximal operator of h reduces to the projection operator onto C. Moreover, the proximal operators of many nonsmooth functions, such as `1 -norm, group-Lasso regularizer, elastic net regularizer and nuclear norm, have closed-form representations. For example, let h be the scaled `1 norm function τ kxk1 . Then the proximal operator of h is the soft-thresholding operator : proxh (v) = sign(v) max{|x| − τ, 0} where sign, max, | · | are entrywise operations and operates the entrywise product. We now recall some of the useful properties of the proximal operator that will be used later. One well-known property of the proximal operator is that it characterizes the optimal solutions of (1) as its fixed points. For its proof, we refer readers to Lemma 2.4 of [4]. Quadratically Convergent SQA... 7 Lemma 2 A vector x ∈ dom(g) is an optimal solution of (1) if and only if for any τ > 0, the following fixed-point equation holds: x = proxτ g (x − τ ∇f (x)). (6) Let R : dom(g) → Rn be the map given by R(x) := x − proxg (x − ∇f (x)). (7) Lemma 2 suggests that r(x) = kR(x)k is a residual function for problem (1), in the sense that r(x) ≥ 0 for all x ∈ dom(g) and r(x) = 0 if and only if x ∈ X . In addition, the following proposition shows that both R(x) and r(x) are Lipschitz continuous on dom(g) if Assumption 1(a) is satisfied. Proposition 1. Suppose that Assumption 1(a) is satisfied. Then, for any y, z ∈ dom(g), we have |r(y) − r(z)| ≤ kR(y) − R(z)k ≤ (L1 + 2)ky − zk. Proof. The first inequality is a direct consequence of the triangle inequality. We now prove the second one. It follows from the definition of R that kR(y) − R(z)k = ky − proxg (y − ∇f (y)) − z + proxg (z − ∇f (z))k ≤ ky − zk + kproxg (y − ∇f (y)) − proxg (z − ∇f (z))k ≤ 2ky − zk + k∇f (y) − ∇f (z)k ≤ (L1 + 2)ky − zk, where the second inequality follows from the firm non-expansiveness of the proximal operator (see Proposition 3.5 of [4]) and the last one is by Assumption 1(a). u t 2.1.1 Inner Minimization Recall that at the k-th iteration of an SQA method, the goal is to find the minimizer of (2). By letting fk (x) be the sum of the first three terms, the inner minimization problem reads min qk (x) := fk (x) + g(x), x∈Rn (8) which is also a convex minimization of the form (1) with the smooth function being quadratic. Therefore, both the optimality condition and the residual function studied above for problem (1) can be adapted to the inner minimization (8). The following corollary is immediately implied by Lemma 2 after noting the fact that ∇fk (x) = ∇f (xk ) + Hk (x − xk ). Corollary 1. A vector x ∈ dom(g) is an minimizer of (8) if and only if for any τ > 0, the following fixed-point equation holds x = proxτ g (x − τ ∇fk (x)) = proxτ g ((I − τ Hk )x − τ (∇f (xk ) − Hk xk )). 8 Man-Chung Yue et al. Similar to the map R for problem (1), let Rk : dom(g) → Rn be the map given by Rk (x) := x−proxg (x−∇fk (x)) = x−proxg ((I−Hk )x−(∇f (xk )−Hk xk )). (9) Parallel to the role of r(x) for the outer problem (1), rk (x) = kRk (x)k is a natural residual function for the inner minimization (8). Moreover, following the lines of the proof of Proposition 1, we can easily show that both the Rk and rk are Lipschitz continuous Corollary 2. For any y, z ∈ Rn , we have3 |rk (y) − rk (z)| ≤ kRk (y) − Rk (z)k ≤ (λmax (Hk ) + 2)ky − zk. 2.2 Error Bound Condition A prevailing assumption in existing analyses of the global convergence and local convergence rates of SQA methods for solving problem (1) is that the smooth function f is strongly convex [11, 3]. However, such an assumption is difficult to verify and is invalid in many applications (see the discussion in Section 1). Instead of assuming strong convexity, our analysis of IRPN is based on the following local error bound (EB) condition. Assumption 2. (Luo-Tseng EB Condition) Let r(x) be the residual function defined in Section 2.1. For any ζ ≥ F ∗ , there exist scalars κ > 0 and > 0 such that dist(x, X ) ≤ κ · r(x), whenever F (x) ≤ ζ, r(x) ≤ . (10) Error bounds have long been an important topic and permeate in all aspects of mathematical programming [20, 6]. The Luo-Tseng EB (10) for problem (1) was studied and popularized in a series of papers by Luo and Tseng [13–15]4 . In particular, many useful subclasses of problem (1) have been verified to saitsfy the Luo-Tseng EB, mainly for those with polyhedral g. Recently, the validity of the Luo-Tseng EB is extended to problems with non-polyhedral g [25] and even problems with non-polyhedral optimal solution set [31]. We briefly summarize these results in the following theorem. We refer the readers to the recent manuscript of Zhou and So [31] for a more detailed review of the developments of the Luo-Tseng EB. Theorem 1. For problem (1), Assumption 2 (Luo-Tseng EB condition) holds in any of the following scenarios: (S1) ([19, Theorem 3.1]) f is strongly convex, ∇f is Lipschitz continuous, and g is any closed convex function. 3 Compared with Proposition 1, Assumption 1(a) is not required for Corollary 2. Similar notion of EB had been studied before by Pang [19] for linearly constrained variational inequalities. 4 Quadratically Convergent SQA... 9 (S2) ([14, Theorem 2.1]) f takes the form f (x) = h(Ax) + hc, xi, where A ∈ Rm×n , c ∈ Rn are arbitrarily given and h : Rm → (−∞, +∞) is a continuously differentiable function with the property that on any compact subset C of Rn , h is strongly convex and ∇h is Lipschitz continuous on C, and g is of polyhedral epigraph. (S3) ([25, Theorem 2]) f takes the form f (x) = h(Ax), where A ∈ Rm×n and h : Rm → (−∞, +∞) are P as in scenario (S2), and g is the grouped LASSO regularizer; i.e., g(x) = J∈J ωJ kxJ k2 , where J is a non-overlapping partition of the index set {1, . . . , n}, xJ ∈ R|J| is the sub-vector obtained by restricting x ∈ Rn to the entries in J ∈ J , and ωJ ≥ 0 is a given parameter. (S4) ([31, Proposition 12]) f takes the form f (X) = h(A(X)) for all X ∈ Rn×p , where A : Rn×p → Rm is a linear mapping, h : Rm → (−∞, +∞) is as in scenario (S2), g is the nuclear norm regularizer, i.e., g(X) equals the sum of singular values of X, and there exists an X ∗ ∈ X such that the following strict complementary-type condition holds: 0 ∈ ∇f (X ∗ ) + ri(∂g(X ∗ )), where ri denotes the relative interior. Note that in scenarios (S2)-(S4), the assumptions on h are the same and can 1 2 , which corresponds to readily be shown to be satisfied by h(y) Pm = 2 ky − bk −bi yi least-squares regression, and h(y) = i=1 log(1 + e ) with b ∈ {−1, 1}m , which corresponds to logistic regression. The assumptions are also satisfied by the loss functions that arise in the maximum likelihood estimation (MLE) for conditional random field problems [30] and MLE under Poisson noise [22]. It follows from Theorem 1 that many problems of the form (1), including `1 regularized logistic regression, satisfy the Luo-Tseng EB condition, even when they are not strongly convex. In Section 4, the Luo-Tseng EB condition (10) is harnessed to establish the local superlinear convergence rate of our inexact regularized proximal Newton (IRPN) method, thus further demonstrating its versatility in convergence analysis. Prior to this work, the Luo-Tseng EB condition (10) is used to prove the local linear convergence of a number of first-order methods for solving problem (1), such as the projected gradient method, proximal gradient method, coordinate minimization, and coordinate gradient descent method; see [15, 25, 26] and the references therein. Besides first-order methods, such EB condition is also used to prove the superlinear convergence of primal-dual interior-point path following methods [24]. However, these methods are quite different in nature from the SQA methods considered in this paper. 2.3 Other Regularity Conditions Besides the Luo-Tseng EB condition (10), there are other regularity conditions that are utilized for establishing the superlinear convergence rate of SQA methods for problem (1). 10 Man-Chung Yue et al. Yen et al. [28] and Zhong et al. [30] introduced the constant nullspace strong convexity (CNSC) condition for a smooth function. They showed that, under the CNSC condition on f and some other regularity conditions on the nonsmooth function g, the proximal Newton method [11] converges quadratically and proximal quasi-Newton method [11] converges superlinearly. The claimed convergence results are misleading nonetheless. Let xk be the iterates generated by their algorithms and zk be the projection of xk onto a subspace that is associated with the CNSC assumption. Their results implies that z k converges to a point z ∗ superlinearly or quadratically. However, this cannot guarantee that {xk }k≥0 converges to the set of optimal solution and the rate of convergence of {xk }k≥0 is unclear. Here we also remark that both the inexactness of the inner minimization and the global convergence were not addressed in [28] and [30]. Dontchev and Rockafellar [5] developed a framework of solving generalized equations. When specialized to the optimality condition 0 ∈ ∇f (x) + ∂g(x) of problem (1), their framework coincides with the family of SQA methods and the corresponding convergence results require the set-valued mapping ∇f + ∂g to be either metrically regular or strongly metrically sub-regular [5]. Unfortunately, both of these two regularity conditions are provably more restrictive than the EB condition (10) and are hardly satisfied by any instances of (1) of interest. For example, consider the following two-dimensional `1 -norm regularized linear regression: min x1 ,x2 1 |x1 + x2 − 2|2 + |x1 | + |x2 |. 2 The above problem satisfies the EB condition (10) as it belongs to scenario (S2) in Theorem 1. On the other hand, it can be verified that the set-valued mapping associated with this problem is neither metrically regular nor strongly metrically sub-regular. 3 The Inexact Regularized Proximal Newton Method We now describe in detail the inexact regularized proximal Newton (IRPN) method. The algorithm is input with an initial iterate x0 , a constant 0 > 0 that controls the precision, constants θ ∈ (0, 1/2), ζ ∈ (θ, 1/2), η ∈ (0, 1) that are parameters of inexactness conditions and line search, and constants c > 0, ρ ∈ [0, 1] that are used in forming the approximation matrices Hk ’s. As we will see in Section 4, the choice of constant ρ is crucial for the local convergence rate of IRPN. At iterate xk , we first construct the following quadratic approximation of F at xk , 1 qk (x) := f (xk ) + ∇f (xk )T (x − xk ) + (x − xk )T Hk (x − xk ) + g(x), 2 where Hk = ∇2 f (xk ) + µk I with µk = c · r(xk )ρ , and r(·) is the residual function defined in Section 2.1. Since f is convex and µk > 0, the matrix Hk is Quadratically Convergent SQA... 11 positive definite for all k. Hence, the quadratic model qk (x) is strictly convex and thus has a unique minimizer. We next find an approximate minimizer x̂ of the quadratic model qk (x), such that the following inexactness conditions hold: rk (x̂) ≤ η · min{r(xk ), r(xk )1+ρ } and qk (x̂) − qk (xk ) ≤ ζ(`k (x̂) − `k (xk )), (11) where rk (·) is the residual function for the inner minimization defined in Section 2.1.1 and `k is the first-order approximation of F at xk : `k (x) := f (xk ) + ∇f (xk )T (x − xk ) + g(x). Since the exact minimizer of qk has no closed-form in most cases, an iterative algorithm, such as coordinate minimization, coordinate gradient descent method, and accelerated proximal gradient method, is typically called for finding the approximate minimizer x̂ that satisfies the conditions in (11). After performing a backtracking line search along the direction dk := x̂ − xk , we obtain a step size αk > 0 that guarantees a sufficient decrease of the objective value. The algorithm then steps into the next iterate by setting xk+1 := xk + αk dk . Finally, we terminate the algorithm when r(xk ) is less than the prescribed constant 0 . We summarize the details of IRPN method in Algorithm 1. Algorithm 1 Inexact Regularized Proximal Newton (IRPN) Method 1: Input: Starting point x0 and constants 0 > 0, θ ∈ (0, 1/2), ζ ∈ (θ, 1/2), η ∈ (0, 1), c > 0, ρ ∈ [0, 1] and β ∈ (0, 1). 2: for k = 0, 1, 2, . . . do 3: Compute the value of residual r(xk ). 4: If r(xk ) ≤ 0 , terminate the algorithm and return xk . 5: Form the quadratic model qk (x) with Hk = ∇2 f (xk ) + µk I and µk = c · r(xk )ρ . 6: Call an inner solver and find an approximate minimizer x̂ of qk such that the following inexactness conditions are satisfied: rk (x̂) ≤ η · min{r(xk ), r(xk )1+ρ } 7: and qk (x̂) − qk (xk ) ≤ ζ(`k (x̂) − `k (xk )). (11) Let the search direction dk := x̂ − xk . Find the smallest positive integer i such that F (xk ) − F (xk + β i dk ) ≥ θ(`k (xk ) − `k (xk + β i dk )). (12) 8: Let the step size αk = β i and set xk+1 = xk + αk dk . 9: end for The inexactness conditions (11) is similar to those proposed in [3]. However, the analysis in [3] cannot be applied to study the convergence behavior of IRPN. First, the global convergence therein requires Hk λI for some λ > 0 for all k. Second, the local convergence rate therein requires ∇2 f to be positive definite at the limiting point x∗ of the sequence {xk }k≥0 . Since we do not assume that the smooth function f is strongly convex, neither of these two assumptions holds in our case. 12 Man-Chung Yue et al. One advantage of our proposed IRPN lies in its flexibility. Indeed, for SQA methods using regularized Hessian, it is typical to let the regularization parameter µk be of the same order as the residual function r(xk ) [12, 21], i.e., µk = c · r(xk ). In contrast, we let the order of µk be adjustable according to the parameter ρ. Therefore, IRPN is actually a tunable family of algorithms parametrized by ρ ∈ [0, 1]. As we will see in Section 4 and Section 5, the parameter ρ plays a dominant role in the asymptotic convergence rate of IRPN. Even more, IRPN allows one to choose the inner solver. Since each inner problem is a very well-structured minimization problem with objective function being a sum of a strongly convex quadratic function and the nonsmooth convex function g, diligent users can speed up the inner minimization by exploiting the structure of ∇2 f and g to design special inner solvers. The first thing that concerns us is the well-definition of IRPN. To prove this, we need to argue that the inexactness condition (11) is always feasible and that the line search procedure will terminate within finite steps. We first present the following lemma, which ensures that at each iteration, the inexactness condition (11) is satisfied as long as x̂ is close enough to the exact solution of the inner minimization. The proof is identical to Lemma 4.5 in [3], but we present it here for completeness. Lemma 3 The inexactness conditions (11) are satisfied for any vectors in Rn that are sufficiently close to the exact minimizer x̄ of qk (x). Proof. Suppose we are at iteration k with the iterates xk . In addition, suppose xk is not optimal for problem (1) because otherwise the algorithm has terminated. The first condition in (11) is readily verified by noting that rk (x̄) = 0 and rk (·) is continuous. It remains to show that the second condition in (11) will eventually be satisfied. Since qk (x) = `k (x) + 21 (x − xk )T Hk (x − xk ) and x̄ is the minimizer of qk , we have Hk (xk − x̄) ∈ ∂`k (x̄). This leads to the property that xk 6= x̄, because otherwise we have 0 ∈ ∂`k (xk ), which implies that xk is an optimal solution of (1). Moreover, by the convexity of `k , we have `k (xk ) ≥ `k (x̄) + (x̄ − xk )T Hk (x̄ − xk ). (13) Since Hk = ∇2 f (xk ) + µk I for some µk > 0, Hk is positive definite. This, together with the fact that xk 6= x̄, implies that `k (xk ) > `k (x̄). Furthermore, we have 1 qk (x̄) − qk (xk ) = `k (x̄) + (x̄ − xk )T Hk (x̄ − xk ) − `k (xk ) 2 1 ≤ (`k (x̄) − `k (xk )) 2 < ζ(`k (x̄) − `k (xk )) where the first inequality is due to (13) and the last one is due to the facts ζ ∈ (θ, 1/2) and `k (xk ) > `k (x̄). Therefore, due to the continuity of qk and `k , the second condition in (11) is satisfied as long as x is close enough to x̄. u t We next show that the backtracking line search is well defined. Quadratically Convergent SQA... 13 Lemma 4 Suppose that Assumption 1(a) is satisfied. Then, for any iteration k of Algorithm 1, there exists a positive integer i such that the descent condition in (12) is satisfied. Moreover, the step size αk obtained from the line search strategy satisfies β(1 − θ)µk . αk ≥ (1 − ζ)L1 Proof. By definition, we have 1 qk (x̂) − qk (xk ) = `k (x̂) − `k (xk ) + (x̂ − xk )T Hk (x̂ − xk ). 2 Since the approximate minimizer x̂ satisfies the inexactness conditions (11), the above implies `k (xk ) − `k (x̂) ≥ 1 µk (x̂ − xk )T Hk (x̂ − xk ) ≥ kx̂ − xk k2 . (14) 2(1 − ζ) 2(1 − ζ) Due to the convexity of `k (x), for any constant α ∈ [0, 1], we have `k (xk ) − `k (xk + αdk ) ≥ α(`k (xk ) − `k (xk + dk )). Combining the above two inequalities and noting that x̂ = xk + dk , we obtain αµk kdk k2 . (15) `k (xk ) − `k (xk + αdk ) ≥ 2(1 − ζ) Moreover, since ∇f is Lipschitz continuous by Assumption 1(a), it then follows that α2 L1 k 2 f (xk + αdk ) − f (xk ) ≤ α∇f (xk )T dk + kd k , (16) 2 which leads to, by the definition of `k , F (xk ) − F (xk + αdk ) ≥ `k (xk ) − `k (xk + αdk ) − α2 L1 k 2 kd k . 2 It then follows from the above inequality and (15) that F (xk ) − F (xk + αdk ) − θ(`k (xk ) − `k (xk + αdk )) ≥ (1 − θ)(`k (xk ) − `k (xk + αdk )) − α2 L1 k 2 kd k 2 (1 − θ)αµk k 2 α2 L1 k 2 kd k − kd k 2(1 − ζ) 2 α 1−θ = µk − αL1 kdk k2 . 2 1−ζ ≥ Hence, as long as α satisfies α < (1 − θ)µk /(1 − ζ)L1 , the descent condition (12) is satisfied. Since the backtracking line search multiplies the step-length by β ∈ (0, 1) after each trial, then the line search strategy will output an αk that satisfies αk ≥ β(1 − θ)µk /(1 − ζ)L1 . u t Combining Lemma 3 and Lemma 4, we conclude that the method IRPN is well-defined. 14 Man-Chung Yue et al. 4 Convergence Analysis Throughout this section, we denote {xk }k≥0 as the sequence of iterates generated by Algorithm 1. We first present the a theorem showing the global convergence of Algorithm 1. Theorem 2 (Global Convergence). Suppose that Assumption 1(a) holds. Then, we have lim r(xk ) = 0. (17) k→∞ In other words, each accumulation point of the sequence {xk }k≥0 is an optimal solution of (1). Proof. From the inequality (15), we obtain `k (xk ) − `k (xk + αdk ) ≥ αµk kdk k2 ≥ 0. 2(1 − ζ) (18) Hence, due to the descent condition (12) and the assumption that inf F > −∞, we have lim `k (xk ) − `k (xk + αk dk ) = 0. k→∞ Again, using (15), the above implies limk→∞ αk µk kdk k2 = 0. Moreover, recall that the inexactness condition (11) implies rk (x̂) ≤ η · r(xk ). It then follows that (1 − η)r(xk ) ≤ r(xk ) − rk (x̂) = rk (xk ) − rk (x̂) ≤ (λmax (Hk ) + 2)kdk k ≤ (L1 + µk + 2)kdk k, where the first equality follows from rk (xk ) = r(xk ), the second inequality is by Corollary 2 and the last inequality is due to Lemma 1. Then, combining Lemma 4, limk→∞ αk µk kdk k2 = 0 and (1 − η)r(xk ) ≤ (L1 + µk + 2)kdk k, we have µ2k r(xk )2 = 0. lim k→∞ (µk + L1 + 2)2 This, together with the fact that µk = c · r(xk )ρ , implies that limk→∞ r(xk ) = 0. u t Having established the global convergence of the IRPN method, we now study its asymptotic convergence rate. We show that remarkably, the IRPN method can attain superlinear and quadratic rates of convergence even without strong convexity (see Theorem 3). Since we do not assume f to be strongly convex, techniques based on the facts that the Hessian of f is positive definite at an optimal x∗ or the optimal solution is unique collapse. Key to the proof Quadratically Convergent SQA... 15 of Theorem 3 is the EB condition (10). Our techniques are different from those in [11] and [3], and we believe that it will find further applications in the convergence rate analysis of other second-order methods in the absence of strong convexity. Theorem 3 (Local Convergence Rate). Suppose that Assumptions 1 and 2 hold. Then, for sufficiently large k, we have (i) dist(xk+1 , X ) ≤ γdist(xk , X ) for some γ ∈ (0, 1), if we take ρ = 0, c ≤ κ4 , and η ≤ 2(L11+3)2 ; (ii) dist(xk+1 , X ) ≤ O(dist(xk , X )1+ρ ) if we take ρ ∈ (0, 1); (iii) dist(xk+1 , X ) ≤ O(dist(xk , X )2 ) if we take ρ = 1, c ≥ κL2 , and η ≤ κ2 L2 2(L1 +1) . Using Theorem 3, we can further establish that the sequence of iterates of IRPN converges to a point in the set of optimal solutions. Corollary 3 (Sequence Convergence). Suppose that Assumption 1 and 2 hold. Then, for all the three cases in Theorem 3, the sequence {xk }k≥0 converges to some x∗ ∈ X . Moreover, (i) {xk }k≥0 converges R-linearly to x∗ , if ρ = 0, c ≤ κ4 , and η ≤ 2(L11+2)2 . (ii) {xk }k≥0 converges Q-superlinearly to x∗ with order 1 + ρ, if ρ ∈ (0, 1). (iii) {xk }k≥0 converges Q-quadratically to x∗ , if ρ = 1, c ≥ κL2 , and η ≤ κ2 L2 2(L1 +1) . We remark that though Theorem 3 and Corollary 3 show that a larger ρ results in a faster convergence rate with respect to the iteration counter k, it does not suggest that a larger ρ leads to a better overall time complexity. The reason is that a larger ρ results in a stricter inexactness condition and thus the time consumed by each iteration is longer. Hence, despite the faster convergence rate, the best choice of ρ ∈ [0, 1] depends on particular problems and the solver used for inner minimization. Nonetheless, the above results provide a complete characterization of the convergence rate of IRPN in terms of ρ and facilitates the flexibility of IRPN. Developing an empirical or analytical way for tuning ρ is definitely a topic worth pursuing. The rest of this section is devoted to proving Theorem 3 and Corollary 3. In Section 4.1, we prepare some technical lemmas. In Section 4.2 and Section 4.3, we prove Theorem 3 and Corollary 3, respectively. 4.1 Technical Lemmas Since the following lemmas are all proved under the conditions of Theorem 3, we suppress the required assumptions in their statements. Lemma 5 Let x̂ex be the exact solution of the k-th subproblem, i.e. rk (x̂ex ) = 0, and x̄k be the projection of xk onto the optimal set X . Then kx̂ex − x̄k k ≤ 1 k∇f (x̄k ) − ∇f (xk ) − Hk (x̄k − xk )k. µk 16 Man-Chung Yue et al. Proof. If x̂ex = x̄k , the inequality holds trivially. Assuming x̂ex 6= x̄k , it follows from definitions that 0 ∈ ∇f (x̄k ) + ∂g(x̄k ) and 0 ∈ ∇f (xk ) + Hk (x̂ex − xk ) + ∂g(x̂ex ). Since g is convex, the subdifferential is monotone. Hence we have 0 ≤h∇f (xk ) − ∇f (x̄k ) + Hk (x̂ex − xk ), x̄k − x̂ex i =h∇f (xk ) − ∇f (x̄k ) − Hk (xk − x̄k ), x̄k − x̂ex i − hHk (x̄k − x̂ex ), x̄k − x̂ex i ≤k∇f (xk ) − ∇f (x̄k ) − Hk (xk − x̄k )kkx̄k − x̂ex k − µk kx̄k − x̂ex k2 , which yields the desired inequality. u t We remark here that we did not use any special structure of Hk in the proof of Lemma 5 but the assumption that Hk is positive definite. Therefore this lemma also applies to other positive definite approximate Hessian Hk with µk replaced by the minimum eigenvalue λmin (Hk ). Lemma 6 It holds for all k that kx̂ex − xk k ≤ L2 dist(xk , X ) + 2 dist(xk , X ). 2µk Proof. Using triangle inequality and the definition of µk , k∇f (x̄k ) − ∇f (xk ) − Hk (x̄k − xk )k ≤k∇f (x̄k ) − ∇f (xk ) − ∇2 f (xk )(x̄k − xk )k + µk kx̄k − xk k L2 ≤ kx̄k − xk k2 + µk kx̄k − xk k, 2 (19) where the second inequality follows from Assumption 1(b). Then, by Lemma 5 and inequality (19), kx̂ex − xk k ≤ kx̂ex − x̄k k + dist(xk , X ) 1 ≤ k∇f (x̄k ) − ∇f (xk ) − Hk (x̄k − xk )k + dist(xk , X ) µk L2 k ≤ kx̄ − xk k2 + kx̄k − xk k + dist(xk , X ) 2µk L2 k = dist(x , X ) + 2 dist(xk , X ) 2µk u t Lemma 7 It holds for all k that kx̂ − x̂ex k ≤ η(L1 + |1 − µk |) r(xk ) + ηr(xk )1+ρ . c Quadratically Convergent SQA... 17 Proof. The inequality holds trivially if x̂ = x̂ex . So we assume x̂ 6= x̂ex . Recall that Rk (x̂) = x̂ − proxg (x̂ − ∇fk (x̂)). Then Rk (x̂) − ∇fk (x̂) ∈ ∂g(x̂ − Rk (x̂)) Rk (x̂) + ∇fk (x̂ − Rk (x̂)) − ∇fk (x̂) ∈ ∂qk (x̂ − Rk (x̂)) (I − Hk )Rk (x̂) ∈ ∂qk (x̂ − Rk (x̂)) On the other hand, we have 0 ∈ ∂qk (x̂ex ). Since qk = fk + g is µk -strongly convex, ∂qk is µk -strongly monotone and thus h(I − Hk )Rk (x̂), x̂ − Rk (x̂) − x̂ex i ≥ µk kx̂ − Rk (x̂) − x̂ex k2 . Then using Cauchy-Schwarz inequality, 1 k(I − Hk )Rk (x̂)k µk 1 ≤ k∇2 f (xk ) − (1 − µk )Ikop · rk (x̂) µk η(L1 + |1 − µk |) r(xk )1+ρ ≤ µk η(L1 + |1 − µk |) r(xk ), ≤ c kx̂ − Rk (x̂) − x̂ex k ≤ where the third inequality is due to Lemma 1 and the inexactness condition on rk (x̂), and the last inequality is due to the setting of µk . Hence, kx̂ − x̂ex k ≤ kx̂ − Rk (x̂) − x̂ex k + kRk (x̂)k ≤ η(L1 + |1 − µk |) r(xk ) + ηr(xk )1+ρ . c u t We next present a lemma showing that eventually we have unit steplengths, i.e. αk = 1, and the direction dk is directly used without doing line search. Lemma 8 Suppose that Assumptions 1 and 2 hold. Then there exists a positive integer k0 such that αk = 1 for all k ≥ k0 (i) if ρ ∈ [0, 1), or (ii) if ρ = 1 and c, η satisfy 2ηL2 (L1 + 2) + κ2 L22 + 2cκL2 ≤ 6c2 . In particular, the inequality in (ii) is satisfied if we take c ≥ κL2 and η ≤ κ2 L2 2(L1 +1) . 18 Man-Chung Yue et al. Proof. First note that f (x̂) − f (xk ) = (x̂ − xk )T Z 1 ∇f (xk + t(x̂ − x))dt 0 k T Z 1 ∇f (xk + t(x̂ − x)) − ∇f (xk ) dt + (x̂ − xk )T ∇f (xk ) = (x̂ − x ) 0 = (x̂ − xk )T Z 1 t ∇2 f (xk + st(x̂ − x)) − ∇2 f (xk ) ds dt(x̂ − xk ) 0 k T Z + (x̂ − x ) 1 t∇2 f (xk )dt(x̂ − xk ) + ∇f (xk )T (x̂ − xk ) 0 ≤ L2 1 (x̂ − xk )T ∇2 f (xk )(x̂ − xk ) + ∇f (xk )T (x̂ − xk ) + kx̂ − xk k3 . 2 6 Therefore, F (x̂) − F (xk ) + qk (xk ) − qk (x̂) = f (x̂) − f (xk ) − ∇f (xk )T (x̂ − xk ) 1 µk (20) − (x̂ − xk )T ∇2 f (xk )(x̂ − xk ) − kx̂ − xk k2 2 2 L2 µk ≤ kx̂ − xk k3 − kx̂ − xk k2 . 6 2 Hence using inequalities (20), (11) and (14), we have F (x̂) − F (xk ) = F (x̂) − F (xk ) − qk (x̂) + qk (xk ) + qk (x̂) − qk (xk ) L2 k 3 µk k 2 kd k − kd k ≤ ζ `k (x̂) − `k (xk ) + 6 2 = θ `k (x̂) − `k (xk ) + (ζ − θ) `k (x̂) − `k (xk ) L2 k 3 µk k 2 + kd k − kd k 6 2 ζ − θ µk k 2 L2 k 3 k kd k + kd k ≤ θ `k (x̂) − `k (x ) − 1 + 1−ζ 2 6 ≤ θ `k (x̂) − `k (xk ) if ζ − θ 3cr(xk )ρ kd k = kx̂ − x k ≤ 1 + . 1−ζ L2 By Lemma 6 and Lemma 7, for sufficiently large k, k k η(L1 + |1 − µk |) r(xk ) + ηr(xk )1+ρ c L2 k dist(x , X ) + 2 dist(xk , X ) + 2cr(xk )ρ η(L1 + |1 − µk |) ≤ r(xk ) + ηr(xk )1+ρ c κ2 L2 + r(xk )2−ρ + 2κr(xk ). 2c (21) kx̂ − xk k ≤ (22) Quadratically Convergent SQA... 19 Hence a sufficient condition for inequality (21) is η(L1 + |1 − µk |) κ2 L2 r(xk ) + ηr(xk )1+ρ + r(xk )2−ρ + 2κr(xk ) c 2c ζ − θ 3cr(xk )ρ ≤ 1+ . 1−ζ L2 (23) If ρ ∈ [0, 1), then since r(xk ) → 0, (23) holds for sufficiently large k. Now we consider the case ρ = 1. Since ζ > θ and µk ∈ (0, 1) for sufficiently large k, a sufficient condition for (23) is 3c η(L1 + 1) κ2 L2 + + 2κ ≤ , c 2c L2 which in particular holds if we take η ≤ κ 2 L2 2(L1 +1) and c ≥ κL2 . u t 4.2 Proof of Theorem 3 By Lemma 8, we see that for all the three cases in Theorem 3, the step size αk = 1 for k sufficiently large. Hence, for sufficiently large k, the approximate inner solution x̂ is exactly the next iterate xk+1 . As a result, all the properties derived in Section 4.1 still hold if we replace x̂ by xk+1 for k sufficiently large. From Proposition 1, we have kxk+1 − xk k ≤ ≤ η(L1 + |1 − µk |) r(xk ) + ηr(xk )1+ρ c L2 k + dist(x , X ) + 2 dist(xk , X ) 2cr(xk )ρ η(L1 + |1 − µk |)(L1 + 2) + η(L1 + 2)r(xk )ρ c ! L2 k + dist(x , X ) + 2 dist(xk , X ) 2cr(xk )ρ (24) By the error bound assumption, dist(x, X ) ≤ κr(x) for sufficiently small r(x). Since r(xk ) → 0 as k → ∞, we have that for sufficiently large k , kxk+1 − xk k = O(dist(xk , X )). Using the same arguments as in Lemma 6, it can be shown that kproxg (xk+1 − ∇f (xk+1 )) − proxg (xk+1 − ∇fk (xk+1 ))k ≤k∇f (xk+1 ) − ∇f (xk ) − Hk (xk+1 − xk )k. (25) 20 Man-Chung Yue et al. Finally, we have that for sufficiently large k, dist(xk+1 , X ) ≤ κr(xk+1 ) ≤ κkR(xk+1 ) − Rk (xk+1 )k + κkRk (xk+1 )k ≤ κkproxg (xk+1 − ∇f (xk+1 )) − proxg (xk+1 − ∇fk (xk+1 ))k + κηr(xk )1+ρ ≤ κk∇f (xk+1 ) − ∇f (xk ) − Hk (xk+1 − xk )k + κηr(xk )1+ρ κL2 k+1 ≤ kx − xk k2 + cκr(xk )ρ kxk+1 − xk k + κηr(xk )1+ρ 2 κL2 k+1 ≤ kx − xk k2 + cκ(L1 + 2)ρ dist(xk , X )ρ kxk+1 − xk k 2 + κη(L1 + 2)1+ρ dist(xk , X )1+ρ ≤ O(dist(xk , X )1+ρ ), (26) where first inequality follows from the error bound assumption, the fourth from (25), the sixth from Proposition 1, and the seventh from (24). If ρ = 0, inequality (26) does not necessarily imply linear convergence unless the constant in the big-O notation is strictly less than 1. Inspecting the second last line of (26) and the second line of (24), a sufficient condition is κη(L1 + 2) + cκ η(L1 + |1 − c|)(L1 + 2) + η(L1 + 2) + 2 < 1, c which in particular holds if we take η ≤ 1 2(L1 +3)2 and c ≤ κ/4. u t 4.3 Proof of Corollary 3 By (17) and the error bound assumption, we have limk→∞ dist(xk , X ) = 0. This implies that for all the three cases in Theorem 3, there exists an integer K1 > 0 such that dist(xk+1 , X ) ≤ γ·dist(xk , X ) for all k ≥ K1 , where γ ∈ (0, 1) is the constant in case (i) of Theorem 3. In addition, by (24), there exist σ > 0 and integer K2 > 0 such that kxk+1 − xk k ≤ σ · dist(xk , X ) for all k ≥ K2 . Furthermore, since limk→∞ dist(xk , X ) = 0, given any > 0, there exists an integer K3 > 0 such that dist(xk , X ) ≤ (1 − γ)/σ for all k ≥ K3 . Now taking K = max{K1 , K2 , K3 }, we have kxk1 − xk0 k ≤ kX 1 −1 kxi+1 − xi k ≤ σ i=k0 ∞ X ≤σ i=0 kX 1 −1 i=k0 γ i · dist(xk0 , X ) ≤ dist(xi , X ) ≤ σ k1 −k X0 −1 i=0 σ · dist(xk0 , X ) ≤ , 1−γ γ i · dist(xk0 , X ) Quadratically Convergent SQA... 21 for all k1 ≥ k0 ≥ K. Hence, {xk }k≥0 is a Cauchy sequence and converges to some x∗ . This, together with limk→∞ dist(xk , X ) = 0 and the fact that X is closed, implies that x∗ ∈ X . It then suffices to prove the convergence rates of {xk }k≥0 in cases (i), (ii) and (iii). From the above analysis, it is ready that kxk+i − xk+1 k ≤ σ · dist(xk+1 , X ) 1−γ for all integers i ≥ 1 and k ≥ K. Upon taking the limit for i → ∞, we have kxk+1 − x∗ k ≤ σ · dist(xk+1 , X ). 1−γ (27) In case (i), Theorem 3 implies that dist(xk+1 , X ) converges R-linearly to 0. Hence, by (27), {xk }k≥0 converges R-linearly to x∗ . In case (ii), using Theorem 3 and (27), we obtain kxk+1 − x∗ k = O(dist(xk , X )1+ρ ) ≤ O(kxk − x∗ k1+ρ ), which implies that {xk }k≥0 converges Q-superlinearly to x∗ with order 1 + ρ. Following the same arguments, we have the conclusion in case (iii). u t 5 Numerical Experiments We now explore the numerical performance of the proposed IRPN algorithm. Specifically, we apply it to solve `1 -regularized logistic regression, which is widely used for linear classification tasks and also a standard benchmark problem for testing the efficiency of an algorithm for solving (1). The optimization problem of `1 -regularized logistic regression takes the following form: m minn F (x) = x∈R 1 X log(1 + exp(−bi · aTi x)) + µkxk1 , m i=1 (28) where (ai , bi ) ∈ Rn × {−1, 1}, i = 1, . . . , n are data points. We employ the data set RCV1, which contains 20242 samples and 47236 features and is downloaded from LIBSVM datasets repository. Since the number of features is larger than the number of data points, the objective function F (x) in (28) associated with this classification task is not strongly convex. However, due to Theorem 1, the error bound condition (10) holds. Hence, both Assumption 1 and Assumption 2 hold for problem (28) and our convergence analysis of IRPN applies when solving problem (28). We compare our algorithm IRPN with two venerable methods for solving (28). The first one is the newGLMNET algorithm, which is also a SQA method and is the workhorse of the well-known LIBLINEAR package [29]. The second one for comparison is the fast iterative shrinkage algorithm with adaptive restarting scheme [17] (adaFISTA), which is a first-order method. Although both IRPN and newGLMNET are instantiations of SQA methods, there are 22 Man-Chung Yue et al. Tol. Algorithm 10−4 Outer iterations Inner iterations Training time (sec) Outer iterations Inner iterations Training time (sec) Outer iterations Inner iterations Training time (sec) 10−6 10−8 adaFISTA newGLMNET IRPN ρ=0 IRPN ρ = 0.5 IRPN ρ=1 264 – 4.19 450 – 7.14 1978 – 31.52 21 52 4.10 28 85 7.65 38 116 10.56 14 28 3.56 20 70 6.02 23 90 9.32 9 51 6.10 10 92 7.90 11 151 13.44 8 54 4.53 9 123 11.34 9 146 12.10 Table 1: Comparison of number of iterations and time consumed. Algorithms are terminated when the residual r(x) is less than the tolerances 10−4 , 10−6 and 10−8 . two main differences: (i) newGLMNET regularizes the Hessian by adding a constant multiple of identity matrix while IRPN employs an adative regularization (see Step 5 of Algorithm 1); (ii) newGLMNET uses heuristic stopping schemes for inner minimization while IRPN employs adaptive inexactness conditions (11) that leads to provable convergence guarantees. To highlight these two differences, we deactivate other heuristic tricks in the implementation that can be applied to general SQA methods for acceleration, such as shrinking strategy. Moreover, for both of these two methods, randomized coordinate descent algorithm are called for solving the inner minimization. We see from Table 1 that IRPN with ρ = 0 is the best among the compared algorithms in the sense that it requires minimum training time for reaching any of the given tolerances. Note that when ρ = 0, IRPN differs with newGLMNET only on the stopping conditions for inner minimization. Hence, Table 1 suggests that the adaptive inexactness conditions (11) is not only theoretically sound but also practically better than the heuristic stopping scheme employed by newGLMNET. We then study the convergence rate of the sequence {xk }k≥0 generated by the IRPN method. Thanks to the error bound condition (10), the speed that xk converges to the optimal set X is asymptotically of the order r(xk ). Hence, it suffices to investigate the rate of r(xk ) converging to 0. It can be observed from Figure 1(a) that r(xk ) converges linearly if ρ = 0 and superlinearly if ρ = 0.5 or ρ = 1. This corroborates with our results in Theorem 3. Nonetheless, a faster convergence rate of r(xk ) does not tell the whole story since a larger ρ will lead to stricter inexactness conditions (11) and hence more iterations for each inner minimization are needed. We refer to Figure 1(b), which plots the decreases of (logarithms of) r(xk ) against training time, for the overall performance. The figure shows that the training time complexity of these three values of ρ are almost the same. Quadratically Convergent SQA... 23 10 2 10 2 ρ=0 ρ = 0.5 ρ=1 10 0 10 -2 log(r(x)) log(r(x)) 10 -2 10 -4 10 -4 10 -6 10 -6 10 -8 10 -8 10 -10 ρ=0 ρ = 0.5 ρ=1 10 0 0 5 10 15 20 Number of Iterations (a) 25 30 10 -10 0 2 4 6 8 10 12 14 Training Time (sec) (b) Fig. 1: Convergence performance of the IRPN algorithm. 6 Conclusions In this paper, we propose a family of SQA methods, the inexact regularized proximal Newton (IRPN) method, for minimizing a sum of smooth and nonsmooth convex functions. Compared with prior works on SQA, IRPN has a much more satisfying theory. It does not require an exact inner solver or the strong convexity of the smooth part. We also prove that the IRPN method globally converges to the set of optimal solutions and the asymptotic rate of convergence is linear, superlinear and even quadratic, depending on the choice of ρ. The key to our analysis is the Luo-Tseng error bound condition. Although this error bound played a fundamental role in establishing linear convergence of various first-order methods, to the best of our knowledge, it is the first time that such condition is utilized for proving superlinear convergence of SQA methods for this class of minimization problems. We compare our proposed IRPN with two popular and efficient algorithms—the newGLMNET [29] and adaptively restarted FISTA [17]—by applying them to the `1 -regularized logistic regression problem. Experiment results indicate that IRPN achieves the same accuracies in shorter training time and less number of iterations than the two other algorithms. This shows that the IRPN method does not only has a promising theory, but also a superior numerical performance. References 1. A. Beck and M. Teboulle. A Fast Iterative Shrinkage–Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. 2. S. Becker and J. Fadili. A Quasi–Newton Proximal Splitting Method. In Advances in Neural Information Processing Systems, pages 2618–2626, 2012. 3. R. H. Byrd, J. Nocedal, and F. Oztoprak. An Inexact Successive Quadratic Approximation Method for L–1 Regularized Optimization. Accepted for publication in Mathematical Programming, Series B, 2015. 4. P. L. Combettes and V. R. Wajs. Signal Recovery by Proximal Forward–Backward Splitting. Multiscale Modeling and Simulation, 4(4):1168–1200, 2005. 24 Man-Chung Yue et al. 5. A. L. Dontchev and R. T. Rockafellar. Implicit Functions and Solution Mappings. Springer Monographs in Mathematics. Springer Science+Business Media, LLC, New York, 2009. 6. F. Facchinei and J.-S. Pang. Finite-dimensional variational inequalities and complementarity problems. Springer Science & Business Media, 2007. 7. R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A Library for Large Linear Classification. The Journal of Machine Learning Research, 9:1871–1874, 2008. 8. A. Fischer. Local Behavior of an Iterative Framework for Generalized Equations with Nonisolated Solutions. Mathematical Programming, 94(1):91–124, 2002. 9. J. Friedman, T. Hastie, H. Höfling, R. Tibshirani, et al. Pathwise Coordinate Optimization. The Annals of Applied Statistics, 1(2):302–332, 2007. 10. C.-J. Hsieh, I. S. Dhillon, P. K. Ravikumar, and M. A. Sustik. Sparse Inverse Covariance Matrix Estimation using Quadratic Approximation. In Advances in Neural Information Processing Systems, pages 2330–2338, 2011. 11. J. D. Lee, Y. Sun, and M. A. Saunders. Proximal Newton–Type Methods for Minimizing Composite Functions. SIAM Journal on Optimization, 24(3):1420–1443, 2014. 12. D.-H. Li, M. Fukushima, L. Qi, and N. Yamashita. Regularized Newton Methods for Convex Minimization Problems with Singular Solutions. Computational Optimization and Applications, 28(2):131–147, 2004. 13. Z.-Q. Luo and P. Tseng. Error Bound and Convergence Analysis of Matrix Splitting Algorithms for the Affine Variational Inequality Problem. SIAM Journal on Optimization, 2(1):43–54, 1992. 14. Z.-Q. Luo and P. Tseng. On the Linear Convergence of Descent Methods for Convex Essentially Smooth Minimization. SIAM Journal on Control and Optimization, 30(2):408–425, 1992. 15. Z.-Q. Luo and P. Tseng. Error Bounds and Convergence Analysis of Feasible Descent Methods: A General Approach. Annals of Operations Research, 46(1):157–178, 1993. 16. J. J. Moré. The Levenberg-Marquardt Algorithm: Implementation and Theory. In Numerical analysis, pages 105–116. Springer, 1978. 17. B. O’Donoghue and E. Candès. Adaptive Restart for Accelerated Gradient Schemes. Foundations of computational mathematics, 15(3):715–732, 2015. 18. F. Oztoprak, J. Nocedal, S. Rennie, and P. A. Olsen. Newton–Like Methods for Sparse Inverse Covariance Estimation. In Advances in Neural Information Processing Systems, pages 755–763, 2012. 19. J.-S. Pang. A Posteriori Error Bounds for the Linearly–Constrained Variational Inequality Problem. Mathematics of Operations Research, 12(3):474–484, 1987. 20. J.-S. Pang. Error Bounds in Mathematical Programming. Mathematical Programming, 79(1-3):299–332, 1997. 21. H. Qi and D. Sun. A Quadratically Convergent Newton Method for Computing the Nearest Correlation Matrix. SIAM journal on matrix analysis and applications, 28(2):360–385, 2006. 22. S. Sardy, A. Antoniadis, and P. Tseng. Automatic Smoothing with Wavelets for a Wide Class of Distributions. Journal of Computational and Graphical Statistics, 13(2):399– 421, 2004. 23. M. W. Schmidt, E. Berg, M. P. Friedlander, and K. P. Murphy. Optimizing Costly Functions with Simple Constraints: A Limited–Memory Projected Quasi–Newton Algorithm. In International Conference on Artificial Intelligence and Statistics, page None, 2009. 24. P. Tseng. Error Bounds and Superlinear Convergence Analysis of some Newton–Type Methods in Optimization. In Nonlinear Optimization and Related Topics, pages 445– 462. Springer, 2000. 25. P. Tseng. Approximation Accuracy, Gradient Methods, and Error Bound for Structured Convex Optimization. Mathematical Programming, Series B, 125(2):263–295, 2010. 26. P. Tseng and S. Yun. A Coordinate Gradient Descent Method for Nonsmooth Separable Minimization. Mathematical Programming, Series B, 117(1–2):387–423, 2009. 27. N. Yamashita and M. Fukushima. On the Rate of Convergence of the LevenbergMarquardt Method. In Topics in numerical analysis, pages 239–249. Springer, 2001. Quadratically Convergent SQA... 25 28. I. E.-H. Yen, C.-J. Hsieh, P. K. Ravikumar, and I. S. Dhillon. Constant Nullspace Strong Convexity and Fast Convergence of Proximal Methods under High-Dimensional Settings. In Advances in Neural Information Processing Systems, pages 1008–1016, 2014. 29. G.-X. Yuan, C.-H. Ho, and C.-J. Lin. An Improved GLMNET for L1–Regularized Logistic Regression. The Journal of Machine Learning Research, 13(1):1999–2030, 2012. 30. K. Zhong, I. E.-H. Yen, I. S. Dhillon, and P. K. Ravikumar. Proximal Quasi–Newton for Computationally Intensive `1 –Regularized M –Estimators. In Advances in Neural Information Processing Systems, pages 2375–2383, 2014. 31. Z. Zhou and A. M.-C. So. A Unified Approach to Error Bounds for Structured Convex Optimization Problems. arXiv preprint arXiv:1512.03518, 2015. 32. Z. Zhou, Q. Zhang, and A. M.-C. So. `1,p –Norm Regularization: Error Bounds and Convergence Rate Analysis of First–Order Methods. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pages 1501–1510, 2015.
© Copyright 2026 Paperzz