Error Bounds for Structured
Convex Programming:
Theory and Applications
ZHOU, Zirui
A Thesis Submitted in Partial Fulfilment
of the Requirements for the Degree of
Doctor of Philosophy
in
Systems Engineering and Engineering Management
The Chinese University of Hong Kong
September 2015
Thesis Assessment Committee
Professor LI Duan (Chair)
Professor Anthony Man-Cho So (Thesis Supervisor)
Professor CAI Xiaoqiang (Committee Member)
Professor SUN Defeng (External Examiner)
Abstract of thesis entitled:
Error Bounds for Structured Convex Programming:
Theory and Applications
Submitted by ZHOU, Zirui
for the degree of Doctor of Philosophy
at The Chinese University of Hong Kong in September 2015
To cope with the rapidly growing size of datasets, recent researches on numerical
algorithms for solving convex optimization have chiefly been focusing on firstorder methods, such as the proximal gradient method and its accelerated version,
and the coordinate gradient descent method. In terms of the convergence analysis
of these iterative methods, the so-called local error bound (which can be viewed
as a relaxed condition of strong convexity) has been demonstrated as the key
to unveiling the linear rate of convergence for these algorithms, especially in the
treatment of nondegeneracy.
Motivated by this, we explore the validity of such error bound for a class of
structured convex programming, which emerges in a wide variety of application
domains. We show that under our setting, the local error bound is equivalent
with the upper Lipschitz continuity of certain set-valued mapping. Based on
this fact, it is proved that the local error bound holds if both of the following
conditions are satisfied: (a) the subdifferential of the nonsmooth function is
metrically subregular at the optimal solution set; (b) two specifically defined
convex sets are linear regular. Therefore, by verifying conditions (a) and (b), we
develop a new analysis approach towards local error bound.
We then apply our technical developments to considering the local error
ii
bound for the following two problem instances: `1,p -regularization and nuclear
norm regularization. For `1,p -regularization, we show that the local error bound
holds when p ∈ [1, 2] and p = ∞; by contrast, it fails in general when p ∈ (2, ∞).
For nuclear norm regularization, we prove that the local error bound holds if a
constraint qualification is satisfied. An explicit example is constructed to demonstrate that the condition of constraint qualification is reasonable. Numerical
experiments are also presented to corroborate our technical results.
iii
論文題目:
結構型凸優化的誤差界限:理論及應用
學生: 周子銳
學位: 哲學博士學位
學校: 香港中文大學
時間: September 2015
為了處理日益增大的數據集,近期有關數值優化算法的研究方向大部分都集中
在了一階算法上,例如鄰近梯度算法和它的加速版本,坐標下降法和坐標梯度
下降法等。在關於這些算法的收斂速度分析上,被稱為誤差界限的性質在證明
線性收斂速度方面起到了關鍵性的作用。其關鍵性尤其凸顯在處理退化問題的
時候。
在這樣的啟發下,我們的工作主要是研究對於一類結構型優化問題,誤差
界限是否被滿足。我們發現在這類問題下,誤差界限和一個多值函數的上半利
普希茨連續性是等價的。利用這種等價性,我們證明了當下列兩個條件同時
滿足的時候,誤差界限就同時被滿足:(1)非光滑函數的次梯度是度量規則
的;(2)兩個特殊定義的凸集是線性規則的。
接著我們將我們的理論分析運用到兩個常見的具體優化問題中:`1,p -範數正
則化問題和核模正則化問題。對於`1,p -範數正則化問題,我們發現當 p ∈ [1, 2]
和 p = ∞時,誤差界限是滿足的;而當p ∈ (2, ∞)時,誤差界限是不一定成立
的。對於核模正則化問題,我們證明當一個約束規範條件被滿足時,誤差界限
就一定成立。同時,我們給出一個具體的反例來展示這個約束規範條件的必要
性。數值試驗結果也支持了我們的理論分析結果。
iv
Acknowledgement
The research work contained in this thesis was carried out during the years 20112015 at the Department of System Engineering and Engineering Management,
the Chinese University of Hong Kong, when I was pursuing my PhD degree. I
had a great time during the stay in CUHK and I would like to express my sincere
thanks to those who guide me and support me all through.
First and foremost, I owe my deepest gratitude to my supervisor Professor
Anthony Man-Cho So. I am grateful to him for introducing me to this intriguing
continent of optimization, and professionally guiding me, supporting me as well
as encouraging me all the way. I still remember the hard days when I was facing
the dilemma in one of my research project. It is him who equipped me with the
courage and patience and finally successfully helped me seek a path through the
darkness. I am greatly indebted to him.
My deepest thanks go to my committee members, Professor Duan Li and
Professor Xiaoqiang Cai, for sharing time and experience with me and giving
valuable comments on my PhD proposal, on my PhD candidacy oral presentation,
and on this dissertation.
It is fair to say that without the endless support and encouragement of many
friends, this thesis would not be possible. I would like to express my heartfelt
gratitude to them all. Special thanks go to Kuncheng Wang, Ke Hou, Senshan
Ji and all the other team members of our research group.
I would like to express my deepest gratitude to my parents, for their uncon-
v
ditional love and support all through my life, and instilling me the passion for
learning. Last but not least, I am also greatly indebted to my wife Xuanjin
Cheng, who has always been supporting me with her deep love and inspiring me
with her pure heart.
vi
This work is dedicated to my family.
Contents
Abstract
ii
摘要
iv
Acknowledgement
v
1 Introduction
1
1.1
Error Bounds: An Overview . . . . . . . . . . . . . . . . . . . . .
1
1.2
A Local Error Bound for Structured Convex Programming . . . .
2
1.3
Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.4
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.5
Notations and Thesis Organization . . . . . . . . . . . . . . . . .
9
2 Preniminaries
2.1
2.2
11
Elements of Convex Analysis . . . . . . . . . . . . . . . . . . . . . 11
2.1.1
Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2
Convex Functions . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3
Norm Functions . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.4
Proximity Operator . . . . . . . . . . . . . . . . . . . . . . 18
Set-Valued Analysis: A Review . . . . . . . . . . . . . . . . . . . 20
2.2.1
Set-Valued Mappings . . . . . . . . . . . . . . . . . . . . . 20
2.2.2
Upper Semicontinuity and Closeness . . . . . . . . . . . . 21
viii
2.2.3
Upper Lipschitz Continuity, Calmness and Metric Subregularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.4
Linear Regularity of Convex Sets . . . . . . . . . . . . . . 23
3 A New Analysis Framework
25
3.1
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2
Properties on Optimality and the Residual Map . . . . . . . . . . 27
3.3
A Sufficient and Necessary Condition for Local Error Bounds . . . 32
3.4
Bounded Linear Regularity and Metric Subregularity . . . . . . . 38
4 Error Bounds for `1,p -Regularization
43
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2
Subdifferential of `p -Norm . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1
`p -Norm with p ∈ (1, ∞) . . . . . . . . . . . . . . . . . . . 45
4.2.2
`p -Norm with p = 1 and p = ∞ . . . . . . . . . . . . . . . 51
4.3
Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5 Nuclear Norm Regularization
61
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3
5.4
5.2.1
Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . 64
5.2.2
Subdifferential of Nuclear Norm on S n . . . . . . . . . . . 66
Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3.1
Bounded Linear Regularity . . . . . . . . . . . . . . . . . . 73
5.3.2
Metric Subregularity of ∂P
5.3.3
Error Bounds for Nuclear Norm Regularization . . . . . . 76
. . . . . . . . . . . . . . . . . 74
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
ix
6 Convergence Analysis of PGM
82
6.1
Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2
Error Bound Based Convergence Analysis . . . . . . . . . . . . . 84
6.3
What If Error Bound Fails? . . . . . . . . . . . . . . . . . . . . . 88
6.4
6.3.1
`1,p -Regularization . . . . . . . . . . . . . . . . . . . . . . 89
6.3.2
Nuclear Norm Regularization . . . . . . . . . . . . . . . . 89
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Bibliography
94
x
List of Figures
6.1
The PG method for solving problem (4.17) with p ∈ [1, 2] and
p = ∞. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2
The PG method for solving problem (4.17) with p ∈ (2, ∞). . . . 92
6.3
The PG method for solving problem (5.28). . . . . . . . . . . . . 93
xi
Chapter 1
Introduction
1.1
Error Bounds: An Overview
Given a subset S of Rn , an error bound is an inequality that gives an upper
bound on the distance from any vector x belonging to a test set U ⊆ Rn to the
set S in terms of some residual function r(x), which is a function from Rn to
R+ and satisfies r(x) = 0 if and only if x ∈ S. Mathematically, an error bound
requires constants κ, γ > 0 such that
dist(x, S) ≤ κr(x)γ ,
∀x ∈ U.
(1.1)
The error bound is said to be global if U = Rn and is said to be local otherwise.
In addition, it is called an error bound of Lipschitzian type if γ = 1 and of
Hölderian type if γ 6= 1.
Error bounds have rich and diverse applications in a number of interesting
research domains. In the community of mathematical programming, the study
of error bounds was initially motivated by the practical implementation of iterative methods for solving optimization problems. For most of these methods, an
inevitable issue is to determine when to terminate the iterative procedure, which
usually takes the form r(xk ) ≤ for some algorithm related residual function
1
r(x). Hence, the following error bounds,
dist(x, X ) ≤ κr(x)γ ,
for all x satisfying r(x) ≤ ,
where X is the optimal solution set, are utilized to justify these stopping rules as
it indicates that if xk satisfies r(xk ) ≤ , it is indeed close enough to the optimal
solution set X . Furthermore, error bounds have been widely recognized as key to
the convergence analysis of various numerical methods for solving optimization
problems, especially in treating degenerate problem instances. This includes,
among others, gradient-based methods [37, 20, 70], Newton-type methods [2, 71,
77], as well as interior point methods [73, 47, 64]. For a comprehensive survey
of the applications of error bounds in mathematical programming, we refer the
readers to [50].
1.2
A Local Error Bound for Structured Convex Programming
Optimization problems emerged in application domains such as image processing,
machine learning, compressed sensing, sensor network localization often takes the
form
min F (x) := f (x) + P (x),
x∈E
(1.2)
where E is a finite-dimensional real linear space endowed with the norm k · k,
both of the functions f and P are lower-semicontinuous as well as convex, and f
is additionally continuously differentiable. For instance, in the convex relaxation
of the compressed sensing problem, one aims at solving the following `1 -norm
regularized minimization [14, 60],
1
minn kAx − bk22 + τ kxk1 ,
x∈R 2
(1.3)
which corresponds to problem (1.2) with E = Rn , f (x) = kAx − bk22 and P (x) =
τ kxk1 . Another notable instance of (1.2) is in the field of matrix rank minimiza2
CHAPTER 1. INTRODUCTION
3
tion, of which a popular approach entails the following optimization [12, 13],
1
min
kA(X) − bk22 + τ kXk∗ ,
X∈Rn×p 2
(1.4)
where A is a linear operator and kXk∗ denotes the nuclear norm of X, namely,
sum of all the singular values of X. This corresponds to problem (1.2) with
E = Rn×p , f (X) = kA(X) − bk22 and P (X) = τ kXk∗ . Besides the function P
of (1.2) being the regularizers that are commonly used in modern computation
like `1 -norm or nuclear norm, it can also be the indicator function of any closed
convex set. Hence, the general constrained convex minimization
min f (x)
(1.5)
s.t. x ∈ C,
where C is a closed convex set, is equivalent with (1.2) by letting P be the
indicator function of C, i.e.,
P (x) =
0,
+∞,
if x ∈ C,
otherwise.
In this thesis, we consider a specific local error bound of Lipschitzian type
for optimization problem (1.2) with the residual function defined as below. For
any lower-semicontinuous convex function h : E → R, let proxh : E → E be the
Moreau-Yoshida proximity operator, which is given by
1
2
proxh (x) := argmin h(z) + kx − zk .
2
z∈E
By comparing the optimality conditions for (1.2) and (1.2), it is immediate that
a solution x ∈ E is optimal for (1.2) if and only if it satisfies the following
fixed-point equation:
x = proxP (x − ∇f (x)).
This motivates a residual map of (1.2) R : E → E defined by
R(x) := x − proxP (x − ∇f (x)),
and the residual function r : E → R+ is defined to be the norm of the residual
map R(x), namely,
r(x) := kR(x)k = kx − proxP (x − ∇f (x))k.
(1.6)
Formally, the local error bound of our interest is stated as follows. Let v ∗ and X
be the optimal value and optimal solution set of (1.2), respectively.
Local Error Bound: For any ζ ≥ F ∗ , there exist κ > 0 and > 0 such that
dist(x, X ) ≤ κkR(x)k for all F (x) ≤ ζ, kR(x)k ≤ .
(1.7)
Such error bound (1.7) is also called the projection-type error bound [39, 70] as
it is originally considered in the case where P is the indicator function of a closed
convex set C, of which the residual function (1.6) can be verified to be equivalent
with
r(x) = kx − [x − ∇f (x)]+
C k,
where [·]+
C is the projection operator to C.
The local error bound defined above has been demonstrated as the key to establishing (asymptotic) linear convergence rate for a number of first-order methods for solving problem (1.2). This includes the Goldstein-Levitin-Polyak gradient projection method, the Martinet-Rockafellar proximal point method, coordinate descent method, matrix splitting method, and extragradient method. See a
unified treatment of all these methods in [39]. Such results are recently extended
to proximal gradient method [72], coordinate gradient descent method [74] and
alternating direction method of multiplies [25].
Prior to the error bound based convergence analysis, theoretical results on linear convergence of the above mentioned methods typically requires the objective
function of (1.2) to be strongly convex, which, however, is invalid in most application instances. For example, consider the `1 -regularized linear regression (1.3),
4
CHAPTER 1. INTRODUCTION
5
unless the columns of the input matrix A are linear independent (which is impossible in the high-dimensional setting, i.e., A ∈ Rm×n with n m), the objective
function of (1.3) is not strongly convex. By assuming the local error bound (1.7),
the convergence analysis no longer requires the strong convexity assumption and
is thus promising to explain the linear rate of convergence for a large class of
functions raised in “real” applications. In view of this, the following question
arises naturally:
Question: Under what kind of conditions upon the functions f and P will the
optimization problem (1.2) satisfy the local error bound (1.7)?
1.3
Literature Review
Researches towards answering the above question began in the late 1980s. Pang
proved in [49] that the local error bound (1.7) holds in the following scenario:
(S1). f is strongly convex and ∇f (x) is Lipschitz continuous, P is any lowersemicontinuous convex function.
Without strong convexity of f , Luo and Tseng made some pioneering contributions in a series of works [37, 38, 40]. They considered the cases where
nonsmooth function P being the indicator function of a closed convex set (which
corresponds to a constrained convex programming (1.5)). In particular, they
proved that the local error bound holds in any of the following scenarios:
(S2). f is quadratic (not necessarily convex), P is the indicator function of a
polyhedral set.
(S3). E = Rn , f (x) = h(Ax) + hc, xi for all x ∈ Rn , where A ∈ Rm×n , c ∈ Rn ,
and h is strongly convex and ∇h is Lipschitz continuous on any compact
subset of Rm . P is the indicator function of a polyhedral set.
(S4). E = Rn , f (x) = maxy∈Y {hy, Axi − h(y)} + hc, xi for all x ∈ Rn , where
A ∈ Rm×n , c ∈ Rn and Y is a polyhedral set in Rm . In addition, h is
strongly convex and ∇h is Lipschitz continuous on any compact subset of
Rm . P is the indicator function of a polyhedral set.
Specifically, (S2) holds by Theorem 2.3 of [37]; (S3) holds by Theorem 2.1 of [38];
(S4) holds by Theorem 4.1 of [40]. In addition, in [74], Tseng improved the
results in scenarios (S2)-(S4) by letting P be of polyhedral epigraph (clearly, this
includes the indicator function of a polyhedral set as a special case). Nevertheless,
for non-strongly convex functions, all the above results share a limitation that
they require the nonsmooth function P to be polyhedral in nature.
Recently, Tseng in [72] made a breakthrough in the treatment of non-strongly
smooth function f along with non-polyhedral nonsmooth function P . He proved
that (see Theorem 2 of [72]) the local error bound also holds in the following
scenario:
(S5). E = Rn , f (x) = h(Ax) for all x ∈ Rn , where A ∈ Rm×n , and h is strongly
convex and ∇h is Lipschitz continuous on any compact subset of Rm . P
P
is the group-lasso regularizer, i.e., P (x) =
J∈J ωJ kxJ k2 where J is a
non-overlapping partition of the index set {1, . . . , n}.
In view of all the existing results on the local error bound (1.7), namely,
scenarios (S1) to (S5), it is worth noting the following several points. Firstly,
the result in (S1) implies that the assumption of local error bound is weaker
than assuming strong convexity in the convergence analysis. Secondly, for nonstrongly convex cases, although additional structures on f are required, they
include a number of smooth convex functions raised in applications. Indeed,
both the least-square loss function in linear regression and the logistic loss function in logistic regression are special cases of the structure f (x) = h(Ax), where
h is strongly convex and ∇h is Lipschitz continuous on any compact set. Fi6
CHAPTER 1. INTRODUCTION
7
nally, for non-strongly convex cases, the improvement to non-polyhedral P by
Tseng [72] is still limited, as it only corresponds to a special instance, that is, the
non-overlapping group lasso regularizer. In fact, there are a large number of nonsmooth functions emerged in various applications that are neither of polyhedral
epigraph nor the non-overlapping group lasso, such as nuclear norm regularizer,
mixed norm regularizer, etc.(see [27, 6] for a survey on commonly used regularizers in applications). However, the proof of Theorem 2 in [72] exploits the
special structures of the non-overlapping group lasso regularizer and the analysis
approach is not easily generalizable to other instances.
1.4
Contributions
As stated above, the bottleneck of existing researches on the local error bound (1.7)
lies in their limitations on the nonsmooth part. This gives rise to our goal in this
thesis, that is, to study the local error bound for non-strongly convex function
f along with general nonsmooth convex function P . Specifically, we focus on f
being non-strongly convex but structured as follows
f (x) = h(Ax),
(1.8)
where h has the same setting as in scenario (S3) and (S5), and explore the
conditions upon P such that the local error bound holds.
Towards that end, we first prove a sufficient and necessary condition on the
local error bound in terms of the upper-Lipschitz continuity (ULC) of certain setvalued mapping. Such equivalent characterization grasp the essence of the local
error bound and it provides an alternative analysis approach in the treatment of
the local error bound, that is, to explore the ULC property of such set-valued
mapping. After that, by invoking the linear regularity of collections of convex
sets, we show that the set-valued mapping has the ULC property if both of
the following holds: (a) problem (1.2) satisfies a constraint qualification; (b)
the subdifferential of the nonsmooth function P is metrically subregular at the
optimal solution set. Therefore, given any instance of P , the local error bound
can be proved by verifying the conditions (a) and (b).
We then apply our analysis approach to two instances of nonsmooth function
P . The first one is the `1,p -norm regularizer with 1 ≤ p ≤ ∞, which is defined as
P (x) =
X
ωJ kxJ kp ,
∀x ∈ Rn ,
J∈J
where J is a non-overlapping partition of the index set {1, . . . , n} and k · kp is
the vector p-norm, i.e., for any x ∈ Rn ,
kxkp =
n
X
! p1
|xi |p
.
i=1
The `1,p -norm regularizer is a generalization of the group-lasso regularizer in
(S5) (the group-lasso regularizer corresponds to p = 2) and it is also called the
generalized group-lasso regularizer. Such regularizer has been utilized widely to
induce group sparsity or structured sparsity in application domains such as sparse
regression [21], multiple kernel learning [69, 30], etc.. By exploring the conditions
(a) and (b) for `1,p regularizer, we show that the local error bound (1.7) holds
when p ∈ [1, 2] as well as p = ∞. In particular, such result includes scenario (S5)
as a special case when p = 2. Moreover, compared to the proof for scenario (S5)
in [72], our approach is much cleaner and more straightforward. By contrast with
p ∈ [1, 2] and p = ∞, the local error bound can fail when p ∈ (2, ∞). That is due
to the failure of metric subregularity of the subdifferential of P at certain points,
i.e., condition (b) fails in general. Furthermore, we construct a counter-example
to demonstrate the failure of the local error bound when p ∈ (2, ∞).
The second instance of nonsmooth function P we considered in this thesis is
the nuclear norm regularizer, namely,
P (X) = kXk∗ ,
8
∀X ∈ Rm×n .
CHAPTER 1. INTRODUCTION
9
We show that although the subdifferential of the nuclear norm is metrically
subregular at any point in its domain (which implies that condition (b) always
holds), the constraint qualification in condition (a) fails in general. In addition,
we also construct an explicit counter-example to demonstrate the failure of the
local error bound in this instance.
1.5
Notations and Thesis Organization
Throughout the thesis, we will apply the following notations. Let E be a finitedimensional Euclidean space endowed with the norm k · k. For any subset C ⊆ E
and any element x ∈ E, we denote dist(x, C) as the distance from x to the set
C, namely,
dist(x, C) := inf kx − zk.
z∈C
Moreover, we follow the convention that if C is an empty set, then dist(x, C) =
+∞. We denote sgn(·) as the sign function, i.e., for any a ∈ R,
1,
if a > 0;
sgn(a) =
0,
if a = 0;
−1, if a < 0.
The rest of this thesis will be organized as follows. In Chapter 2, we will
provide some preliminary results and tools regarding the convex analysis and
set-valued analysis, both of which are the foundation of subsequent analysis.
In Chapter 3, based on the notion of upper Lipschitz continuity of set-valued
mappings, we characterize a sufficient and necessary condition for the local error
bound and hence develop a new analysis approach for the local error bound. In
the next two chapters, we apply the technical results developed in Chapter 3
to analyze the local error bound for `1,p -regularization (Chapter 4) and nuclear
norm regularization (Chapter 5). Finally, in Chapter 6, we will apply the local
error bound for analyzing the convergence rate of the proximal gradient method.
2 End of chapter.
10
Chapter 2
Preniminaries
Summary
Some technical tools and results that will be used for subsequent
analysis are collected in this chapter. Readers who are familiar
with these fundamental materials are encouraged to skip ahead to
Chapter 3.
2.1
Elements of Convex Analysis
In this section, we review some definitions and technical results in convex analysis. Let E be a finite-dimensional Euclidean space endowed with the inner
product h·, ·i and its induced norm k · k, that is, kxk := hx, xi1/2 for any x ∈ E.
2.1.1
Convex Sets
Definition 2.1.1 Let C be a subset of E. We say that
1. C is affine if αx + (1 − α)y ∈ C whenever x, y ∈ C and α ∈ R;
2. C is convex if αx + (1 − α)y ∈ C whenever x, y ∈ C and α ∈ [0, 1].
11
Directly implied by the definition, we have the following property on convex
(affine) sets.
Proposition 2.1.1 The intersection of an arbitrary collection of convex (affine)
sets is convex (affine).
The notion of convex hull (affine hull) of a set is defined as below.
Definition 2.1.2 The convex hull (affine hull) of a set C ∈ E, denoted by
conv(C) (aff(C)), is the smallest convex (affine) set that contains C.
We next introduce some topological concepts of convex sets. Let B denote the
closed unit ball centred at the origin in E, i.e.,
B := {x ∈ E | kxk ≤ 1}.
Hence, the ball centred at x0 ∈ E with radius ρ > 0 can be represented as
B(x0 , ρ) := x0 + ρB. The interior and closure of a convex set C ⊆ E is defined as
int(C) := {x ∈ C | x + B ⊆ C for some > 0} ;
\
cl(C) := (C + B).
>0
The boundary of C is then given by bd(C) := cl(C) \ int(C). In addition, the
concept of relative interior plays an important role in our analysis. Given any
convex set C ⊆ E, the relative interior of C is defined as
ri(C) := {x ∈ C | (x + B) ∩ aff(C) ⊆ C for some > 0} .
2.1.2
Convex Functions
We now turn to the concepts and properties regarding convex functions.
Definition 2.1.3 Let f : E → R ∪ {+∞} be an extended-valued function. We
say that f is convex if
f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y)
for all x, y ∈ E and α ∈ [0, 1].
12
CHAPTER 2. PRENIMINARIES
13
Definition 2.1.4 Let f : E → R ∪ {+∞} be an extended-valued function and C
be a non-empty convex subset of E. We say that f is strictly convex on C if
f (αx + (1 − α)y) < αf (x) + (1 − α)f (y)
for all x, y ∈ C and α ∈ [0, 1]. In addition, we say that f is strongly convex on
C with parameter σ > 0 if
f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) −
σα(1 − α)
kx − yk2
2
for all x, y ∈ C and α ∈ [0, 1].
Throughout the thesis, for differentiable functions, we will denote ∇f as its
gradient. The following lemma provides equivalent characterizations of strong
convexity if the function is differentiable. We refer the reader to [48] for the
proofs.
Lemma 2.1.1 Suppose f : E → R is a differentiable function with the gradient
of f denoted by ∇f , and C is a non-empty convex subset of E. Then the following
statements are equivalent:
(a) f is strongly convex on C with parameter σ > 0.
(b) For any x, y ∈ C, there holds
h∇f (x) − ∇f (y), x − yi ≥ σkx − yk2 .
(c) For any x, y ∈ C, there holds
f (y) ≥ f (x) + h∇f (x), y − xi +
σ
kx − yk2 .
2
For a convex function f , the (effective) domain of f , denoted by dom(f ) is defined
as
dom(f ) := {x ∈ E | f (x) < +∞} .
In addition, we say a convex function f is proper, if its domain dom(f ) is nonempty and f (x) > −∞ for all x ∈ E. The epigraph of a convex function f ,
denoted by epi(f ), is defined as
epi(f ) := {(x, t) ∈ E × R | t ≥ f (x)} .
We say a function f is closed if epi(f ) is a closed subset of E × R. The following
result characterizes the relationship between the epigraph and convex functions,
which can be proved directly by the definitions.
Proposition 2.1.2 A function f : E → R ∪ {+∞} is convex (as a function) if
and only if epi(f ) is convex (as a subset of E × R).
For any ζ ∈ R, the ζ-level set of f , denoted by L(ζ), is a subset of E defined as
L(ζ) := {x ∈ E | f (x) ≤ ζ}.
The following result is clear from the definitions.
Proposition 2.1.3 Suppose f : E → R ∪ {+∞} is a closed proper convex function. Then for any ζ ≥ min f , the ζ-level set L(ζ) is convex as well as close. In
particular, the set of global minimizers of f , if not-empty, is convex as well as
close.
Moreover, we have the following result regarding the boundedness of level sets,
of which the proof can be found in [55, Corollary 8.7.1].
Lemma 2.1.2 Suppose f : E → R ∪ {+∞} is a closed proper convex function.
If the level set L(ζ) is non-empty and bounded for one ζ, it is bounded for every
ζ.
We then recall the concepts regarding the differentiability and subdifferentiability of convex functions.
14
CHAPTER 2. PRENIMINARIES
15
Proposition 2.1.4 Suppose f : E → R is differentiable on an open set Ω ⊆ E
and let C ⊆ Ω be convex. Then f is convex on C if and only if
f (x) − f (y) ≥ hx − y, ∇f (x)i,
∀x, y ∈ C.
For non-differentiable functions, the notions of subgradients and subdifferential
is crucial in convex analysis.
Definition 2.1.5 Let f : E → R∪{+∞} be a function and suppose x ∈ dom(f ).
A vector s ∈ E is called a subgradient of f at x if
f (y) ≥ f (x) + hs, y − xi,
∀y ∈ E.
In addition, the set of all the subgradients of f at x is called the subdifferential
of f at x, denoted as ∂f (x).
Now, let
f 0 (x; d) = lim+
t→0
f (x + td) − f (x)
t
be the directional dirivative of f at x ∈ E in the direction d ∈ E \ 0. The result
below shows that the subgradient of a convex function always exist in the interior
of its domain, and characterizes the directional derivatives of convex functions.
We omit the proof and refer the interested reader to [58, Section 2.5].
Lemma 2.1.3 Let f : E → R ∪ {+∞} be a convex function and let x ∈
int dom(f ). Then, ∂f (x) is non-empty, compact, as well as convex. In addition, for any d ∈ E, there holds f 0 (x; d) = maxs∈∂f (x) hs, di.
It is natural to expect that for convex function f , when f is differentiable at x,
the gradient of f coincides with the subgradient of f . More precisely, we have
the following.
Lemma 2.1.4 Let f : E → R ∪ {+∞} be a proper convex function. Then, f
is differentiable at x ∈ dom(f ) if and only if ∂f (x) is a singleton, which is the
gradient of f at x, i.e., ∂f (x) = {∇f (x)}.
Furthermore, the subdifferential can be utilized to characterize the optimality
condition of convex minimization.
Lemma 2.1.5 Let f : E → R ∪ {+∞} be a proper convex function. Then
x ∈ dom(f ) is a global minimizer of f if and only if 0 ∈ ∂f (x).
2.1.3
Norm Functions
In what follows, we review some of the properties of a class of convex functions
that is of particular interest in this thesis, that is, the norm functions.
Definition 2.1.6 Given a finite-dimensional Euclidean space E, we say a function P : E → R is a norm function on E if for all x, y ∈ E and α ∈ R, it satisfies
the following three conditions:
(a) P (x) = 0 if and only if x = 0.
(b) P (x + y) ≤ P (x) + P (y) (triangle inequality).
(c) P (αx) = |α|P (x) (absolute homogeneity).
Some facts are directly implied by the definition. Firstly, for any x ∈ E, P (x) ≥
0. Indeed, by absolute homogeneity, P (x) = P (−x). In addition, by triangle
inequality, we have P (0) ≤ P (x) + P (−x). Hence, by combining these two
properties with condition (a), we obtain P (x) ≥ 0. Secondly, any norm function
is convex, as implied by triangle inequality along with absolute homogeneity.
In the rest of this section, we will denote k · k as any norm function on E.
Suppose h·, ·i is the inner product on E. Given a norm function k · k on E, the
dual norm, denoted by k · kd , is the function from E to R defined by
kxkd = max{hx, yi | kyk ≤ 1}.
y
It can be easily verified from the definition that k · kd is also a norm function on
E.
16
CHAPTER 2. PRENIMINARIES
17
In the vector space Rn , a class of norm functions that is well-known is the
vector p-norm (also called `p -norm), denoted by k · kp , with p ∈ [1, ∞]. Given
any vector x ∈ Rn , the vector p-norm takes values
(Pn |x |p ) p1 ,
if 1 ≤ p < ∞;
i
i=1
kxkp =
max{|xi | | i = 1, . . . , n}, if p = ∞.
The dual norm of vector p-norm is the vector q-norm, where q is the Hölder
conjugate of p, i.e., 1/p + 1/q = 1 with the convention 1/∞ = 0. Moreover,
they satisfy the Hölder’s inequality, namely, for any x, y ∈ Rn and any p ∈ [1, ∞]
along with its Hölder conjugate q, there holds
xT y ≤ kxkp · kykq ,
where the equality holds if and only if there exists a constant c > 0 such that
yi = c · sgn(xi )|xi |p/q for all i = 1, . . . , n.
In the matrix space Rm×n , a generalization of the vector p-norm is the socalled Schatten p-norm, denoted by k · kSp , with p ∈ [1, ∞]. Given a matrix
X ∈ Rm×n (without loss of generality, we assume m ≤ n), suppose the singular
values of X are σ1 , . . . , σm . Then the Schatten p-norm takes the value
(Pn σ p ) p1 ,
if 1 ≤ p < ∞;
i=1 i
kXkSp =
max{σi | i = 1, . . . , n}, if p = ∞.
Note that here no absolute value operator is involved as the singular values of
matrices are non-negative. Similarly, the dual norm of Schatten p-norm is the
Schatten q-norm, where q is the Hölder conjugate of p as before. Furthermore,
the concept of Schatten p-norm can be expressed using eigenvalues in a symmetric matrix space S n . Given a matrix X ∈ S n with eigenvalues λ1 , . . . , λn , the
Schatten p-norm of X is
(Pn |λ |p ) p1 ,
if 1 ≤ p < ∞;
i
i=1
kXkSp =
max{|λi | | i = 1, . . . , n}, if p = ∞.
As stated above, norm functions are convex. In the following, we present
a result regarding the subdifferential properties of this special class of convex
functions. We omit the proof and refer the interested readers to [58, Section 2.5].
Lemma 2.1.6 Suppose k · k is a norm function on E and k · kd is the dual norm
of k · k. Then for any x ∈ E, the subdifferential of k · k at x can be expressed as
follows,
∂kxk = {g ∈ E | kgkd ≤ 1, hg, xi ≥ kxkd } .
2.1.4
(2.1)
Proximity Operator
Recall that the residual function of our interest is r(x) = kR(x)k (see (1.6)
and (1.2)) where R : E → E is the residual map defined as follows,
R(x) := x − proxP (x − ∇f (x)).
Therefore, the properties of the proximity operator proxP is crucial to our analysis. In what follows, we summarize several technical results regarding the proximity operator.
Proposition 2.1.5 Suppose h : E → R ∪ {+∞} is a proper closed convex function. Then for any v ∈ E, proxh (v) exists and is unique.
Proof The proximity operator is defined by
1
2
proxh (v) := argmin h(z) + kz − vk .
2
z∈E
Since h is a closed convex function that is not identically +∞, the function in
the brackets is strongly convex. Hence the minimizer exits and is unique.
u
t
The proximity operator often witnesses a closed-form solution in applications.
We list several instances as below.
Proposition 2.1.6 Suppose h : E → R ∪ {+∞} is a proper closed convex function.
18
CHAPTER 2. PRENIMINARIES
19
(a) If h is the indicator function of a closed convex set C ⊆ E, then
proxh (v) = [v]+
C,
where [·]+
C is the projection operator on C.
(b) If h is the `1 -norm function, i.e., h : Rn → R and h(x) =
Pn
i=1
|xi |, then
for all i = 1, . . . , n,
vi − 1, if vi ≥ 1;
(proxh (v))i =
0,
if |vi | ≤ 1;
vi + 1, if vi ≤ −1.
Proof We give a proof of part (a) here. In this case, the function h is of the
form
0,
if x ∈ C;
h(x) =
+∞, otherwise.
Hence, the minimization of h(z) + 1/2kz − vk2 is equivalent with
min 1/2kz − vk2
s.t. x ∈ C.
This is exactly the definition of projection operator. For the proof of part (b),
u
t
please refer to [6, Section 1.3.3]
Another important property of proximity operator is its nonexpansiveness.
Proposition 2.1.7 Suppose h : E → R ∪ {+∞} is a proper closed convex function. Then, the proximity operator of h is firmly nonexpansive, namely,
hproxh (u) − proxh (v), u − vi ≥ ku − vk2 .
Proof By the definition of proximity operator and Lemma 2.1.5, we have
ũ = proxh (u) ⇐⇒ u − ũ ∈ ∂h(ũ)
ṽ = proxh (v) ⇐⇒ v − ṽ ∈ ∂h(ṽ)
(2.2)
Since h is a convex function, by the definition of subgradients 2.1.5, there holds
hsu − sv , u − vi ≥ 0,
∀su ∈ ∂h(u), sv ∈ ∂h(v).
This, together with (2.2), gives us
hu − ũ − v + ṽ, ũ − ṽi ≥ 0.
After rearrangement, we obtain the required result.
u
t
Directly implied by the above proposition, we have the following corollary.
Corollary 2.1.1 Suppose h : E → R ∪ {+∞} is a proper closed convex function. Then, the proximity operator of h is Lipschitz continuous with constant 1,
namely,
kproxh (u) − proxh (v)k ≤ ku − vk.
Proof The result is immediately when combine the result in Proposition 2.1.7
u
t
with the Cauchy-Schwartz inequality.
2.2
Set-Valued Analysis: A Review
We introduce some basic concepts of set-valued analysis that will be used in the
sequel. Throughout this section, we suppose X and Y are two Banach spaces.
With a little abuse of notation, we denote k · k as the norms and B as the closed
unit balls in both X and Y.
2.2.1
Set-Valued Mappings
We say a mapping Γ is a set-valued mapping (or, a multifunction) from X to Y
(denoted by Γ : X ⇒ Y) if it assigns a subset of Y to each vector x ∈ X . The
graph of Γ, denoted by Gr(Γ), is a subset of X × Y defined by
Gr(Γ) := {(x, y) ∈ X × Y | y ∈ F (x)} .
20
CHAPTER 2. PRENIMINARIES
21
The domain of Γ is defined by dom(Γ) := {x ∈ X | Γ(x) 6= ∅} and inverse mapping of Γ, denoted by Γ−1 , is a set-valued mapping from Y to X defined by
Γ−1 (y) = {x ∈ X | y ∈ Γ(x)} .
(2.3)
For example, suppose A is a matrix of Rm × Rn . For any b ∈ Rn , if we let Γ(b) be
the solution set of the linear system Ax = b, i.e., Γ(b) = {x ∈ Rm | Ax = b}, Γ is
a set-valued mapping from Rn to Rm . Another notable example is that for any
lower-semicontinuous function P , the subdifferential of P (see Definition 2.1.5)
is also a set-valued mapping.
2.2.2
Upper Semicontinuity and Closeness
Similar to single-valued functions, continuity properties are also defined for setvalued mappings. In this thesis, only upper-semicontinuity will be involved,
which is defined as below.
Definition 2.2.1 A set-valued mapping Γ : X ⇒ Y is upper semicontinuous
(usc) at x̄ ∈ dom(Γ) if for any > 0, there exists a constant δ > 0 such that
Γ(x) ⊆ Γ(x̄) + B,
∀x ∈ x̄ + δB.
From the above definition, the following result comes with no surprise.
Proposition 2.2.1 Let {xk } be any sequence of vectors in dom(Γ) converging
to x̄. Then, Γ is usc at x̄ if and only if for any sequence {y k } with y k ∈ Γ(xk ),
limk→∞ dist(y k , Γ(x̄)) = 0.
Definition 2.2.2 A set-valued mapping Γ : X ⇒ Y is closed at x̄ if for any
sequences {xk } ∈ X and {y k } ∈ Y, there holds
xk → x̄, y k → ȳ, y k ∈ Γ(xk ) =⇒ ȳ ∈ Γ(x̄).
Or equivalently, Γ is closed if its graph Gr(Γ) is a closed subset of X × Y.
The following result reveals an important property on the subdifferential of a
closed convex function, of which the proof can be found in [55, Theorem 24.4].
Lemma 2.2.1 Let f : E → R ∪ {+∞} be a proper closed convex function. Then
∂f is closed (as a set-valued mapping).
2.2.3
Upper Lipschitz Continuity, Calmness and Metric Subregularity
We next introduce several regularity notions regarding set-valued mappings. Let
us start with the so-called upper lipschitz continuity, which was initially proposed
by Robinson in [54].
Definition 2.2.3 A set-valued mapping Γ : X ⇒ Y is called locally upper Lipschitz continuous (briefly, locally-ULC) at x̄ if there exist constants κ, δ > 0 such
that
Γ(x) ⊆ Γ(x̄) + κkx − x̄kB
∀x ∈ x̄ + δB.
(2.4)
A famous result on the locally-ULC property is stated below, of which the proof
can be found in [54, Proposition 1] as well as [18, Theorem 3D.1]. A set-valued
mapping S : Rn ⇒ Rm is said to be a polyhedral multifunction if Gr(S) is the
union of a finitely many polyhedral convex sets.
Lemma 2.2.2 Suppose S : Rn ⇒ Rm is a polyhedral multifunction. Then it is
locally-ULC at any point in Rn .
We then introduce the notions of calmness and metric subregularity.
Definition 2.2.4 A set-valued mapping Γ : X ⇒ Y is calm at x̄ for ȳ, where
ȳ ∈ Γ(x̄), if there exist constants κ, , δ > 0 such that, with U = x̄ + δB and
V = ȳ + B,
Γ(x) ∩ V ⊆ Γ(x̄) + κkx − x̄kB ∀x ∈ U.
22
(2.5)
CHAPTER 2. PRENIMINARIES
23
Moreover, Γ is metrically subregular at x̄ for ȳ, where ȳ ∈ Γ(x̄), if there exists
constants κ, , δ > 0 such that, U = x̄ + δB and V = ȳ + B,
dist(x, Γ−1 (ȳ)) ≤ κ · dist(ȳ, Γ(x) ∩ V )
∀x ∈ U.
(2.6)
The following proposition states the relationship between metric subregularity
and calmness of a set-valued mapping Γ, of which the proof can be found in [17,
Theorem 3.2].
Lemma 2.2.3 For a multifunction Γ : X ⇒ Y, let (x̄, ȳ) ∈ Gr(Γ). Then Γ is
calm at x̄ for ȳ if and only if its inverse Γ−1 is metrically subregular at ȳ for x̄.
From the definition, it is easy to see that the calmness property is a localized
version of locally-ULC property. Hence, the following result regarding the relationship between locally-ULC, calmness and metric subregularity is immediate.
Lemma 2.2.4 Suppose a set-valued mapping Γ : X ⇒ Y is locally-ULC at x̄.
Then, for any ȳ ∈ Γ(x̄), Γ is calm at x̄ for ȳ and Γ−1 is metrically subregular at
ȳ for x̄.
Proof If Γ is locally-ULC at x̄, there exist constants κ, δ > 0 such that
Γ(x) ⊆ Γ(x̄) + κkx − x̄kB.
Hence, for any ȳ ∈ Γ(x̄) and any > 0, we have, with V = ȳ + B,
Γ(x) ∩ V ⊆ Γ(x) ⊆ Γ(x̄) + κkx − x̄kB,
which implies the calmness by definition. The second part of the result is due to
Lemma 2.2.3.
2.2.4
u
t
Linear Regularity of Convex Sets
In what follows, we introduce the concepts regarding the regularity property of
convex sets.
Definition 2.2.5 Suppose C1 , . . . , CN are closed convex subsets of E satisfying
N
T
Ci 6= ∅. We say the collection {C1 , . . . , CN } is linear regular if there exists a
i=1
constant κ > 0 such that
dist(x,
N
\
Ci ) ≤ κ
i=1
N
X
dist(x, Ci ),
∀x ∈ E.
i=1
Moreover, we say it is bounded linear regular if for every bounded subset B of E,
there exists a constant κ > 0 (with respect to B), such that
dist(x,
N
\
Ci ) ≤ κ
i=1
N
X
dist(x, Ci ),
∀x ∈ B.
i=1
From the definition, it is obvious that linear regularity implies bounded linear
regularity. The above concepts are proposed in [8] and have been demonstrated
as the essence in proving the linear convergence of alternating projection methods
for solving feasibility problems. The following result can be found in [9].
Lemma 2.2.5 Let C1 , . . . , CN be N closed convex subsets of a finite-dimensional
Euclidean space E. Suppose there exists r ∈ {0, . . . , N } such that Cr+1 , . . . , CN
are polyhedral. Then the collection {C1 , . . . , CN } is bounded linear regular if it
satisfies the following condition:
r
\
i=1
ri(Ci ) ∩
N
\
i=r+1
2 End of chapter.
24
Ci 6= ∅.
(2.7)
Chapter 3
Towards Local Error Bounds: A
New Analysis Framework
Summary
In this chapter, we consider the local error bound (1.7) with f
being non-strongly convex but structured, and explore the conditions upon P such that the local error bound holds. We provide
an equivalent characterization of the local error bound in terms of
the locally-ULC property of certain set-valued mapping (Theorem
3.3.1). Furthermore, by invoking the result from linear regularity of
collections of convex sets, we show that the set-valued mapping has
the ULC property if the two conditions regarding bounded linear
regularity and metric subregularity of ∂P are certified.
3.1
Assumptions
Let us start with presenting all the assumptions required throughout this chapter.
Suppose E and T are two finite-dimensional Euclidean spaces. With a little abuse
25
of notation, we denote h·, ·i as the inner products of both E and T , and k · k as
the norms induced by the inner products. In addition, let B be the origin centred
closed unit balls of E and T .
We consider the local error bound (1.7) where f and P are both proper closed
convex functions that map E to R ∪ {+∞}. In addition, the function f takes the
structure
f (x) = h(A(x)),
(3.1)
where A : E → T is a linear operator and the convex function h : T → R satisfies
the following assumption:
Assumption 1 (a) The effective domain of h is open as well as non-empty.
(b) The function h is continuously differentiable on dom(h). Moreoever, for
any compact subset C ⊆ dom(h), h is strongly convex and ∇h is Lipschitz
continuous on C.
Let X denote the optimal solution set of (1.2). We also assume the following:
Assumption 2 (a) The optimal solution set X is non-empty.
(b) The optimal solution set X is bounded.
Assumptions 1 and 2 can be justified in various problems of wide applications.
Indeed, for the smooth function f , Assumption 1 is satisfied by linear regression,
in which the function h takes the form
1
h(y) = ky − bk2 ,
2
as well as logistic regression, in which h is of the form
h(y) =
m
X
log(1 + eyi ) − hb, yi,
i=1
26
CHAPTER 3. A NEW ANALYSIS FRAMEWORK
27
with m being the dimension of T and input parameter b ∈ {0, 1}m . Besides, it is
also satisfied by the likelihood function of Poisson noise [59], where h takes the
form
h(y) = −
m
X
log(yi ) + hb, yi,
i=1
with b ≥ 0. Furthermore, with Assumption 1 on f and Assumption 2(a), Assumption 2(b) can be additionally verified if P is any norm function (this includes
the instances of is `1 -regularizer, group-lasso regularizer, nuclear norm regularizer as well as mixed-norm regularizer, etc.) or P is the indicator function of
any compact convex set. Please refer to Section 4.1 and Section 5.1 for detailed
discussion.
3.2
Properties on Optimality and the Residual Map
The following several technical results regarding the optimal solution set X are
direct consequences of Assumptions 1 and 2.
Proposition 3.2.1 For any ζ ≥ v ∗ , where v ∗ is the optimal value of (1.2), the
ζ-level set of F , i.e., L(ζ) = {x ∈ E | F (x) ≤ ζ} is non-empty, convex as well as
compact. In particular, the optimal solutions set X is also non-empty, convex as
well as compact.
Proof By Proposition 2.1.3, since both functions f and P are closed convex
function, L(ζ) is closed and convex for any ζ ≥ v ∗ . Moreover, it is non-empty as
X ⊆ L(ζ) and X is non-empty by Assumption 2. Hence, it suffices to show the
boundedness of L(ζ). By Assumption 2, the optimal solution set X is bounded,
which corresponds to the v ∗ -level set since X = L(v ∗ ). Thus the result is immediately implied by Lemma 2.1.2.
u
t
Proposition 3.2.2 Suppose that f takes the form (3.1) and Assumptions 1 and
2(a) are satisfied. Then, there exist vectors ȳ ∈ T and ḡ ∈ E, with ḡ = A∗ ∇h(ȳ),
such that
A(x) = ȳ,
∇f (x) = ḡ,
for all x ∈ X .
(3.2)
In addition, the optimal solution set X has the following characterization:
X = {x ∈ E | A(x) = ȳ, −ḡ ∈ ∂P (x)} .
(3.3)
Proof We first prove the invariant property (3.2). For arbitrary x1 , x2 ∈ X , let
y1 = A(x1 ), y2 = A(x2 ). Suppose y1 6= y2 . The line segment between y1 and
y2 is a compact convex subset of dom(h) and h is thus strongly convex over on
this set by Assumption 1(b). Suppose the strong convex parameter is σ > 0. By
Lemma 2.1.1, there holds
1
1
σ
y1 + y2
≤ h(y1 ) + h(y2 ) − ky1 − y2 k2
h
2
2
2
2
Due to (3.1), the above is equivalent with
x1 + x2
1
1
σ
f
≤ f (x1 ) + f (x2 ) − ky1 − y2 k2 .
2
2
2
2
Moreover, by the convexity of P (x),
x1 + x2
1
1
P
≤ P (x1 ) + P (x2 ).
2
2
2
Adding the above two inequalities and using x1 , x2 ∈ X , we have
x 1 + x2
σ
F
≤ v ∗ − ky1 − y2 k2 .
2
2
This leads to the fact y1 = y2 because otherwise it contradicts with the optimality
of v ∗ . Hence, the value of A(x) is invariant over the set X , which we denote by ȳ.
Due to (3.1) and the assumption that h is differentiable on dom(h), the gradient
of f has the expression ∇f (x) = A∗ ∇h(A(x)). Thus, by letting ḡ = A∗ ∇h(ȳ),
we have
A(x) = ȳ,
∇f (x) = ḡ,
∀ x ∈ X.
We next prove the characterization (3.3). For any optimal point x, by the invariant property (3.2), we have A(x) = ȳ and ∇f (x) = ḡ. In addition, since both
28
CHAPTER 3. A NEW ANALYSIS FRAMEWORK
29
functions f and P are convex, x is optimal if and only if it satisfies the first-order
optimality condition, i.e.,
0 ∈ ∇f (x) + ∂P (x).
(3.4)
Hence, we have −ḡ ∈ ∂P (x). On the other hand, for any x ∈ E satisfing A(x) = ȳ
and −ḡ ∈ ∂P (x), by the relationship ḡ = A∗ ∇h(ȳ) and ∇f (x) = A∗ ∇h(A(x)),
we obtain 0 ∈ ∇f (x) + ∂P (x), which implies x ∈ X by Lemma 2.1.5. Hence,
x ∈ X if and only if it satisfies A(x) = ȳ and −ḡ ∈ ∂P (x).
u
t
Furthermore, the gradient of f is Lipschitz continuous around the optimal solution set. Such result is precisely stated as below. Given any ρ > 0, let us denote
N (ρ) as the neighbourhood of X defined by
N (ρ) := {x ∈ E | dist(x, X ) ≤ ρ} .
(3.5)
Proposition 3.2.3 Suppose that f takes the form (3.1) and Assumptions 1 and
2 are satisfied. Then, for any ρ > 0, there exist constants LA , Lf , both of which
depend on ρ, such that
k∇f (x) − ḡk ≤ LA kA(x) − ȳk ≤ Lf · dist(x, X ),
∀x ∈ N (ρ),
where the vectors ȳ, ḡ are given in Proposition 3.2.2.
Proof By Assumption 2, the optimal solution set X is a compact subset of E,
so as the its neighbourhood N (ρ). Hence, upon applying the linear operator A,
the set C(ρ) := {A(x) | d(x, X ) ≤ ρ} is a compact subset of T and ȳ ∈ C(ρ).
Thus by Assumption 1(b), ∇h is Lipschitz continuous on C. Let us denote the
Lipschitz constant as Lh . Then for any x satisfying d(x, X ) ≤ ρ, we have
k∇f (x) − ḡk = kA∗ ∇h(A(x)) − A∗ ∇h(ȳ)k
≤ kA∗ k · Lh · kA(x) − ȳk
= kA∗ k · Lh · kA(x) − A(x̄)k
≤ kA∗ k · Lh · kAk · d(x, X )
(3.6)
where the first equality follows from (3.1) and x̄ is the projection of x onto the
optimal solution set X . It is worth noting here that for either A or its adjoint
operator A∗ , kAk and kA∗ k denote the operator norm of A and A∗ , both of
which are bounded. Therefore, by letting LA = Lf kA∗ k and Lf = LA kAk, we
u
t
obtain the required result.
We now turn to the properties of the residual function of our interest. Recall
that the residual function r(x) = kR(x)k, where R : E → E is the residual map
defined by
R(x) := proxP (x − ∇f (x)) − x.
Proposition 3.2.4 Under the same setting of Proposition 3.2.3, the residual
function r(x) is continuous on dom(F ).
Proof For any x, x̃ ∈ dom(F ), we have
|r(x) − r(x̃)| = |kR(x)k − kR(x̃)k|
≤ kx − proxP (x − ∇f (x)) − x̃ + proxP (x̃ − ∇f (x̃))k
(3.7)
≤ kproxP (x − ∇f (x)) − proxP (x̃ − ∇f (x̃))k + kx − x̃k
≤ k∇f (x) − ∇f (x̃)k + 2kx − x̃k
where triangle inequality are used repeatedly and the last inequality invokes the
nonexpansiveness of the proximity operator (see Proposition 2.1.1). By Assumption 1(b), ∇h is continuous on dom(h) and ∇f is thus continuous on dom(F ).
This, together with (3.7), implies the continuity of r(x).
u
t
Besides its continuity on dom(F ), the residual function r(x) enjoys the following
properties when x is close to the optimal solution set X .
Proposition 3.2.5 Under the same setting of Proposition 3.2.3, for any ρ > 0,
there exist a constant Lr > 0, which depends on ρ, such that
r(x) ≤ LR · d(x, X ),
30
∀x ∈ N (ρ).
CHAPTER 3. A NEW ANALYSIS FRAMEWORK
31
In addition, for any ρ > 0 and any ζ > v ∗ , there exist a constant > 0, which
depends on ρ, such that
x ∈ N (ρ) whenever F (x) ≤ ζ, r(x) ≤ .
Proof For any x ∈ X , we have r(x) = 0 and R(x) = 0. Hence, by letting x̄ be
the projection of x on X , and taking x and x̄ into equation (3.7), we have
r(x) = |r(x) − r(x̄)|
≤ k∇f (x) − ∇f (x̄)k + 2kx − x̄k
(3.8)
= k∇f (x) − ḡk + 2kx − x̄k
≤ (Lf + 2) · dist(x, X )
where the second equality is by Proposition 3.2.2 and the last inequality is due
to Proposition 3.2.3. By letting Lr = Lf + 2, we complete the proof to the first
part.
For the second part of Proposition 3.2.5, we prove by contradiction. Suppose the statement is false, then there exist constants ρ̄ > 0 and ζ̄ ≥ v ∗ , as
well as a sequence {xk } satisfying F (xk ) ≤ ζ̄ for all k and limk→∞ r(xk ) = 0,
while dist(xk , X ) > ρ for all k. By Proposition 3.2.1, the level set L(ζ̄) is
compact.
Hence, by passing to a subsequence if necessary, we can assume
limk→∞ xk = x∞ . By the continuity of r(x) (see Proposition 3.2.4), we have
r(x∞ ) = limk→∞ r(xk ) = 0, which implies x∞ ∈ X . This contradicts with
dist(xk , X ) > ρ for all k. Therefore, we obtain the required result.
u
t
We end this section by remarking that the boundedness of level sets is crucial
to the second part of Proposition 3.2.5. In fact, for certain convex functions with
level sets unbounded, r(xk ) → 0 can not imply xk → X . To help the readers gain
a feel of the importance of boundedness, let us consider the function φ : R2 → R
defined as follows,
x
y
ye , if y > 0
φ(x, y) =
0, if x ≤ 0, y = 0
+∞, otherwise.
(3.9)
Obviously, the function φ is a closed convex function with minimum value 0.
The level sets of φ are unbounded and the optimal solution set of min φ is X =
{(x, y) ∈ R2 | x ≤ 0, y = 0}, which is also unbounded. A residual function
for the optimization min φ is r(x, y) := dist(0, ∂φ(x, y)). We claim that for any
ζ ≥ 0, there is no positive constants , κ such that
dist((x, y), X ) ≤ κ · r(x, y) whenever φ(x, y) ≤ ζ, r(x, y) ≤ .
In fact, if (x̄, ȳ) ∈
/ X satisfying φ(x̄, ȳ) ≤ ζ, then, for any t ≥ 0, φ(x̄ − t, ȳ) ≤ ζ.
Moreover, it is easy to verify that r(x̄ − t, ȳ) → 0 as t → +∞. However, for any
t > 0,
dist((x̄ − t, ȳ), X ) = kȳk,
which is bounded away from 0. Therefore, the claim is true.
3.3
A Sufficient and Necessary Condition for Local Error
Bounds
Recall that the local error bound of our interest (1.7) targets the test set defined
as U = {x ∈ E | F (x) ≤ ζ, r(x) ≤ } for some ζ and . In what follows, we alter
this test set and define a new local error bound with identical residual function
as in (1.7).
Local Error Bound with Neighbourhood Test Set: There exists ρ > 0
such that
dist(x, X ) ≤ κkR(x)k for all x ∈ N (ρ).
32
(3.10)
CHAPTER 3. A NEW ANALYSIS FRAMEWORK
33
Though with different test sets, the error bounds (1.7) and (3.10) are closely
related. In particular, under our assumptions, we have the following result, which
can be derived easily using Proposition 3.2.5.
Proposition 3.3.1 Suppose f is of the form (3.1) and Assumptions 1 and 2 are
satisfied. Then, if the error bound (3.10) holds, the error bound (1.7) holds as
well.
In what follows, we will turn to study the validity of error bound (3.10). Let
Σ : T ×E ⇒ E be a set-valued mapping defined as follows. For any (t, e) ∈ T ×E,
Σ(t, e) := {z ∈ E | A(z) = t, −e ∈ ∂P (z) }.
(3.11)
By Proposition 3.2.2, it is immediate that if f is of the form (3.1) and Assumptions 1 and 2 are satisfied, the optimal solution set X equals to the value of the
multifunction Σ at the point (ȳ, ḡ), i.e., X = Σ(ȳ, ḡ). Hence, for any ρ > 0, the
neighbourhood N (ρ) of X is also a neighbourhood of Σ(ȳ, ḡ).
In view of this, let us define a set-valued mapping Σρ : T × E ⇒ E as follows:
Σρ (t, e) := Σ(t, e) ∩ N (ρ),
∀(t, e) ∈ T × E.
(3.12)
As N (ρ) is a neighbourhood of Σ(ȳ, ḡ) for any ρ > 0, it is obvious that Σρ (ȳ, ḡ) =
Σ(ȳ, ḡ) for any ρ > 0. In addition, in the terminology of Rockafellar and Wets,
Σρ is called a truncated set-valued mapping of Σ (see [57, Theorem 9.33]). Furthermore, the following result provides a sufficient and necessary condition on
the error bound (3.10) in terms of the upper Lipschitz continuity of Σρ .
Theorem 3.3.1 Suppose f is of the form (3.1) and Assumptions 1 and 2 are
satisfied. Then the local error bound (3.10) holds if and only if there exists a
constant ρ > 0 such that the set-valued mapping Σρ is locally-ULC at (ȳ, ḡ),
where the vectors ȳ and ḡ are given in Proposition 3.2.2.
Proof We first prove the necessity. Suppose the local error bound (3.10) holds.
Due to the fact that X = Σ(ȳ, ḡ) and the expression of the residual map (1.2),
we have
d(x, Σ(ȳ, ḡ)) ≤ κkproxP (x − ∇f (x)) − xk for all x ∈ N (ρ).
(3.13)
For any (y, g) ∈ T × E, we claim that
x = proxP (x − g),
∀x ∈ Σ(y, g).
(3.14)
Indeed, if x ∈ Σ(y, g), by the definition of Σ, we have −g ∈ ∂P (x). Suppose
z = proxP (x − g). Then by the definition of proxP (x − g), it is immediate that
0 ∈ z − x + g + ∂P (z).
Clearly, by letting z = x, the above relationship holds. This, together with
Proposition 2.1.5, certifies our claim. Hence, for any (y, g) ∈ T × E and any
x ∈ Σ(y, g) ∩ N (ρ), we have
dist(x, Σ(ȳ, ḡ)) ≤ κkproxP (x − ∇f (x)) − xk
= κkproxP (x − ∇f (x)) − proxP (x − g)k
≤ κk∇f (x) − gk
(3.15)
≤ κ (k∇f (x) − ḡk + kḡ − gk)
≤ κ (LA kA(x) − ȳk + kḡ − gk)
≤ κ(LA + 1)(ky − ȳk + kg − ḡk),
where the equality is due to (3.14), the second inequality is by the nonexpansiveness of the proximity operator and the fourth inequality is due to Proposition
3.2.5. Therefore, by recalling the definition of Σρ , the relationship (3.15) implies
Σρ (y, g) ⊆ Σ(ȳ, ḡ) + κ(LA + 1)k(y, g) − (ȳ, ḡ)kB
for all (y, g) ∈ T × E. In addition, as Σρ (ȳ, ḡ) = Σ(ȳ, ḡ), we have
Σρ (y, g) ⊆ Σρ (ȳ, ḡ) + κ(LA + 1)k(y, g) − (ȳ, ḡ)kB
34
CHAPTER 3. A NEW ANALYSIS FRAMEWORK
35
for all (y, g) ∈ T × E, which implies that Σρ is locally-ULC at (ȳ, ḡ).
We next prove the sufficiency. Suppose there exists a constant ρ0 > 0 such
that the set-valued mapping Σρ0 is locally-ULC at (ȳ, ḡ). Then there exist constants κ0 , δ > 0 such that,
Σρ0 (t, e) ⊆ Σρ0 (ȳ, ḡ) + κ0 k(t, e) − (ȳ, ḡ)kB
(3.16)
whenever k(t, e) − (ȳ, ḡ)k ≤ δ. By the definition of Σρ0 and the fact that
Σρ0 (ȳ, ḡ) = Σ(ȳ, ḡ), (3.16) implies
Σ(t, e) ∩ N (ρ0 ) ⊆ Σ(ȳ, ḡ) + κ0 k(t, e) − (ȳ, ḡ)kB
(3.17)
whenever k(t, e) − (ȳ, ḡ)k ≤ δ. Consider the following pair of functions:
y(x) = A(x + R(x)),
g(x) = ∇f (x) + R(x).
Clearly, we have y(x) = ȳ and g(x) = ḡ for all x ∈ X . Moreover, for any x ∈ E,
k(y(x), g(x)) − (ȳ, ḡ)k
=kA(x) + A(R(x)) − ȳk + k∇f (x) + R(x) − ḡk
(3.18)
≤kA(x) − ȳk + k∇f (x) − ḡk + (kAk + 1)r(x).
Due to Propositions 3.2.3 and 3.2.5, there exist constants LA , Lf , Lr > 0, all of
which depends on ρ0 , such that for all x ∈ N (ρ0 ),
k∇f (x) − ḡk ≤ LA · kA(x) − ȳk ≤ Lf · dist(x, X ),
(3.19)
r(x) ≤ Lr · dist(x, X ).
In view of (3.18) and (3.19), there exists a constant κ1 > 0 such that
k(y(x), g(x)) − (ȳ, ḡ)k ≤ κ1 · dist(x, X ) ∀x ∈ N (ρ0 ).
ρ0
}. From its definition, it is clear that N (ρ) ⊆ N (ρ0 ) (since
Let ρ = min{ κδ1 , 1+L
R
ρ ≤ ρ0 ) and κ1 ρ ≤ δ. Hence, for all x ∈ N (ρ), (3.19) implies that
k(y(x), g(x)) − (ȳ, ḡ)k ≤ κ1 · dist(x, X ) ≤ κ1 ρ ≤ δ.
Combining this with (3.17), we obtain that for any x ∈ N (ρ), it must satisfy
Σ(y(x), g(x)) ∩ N (ρ0 ) ⊆ Σ(ȳ, ḡ) + κ0 k(y(x), g(x)) − (ȳ, ḡ)kB
(3.20)
Since R(x) = x−proxP (x−∇f (x)), by the definition of proximity operator (1.2),
it satisfies
0 ∈ R(x) + ∇f (x) + ∂P (x + R(x)).
(3.21)
This, together with the definitions of y(x) and g(x), implies that for any x ∈ E,
x + R(x) ∈ Σ(y(x), g(x)).
(3.22)
In addition, for any x ∈ N (ρ), we have
dist(x + R(x), X ) ≤ dist(x, X ) + kR(x)k
≤ (1 + Lr ) · dist(x, X )
(3.23)
≤ (1 + Lr )ρ
≤ ρ0 ,
where the first inequality is by triangle inequality, the second inequality is due
to (3.19) and the last inequality is by the definition of ρ. Combining the results (3.22) and (3.23), we obtain
x + R(x) ∈ Σ(y(x), g(x)) ∩ N (ρ0 ),
∀x ∈ N (ρ).
(3.24)
Hence, due to (3.20) and the fact X = Σ(ȳ, ḡ), we have
dist(x + R(x), X ) ≤ κ0 k(y(x), g(x)) − (ȳ, ḡ)k,
∀x ∈ N (ρ).
(3.25)
This leads to the fact that for any x ∈ N (ρ), there holds
dist(x, X ) ≤ dist(x + R(x), X ) + kR(x)k
≤ κ0 k(y(x), g(x)) − (ȳ, ḡ)k + kR(x)k
≤ κ0 ((LA + 1)kA(x) − ȳk + (kAk + 1)kR(x)k)
+ kR(x)k
36
(3.26)
CHAPTER 3. A NEW ANALYSIS FRAMEWORK
37
where the first inequality is by triangle inequality, the second inequality is due
to (3.25) and the last one is by (3.18) and (3.19). Therefore, there exists a
constant κ1 > 0 such that
dist(x, X ) ≤ κ1 (kA(x) − ȳk + kR(x)k) ,
∀x ∈ N (ρ).
Using the inequality that for any a, b ∈ R, (a + b)2 ≤ 2(a2 + b2 ), we have
dist(x, X )2 ≤ 2κ21 (kA(x) − ȳk2 + kR(x)k2 ),
∀x ∈ N (ρ).
(3.27)
Since X is a compact subset of E, the set {A(x) | x ∈ N (ρ)} is a compact subset
of T . Hence, by Assumption 1, h is strongly convex on this set. Let the strong
convexity factor be σ. Then for any x ∈ N (ρ),
σkA(x) − ȳk2 ≤ h∇h(A(x)) − ∇h(ȳ), A(x) − ȳi
= hA∗ ∇h(A(x)) − A∗ ∇h(ȳ), x − x̄i
(3.28)
= h∇f (x) − ḡ, x − x̄i,
where x̄ is the projection of x onto X . Since P (x) is convex, for any two vectors
v1 , v2 satisfying v1 ∈ ∂P (x1 ) and v2 ∈ ∂P (x2 ), there must hold
hv1 − v2 , x1 − x2 i ≥ 0.
(3.29)
Due to (3.21) and the optimality of x̄, we have −∇f (x) − R(x) ∈ ∂P (x + R(x))
and −ḡ ∈ ∂P (x̄). Hence, by substituting v1 = −∇f (x) − R(x), x1 = x + R(x)
and v2 = −ḡ, x2 = x̄ into (3.29) and upon rearranging, we obtain
h∇f (x) − ḡ, x − x̄i + kR(x)k2 ≤ hḡ − ∇f (x) + x̄ − x, R(x)i.
(3.30)
Noting that kR(x)k2 ≥ 0 and using (3.19), the inequality (3.30) leads to
h∇f (x) − ḡ, x − x̄i ≤ (Lf + 1) · dist(x, X ) · kR(x)k,
∀x ∈ N (ρ).
In view of this, (3.28) and (3.27), there exists a constant κ2 > 0 such that
dist(x, X )2 ≤ κ2 (dist(x, X ) · kR(x)k + kR(x)k2 ),
∀x ∈ N (ρ).
Solving this quadratic inequality, we obtain a constant κ such that
dist(x, X ) ≤ κkR(x)k,
∀x ∈ N (ρ).
(3.31)
u
t
Therefore, the proof is completed.
Equipped with Proposition 3.3.1 and Theorem 3.3.1, we are able to turn
the study of the local error bound (1.7) into the exploration of the locally-ULC
property of the set-valued mapping Σρ . Compared with previous studies on
the local error bound, such analysis approach enjoys the following advantages.
Firstly, previous studies on the local error bound relies heavily on the closed
form of r(x) as well as its special structure (see [72, 79]). However, in some
instances of (1.2) with wide applications, there is no closed-form of r(x), such as
`1,p -norm regularization with p ∈ (1, 2) or p ∈ (2, ∞]. In contrast, the set-valued
mapping Σρ is only relevant with the subdifferential of P , which is accessible for
almost all the functions in applications. Secondly, for non-polyhedral P , previous
proofs of the local error bound (also see [72, 79]) are based on very complicated
contradictory arguments, from which are not clear what are the primary causes of
such results. On the contrary, Theorem 3.3.1 grasp the essence of the local error
bound, that is, the local error bound is the consequence of the regularity property
jointly determined by the linear operator A along with the subdifferential of P .
After we apply Theorem 3.3.1 to several instances of (1.2) in Chapter 4 and
Chapter 5, the advantages of our analysis approach will be readily seen.
3.4
Bounded Linear Regularity and Metric Subregularity
In view of Theorem 3.3.1, it is clear that given any instance of optimization
problem (1.2) that satisfies Assumptions 1 and 2, we can study its local error
bound by exploring the locally-ULC property of the set-valued mapping Σρ .
However, this analysis approach is still not handy enough as to study the locallyULC property of Σρ is not a easy task in general. In this section, we simplify
38
CHAPTER 3. A NEW ANALYSIS FRAMEWORK
39
this approach by providing sufficient conditions for Σrho to be locally-ULC that
are much easier to testify.
Let us start with presenting a well-known result regarding the error bound
of linear systems, which is originally due to Hoffman [24]. Readers are also
encouraged to refer to [53, 42] for the proof. For any vector v ∈ Rn , we let
[v]+ ∈ Rn with each component ([v]+ )i = max{vi , 0}.
Lemma 3.4.1 Consider the solution set of a linear system:
S := {z ∈ Rn | Az ≤ a, Bz = b},
where A, B (a, b) are arbitrary matrices (vectors) of consistent dimensions. Then,
there exists a constant c > 0, which only depends on A and B, such that
dist(x, S) ≤ c(k(Ax − a)+ k + kBx − bk),
∀x ∈ Rn .
Theorem 3.4.1 Suppose the function f in (1.2) is of form (3.1) and Assumptions 1 and 2 are satisfied. Let us define the two sets Cf and CP as below:
Cf := {z ∈ E | A(z) = ȳ },
CP := {z ∈ E | −ḡ ∈ ∂P (z)},
where the vectors ȳ, ḡ are given in Proposition 3.2.2. Then the local error bound (1.7)
holds if the following two conditions are both satisfied:
(C1). The collection {Cf , CP } is bounded linear regular.
(C2). For any optimal solution x̄ ∈ X , ∂P (as a set-valued mapping) is metrically
subregular at x̄ for −ḡ.
Proof In view of Proposition 3.3.1 and Theorem 3.3.1, it suffices to prove that
under conditions (C1) and (C2), the truncated set-valued mapping Σρ is locallyULC for some ρ > 0. By the definitions of Cf and CP , it is clear that X =
Σ(ȳ, −ḡ) = Cf ∩ CP . Due to condition (C1) and the definition of bounded linear
regularity (see Definition 2.2.5), for any ρ1 > 0, there exists a constant κ1 > 0
such that
dist(x, Σ(ȳ, ḡ)) = dist(x, Cf ∩ CP )
(3.32)
≤ κ1 (dist(x, Cf ) + dist(x, CP ))
for all x ∈ N (ρ1 ) (recall the definition of N in (3.5)). Notice that the set Cf is
defined by a linear system. Hence, by Lemma 3.4.1, there exists a constant c > 0
such that
dist(x, Cf ) ≤ ckA(x) − ȳk,
∀x ∈ E.
(3.33)
Moreover, we make the following claim.
Claim: Under condition (C2), there exist constants κ2 , ρ2 , δ > 0 such that, with
V = −ḡ + δB,
dist(x, CP ) ≤ κ2 · dist(−ḡ, ∂P (x) ∩ V ),
∀x ∈ N (ρ2 ).
(3.34)
For the sake of clarity, we proceed our proof by assuming that the claim is true.
In view of (3.32), (3.33) and (3.34), there exists a constant κ > 0 such that
dist(x, Σ(ȳ, ḡ) ≤ κ(kA(x) − ȳk + dist(−ḡ, ∂P (x) ∩ V )),
∀x ∈ N (ρ),
where ρ = min{ρ1 , ρ2 }. This implies the following
dist(x, Σ(ȳ, ḡ)) ≤ κ(ky − ȳk + kg − ḡk),
(3.35)
for all x ∈ Σρ (y, g) with k(y, g) − (ȳ, ḡ)k ≤ δ, where we utilize the facts that
A(x) = y and −g ∈ ∂P (x) for any x ∈ Σρ (y, g), and also the fact −g ∈ V due to
k(y, g)−(ȳ, ḡ)k ≤ δ. Moreover, (3.35) is equivalent with the following statements
Σρ (y, g) ⊆ Σ(ȳ, ḡ) + κ(ky − ȳk + kg − ḡk)B,
whenever k(y, g) − (ȳ, ḡ)k ≤ δ. This, together with the fact Σρ (ȳ, ḡ) = Σ(ȳ, ḡ),
proves that Σρ is locally-ULC at (ȳ, ḡ).
40
CHAPTER 3. A NEW ANALYSIS FRAMEWORK
41
It remains to prove the above claim. By condition (C2), ∂P is metrically
subregular at any x̄ ∈ X for −ḡ. From the definition of metric subregularity (see
Definition 2.2.4), we have for any x̄ ∈ X , there exists a constant κ(x̄) > 0 along
with an open neighbourhood Ux̄ of x̄ and an open neighbourhood Vx̄ of −ḡ such
that
dist(x, ∂P −1 (−ḡ)) ≤ κ(x̄) · dist(−ḡ, ∂P (x) ∩ Vx̄ ),
∀x ∈ Ux̄ ,
(3.36)
where ∂P −1 is the inverse of the set-valued mapping ∂P . Since X is a compact set
by Assumption 2, by the Heine-Borel theorem, there exists N points x̄1 , . . . , x̄N ∈
X such that
X ⊆
N
[
Ux̄i .
i=1
Additionally, as X is compact and each Ux̄i is open, there exists a constant ρ2 > 0
such that
N (ρ2 ) = X + ρ2 B ⊆
N
[
Ux̄i .
i=1
Similarly, there exists a constant δ > 0 such that
V := −ḡ + δB ⊆
N
\
Vx̄i .
i=1
Hence, by letting κ2 = max{κ(x̄1 ), . . . , κ(x̄N )}, (3.36) implies that
dist(x, ∂P −1 (−ḡ)) ≤ κ2 · dist(−ḡ, ∂P (x) ∩ V ),
∀x ∈ N (ρ2 ).
Since ∂P −1 (−ḡ) = {x ∈ E | −ḡ ∈ ∂P (x)} = CP , the above relationship is
equivalent with (3.34) and the claim is thus proved.
u
t
We end this chapter by making several remarks on the conditions (C1) and
(C2):
• Both the conditions (C1) and (C2) are not difficult to verify. In fact, there
are abundant works in the literature studying the bounded linear regularity
of convex sets (see [8, 9, 10, 33] and references therein) and the metric
subregularity of subdifferential mappings (see [26, 4, 5, 19] and references
there in).
• The condition (C1) can be viewed as a constraint qualification condition
and it will be shown in Chapter 5 that, condition (C1) is closely related
to the strict complementarity of nonlinear programming. The condition
(C2) is a local growth condition of the nonsmooth function P . It is closely
related to the local strong convexity of convex functions (see [19]).
• In the literature of mathematical programming, the condition (C1) is typically required for obtaining the linear convergence of alternating projection
methods, such as [7, 8, 36]; and the condition (C2) is typically required
for the linear convergence of proximal point method and its variants, such
as [56, 80, 34]. Recall that our projection-type local error bound (1.7) is
utilized to establish linear convergence of projection gradient method, proximal gradient method, etc.. Hence, our result (Theorem 3.4.1) shows the
linear rate of convergence these first-order methods may not be attainable
if either (C1) or (C2) is absent. Furthermore, as we will demonstrate in
Chapter 6, the failure of either (C1) or (C2) will cause the failure of linear
convergence of these algorithms.
2 End of chapter.
42
Chapter 4
Error Bounds for `1,p-Norm
Regularization
Summary
In this chapter, we apply the theoretical tools and results developed
in Chapter 3 to explore the local error bound for optimization problems with `1,p -norm regularization with p ∈ [1, ∞]. We will show
that for this class of problems, the local error bound holds when
p ∈ [1, 2] and p = ∞, while it fails when p ∈ (2, ∞) for certain
instances.
4.1
Introduction
Consider the following convex optimization problem,
min f (x) + P (x),
x∈Rn
43
(4.1)
where f is a continuously differentiable convex function, P is the `1,p -regularizer
with p ∈ [1, ∞] (also called generalized group-lasso regularizer) defined as
P (x) =
X
ωJ kxJ kp ,
∀x ∈ Rn .
(4.2)
J∈J
In the above formula, J is a non-overlapping partition of the coordinate index
set {1, 2, . . . , n} and ωJ > 0 are given constants. In addition, k · kp is the vector
p-norm in the usual sense, i.e., for any vector x ∈ Rn ,
(Pn |x |p ) p1 , if 1 ≤ p < ∞;
i
i=1
kxkp =
maxi {|xi |},
if p = ∞.
Obviously, the `1,p -regularizer is a norm function and is thus convex.
Optimization problem (4.1)-(4.2) is of broad applications in machine learning,
statistics, computational biology, and signal processing, etc., see [66, 78, 75, 32,
21, 69, 44, 30, 43] and references therein. The `1,p -regularizer in these applications
are utilized for inducing the structured sparsity of the parameters. In addition,
the `1 -regularization [66] and group-lasso regularization [78] are special instances
of problem (4.1)-(4.2), which correspond to p = 1 and p = 2, respectively.
In this chapter, we explore the local error bound (1.7) for problem (4.1)-(4.2)
under Assumptions 1 and 2(a). In other words, we assume the smooth function
f takes the form
f (x) = h(Ax),
∀x ∈ Rn
(4.3)
for some matrix A ∈ Rm×n along with a convex function h satisfying Assumption 1, and the optimal solution set X of (4.1)-(4.2) is non-empty. As discussed
in Section 3.1, for both `1,p -norm regularized linear regression and `1,p -norm regularized logistic regression, Assumption 1 can be justified. The reason we do not
impose Assumption 2(b) (boundedness of X ) is that it is automatically satisfied
by problem (4.1)-(4.2) if Assumptions 1 and 2(a) are satisfied. In fact, due to
Proposition 3.2.2, there exists a vector ȳ such that Ax = ȳ for all x ∈ X . Hence,
44
CHAPTER 4. ERROR BOUNDS FOR `1,P -REGULARIZATION
45
the optimal solution set can be expressed as
X = {x ∈ E | P (x) = v ∗ − h(ȳ)}.
Since now P (x) is the `1,p -regularizer and is thus a norm function, the boundedness of X is readily seen.
We will utilize the technical results developed in Chapter 3 to study the local
error bound for problem (4.1)-(4.3). In particular, we will testify whether the two
conditions in Theorem 3.4.1 are satisfied for this class of optimization problem.
4.2
Subdifferential of `p -Norm
In this section, we investigate the properties of the subdifferential of `p -norm.
4.2.1
`p -Norm with p ∈ (1, ∞)
Let us start with the scenario p ∈ (1, ∞).
Proposition 4.2.1 Suppose p ∈ (1, ∞) and q is the Hölder conjugate of p, i.e.,
1/p + 1/q = 1. Then, for any x ∈ Rn , the subdifferential of vector p-norm at x
has the following expression:
{s ∈ Rn | ksk ≤ 1}, if x = 0,
q
∂kxkp =
{c(x)},
if x 6= 0,
(4.4)
where c : Rn → Rn is a vector-valued function defined as
!− 1q
n
X
· (sgn(x1 )|x1 |p−1 , . . . , sgn(xn )|xn |p−1 ).
c(x) =
|xi |p
i=1
Moreover, for any x 6= 0, kc(x)kq = 1.
Proof The proof is straightforward. Firstly, note that the dual norm of k · kp is
k · kq . Hence, from Lemma 2.1.6, it is clear that
∂k0kp = {s ∈ Rn | kskq ≤ 1}.
In addition, for p ∈ (1, ∞), the vector p-norm is differentiable at anywhere but
0. So for x 6= 0, routine calculations on the gradients will result in the expression
of c(x). In addition, when x 6= 0, we have
!− 1q
n
X
kc(x)kq =
·
|xi |p
i=1
n
X
! 1q
|xi |(p−1)q
.
i=1
Since p and q are Hölder conjugates, from 1/p + 1/q = 1, we have p = (p − 1)q
and thus kc(x)kq = 1.
u
t
For any g ∈ Rn , let S(g) be a subset of Rn defined as
S(e) := {x ∈ Rn | e ∈ ∂kxkp }.
(4.5)
We now derive the explicit expression and properties of the set S(e) with p ∈
(1, ∞) for any e ∈ Rn .
Proposition 4.2.2 Suppose p ∈ (1, ∞). Then, for any e ∈ Rn , it follows that
if kekq > 1;
∅
S(e) =
{x | x = a · v(e), a ≥ 0}
{0}
if kekq = 1;
(4.6)
if kekq < 1,
where the vector-valued function v : Rn → Rn is defined by
q
q
p
p
v(e) := sgn(e1 )|e1 | , . . . , sgn(en )|en | .
(4.7)
In addition, for any e ∈ Rn , S(e), if not empty, is a polyhedral set.
Proof Suppose e ∈ ∂kxkp for some (x, e) ∈ Rn × Rn . Then, by Lemma 2.1.6, it
follows that
kekq ≤ 1,
(4.8a)
eT x ≥ kxkp .
(4.8b)
Hence, from (4.8a), if kekq > 1, then S(e) = ∅. From (4.8b), by using Hölder’s
inequality, we have
kxkp ≤ eT x ≤ kekq · kxkp .
46
CHAPTER 4. ERROR BOUNDS FOR `1,P -REGULARIZATION
47
Hence, if kekq < 1, we must have kxkp = 0, which is equivalent with x = 0;
and the equality must hold when kekq = 1. Recall that Hölder’s inequality takes
equality only when there exist a constant c > 0 such that xi = c · sgn(ei )|ei |q/p ,
which implies the existence of a > 0 such that x = a · v(e). Hence, we have
proved (4.6). Since for any e ∈ Rn such that S(e) is non-empty, S(e) is either a
half line with direction v(e) or the singleton {0}, we have S(e) is a polyhedral
set for any e ∈ Rn .
u
t
The vector-valued function v defined in (4.7) plays an important role in analysis, as it is key factor in describing how the set S(e) changes as e varies. Moreover,
for p ∈ (1, 2], the function v enjoys the following locally Lipschitz property.
Proposition 4.2.3 Suppose v : Rn → Rn is the vector-valued function defined
in (4.7). If p ∈ (1, 2], then, given any ē ∈ Rn and any constant > 0, there
exists a constant L > 0, depending on and ē, such that
kv(e) − v(ē)k2 ≤ Lke − ēk2
whenever ke − ēk2 ≤ .
(4.9)
Proof Let us define a function ṽ : R → R as
q
ṽ(a) := sgn(a)|a| p ,
∀a ∈ Rn .
In view of (4.7), it follows that v(e) = (ṽ(e1 ), . . . , ṽ(en )). Since p ∈ (1, 2], the
constant q ∈ [2, ∞) and thus q/p ≥ 1. Hence, it is easy to verify that ṽ satisfies
that given any ā ∈ R and any constant ˜ > 0, there exists a constant L̃ > 0 such
that
|ṽ(a) − ṽ(ā)| ≤ L̃|a − ā| whenever |a − ā| ≤ ˜.
Therefore, each component function of v is locally Lipschitz and so as v, which
proves (4.9).
u
t
Equipped with all the above technical results, we now explore the regularity
property of the subdifferetial of vector p-norm (as a set-valued mapping). In
the following theorem, we show that when p ∈ (1, 2], the metric subregularity of
∂k · kp is “always” satisfied.
Theorem 4.2.1 Suppose p ∈ (1, 2]. Then, for any (x, e) ∈ Gr(∂k · kp ), ∂k · kp
is metrically subregular at x for e.
Proof By definition of metric subregularity and (4.5), it suffices to prove that
for any (x̄, ē) ∈ Gr(∂k · kp ) satisfying ē ∈ ∂kx̄kp , there exist constants κ, , δ > 0
such that, with U = x̄ + δB and V = ē + B,
dist(x, S(ē)) ≤ κ · dist(ē, ∂kxkp ∩ V ),
∀x ∈ U.
(4.10)
Moreover, in order to verify (4.10), we only need to consider x ∈ U satisfying
e ∈ ∂kxkp for some e ∈ V , since otherwise ∂kxkp ∩ V = ∅ and the right-hand-side
of (4.10) is +∞, which implies that (4.10) holds trivially.
Note that if ē ∈ ∂kx̄kp for some x̄, by Proposition 4.2.2, kēkq ≤ 1, where q
is the Hölder conjugate of p. Then we consider the following two cases of ē: (a)
kēkq < 1 and (b) kēkq = 1.
(a). In this case, if (x̄, ē) ∈ Rn × Rn satisfies ē ∈ ∂kx̄kp , then, by Proposition 4.2.2, x̄ = 0. In other words, S(ē) = {0}. Now we let κ, δ be arbitrary
positive numbers and > 0 be small enough such that kekq < 1 for all
e ∈ V (such must exist since the set {e ∈ Rn | kekq < 1} is open).
With these scalars, we claim that (4.10) is satisfied. To substantiate the
claim, let us fix any x ∈ U satisfying e ∈ ∂kxkp for some e ∈ V . Since
V only contains e with kekq < 1, by Proposition 4.2.2, we have x = 0.
Hence, dist(x, S(ē)) = 0 and thus (4.10) holds trivially. In summary, for
any (x̄, ē) ∈ Gr(∂k · kp ) with kēk < 1, we have shown the existence of
κ, , δ > 0 such that (4.10) holds.
(b). In this case, by Proposition 4.2.2, the set S(ē) has the expression
{x | x = a · v(ē), a ≥ 0},
48
(4.11)
CHAPTER 4. ERROR BOUNDS FOR `1,P -REGULARIZATION
49
where v is the vector-valued function defined in (4.7). Now we fix arbitrary
, δ > 0, and show the existence of positive constant κ such that (4.10)
holds. Let us consider arbitrary fixed x ∈ U , which satisfies e ∈ ∂kxkp for
some e ∈ V . We divide our analysis into two cases: (b1) x = 0 and (b2)
x 6= 0.
(b1). Since x = 0, then by the expression of S(ē) in (4.11), we have x ∈ S(ē)
and thus
dist(x, S(ē)) = 0.
Hence, (4.10) holds for any κ > 0.
(b2). In this case, in summary we have x 6= 0 and e ∈ ∂kxkp for some e ∈ V .
Then, by Proposition 4.2.2, it follows that kekq = 1, ∂kxkp = {e} and
there exists a scalar α ≥ 0 such that
x = α · v(e),
where v is the vector-valued function defined in (4.7). Since S(ē) is of
the expression (4.11), the vector x̄ := α · v(ē) ∈ S(ē). Hence, we have
dist(x, S(ē)) ≤ kx − x̄k2 = αkv(e) − v(ē)k2
(4.12)
In the scenario p ∈ (1, 2], note that we have proved the locally Lipschitz
property of v in Proposition 4.2.3. Hence, there exists a constant
L > 0, depending on , such that
kv(e) − v(ē)k2 ≤ Lke − ēk2 ,
∀e ∈ V.
This, together with (4.12), implies that
dist(x, S(ē)) ≤ αLke − ēk2 .
Moreover, since ∂kxkp = {e}, the set ∂kxkp ∩V = {e} and dist(ē, ∂kxkp ∩
V ) = ke − ēk2 . It then follows that
dist(x, S(ē)) ≤ αL · dist(ē, ∂kxkp ∩ V ).
Note that in the above relationship, x is arbitrary in U but fixed, so the
constants α and L may depend on the choice of x. Thus to prove (4.10),
it remains to show for any x ∈ U , α, L are bounded above. In fact, L
has no dependence on the choice of x since it is determined by ē and
. For the constant α, recall that it is utilized to measure the “length”
of x, i.e., x = α · v(e). Since U is a bounded subset (a neighbourhood
of x̄) and kv(e)k2 is obviously bounded away from 0 for all e satisfying
kekq = 1, there exists a constant ᾱ such that for all x ∈ U , the constant
α is bounded above by ᾱ. Therefore, by letting κ = ᾱL, we prove the
existence of κ, , δ in (b2).
In summary, in both case (a) and case (b), there exist constants κ, , δ > 0 such
that (4.10) holds and the theorem is thus proved.
u
t
However, when p ∈ (2, ∞), the metric subregularity of ∂k · kp fails at certain
points. For example, consider the vector p-norm in R2 with p = 3, and we
investigate the metric subregularity of ∂k · k3 at x̄ = (1, 0) and ē = (1, 0).
Obviously, ē = ∂kx̄k3 (in fact, ∂kx̄k3 = {ē}). Let us consider the sequence {xk }
of R2 defined as
xk1
1
1 3
,
= 1−
k
xk2
13
1
.
=
k
Obviously, xk → x̄ as k → ∞. Moreover, xk1 converges to 1 at the rate Θ k1
1
and xk2 converges to 0 at the rate Θ k1 3 . Hence, xk converges to x̄ at the rate
1
Θ k1 3 (the slower one of the above two rates). On the other hand, let us define
ek ∈ R2 as
ek1
=
1
1−
k
23
,
ek2
23
1
=
.
k
It can be easily verified that ∂kxk k3 = {ek }. In addition, using similar arguments
2
to the convergence rate of xk , the convergence rate of ek to ē is Θ k1 3 . Since
1 2
1 3
1 3
=
o
, one can never find a constant κ > 0 such that
k
k
dist(xk , x̄) ≤ κ · dist(ē, ek ).
50
CHAPTER 4. ERROR BOUNDS FOR `1,P -REGULARIZATION
51
This implies that ∂k · k3 is not metrically subregular at (1, 0) for (1, 0).
4.2.2
`p -Norm with p = 1 and p = ∞
We now turn to study the subdifferential of the vector p-norm with p = 1 and
p = ∞. These two cases are special mainly due to the following fact. We omit
the proof since it is readily seen.
Fact 4.2.1 Suppose p = 1 or p = ∞. Then, for any r ≥ 0, the r-level set of
`p -norm is a polyhedral set (as a subset of Rn ). In addition, the epigraph of
`p -norm is a polyhedral set (as a subset of Rn × R).
Equipped with the above fact, we now derive the property of S(e) for p = 1 or
p = ∞.
Proposition 4.2.4 Suppse p = 1 or p = ∞, and for any e ∈ Rn , the set S(e)
is defined (4.5). Then, for any e ∈ Rn , S(e), if not empty, is a polyhedral set.
Proof Recall the definition of S(e),
S(e) = {x ∈ Rn | e ∈ ∂kxkp }.
By Lemma 2.1.6, we can express the set S(e) as
S(e) = {x ∈ Rn | kekq ≤ 1, eT x ≥ kxkp }.
Hence, the set S(e), if not empty, can be expressed as
S(e) = {x ∈ Rn | eT x ≥ t, kxkp ≤ t}.
Since both the sets {x ∈ Rn | eT x ≥ t} and {x ∈ Rn | kxkp ≤ t} are polyhedral,
S(e) is a polyhedral set. Therefore, we obtain the required result.
u
t
We next prove the following theorem, which regards the metric subregularity
of the subdifferential mapping of any polyhedral convex function.
Theorem 4.2.2 Suppose P is a convex function with polyhedral epigraph. Then,
for any (x, e) ∈ Gr(∂P ), ∂P is metrically subregular at x for ēe.
Proof We first show that the graph of ∂P is a finite union of polyhedral sets.
Towards that end, let us represent the epigraph of P as
epi(P ) = {(z, w) ∈ Rn × R | Cz z + Cw w ≤ d},
where Cw , d ∈ Rl , Cz ∈ Rl × Rn . We next make the following claim:
Claim. For any (x, e) ∈ Rn ×Rn , (x, e) ∈ Gr(∂P ) if and only if there exists s ∈ R
such that (x, s) is the optimal solution of the following linear programming,
min −eT z + w
(4.13)
s.t. Cz z + Cw w ≤ d.
For the clarity of the proof, we first proceed by assuming that the claim is true.
Since (4.13) is a linear programming, the so-called KKT condition is sufficient
and necessary, i.e., (x, s) is the optimal solution of (4.13) if and only if there
exists a vector λ ∈ Rl such that the triplet (x, s, λ) satisfies
CzT λ − e = 0,
Cz x + Cw s ≤ d,
1 + CwT λ = 0,
λ ≥ 0,
(4.14)
λT (Cz x + Cw s − d) = 0.
In view of the claim and the KKT condition (4.14), it follows that
Gr(∂P ) = (x, e) ∈ Rn × Rn |(x, s, λ) ∈ K(e) for some s ∈ R, λ ∈ Rl , (4.15)
where the set-valued mapping K : Rn ⇒ Rn × R × Rl is defined by
T
C
λ
−
e
=
0,
z
T
1
+
C
λ
=
0,
w
n
l
K(e) = (x, s, λ) ∈ R × R × R λ ≥ 0, .
C
x
+
C
s
≤
d,
z
w
T
λ (C x + C s − d) = 0.
z
52
w
CHAPTER 4. ERROR BOUNDS FOR `1,P -REGULARIZATION
53
Suppose (x, s, λ) ∈ K(e), then by the complementarity slackness, for all i =
1, . . . , l, there holds λi = 0 or (Cz x + Cw s − d)i = 0. Let us denote 1 as the
vector in Rl with all entries one, and V0,1 denote the set of vectors in Rl of which
the entries are either 1 or 0. Hence, we can express K(e) as
T
Cz λ − e = 0,
T
1 + Cw λ = 0,
λ
≥
0,
for some v ∈ V0,1 .
K(e) = (x, s, λ)
Cz x + Cw s ≤ d,
T
v
(C
x
+
C
s
−
d)
=
0,
z
w
T
(1 − v) λ = 0,
From the above expression, it is clear that by fixing any v ∈ V0,1 , (x, s, λ, e)
belongs to a polyhedral set. Since the cardinality of V0,1 is 2l , the graph of
the set-valued mapping K is a finite union of polyhedral sets. Moreover, (4.15)
implies that Gr(∂P ) is the projection of Gr(K) on Rn ×Rn . Hence, Gr(∂P ) is also
a finite union of polyhedral sets and ∂P is thus a polyhedral multifunction. In
addition, since ∂P and ∂P −1 share the identical graph, ∂P −1 is also a polyhedral
multifunction. Therefore, by Lemma 2.2.2, ∂P −1 is locally-ULC at any point in
its domain. Furthermore, by invoking Lemma 2.2.4, we obtain the required result
in Theorem 4.2.2.
It remains to prove the claim. If e ∈ ∂P (x), by definition,
P (z) ≥ P (x) + eT (z − x),
∀z ∈ dom(P ).
Upon rearranging,
P (x) − eT x ≤ P (z) − eT z ≤ w − eT z,
∀(z, w) ∈ epi(P ).
This implies that (x, P (x)) is an optimal solution of (4.13). On the other hand,
if (x, s) is an optimal solution, then s = P (x) because otherwise (x, P (x)) is a
feasible solution of (4.13) with lower objective value. Hence,
P (x) − eT x ≤ P (z) − eT z,
∀z ∈ domP.
By the definition of subgradient, there holds e ∈ ∂P (x). Therefore, the claim is
u
t
true.
In view of Fact 4.2.1 and Theorem 4.2.2, it is immediate to have the following
result.
Corollary 4.2.1 Suppose p = 1 or p = ∞. Then, for any (x, e) ∈ Gr(∂k · kp ),
∂k · kp is metrically subregular at x for e.
We end this section by summarizing the results obtain above into the following
corollaries.
Corollary 4.2.2 Suppose p ∈ [1, 2] or p = ∞. Then, for any e ∈ Rn , the set
S(e), if not empty, is a polyhedral set.
Proof Combine the results in Proposition 4.2.2 and Proposition 4.2.4.
u
t
Corollary 4.2.3 Suppose p ∈ [1, 2] or p = ∞. Then, for any (x, e) ∈ Gr(∂k·kp ),
∂k · kp is metrically subregular at x for e.
Proof Combine the results in Theorem 4.2.1 and Corollary 4.2.1.
4.3
u
t
Main Results
In this section, we will prove the main results in this chapter: the local error
bound for problem (4.1)-(4.3) always holds when p ∈ [1, 2] or p = ∞, while when
p ∈ (2, ∞), such error bound can fail for certain instances of (4.1)-(4.3).
Recall that the `1,p -regularizer has the form
X
P (x) =
ωJ kxJ kp ,
J∈J
54
CHAPTER 4. ERROR BOUNDS FOR `1,P -REGULARIZATION
55
where for each J ∈ J , kxJ kp is the vector p-norm of xJ in RnJ and J is a
non-overlapping partition. Hence, it follows that for any x ∈ Rn ,
∂P (x) =
Y
ωJ ∂kxJ kp ,
(4.16)
J∈J
where
Q
is the Cartesian product of sets (note that for each J, ∂kxJ kp is a subset
of RnJ ). In view of this fact, we have the following result.
Proposition 4.3.1 Suppose p ∈ [1, 2] or p = ∞. Then, for any (x, e) ∈ Gr(∂P ),
the set SP (e) := {x ∈ Rn | e ∈ ∂P (x)} is a polyhedral set and ∂P is metrically
subregular at x for e.
Proof By (4.16), for any e ∈ Rn , we have eJ ∈ ωJ ∂kxJ kp for all J ∈ J . Let us
denote SJ (eJ ) := {x ∈ RnJ | eJ ∈ ωJ kxkp }. Hence, in view of the definition of
SP (e), we have
SP (e) =
Y
SJ (eJ ).
J∈J
By Corollary 4.2.2, for p ∈ [1, 2] or p = ∞, SJ (eJ ) are polyhedral sets for all J.
Hence, SP (e) is also a polyhedral set.
By Corollary 4.2.3, it follows that for any (xJ , eJ ) ∈ RnJ × RnJ with eJ ∈
∂kxJ kp , ∂k · kp is metrically subregular at xJ for eJ . In view of this and the
Cartesian production (4.16), the second part of Proposition 4.3.1 is readily seen
if the following fact is true.
Fact 4.3.1 Let Γi : Rni ⇒ Rni , i = 1, . . . , n be n set-valued mappings. Suppose
Q
for each i, Γi is metrically subregular at xi ∈ Xi for yi ∈ Yi , then Γ := ni=1 Γi
is metrically subregular at (x1 , . . . , xn ) for (y1 , . . . , yn ).
We now give a proof to this fact. Since for each i, Γi is metrically subregular at
xi for yi , by definition, there exist constants κi , δi , i such that, with Ui = xi +δi B
and Vi = yi + i B,
dist(x̃i , Γ−1
i (yi )) ≤ κi · dist(yi , Γi (x̃i ) ∩ Vi ),
∀x̃i ∈ Ui .
Hence, for any x̃i ∈ Ui , it follows that
dist((x̃1 , . . . , x̃n ), Γ−1 (y1 , . . . , yn ))2
= dist((x̃1 , . . . , x̃n ),
n
Y
2
Γ−1
i (yi ))
i=1
=
n
X
2
dist(x̃i , Γ−1
i (yi ))
i=1
2
≤κ ·
n
X
dist(yi , Γi (x̃i ) ∩ Vi )2
i=1
2
= κ · dist((y1 , . . . , yn ),
n
Y
i=1
Γi (x̃i ) ∩
n
Y
Vi )2
i=1
= κ2 · dist((y1 , . . . , yn ), Γ(x̃1 , . . . , x̃n ) ∩ V )2
Q
where κ = max{κi | i = 1, . . . , n} and V = ni=1 Vi . Therefore, by taking
the square root on both sides of the above relationship, we obtain that for any
Q
(x̃1 , . . . , x̃n ) ∈ U := ni=1 Ui , which is a neighbourhood of (x1 , . . . , xn ), it follows
that
dist((x̃1 , . . . , x̃n ), Γ−1 (y1 , . . . , yn )) ≤ κ · dist((y1 , . . . , yn ), Γ(x̃1 , . . . , x̃n ) ∩ V ).
By definition, Γ is metrically subregular at (x1 , . . . , xn ) for (y1 , . . . , yn ) and the
u
t
fact is true.
Equipped with Proposition 4.3.1, we are now ready to present the main result
of this chapter.
Theorem 4.3.1 For `1,p -regularized optimization (4.1), suppose p ∈ [1, 2] or
p = ∞, f is of form (4.3) and Assumptions 1 and 2(b) are satisfied. Then, the
local error bound holds.
Proof As we stated before, for `1,p -regularized optimization, if f takes form (4.3)
and Assumptions 1 and 2(b) are satisfied, then Assumption 2(a) is also satisfied.
Hence, all the assumptions in Theorem 3.4.1 are satisfied. Equipped with Theorem 3.4.1, it now suffices to prove that for any ȳ ∈ Rm and any −ḡ ∈ Rn ,
56
CHAPTER 4. ERROR BOUNDS FOR `1,P -REGULARIZATION
57
(C1) the collection {Cf , Cg } are bounded linear regular;
(C2) for any x ∈ Rn satisfying −ḡ ∈ ∂P (x), ∂P is metrically subregular at x for
−ḡ.
Since Cf is obviously a polyhedral set, and Cg is also polyhedral by Proposition 4.3.1, part (C1) is immediate followed by Lemma 2.2.5. In addition, part
(C2) is true due to Proposition 4.3.1. Therefore, the local error bound holds by
u
t
Theorem 3.4.1.
We remark that the result in Theorem 4.2.1 includes the result in [72] as a
special case, which only corresponds to the case p = 2. In what follows, we will
construct a problem instance of (4.1)-(4.3) to demonstrate that the local error
bound can fail when p ∈ (2, ∞).
Example (Error bound fails when p ∈ (2, ∞)). Consider the following problem:
min2
x∈R
1
kAx − bk2 + kxkp ,
2
(4.17)
where A = [1, 0], b = 2. It is obvious that this problem satisfies Assumptions 1
and 2. In addition, the optimal value and optimal solution set of (4.17) can be
calculated explicitly.
Proposition 4.3.2 Consider problem (4.17) with p ∈ (2, ∞). The optimal value
is v ∗ = 3/2 and the optimal solution set is given by X = {(1, 0)}.
Proof For simplicity and consistency, let f (x) = 12 kAx − bk2 and P (x) = kxkp ,
where p ∈ (2, ∞). We first show that x̄ = (1, 0) is an optimal solution to
problem (4.17). Indeed, we have
∇f (x̄) = (−1, 0),
∂P (x̄) = (1, 0).
Thus, 0 ∈ ∇f (x̄) + ∂P (x̄), which implies the optimality of x̄. Next, we show
that x̄ = (1, 0) is the only optimal solution to problem (4.17), i.e., X = {x̄}. Let
x̃ ∈ X be arbitrary. By Proposition 3.2.2, Ax is invariant over X . Thus, we have
Ax̃ = Ax̄ = 1,
which implies that x̃1 = 1. Moreover, since ∇f (x) is also invariant over X , we
have ∇f (x̃) = ∇f (x̄) = (−1, 0). Now, the optimality of x̃ yields (1, 0) ∈ ∂P (x̃).
This, together with Proposition 4.2.2, implies that x̃ is a non-negative multiple
of (1, 0). Since x̃1 = 1, we conclude that x̃ = (1, 0) = x̄, as desired. Finally, we
have v ∗ = f (x̄) + P (x̄) = 3/2.
u
t
Now, let {δk }k≥0 be a sequence converging to zero; i.e., δk = o(1). For simplicity,
we assume that δk > 0 for all k ≥ 0. Consider the sequence {xk }k≥0 with
1
xk1
1
q
:= 2 − (1 − δk ) ,
xk2
:=
2 − (1 − δk ) q
(1 − δk )
1
p
1
1
· δkp + δkq ,
where q is the Hölder conjugate of p. Since δk → 0, the sequence xk converges
to X . Our goal now is to show that kR(xk )k = o(d(xk , X )) when p ∈ (2, ∞).
To begin, observe that xk1 converges to 1 at the rate Θ(δk ) and xk2 converges
1/p
1/p
to 0 at the rate Θ(δk ) (note that when p ≥ 1, δk = O(δk )). Thus, we have
1/p
d(xk , X ) = Θ(δk ).
Next, we need to compute R(xk ). This is done in the following lemma.
1/q
Lemma 4.3.1 For the sequence {xk }k≥0 defined above, we have R(xk ) = (0, −δk ).
Proof By definition of R(xk ), we have
0 ∈ ∇f (xk ) + R(xk ) + ∂P xk + R(xk ) .
Adding xk to both sides and rearranging, we get
xk − ∇f (xk ) ∈ xk + R(xk ) + ∂P xk + R(xk ) ,
(4.18)
which is a relationship of the form u ∈ (I + ∂P )(z). Since ∂P is a maximal
monotone operator (see, e.g., [46]), a result of Minty [45] states that given any
58
CHAPTER 4. ERROR BOUNDS FOR `1,P -REGULARIZATION
59
u ∈ Rn , there exists a unique vector z = z(u) ∈ Rn such that u ∈ (I + ∂P )(z(u)).
1/q
Thus, it remains to show that R(xk ) = (0, −δk ) satisfies (4.18).
To begin, we use the definition of xk and the fact that ∇f (x) = (x1 − 2, 0) to
compute
1
xk − ∇f (xk ) = (2, xk2 ) =
2,
2 − (1 − δk ) q
1
(1 − δk ) p
1
p
1
q
!
· δk + δk
.
1/q
Now, let z k = xk + (0, −δk ). Then,
1
1
q
zk =
2 − (1 − δk ) ,
1
=
2 − (1 − δk ) q
1
(1 − δk ) p
2 − (1 − δk ) q
1
(1 − δk ) p
1
p
!
· δk
1
1
p
(1 − δk ) p , δk .
Using Proposition 4.2.1, it can be verified that for p ∈ (2, ∞),
p−1
1
p−1
1
p
q
k
p
q
∂P (z ) = (1 − δk ) , δk
= (1 − δk ) , δk .
It follows that
1
z k + ∂P (z k ) =
2,
2 − (1 − δk ) q
(1 − δk )
1
p
1
p
1
q
!
· δk + δk
= xk − ∇f (xk ).
(4.19)
1/q
Upon comparing (4.18) and (4.19), we conclude that R(xk ) = (0, −δk ), as
u
t
desired.
Since 1/p < 1/q when p ∈ (2, ∞), we have
1/q
δk
=
1/p
o(δk ).
It follows from
Lemma 4.3.1 that when p ∈ (2, ∞),
1
1
q
k
kR(x )k = Θ δk = o δkp = o d xk , X ,
which shows that the EB condition fails for problem (4.17).
4.4
Conclusions
We end this chapter by a brief conclusion. In this chapter, we explore the local
error bound for `1,p -regularized optimization and prove that for structured function f (see (4.3)), under Assumptions 1 and 2(a) (in this setting, Assumption
2(b) is automatically implied), the local error bound always hold when p ∈ [1, 2]
and p = ∞, while it fails for some instances of `1,p -regularized optimization
when p ∈ (2, ∞). Hence, we completely characterize the local error bound for
`1,p -regularization with p ∈ [1, ∞]. Our result extends the pioneering work by
Tseng [72], which corresponds to the special case of p = 2. By contrast, our
analysis approach is novel and grasp the essence of the local error bound. Furthermore, it is worth noting the reason why error bound can fail when p ∈ (2, ∞).
That is because the subdifferential mapping ∂P does not satisfy the metric subregularity at certain points. Hence, if the optimal solution set of a problem
instance contains certain points, the local error bound can fail.
2 End of chapter.
60
Chapter 5
Error Bounds for Nuclear Norm
Regularization
Summary
In this chapter, we apply the theoretical tools and results developed in Chapter 3 to explore the local error bound for optimization
problems with nuclear norm regularization. We will show that for
this class of problems, the local error bound holds if a strict complementarity condition is satisfied.
5.1
Introduction
The problem of finding a low–rank matrix that (approximately) satisfies a given
set of conditions has recently generated a lot of interest in many communities.
Indeed, such a problem arises in a wide variety of applications, including approximation algorithms [62], automatic control [22], matrix classification [68], matrix
completion [23], multi–label classification [1], multi–task learning [3], network
localization [28], subspace learning [76], and trace regression [31], just to name
61
a few. Due to the combinatorial nature of the rank function, the task of recovering a matrix with the desired rank and properties is generally intractable. To
circumvent this, a popular approach is to use the trace norm1 (also known as the
nuclear norm) as a surrogate for the rank function. Such an approach is quite
natural, as the trace norm is the tightest convex lower bound of the rank function over the set of matrices with spectral norm at most one [52]. In the above
mentioned application domains, the trace norm is typically used as a regularizer
in the minimization of certain convex loss function. This gives rise to convex
optimization problems of the form
min F (X) := f (X) + P (X),
X∈Rm×n
(5.1)
where f is a continuously differentiable convex function and P : S n → R is the
nuclear norm regularizer defined as
P (X) = τ
n
X
σi (X) ∀X ∈ Sn .
(5.2)
i=1
In the above formula, τ > 0 is the regularization parameter, σi (X)’s are the
singular values of X. Note that the nuclear norm regularizer P (x) is a norm
function and P (x) is thus convex.
In this chapter, we will explore the local error bound (1.7) for nuclear norm
regularized optimization. In particular, we will focus on the problems defined on
the symmetric matrices space S n as follows:
min F (X) := f (X) + P (X),
X∈Sn
(5.3)
where f : S n → R is a continuously differentiable convex function on S n and
P : S n → R is the nuclear norm regularizer on S n defined as, with λi (X)’s being
the eigenvalues of X,
P (X) = τ
n
X
|λi (X)| ∀X ∈ Sn .
i=1
1
Recall that the trace norm of a matrix is defined as the sum of its singular values.
62
(5.4)
CHAPTER 5. NUCLEAR NORM REGULARIZATION
63
We remark that problem (5.3)-(5.4) is a symmetric version of nuclear norm regularized optimization (5.1)-(5.2), i.e., the problem is defined on symmetric matrices space Sn rather than general matrices space Rm×n . This allows us to
manipulate the eigenvalues and eigenvectors rather than singular values and singular vectors of a matrix, which can simplify the proofs while remain the key
ideas of the derivation. Extensions of the results developed in this chapter to
general matrices spaces Rm×n will be left for future work.
Throughout this chapter, we will assume that the smooth function f of (5.3)
takes the form
f (X) = h(A(X)),
∀X ∈ Sn
(5.5)
for some linear operator A : Sn → Rl along with a convex function h satisfying
Assumption 1. In addition, we assume that the optimal solution set X of (5.3)(5.4) is non-empty, i.e.Assumption 2(a). As discussed in Section 3.1, for h being
either the least square loss function or the logistic loss function, Assumption 1
can be justified. Moreover, using similar arguments as in Section 4.1, Assumption 2(b) (boundedness of X ) is automatically satisfied by problem (5.3)-(5.4)
if Assumptions 1 and 2(a) are satisfied. Hence, we do not need to additionally
impose Assumption 2(b) throughout the analysis in this chapter.
The rest of this chapter will be organized as follows. In Section 5.2, we
will provide some preliminary results on eigenvalues and eigenvectors, as well
as the subdifferential of nuclear norm on the symmetric matrices space Sn . In
Section 5.3, we prove that the subdifferential of the nuclear norm regularizer P
is metrically subregular at any point in its graph. In addition, by utilizing the
results developed in Chapter 3, we show that the local error bound of nuclear
norm regularized optimization holds if a constraint qualification is satisfied. In
addition, an example is constructed to demonstrate the necessity of the strict
complementarity condition.
For the sake of clarity, let us end this section by introducing some notations
that will be used throughout this chapter. We will let In be the n × n identity
matrix. We denote S n as the space of n × n real symmetric matrices, S+n as
the cone of n-dimensional positive semidefinite matrices and S+n as the cone of
n-dimensional negative semidefinite matrices. We denote On as the set of all
n × n unitary matrices, i.e., On = {U ∈ Rn×n | U T U = In . For any matrix
X ∈ S n , let kXk∗ , kXkF , kXkop , Tr(X) be the nuclear norm, Frobenius norm,
operator norm, trace of X, respectively. For any two matrices X, Y ∈ S n , let
the inner product hX, Y i := Tr(XY ).
5.2
Preliminaries
In this section, we will derive some preliminary properties of eigenvalues and
eigenvectors as well as properties of the subdifferential of nuclear norm function,
which are needed for the coming analysis.
5.2.1
Eigenvalues and Eigenvectors
Given any X̄ ∈ S n , let λi (X), i = 1, . . . , n, be the eigenvalues of X̄ with nonascending order, i.e., λ1 (X̄) ≥ . . . ≥ λn (X̄). In addition, we define Λ(X̄) :=
Diag(λ1 (X̄), . . . , λn (X̄)), namely, Λ(X̄) is an n×n diagonal matrix with diagonal
being (λ1 (X̄), . . . , λn (X̄)). Hence, the eigenvalue decomposition of X̄ can be
represented as
X̄ = Ū Λ(X̄)Ū T ,
for some U ∈ On .
Since the unitary matrix Ū in the above formula is not necessarily unique, we
define the set O(X̄) as follows:
O(X̄) := Ū ∈ On | Ū T X̄ Ū = Λ(X̄) .
In other words, O(X̄) contains all the unitary matrices that can provide the
eigenvalue decomposition of X̄ with eigenvalues in non-ascending order. Though
64
CHAPTER 5. NUCLEAR NORM REGULARIZATION
65
Λ1 (X̄), . . . , Λn (X̄) are all the eigenvalues of X̄, some of them can be of multiplicity more than one. This motivates us to define ν̄1 ≥ . . . ≥ ν̄r be all the distinct
eigenvalues of X̄. Obviously it follows that r ≤ n. Let us define
αk := {i | λi (X̄) = ν̄k },
k = 1, . . . , r.
Hence, for each k, αk is a subset of the index set {1, . . . , n}. For any matrix
∆ ∈ S n , we denote ∆ij as the entry of ∆ located on the i-th row and j-th
column; we denote ∆αi αj as the sub-matrix of ∆ with the entries located on the
index set αi (for rows) and αj (for columns).
The following result regarding the eigenvectors of X̄ is obvious and can be
obtained by simple calculations.
Lemma 5.2.1 Suppose Ū , V̄ ∈ O(X̄) are distinct. Let us denote Q = Ū T V̄ .
Then, for all k, l = 1, . . . , r, it follows that
Q
k 6= l;
αk αl = 0,
T
Qα α QT
αk αk = Qαk αk Qαk αk = I|αk | , k = l.
k k
(5.6)
An immediate implication of the above result is that if all the eigenvalues of X̄
are distinct, the set O(X̄) is a singleton by ignoring “±” signs.
For any matrix A ∈ S n , λ is an eigenvalue of A if and only if it satisfies
det(A − λIn ) = 0,
(5.7)
where det(·) is the determinant of a matrix. By definition of determinant, eigenvalues are the roots of a polynomial of degree n with coefficients determined by
A. Recall that the roots of a polynomial depend continuously on the coefficients
of the polynomial, and the coefficients of (5.7) are continuous functions of A.
These give rise to the following continuity result of eigenvalues:
Lemma 5.2.2 Suppose that X̄ ∈ S n has eigenvalues λ1 (X̄) ≥ . . . ≥ λn (X̄). For
any H ∈ S n , let λ1 (X̄ + H) ≥ . . . ≥ λn (X̄ + H) be the eigenvalues of the matrix
X̄ + H. Then, for any > 0, there exists a constant δ > 0 such that
max |λi (X̄) − λi (X̄ + H)| ≤ whenever kHkF ≤ δ.
i
Furthermore, the following result characterizes the analytic properties of eigenvectors, of which the proof can be found in [16] and [65, Lemma 4.12].
Lemma 5.2.3 For any H ∈ S n , let P ∈ O(X̄ + H). Then, there exist constants
κ, δ > 0 such that
dist(P, O(X̄)) ≤ κkHkF
whenever kHkF ≤ δ.
Lemma 5.2.3 is equivalent with the statement that O(·), as a set-valued mapping
from S n to On , is locally upper Lipschtiz continuous at any point X̄ ∈ S (see [15,
Lemma 3.1]).
5.2.2
Subdifferential of Nuclear Norm on S n
Recall that on the symmetric matrices space S n , the nuclear norm of a matrix
X ∈ S n is defined as
kXk∗ =
n
X
|λi (X)|,
i=1
where λi (X)’s are the eigenvalues of X. As a convex function on S n , we now
derive the subdifferential of k · k∗ .
Lemma 5.2.4 Suppose X ∈ Sn admits
Λ
+
X =U 0
0
the following spectral decomposition,
0 0
T
0 0 U
0 Λ−
where Λ+ , Λ− are diagonal matrices of dimension n1 , n3 , respectively, and satisfies Λ+ ∈ Sn+1 , Λ− ∈ Sn−3 . Let n2 := n − n1 − n3 . Then, we have
I
0
0
n1
T n2
∂kXk∗ = U 0 Z
U
Z
∈
S
,
kZk
≤
1
.
0
op
0 0 −In3
66
(5.8)
CHAPTER 5. NUCLEAR NORM REGULARIZATION
67
Proof Due to Lemma 2.1.6, for any X ∈ S., it follows that
∂kXk∗ = {T ∈ Sn | kT k ≤ 1, hT, Xi = kXk∗ }.
(5.9)
We will then prove Lemma 5.2.4 by showing the equivalence between (5.9)
and (5.8). Firstly, let Z̄ ∈ S n be any matrix in the set given by (5.9). Then,
there exists a matrix Z ∈ S n2 with kZkop ≤ 1, such that
I
0
0
n1
T
Z̄ = U 0 Z
0 U .
0 0 −In3
Since kZkop ≤ 1, we have kZ̄kop ≤ 1. In addition,
hZ̄, Xi = Tr(Λ+ ) − Tr(Λ− ) =
n
X
|λi (X)| = kXk∗ .
i=1
Hence, we have verified that Z̄ is also in the set given by (5.8). On the other
hand, let T ∈ Sn be in the set (5.9). Suppose T has the following structure,
T
T
T
11 12 13
T = U T21 T22 T23 U T .
T31 T32 T33
Then by hT, Xi = kXk∗ and kXk∗ = kΛ1 k∗ + kΛ2 k∗ , we have
kΛ1 k∗ + kΛ2 k∗ = hT, Xi = Tr(T11 Λ1 ) + Tr(T33 Λ2 )
≤ kT11 k · kΛ1 k∗ + kT33 k · kΛ2 k∗
≤ kΛ1 k∗ + kΛ2 k∗ ,
where the first inequality is by Cauchy-Schwartz inequality and the second inequality is because kT k ≤ 1 leads to kT11 k ≤ 1 and kT33 k ≤ 1. The above implies
that the two inequality should take equality, i.e.,
Tr(T11 Λ1 ) = kΛ1 k∗ , kT11 k = 1,
and Tr(T33 Λ2 ) = kΛ2 k∗ , kT33 k = 1.
i
, i = 1, . . . , n1 be the diagonal entries of T11 . Since kT11 k ≤ 1, we have
Let T11
i
T11
≤ 1, i = 1, . . . , n1 . Also, since Λ1 0, we have Λi1 > 0, i = 1, . . . , n1 and
P 1 i
kΛ1 k∗ = ni=1
Λ1 . So we obtain
n1
X
Λi1
=
i=1
n1
X
i
T11
Λi1
≤
i=1
n1
X
Λi1 ,
(5.10)
i=1
i
which implies that T11
= 1, i = 1, . . . , n1 . This together with kT11 k ≤ 1 gives us
T11 = In1 . Similarly, we will obtain T33 = −In2 . Moreover, since kT k ≤ 1, we
get T12 = T21 = T13 = T31 = T23 = T32 = 0 and kT22 k ≤ 1. This shows that T is
u
t
also in the set (5.8) and completes the proof.
For any Ē ∈ Sn , let us define the set C(Ē) as
C(Ē) := (∂P )−1 (Ē) = {Z ∈ Sn | Ē ∈ ∂kZk∗ }.
From Lemma 5.2.4, it is immediate that for any Ē ∈ Sn , the set C(Ē) is nonempty if and only if kĒkop ≤ 1. Suppose kĒkop ≤ 1 and Ē admits the spectral
decomposition as follows,
Ē1 0 0
Ē = Ū 0 Ē2 0 Ū T
0 0 Ē3
(5.11)
where Ū ∈ O(Ē) and Ē1 , Ē2 , Ē3 are diagonal matrices satisfying
Ē1 = In̄1 ,
−In̄2 ≺ Ē2 ≺ In̄2 ,
Ē3 = −In̄3 ,
n̄1 + n̄2 + n̄3 = n.
(5.12)
Note that for each block Ēi , i = 1, 2, 3, the eigenvalues of Ēi are distinct from
the other two. Hence, by Lemma 5.2.1, if Ũ ∈ O(Ē), there must exist unitary
matrices Pn̄1 ∈ On̄1 , Pn̄2 ∈ On̄2 , Pn̄3 ∈ On̄3 such that
P
0
0
n̄1
Ũ = Ū 0 Pn̄2 0
0
0 Pn̄3 .
Moreover, by Lemma 5.2.3, we have the following result.
68
(5.13)
CHAPTER 5. NUCLEAR NORM REGULARIZATION
69
Proposition 5.2.1 There exist constants κ, δ > 0 such that for any U ∈ O(E)
with E ∈ Ē + δB, there exist unitary matrices Pn̄1 ∈ On̄1 , Pn̄2 ∈ On̄2 , Pn̄3 ∈ On̄3 ,
all of which depend on E and U , satisfying
P
0
0
n̄1
U − Ū 0 Pn̄2 0 ≤ κkE − ĒkF .
0
0 Pn̄3 F
Proof Due to Lemma 5.2.3, there exist constants κ, δ > 0 such that for any
U ∈ O(E),
dist(U, O(Ē)) ≤ κkE − ĒkF
whenever kE − ĒkF ≤ δ.
Since O(Ē) is a compact set, there exist a unitary matrix Ũ ∈ O(Ē) such
that kU − Ũ k = dist(U, O(Ē)). Then the required result is readily seen by the
expression of Ũ in (5.13).
u
t
According to the spectral structure of Ē in (5.11), for any matrix Z ∈ Sn , we
can divide Z into blocks as follows:
Z11 Z12 Z13
Z = Z21 Z22 Z23
Z31 Z32 Z33
(5.14)
where Z11 ∈ Sn̄1 , Z22 ∈ Sn̄2 , Z33 ∈ Sn̄3 and all the other blocks have coherent
dimensions. We denote I as the index set of the blocks, i.e., I = {(i, j) |
1 ≤ i, j ≤ 3} and I c as the index set of blocks but Z11 and Z33 , i.e., I c =
I \ {(1, 1), (3, 3)}. Equipped with these notations, we are ready to present the
following result.
Lemma 5.2.5 Suppose kĒkop ≤ 1 and Ē admits the spectral decomposition as
in (5.11). Let C̃(Ē) be the set defined as
C̃(Ē) = Z ∈ Sn | Z11 ∈ Sn̄+1 , Z33 ∈ Sn̄−3 , Zij = 0, for all (i, j) ∈ I c .
(5.15)
Let Ū be any unitary matrix in O(Ē). Then, we have
n
o
C(Ē) = Ū Z Ū T | Z ∈ C̃(Ē) .
(5.16)
Proof We first show that the expression (5.16) is well-defined, i.e., for any
n
o
T
Ū ∈ O(Ē), the set Ū Z Ū | Z ∈ C̃(Ē) is invariant. Indeed, if Ũ ∈ O(Ē)
while Ū 6= Ũ , there exist unitary matrices Pn̄1 ∈ On̄1 , Pn̄2 ∈ On̄2 , Pn̄3 ∈ On̄3 such
that (5.13) holds. As the matrices in C̃(Ē) is also of the same block structure
with (5.11), for any Z ∈ C̃(Ē), we have Ū Z Ū T = Ũ Z Ũ T . Hence, (5.16) is
well-defined.
It is obvious that if X = Ū Z Ū T for some Z ∈ C̃(Ē), X must belong to
C(Ē). For the other direction, suppose X ∈ C(Ē) admits the following spectral
decomposition,
Λ
0
0
+
X = U 0 0 0 UT ,
0 0 Λ−
where Λ+ , Λ− are diagonal matrices of dimensions n1 , n3 , respectively, and Λ+ 0, Λ− ≺ 0. Let us denote n2 := n − n1 − n3 . By Lemma 5.2.4, we have
I
0
0
n1
T
n2
U
T
∈
S
.
,
kT
k
≤
1
∂kXk∗ = U 0 T
0
op
0 0 −In3
(5.17)
Since X ∈ C(Ē), Ē ∈ ∂kXk∗ . So there exists a symmetric matrix T̄ ∈ S n2
satisfying kT̄ kop ≤ 1 and
I
0
0
Ē
0 0
n1
1
T
T
U 0 T̄
0 U = Ē = Ū 0 Ē2 0 Ū .
0 0 −In3
0 0 Ē3
(5.18)
It is worth noting here that since ni is not necessarily equal to n̄i , i = 1, 2, 3,
the block structures of the left hand side and right hand side of (5.18) are not
70
CHAPTER 5. NUCLEAR NORM REGULARIZATION
71
necessarily the same. Suppose T̄ admits the following spectral decomposition,
T̄ = P̄n2 ΛT̄ P̄nT2 ,
P̄n2 ∈ On2 .
Then equation (5.18) implies
0 In1 0
0
Ē1 0 0
In1 0
0 Ē2 0 =Ū T U 0 P̄n2
0
0 0 ΛT̄
0
0 −In3
0 0 Ē3
0
0 −In3
I
0
0
n1
T
· 0 P̄nT2
0 U Ū .
0
0 −In3
Since the eigenvalues of a matrix are unique,
I
Ē
0 0
n1
1
0 Ē2 0 = 0
0
0 0 Ē3
(5.19)
we obtain from (5.19) that
0
0
ΛT̄
0 .
0 −In3
Since Ē1 , Ē2 , Ē3 are diagonal matrices with structure (5.23) and kΛT̄ kop ≤ 1,
the above relationship implies that n1 ≤ n̄1 , n2 ≥ n̄2 , n3 ≤ n̄3 . Moreover, since
Ē1 , Ē2 , Ē3 has distinct eigenvalues, by (5.19) and Lemma 5.2.1, there exists orthogonal matrices Pn̄1 ∈ On̄1 , Pn̄2 ∈ On̄2 , Pn̄3 ∈ On̄3 such that
I
0
0
P
0
0
n1
n̄1
Ū T U 0 P̄n2
0 = 0 Pn̄2 0 .
0
0 −In3
0
0 Pn̄3
(5.20)
So we have
Λ1 0 0
Ū T X Ū = Ū T U 0 0 0 U T Ū
0 0 Λ2
0 In1 0
0 Λ1 0 0
Pn̄1 0
= 0 Pn̄2 0 0 P̄nT2
0 0 0 0
0
0 Pn̄3
0
0 −In3
0 0 Λ2
T
0 Pn̄1 0
0
In1 0
· 0 P̄n2
0 0 Pn̄T2 0
T
0
0 −In3
0
0 Pn̄3
T
0 Λ1 0 0 Pn̄1 0
0
Pn̄1 0
= 0 Pn̄2 0 0 0 0 0 Pn̄T2 0 ,
0
0 Pn̄3
0 0 Λ2
0
0 Pn̄T3
where the first equality is by the spectral decomposition of X and the second
equality is due to the identity (5.20). Since Λ+ ∈ Sn+1 , Λ− ∈ Sn−3 and n1 ≤ n̄1 ,
n3 ≤ n̄3 , the above relationship leads to
Z
0
0
11
T
Ū X Ū = 0 0 0 ,
0 0 Z33
Hence, the relationship (5.16) is proved.
5.3
Z11 ∈ Sn̄+1 , Z33 ∈ Sn̄−3 .
u
t
Main Results
Equipped with all the preliminaries required, we are now ready to present the
main results for local error bound. Similar to Chapter 4, we will utilize Theorem 3.4.1 developed in Chapter 3 and testify whether the two conditions given in
that theorem holds or not for nuclear norm regularized optimization (5.3)-(5.4).
72
CHAPTER 5. NUCLEAR NORM REGULARIZATION
5.3.1
73
Bounded Linear Regularity
Let us first recall the notations in Theorem 3.4.1. For nuclear-norm regularization
(5.3)-(5.4), the two sets Cf and CP are defined by
Cf := {Z ∈ Sn | A(Z) = ȳ},
CP := {Z ∈ Sn | −ḡ ∈ ∂P (Z)},
for some ȳ ∈ Rm and ḡ ∈ S n , where P (Z) = τ kZk∗ for some given parameter
τ > 0. As indicated by Lemma 5.2.5, the set CP corresponds to a linear matrix
inequality (LMI) and thus is not necessarily a polyhedral. Hence, by Lemma
2.2.5, the constraint qualification
ri(CP ) ∩ Cf 6= ∅
is necessary for the bounded linear regularity of the collection {Cf , CP }. Moreover, we have the following result.
Proposition 5.3.1 Suppose in the nuclear norm regularized problem (5.3)-(5.4),
the function f is of the form (5.5) and Assumptions 1 and 2(a) are satisfied. If
there exists X ∗ ∈ X such that
0 ∈ ∇f (X ∗ ) + ri(∂P (X ∗ )),
(5.21)
the condition (C1) in Theorem 3.4.1 is satisfied.
Proof By Proposition 3.2.2, there exists (ȳ, ḡ) such that
A(X) = ȳ,
∇f (X) = ḡ,
for all X ∈ X .
Hence, if X ∗ ∈ X satisfies (5.21), we have −ḡ ∈ ri(∂P (X ∗ )) and thus k−ḡkop ≤ τ .
Suppose −ḡ admits the following spectral decomposition
Ḡ1 0 0
−ḡ = Ū 0 Ḡ2 0 Ū T
0 0 Ḡ3
where Ḡ1 , Ḡ2 , Ḡ3 are diagonal matrices satisfying
Ḡ1 = τ In̄1 ,
−τ In̄2 ≺ Ḡ2 ≺ τ In̄2 ,
Ḡ3 = −τ In̄3 ,
n̄1 + n̄2 + n̄3 = n.
Since −ḡ ∈ ri(∂P (X ∗ )), by Lemma 5.2.5, it can be verified that
3
1
, Zij = 0, (i, j) ∈ I c .
, Z33 ∈ Sn̄−−
X ∗ ∈ Ū T Z Ū | Z11 ∈ Sn̄++
In view of this, by using Lemma 5.2.5 again, it follows that
X ∗ ∈ ri((∂P )−1 (−ḡ)) = ri(CP ).
Hence, we have X ∗ ∈ Cf as well as X ∗ ∈ ri(CP ). By Lemma 2.2.5, the collection
{Cf , CP } is bounded linear regular and thus the condition (C1) in Theorem 3.4.1
u
t
is satisfied.
5.3.2
Metric Subregularity of ∂P
Recall that the second condition (C2) in Theorem 3.4.1 is the metric subregularity
of ∂P . In the following theorem, we show that when P is the nuclear norm
regularizer, ∂P is “always” metrically subregular.
Theorem 5.3.1 For any (X̄, Ē) ∈ Gr(∂P ), i.e., Ē ∈ ∂P (X̄), the multifunction
∂P is metrically subregular at X̄ for Ē.
Proof Without loss of generality, let us assume P (X) = kXk∗ , namely, the
regularization parameter τ = 1. Since (X̄, Ē) ∈ Gr(∂P ), kĒkop ≤ 1. Suppose
the spectral decomposition of Ē is given by (5.11). Then by the continuity of
eigenvalues (see Lemma 5.2.2), there exists a constant δ1 > 0 such that for any
E ∈ Ē +δ1 B, kEkop > 1 (in which case (∂P )−1 (E) = ∅) or E admits the following
eigenvalue decomposition:
E
0 0
1
E = U 0 E2 0 U T ,
0 0 E3
74
(5.22)
CHAPTER 5. NUCLEAR NORM REGULARIZATION
75
where E1 , E2 , E3 are diagonal matrices satisfying
E1 = In1 , −In2 ≺ E2 ≺ In2 , E3 = −In3 ,
n1 +n2 +n3 = n and n1 ≤ n̄1 , n3 ≤ n̄3 .
(5.23)
Hence, by Lemma 5.2.5, for any X ∈ C(E) = {Z ∈ Sn | E ∈ ∂P (Ē)}, we have
X = U ZU T for some Z satisfying
Z11 0 0
Z = 0 0 0 ,
0 0 Z33
Z11 ∈ Sn+1 , Z33 ∈ Sn−2 .
(5.24)
Upon using Proposition 5.2.1, there exists constants δ2 , γ > 0 such that for any
E ∈ Ē + δ2 B, there exists unitary matrices Pn̄1 ∈ On̄1 , Pn̄2 ∈ On̄2 , Pn̄3 ∈ On̄3 , all
depending on E, satisfying
0
Pn̄1 0
U − Ū 0 Pn̄2 0 ≤ γkE − ĒkF .
0
0 Pn̄3 (5.25)
F
Since n1 ≤ n̄1 and n3 ≤ n̄3 , by the expression (5.16), it is easy to verify that the
matrix X ∗ defined by
T
P
P
0
0
0
0
n̄1
n̄1
X ∗ := Ū 0 Pn̄2 0 Z 0 Pn̄T2 0 Ū T
T
0
0 Pn̄3
0
0 Pn̄3
satisfies X ∗ ∈ C(Ē), where Z is given by (5.24). Therefore, by letting δ =
min{δ1 , δ2 }, for any X ∈ (∂P )−1 (E) with kE − ĒkF ≤ δ, we have
d(X, C(Ē)) ≤ kX − X ∗ kF
T
0 Pn̄1 0
0 Pn̄1 0
T
T
T
= U ZU − Ū 0 Pn̄2 0 Z 0 Pn̄2 0 Ū
0
0 Pn̄3
0
0 Pn̄T3
F
(5.26)
Pn̄1 0
0
≤ 2kZkF · U − Ū 0 Pn̄2 0
0
0 Pn̄3 F
≤ 2γkZkF · kE − ĒkF ,
where the second inequality is due to the fact that for any two unitary matrix
U1 and U2 , there holds
kU1 ZU1T −U2 ZU2T kF = kU1 ZU1T −U1 ZU2T +U1 ZU2T −U2 ZU2T kF ≤ 2kZ(U1 −U2 )T kF ,
and the last inequality is according to (5.25). Moreover, if X ∈ X̄ + B for some
> 0, we have kZkF = kXkF ≤ (1 + )kX̄kF . Thus, (5.26) leads to
d(X, C(Ē)) ≤ 2γ(1 + )kX̄kF · kE − ĒkF ,
whenever X ∈ X̄ + B and X ∈ (∂P )−1 (E) with kE − ĒkF ≤ δ. This implies
the following,
d(X, (∂P )−1 (Ē)) ≤ κd(Ē, ∂P (E) ∩ V ) whenever X ∈ U,
where κ = 2γ(1 + ), U = X̄ + B, V = Ē + B. By definition, ∂P is metrically
u
t
subregular at X̄ for Ē.
5.3.3
Error Bounds for Nuclear Norm Regularization
Equipped with Theorem 5.3.1 and Proposition 5.3.1, we can obtain the main
result in this section.
76
CHAPTER 5. NUCLEAR NORM REGULARIZATION
77
Theorem 5.3.2 Suppose in the nuclear norm regularized problem (5.3)-(5.4),
the function f is of the form (5.5) and Assumptions 1 and 2(a) are satisfied. If
there exists X ∗ ∈ X such that
0 ∈ ∇f (X ∗ ) + ri(∂P (X ∗ )),
(5.27)
Then the local error bound for (5.3)-(5.4) holds.
Proof The proof is straightforward. By Theorem 3.4.1, it is sufficient to verify
the conditions (C1) and (C2). As (C2) is guaranteed by Theorem 5.3.1 and (C1)
is satisfied due to (6.13) and Proposition 5.3.1, the local error bound for (5.3)u
t
(5.4) holds.
In the rest of this section, we will demonstrate that the error bound for (5.3)
may fail if no additional conditions like constraint qualification (6.13) are assumed. In particular, we will construct an example where the local error bound
for (5.3)-(5.4) fails due to the lack of condition (6.13). Let us consider the following problem,
min f (X) + kXk∗ .
X∈S2
(5.28)
Here we let f (X) = h(A(X)), where A : S2 7→ R2 is a linear operator defined by
A(X) = (X11 , X22 )T .
This gives us the adjoint operator of A, denoted by A∗ : R2 7→ S2 , as follows,
y1 0
y1
.
A∗ =
0 y2
y2
In addition, h : R2 7→ R is the following function,
3/2 −2
5/2
1
1 1
and d = .
h(y) = kB 2 y − B − 2 dk22 , with B =
2
−2 3
−1
1
It is easy to check that B 0, so we can define the invertable matrix B 2 0
and thus h(·) is well-defined and strongly convex. Moreover, we have
∇h(y) = By − d.
So, by the definition of A and A∗ , we have for any matrix X ∈ S2 ,
3/2 −2
X
5/2
· 11 −
∇f (X) = A∗ ∇h(A(X)) = A∗
−2 3
X22
−1
5
3
X11 − 2X22 − 2
0
(5.29)
.
= 2
0
−2X11 + 3X22 + 1
Now let us look at the following matrix
X̄ =
1 0
0 0
.
By using the expression (5.29), we obtain
−1 0
.
∇f (X̄) =
0 −1
Also, it is easy to check that
∂kX̄k∗ = {Z ∈ S2 | Z11 = 1, Z12 = Z21 = 0, Z22 ∈ [−1, 1]}.
So we obtain
−∇f (X̄) ∈ ∂kX̄k∗ ,
which implies that X̄ is an optimal point of (5.28), i.e., X̄ ∈ X . Moreover, by
the strong convexity of h(·), we know that A(X) is invariant over the optimal
solution set X . In other words, there exist a vector ȳ ∈ R2 such that for all
X ∈ X , we have A(X) = ȳ. Since X̄ has been proved to be in X , we must have
ȳ = (1, 0)T .
Now suppose X̃ is also an optimal point, i.e., X̃ ∈ X , then A(X̃) = ȳ, which
implies that
X̃11 = 1, X̃22 = 0.
In addition, we have
−1 0
.
∇f (X̃) =
0 −1
78
CHAPTER 5. NUCLEAR NORM REGULARIZATION
79
By optimality condition, we must have
1 X̃12 1 0
.
∈ ∂
−∇f (X̃) =
X̃12 0 0 1
(5.30)
∗
It is easy to check that if
1 0
∈ ∂kZk∗
0 1
for some matrix Z, then we must have Z 0. This fact together with (5.30)
gives us
X̃12 = X̃21 = 0.
Then we have X̃ = X̄.
All the above shows that the optimal solution set X is a singleton and
1 0
.
X =
0 0
Now consider a sequence X k convergent to the optimal set X . If the error bound
condition holds, there exist some constant κ > 0 such that for sufficiently large
k,
dist(X k , X )
≤ κ.
kR(X k )k
(5.31)
Let X k be the following sequence with {δk } = o(1), i.e., a sequence convergent
to 0,
2
1
+
2δ
δ
k
k
.
Xk =
2
δk
δk
Obviously, X k converges to X and
dist(X k , X ) = Ω(δk ).
Denote Gk = ∇f (X k ) and by (5.29), we have
2
−1 + δk
0
.
Gk =
0
−1 − δk2
(5.32)
We know that the residual map R(X) is defined by
R(X) = Sτ (X − ∇f (X)) − X,
where τ is the regularizer parameter and in our example τ = 1. So by the form
of X k and Gk , we have
2
2
1 + 2δk δk
2 + δk
δk
−
. (5.33)
R(X k ) = S1 (X k −Gk )−X k = S1
2
2
δk
δk
δk
1 + 2δk
It is easy to check that
δk
2 + δk2
I2 .
2
δk
1 + 2δk
Also, noting that for any matrix X I, the matrix shrinkage operator with
factor one has the following property,
S1 (X) = X − I
whenever X I.
Applying this to (5.33), we obtain
2
2
2
−δ 0
1 + 2δk δk
2 + δk
δk
.
= k
−
R(X k ) = S1
0 δk2
δk
δk2
δk
1 + 2δk2
Thus we have kR(X k )k = Ω(δk2 ). Compared with (5.32), we see that kR(X k )k =
o(dist(X k , X )), which implies that (5.31) fails to hold.
5.4
Conclusions
We end this chapter by drawing a brief conclusion. In this chapter, we explore
the local error bound for nuclear norm regularized optimization and prove that
for structured function f (see (4.3)), under Assumptions 1 and 2(a) (in this
setting, Assumption 2(b) is automatically implied), the local error bound holds
if a constraint qualification is satisfied; i.e., there exists a matrix X̄ ∈ X such
that
0 ∈ ∇f (X̄) + ri(∂P (X̄)).
80
CHAPTER 5. NUCLEAR NORM REGULARIZATION
81
The reason of imposing the above condition is to ensure the condition (C1)
in Theorem 3.4.1 (the bounded linear regularity of the collection {Cf , CP }).
In addition, an explicit example is constructed to demonstrate the necessity of
the above constraint qualification condition. Therefore, this, together with the
results in Section 4.3, implies that the failure of either condition (C1) or (C2)
in Theorem 3.4.1 can lead to the failure of the local error bound. Furthermore,
as a by product of our error bound result, we prove that the subdifferential of
nuclear norm is metrically subregular at any point in its graph. Such result is
new even in the community of set-valued analysis and may be of further interest
in the future work.
2 End of chapter.
Chapter 6
Application: Convergence
Analaysis of Proximal Gradient
Method
Summary
In this chapter, we explore the convergence rate of proximal gradient method when applied to solve `1,p -regularized and nuclear norm
regularized optimization. It will be proved that if the local error
bound (1.7) holds, the proximal gradient method can attain a linear
rate of convergence. On the contrary, numerical experiments suggest that when the local error bound fails, the linear convergence
is in general not achievable for proximal gradient method.
82
CHAPTER 6. CONVERGENCE ANALYSIS OF PGM
6.1
83
Motivations
Recall that the nonsmooth convex optimization (1.2) takes the form
min F (x) := f (x) + P (x),
x∈E
where both f and P are closed convex function and f is additionally continuously
differentiable. Currently, a popular numerical method for solving the above nonsmooth convex optimization is the so-called proximal gradient method (PGM).
Recall that x ∈ X (where X is the optimal solution set of (1.2)) if and only if it
satisfies the following fixed-point equation:
x = proxP (x − ∇f (x)).
This naturally leads to the iteration formula of PGM:
y k+1 = xk − α ∇f (xk ),
k
xk+1 = prox (y k+1 ),
αk P
(6.1)
where αk > 0 is the step size in iteration k, for all k = 0, 1, . . .; see [11, 29,
67, 41, 51]. The attractiveness of PGM lies not only in its strong theoretical
convergence guarantees but also in its scalability to large data sets. In fact, for
a number of nonsmooth function P , the proximity operator of P , proxP , can be
computed efficiently or even has a closed-form solution (see [11, 51]). Hence, the
computational complexity of each iteration (6.1) of PGM is relatively low. In
view of this, one is more concerned with the following question regarding the
computational cost of PGM: how many iterations are needed for PGM to output
an approximate solution that is close enough to the optimal solution set? Or
more precisely, what is the convergence rate of PGM?
For problem (1.2) with f being convex and continuously differentiable and
∇f being Lipschitz continuous, the standard PGM (6.1) will achieve an additive
error of O(1/k) in the optimal value after k iterations. Moreover, this error can
be reduced to O(1/k 2 ) using acceleration techniques; see, e.g., [67]. The sublinear
O(1/k 2 ) convergence rate is known to be optimal if f is simply given by a first–
order oracle [48]. On the other hand, if f is strongly convex, then the convergence
rate can be improved to O(ck ) for some c ∈ (0, 1) (i.e., a linear convergence
rate) [61]. However, in practice, the instances of function f that are of wide
applications are often highly structured and hence not just given by an oracle,
but they are not necessarily strongly convex either. For instance, in matrix
completion, a commonly used loss function is the square loss f (·) = kA(·)−bk22 /2,
where A : Rm×n → Rp is a linear measurement operator and b ∈ Rp is a given
set of observations. Clearly, f is not strongly convex when A has a non–trivial
nullspace (or equivalently, when A is not injective). In view of this, it is natural
to ask whether linear convergence of the PGM can be established for a larger class
of loss functions. As we will see in the next chapter, the local error bound (1.7)
is the key towards that end.
6.2
Error Bound Based Convergence Analysis
In this section, we will prove that, if the local error bound (1.7) for problem (1.2)
holds, then the proximal gradient methods for solving (1.2) attain a linear rate
of convergence. Recall that a sequence of vectors {wk }k≥0 is said to converge
Q–linearly (resp. R–linearly) to a vector w∞ if there exist an index K ≥ 0 and
a constant ρ ∈ (0, 1) such that kwk+1 − w∞ k2 /kwk − w∞ k2 ≤ ρ for all k ≥ K
(resp. if there exist constants γ > 0 and ρ ∈ (0, 1) such that kwk − w∞ k2 ≤ γ · ρk
for all k ≥ 0). The following result was proved in [79]; see also [39].
Theorem 6.2.1 (Linear Convergence of the Proximal Gradient Method)
Suppose that in problem (1.2), f is continuously differentiable and ∇f is Lipschitz continuous with Lipschitz constant L. In addition, suppose that the local
error bound (1.7) holds for (1.2). Let {xk } be the sequence generated by PGM
84
CHAPTER 6. CONVERGENCE ANALYSIS OF PGM
85
(6.1) with step size satisfying
0 < α < αk < ᾱ < 1/Lf ,
for k = 0, 1, 2, . . . .
Then, the sequence {xk }k≥0 converges R–linearly to an element in the optimal
solution set X , and the associated sequence of objective values {F (xk )}k≥0 converges Q–linearly to the optimal value v ∗ .
Proof Under the given setting, we claim that there exist scalars κ1 , κ2 , κ3 > 0,
which depend on α, ᾱ, and Lf , such that
F (xk ) − F (xk+1 ) ≥ κ1 kxk − xk+1 k2F ,
F (xk+1 ) − v ∗ ≤ κ2 (dist(xk , X ))2 + kxk+1 − xk k2F ,
kR(xk )kF ≤ κ3 kxk − xk+1 kF .
(6.2)
(6.3)
(6.4)
Let us proceed the proof by assuming the claim is true. Since {F (xk )}k≥0 is a
monotonically decreasing sequence by (6.2) and F (xk ) ≥ v ∗ for all k ≥ 0, we
conclude, again by (6.2), that xk − xk+1 → 0. This, together with (6.4), implies
that R(xk ) → 0. Thus, by (6.2), (6.3) and the local error bound (1.7), there
exist an index K ≥ 0 and a constant κ4 > 0 such that for all k ≥ K,
F (xk+1 ) − v ∗ ≤ κ4 kxk − xk+1 k2F ≤
κ4
(F (xk ) − F (xk+1 )).
κ1
It follows that
F (xk+1 ) − v ∗ ≤
κ4
(F (xk ) − v ∗ ),
κ1 + κ4
(6.5)
which establishes the Q–linear convergence of {F (xk )}k≥0 to v ∗ . Using (6.2) and
(6.5), we can show that {kxk+1 − xk k2F }k≥0 converges R–linearly to 0, which,
together with (6.4), implies that {xk }k≥0 converges R–linearly to a point in X .
Hence it remains to prove the claim. By the iterates of PGM (6.1), we have
1
1
αk P (xk+1 ) + kxk − αk ∇f (xk ) − xk+1 k2F ≤ αk P (xk ) + αk2 k∇f (xk )k2F ,
2
2
which implies that
αk P (xk+1 ) + h∇f (xk ), xk+1 − xk i +
1
kxk+1 − xk k2F ≤ αk P (xk )
2αk
(6.6)
Since ∇f is Lipschitz continuous with parameter Lf > 0, we have
f (xk+1 ) − f (xk ) ≤ h∇f (xk ), xk+1 − xk i +
Lf k+1
kx
− xk k2F ;
2
(6.7)
see, e.g., [35]. It follows from (6.6) and (6.7) that
F (xk+1 ) − F (xk ) ≤ h∇f (xk ), xk+1 − xk i +
Lf
kxk+1
2
− xk k2F
+P (xk+1 ) − P (xk )
≤
Lf
kxk+1
2
≤ − 21
1
ᾱ
− xk k2F − 2α1 k kxk+1 − xk k2F
− Lf kxk+1 − xk k2F .
By taking κ1 = ((1/ᾱ) − Lf )/2, we obtain (6.2).
To establish (6.3), let x̄k be the projection of xk onto X . Using again the
definition of PGM (6.1), we have
1
1
αk P (xk+1 ) + kxk − αk ∇f (xk ) − xk+1 k2F ≤ αk P (x̄k ) + kxk − αk ∇f (xk ) − x̄k k2F .
2
2
This implies that
P (xk+1 ) − P (x̄k ) + h∇f (xk ), xk+1 − x̄k i ≤
1
1
kx̄k − xk k2F ≤
(dist(xk , X ))2 .
2αk
2α
(6.8)
By the Mean Value Theorem, there exists an x̂k ∈ [x̄k , xk+1 ] such that
f (xk+1 ) − f (x̄k ) = h∇f (x̂k ), xk+1 − x̄k i.
86
(6.9)
CHAPTER 6. CONVERGENCE ANALYSIS OF PGM
87
Hence, we compute
F (xk+1 ) − F (x̄k )
= f (xk+1 ) + P (xk+1 ) − f (x̄k ) − P (x̄k )
= h∇f (x̂k ), xk+1 − x̄k i + P (xk+1 ) − P (x̄k )
(6.10)
= h∇f (xk ), xk+1 − x̄k i + h∇f (x̂k ) − ∇f (xk ), xk+1 − x̄k i + P (xk+1 ) − P (x̄k )
≤ h∇f (xk ), xk+1 − x̄k i + Lf kxk+1 − x̄k k2F + P (xk+1 ) − P (x̄k )
1
(dist(xk , X ))2 + 2Lf kxk+1 − xk k2F + kxk − x̄k k2F
2α
1
≤
(dist(xk , X ))2 + kxk+1 − xk k2F ,
2Lf +
2α
≤
(6.11)
(6.12)
where (6.10) follows from (6.9); (6.11) follows from the Lipschitz continuity of
∇f and the fact that x̂k ∈ [x̄k , xk+1 ]; (6.12) follows from (6.8) and the inequality
kxk+1 −x̄k k2F ≤ kxk+1 − xk kF + kxk − x̄k kF
2
≤ 2 kxk+1 − xk k2F + kxk − x̄k k2F .
Upon noting that F (x̄k ) = v ∗ , we obtain (6.3) with κ2 = (2Lf + 1/(2α)).
Next, using the fact that for any Y, Z ∈ Rm×n , the map
1
η 7→ kproxη (Y − ηZ) − Y kF
η
is decreasing in η > 0, and that the map
η 7→ kproxη (Y − ηZ) − Y kF
is increasing in η > 0 [63], we have
kxk+1 − xk kF = kproxτ αk (xk − αk ∇f (xk )) − xk kF
≥ min{1, αk } · kproxτ (xk − ∇f (xk )) − xk kF
≥ min{1, α} · kR(xk )kF .
u
t
This establishes (6.4) with κ3 = 1/ min{1, α}.
In view of Theorem 6.2.1 and the local error bound we obtained in Chapter 4
and Chapter 5, the following results are immediate.
Corollary 6.2.1 For `1,p -regularized optimization (4.1), suppose p ∈ [1, 2] or
p = ∞, f is of form (4.3) and Assumptions 1 and 2(b) are satisfied. Then,
by choosing the step size αk to be the constant 1/L (where L is the Lipschitz
constant of ∇f ), the proximal gradient method for solving (4.1) attain the linear
rate of convergence.
Corollary 6.2.2 Suppose in the nuclear norm regularized problem (5.3)-(5.4),
the function f is of the form (5.5) and Assumptions 1 and 2(a) are satisfied. If
there exists X ∗ ∈ X such that
0 ∈ ∇f (X ∗ ) + ri(∂P (X ∗ )).
(6.13)
Then, by choosing the step size αk to be the constant 1/L (where L is the Lipschitz
constant of ∇f ), the proximal gradient method for solving (5.3)-(5.4) attain the
linear rate of convergence.
6.3
What If Error Bound Fails?
As shown in the previous section, the local error bound (1.7) is the key to establishing linear convergence rate of PGM without assuming strong convexity.
This motivates the following question: What if the local error bound fail? Will
the linear convergence still be attainable for PGM? In this section, we will use
numerical experiments to answer this question. In particular, we will test the
convergence rate of PGM on the two examples we constructed in Section 4.3 and
Section 5.3. Recall that in both of these examples, the local error bound fails.
88
CHAPTER 6. CONVERGENCE ANALYSIS OF PGM
6.3.1
89
`1,p -Regularization
Recall the example we constructed in Section 4.3; i.e., problem (4.17). In spite
of its small size, problem (4.17) is of particular interest in experiments of convergence rates due to the following reasons. First, it belongs to the class of
`1,p -regularized problems that satisfy Assumptions 1 and 2. Second, the local
error bound holds for (4.17) when p ∈ [1, 2] and p = ∞, while it fails when
p ∈ (2, ∞). Third, its optimal value v ∗ and optimal solution set X are known in
advance (Proposition 4.3.2), so that we can trace the curves log(f (xk ) − v ∗ ) and
log(dist(xk , X )) precisely.
We implement PGM to solve (4.17) with p = 1, 1.25, 1.5, 1.75, 2, 2.5, 3, 4, ∞.
The step size is chosen to be constant αk ≡ 0.5, which can be verified to satisfy the
conditions stated in Corollary 6.2.1. For p = 1, 1.25, 1.5, 1.75, 2, ∞, in which cases
the local error bound holds, the convergence performance of both the objective
value and the iterates are presented in Figure 6.1; for p = 2.5, 3, 4, in which cases
the local error bound fails, the convergence performance of both the objective
value and the iterates are presented in Figure 6.2. It is readily seen that when
p ∈ [1, 2] or p = ∞, {f (xk )}k≥0 ({xk }k≥0 ) converges linearly to v ∗ (X ) (Figures
6.1(a) and 6.1(b)). By contrast, when p ∈ (2, ∞), the objective value converges
at a sublinear rate (Figures 6.2(a) and 6.2(b)). Our experiments suggest that if
the local error bound fails, the PGM for solving `1,p -regularized problem can not
achieve linear rate of convergence.
6.3.2
Nuclear Norm Regularization
We implement PGM to solve (5.28) with step sizes α = 0.05, 0.075, 0.1, 0.25.
The optimal value v ∗ and optimal solution set X are known in advance (see
the discussion in Section 5.3), so that we can trace the curves log(f (xk ) − v ∗ )
and log(dist(xk , X )) precisely. The convergence performance of both the objective value and the iterates are presented in Figure 6.3. It is readily seen that
{f (xk )}k≥0 ({xk }k≥0 ) converges to v ∗ (X ) at a sublinear rate (Figures 6.3(a)
and 6.3(b)). Our experiments suggest that if the local error bound fails, the
PGM for solving nuclear norm regularized problem can not achieve linear rate
of convergence.
6.4
Conclusions
In this chapter, we prove that under the local error bound (1.7), the proximal
gradient method for solving nonsmooth convex optimization (1.2) can achieve a
linear rate of convergence. We remark that the applications of error bound in
numerical analysis are far beyond the convergence analysis of PGM. In fact, for
other first-order algorithms like coordinate gradient descent method [74], gradient
projection method and coordinate descent method [39], etc., the local error bound
is also the key to establishing the linear convergence rate of these methods.
Furthermore, in this chapter, we also demonstrate by numerical experiments
that if the local error bound fails for some optimization problem, the proximal
gradient method for solving this problem can not achieve linear convergence.
2 End of chapter.
90
CHAPTER 6. CONVERGENCE ANALYSIS OF PGM
91
Convergence Performance of the Objective Value
2
10
p=1
p=1.25
p=1.5
p=1.75
p=2
p=inf
0
10
−2
log( f(xk) − v* )
10
−4
10
−6
10
−8
10
−10
10
−12
10
−14
10
0
10
20
30
40
50
60
Iterations
(a)
Convergence Performance of the Iterates
1
10
p=1
p=1.25
p=1.5
p=1.75
p=2
p=inf
0
10
−1
log( d(xk, X) )
10
−2
10
−3
10
−4
10
−5
10
−6
10
−7
10
0
10
20
30
40
50
60
Iterations
(b)
Figure 6.1: The PG method for solving problem (4.17) with p ∈ [1, 2] and p = ∞.
Convergence Performance of the Objective Value
2
10
p=2.5
p=3
p=4
0
10
log( f(xk) − v* )
−2
10
−4
10
−6
10
−8
10
−10
10
0
200
400
600
800
1000
Iterations
(a)
Convergence Performance of the Iterates
1
10
p=2.5
p=3
p=4
0
10
−1
log( d(xk, X) )
10
−2
10
−3
10
−4
10
−5
10
0
200
400
600
800
1000
Iterations
(b)
Figure 6.2: The PG method for solving problem (4.17) with p ∈ (2, ∞).
92
CHAPTER 6. CONVERGENCE ANALYSIS OF PGM
93
Convergence Performance of the Objective Value
2
10
α = 0.05
α = 0.75
α = 0.1
α = 0.25
1
10
k
log(d(x ,X))
0
10
−1
10
−2
10
−3
10
0
2000
4000
6000
Iterations
8000
10000
12000
(a)
Convergence Performance of the Iterates
1
10
α = 0.05
α = 0.75
α = 0.1
α = 0.25
0
log(d(xk,X))
10
−1
10
−2
10
0
2000
4000
6000
Iterations
8000
10000
12000
(b)
Figure 6.3: The PG method for solving problem (5.28).
Bibliography
[1] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering Shared Structures in Multiclass Classification. In Proceedings of the 24th International
Conference on Machine Learning (ICML 2007), pages 17–24, 2007.
[2] F. Andreas. On the local superlinear convergence of a newton-type method
for lcp under weak conditions. Optimization Methods and Software, 6(2):83–
107, 1995.
[3] A. Argyriou, T. Evgeniou, and M. Pontil. Convex Multi–Task Feature
Learning. Machine Learning, 73(3):243–272, 2008.
[4] F. A. Artacho and M. H. Geoffroy. Characterization of metric regularity of
subdifferentials. Journal of Convex Analysis, 15(2):365, 2008.
[5] F. J. A. Artacho and M. H. Geoffroy. Metric subregularity of the convex
subdifferential in banach spaces. arXiv preprint arXiv:1303.3654, 2013.
[6] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with
R in Machine Learnsparsity-inducing penalties. Foundations and Trends
ing, 4(1):1–106, 2012.
[7] H. Bauschke and J. M. Borwein. On the convergence of von neumann’s
alternating projection algorithm for two sets. Set-Valued Analysis, 1(2):185–
212, 1993.
94
BIBLIOGRAPHY
95
[8] H. H. Bauschke and J. M. Borwein. On projection algorithms for solving
convex feasibility problems. SIAM review, 38(3):367–426, 1996.
[9] H. H. Bauschke, J. M. Borwein, and W. Li. Strong conical hull intersection property, bounded linear regularity, jameson’s property (g), and error
bounds in convex optimization. Mathematical Programming, 86(1):135–160,
1999.
[10] H. H. Bauschke, J. M. Borwein, and P. Tseng. Bounded linear regularity,
strong chip, and chip are distinct properties. Journal of Convex Analysis,
7(2):395–412, 2000.
[11] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm
for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–
202, 2009.
[12] J.-F. Cai, E. J. Candès, and Z. Shen. A singular value thresholding algorithm
for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982,
2010.
[13] E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717–772, 2009.
[14] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by
basis pursuit. SIAM journal on scientific computing, 20(1):33–61, 1998.
[15] X. Chen, H. Qi, and P. Tseng. Analysis of nonsmooth symmetric-matrixvalued functions with applications to semidefinite complementarity problems. SIAM Journal on Optimization, 13(4):960–985, 2003.
[16] X. Chen and P. Tseng.
Non-interior continuation methods for solv-
ing semidefinite complementarity problems. Mathematical Programming,
95(3):431–474, 2003.
[17] A. Dontchev and R. Rockafellar. Regularity and conditioning of solution
mappings in variational analysis. Set-Valued Analysis, 12(1-2):79–109, 2004.
[18] A. L. Dontchev and R. T. Rockafellar. Implicit functions and solution mappings. Springer Monogr. Math., 2009.
[19] D. Drusvyatskiy and A. S. Lewis. Tilt stability, uniform quadratic growth,
and strong metric regularity of the subdifferential. SIAM Journal on Optimization, 23(1):256–267, 2013.
[20] J. C. Dunn. Global and asymptotic convergence rate estimates for a class of
projected gradient processes. SIAM Journal on Control and Optimization,
19(3):368–400, 1981.
[21] Y. C. Eldar, P. Kuppinger, and H. Bolcskei. Block-sparse signals: Uncertainty relations and efficient recovery. Signal Processing, IEEE Transactions
on, 58(6):3042–3054, 2010.
[22] M. Fazel, H. Hindi, and S. P. Boyd. A Rank Minimization Heuristic with
Application to Minimum Order System Approximation. In Proceedings of
the 2001 American Control Conference, pages 4734–4739, 2001.
[23] D. Gross. Recovering Low–Rank Matrices from Few Coefficients in Any
Basis. IEEE Transactions on Information Theory, 57(3):1548–1566, 2011.
[24] A. J. Hoffman. On approximate solutions of systems of linear inequalities.
Journal of Research of the National Bureau of Standards, 49(4):263–265,
1952.
[25] M. Hong and Z.-Q. Luo. On the linear convergence of the alternating direction method of multipliers. arXiv preprint arXiv:1208.3922, 2012.
[26] A. D. Ioffe. Metric regularity and subdifferential calculus. Russian Mathematical Surveys, 55(3):501, 2000.
96
BIBLIOGRAPHY
97
[27] R. Jenatton, J.-Y. Audibert, and F. Bach. Structured variable selection
with sparsity-inducing norms. The Journal of Machine Learning Research,
12:2777–2824, 2011.
[28] S. Ji, K.-F. Sze, Z. Zhou, A. M.-C. So, and Y. Ye. Beyond Convex Relaxation: A Polynomial–Time Non–Convex Optimization Approach to Network
Localization. In Proceedings of the 32nd IEEE International Conference on
Computer Communications (INFOCOM 2013), pages 2499–2507, 2013.
[29] S. Ji and J. Ye. An Accelerated Gradient Method for Trace Norm Minimization. In Proceedings of the 26th Annual International Conference on
Machine Learning (ICML 2009), pages 457–464, 2009.
[30] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Lp-norm multiple kernel
learning. The Journal of Machine Learning Research, 12:953–997, 2011.
[31] V. Koltchinskii, K. Lounici, and A. B. Tsybakov. Nuclear–Norm Penalization and Optimal Rates for Noisy Low–Rank Matrix Completion. The
Annals of Statistics, 39(5):2302–2329, 2011.
[32] M. Kowalski. Sparse regression using mixed norms. Applied and Computational Harmonic Analysis, 27(3):303–324, 2009.
[33] A. Y. Kruger. About regularity of collections of sets. Set-Valued Analysis,
14(2):187–206, 2006.
[34] D. Leventhal. Metric subregularity and the proximal point method. Journal
of Mathematical Analysis and Applications, 360(2):681–688, 2009.
[35] E. S. Levitin and B. T. Polyak. Constrained Minimization Methods. USSR
Computational Mathematics and Mathematical Physics, 6(5):1–50, 1966.
[36] A. S. Lewis, D. R. Luke, and J. Malick. Local linear convergence for alternating and averaged nonconvex projections. Foundations of Computational
Mathematics, 9(4):485–513, 2009.
[37] Z.-Q. Luo and P. Tseng. Error bound and convergence analysis of matrix
splitting algorithms for the affine variational inequality problem. SIAM
Journal on Optimization, 2(1):43–54, 1992.
[38] Z.-Q. Luo and P. Tseng. On the linear convergence of descent methods
for convex essentially smooth minimization. SIAM Journal on Control and
Optimization, 30(2):408–425, 1992.
[39] Z.-Q. Luo and P. Tseng. Error bounds and convergence analysis of feasible descent methods: a general approach. Annals of Operations Research,
46(1):157–178, 1993.
[40] Z.-q. Luo and P. Tseng. On the convergence rate of dual ascent methods
for linearly constrained convex minimization. Mathematics of Operations
Research, 18(4):846–867, 1993.
[41] S. Ma, D. Goldfarb, and L. Chen. Fixed Point and Bregman Iterative
Methods for Matrix Rank Minimization. Mathematical Programming, Series
A, 128(1–2):321–353, 2011.
[42] O. L. Mangasarian and T.-H. Shiau. Lipschitz continuity of solutions of
linear inequalities, programs and complementarity problems. SIAM Journal
on Control and Optimization, 25(3):583–595, 1987.
[43] L. Meier, S. van de Geer, and P. Bühlmann. The group lasso for logistic
regression. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 70(1):53–71, 2008.
[44] C. A. Micchelli and M. Pontil. Learning the kernel function via regularization. In Journal of Machine Learning Research, pages 1099–1125, 2005.
98
BIBLIOGRAPHY
99
[45] G. J. Minty. Monotone (nonlinear) operators in Hilbert space. Duke Mathematical Journal, 29(3):341–346, 1962.
[46] G. J. Minty. On the monotonicity of the gradient of a convex function.
Pacific Journal of Mathematics, 14(1):243–247, 1964.
[47] R. D. Monteiro and S. J. Wright. Local convergence of interior-point algorithms for degenerate monotone lcp. Computational Optimization and
Applications, 3(2):131–155, 1994.
[48] Y. Nesterov and I. E. Nesterov. Introductory lectures on convex optimization:
A basic course, volume 87. Springer, 2004.
[49] J.-S. Pang. A posteriori error bounds for the linearly-constrained variational
inequality problem. Mathematics of Operations Research, 12(3):474–484,
1987.
[50] J.-S. Pang. Error bounds in mathematical programming. Mathematical
Programming, 79(1-3):299–332, 1997.
[51] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in
Optimization, 1(3):123–231, 2013.
[52] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed Minimum–Rank Solutions
of Linear Matrix Equations via Nuclear Norm Minimization. SIAM Review,
52(3):471–501, 2010.
[53] S. M. Robinson. Bounds for error in the solution set of a perturbed linear
program. Linear Algebra and its applications, 6:69–81, 1973.
[54] S. M. Robinson. Some continuity properties of polyhedral multifunctions.
1981.
[55] R. T. Rockafellar. Convex analysis. Number 28. Princeton university press,
1970.
[56] R. T. Rockafellar. Monotone operators and the proximal point algorithm.
SIAM journal on control and optimization, 14(5):877–898, 1976.
[57] R. T. Rockafellar and R. Wets. VARIATIONAL ANALYSIS, volume 317.
Springer: Grundlehren der Math. Wissenschaften., 1998.
[58] A. P. Ruszczyński. Nonlinear optimization, volume 13. Princeton university
press, 2006.
[59] S. Sardy, A. Antoniadis, and P. Tseng. Automatic smoothing with wavelets
for a wide class of distributions. Journal of computational and graphical
statistics, 13(2):399–421, 2004.
[60] S. Sardy, A. G. Bruce, and P. Tseng. Block coordinate relaxation methods
for nonparametric wavelet denoising. Journal of computational and graphical
statistics, 9(2):361–379, 2000.
[61] M. Schmidt, N. Le Roux, and F. Bach. Convergence Rates of Inexact
Proximal–Gradient Methods for Convex Optimization. In J. Shawe-Taylor,
R. S. Zemel, P. Bartlett, F. C. N. Pereira, and K. Q. Weinberger, editors,
Advances in Neural Information Processing Systems 24: Proceedings of the
2011 Conference, pages 1458–1466, 2011.
[62] A. M.-C. So, Y. Ye, and J. Zhang. A Unified Theorem on SDP Rank
Reduction. Mathematics of Operations Research, 33(4):910–920, 2008.
[63] S. Sra. Scalable nonconvex inexact proximal splitting. In Advances in Neural
Information Processing Systems, pages 530–538, 2012.
[64] J. F. Sturm. Superlinear convergence of an algorithm for monotone linear
complementarity problems, when no strictly complementary solution exists.
Mathematics of Operations Research, 24(1):72–94, 1999.
100
BIBLIOGRAPHY
101
[65] D. Sun and J. Sun. Semismooth matrix-valued functions. Mathematics of
Operations Research, 27(1):150–169, 2002.
[66] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal
of the Royal Statistical Society: Series B (Methodological), 58(1):267–288,
1996.
[67] K.-C. Toh and S. Yun. An Accelerated Proximal Gradient Algorithm for
Nuclear Norm Regularized Linear Least Squares Problems. Pacific Journal
of Optimization, 6(3):615–640, 2010.
[68] R. Tomioka and K. Aihara. Classifying Matrices with a Spectral Regularization. In Proceedings of the 24th International Conference on Machine
Learning (ICML 2007), pages 895–902, 2007.
[69] R. Tomioka and T. Suzuki. Sparsity-accuracy trade-off in mkl. arXiv
preprint arXiv:1001.2615, 2010.
[70] P. Tseng. On linear convergence of iterative methods for the variational
inequality problem. Journal of Computational and Applied Mathematics,
60(1):237–252, 1995.
[71] P. Tseng.
Error bounds and superlinear convergence analysis of some
newton-type methods in optimization. In Nonlinear Optimization and Related Topics, pages 445–462. Springer, 2000.
[72] P. Tseng. Approximation accuracy, gradient methods, and error bound for
structured convex optimization. Mathematical Programming, 125(2):263–
295, 2010.
[73] P. Tseng and Z.-Q. Luo. On the convergence of the affine-scaling algorithm.
Mathematical Programming, 56(1-3):301–319, 1992.
[74] P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117(1-2):387–
423, 2009.
[75] J. E. Vogt and V. Roth. A complete analysis of the `1,p group-lasso. In Proceedings of the 29th International Conference on Machine Learning, 2012.
[76] M. White, Y. Yu, X. Zhang, and D. Schuurmans. Convex Multi–View Subspace Learning. In P. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou,
and K. Q. Weinberger, editors, Advances in Neural Information Processing
Systems 25: Proceedings of the 2012 Conference, pages 1682–1690, 2012.
[77] N. Yamashita and M. Fukushima.
On the rate of convergence of the
levenberg-marquardt method. In Topics in numerical analysis, pages 239–
249. Springer, 2001.
[78] M. Yuan and Y. Lin. Model selection and estimation in regression with
grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
[79] H. Zhang, J. Jiang, and Z.-Q. Luo. On the linear convergence of a proximal
gradient method for a class of nonsmooth convex minimization problems.
Journal of the Operations Research Society of China, 1(2):163–186, 2013.
[80] R. Zhang and J. Treiman. Upper-lipschitz multifunctions and inverse subdifferentials. Nonlinear Analysis: Theory, Methods & Applications, 24(2):273–
286, 1995.
102
© Copyright 2026 Paperzz