Lecture 25 — March 6 25.1 Overview 25.2 Alternating direction

MATH 301: Advanced Topics in Convex Optimization
Winter 2015
Lecture 25 — March 6
Lecturer: Emmanuel Candes
Scribe: Hamid Javadi
Warning: These notes may contain factual and/or typographic errors. Some portions of lecture may have been omitted.

25.1
Overview
In this lecture we will discuss the alternative direction method of multipliers (ADMM). We
begin by introducing this algorithm. Then a few examples are given for which this algorithm
is very efficient.
25.2
Alternating direction method of multipliers (ADMM)
Consider a problem of the form
minimize
subject to
f (x) + h(y)
Ax + By = b.
(25.1)
Problem (25.1) has a separable objective. However, the optimization variables are coupled
via the constraint. The augmented Lagrangian with penalty parameter µ for this problem is
Lµ (x, y, λ) = f (x) + h(y) + λT (Ax + By − b) +
µ
kAx + By − bk2 .
2
(25.2)
It can be seen that in (25.2) the quadratic penalty in the augmented Lagrangian destroys
the separability of the Lagrangian. Therefore, it is not possible to minimize the augmented
Lagrangian by separately minimizing over the variables x, y. In order to solve this problem,
in ADMM we replace the minimization over (x, y) by alternating minimization. Applying
this change to the augmented Lagrangian method gives us the ADMM algorithm which can
be seen in Algorithm 1. Here are some useful facts about this algorithm.
1. It can be shown that the value of f (xk )+h(yk ) in algorithm 1, converges to the optimal
value of problem (25.1). See [BPC+ 11].
2. It can be seen that the ADMM algorithm allows us to perform the optimization in
parallel.
3. The algorithm is extremely efficient when updating x, y in each step has a low computational cost. In fact, in many problems which come up in practice there are closed
form expressions for the updating rules of x, y which makes the ADMM very useful in
practice.
25-1
MATH 301
Lecture 25 — March 6
Winter 2015
Algorithm 1 Alternative direction method of multipliers (ADMM)
y0 ← ỹ, λ0 ← λ̃, k ← 1 //initialize
µ ← µ̃ > 0
while convergence criterion is not satisfied do
xk ← arg minx Lµ (x, yk−1 , λk−1 )
yk ← arg miny Lµ (xk , y, λk−1 )
λk ← λk−1 + µ(Axk + Byk − b)
k ←k+1
end while
4. Although the algorithm converges when we have two variables x, y; the convergence
is not guaranteed when we have three or more variables. In fact, there are examples
in the literature which show that we cannot use this algorithm when we have three
or more variables. See [CHYY14]. However, in practice ADMM is used for problems
with more than 2 variables and in many occasions it actually performs well.
25.3
Examples
In this section we explain the application of the ADMM algorithm in four different problems
which are extremely useful in practice.
25.3.1
Robust PCA
Consider the problem
kLk∗ + γkSk1
L + S = M.
minimize
subject to
(25.3)
P
In (25.1), k.k1 is the entrywise `1 -norm. In other words, kSk1 = i,j |Sij |. This problem is
known as robust PCA.1 One drawback of classical PCA is that it is very sensitive to possible
outliers and corrupted data. Robust PCA tackles this problem by regularizing PCA by a `1
penalty. As it can be seen, ADMM is extremely useful for solving robust PCA. For some
penalty parameter (1/τ ) > 0, the augmented Lagrangian for (25.3) is
1
1
L 1 (L, S, y) = kLk∗ + γkSk1 + hy, M − L − Si + kM − L − Sk2F .
τ
τ
2τ
1
Robust PCA is useful in practical applications where one wishes to recover an approximately low-rank
matrix from a subset of entries, a fraction of which may have been corrupted.
25-2
MATH 301
Lecture 25 — March 6
Winter 2015
Now we derive the update rules for the ADMM algorithm for this problem. Using algorithm
1 the updates are
Lk = arg min L 1 (L, Sk−1 , yk−1 )
τ
L
1
1
2
= arg min kLk∗ + γkSk−1 k1 + hyk−1 , M − L − Sk−1 i + kM − L − Sk−1 kF
τ
2τ
L
1
= arg min kLk∗ + kM − Sk−1 + yk−1 − Lk2F .
2τ
L
Using this, It can be shown that
arg min L 1 (L, Sk−1 , yk−1 ) = SVTτ (M − Sk−1 + yk−1 ).
L
τ
Where,
SVTτ (X) = U Dτ (Σ)V T ,
Dτ (Σ) = diag({(σi − τ )+ })
and
X = U ΣV T ,
Σ = diag({σi })
is the singular value decomposition of X (see Theorem 2.1 in [CCS10]). Also we have
Sk = arg min L 1 (Lk , S, yk−1 )
τ
S
1
1
2
= arg min kLk k∗ + γkSk1 + hyk−1 , M − Lk − Si + kM − Lk − SkF
τ
2τ
S
1
= arg min kSk1 +
kM − Lk + yk−1 − Sk2F .
2γτ
S
As we saw in previous lectures, entrywise soft thresholding is the solution to above equation.
In other words
Sk = ETγτ (M − Lk + yk−1 ).
Where


Xij − γτ
(ETγτ (X))ij = 0


Xij + γτ
if Xij ≥ γτ,
if − γτ ≤ |Xij | ≤ γτ,
if Xij ≤ −γτ.
(25.4)
Also the dual update step for the algorithm will be
1
yk = yk−1 + (M − Lk − Sk )
τ
The ADMM steps for solving the Robust PCA can be seen in Algorithm 2. Note that
the updates for S, y can be done very efficiently with extremely low computational cost.
Updating L requires computing the singular value decomposition of a matrix which can be
costly. However, in practice, iterative methods can be used to efficiently calculate the largest
singular values and their corresponding singular vectors. It turns out that this works pretty
well in practice.
25-3
MATH 301
Lecture 25 — March 6
Winter 2015
Algorithm 2 ADMM for solving the Robust PCA problem
S0 ← S̃, y0 ← ỹ, k ← 1 //initialize
τ ← τ̃ > 0
while convergence criterion is not satisfied do
Lk ← SVTτ (M − Sk−1 + yk−1 )
Sk ← ETγτ (M − Lk + yk−1 )
yk ← yk−1 + τ1 (M − Lk − Sk )
k ←k+1
end while
25.3.2
Graphical Lasso
Our final example is the problem known as graphical Lasso. Consider the following problem.
minimize
subject to
− log det X + Tr(XC) + ρkXk1
X 0.
(25.5)
In (25.5), k.k1 is the entrywise `1 -norm. This problem arises in estimation of sparse undirected graphical models. C is the empirical covariance matrix of the observed data. The
goal is to estimate a covariance matrix with sparse inverse for the observed data (for more
information about graphical Lasso see [FHT08]). In order to apply ADMM we rewrite (25.5)
as
minimize
subject to
− log det X + Tr(XC) + IX0 (X) + ρkY k1
X = Y.
(25.6)
The augmented Lagrangian with penalty parameter µ for (25.6) is
Lµ (X, Y, Z) = − log det X + Tr(XC) + IX0 (X) + ρkY k1 + µhZ, X − Y i +
µ
kX − Y k2F .
2
Based on this, we derive the update rules for the ADMM algorithm. We have
Xk = arg min Lµ (X, Yk−1 , Zk−1 )
X
n
o
µ
= arg min − log det X + hX, Ci + ρkYk−1 k1 + µhZk−1 , X − Yk−1 i + kX − Yk−1 k2F
2
X0
(
2 )
µ
1
= arg min − log det X + X + Zk−1 − Yk−1 + C .
2
µ
X0
F
Thus, letting
C̃k−1 = −Zk−1 + Yk−1 −
1
C,
µ
we have
∇X Lµ (X, Yk−1 , Zk−1 ) = −X −1 + µX − µC̃k−1 ,
25-4
(25.7)
MATH 301
Lecture 25 — March 6
Winter 2015
where C̃k−1 is as in (25.7). It is clear that if for some X ∗ 0, ∇X Lµ (X ∗ , Yk−1 , Zk−1 ) = 0;
then X ∗ = arg minX Lµ (X, Yk−1 , Zk−1 ). Now if
C̃k−1 = U ΛU T ,
Λ = diag({λi })
is the eigenvalue decomposition of Ck−1 with λi ≥ 0, then it is easy to see that for
r
1
4
∗
T
2
X = Fµ (C̃k−1 ) = U Fµ (Λ)U , Fµ (Λ) = diag
λi + λi +
,
2
µ
(25.8)
X ∗ 0 and −X −1 + µX − µC̃k−1 = 0. Therefore the ADMM update is
Xk = Fµ (Yk−1 − Zk−1 −
1
C),
µ
where Fµ (.) is given in (25.8). We also have
Yk = arg min Lµ (Xk , Y, Zk−1 )
Y
n
o
µ
2
= arg min − log det Xk + hXk , Ci + ρkY k1 + µhZk−1 , Xk − Y i + kXk − Y kF
2
Y
o
nµ
kY − (Xk + Zk−1 )k2F + ρkY k1 .
= arg min
2
Y
= ET(ρ/µ) (Xk + Zk−1 ),
where ET(ρ/µ) (.) is given as in (25.4). Finally, we can write the ADMM steps for graphical
Lasso which can be seen in Algorithm 3.
Algorithm 3 ADMM for solving the graphical Lasso problem
Y0 ← Ỹ , Z0 ← Z̃, k ← 1 //initialize
µ ← µ̃ > 0
while convergence criterion is not satisfied do
Xk ← Fµ (Yk−1 − Zk−1 − µ1 C)
Yk ← ET(ρ/µ) (Xk + Zk−1 )
Zk ← Zk−1 + µ(Xk − Yk )
k ←k+1
end while
25-5
Bibliography
[BPC+ 11] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein,
Distributed optimization and statistical learning via the alternating direction
R in Machine Learning 3 (2011),
method of multipliers, Foundations and Trends
no. 1, 1–122.
[CCS10]
Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen, A singular value thresholding algorithm for matrix completion, SIAM Journal on Optimization 20
(2010), no. 4, 1956–1982.
[CHYY14] Caihua Chen, Bingsheng He, Yinyu Ye, and Xiaoming Yuan, The direct extension of admm for multi-block convex minimization problems is not necessarily
convergent, Mathematical Programming (2014), 1–23.
[FHT08]
Jerome Friedman, Trevor Hastie, and Robert Tibshirani, Sparse inverse covariance estimation with the graphical lasso, Biostatistics 9 (2008), no. 3, 432–441.
6

Download Report

Lecture 25 — March 6 25.1 Overview 25.2 Alternating direction

Paperzz.com

Your Paperzz