Advanced Topics in Machine Learning (Part II)

Advanced Topics in Machine Learning
(Part II)
4. Sparsity Methods
Massimiliano Pontil
1
Today’s Plan
• Sparsity in linear regression
• Formulation as a convex program – Lasso
• Group Lasso
• Matrix estimation problems (Collaborative Filtering, Multi-task
Learning, Inverse Covariance, Sparse Coding, etc.)
• Structure Sparsity
• Dictionary Learning / Sparse Coding
• Nonlinear extension
2
L1-regularization
Least absolute shrinkage and selection operator (LASSO):
1
ky − Xwk22
kwk1 ≤α 2
min
where kwk1 =
Pd
j=1 |wj |
• equivalent problem: min 12 ky − Xwk22 + λkwk1
w∈IRd
• can be rewritten as a QP:
1
ky − X(w+ − w−)k22 + λe⊤(w+ + w−)
w+ ,w− ≥0 2
min
3
L1-norm regularization encourages sparsity
Consider the case X = I:
1
min kw − yk22 + λkwk1
w∈IRd 2
Lemma: Let Hλ(t) = (|t| − λ)+sgn(t), t ∈ IR. The solution ŵ is given by
ŵi = Hλ(yi), i = 1, . . . , d
˘1
2
¯
Proof: First note that the problem decouples: ŵi = argmin 2 (wi − yi) + λ|wi| .
By symmetry ŵiyi ≥ 0, thus w.l.o.g. we can assume yi ≥ 0. Now, if ŵi > 0 the
objective function is differentiable and setting the derivative to zero gives ŵi = yi − λ.
Since the minimum is unique we conclude that ŵi = (yi − λ)+.
4
Minimal norm interpolation
If the linear system Xw = y of equations admits a solution, when λ → 0
the L1-regularization problem reduces to:
min {kwk1 : Xw = y}
(MNI)
which is a linear program (exercise)
• the solution is in general not unique
• suppose that the y = Xw∗; under which condition w∗ is also the unique
solution to (MNI)?
5
Restricted isometry property
Without further assumptions there is no hope that ŵ = w∗
The following condition are sufficient:
6 0} ≤ s, with s ≪ d
• Sparsity: card{j : |wj∗| =
• X satisfies the restricted isometry property (RIP): there is a δs ∈ (0, 1)
such that, for every w ∈ IRd with card{j : wj 6= 0} ≤ s, it holds that
(1 − δs)kwk22 ≤ kXwk22 ≤ (1 + δs)kwk22
6
Optimality conditions
Directional derivative of a function f : IRd → IR at w in the direction d:
f (w + ǫd) − f (w)
ǫ
ǫ→0+
D+f (w; d) := lim
• when f is convex, the limit is always well defined and finite
Theorem 1: ŵ ∈ arg min f (w) iff D+f (ŵ; d) ≥ 0 ∀d ∈ IRd
w∈IRd
• if f is differentiable at w then D+f (w; d) = d⊤∇f (w) and Theorem 1
says that ŵ is a solution iff ∇f (ŵ) = 0
7
Optimality conditions (cont.)
If f is convex its subdifferential at w is defined as
∂f (w) = {u : f (v) ≥ f (w) + u⊤(v − w), ∀v ∈ IRd}
• a set-valued function!
• always a closed convex set
• the elements of ∂f (w) are called the subgradients of f at w
• intuition: u ∈ ∂f (w) if the affine function f (w) + u⊤(v − w) is a global
underestimator of f
Theorem 2: ŵ ∈ arg min f (w), iff 0 ∈ ∂f (ŵ)
w∈IRd
(easy to proof)
8
Optimality conditions (cont.)
Theorem 2: ŵ ∈ arg min f (w), iff 0 ∈ ∂f (ŵ)
w∈IRd
• if f is differentiable then ∂f (w) = {∇f (w)} and Theorem 2 says that
ŵ is a solution iff ∇f (ŵ) = 0
Some properties of gradients are still true for subgradients, e.g:
• ∂(af )(w) = af (w), for all a ≥ 0
• If f and g are convex then ∂(f + g)(w) = ∂f (w) + ∂g(w)
9
Optimality conditions for Lasso
min ky − Xwk22 + λkwk1
• by Theorem 2 and the properties of subgradients, w is a optimal
solution iff
X ⊤(y − Xw) ∈ λ∂kwk1
• to compute ∂kwk1 use the sum rule and the subgradient of the absolute
value: ∂|t| = {sgn(t)} if t 6= 0 and ∂|t| = {u : |u| ≤ 1} if t = 0
Case X = I: ŵ is a solution iff, for every i = 1, . . . , d, yi − ŵi = λsgn(ŵi)
if ŵi 6= 0 and |yi − ŵi| ≤ λ otherwise (verify that these formulae yield the
soft thresolding solution on page 4)
10
General learning method
In generally we will consider optimization problems of the form
min F (w),
w∈IRd
where F (w) = f (w) + g(w)
Pm
Often f will be a data term: f (w) = i=1 E(w⊤xi, yi), and g a convex
penalty function (non necessarily smooth, e.g. the L1-norm)
Next week we will discuss a general and efficient method to solve the above problem
under the assumptions that f has some smoothness property and g is “simple”, in the
sense that the following problem is easy to solve
1
2
min kw − yk + g(w)
w 2
11
Group Lasso
Enforce sparsity across a-priori known groups of variables:
min f (w) + λ
W ∈IRd
N
X
kw|Jℓ k2
ℓ=1
where J1, . . . , JN are prescribed subsets of {1, . . . , d}
• In the original formulation (Yuan and Lin, 2006) the groups form a
partition of the index set {1, . . . , n}
• Overlapping groups (Zhao et al. 2009; Jennaton et al. 2010):
hierarchical structures such as DAGS
Example: J1 = {1, 2, . . . , d}, J2 = {2, 3, . . . , n}, . . . , Jn = {n}
12
Multi-task learning
• Learning multiple linear regression or binary classification tasks
simultaneously
• Formulate as a matrix estimation problem (W = [w1, . . . , wT ])
min
m
T X
X
W ∈IRd×T t=1 i=1
E(w⊤xti, yti) + λg(W )
• Relationships between tasks modeled via sparsity constraints on W
• Few common important variables (special case of Group Lasso):
g(W ) =
d
X
kwj k2
j=1
13
Structured Sparsity
• The above regularizer favors matrices with many zero rows (few
features shared by the tasks)
v
u T
d
X uX
t
g(W ) =
w2
tj
j=1
t=1
14
2. Structured Sparsity (cont.)
Compare matrices W favored by different norms (green = 0, blue = 1):
#rows = 13
5
3
g(W ) = 19
P
|wtj | = 29
12
8
29
29
tj
15
Estimation of a low rank matrix
min
W ∈IRd×T
(m
X
(yi − hW, Xii)2 : rank(W ) ≤ k
i=1
)
⊤
• Multi-task learning: choose Xi = xie⊤
ci , hence hW, Xi i = wci xi
• Collaborative filtering: choose Xi = eri e⊤
ci , hence hW, Xi i = Wri ci ,
where ri ∈ {1, . . . , d} and ci ∈ {1, . . . , T } (rows / columns indices)
Relax the rank with the trace (or nuclear) norm: kW k∗ =
min(d,T
P )
σi(W )
i=1
16
Trace norm regularization
min
m
X
W ∈IRd×T i=1
• complete data case:
min
(yi − hW, Xii)2 + λkW k∗
W ∈IRd×T
kY − W k2Fr + λkW k∗
• if Y = U diag(σ)V ⊤ then the solution is (recall Hλ from page 4):
Ŵ = U diag(Hλ(σ))V ⊤
Proof uses von Neumann’s Theorem: tr(Y ⊤W ) ≤ σ(Y )⊤σ(W ) and equality holds iff
Y and W have the same ordered system of singular vectors
17
Sparse Inverse Covariance Selection
Let x1, . . . , xm ∼ p, where p(x) =
1
−(x−µ)⊤ Σ−1 (x−µ)
e
(2π)d det(Σ)
Maximum likelihood estimate for the covariance
Σ̂ = arg max
Σ≻0
d
Y
p(xi) = arg max
i=1
Σ≻0
d
Y
i=1
= arg max − log det(Σ) − hS, Σ
Σ≻0
where S =
1
m
m
P
log p(xi)
−1
i
(xi − µ)(xi − µ)⊤
i=1
• The solution is Σ̂ = S (show it as an exercise)
18
Sparse Inverse Covariance Selection (cont.)
Inverse covariance provides information about the relationship between
i
j
variables: Σ−1
ij = 0 iff x and x are conditionally independent
Ŵ = arg max {log det(W ) − hS, W i} = arg min {hS, W i − log det(W )}
W ≻0
W ≻0
If we expect many pairs of variables to be conditionally independent we
could solve the problem
min {hS, W i − log det(W ) : W ≻ 0, card{(i, j) : |Wij | > 0} ≤ k}
which can be relaxed to the convex program
min {hS, W i − log det(W ) : W ≻ 0, kW k1 ≤ k}
19
Dictionary Learning / Sparse Coding
Given x1, . . . , xm ∼ p find d × k matrix W which minimize the average
reconstruction error:
m
X
min kxi − W zk22
i=1
z∈Z
Can be seen as a constrained matrix factorization problem
min kX −
W Zk2F
: W ∈ W, Z ∈ Z
where X = [x1, . . . , xm] and W ⊆ IRd×k , Z ⊆ IRk×m
Interpretation: the columns of W are some basis vectors (could be
linearly dependent) and the columns of Z are the codes / coefficients used
to reconstruct the inputs as a linear combination of the basis vectors
20
Examples
• PCA: W = IRd×k , Z = IRk×m
• k-means clustering: W = IRd×k , Z = {Z : zi ∈ {e1, . . . , ek }}
• Nonnegative matrix factorization
min kX − W Zk2F
W,Z≥0
• Sparse coding: W = IRd×k , Z = {Z : kzik0 ≤ s}
Can be relaxed to the problem: min kX − W Zk2Fr + λkZk1
21
Nonlinear extension
The methods we have seen so far can be extended to a RKHS setting; for
example the Lasso extends to the problem
!
N
m
N
X
X
X
E
min
fℓ(xi), yi + λ
kfℓkKℓ (∗)
i=1
ℓ=1
ℓ=1
• minimum is over functions f1, . . . , fN , with fℓ ∈ HKℓ , with K1, . . . , KN
some prescribed kernels
• feature space formulation (recall Kℓ(x, t) = hφℓ(x), φℓ(t)i)
min
m
X
i=1
E
N
X
ℓ=1
wℓ⊤φℓ(xi), yi
!
+λ
N
X
kwℓk2
ℓ=1
22
Connection to Group Lasso
Two important “parametric” versions of the above formulation:
• Lasso: choose fj (x) = wj xj , Kj (x, t) = xj tj
m
X
E(w⊤xi, yi) + γ
d
X
|wj |
j=1
i=1
P
• Group Lasso: choose fj (x) = j∈Jℓ wj xj , Kj (x, t) = hx|Jℓ , t|Jℓ i,
where {Jℓ}nℓ=1 is a partition of index set {1, . . . , d}
m
X
i=1
E(w⊤xi, yi) + γ
N
X
kw|Jℓ k2
ℓ=1
23
Representer theorem
Two reformulations of (*) as a finite dimension optimization problem
• Using the representer theorem:


N q
m
N X
m
X
X
X
αℓ⊤Kℓαℓ
Kℓ(xi, xj )αℓ,j , yi + λ
E
min
i=1
• Using the formula
ℓ=1 j=1
P
ℓ |tℓ |
m
X
=
ℓ=1
inf 12
z>0
t2ℓ
ℓ zℓ
P
+ zℓ, rewrite the problem as
X
λ
2
P
zℓ
E(f (xi), yi) + kf k zℓKℓ +
inf min
z>0
2
ℓ
i=1
ℓ
24
Some references
• L1-regularization / L1-MNI:
– P.J. Bickel, Y. Ritov, and A.B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector.
Annals of Statistics, 37:1705–1732, 2009.
– E. J. Candès. The restricted isometry property and its implications for compressed sensing. Compte
Rendus de l’Academie des Sciences, Paris, Serie I, 346 589-592.
– E. J. Candès, J. Romberg and T. Tao. Stable signal recovery from incomplete and inaccurate
measurements. Comm. Pure Appl. Math., 59 1207-1223.
– R. Tibshirani. Regression Shrinkage and Selection via the Lasso, J. Royal Statistical Society B,
58(1):267–288, 1996.
• Group Lasso:
– M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables Journal of
the Royal Statistical Society, Series B, 68(1):49–67, 2006.
– P. Zhao, G. Rocha, and B. Yu. Grouped and hierarchical model selection through composite
absolute penalties. Annals of Statistics, 37(6A):3468–3497, 2009.
– R. Jenatton, J.-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing
norms. arXiv:0904.3523v2, 2009.
25
• Multi-task learning:
– A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning,
73(3):243–272, 2008.
– G. Obozinski, B. Taskar, and M.I. Jordan. Joint covariate selection and joint subspace selection for
multiple classification problems. Statistics and Computing, 20(2):1–22, 2010.
• Low rank matrix estimation:
– V. Koltchinskii, A.B. Tsybakov, K. Lounici. Nuclear norm penalization and optimal rates for noisy
low rank matrix completion. arXiv:1011.6256, 2011.
– N. Srebro, J.D.M. Rennie, T.S. Jaakkola. Maximum-Margin Matrix Factorization. Advances in
Neural Information Processing Systems 17, pages 1329–1336, 2005.
– E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Found. of Comput.
Math., 9 717-772.
• Nonlinear Group Lasso / Multiple kernel learning:
– A. Argyriou, C. A. Micchelli and M. Pontil. Learning convex combinations of continuously
parameterized basic kernels. COLT 2005
– F. R. Bach, G. R. G. Lanckriet and M. I. Jordan. Multiple kernel learning, conic duality, and the
SMO algorithm. ICML 2004
26
– G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui and M. I. Jordan. Learning the kernel
matrix with semidefinite programming. JMLR 2004
– C.A. Micchelli and M. Pontil. Learning the kernel function via regularization. JMLR 2005
– A. Rakotomamonjy, F. R. Bach, S. Canu, Y. Grandvalet, SimpleMKL, JMLR, 2008.
• Sparse Coding:
– B.A. Olshausen and D.J. Field. Emergence of simple-cell receptive field properties by learning a
sparse code for natural images. Nature, 381(6583):607–609, 1996.
– D. Lee and H. Seung, Algorithms for non-negative matrix factorization. Advances in Neural
Information Processing Systems, 13, pages 556-562, 2001.
27