A Generic Coordinate Descent Algorithm for Inverse Covariance

A Generic Coordinate Descent Algorithm for
Inverse Covariance Estimation
Junhui Wang
Department of Mathematics, Statistics,
and Computer Science
University of Illinois at Chicago
December, 2011
Wang, Junhui
Inverse Covariance Estimation
Outline
Brief review of inverse covariance estimation
Idea of the generic coordinate descent algorithm
Estimation of the inverse covariance matrix
Simulation and real examples
Extension to graph clustering
Summary
Wang, Junhui
Inverse Covariance Estimation
Inverse covariance matrix
Assume X = (X1 , . . . , Xp )T ∼ Np (0, Σ0 ), the goal is to
estimate the precision matrix Ω0 = Σ−1
0 .
Wang, Junhui
Inverse Covariance Estimation
Inverse covariance matrix
Assume X = (X1 , . . . , Xp )T ∼ Np (0, Σ0 ), the goal is to
estimate the precision matrix Ω0 = Σ−1
0 .
Connection with Gaussian graphical models (Edwards, 2000):
the dependence structure among Xj ’s can be fully determined
by Ω0 , since
(Ω0 )jk = 0 ⇐⇒ Xj ⊥
⊥ Xk | other variables.
Wang, Junhui
Inverse Covariance Estimation
Inverse covariance matrix
Assume X = (X1 , . . . , Xp )T ∼ Np (0, Σ0 ), the goal is to
estimate the precision matrix Ω0 = Σ−1
0 .
Connection with Gaussian graphical models (Edwards, 2000):
the dependence structure among Xj ’s can be fully determined
by Ω0 , since
(Ω0 )jk = 0 ⇐⇒ Xj ⊥
⊥ Xk | other variables.
Can improve learning performance (Rothman et al., 2008).
Wang, Junhui
Inverse Covariance Estimation
Existing estimation methods
Assume X(1) , . . . , X(n) are i.i.d. from Np (0, Σ0 ).
The negative log-likelihood is, after dropping some constants,
l(Ω) = tr(ΩS) − log |Ω|,
where S = n−1
Pn
i=1 kX(i)
Wang, Junhui
− X̄k2 .
Inverse Covariance Estimation
Existing estimation methods
Assume X(1) , . . . , X(n) are i.i.d. from Np (0, Σ0 ).
The negative log-likelihood is, after dropping some constants,
l(Ω) = tr(ΩS) − log |Ω|,
where S = n−1
Pn
i=1 kX(i)
− X̄k2 .
l(Ω) is strictly convex in Ω and its first derivative is
l0 (Ω) = S − Ω−1 .
Wang, Junhui
Inverse Covariance Estimation
Existing estimation methods
Assume X(1) , . . . , X(n) are i.i.d. from Np (0, Σ0 ).
The negative log-likelihood is, after dropping some constants,
l(Ω) = tr(ΩS) − log |Ω|,
where S = n−1
Pn
i=1 kX(i)
− X̄k2 .
l(Ω) is strictly convex in Ω and its first derivative is
l0 (Ω) = S − Ω−1 .
If S is p.d., the mle of Ω is S −1 .
Wang, Junhui
Inverse Covariance Estimation
Existing estimation methods
S may be semi p.d.; or S −1 can be dense and suboptimal
when Ω0 is assumed to be sparse.
Wang, Junhui
Inverse Covariance Estimation
Existing estimation methods
S may be semi p.d.; or S −1 can be dense and suboptimal
when Ω0 is assumed to be sparse.
Regularized log-likelihood function,
min lλ (Ω) = tr(ΩS) − log |Ω| + λkΩ− k1 ,
Ω p.d.
where Ω− = Ω − Ω+ with Ω+ = diag(Ω), and k · k1 is
componentwise L1 -norm.
Wang, Junhui
Inverse Covariance Estimation
Existing estimation methods
S may be semi p.d.; or S −1 can be dense and suboptimal
when Ω0 is assumed to be sparse.
Regularized log-likelihood function,
min lλ (Ω) = tr(ΩS) − log |Ω| + λkΩ− k1 ,
Ω p.d.
where Ω− = Ω − Ω+ with Ω+ = diag(Ω), and k · k1 is
componentwise L1 -norm.
Some existing estimation methods:
Graphical Lasso (glasso; Friedman et al., 2007)
Cholesky decomposition (SPICE; Rothman et al., 2008)
Wang, Junhui
Inverse Covariance Estimation
A generic coordinate descent (CD) algorithm
Let s(Ω) be any strictly convex function of Ω,
min s(Ω).
p.d.
Ω
Wang, Junhui
Inverse Covariance Estimation
A generic coordinate descent (CD) algorithm
Let s(Ω) be any strictly convex function of Ω,
min s(Ω).
p.d.
Ω
b t by one
The key of the CD algorithm is to update the current Ω
diagonal entry or two symmetric off-diagonal entries,
b t+1 = Ω
b t − vt Wt ,
Ω
where Wt is the CD direction and vt is the step size.
Wang, Junhui
Inverse Covariance Estimation
The CD algorithm assuring p.d.
b t ), (a, b) = argmaxj,k |(Dt )jk |, and then set
Let Dt = s0 (Ω
(Wt )ab = (Wt )ba = (Dt )ab , and (Wt )jk = 0 otherwise.
Wang, Junhui
Inverse Covariance Estimation
The CD algorithm assuring p.d.
b t ), (a, b) = argmaxj,k |(Dt )jk |, and then set
Let Dt = s0 (Ω
(Wt )ab = (Wt )ba = (Dt )ab , and (Wt )jk = 0 otherwise.
Set vt as follows.
Theorem
b t is p.d., Ω
b t+1 = Ω
b t − vt Wt is p.d. if and only if
Given that Ω
b t+1 ) > 0.
det(Ω
Wang, Junhui
Inverse Covariance Estimation
The CD algorithm assuring p.d.
b t ), (a, b) = argmaxj,k |(Dt )jk |, and then set
Let Dt = s0 (Ω
(Wt )ab = (Wt )ba = (Dt )ab , and (Wt )jk = 0 otherwise.
Set vt as follows.
Theorem
b t is p.d., Ω
b t+1 = Ω
b t − vt Wt is p.d. if and only if
Given that Ω
b t+1 ) > 0. In addition, det(Ω
b t+1 ) > 0 when
det(Ω

q
−1
−1
−1

 −(Dt )ab (Ωb t )ab +|(Dt )ab | (Ωb t )aa (Ωb t )bb
, if a =
6 b;
2
∗
(Dt )ab ∆t
vt < vt =

1

,
if a = b,
b −1
|(Dt )ab |(Ωt )aa
b −1 )aa (Ω
b −1 )bb − (Ω
b −1 )2 .
where ∆t = (Ω
t
t
t
ab
Wang, Junhui
Inverse Covariance Estimation
Precision matrix estimation using lλ (Ω)
Set s(Ω) = lλ (Ω).
b 0 = (diag(S))−1 .
Initialize Ω
b t) = S − Ω
b −1 + λ sign(Ω
b − ) and Wt is defined as in
Dt = l0 (Ω
t
t
the last slide.
b t − vWt ) with 0 < α < 1.
vt = α · argminv≤v∗ lλ (Ω
t
Wang, Junhui
Inverse Covariance Estimation
Precision matrix estimation using lλ (Ω)
Set s(Ω) = lλ (Ω).
b 0 = (diag(S))−1 .
Initialize Ω
b t) = S − Ω
b −1 + λ sign(Ω
b − ) and Wt is defined as in
Dt = l0 (Ω
t
t
the last slide.
b t − vWt ) with 0 < α < 1.
vt = α · argminv≤v∗ lλ (Ω
t
Some remarks:
Requires line search for finding vt ;
Needs to re-run the iteration for different λ’s;
Wang, Junhui
Inverse Covariance Estimation
Precision matrix estimation using l(Ω)
Set s(Ω) = l(Ω).
b 0 = (diag(S))−1 .
Initialize Ω
b t) = S − Ω
b −1 and Wt is defined as in the last slide.
Dt = l0 (Ω
t
b
vt = α · argminv l(Ωt − vWt ) with 0 < α ≤ 1,
Wang, Junhui
Inverse Covariance Estimation
Precision matrix estimation using l(Ω)
Set s(Ω) = l(Ω).
b 0 = (diag(S))−1 .
Initialize Ω
b t) = S − Ω
b −1 and Wt is defined as in the last slide.
Dt = l0 (Ω
t
b
vt = α · argminv l(Ωt − vWt ) with 0 < α ≤ 1, which has
analytic solution,
−1
b −1
if a = b, vt = (Saa (Ω
;
t )aa )
if a =
6 b and Sab = 0, vt = ∆−1
t ;
if a =
6 b and Sab 6= 0,
q
2 ∆ + 4S 2 ((Ω
2
b −1
b −1
∆2t + 4Sab
−(∆t + 2(Ω
)
S
)
+
ab
ab
t
t
t )ab )
ab
.
vt =
2∆t (Dt )ab Sab
Wang, Junhui
Inverse Covariance Estimation
Some remarks
b t+1 is always p.d..
vt < vt∗ , and thus Ω
Wang, Junhui
Inverse Covariance Estimation
Some remarks
b t+1 is always p.d..
vt < vt∗ , and thus Ω
b that starts
The algorithm generates a p.d. solution path of Ω,
from the diagonal matrix and gradually converges to a dense
matrix.
Wang, Junhui
Inverse Covariance Estimation
Some remarks
b t+1 is always p.d..
vt < vt∗ , and thus Ω
b that starts
The algorithm generates a p.d. solution path of Ω,
from the diagonal matrix and gradually converges to a dense
matrix.
b can be obtained by early stopping the iteration.
The sparse Ω
Any model selection criterion can be used as the stopping
rule, such as
2
· df(Ω),
n
log(n)
BIC(Ω) = l(Ω) +
· df(Ω),
n
AIC(Ω) = l(Ω) +
where df(Ω) = #{(j, k) : j < k, Ωjk 6= 0}.
Wang, Junhui
Inverse Covariance Estimation
Simulation setup
Four covariance structures are considered:
Model 1 (AR(1)): (Σ0 )jk = ρ|j−k| with ρ = 0.5, and
Ω0 = (Σ0 )−1 ;
Model 2 (AR(3)): (Ω0 )jk = I(|j − k| = 0) + 0.5I(|j − k| =
1) + 0.2I(|j − k| = 2) + 0.1I(|j − k| = 3);
Models 3 & 4 (Randomly generated matrix):
(Ω0 )jk ∼ 0.5 · Bern(γ) when j 6= k, with γ = 0.1 for Model 3
and γ = 0.5 for Model 4, and (Ω0 )jj ’s are set so that the
smallest eigenvalue of Ω0 is 0.1.
Sample size n = 80, and dimension p = 25, 50, or 100.
Wang, Junhui
Inverse Covariance Estimation
Various performance measures
Kullback-Leibler loss:
b M ) = tr(Σ0 Ω
b M ) − log |Σ0 Ω
b M | − p;
KL(Ω0 , Ω
Frobenius norm:
b M ) = kΩ0 − Ω
b M kF ;
F (Ω0 , Ω
Variable selection loss (Ravikumar et al., 2011):
X
b M ) = (p(p−1))−1
b M )jk )).
V S(Ω0 , Ω
I(sign((Ω0 )jk ) 6= sign((Ω
j6=k
Wang, Junhui
Inverse Covariance Estimation
Simulation: Kullback-Leibler loss
20
Model 2
15
glasso
CD_AIC
CD_BIC
vCD_AIC
vCD_BIC
p=25
p=50
p=100
0
0
5
5
10
10
15
Model 1
glasso
CD_AIC
CD_BIC
vCD_AIC
vCD_BIC
p=25
30
20
glasso
CD_AIC
CD_BIC
vCD_AIC
vCD_BIC
5 10
p=25
p=50
p=100
Wang, Junhui
0
0
p=100
Model 4
glasso
CD_AIC
CD_BIC
vCD_AIC
vCD_BIC
5 10
20
30
Model 3
p=50
p=25
p=50
p=100
Inverse Covariance Estimation
Simulation: Frobenius norm
Model 2
7
6
5
glasso
CD_AIC
CD_BIC
vCD_AIC
vCD_BIC
p=50
p=100
p=25
p=50
p=100
Model 4
40
Model 3
30
glasso
CD_AIC
CD_BIC
vCD_AIC
vCD_BIC
20
glasso
CD_AIC
CD_BIC
vCD_AIC
vCD_BIC
10
p=25
p=50
p=100
Wang, Junhui
0
5
0
0
p=25
10
15
20
0
1
2
2
3
4
4
6
8
Model 1
glasso
CD_AIC
CD_BIC
vCD_AIC
vCD_BIC
p=25
p=50
p=100
Inverse Covariance Estimation
Simulation: Variable selection loss
Model 2
p=25
p=50
p=100
0.00
0.0
0.1
0.10
0.2
0.3
0.20
0.4
Model 1
p=25
p=100
Model 4
p=25
p=50
p=100
Wang, Junhui
0.0
0.00
0.1
0.2
0.10
0.3
0.20
0.4
Model 3
p=50
p=25
p=50
p=100
Inverse Covariance Estimation
Colon tumor classification
62 colon adenocarcinoma tissue samples are gathered, where
40 are tumor tissues and 22 are non-tumor tissues.
2,000 gene expression profiles are available for each tissue.
42 tissues are randomly selected for training and the
remaining 20 tissues are for testing.
p = 25, 50 or 100 most significant genes are selected for
illustration.
LDA is employed for classification, with estimated precision
matrices.
Wang, Junhui
Inverse Covariance Estimation
Colon tumor classification: misclassification error
0
5
10
15
20
Colon tumor classification
p=25
Wang, Junhui
p=50
p=100
Inverse Covariance Estimation
Extension to graph clustering
Graph clustering aims to group the vertices (variables) into
clusters so that the vertices in the same cluster are well
connected (similar) and the vertices between clusters are not.
Wang, Junhui
Inverse Covariance Estimation
Extension to graph clustering
Graph clustering aims to group the vertices (variables) into
clusters so that the vertices in the same cluster are well
connected (similar) and the vertices between clusters are not.
Assume the variables follow a p-variate Gaussian distribution,
then the graph clustering problem becomes solving
min l(Ω)
Ω
subject to Ω is p.d. and block diagonal.
Wang, Junhui
Inverse Covariance Estimation
Extension to graph clustering
Graph clustering aims to group the vertices (variables) into
clusters so that the vertices in the same cluster are well
connected (similar) and the vertices between clusters are not.
Assume the variables follow a p-variate Gaussian distribution,
then the graph clustering problem becomes solving
min l(Ω)
Ω
subject to Ω is p.d. and block diagonal.
Each block corresponds to a cluster.
Wang, Junhui
Inverse Covariance Estimation
Graph clustering via CD algorithm
b t = diag{K1 , . . . , Km }.
At step t, suppose Ω
b −1 , then
(i) Let Dt = S − Ω
t
(a, b) = argmax
j,k
k(Dt )jk k1
,
dim(Kj )dim(Kk )
where (Dt )jk is the (j, k)-th block of Dt .
Wang, Junhui
Inverse Covariance Estimation
Graph clustering via CD algorithm
b t = diag{K1 , . . . , Km }.
At step t, suppose Ω
b −1 , then
(i) Let Dt = S − Ω
t
(a, b) = argmax
j,k
k(Dt )jk k1
,
dim(Kj )dim(Kk )
where (Dt )jk is the (j, k)-th block of Dt .
(ii) Set (Wt )ab = (Wt )ba = (Dt )ab , and (Wt )jk = 0 otherwise.
Wang, Junhui
Inverse Covariance Estimation
Graph clustering via CD algorithm
b t = diag{K1 , . . . , Km }.
At step t, suppose Ω
b −1 , then
(i) Let Dt = S − Ω
t
(a, b) = argmax
j,k
k(Dt )jk k1
,
dim(Kj )dim(Kk )
where (Dt )jk is the (j, k)-th block of Dt .
(ii) Set (Wt )ab = (Wt )ba = (Dt )ab , and (Wt )jk = 0 otherwise.
−1/2
(iii) Set vt = α · φmax (Ka−1 Dt Kb−1 DtT )
and φmax (·) being
b t+1 = Ω
b t − vt Wt .
the largest eigenvalue, and Ω
Wang, Junhui
Inverse Covariance Estimation
Graph clustering via CD algorithm
b t = diag{K1 , . . . , Km }.
At step t, suppose Ω
b −1 , then
(i) Let Dt = S − Ω
t
(a, b) = argmax
j,k
k(Dt )jk k1
,
dim(Kj )dim(Kk )
where (Dt )jk is the (j, k)-th block of Dt .
(ii) Set (Wt )ab = (Wt )ba = (Dt )ab , and (Wt )jk = 0 otherwise.
−1/2
(iii) Set vt = α · φmax (Ka−1 Dt Kb−1 DtT )
and φmax (·) being
b t+1 = Ω
b t − vt Wt .
the largest eigenvalue, and Ω
(iv) Reorganize Ωt+1 by combining the Ka and Kb .
Wang, Junhui
Inverse Covariance Estimation
Some remarks:
b t ’s are p.d. and
The algorithm always converges and all Ω
block diagonal.
Wang, Junhui
Inverse Covariance Estimation
Some remarks:
b t ’s are p.d. and
The algorithm always converges and all Ω
block diagonal.
Similar to agglomerative hierarchical clustering, with a
number of differences.
Wang, Junhui
Inverse Covariance Estimation
Some remarks:
b t ’s are p.d. and
The algorithm always converges and all Ω
block diagonal.
Similar to agglomerative hierarchical clustering, with a
number of differences.
The formulation of finding (a, b) is analogous to the group
average, and the single linkage and the complete linkage can
be defined accordingly.
Wang, Junhui
Inverse Covariance Estimation
Some remarks:
b t ’s are p.d. and
The algorithm always converges and all Ω
block diagonal.
Similar to agglomerative hierarchical clustering, with a
number of differences.
The formulation of finding (a, b) is analogous to the group
average, and the single linkage and the complete linkage can
be defined accordingly.
The number of blocks (clusters) can be pre-specified or
selected via the model selection criteria.
Wang, Junhui
Inverse Covariance Estimation
Illustrative examples: heatmaps
Wang, Junhui
Inverse Covariance Estimation
Illustrative examples: clustering error
Clustering error (Wang, 2010) is defined as
n
o
# {ψ0 (xj ) = ψ0 (xk )}∆{ψ̂(xj ) = ψ̂(xk )}
,
err(ψ̂) =
p(p − 1)
where ψ0 and ψ̂ are the true and the estimated clustering
mappings, and ∆ is the symmetric set difference.
Wang, Junhui
Inverse Covariance Estimation
Illustrative examples: clustering error
Clustering error (Wang, 2010) is defined as
n
o
# {ψ0 (xj ) = ψ0 (xk )}∆{ψ̂(xj ) = ψ̂(xk )}
,
err(ψ̂) =
p(p − 1)
where ψ0 and ψ̂ are the true and the estimated clustering
mappings, and ∆ is the symmetric set difference.
Averaged clustering errors over 50 replications:
Model 1
Model 2
Model 3
α = .2
5.10(.093)
7.00(.181)
7.84(.223)
Wang, Junhui
α = .5
5.59(.109)
6.24(.141)
6.34(.127)
hclust
19.98(2.695)
26.47(6.477)
22.53(4.896)
Inverse Covariance Estimation
Some take-home message
A generic coordinate descent algorithm is introduced to
optimize any strictly convex function with respect to positive
definite matrices.
Wang, Junhui
Inverse Covariance Estimation
Some take-home message
A generic coordinate descent algorithm is introduced to
optimize any strictly convex function with respect to positive
definite matrices.
The algorithm is successfully applied to precision matrix
estimation and graph clustering.
Wang, Junhui
Inverse Covariance Estimation
Some take-home message
A generic coordinate descent algorithm is introduced to
optimize any strictly convex function with respect to positive
definite matrices.
The algorithm is successfully applied to precision matrix
estimation and graph clustering.
The algorithm can be extended to other problems such as
covariance matrix estimation and metric learning.
Wang, Junhui
Inverse Covariance Estimation
Some take-home message
A generic coordinate descent algorithm is introduced to
optimize any strictly convex function with respect to positive
definite matrices.
The algorithm is successfully applied to precision matrix
estimation and graph clustering.
The algorithm can be extended to other problems such as
covariance matrix estimation and metric learning.
Thank you!
Wang, Junhui
Inverse Covariance Estimation

Download Report

A Generic Coordinate Descent Algorithm for Inverse Covariance

Paperzz.com

Your Paperzz