1 The Sparse Covariance Selection Problem

Semidefinite Programming
OR 6327 Spring 2012
Scribe: Daniel Fleischman
1
Lecture 27
May 1, 2012
The Sparse Covariance Selection Problem
Suppose we have 𝑛 random variables and we want to detect their dependence on each other as
well as estimating mean and covariance parameters.
Note that we may (wrongly) detect a dependence. Suppose variables 𝑋1 and 𝑋2 are (conditionally) independent of each other, but 𝑋1 and 𝑋3 are dependent, 𝑋3 and 𝑋4 are dependent
and 𝑋4 and 𝑋2 are dependent. (Henceforth we’ll use lower-case letters for random variables to
avoid confusion with matrices.)
We will assume that the variables are multivariate normal parametrized
by mean πœ‡ )and
( 1
𝑇 βˆ’1
covariance matrix Ξ£ (Ξ£ is pd). Then the density is 𝑓 (π‘₯) = 𝑐1 exp βˆ’ 2 (π‘₯ βˆ’ πœ‡) Ξ£ (π‘₯ βˆ’ πœ‡) for
some normalizing constant 𝑐1 .
Let 𝑦 = (π‘₯𝑖 ; π‘₯𝑗 ) consist of two of the variables and let 𝐾 = {1, 2, . . . , 𝑛} βˆ– {𝑖, 𝑗}. Then the
density of 𝑦 given π‘₯𝐾 = π‘₯¯πΎ is:
}
{
1
𝑇 βˆ’1
𝑔(𝑦) = 𝑐2 exp βˆ’ (𝑦 βˆ’ 𝜈) Ξ£{𝑖,𝑗},{𝑖,𝑗} (𝑦 βˆ’ 𝜈)
2
for some 𝜈 and 𝑐2 . Hence, π‘₯𝑖 and π‘₯𝑗 are independent given π‘₯𝐾 = π‘₯¯πΎ if and only if (Ξ£βˆ’1 )𝑖𝑗 = 0.
We sample 𝑁 times to get π‘₯1 , . . . , π‘₯𝑁 ∈ ℝ𝑛 and compute the sample mean π‘₯¯ = (π‘₯1 + . . . +
π‘₯𝑁 )/𝑁 and the sample variance Ξ£Μ„ = ((π‘₯1 βˆ’ π‘₯¯)(π‘₯1 βˆ’ π‘₯¯)𝑇 + . . . + (π‘₯𝑁 βˆ’ π‘₯¯)(π‘₯𝑁 βˆ’ π‘₯¯)𝑇 )/𝑁 , and use
maximum likelihood. The likelihood function given the data is:
βˆ’π‘/2
β„Ž(πœ‡, Ξ£) := 𝑐3 det(Ξ£)
𝑁
∏
}
{
1
𝑇 βˆ’1
exp βˆ’ (π‘₯𝑗 βˆ’ πœ‡) Ξ£ (π‘₯𝑗 βˆ’ πœ‡)
2
𝑗=1
for yet another normalizing constant 𝑐3 . For any Ξ£ pd this is maximized over πœ‡ by πœ‡ = π‘₯¯. So
we want to maximize:
βˆ’π‘/2
β„ŽΜ„(Ξ£) := β„Ž(¯
π‘₯, Ξ£) = 𝑐3 det(Ξ£)
{
}
{
}
1 βˆ’1
𝑁
𝑁 βˆ’1
exp βˆ’ Ξ£ βˆ™ Ξ£Μ„ = exp 𝑐4 βˆ’ lndet Ξ£ βˆ’ Ξ£ βˆ™ Ξ£Μ„ .
2
2
2
Since the exponential function is monotone, we want to maximize its argument. This
is not concave in Ξ£. So we change variables. Let 𝑋 = Ξ£βˆ’1 . Then we want to minimize
βˆ’lndet 𝑋 + Ξ£Μ„ βˆ™ 𝑋, which is convex.
Parametrizing our optimization by the inverse covariance has another advantage, since this
is the matrix we expect (or hope) to be sparse. We will penalize the presence of nonzeros in 𝑋
by adding an β€œπ‘™1 ” term: βˆ£βˆ£βˆ£π‘‹βˆ£βˆ£βˆ£1 := ∣∣vec(𝑋)∣∣1 . So, at the end, we want to solve:
1
βˆ’lndet
𝑋
|
{z+ Ξ£Μ„ βˆ™ 𝑋}
smooth
min
πœŒβˆ£βˆ£βˆ£π‘‹βˆ£βˆ£βˆ£1 .
| {z }
nice nonsmooth convex
We will use variable splitting and the alternating direction augmented Lagrangian method.
𝑋≻0
2
+
The Method
Reformulate as:
min
βˆ’ lndet 𝑋 + Ξ£Μ„ βˆ™ 𝑋 + πœŒβˆ£βˆ£βˆ£π‘Š ∣∣∣1
s.t. :
𝑋 βˆ’ π‘Š = 0,
𝑋≻0
and consider the augmented Lagrangian:
𝐿(𝑋, π‘Š, π‘Œ ) := βˆ’lndet 𝑋 + Ξ£Μ„ βˆ™ 𝑋 + πœŒβˆ£βˆ£βˆ£π‘Š ∣∣∣1 βˆ’ π‘Œ βˆ™ (𝑋 βˆ’ π‘Š ) +
𝛽
βˆ£βˆ£π‘‹ βˆ’ π‘Š ∣∣2𝐹
2
for 𝛽 > 0 fixed.
The idea is to start with 𝑋0 = π‘Š0 = 𝐼 and π‘Œ0 = 0 and in each step do (π‘‹π‘˜+1 , π‘Šπ‘˜+1 ) =
arg min𝑋,π‘Š 𝐿(𝑋, π‘Š, π‘Œπ‘˜ ), and π‘Œπ‘˜+1 := π‘Œπ‘˜ βˆ’ 𝛽(π‘‹π‘˜+1 βˆ’ π‘Šπ‘˜+1 ).
But it is more efficient to minimize separately over 𝑋 and π‘Š , to get:
βˆ™ Start with 𝑋0 = π‘Š0 = 𝐼 and π‘Œ0 = 0.
βˆ™ In iteration π‘˜ do:
– π‘‹π‘˜+1 = arg min𝑋 𝐿(𝑋, π‘Šπ‘˜ , π‘Œπ‘˜ )
– π‘Šπ‘˜+1 = arg minπ‘Š 𝐿(π‘‹π‘˜+1 , π‘Š, π‘Œπ‘˜ )
– π‘Œπ‘˜+1 = π‘Œπ‘˜ βˆ’ 𝛽(π‘‹π‘˜+1 βˆ’ π‘Šπ‘˜+1 ).
The two optimization steps are given by the following two propositions. First note that:
𝛽
βˆ£βˆ£π‘‹ βˆ’ π‘Š ∣∣2𝐹 + 𝑔(π‘Š, π‘Œ )
2
𝛽
1
= βˆ’lndet 𝑋 + βˆ£βˆ£π‘‹ βˆ’ (π‘Š + (π‘Œ βˆ’ Ξ£Μ„))∣∣2𝐹 + 𝑔 β€² (π‘Š, π‘Œ )
2
𝛽
𝐿(𝑋, π‘Š, π‘Œ ) = βˆ’lndet 𝑋 βˆ’ (π‘Œ βˆ’ Ξ£Μ„) βˆ™ 𝑋 +
where 𝑔(π‘Š, π‘Œ ) and 𝑔 β€² (π‘Š, π‘Œ ) are functions independent of 𝑋.
Proposition 1 Let 𝐡 := π‘Šπ‘˜ + 𝛽1 (π‘Œπ‘˜ βˆ’ Ξ£Μ„) have eigenvalue decomposition 𝑄Λ𝑄𝑇 . Then the mini(
)
√
mizer of βˆ’lndet 𝑋+ 𝛽2 βˆ£βˆ£π‘‹βˆ’π΅βˆ£βˆ£2𝐹 is 𝑋 = 𝑄𝑀 𝑄𝑇 where 𝑀 = π·π‘–π‘Žπ‘”(πœ‡) and πœ‡π‘— = 12 πœ†π‘— + πœ†2𝑗 + 4𝛽 βˆ’1
for all 𝑗.
2
Proof: The objective function is convex and smooth, with derivative βˆ’π‘‹ βˆ’1 + 𝛽(𝑋 βˆ’ 𝐡), so it
is minimized when 𝛽𝑋 βˆ’ 𝑋 βˆ’1 = 𝛽𝐡.
Let 𝑀 = 𝑄𝑇 𝑋𝑄; then by premultiplying by 𝑄𝑇 and postmultiplying by 𝑄 we get 𝛽𝑀 βˆ’
βˆ’1
𝑀 = 𝛽Λ. This is solved by a diagonal matrix 𝑀 = Diag (πœ‡) with π›½πœ‡π‘— βˆ’ πœ‡βˆ’1
𝑗 = π›½πœ†π‘— , for all 𝑗,
with roots as given above. βŠ”
βŠ“
Now, for the minimization problem over π‘Š , note that:
𝐿(𝑋, π‘Š, π‘Œ ) = βˆ’π‘Œ βˆ™ (𝑋 βˆ’ π‘Š ) + πœŒβˆ£βˆ£βˆ£π‘Š ∣∣∣1 +
= πœŒβˆ£βˆ£βˆ£π‘Š ∣∣∣1 +
𝛽
βˆ£βˆ£π‘‹ βˆ’ π‘Š ∣∣2𝐹 + π‘ž(𝑋, π‘Œ )
2
𝛽
1
βˆ£βˆ£π‘Š βˆ’ (𝑋 βˆ’ π‘Œ )∣∣2𝐹 + π‘ž β€² (𝑋, π‘Œ )
2
𝛽
for π‘ž(𝑋, π‘Œ ), π‘ž β€² (𝑋, π‘Œ ) independent of π‘Š .
So we have:
Proposition 2 Let 𝐢 = π›½π‘‹π‘˜+1 βˆ’ π‘Œπ‘˜ . Then the minimizer of πœŒβˆ£βˆ£βˆ£π‘Š ∣∣∣1 + 𝛽2 βˆ£βˆ£π‘Š βˆ’ 𝛽1 𝐢∣∣2𝐹 is
𝜌 (𝐢)), where 𝐡 𝜌 = {𝑍 ∈ 𝕄𝑛 : βˆ£π‘§ ∣ ≀ 𝜌 βˆ€π‘–, 𝑗} and 𝑃 𝜌 is the projection on that
π‘Š = 𝛽1 (𝐢 βˆ’ π‘ƒπ΅βˆž
𝑖𝑗
𝐡∞
∞
set.
𝜌 (𝐢) = 𝐷, where 𝑑
Proof: Note that 𝐢 βˆ’ π‘ƒπ΅βˆž
𝑖𝑗 is 0 if βˆ’πœŒ ≀ 𝑐𝑖𝑗 ≀ 𝜌, is 𝑐𝑖𝑗 + 𝜌 if 𝑐𝑖𝑗 < βˆ’πœŒ and
𝑐𝑖𝑗 βˆ’ 𝜌 if 𝑐𝑖𝑗 > 𝜌.
The objective function separates completely over each entry of π‘Š and 𝐢, so we want:
min πœŒβˆ£π‘€π‘–π‘— ∣ +
𝑀𝑖𝑗
1
𝛽
(𝑀𝑖𝑗 βˆ’ 𝑐𝑖𝑗 )2 .
2
𝛽
This is convex, and its subdifferential is:
⎧
if 𝑀𝑖𝑗 > 0,
⎨ {𝜌 + (𝛽𝑀𝑖𝑗 βˆ’ 𝑐𝑖𝑗 )}
{βˆ’πœŒ + (𝛽𝑀𝑖𝑗 βˆ’ 𝑐𝑖𝑗 )} if 𝑀𝑖𝑗 < 0,
βˆ‚π‘“ (𝑀𝑖𝑗 ) =
⎩
[βˆ’πœŒ, 𝜌] + {𝛽𝑀𝑖𝑗 βˆ’ 𝑐𝑖𝑗 } if 𝑀𝑖𝑗 = 0.
In all cases we see that π‘Š as in the statement of the proposition has entry 𝑀𝑖𝑗 with a
subgradient of 0. βŠ”
βŠ“
3