Structural pursuit over multiple undirected graphs 1 Introduction

Structural pursuit over multiple undirected graphs
∗
Yunzhang Zhu1 , Xiaotong Shen1 and Wei Pan2
Summary
Gaussian graphical models are useful to analyze and visualize conditional dependence relationships between interacting units. Motivated from network analysis under different experimental conditions, such as
gene networks for disparate cancer subtypes, we model structural changes over multiple networks with possible heterogeneities. In particular, we estimate multiple precision matrices describing dependencies among
interacting units through maximum penalized likelihood. Of particular interest are homogeneous groups of
similar entries across and zero-entries of these matrices, referred to as clustering and sparseness structures,
respectively. A non-convex method is proposed to seek a sparse representation for each matrix and identify
clusters of the entries across the matrices. Computationally, we develop an efficient method on the basis of
difference convex programming, the augmented Lagrangian method and the block-wise coordinate descent
method, which is scalable to hundreds of graphs of thousands nodes through a simple necessary and sufficient
partition rule, which divides nodes into smaller disjoint subproblems excluding zero-coefficients nodes for arbitrary graphs with convex relaxation. Theoretically, a finite-sample error bound is derived for the proposed
method to reconstruct the clustering and sparseness structures. This leads to consistent reconstruction of
these two structures simultaneously, permitting the number of unknown parameters to be exponential in the
sample size, and yielding the optimal performance of the oracle estimator as if the true structures were given
a priori. Simulation studies suggest that the method enjoys the benefit of pursuing these two disparate kinds
of structures, and compares favorably against its convex counterpart in the accuracy of structure pursuit
and parameter estimation.
Key Words: Simultaneous pursuit of sparseness and clustering, multiple networks, nonconvex, prediction, signaling network inference.
1
Introduction
Graphical models are widely used to describe relationships among interacting units. Major
components of the models are nodes that represent random variables, and edges encoding
conditional dependencies between the nodes. Of great current interest is the identification
of certain lower-dimensional structures for undirected graphs. The central topic of this
∗1
School of Statistics, 2 Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455. This
research was supported in part by the National Science Foundation Grant DMS-1207771 and National
Institutes of Health Grants R01GM081535, HL65462 and R01HL105397. The authors thank the editors, the
associate editor and the reviewers for helpful comments and suggestions.
1
paper is maximum penalized likelihood estimation of multiple Gaussian graphical models for
simultaneously pursuing two disparate kinds of structures–sparseness and clustering.
In the literature on Gaussian graphical models, the current research effort has concentrated on reconstruction of a single sparse graph. Methods to exploit matrix sparsity include
[1, 5, 11, 13, 14, 15, 25], among others. For multiple Gaussian graphical models, existing
approaches mainly focus on either exploring temporal smoothing structure [7, 26] or encouraging common sparsity across the networks [6, 10]. In this paper, we focus on pursuing both
clustering and sparseness structures over multiple graphs, including temporal clustering as
a special case while allowing for abrupt changes of structures over graphs. For multiple
graphs without a temporal ordering, our method enables to identify possible element-wise
heterogeneity among undirected graphs. This is motivated by heterogeneous gene regulatory
networks corresponding to disparate cancer subtypes [18, 21]. In such a situation, the overall
associations among genes remain similar for each network, whereas specific pathways and
certain critical nodes (genes) may be differentiated under disparate conditions.
For multiple Gaussian graphical models, estimation is challenging due to enormous can2
didate graphs of order 2Lp , where p is the total number of nodes and L is total number of
graphs. To battle the curse of dimensionality, we explore two dissimilar types of structures
simultaneously: (1) sparseness within each graph and (2) element-wise clustering across
graphs. The benefit of this exploration is three-fold. First, it goes beyond sparseness pursuit
alone for each graph, which is usually inadequate given a large number of unknown parameters relative to the sample size, as demonstrated in four numerical examples in Section 5.
Second, borrowing information across graphs enables us to detect the changes of sparseness
and clustering structures over the multiple graphs. Third, pursuit of these two structures at
the same time is suited for our problem, which seeks both similarities and differences among
the multiple graphs.
To this end, we propose a regularized/constrained maximum likelihood method for si2
multaneous pursuit of sparseness and clustering structures. Computationally, we develop
a strategy to convert the optimization involving matrices to a sequence of much simpler
quadratic problems. Most critically, we derive a necessary and sufficient partition rule to
partition the nodes into disjoint subproblems excluding zero-coefficient nodes for multiple
arbitrary graphs with convex relaxation, where the rule is applied before computation is performed. Such a rule has been used in [12] for convex estimation of a single matrix and in [24]
for multiple matrices with the temporal structure, but has not been available for multiple
arbitrary graphs, to our knowledge. This makes efficient computation possible for multiple
large graphical models, which otherwise is rather difficult if not impossible. Theoretically,
we develop a novel theory for the proposed method, and show that it enables to reconstruct
the oracle estimator as if the true sparseness and element-wise clustering structures were
given a priori, which leads to reconstruction of the two types of structures consistently. This
occurs roughly when the size of L matrices p2 L is of order exp(An), where p is the dimension
of the matrices and A is related to the Hessian matrices of the negative log-determinant of
the true precision matrices and the resolution level for simultaneous pursuit of sparseness
and element-wise clustering, c.f., Corollary 2. Moreover, we quantify the improvement due
to structural pursuit beyond that of sparsity.
The rest of this article is organized as follows. Section 2 introduces the proposed method.
Section 3 is devoted to estimation of partial correlations across multiple graphical models,
and develops computational tools for efficient computation. Section 4 presents a theory
concerning the accuracy of structural pursuit and parameter estimation, followed by some
numerical examples in Section 5 and an application to signaling network inference in Section
6. Section 7 discusses various issues in modeling. Finally, the appendix contains proofs.
3
2
Proposed method
(l)
(l)
Consider the L−sample problem with the l−th sample X1 , · · · , Xnl from N (µl , Σl ); l =
1, · · · , L, we estimate Ω = (Ω1 , · · · , ΩL ), where Ωl = Σ−1
is the p × p inverse covariance
l
matrix and positive definite, denoted by Ωl 0, µl and Σl are the corresponding mean
P
vector and covariance matrix, and the sample size n = Ll=1 nl .
For maximum likelihood estimation, the profile likelihood for Ω, after µ1 , · · · , µL are
maximized out, is proportional to
L
X
nl log det(Ωl ) − tr(Sl Ωl ) ,
(1)
l=1
where X̄l = n−1
l
Pnl
i=1
(l)
Xi and Sl = n−1
l
(l)
(l)
T
i=1 (Xi − X̄l )(Xi − X̄l )
Pnl
are the corresponding
sample mean and covariance matrix, det and tr denote the determinant and trace. In (1),
the number of unknown parameters in Ω can greatly exceed the sample size n.
2.1
General penalized multiple precision matrices estimation
To avoid non-identifiability in (1) and encourage low dimensional structures, we propose a
regularized maximum likelihood approach through penalty functions Jjk (·):
L
X
X
Jjk (ωjk1 , · · · , ωjkL ),
nl log det(Ωl ) − tr(Sl Ωl ) −
maximizeΩ0 S(Ω) =
(2)
j6=k
l=1
where Ω = (Ω1 , · · · , ΩL ) and only the off-diagonals {ωjkl } of Ωl are regularized. Note that
Jjk (·) could be any function that penalizes jk-th entries across Ωl ’s. This encompasses many
existing penalty-based approaches for multiple Gaussian graphical models [6, 22] as special
cases.
In general, the maximization problem (2) involving L matrices is computationally difficult. To meet the computational challenges, we develop a general block-wise coordinate
descent strategy to reduce (2) to an iterative procedure involving much easier subproblems.
Before proceeding, we introduce some notations. Let the jth row (or column) of Ωl be ωjl ,
let ω−jl = (ωj1l , . . . , ωj(j−1)l , ωj(j+1)l , . . . , ωjpl ) be a (p − 1)-dimensional vector, excluding the
4
jth component of ωjl , and Ω−jl be the sub-matrix without the jth row and column of Ωl ,
and Ω−1
−jl be the inverse of Ω−jl .
Our proposed method maximizes (2) by sweeping each row (or column) of Ω across
T
T
Ω−1
l = 1, · · · , L. Using the property that det(Ωl ) = det Ω−jl (ωjjl − ω−jl
−jl ω−jl ) with
indicating the transpose, we rewrite (2), after ignoring constant terms, as a function of each
row (or column) (ωj1 , · · · , ωjl ) across l; j = 1, · · · , p,
L
X
X
T
T
nl log ωjjl − ω−jl
Ω−1
Jjk (ωjk1 , · · · , ωjkL ).
−jl ω−jl ) − sjjl ωjjl − 2s−jl ω−jl −
l=1
(3)
k6=j
First, for each fixed row (or column) of Ω across l = 1, · · · , L, we maximize (3) over the
diagonals (ωjj1 , · · · , ωjjl ) given the corresponding off-diagonals (ω−j1 , · · · , ω−jl ). Setting the
partial derivatives of (3) in the diagonals to be zero yields the profile maximizer of (2)
T
ω̂jjl = 1/sjjl + ω−jl
Ω−1
−jl ω−jl ,
l = 1, · · · , L.
(4)
Second, substituting (4) into (3) yields the negative profile likelihood of (2) for (ω−jl , · · · , ω−jl )
L
X
X
T
T
nl sjjl ω−jl
Ω−1
ω
+
2s
ω
Jjk (ωjk1 , · · · , ωjkL ).
(5)
+
−jl
−jl
−jl
−jl
l=1
k6=j
Third, the aforementioned process is repeated for each rows (or columns) of Ω until a certain
stopping criterion is satisfied. By Theorem 1, profiling is equivalent to the original problem
for separable convex penalty functions summarized as follows.
Theorem 1 Iteratively minimizing (5) over the off-diagonals (ω−j1 , · · · , ω−jL ) and updating
diagonals ωjjl by (4); j = 1, · · · , p, l = 1, · · · , L converges to a local maximizer of (2).
Moreover, if Jjk (·) are convex, it converges to a global maximizer.
Theorem 1 reduces (2) to iteratively solving (5) which is quadratic in its argument. On
this ground we design efficient methods for solving (2) with a specific choice of Jjk (·) next.
2.2
Pursuit of sparseness and clustering structures
A zero element in Ωl corresponds to conditional independence between two components of
Y (l) given its other components [9]. Thus, within each precision matrix Ωl , estimating its
5
elements reconstructs its graph structure, where a zero-element of Ωl corresponds to no edges
between the two nodes, encoding conditional independence. In addition, the nodes connecting many other nodes are identified, called network hubs. On the other hand, over multiple
precision matrices, estimating element-wise clustering structure can reveal the change of
sparseness and clustering structures.
To detect clustering structures, consider element-wise clustering of entries of Ω1 , · · · , ΩL
based on possible prior knowledge. The prior knowledge is specified loosely in an undirected
graph U with each node corresponding to a triplet (j, k, l); 1 ≤ j < k ≤ p, 1 ≤ l ≤ L. That
is, an edge between node (j, k, l) and (j, k, l0 ) means that the (j, k)th entry of Ωl and the
(j, k)th entry of Ωl0 tend to be similar a priori and thus can be pushed to share the same
value. Specifically, let Ejk denote a set of edges between two distinct nodes (j, k, l) 6= (j, k, l0 )
of U, where (l, l0 ) ∈ Ejk indicates a connection between the two nodes (j, k, l), (j, k, l0 ). To
identify homogeneous subgroups of off-diagonals {ωjkl } of Ωl across l = 1, · · · , L over U,
including the group of zero-elements, we propose a non-convex penalty of the form
L
X
X
Jτ (|ωjkl |) + λ2
Jτ (|ωjkl − ωjkl0 |),
Jjk (ωjk1 , · · · , ωjkL ) = λ1
(l,l0 )∈E
l=1
(6)
jk
to regularize (2), where λ1 and λ2 are nonnegative tuning parameters controlling the degrees
of sparseness and clustering, Jτ (z) = min(|z|, τ ) is the truncated L1 -penalty of [17], called
TLP in what follows, which, after rescaled by τ1 , approximates the L0 -function when tuning
parameter τ > 0 tends to 0+ .
Note that our approach is applicable to a variety of applications by specifying the graph
U. For time varying graphs, our method can be used to detect the change of clustering
structure, where Ejk is a serial graph as in the fused Lasso [19], and a serial temporal relation
is defined only for elements in adjacent matrices. One key difference between our method
and the smoothing method [26, 7] is that it enables to accommodate abrupt changes of
structures over networks. For multiple graphs without a serial ordering, the proposed method
enables to identify possible element-wise heterogeneity among undirected graphs, such as
6
gene regulatory networks corresponding to disparate cancer subtypes [18, 21]. Heterogeneity
of this type can be dealt with by specifying a complete graph for each Ejk .
3
Computation
This section proposes a relaxation method to treat non-convex penalties in (6). For largescale problems, a partition rule may be useful, which breaks large matrices into many small
ones to process separately. A novel necessary and sufficient partition rule is derived for our
non-convex penalization method as well as its convex counterpart, generalizing the results
for single precision matrix estimation [12, 22].
3.1
Non-convex optimization
For the non-convex minimization (2) with (6), we develop a relaxation method by solving a
sequence of convex problems. This method integrates difference convex (DC) programming
with block-wise coordinate descent method based on the foregoing strategy.
For DC programming, we first decompose S(Ω) into a difference of two convex functions:
S(Ω) = S1 (Ω) − S2 (Ω), with
S1 (Ω) =
L
X
l=1
S2 (Ω) =
X
j6=k
X
nl log det(Ωl ) − tr(Sl Ωl ) + λ1
|ωjkl | + λ2
(j,k,l):j6=k
λ1
L
X
l=1
max(|ωjkl | − τ, 0) + λ2
X
X
X
1≤j6=k≤p
X
(l,l0 )∈E
|ωjkl − ωj 0 k0 l0 |,
jk
max(|ωjkl − ωj 0 k0 l0 | − τ, 0),
(7)
1≤j6=k≤p (l,l0 )∈Ejk
where a DC decomposition of Jτ (|z|) = |z| − max(|z| − τ, 0) is used. Then the trailing
convex function S2 (Ω) is iteratively approximated by its minorization, say at iteration m,
P P
P
P
(m)
(m)
(m)
λ1 Ll=1 j6=k (I(|ω̂jkl | ≤ τ )|ωjkl |+λ2 1≤j6=k≤p (l,l0 )∈E I(|ω̂jkl −ω̂jkl0 | ≤ τ )|ωjkl −ωjkl0 |. This
(m)
is obtained through minorization |z (m) | + ζ(|z (m) |)(|z| − |z (m) | of max(|z| − τ, 0) at |ω̂jkl |,
which is the solution at iteration m − 1, where ζ(|z (m) |) is the gradient of max(|z| − τ, 0) at
|z (m) |; see [17] for more discussions about minorization of this type. At iteration m, the cost
7
function to minimize is
−
L
X
nl log det(Ωl ) − tr(Sl Ωl ) + λ1
l=1
X
X
|ωjkl | + λ2
|ωjkl − ωjkl0 |
(8)
{(j,k,l),(j,k,l0 )}∈F (m)
(j,k,l)∈E (m)
(m)
subject to Ωl 0; l = 1, · · · , L, where E (m) = {(j, k, l) : |ω̂jkl | ≤ τ, j 6= k}; F (m) =
(m)
(m)
{{(j, k, l), (j, k, l0 )} : (l, l0 ) ∈ Ejk , |ω̂jkl − ω̂jkl0 | ≤ τ }.
To solve (8), we apply Theorem 1 to iteratively minimize:
L
X
X
T
T
nl sjjl ω−jl
Ω−1
+ λ1
|ωjkl |
−jl ω−jl + 2s−jl ω−jl
l=1
(j,k,l)∈E (m)
X
+ λ2
|ωjkl − ωjkl0 |,
(9)
{(j,k,l),(j,k,l0 )}∈F (m)
and update diagonal elements using (4). This quadratic problem can then be efficiently
solved using augmented Lagrangian methods as in [27].
Unlike the coordinate descent method updating one component at a time, we update one
component of ζ and two components (ωjkl , ωjjl ) for Ωl at the same time.
3
In (9), computation of Ω−1
−jl by directly inverting Ω−jl has a complexity of O(p ) opera-
tions for each (j, l). For efficient computation, we utilize the special property of our sweeping
operator in that the (p − 1)2 elements of Ωl are unchanged except one row and one column
are swept, in addition to the rank one property for updating the formula. In (9), we derive an
analytic formula through block-wise inversion and the Neumann formula of a square matrix,
−1
to compute (Ω−jl )−1 from (ωjjl , ω−jl , Ω−1
from (ωjjl , ω−jl , (Ω−1
l ) and Ωl
−l )−j ) for each (j, l).
That is,
(Ω−1 )j (Ω−1 )T
Ω−1
l
j
l
l
,
(Ω−jl )−1 = (Ω−1
l )−j −
(Ω−1
l )jj


−1
T
 (Ωl )−j + baa −ba 
−1
=
 , a = (Ωl )−j ω−jl , b = (ωjjl − aT ω−jl )−1 .
−baT
b
(10)
(11)
This amounts to O(p2 ) operations.
The foregoing discussion leads to our DC block-wise coordinate descent algorithm through
sweeping operations over p(p−1) off-diagonals of (Ω1 , · · · , ΩL ), with each operation involving
8
the L corresponding off-diagonals.
Algorithm 1:
(0)
Step 1. (Initialization) Set Ω̂l
= I; l = 1, · · · , L, E (0) = {(j, k, l) : 1 ≤ j 6= k ≤ p, 1 ≤ l ≤
L}, F (0) = E, m = 0 and precision tolerance = 10−5 for Step 2.
Step 2. (Iteration) At current iteration m, initialize Ω = Ω̂(m) . Then solve (8) applying
the block-wise coordinate descent algorithm to update Ω to yield Ω̂(m+1) . And set E (m+1) =
(m+1)
{(j, k, l) : |ω̂jkl
(m+1)
| ≤ τ, j 6= k}; F (m+1) = {{(j, k, l), (j 0 , k 0 , l0 )} : (l, l0 ) ∈ Ejk , |ω̂jkl
−
(m+1)
ω̂j 0 k0 l0 | ≤ τ }. Specifically,
a) For each row (column) index j = 1, · · · , p, compute Ω−1
−jl using (10); l = 1, · · · , L.
(m)
(m)
(m)
Solve (9) to obtain ω̂−jl , and then compute ω̂jjl through (4); l = 1, · · · , L. Update Ωl
(m)
(m)
with its jth row replaced by (ω̂jjl , ω̂−jl ) and its jth column by symmetry. Finally update
(m) −1
(Ωl
)
(m)
using (11). Go to next iteration j + 1 until all rows of Ωl
have been swept.
b) Repeat a) until the decrement of the objective function is less than . After convergence, update Ω to yield Ω̂(m+1) = Ω based on a).
Step 3. (Stopping criterion) Terminate when E (m+1) = E (m) and F (m+1) = F (m) , otherwise,
repeat Step 2 with m = m + 1.
The overall complexity of Algorithm 1 is of order O(p3 L2 ). And real computational time
of our algorithm depends highly on values of λ1 , λ2 and the number of iterations. In Example
1, it takes about 30 seconds for one simulation run with (p, L) = (200, 4) over 100 grids on
a 8-core computer with Intel(R) Core(TM) i7-3770 processors and 16GB of RAM.
3.2
Partition rule for large-scale problems
This section establishes a necessary and sufficient partition rule for our non-convex penalization method and its convex counterpart using the sample covariances, permitting fast
computation for large-scale problems by partitioning nodes into disjoint subsets excluding
the zero-coefficient subset then applying the proposed method to each nonzero subset. Such
9
a result exists only for a single matrix or a special case of multiple matrices, c.f., [12, 22].
In what follows, we only consider the case where Ejk = E are identical. Given this graph
G = (V, E), with (V = {1, · · · , L}, E = Ejk , 1 ≤ j < k ≤ p) denoting the node and edge sets,
we write l ∼ l0 if (l, l0 ) ∈ E, or two nodes are connected. First consider the convex grouping
penalty over G, followed by a general case, where the penalized log-likelihood is
L X
X
nl − log det(Ωl ) + tr(Ωl Sl ) + λ1 kΩl,off k1 + λ2
kΩl,off − Ωl0 ,off k1 , (12)
l∼l0
l=1
where Ωl,off denotes the off-diagonal elements of Ωl and Sl = (sjkl )1≤j,k≤p are the sample
covariance matrices, l = 1, · · · , L.
The next theorem derives a necessary and sufficient condition for the jkth element of Ω̂l
Ω̂jkl = 0 across l = 1, · · · , L, for j ∈ J , k ∈ J c , where (Ω̂1 , · · · , Ω̂L ) is the minimizer of
(12), and J ⊂ {1, · · · , p} is any subset. This partitions the node set into disjoint subsets of
connected nodes, with no connections between these subsets.
Theorem 2 (Partition rule for (12)) Ω̂jkl = 0 for all j ∈ J ; k ∈ J c and l = 1, · · · , L,
if and only if (sjk1 , · · · , sjkL ) ∈ S, for all j ∈ J , k ∈ J c , where S = s = (s1 , · · · , sL ) :
P
P
| l∈I nl sl | ≤ λ1 |I| + λ2 d(I, I c ), ∀I ⊆ V with d(I, I c ) = l∈I,l0 ∈I c I l ∼ l0 denoting the
number of edges between the nodes in I and the remaining nodes in I c .
Similar results hold for the proposed non-convex regularized estimators.
Theorem 3 (Partition rule for non-convex regularization) Denote by Ω̂dc the solution obc
tained from Algorithm 1 for (2). Similarly, given any J , Ω̂dc
jkl = 0 for all j ∈ J ; k ∈ J ;
P
l = 1, · · · , L, if and only if (sjk1 , · · · , sjkL ) ∈ S, where S = s = (s1 , · · · , sL ) : | l∈I nl sl | ≤
λ1 |I| + λ2 d(I, I c ), ∀I ⊆ V .
Corollary 1 simplifies the expression of S for specific graphs.
Corollary 1 In the cases of the fused graph and the complete graph, we have
10
(
S =
l
X
ni si ≤ lλ1 + λ2 ,
s:
i=1
L
X
ni si ≤ lλ1 + λ2 , l = 1, · · · , L − 1,
i=L−l+1
l2
X
n
s
i i ≤ (l2 − l1 )λ1 + 2λ2 , 1 ≤ l1 < l2 < L;
i=1
i=l1 +1
(
S =
)
L
X
ni si ≤ Lλ1 ,
l
X
nki ski ≤ lλ1 + l(L − l)λ2 ,
s:
i=1
L
X
nki ski ≤ lλ1 + l(L − l)λ2 ,
i=L−l+1
)
l = 1, · · · , L, sk1 ≥ · · · ≥ skL .
The partition rule is useful for efficient computation, as it may reduce computation cost
substantially. It can be used in several ways. First, the rule partitions nodes into disjoint connected subsets through the sample covariances sjkl ’s. This breaks the original large problem
into smaller subproblems, owing to this necessary and sufficient rule. Second, Algorithm 1
can be applied to each subproblem independently, permitting parallel computation.
Algorithm 2 integrates the partition rule in Theorem 3 with Algorithm 1 to make the
proposed method applicable to large-scale problems.
Algorithm 2 (A partition version of Algorithm 1):
Step 1. (Screening) Compute the sample-covariance matrix Sl ; l = 1, · · · , L. Construct a
p × p symmetric matrix T = (tjk )1≤j,k≤p , with tjk = 0 if (sjk1 , · · · , sjkL ) ∈ S and tjk = 1
otherwise. Treating T as an adjacency matrix of an undirected graph, we compute its
maximum connected components to form a partition of nodes {J1 , · · · , Jq } using breadthfirst search or depth-first search algorithm, c.f., [3]
Step 2. (Subproblems) For i = 1, · · · , q, solve (2) for each subproblem consisting of nodes
(i)
(i) in Ji , by applying Algorithm 1 to obtain the solution Ω̂(i) = Ω̂1 , · · · , Ω̂L ; i = 1, · · · , q.
(1)
(q) Step 3. (Combining results) The final solution Ω̂l = Diag Ω̂l , · · · , Ω̂l ; l = 1, · · · , L.
11
4
Theory
This section investigates theoretical aspects of the proposed method. First we develop a general theory on maximum penalized likelihood estimation involving two types of L0 -constraints
for pursuit of sparseness and clustering. Then we specialize the theory for estimation of multiple precision matrices in Section 4.3. Now consider a constrained L0 -version of (2):
max L(θ), subject to
θ=(β,η)
d
X
I(|βj | =
6 0) ≤ C1 ,
X
I |βj − βj 0 | =
6 0 ≤ C2 .
(13)
Jτ βj − βj 0 ≤ C2 ,
(14)
(jj 0 )∈E
j=1
as well as its computational surrogate
max L(θ), subject to
θ=(β,η)
d
X
Jτ (|βj |) ≤ C1 ,
X
(jj 0 )∈E
j=1
where θ = (β, η) with β ∈ Rd and η representing the off-diagonals and diagonals of Ω, and
three non-negative tuning parameters (C1 , C2 , τ ). Note that Algorithm 1 yields a local
minimizer of (14), relaxing it by solving a sequence of convex problems.
In what follows, we will prove that global minimizers of (13) and (14) reconstruct the
ideal oracle estimator as if the true sparseness and clustering structures of the precision
matrices were known in advance. As a result of the reconstruction, key properties of the
oracle estimator are simultaneously achieved by the proposed method.
4.1
The oracle estimator and consistent graph
To define the oracle estimator, let G(β) denote a partition of I ≡ {1, · · · , d} by the parameter
β, i.e. G(β) = (I0 (β), · · · , IK(β) (β)), with I0 (β) = I \ A(β) and Ik (β) satisfying βj = βj 0 ;
j, j 0 ∈ Ik (β); k = 1, · · · , K(β), where K(β) is the number of nonzero clusters and A(β) ≡
{i : βi 6= 0} is the support of β. Let G 0 = G(β 0 ) be the true partition induced by β 0 , with
θ 0 = (β 0 , η 0 ) the true parameter value and β 0 ∈ Rd .
Definition 1 (Oracle estimator) Given G 0 , the oracle estimator is defined as: θ̂ o =
(β̂ o , η̂ o ) = argmaxβ:G(β)=G 0 L(θ), the corresponding maximum likelihood estimator.
12
In (13) and (14), the edge set E of U is important for clustering. In order for simultaneous
pursuit of sparseness and clustering structures to be possible, we may need U to be consistent
with the clustering structure of the true precision matrices. In other words, a consistent graph
is a minimal requirement for reconstruction of the oracle estimator, where there must exist
a path connecting any nodes within the same true cluster.
Definition 2 (Consistent graph U) An undirected graph U = I, E is consistent with
0
}, if the subgraph restricting nodes on Ij0 is connected;
the true cluster G 0 = {I00 , · · · , IK
0
j = 1, · · · , K0 .
4.2
Non-asymptotic probability error bounds
This section derives a non-asymptotic probability error bound for simultaneous sparseness
and clustering pursuit, based on which we prove that (13) and (14) reconstruct the oracle
estimator. This implies consistent identification of the sparseness and clustering structures
of multiple graphical models, under one simple assumption, called the degree-of-separation
condition.
Before proceeding, we introduce some notations. Given a graph U = (I, E), let S =
θ = (β, η) : |A(β)| ≤ d0 , C(β, E) ≤ c0 , G(β) 6= G(β 0 ) be a constrained set with C(β, E) =
P
0| =
I
|β
−
β
6
0
, where d0 = |A0 | with A0 = A(β 0 ) as defined above. Given a
0
j
j
(jj )∈E
partition G, let SG = θ ∈ S : G(β) = G . Given an index set A ⊆ I, let SA = θ ∈
S : A(β) = A . Let Si = ∪A:|A0 \A|=i SA , Si? = maxA:|A0 \A|=i {G(β) : θ = (β, η) ∈ SA };
log Si∗ . Roughly, S ? quantifies complexity of the
i = 0, · · · , d0 , and S ∗ = exp max0≤i≤d0 max(i,1)
space of candidate precision matrices scaled by the number of nonzero entries.
The degree-of-separation condition will be used to ensure consistent reconstruction of the
oracle estimator: For some constant c1 > 0,
Cmin (θ 0 ) ≥ c1
log d + log S ∗
,
n
13
(15)
where Cmin (θ 0 ) ≡ inf {θ=(β,η)∈S}
− log(1−h2 (θ,θ 0 ))
max(|A0 \A(β)|,1)
with | · | and \ denoting the size of a set and
1/2
R
is the
that of set difference, respectively, h(θ, θ 0 ) = 12 (g 1/2 (θ, y) − g 1/2 (θ 0 , y))2 dµ(y)
Hellinger-distance for densities with respect to a dominating measure µ.
We now define the bracketing Hellinger metric entropy of space F, denoted by the function H(·, F), which is the logarithm of the cardinality of the u-bracketing (of F) of the
u
l
} ⊂ L2 satisfying
, fm
smallest size. That is, for a bracket covering S(ε, m) = {f1l , f1u , · · · , fm
max1≤j≤m kfju − fjl k2 ≤ ε and for any f ∈ F, there exists a j such that fjl ≤ f ≤ fju , a.e.
R
P , then H(u, F) is log(min{m : S(u, m)}), where kf k2 = f 2 (z)dµ. For more discussions
about metric entropy of this type, see [8].
Assumption A: (Complexity of the parameter space) For some constant c0 > 0 and
any 0 < t < ε ≤ 1, H(t, BG ) ≤ c0 (log p)2 , 1)|A| log(2ε/t), where BG = FG ∩ {h(θ, θ 0 ) ≤ 2ε}
is a local parameter space, and FG = {g 1/2 (θ, y) : θ = (β, η) : G(β) = G} be a collection of
square-root densities indexed by any subset G ∈ {G(β) : θ = (β, η) ∈ S}.
Next we present our non-asymptotic probability error bounds for reconstruction of the
oracle estimator θ̂ 0 by global minimizers of (13) and (14) in terms of Cmin (θ 0 ), n, d and
d0 , where d0 and d can depend on n. Consistency is established for reconstruction of θ̂ 0
as well as structure recovery. Note that θ̂ 0 is asymptotically optimal, hence the optimality
translates into the global minimizers of (13) and (14).
Theorem 4 (Global minimizer of (13)) Under Assumption A, if U is consistent with G 0 ,
then for a global minimizer of (13) θ̂ l0 with estimated grouping Ĝ l0 = G(β̂ l0 ) at (C1 , C2 ) =
(d0 , c0 ) with c0 = C(β 0 , E),
l0
0
l0
o
0
?
P Ĝ 6= G = P θ̂ 6= θ̂
≤ exp − c2 nCmin (θ ) + 2 log d + log S .
l0
0
l0
o
Under (15), P Ĝ = G = P θ̂ = θ̂ → 1 as n, d → ∞.
(16)
For the constrained truncated L1 -likelihood, one additional condition–Assumption B
is necessary. We requires the Hellinger-distance to be smooth so that the approximation of
14
the truncated L1 -function to the L0 -function becomes adequate by tuning τ .
Assumption B: For some constants d1 -d3 > 0,
− log(1 − h2 (θ, θ 0 )) ≥ −d1 log(1 − h2 (θ τ , θ 0 )) − d3 τ d2 d,
where θ τ = (β τ , η) with β τ = (β1τ , · · · , βpτ ), and βjτ =
P
j 0 ∈Ik
|Ik |
βj0
(17)
for j ∈ Ik (β); k =
0, 1, · · · , K(β).
Theorem 5 (Global minimizer of (14)) Assume that Assumption A with FG replaced by
FGτ = {g 1/2 (θ, y) : θ = (β, η) : G τ (β) = G} and Assumption B are met. If U is consistent
with G 0 , then for a global minimizer of (14) θ̂ g with estimated grouping Ĝ g = G(β̂ g ) at
0 1/d2
,
(C1 , C2 ) = (d0 , c0 ) with c0 = C(β 0 , E) and τ ≤ (d1 −c3d)C3 dmin (θ )
P Ĝ g 6= G 0 = P θ̂ g 6= θ̂ o ≤ exp − c3 nCmin (θ 0 ) + 2 log d + log S ? .
(18)
Under (15), P Ĝ g = G 0 = P θ̂ g = θ̂ o → 1 as n, d → ∞.
4.3
An illustrative example
We now apply the general theory in Theorems 2 and 3 to the estimation of multiple precision
matrices, in which the true precision matrices in each cluster are the same, with g0 ≡
PL−1 P
0
0
l=1
j>k I ωjkl 6= ωjk(l+1) the number of break points among these clusters. In this case,
a serial graph U is considered for clustering.
Denote by p and L0 the dimension of the precision matrix and the number of distinctive
2
det(Ωl )) clusters, respectively. Let Hl = ∂ (− log
be the p2 × p2 Hessian matrix of
∂2Ω
0
Ωl =Ωl
0 0
− log det(Ωl ), whose (jk, j k ) element is tr(Σl ∆jk Σl ∆j 0 k0 ), c.f., [2]. Define
0 1
0
0
ωjkl , √
ωjkl − ωjk(l+1)
ηmin = min
min0
min
0 6=ω 0
(j,k,l):ωjkl 6=0
2 (j,k,l):ωjkl
jk(l+1)
to be the resolution level for simultaneous sparseness and clustering pursuit.
An application of Theorems 2 and 3 with β and η being off-diagonals and diagonals of
Ω leads to the following result.
Corollary 2 (Multiple precision matrices with a serial graph) When U is a serial graph, all
15
the results in Theorems 2 and 3 for simultaneous pursuit of sparseness and clustering hold
under two simple conditions:
2
Cmin (θ 0 ) ≥ c4 min cmin Hl ηmin
, and log S ? ≤ 2g0 max(log(d0 /g0 ), 1),
1≤l≤L
(19)
for some constant c4 > 0. Sufficiently, if
2
log Lp(p − 1)/2 − g0 max log(d0 /g0 ), 1
,
(20)
min cmin Hl ηmin ≥ c0
1≤l≤L
n holds for some constant c0 > 0, then P Ω̂`0 6= Ω̂o and P Ω̂g 6= Ω̂o → 0 as n, d → +∞.
Corollary 2 suggests that the amount of reconstruction improvement would be of the
order of 1/L if the L precision matrices are identical.
In general, the amount of im-
provement of joint estimation over separate estimation is L/ log(L) when g0 is small, i.e.
g0 max log(d0 /g0 ), 1 . log Lp(p − 1)/2 , by contrasting the sufficient condition in (20)
with that for a separate estimation approach in [17], where . denotes inequality ignoring
constant terms. Here g0 describes similarity among L precision matrices with a small value
corresponding to a high-degree of similarity shared among precision matrices.
5
Numerical examples
This section studies operational characteristics of the proposed method via simulation in
sparse and nonsparse situations with different types of graphs in both low- and high-dimensional
settings. In each simulated example, we compare our method against its convex counterpart
for seeking the sparseness structure for each graphical model and identifying the grouping
structure among multiple graphical models, and contrast the method against its counterpart seeking the sparseness structure alone. In addition, we also compare against a kernel
smoothing method for time-varying networks [7, 26] in Examples 1-3, whenever appropriate. The smoothing method defines a weighted average over sample covariance matrices
at time points as S̃l (h) =
PL
0
l =1
P
L
wll0 (h)Sl0
l0 =1
wll0 (h)
, with wll0 (h) = K(h−1 |l − l0 |); l = 1, · · · , L, where
K(x) = (1 − |x|)I(|x| < 1) is a triangular kernel, h is a bandwidth, and l = 1, · · · , L denotes
16
clusters. Then within each cluster l, the precision matrix estimate Ω̂l (h, λ) is obtained by
solving
X
|ωjj 0 l | , l = 1, · · · , L (21)
Ω̂l (h, λ) = argmin − log det Ωl + tr(ΩS̃l (h)) + λ
Ωl 0
j<j 0
using the glasso algorithm [5], and the final estimate is obtained through tuning over (h, λ)grids. Two performance metrics are used to measure the accuracy of parameter estimation
as well as that of correct identification of the sparseness and grouping structures.
In Examples 1-3, temporal clustering pursuit is performed over Ω1 , · · · , ΩL through a
serial graph E = {(j, k, l), (j 0 , k 0 , l0 )} : j = j 0 , k = k 0 , |l − l0 | = 1 . That is, only adjacent
matrices may be possibly clustered. In Example 4, general clustering pursuit is conducted
through a complete graph E = {(j, k, l), (j 0 , k 0 , l0 )} : j = j 0 , k = k 0 , l < l0 .
For the accuracy of parameter estimation, the average entropy loss (EL) and average
quadratic loss (QL) are considered, defined as
L
L
2 1 X
1 X −1
−1
EL =
tr Ω−1
tr
Ω
Ω̂
−
log
det
Ω
Ω̂
−
I
.
Ω̂
,
QL
=
l
l
l
l
l
l
L l=1
L l=1
For the accuracy of identification, average false positive (FPV) and false negative (FNV)
rates for sparseness pursuit, as well as those (FPG) and (FNG) for grouping are used:
L P
1 X 1≤j<j 0 ≤p I(ωjj 0 l = 0, ω̂jj 0 l 6= 0) P
1 − I Ωl,off 6= 0
FPV =
L l=1
1≤j<j 0 ≤p I(ωjj 0 l = 0)
L P
1 X 1≤j<j 0 ≤p I(ωjj 0 l 6= 0, ω̂jj 0 l = 0)
P
F NV =
I Ωl,off 6= 0 ,
L l=1
1≤j<j 0 ≤p I(ωjj 0 l 6= 0)
1 X
FPG =
|E| l∼l0
P
1 X
F NG =
|E| l∼l0
P
I(ωjj 0 l = ωjj 0 l0 , ω̂jj 0 l 6= ω̂jj 0 l0 ) 1 − I Ωl,off 6= Ωl0 ,off ,
1≤j<j 0 ≤p I(ωjj 0 l = ωjj 0 l0 )
1≤j<j 0 ≤p
P
I(ωjj 0 l =
6 ωjj 0 l0 , ω̂jj 0 l = ω̂jj 0 l0 )
I Ωl,off 6= Ωl0 ,off ,
1≤j<j 0 ≤p I(ωjj 0 l 6= ωjj 0 l0 )
1≤j<j 0 ≤p
P
where Ωl,off denotes the off-diagonal elements of Ωl . Note that F P V and F N G as well as
F N V and F N G are not comparable due to normalization with and without the zero-group,
respectively.
17
For tuning, we minimize a prediction criterion with respect to the tuning parameter(s) on an
independent test set with the same sample size as the training set. The prediction criterion
P is CV (λ) = L1 Ll=1 − log det Ω̂l (λ) + tr Sltune Ω̂l (λ) , where Sltune is the sample
covariance matrix for the tuning data; l = 1, · · · , L. Then the estimated tuning parameter
is obtained: λ? = argminλ CV (λ), which is used in the estimated precision matrices. Here
minimization of CV (λ) is performed through a simple grid search over the domain of the
tuning parameter(s).
All simulations are performed based on 100 simulation replications. Three different types
of networks are considered. Specifically, Example 1 concerns a chain network with small
p and L but large n, where each Ωl is relatively sparse and a temporal change occurs
at two different l values. Example 2 deals with a nearest neighbor networks for each Ωl
and the same temporal structure as in Example 2. Examples 3 and 4 study exponentially
decaying networks in nonsparse precision matrices in high and low-dimensional situations
with large and small L, respectively. In Examples 1-3 and Example 4, Algorithms 1 and 2
are respectively applied.
Example 1: Chain networks: This example estimates tridiagonal precision matrices as
in [4]. Specifically, Ω−1
= Σl is AR(1)-structured with its ij-element being σijl = exp(−|sil −
l
sjl |/2), and s1l < s2l < · · · < spl are randomly chosen: sil −s(i−1)l ∼ Unif(0.5, 1); i = 2, · · · , p,
l = 1, · · · , L. The following situations are considered: (I) (n, p, L) = (120, 30, 4), (n, p, L) =
(120, 200, 4), with Ω1 = Ω2 , Ω3 = Ω4 ; (II) (n, p, L) = (120, 20, 30), (n, p, L) = (120, 10, 90),
with Ω1 = · · · = ΩL/3 , Ω(1+L/3) = · · · = Ω2L/3 , Ω1+2L/3 = · · · = ΩL . Then, we study the
proposed method’s performance as a function of the number of graphs and the number of
nodes.
Example 2: Nearest neighbor networks. This example concerns networks described
in [11]. In particular, we generate p points randomly on a unit square, and compute the k
nearest neighbors of each point based on the Euclidean distance. In the case of k = 3, three
18
points are connected to each point. For each ”edge” in the graph, the corresponding offdiagonal in a precision matrix is sampled independently according to the uniform distribution
over [−1, −0.5] ∪ [0.5, 1], and the ith diagonal is set to be the sum of the absolute values of
the ith row off-diagonals. Given the previous cluster, the matrices in the current cluster are
obtained by randomly adding or deleting a small fraction of nonzero elements in the matrices
from previous cluster. Finally, each row of a precision matrix is divided by the square root
of the product of corresponding diagonals (ωij ←
√
ωij
)
ωii ωjj
so that diagonals of the final
precision matrices are one. The following scenarios are considered: (I) (n, p, L) = (300, 30, 4)
and (n, p, L) = (300, 200, 4), where Ω1 = Ω2 , Ω3 = Ω4 , and (II) (n, p, L) = (300, 20, 30) and
(n, p, L) = (300, 10, 90), where Ω1 = · · · = ΩL/3 , Ω1+L/3 = · · · = Ω2L/3 , Ω1+2L/3 = · · · = Ωl .
In (I), the first cluster of matrices (Ω1 , Ω2 ) are generated using the above mechanism, with
the second cluster of matrices (Ω3 , Ω4 ) obtained by deleting one edge for each node in the
network. In (II), the generating mechanism remains except that the third cluster of matrices
(Ω1+2L/3 , · · · , ΩL ) are generated by adding an edge for each node in its previous adjacent
network.
Example 3: Exponentially decaying networks. This example examines a nonsparse
situation in which elements of precision matrices are nonzero, and decay exponentially with
respect to their Euclidean distances to the corresponding diagonals. In particular, the (i, j)th
entry of the lth precision matrix ωijl is exp al |i−j| with al sampled uniformly over [1, 2]. In
this case, it is sensible to report the results for parameter estimation as opposed to identifying
nonzeros As in Examples 1 and 2, several scenarios are considered: (I) (p, L) = (30, 4),
(p, L) = (200, 4), and the sample size n = 120 or 300 with Ω1 = Ω2 , Ω3 = Ω4 , and (II)
(p, L) = (20, 30), (p, L) = (10, 90), and the sample size n = 120 or 300 with Ω1 = · · · =
ΩL/3 , Ω1+L/3 = · · · = Ω2L/3 , Ω1+2L/3 = · · · = ΩL .
Example 4: Large precision matrices. This example utilizes the partition rule to
treat large-scale simulations. First, we examine two cases (n, p, L) = (120, 1000, 4) and
19
(n, p, L) = (500, 2000, 4) with Ω1 = Ω2 and Ω3 = Ω4 , where four precision matrices are
considered with size 1000 × 1000 and 2000 × 2000 for pairwise clustering, where U is the
complete graph. Here each precision matrix is set to be a block-diagonal matrix: Ωl =
Diag(Ωl1 , · · · , Ωlq ); l = 1, · · · , L, where Ω1j = Ω2j , Ω3j = Ω4j are 20×20 matrices generated
in the same fashion as that in Examples 1. Finally, the complete graph is used as opposed
to the fused graph. Overall, the complexity is much higher than the previous examples.
Tables 1-4 and Figure 1 about here
As suggested by Tables 1-4, the proposed method performs well against its competitors
in parameter estimation and correct identification of the sparseness and grouping structures
across all the situations. With regard to accuracy of identification of the sparseness and
clustering structures, the proposed method has the smallest false positives in terms of F P V
and F P G, yielding sharper parameter estimation than the competitors. This says that
shrinkage towards common elements is advantageous for parameter estimation in a low-or
high-dimensional situation. Note that the largest improvement occurs for the most difficult
situation in Example 4.
Compared with pursuit of sparseness alone–TLP, the amount of of improvement of our
method is from 143% to 244% and 118% % to 236% in terms of the EL and QL when n = 120,
and from 80.5% to 228% and 96.5% to 240% in terms of the EL and QL when n = 300,
as indicated in Table 3. This comparison suggests that exploring the sparseness structure
alone is inadequate for multiple graphical models. Pursuit of two types of structures appears
advantageous in terms of performance, especially for large matrices.
Compared with its convex counterpart “our-con”, our method leads to between a 19.9%
and a 106% improvement, and between a 4.2% improvement and a 75.5% improvement in
terms of the EL and QL when n = 120, and between a 18.3% improvement and a 120%
improvement, and a 90.3% improvement and a 151% improvement in terms of the EL and
20
QL when n = 300; see Table 3. This is expected because more accurate identification of
structures tends to yield better parameter estimation.
In contrast to the smoothing method [7, 26] for time-varying network analysis, across all
cases except one low-dimensional case of L = 4 and p = 30 in Table 3, our method yields a
54.5% improvement and a 20.3% improvement in terms of the EL and QL when n = 120, and
a 51.9% improvement and a 25.8% improvement in terms of the EL and QL when n = 300
when L is not too small, c.f. Tables 1-4.
To understand how the proposed method performs relative to (n, p, L), we examine Table
3 and Figure 1 in further detail. Overall, the proposed method performs better as n, L
increases and worse as p increases. Interestingly, as suggested by Figure 1, the method
performs better as L increases, which confirms with our theoretical analysis.
In summary, the proposed method achieves the desired objective of pursuing simultaneous
both sparseness and clustering structures to battle the curse of dimensionality in a highdimensional situation.
6
Signaling network inference
This section applies the proposed method to the multivariate single cell flow cytometry data
in [16] to infer a signaling network or pathway; a consensus version of the network with eleven
proteins is described in Figure 3. In this study, a multiparameter flow cytometry recorded
the quantitative amounts of the eleven proteins in a single cell as an observation. To infer
the network, experimental perturbations on various aspects of the network were imposed
before the amounts of the eleven proteins were measured under each condition. The idea
was that, if a chemical was applied to stimulate or inhibit the activity of a protein, then both
the abundance of the protein and those of its downstream proteins in the network would be
expected to increase or decrease, while those of non-related proteins would barely change.
There were ten types of experimental perturbations on different targets: 1) activating a
21
target (CD3) in the upstream of the network so that the whole network was expected to
be perturbed; 2) activating a target (CD28) in the upsteam of the network; 3) activating a
target (ICAM2) in the upsteam of the network; 4) activating PKC; 5) activating PKA; 6)
inhibiting PKC; 7) inhibiting Akt; 8) inhibiting PIP2; 9) inhibiting Mek; 10) inhibiting a
target (PI3K) in the upstream of the network. In [16], data were collected under nine experimental conditions and then used to infer a directed network; each of the nine experimental
conditions was either a single type of perturbation or a combination of two or three types of
perturbations. Interestingly, data were also collected under another five conditions, each of
which was a combination of two of the previous nine conditions. Hence, the data offered an
opportunity to infer the two networks under the two sets of the conditions: since the two sets
of conditions largely overlapped, we would expect the two networks to be largely similar to
each other; on the other hand, due to the difference between the two sets of the conditions,
some deviations between the two networks were also anticipate. There were n1 = 7466 and
n2 = 4206 observations under the two sets of the conditions respectively.
We apply the proposed method to the normalized data under the two sets of the conditions
respectively. Due to the expected similarities between the two networks, we consider grouping
to encourage common structure defined by connecting edges between the two networks. The
tuning parameters are estimated by a three-fold cross-validation. The reconstructed two
undirected networks are now displayed in Figure 2, with 9 and 8 estimated (undirected)
links for the two groups of conditions, being a subset of the 20 (directed) links in the gold
standard signaling network as displayed in Figure 3, which is a consensus network that has
been verified biologically, c.f. [16]. The reconstructed undirected graphs miss some edges
as compared to the gold standard network, for instance, the links from protein “PKC” to
“Raf” and “Mek”. The three edges missed by [16], “PIP3” to “Akt”, “Plcg” to “PKC”, and
“PIP2” to “PKC”, are also missed by our method, possibly reflecting lack of information in
the data due to no direct interventions imposed on “PIP3” and ”Plcg”.
22
Overall, the proposed method appears to work well in that the network inferred from the
first set of conditions recovers one more dependence relationships than that from the second
set of conditions, which is expected given that the second set of interventional conditions is
less specific than the first one.
Here we analyze the data by contrasting the network constructed under the nine conditions with n1 = 7466 against that under the five conditions with n2 = 4206. Of particular
interest is the detection of network structural changes between the two sets of conditions.
Figures 2 and 3 about here
7
Discussion
This article proposes a novel method to pursue two disparate types of structures—sparseness
and clustering for multiple Gaussian graphical models. The proposed method is equipped
with an efficient algorithm for large graphs, which is integrated with a partition rule to
break down a large problem into many separate small problems to solve. For data analysis,
we have considered signaling network inference in a low-dimensional situation. Worthy of
note is that the proposed method can be equally applied to high-dimensional data, such as
reconstructing and comparing gene regulatory networks across four subtypes of glioblastoma
multiforme based on gene expression data [21].
To make the proposed method useful in practice, inferential tools need to be further
developed. A Monte Carlo method may be considered given the level of complexity of
the underlying problems. Moreover, the general approach developed here can be expanded
to other types of graphical models, for instance, dynamic network models or time-varying
graphical models [7]. This enables us to build time dependency into a model through, for
example, a Markov property. Further investigation is necessary.
23
8
Appendix
Proof of Theorem 1: The equivalence follows directly from Theorem 4.1 in [20]. 2
Next we present two lemmas to be used in the proof of Theorem 2.
Lemma 1 For any x0 , x1 , · · · , xm ∈ RL , (22) and (23) are equivalent:
∃ |b1 |, · · · , |bm | ≤ 1 s.t. x0 + b1 x1 + · · · + bm xm = 0,
(22)
f or ∀ c ∈ RL , |cT x0 | ≤ |cT x1 | + · · · + |cT xm |.
(23)
Proof: If (22) holds, then for any c ∈ RL , |cT x0 | = |b1 cT x1 + · · · + bm cT xm | ≤ |b1 ||cT x1 | +
· · · + |bm ||cT xm |, which is no greater than |cT x1 | + · · · + |cT xm |, implying (23). For the
converse, assume that for any c ∈ RL , |cT x0 | ≤ |cT x1 | + · · · + |cT xm |. Consider the
following convex minimization:
m
X
min{b1 ,··· ,bm }
B(bi ) subject to x0 + b1 x1 + · · · + bm xm = 0,
(24)
i=1
where B(x) is an indicator function with B(x) = 0 when |x| ≤ 1 and B(x) = +∞ otherwise.
First, we need to show that the constraint set in (24) is nonempty. Suppose that it is
empty. Let c0 = (I − P(x1 ,··· ,xm ) )x0 , where P(x1 ,··· ,xm ) is the projection matrix onto the linear
space spanned by x1 , · · · , xm . Since x0 , x1 , · · · , xm are linearly independent, we have that
kc0 k2 > 0. Therefore |cT0 x0 | = kc0 k22 > 0 = |cT0 x1 | + · · · + |cT0 xm |, contracting to that
|cT x0 | ≤ |cT x1 | + · · · + |cT xm |. Hence the constraint set of (24) is nonempty and we denote
its optimal value by p? . Next we convert (24) to its dual by introducing dual variable ν ∈ RL
for the equality constraints in (24) through Lagrange multipliers:
max{ν∈RL } ν T x0 − |ν T x1 | − · · · − |ν T xm |.
(25)
By the assumption that |cT x0 | ≤ |cT x1 |+· · ·+|cT xm | for any c, the maximal of (25) d? must
satisfy d? ≤ 0. Hence d? = 0 because it is attained by ν = 0. Moreover, Slater’s condition
holds because constraint set of (24) is nonempty. By the strong duality principle, the duality
gap is zero, and hence that p? = d? = 0. Consequently, a minimizer of (24) (b1 , · · · , bm )
exists with |b1 | ≤ 1, · · · , |bm | ≤ 1, satisfying the constraints x0 + b1 x1 + · · · + bm xm = 0.
This implies (23). This completes the proof. 2
24
Lemma 2 For s = (s1 , · · · , sL ) and a connected graph G = (V, E), there exist |gl | ≤ 1,
|gll0 | ≤ 1, gll0 = −gl0 l ; 1 ≤ l, l0 ≤ L such that

P


n1 s1 + λ1 g1 + λ2 l0 ∼1 g1l0
= 0



..
..
(26)
.
.




 nL sL + λ1 gL + λ2 P 0 gLl0 = 0,
l ∼L
P
P
c
is equivalent to | l∈I nl sl | ≤ λ1 |I|+λ2 d(I, I ) for any I ⊆ V with d(I, I c ) = l∈I,l0 ∈I c I l ∼
l0 .
Proof: First, for some |gl | ≤ 1, |gll0 | ≤ 1, gll0 = −gl0 l ; 1 ≤ l, l0 ≤ L, if (26) holds then,
X XX X X X
X
|
nl s l | = λ1 gl + λ2 gll0 = λ1 gl + λ2 I(l ∼ l0 )gll0 ,
l∈I
l=1
which is no greater than λ1 |I| +
l∈I l0 ∼l
λ2 d(I, I c )
l=1
l∈I l0 ∈I c
for any I ⊆ V . Conversely, by Lemma 1, it
suffices to show that for any c ∈ RL ,
L
L
X
X
X
|
cl nl sl | ≤ λ1
|cl | + λ2
|cl − cl0 |
(27)
0
l=1
l=1
l∼l
P
provided that | l∈I nl sl | ≤ λ1 |I| + λ2 d(I, I c ) for any I ⊆ V . To this end, for any permutation (k1 , · · · , kL ) ∈ σ(1, · · · , L) and l = 1, · · · , L, define convex region Clk1 ···kL = {c =
(c1 , · · · , cL ) : ck1 ≥ · · · ≥ ckl ≥ 0 ≥ · · · ≥ ckL }, where σ(1, · · · , L) denotes the set of all possible permutation of (1, · · · , L). It’s easy to see that ∪Ll=1 ∪(k1 ,··· ,kL )∈σ(1,··· ,L) Clk1 ···kL = RL Then,
P
P
P
consider function g(c) = | Ll=1 cl nl sl | − λ1 Ll=1 |cl | − λ2 l∼l0 |cl − cl0 |. Note that, g(c) over
each region Clk1 ···kL is a convex function. By the maximal principle, its maximum (over each
region) can be attained at the extreme points of Clk1 ···kL . It is easy to show that the extreme
points must be of the form c = (t1I , 0I c ) for some I ⊆ V and t 6= 0, that is the non-zero
components must to equal to each other. Hence, g(c) evaluated at the extreme points of
P
Clk1 ···kL reduces to | l∈I nl sl | − λ1 |I| − λ2 d(I, I c ) for some I ⊆ V, which, by assumption, is
always nonpositive. This completes the proof. 2
Proof of Theorem 2: We shall use the KKT condition of (12), or local optimality, which
is in the form of
25
nl Ω̂−1
l + nl Sl + λ1 ∂kΩ̂l k1,off + λ2
X
∂kΩ̂l − Ω̂l0 k1,off = 0,
l = 1, · · · , L,
(28)
l0 :l∼l0
where Ω̂−1
l is the inversion of matrices Ω̂l and ∂k·k1 denotes the subgradient of the `1 function.
If ω̂jkl = 0 for any j ∈ J , k ∈ J c , 1 ≤ l ≤ L, then Ω̂−1
= 0 for any j ∈ J , k ∈ J c , 1 ≤
l
jk
l ≤ L. By Lemma 2, we must have (sjk1 , · · · , sjkL ) ∈ S for any j ∈ J , k ∈ J c . Conversely,
if (sjk1 , · · · , sjkL ) ∈ S for any j ∈ J , k ∈ J c , again by Lemma 2, the KKT condition in
(28) holds at ω̂jkl = 0, l = 1, · · · , L for jkth components for any j ∈ J , k ∈ J c , 1 ≤ l ≤ L.
Hence, ω̂jkl = 0 for any j ∈ J , k ∈ J c , 1 ≤ l ≤ L. This completes the proof. 2
(m)
(m)
Let (Ω̂1 , · · · , Ω̂L ) be the DC solution at iteration m. If the
Proof of Theorem 3:
diagonal matrix is initialized as in Algorithm 1, then an application of Theorem 2 on
(1)
(1)
(1)
(Ω̂1 , · · · , Ω̂L ) yields that (sjk1 , · · · , sjkL ) ∈ S for any j ∈ J , k ∈ J c , implying that ω̂jkl =
0 for any j ∈ J , k ∈ J c , 1 ≤ l ≤ L. Next, we prove by induction that if (sjk1 , · · · , sjkL ) ∈ S
(m)
for any j ∈ J , k ∈ J c , then ω̂jkl = 0 for any j ∈ J , k ∈ J c , 1 ≤ l ≤ L holds for any
(m−1)
m ≥ 1. Suppose that ω̂jkl
= 0 for any j ∈ J , k ∈ J c , 1 ≤ l ≤ L holds for some m ≥ 2,
(m−1)
then at DC iteration m, |ω̂jkl
(m−1)
| = 0 ≤ τ, |ω̂jkl
(m−1)
− ω̂jkl0
| = 0 ≤ τ . This, together with
(m)
Theorem 2, again implies that ω̂jkl = 0 for any j ∈ J , k ∈ J c , 1 ≤ l ≤ L. Using the finite
convergence of the DC algorithm, c.f., Theorem 1, we have (sjk1 , · · · , sjkL ) ∈ S for any
dc
j ∈ J , k ∈ J c , implying that ω̂jkl
= 0 for any j ∈ J , k ∈ J c , 1 ≤ l ≤ L. Conversely, if for
dc
some J ω̂jkl
= 0 for any j ∈ J , k ∈ J c , 1 ≤ l ≤ L, consider the next DC iteration, we have
m∗+1
ω̂jkl
= 0 for any j ∈ J , k ∈ J c , 1 ≤ l ≤ L. Using the same argument as above with the
converse part of Theorem 2, we obtain that (sjk1 , · · · , sjkL ) ∈ S for any j ∈ J , k ∈ J c .
This completes the proof. 2
Proof of Corollary 1: For the fused graph, let I = {1, · · · , l}, {L − l + 1, · · · , L}, {l1 +
P
P
1, · · · , l2 } and {1, · · · , L}, then if | l∈I nl sl | ≤ λ1 |I|+λ2 d(I, I c ) then li=1 ni si ≤ lλ1 +λ2 ,
P
P
P
L
l2
L
n
s
≤
(l
−l
)λ
+2λ
,
n
s
i=L−l+1 ni si ≤ lλ1 +λ2 , i=l
≤ Lλ1 . Conversely,
i
i
2
1
1
2
i
i
i=1
1 +1
P
if I = {1, · · · , L}, then i∈I ni si ≤ λ1 |I| + λ2 d(I, I c ) = Lλ1 . Next, assume that I 6=
{1, · · · , L}, and write I = ∪qk=1 {ik , ik + 1, · · · , ik + lk } with i1 ≤ i1 + l1 < i2 < i2 + l2 < · · · <
26
iq < iq + lq . Then
q
q ik +lk
X
X
X
X
ni si ≤ λ1
lk + 2(q − 2)λ2 + I(i1 6= 1) + I(iq + lq 6= L) + 2 λ2
ni s i ≤
i∈I
k=1
i=ik
k=1
= |I|λ1 + 2(q − 1)λ2 + I(i1 6= 1) + I(iq + lq 6= L) λ2 = |I|λ1 + d(I, I c )λ2 .
In the case of the complete graph, set I = {k1 , · · · , kl }, {kL−l+1 , · · · , kL }, given sk1 ≤ · · · ≤
P
P
l
L
skL , then we have i=1 nki ski ≤ lλ1 + l(L − l)λ2 , i=L−l+1 nki ski ≤ lλ1 + l(L − l)λ2 .
P
P
P
|I|
L
Conversely, for any I, i∈I ni si ≤ max i=1 nki ski , i=L−|I|+1 nki ski ≤ lλ1 + l(L −
l)λ2 . This completes the proof. 2.
Proof of Theorem 4: The proof uses a large deviation probability inequality of [23] to
treat one-sided log-likelihood ratios with constraints. This enables us to obtain sharp results
without a moment condition on both tails of the log-likelihood ratios.
Recall that S = θ = (β, η) : |A(β)| ≤ d0 , C(β, E) ≤ c0 , G(β) 6= G(β 0 ) and SA = θ ∈
S : A(β) = A . Let a class of candidate subsets be A ≡ {A 6= A0 : |A| ≤ d0 } for sparseness
pursuit. Note that any A ⊂ {1, · · · , d} can be partitioned into (A \ A0 ) ∪ (A ∩ A0 ). Then
0
we partition S accordingly with S = ∪di=0
∪A∈Bi SA , where Bi = A ∩ {A : |A0 \ A| = i},
Pi
d−d0
0
with |Bi | = d0d−i
, i = 0, · · · , d0 . Moreover, SA = ∪G∈{G(β):θ=(β,η)∈SA } SG , where
j=0
j
0
SG = θ = (β, η) ∈ S : G(β) = G . So S = ∪di=0
∪A∈Bi ∪G∈{G(β):θ=(β,η)∈SA } SG .
To bound the error probability, note that if Ĝ l0 = G 0 then θ̂ `0 = θ̂ o then θ̂ `0 = θ̂ o , by
Definition 1. Conversely, if θ̂ `0 = θ̂ o or β̂ `0 = β̂ o , then Ĝ l0 = G 0 . Thus {Ĝ l0 = G 0 }{θ̂ `0 = θ̂ o }.
So {θ̂ `0 6= θ̂ o } ⊆ {L(θ̂ `0 ) − L(θ̂ o ) ≥ 0} ⊆ {l(θ̂ `0 ) − l(θ 0 ) ≥ 0}. This together with {θ̂ `0 6=
θ̂ o } ⊆ {θ̂ `0 ∈ S} implies that {θ̂ `0 6= θ̂ o } ⊆ {l(θ̂ `0 ) − l(θ 0 ) ≥ 0} ∩ {θ̂ `0 ∈ S}. Consequently,
I ≡ P θ̂ `0 6= θ̂ o ≤ P L(θ̂ `0 ) − L(θ 0 ) ≥ 0; θ̂ `0 ∈ S is upper bounded by
d0 X
X
X
P∗ sup L(θ) − L(θ 0 ) ≥ 0
θ∈SG
i=0 A∈Bi G∈{G(β):θ∈SA }
≤
d0 X
X
X
i=0 A∈Bi G∈{G(β):θ∈SA }
P
∗
sup
−log(1−h2 (θ,θ 0 ))≥max(i,1)Cmin (θ 0 ), θ∈SG
L(θ) − L(θ ) ≥ 0 ,
0
0
where P∗ is the outer measure and the last two inequalities use the fact that S = ∪di=0
∪A∈Bi
27
∪G∈{G(β):θ=(β,η)∈SA } SG and SG ⊆ {θ : max(|A0 \ A|, 1)Cmin (θ 0 ) ≤ − log(1 − h2 (θ, θ 0 ))} for
G ∈ {G(β) : θ = (β, η) ∈ SA }.
For I, we apply Theorem 1 of [23] to bound each term. Towards this end, we verify their
entropy condition (3.1) for the local entropy over SG for G ∈ {G(β) : θ = (β, η) ∈ SA }, A ∈
1/2
Bi and i = 0, · · · , d0 . Under Assumption A ε = εn,p0 ,p = (2c0 )1/2 c−1
/c3 ) log p( pn0 )1/2
4 log(2
satisfies there with respect to ε > 0, that is,
Z 21/2 ε
1/2
sup
H 1/2 (t/c3 , BA )dt ≤ p0 21/2 ε log(2/21/2 c3 ) ≤ c4 n1/2 ε2 .
{0≤|A|≤p0 }
(29)
2−8 ε2
for some constant c3 > 0 and c4 , say c3 = 10 and c4 =
(2/3)5/2
.
512
By Assumption A,
1/2
/c3 ).
Cmin (θ 0 ) ≥ ε2n,p0 ,p implies (29), provided that d0 ≥ (2c0 )1/2 c−1
4 log(2
Now, let Si? = maxA∈Bi #{G : I0c = A} and log(S ? ) = max1≤i≤p0 log(Si? )/i. Us
Pi
d−d0
d0
i
ing inequalities for binomial coefficients:
≤
(d
−
d
)
and
≤ di0 , |Bi | =
0
j=0
j
i
Pi
i
d0
d−d0
≤
d(d
−
d
)
≤ (d2 /4)i , we have, by Theorem 1 of [23], that for a
0
j=0
d0 −i
j
constant c2 > 0, say c2 =
d0
X
4 1
,
27 1926
d0
X
d2 i ?
I ≤
Si exp − c2 niCmin (θ 0 )
exp − c2 niCmin (θ ) ≤
4
i=0
i=0
≤ exp − c2 nCmin (θ 0 ) + 2 log d + log(S ? ) .
|Bi |Si?
0
This completes the proof. 2
Proof of Theorem 5: The proof is similar to that of Theorem 4 with some minor modifications. Given τ and β ∈ Rd , a partition G τ (β) = (I0 (β), · · · , IK(β) (β)) associated with β
is defined to satisfy the following (i) maxj∈I0 (β) |βj | ≤ τ ; (ii) βj1 − βj2 ≤ τ for any j1 , j2 in
different groups. Let Aτ (β) = I \ I0 (β).
The rest of the proof is basically the same as that in Theorem 2 with a modification
that G(β) and A(β) are replaced by G τ (β) and Aτ (β) respectively. Here, S = θ = (β, η) :
Pp
P
P
≤ c0 , G τ (β) 6= G(β 0 ) , C τ (β, E) =
j=1 Jτ (|βj |) ≤ d0 ,
(jj 0 )∈E Jτ βj −βj 0
(jj 0 )∈E I(|βj −
βj 0 | =
6 0), SA = θ ∈ S : Aτ (β) = A and SG = θ = (β, η) ∈ S : G τ (β) = G .
Next, we show that θ̂ g = θ̂ o if and only if Ĝ g = G 0 , where Ĝ g ≡ G τ (β̂ g ). Now d1 ≡
28
|β̂jg | + d1 ≤ d0 , with d0 = d1 , yields that β̂jg = 0; j ∈ I01 .
P P
|β̂jg −β̂jg0 |
In addition, the second constraint of (14) implies K
≤ 0, yielding
i=1
jj 0 ∈Ii ,(jj 0 )∈E
τ
|I \ Î00 | = d0 . By (14),
1
τ
P
j∈Î0
that β̂jg1 = β̂jg1 for any j1 , j2 ∈ Îi , (j1 , j2 ) ∈ E, i = 1, · · · , K. By graph consistency of U, U
is connected over Îi , implying that β̂jg1 = β̂jg1 for any j1 , j2 ∈ Îi , i = 1, · · · , K. This further
implies that β̂ g = β̂ o and θ̂ g = θ̂ o , meaning that that {Gbg = G 0 } ⊆ {θ̂ g = θ̂ o }. On the other
hand, it is obvious that if θ̂ g = θ̂ o then {Gbg = G 0 }. Hence, {Gbg = G 0 } = {θ̂ g = θ̂ o } from
which we conclude that {θ̂ g 6= θ̂ o } ⊆ {θ̂ g ∈ S}. This together with {θ̂ g 6= θ̂ o } ⊆ {L(θ̂ g ) −
L(θ o ) ≥ 0} ⊆ {L(θ̂ g )−L(θ 0 ) ≥ 0} implies that P θ̂ g 6= θ̂ o ≤ P L(θ̂ g )−L(θ 0 ) ≥ 0; θ̂ g ∈ S
is bounded by
d0 X
X
P∗ sup L(θ) − L(θ 0 ) ≥ 0
X
i=0 A∈Bi G∈{G τ (β):θ=(β,η)∈SA }
≤
d0 X
X
X
i=0 A∈Bi G∈{G τ (β):θ∈SA }
P∗ θ∈SG
sup
−log(1−h2 (θ,θ 0 ))≥d1 max(i,1)Cmin (θ 0 )−d3 τ d2 d,θ∈SG
L(θ) − L(θ 0 ) ≥ 0 ,
where the last step uses the fact that {θ ∈ SG } ⊆ {− log(1−h2 (θ τ , θ 0 )) ≥ max(i, 1)Cmin (θ 0 )}
⊆ {− log(1 − h2 (θ, θ 0 )) ≥ d1 max(i, 1)Cmin (θ 0 ) − d3 τ d2 d}, under Assumption B. Then, for
some constant c3 , P θ̂ g 6= θ̂ o is upper bounded by
d0
X
d0
X
d2 i ?
Si exp − c3 niCmin (θ 0 )
|Bi |Si? exp − c3 niCmin (θ 0 ) ≤
4
i=0
i=0
≤ exp − c3 nCmin (θ 0 ) + 2 log d + log(S ? ) ,
0 1/d2
provided that τ ≤ (d1 −c3d)C3 dmin (θ )
. This completes the proof. 2
Proof of Corollary 2: First we derive an upper bound of S ∗ . Let p0l the number of nonzero
elements of the precision matrix in the sth cluster for l = 1, · · · , L. Let p0 = d0 /L =
p01 +···+p0L
L
be the average number of nonzero elements. For any θ ∈ SA with |A0 \ A| = i, let Q =
{(j, k) : j > k, ∃l, xjul 6= 0} and |Q| = q0 . Let ajk = #{l : xjkl 6= 0} for (j, k) ∈ Q. Note that
P
q0 ≤ p0 and (j,k)∈Q ajk ≤ d0 since |A| ≤ d0 . By the definition of Si∗ , we have
29
Si∗
X
≤
P
(j,k)∈Q rjk ≤g0
g0 P
Y ajk − 1 X
(j,k)∈Q ajk − q0
=
g
rjk
g=0
(j,k)∈Q
g0 X
d0 − p0
d0 − p0 g 0
≤ (g0 + 1) e
≤
.
g
g
0
g=0
(30)
This together with log(1 + g0 ) ≤ g0 implies log S ∗ ≤ 2g0 max(log(d0 /g0 ), 1). To lower bound
Cmin (θ 0 ), we proceed similarly with the proof of Proposition 2 in [17]. Specifically, note
Q
that h2 (θ, θ 0 ) = 1 − Ll=1 1 − h2 (Ωl , Ω0l ) . Thus,
L X
2
0
2
0
(31)
− log(1 − h (θ, θ )) =
− log 1 − h (Ωl , Ωl ) .
l=1
An application of Proposition 2 of [17] yields that each term in (31) is lower bounded by
PL
0 2
c? kΩl − Ω0l k22 . Therefore, − log(1 − h2 (θ, θ 0 )) ≥ c? min1≤l≤L cmin Hl
l=1 kΩl − Ωl k2 . Now
P
2
. If A0 \ A = ∅, then by
if A0 \ A 6= ∅, we have Ll=1 kΩl − Ω0l k22 ≥ |A0 \ A| min(j,k,l):ωjkl 6=0 ωjkl
0
0
definition of S, there must exist (j, k, l) such that ωjkl = ωjk(l+1) and ωjkl
6= ωjk(l+1)
. Here
L
X
1 0
0 2
0
0
− ωjk(l+1)
)2
kΩl − Ω0l k22 ≥ (ωjkl − ωjkl
) + (ωjk(l+1) − ωjk(l+1)
)2 ≥ (ωjkl
2
l=1
2
1
0
0
≥
min
ωjkl
− ωjk(l+1)
.
2 (j,k,l): ω0 6=ω0
jkl
jk(l+1)
A combination of both the cases yield that
2
− log(1 − h2 (θ, θ 0 ))/ max |A0 \ A|, 1 ≥ c? min cmin Hl ηmin
.
1≤l≤L
2
. This, together with,
which, after taking infimum over S, leads to Cmin (θ 0 ) ≥ c? cmin Hl ηmin
the upper bound on log S ? in Theorems 4 and 5, gives a sufficient condition for simultaneous
2
2
0 /g0 ),1)
, for
pursuit of sparseness and clustering: min1≤l≤L cmin Hl ηmin
≥ c0 log(p L)−g0 max(log(d
n
some c0 > 0. Moreover, under this condition, P Ω̂`0 6= Ω̂o and P Ω̂g 6= Ω̂o → 0 as
n, d → +∞. This completes the proof. 2
References
[1] O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine
30
Learning Research, 9:485–516, 2008.
[2] S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge Univ Pr, 2004.
[3] T.H. Cormen. Introduction to algorithms. The MIT press, 2001.
[4] J. Fan, Y. Feng, and Y. Wu. Network exploration via the adaptive lasso and scad
penalties. The annals of applied statistics, 3(2):521–541, 2009.
[5] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with
the graphical lasso. Biostatistics, 9(3):432, 2008.
[6] J. Guo, E. Levina, G. Michailidis, and J. Zhu. Joint estimation of multiple graphical
models. Biometrika, 98(1):1, 2011.
[7] M. Kolar and E.P. Xing. On time varying undirected graphs. In Proceedings of the 14th
International Conference on Artificial Intelligence and Statistics, 2011.
[8] A.N. Kolmogorov and V.M. Tikhomirov. ε-entropy and ε-capacity of sets in function
spaces. Uspekhi Matematicheskikh Nauk, 14(2):3–86, 1959.
[9] Steffen L Lauritzen. Graphical models. Oxford University Press, 1996.
[10] B. Li, H. Chun, and H. Zhao. Sparse estimation of conditional graphical models with
application to gene networks. Journal of the American Statistical Association, 107:152–
167, 2012.
[11] H. Li and J. Gui. Gradient directed regularization for sparse gaussian concentration
graphs, with applications to inference of genetic networks. Biostatistics, 7(2):302, 2006.
[12] R. Mazumder and T. Hastie. Exact covariance thresholding into connected components
for large-scale graphical lasso. The Journal of Machine Learning Research, 13:781–794,
2012.
31
[13] N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with
the lasso. The Annals of Statistics, 34(3):1436–1462, 2006.
[14] G.V. Rocha, P. Zhao, and B. Yu. A path following algorithm for sparse pseudo-likelihood
inverse covariance estimation (splice). Arxiv preprint arXiv:0807.3734, 2008.
[15] A.J. Rothman, P.J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494–515, 2008.
[16] K. Sachs, O. Perez, D. Pe’er, D.A. Lauffenburger, and G.P. Nolan. Causal proteinsignaling networks derived from multiparameter single-cell data. Science, 308(5721):523,
2005.
[17] X. Shen, W. Pan, and Y. Zhu. Likelihood-based selection and sharp parameter estimation. Journal of American Statistical Association, 107:223–232, 2012.
[18] Teppei Shimamura, Seiya Imoto, Rui Yamaguchi, Masao Nagasaki, and Satoru Miyano.
Inferring dynamic gene networks under varying conditions for transcriptomic network
comparison. Bioinformatics, 26(8):1064–1072, 2010.
[19] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness
via the fused lasso. Journal of the Royal Statistical Society: Series B, 67:91–108, 2005.
[20] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109(3):475–494, 2001.
[21] R. Verhaak, K.A Hoadley, E. Purdom, V. Wang, Y. Qi, M. D Wilkerson, C. R. Miller,
L. Ding, T. Golub, J.P. Mesirov, et al. An integrated genomic analysis identifies clinically
relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh1, egfr
and nf1. Cancer cell, 17(1):98, 2010.
32
[22] D. M. Witten, J. H. Friedman, and N. Simon. New insights and faster computations
for the graphical lasso. Journal of Computational and Graphical Statistics, 20:892–900,
2011.
[23] W.H. Wong and X. Shen. Probability inequalities for likelihood ratios and convergence
rates of sieve mles. The Annals of Statistics, pages 339–362, 1995.
[24] S. Yang, Z. Lu, X. Shen, P. Wonka, and J. Ye. Fused multiple graphical lasso. Arxiv
preprint arXiv:1209.2139v2, 2012.
[25] M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model.
Biometrika, 94(1):19, 2007.
[26] S. Zhou, J. Lafferty, and L. Wasserman. Time varying undirected graphs. Machine
Learning, 80(2):295–319, 2010.
[27] Y. Zhu, X. Shen, and W. Pan. Simultaneous grouping pursuit and feature selection
over an undirected graph. Journal of the American Statistical Association, 108:713–
725, 2013.
33
Table 1: Average entropy loss, denoted by EL, (SD in parentheses) and average quadratic loss,
denoted by QL, (SD in parentheses), average false positive for sparseness pursuit, denote by FPV,
(SD in parentheses), average false negative for sparseness pursuit, denoted as FNV, (SD in parentheses), average false positive for grouping, denoted by FPG, (SD in parentheses), and average
false negative for grouping, denoted by FNG, (SD in parentheses), based on 100 simulations, for
estimating precision matrices in Example 1 with n = 120. Here “Smooth”, “Lasso”, “TLP”, “Ourcon” and “Ours” denote estimation of individual matrices with kernel smoothing method proposed
in [26], the L1 sparseness penalty, that with non-convex TLP penalty, the convex counterpart of
our method with the L1 -penalty for sparseness and clustering, and our non-convex estimates by
solving (2) with penalty (6). The best performer is bold-faced.
(p, L)
(30, 4)
Method
Smooth
Lasso
TLP
Our-con
Ours
(200, 4) Smooth
Lasso
TLP
Our-con
Ours
(20, 30) Smooth
Lasso
TLP
Ours-con
Ours
(10, 90) Smooth
Lasso
TLP
Our-con
Ours
EL
0.570(.005)
1.547(.074)
0.746(.084)
1.288(.064)
0.525(.055)
7.118(.173)
36.48(.426)
5.305(.351)
36.28(.422)
3.500(.164)
1.122(.023)
1.685(.028)
0.507(.023)
1.593(.028)
0.236(.015)
0.339(.007)
0.575(.010)
0.250(.009)
0.519(.009)
0.100(.005)
QL
2.617(.266)
5.416(.393)
4.688(.617)
3.700(.270)
3.494(.418)
22.45(2.61)
69.21(1.97)
33.67(2.22)
66.53(1.87)
23.71(1.33)
1.983(.056)
3.770(.113)
3.081(.180)
3.256(.097)
1.812(.130)
0.603(.017)
1.439(.038)
1.404(.061)
1.190(.032)
0.748(.043)
34
FPV
.312(.016)
.200(.009)
.043(.006)
.129(.016)
.040(.009)
.087(.017)
.013(.001)
.004(.000)
.010(.001)
.003(.000)
.131(.006)
.152(.005)
.077(.004)
.136(.007)
.068(.016)
.271(.014)
.273(.007)
.196(.006)
.284(.018)
.017(.020)
FNV
.000(.000)
.000(.000)
.001(.002)
.000(.000)
.000(.000)
.000(.000)
.000(.000)
.005(.003)
.000(.000)
.000(.000)
.000(.000)
.000(.000)
.000(.000)
.000(.000)
.000(.000)
.000(.000)
.000(.000)
.000(.000)
.000(.000)
.000(.000)
FPG
.392(.020)
.377(.015)
.108(.008)
.045(.020)
.009(.007)
.106(.018)
.027(.001)
.014(.000)
.012(.001)
.001(.000)
.223(.006)
.314(.008)
.198(.005)
.055(.003)
.020(.002)
.420(.012)
.541(.009)
.434(.008)
.071(.005)
.028(.004)
FNG
.000(.000)
.000(.000)
.000(.000)
.251(.032)
.267(.043)
.000(.000)
.000(.000)
.000(.000)
.122(.021)
.280(.007)
.000(.000)
.000(.000)
.000(.000)
.024(.004)
.032(.004)
.000(.000)
.000(.000)
.000(.000)
.008(.002)
.012(.002)
Table 2: Average entropy loss, denoted by EL, (SD in parentheses) and average quadratic loss,
denoted by QL, (SD in parentheses), average false positive for sparseness pursuit, denote by FPV,
(SD in parentheses), average false negative for sparseness pursuit, denoted as FNV, (SD in parentheses), average false positive for grouping, denoted by FPG, (SD in parentheses), and average
false negative for grouping, denoted by FNG, (SD in parentheses), based on 100 simulations, for
estimating precision matrices in Example 2 with n = 300. Here “Smooth”, “Lasso”, “TLP”, “Ourcon” and “Ours” denote estimation of individual matrices with kernel smoothing method proposed
in [26], the L1 sparseness penalty, that with non-convex TLP penalty, the convex counterpart of
our method with the L1 -penalty for sparseness and clustering, and our non-convex estimates by
solving (2) with penalty (6). The best performer is bold-faced.
(p, L)
(30, 4)
Method
Smooth
Lasso
TLP
Our-con
Ours
(200, 4) Smooth
Lasso
TLP
Our-con
Ours
(20,30) Smooth
Lasso
TLP
Our-con
Ours
(10, 90) Smooth
Lasso
TLP
Our-con
Ours
EL
QL
FPV
FNV
FPG
FNG
0.418(.025)
1.081(.072) .387(.054) .009(.008) .543(.060) .002(.002)
0.732(.041)
1.840(.122) .229(.011) .061(.013) .466(.016) .006(.005)
0.772(.055)
2.290(.192) .038(.005) .184(.020) .162(.010) .037(.014)
0.591(.039)
1.373(.104) .081(.019) .033(.013) .056(.035) .146(.021)
0.359(.036) 1.012(.111) .044(.009) .031(.015) .004(.002) .233(.012)
3.198(.069) 7.823(.187) .129(.010) .030(.006) .161(.010) .013(.002)
6.902(.140)
17.91(.435) .049(.001) .234(.010) .110(.002) .039(.004)
8.350(.215)
23.71(.710) .003(.001) .493(.015) .017(.001) .116(.010)
6.151(.148)
15.58(.439) .014(.005) .198(.013) .030(.010) .142(.046)
2.977(.135) 8.108(.399) .001(.000) .191(.010) .001(.000) .284(.012)
0.409(.011)
0.839(.006) .063(.007) .024(.004) .290(.007) .005(.001)
0.470(.011)
1.157(.034) .290(.006) .034(.004) .611(.008) .001(.001)
0.491(.017)
1.421(.057) .081(.004) .118(.007) .341(.006) .004(.001)
0.303(.010)
0.682(.025) .071(.013) .008(.002) .044(.014) .023(.002)
0.111(.006) 0.317(.019) .012(.005) .006(.003) .011(.002) .032(.003)
0.123(.003)
0.248(.007) .132(.013) .008(.002) .518(.008) .001(.000)
0.170(.003)
0.419(.010) .405(.010) .007(.002) .798(.008) .000(.000)
0.155(.005)
0.439(.014) .175(.007) .020(.003) .609(.007) .000(.000)
0.099(.008)
0.230(.007) .135(.024) .000(.000) .043(.015) .001(.001)
0.040(.002) 0.117(.005) .015(.014) .000(.000) .009(.001) .001(.001)
35
Table 3: Average entropy loss, denoted by EL, (SD in parentheses) and average quadratic loss,
denoted by QL, (SD in parentheses), based on 100 simulations, for estimating multiple precision
matrices in Example 3. Here “Smooth”, “Lasso”, “TLP”, “Our-con” and “Ours” denote estimation
of individual matrices with kernel smoothing method proposed in [26], the L1 sparseness penalty,
that with non-convex TLP penalty, the convex counterpart of our method with the L1 -penalty for
sparseness and clustering, and our non-convex estimates by solving (2) with penalty (6). The best
performer is bold-faced.
Set-up
(p,L)
(30 , 4)
(200, 4)
(20 , 30)
(10 , 90)
Method
Smooth
Lasso
TLP
Our-con
Ours
Smooth
Lasso
TLP
Our-con
Ours
Smooth
Lasso
TLP
Our-con
Ours
Smooth
Lasso
TLP
Our-con
Ours
n = 120
EL
QL
0.468(.042) 0.941(.097)
1.158(.062)
2.534(.175)
1.546(.100)
3.625(.262)
0.897(.066)
1.823(.166)
0.501(.063)
1.143(.160)
6.882(.220)
13.26(.535)
10.37(.173)
21.92(.498)
12.34(.202)
25.87(.560)
6.091(.199)
12.34(.484)
5.079(.265) 11.84(.658)
0.490(.021)
0.878(.042)
0.786(.020)
1.670(.052)
0.987(.036)
2.454(.107)
0.653(.023)
1.281(.055)
0.317(.013) 0.730(.036)
0.230(.008)
0.409(.017)
0.318(.008)
0.694(.022)
0.402(.012)
1.010(.043)
0.240(.008)
0.487(.019)
0.158(.005) 0.369(.014)
36
n = 300
EL
QL
0.231(.056) 0.476(.034)
0.736(.036)
1.434(.086)
0.575(.045)
1.301(.121)
0.699(.038)
1.317(.085)
0.247(.017)
0.524(.042)
2.578(.066)
4.843(.130)
5.449(.094)
10.84(.211)
5.523(.153)
12.08(.365)
4.625(.098)
8.625(.209)
1.682(.038) 3.551(.096)
0.278(.008)
0.492(.015)
0.564(.012)
1.066(.027)
0.355(.014)
0.819(.035)
0.528(.012)
0.959(.027)
0.183(.005) 0.391(.014)
0.115(.003)
0.203(.005)
0.205(.004)
0.398(.008)
0.148(.004)
0.346(.011)
0.180(.004)
0.335(.007)
0.082(.002) 0.176(.005)
Table 4: Average entropy loss, denoted by EL, (SD in parentheses) and average quadratic loss,
denoted by QL, (SD in parentheses), average false positive for sparseness pursuit, denote by FPV,
(SD in parentheses), average false negative for sparseness pursuit, denoted as FNV, (SD in parentheses), average false positive for grouping, denoted by FPG, (SD in parentheses), and average
false negative for grouping, denoted by FNG, (SD in parentheses), based on 100 simulations, for
estimating precision matrices in Example 4. Here “Lasso”, “TLP”, “Our-con” and “Ours” denote estimation of individual matrices with the L1 sparseness penalty, that with non-convex TLP
penalty, the convex counterpart of our method with the L1 -penalty for sparseness and clustering,
and our non-convex estimates by solving (2) with penalty (6). The best performer is bold-faced.
(p, L)
Method
EL
QL
FPV
(1000, 4) Lasso
378.5(2.09)
829.4(13.3)
.0005(.0000)
TLP
36.05(.1.78)
201.9(8.81)
.0004(.0000)
Our-con 377.3(2.07)
805.2(12.9)
.0004(.0000)
Ours
26.8(1.35) 160.9(7.266) .0003(.0000)
(2000, 4) Lasso
225.6(.413)
358.1(1.13)
.0009(.0000)
TLP
9.160(.083)
54.17(.654)
.0007(.0000)
Our-con 225.6(.413)
358.1(1.12)
.0009(.0000)
Ours
8.617(.081) 51.79(.657) .0006(.0000)
37
FNV
.0270(.0030)
.0270(.0030)
.0110(.0020)
.0130(.0020)
.0000(.0000)
.0000(.0000)
.0000(.0000)
.0000(.0000)
FPG
.0020(.0000)
.0020(.0000)
.0017(.0000)
.0017(.0000)
.0018(.0000)
.0015(.0000)
.0018(.0000)
.0005(.0000)
FNG
.0002(.0003)
.0002(.0003)
.0497(.0040)
.0267(.0027)
.0000(.0000)
.0000(.0000)
.0000(.0000)
.1750(.0020)
Quadratic loss
Entropy loss
1.8
p=10
p=15
p=20
0.10
1.4
0.8
1.0
1.2
Quadratic loss
0.18
0.14
Entroy loss
1.6
0.22
p=10
p=15
p=20
30
40
50
60
70
80
30
90
40
50
60
70
80
90
number of graphs
number of graphs
Figure 1: Average entropy and quadratic losses of the proposed method over different p and
L values over 100 simulation replications in Example 1.
"PKA"
"Jnk"
"Jnk"
"Mek"
"P38"
"PIP2"
"Mek"
"P38"
"PIP2"
"AKt"
"Erk"
"PIP3"
"Plcg"
"PKA"
"Plcg"
"AKt"
"Erk"
"Raf"
"Raf"
"PIP3"
"PKC"
"PKC"
(a) Under the nine conditions.
(b) Under the five conditions.
Figure 2: Reconstructed networks for simultaneous pursuit of clustering and sparsity.
38
"Plcg"
"PKA"
"JnK"
"Mek"
"PIP2"
"P38"
"Erk"
"AKt"
"Raf"
"PIP3"
"PKC"
Figure 3: Signaling network reproduced from Figure 3(A) of [16], where the black dashed
line represents links that have been missed by methods in [16].
39