Complete Dictionary Recovery over the Sphere I: Overview

IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
1
Complete Dictionary Recovery over the Sphere
I: Overview and the Geometric Picture
Ju Sun, Student Member, IEEE, Qing Qu, Student Member, IEEE, and John Wright, Member, IEEE
arXiv:1511.03607v2 [cs.IT] 30 Nov 2015
Abstract
We consider the problem of recovering a complete (i.e., square and invertible) matrix A0 , from Y ∈ Rn×p with
Y = A0 X0 , provided X0 is sufficiently sparse. This recovery problem is central to the theoretical understanding
of dictionary learning, which seeks a sparse representation for a collection of input signals, and finds numerous
applications in modern signal processing and machine learning. We give the first efficient algorithm that provably
recovers A0 when X0 has O (n) nonzeros per column, under suitable probability model for X0 . In contrast, prior
results based on efficient algorithms provide recovery guarantees when X0 has only O(n1−δ ) nonzeros per column
for any constant δ ∈ (0, 1).
Our algorithmic pipeline centers around solving a certain nonconvex optimization problem with a spherical
constraint. In this paper, we provide a geometric characterization of the high-dimensional objective landscape.
In particular, we show that the problem is highly structured: with high probability there are no “spurious” local
minimizers and all saddle points are second-order. This distinctive structure makes the problem amenable to efficient
algorithms. In a companion paper [3], we design a second-order trust-region algorithm over the sphere that provably
converges to a local minimizer with an arbitrary initialization, despite the presence of saddle points.
Index Terms
Dictionary learning, Nonconvex optimization, Spherical constraint, Escaping saddle points, Trust-region method,
Manifold optimization, Function landscape, Second-order geometry, Inverse problem, Structured signal, Nonlinear
approximation
I. I NTRODUCTION
.
.
Given p signal samples from
i.e., Y = [y1 , . . . , yp ], is it possible to construct a dictionary A = [a1 , . . . , am ]
with m much smaller than p, such that Y ≈ AX and the coefficient matrix X has as few nonzeros as possible? In
other words, this model dictionary learning (DL) problem seeks a concise representation for a collection of input
signals. Concise signal representations play a central role in compression, and also prove useful for many other
important tasks, such as signal acquisition, denoising, and classification.
Traditionally, concise signal representations have relied heavily on explicit analytic bases constructed in nonlinear
approximation and harmonic analysis. This constructive approach has proved highly successfully; the numerous
theoretical advances in these fields (see, e.g., [4]–[8] for summary of relevant results) provide ever more powerful
representations, ranging from the classic Fourier to modern multidimensional, multidirectional, multiresolution bases,
including wavelets, curvelets, ridgelets, and so on. However, two challenges confront practitioners in adapting these
results to new domains: which function class best describes signals at hand, and consequently which representation
is most appropriate. These challenges are coupled, as function classes with known “good” analytic bases are rare. 1
Around 1996, neuroscientists Olshausen and Field discovered that sparse coding, the principle of encoding a
signal with few atoms from a learned dictionary, reproduces important properties of the receptive fields of the simple
cells that perform early visual processing [10], [11]. The discovery has spurred a flurry of algorithmic developments
and successful applications for DL in the past two decades, spanning classical image processing, visual recognition,
compressive signal acquisition, and also recent deep architectures for signal classification (see, e.g., [12], [13] for
review of this development).
Rn ,
JS, QQ, and JW are all with Electrical Engineering, Columbia University, New York, NY 10025, USA. Email: {js4038, qq2105,
jw2966}@columbia.edu. An extended abstract of the current work has been published in [1]. Proofs of some secondary results are contained
in the combined technical report [2].
Manuscript received xxx; revised xxx.
1
As Donoho et al [9] put it, “...in effect, uncovering the optimal codebook structure of naturally occurring data involves more challenging
empirical questions than any that have ever been solved in empirical work in the mathematical sciences.”
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
2
The learning approach is particularly relevant to modern signal processing and machine learning, which deal
with data of huge volume and great variety (e.g., images, audios, graphs, texts, genome sequences, time series, etc).
The proliferation of problems and data seems to preclude analytically deriving optimal representations for each
new class of data in a timely manner. On the other hand, as datasets grow, learning dictionaries directly from data
looks increasingly attractive and promising. When armed with sufficiently many data samples of one signal class,
by solving the model DL problem, one would expect to obtain a dictionary that allows sparse representation for the
whole class. This hope has been borne out in a number of successful examples [12], [13] and theories [14]–[17].
A. Theoretical and Algorithmic Challenges
In contrast to the above empirical successes, the theoretical study of dictionary learning is still developing. For
applications in which dictionary learning is to be applied in a “hands-free” manner, it is desirable to have efficient
algorithms which are guaranteed to perform correctly, when the input data admit a sparse model. There have been
several important recent results in this direction, which we will review in Section I-E, after our sketching main
results. Nevertheless, obtaining algorithms that provably succeed under broad and realistic conditions remains an
important research challenge.
To understand where the difficulties arise, we can consider a model formulation, in which we attempt to obtain
the dictionary A and coefficients X which best trade-off sparsity and fidelity to the observed data:
1
minimizeA∈Rn×m ,X∈Rm×p λ kXk1 + kAX − Y k2F , subject to A ∈ A.
(I.1)
2
P
.
Here, kXk1 = i,j |Xij | promotes sparsity of the coefficients, λ ≥ 0 trades off the level of coefficient sparsity and
quality of approximation, and A imposes desired structures on the dictionary.
This formulation is nonconvex: the admissible set A is typically nonconvex (e.g., orthogonal group, matrices with
normalized columns)2 , while the most daunting
nonconvexity comes from the bilinear mapping: (A, X) 7→ AX .
−1
∗
Because (A, X) and AΠΣ, Σ Π X result in the same objective value for the conceptual formulation (I.1),
where Π is any permutation matrix, and Σ any diagonal matrix with diagonal entries in {±1}, and (·)∗ denotes
matrix transpose. Thus, we should expect the problem to have combinatorially many global minimizers. Because
there are multiple isolated global minimizers, the problem does not appear to be amenable to convex relaxation (see
similar discussions in, e.g., [20] and [21]).3 This contrasts sharply with problems in sparse recovery and compressed
sensing, in which simple convex relaxations are often provably effective [24]–[33]. Is there any hope to obtain
global solutions to the DL problem?
B. An Intriguing Numerical Experiment with Real Images
We provide empirical evidence in support of a positive answer to the above question. Specifically, we learn
orthogonal bases (orthobases) for real images patches. Orthobases are of interest because typical hand-designed
dictionaries such as discrete cosine (DCT) and wavelet bases are orthogonal, and orthobases seem competitive in
performance for applications such as image denoising, as compared to overcomplete dictionaries [34]4 .
We divide a given greyscale image into 8 × 8 non-overlapping patches, which are converted into 64-dimensional
vectors and stacked column-wise into a data matrix Y . Specializing (I.1) to this setting, we obtain the optimization
problem:
1
minimizeA∈Rn×n ,X∈Rn×p λ kXk1 + kAX − Y k2F , subject to A ∈ On .
(I.2)
2
2
For example, in nonlinear approximation and harmonic analysis, orthonormal basis or (tight-)frames are preferred; to fix the scale ambiguity
discussed in the text, a common practice is to require that A to be column-normalized. There is no obvious reason to believe that convexifying
these constraint sets would
leave the optima unchanged. For example, the convex hull of the orthogonal group On is the operator norm ball
X ∈ Rn×n : kXk ≤ 1 . If there are no effective symmetry breaking constraints, any convex objective function tends to have minimizers
inside the ball, which obviously will not be orthogonal matrices. Other ideas such as lifting may not play together with the objective function,
nor yield tight relaxations (see, e.g., [18], [19]).
3
Semidefinite programming (SDP) lifting may be one useful general strategy to convexify bilinear inverse problems, see, e.g., [22], [23].
However, for problems with general nonlinear constraints, it is unclear whether the lifting always yield tight relaxation, consider, e.g., [18],
[19] again.
4
See Section I-C for more detailed discussions of this point. [35] also gave motivations and algorithms for learning (union of) orthobases
as dictionaries.
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
3
9
7
8
6
6
7
5
5
6
4
5
4
4
3
3
3
2
2
2
1
0
1
1
0
10
20
30
40
50
60
70
80
90
100
0
0
10
20
30
40
50
60
70
80
90
100
0
0
10
20
30
40
50
60
70
80
90
100
Fig. 1: Alternating direction method for (I.2) on uncompressed real images seems to always produce the same
solution! Top: Each image is 512 × 512 in resolution and encoded in the uncompressed pgm format (uncompressed
images to prevent possible bias towards standard bases used for compression, such as DCT or wavelet bases).
Each image is evenly divided into 8 × 8 non-overlapping image patches (4096 in total), and these patches are all
vectorized and then stacked as columns of the data matrix Y . Bottom: Given each Y , we solve (I.2) 100 times
with independent and randomized (uniform over the orthogonal group) initialization A0 . The plots show the values
of kA∗∞ Y k1 across the independent repetitions. They are virtually the same and the relative differences are less
than 10−3 !
To derive a concrete algorithm for (I.2), one can deploy the alternating direction method (ADM)5 , i.e., alternately
minimizing the objective function with respect to (w.r.t.) one variable while fixing the other. The iteration sequence
actually takes very simple form: for k = 1, 2, 3, . . . ,
Xk = Sλ A∗k−1 Y ,
Ak = U V ∗ for U DV ∗ = SVD (Y Xk∗ )
.
where Sλ [·] denotes the well-known soft-thresholding operator acting elementwise on matrices, i.e., Sλ [x] =
sign (x) max (|x| − λ, 0) for any scalar x.
Fig. 1 shows what we obtained using the simple ADM algorithm, with independent and randomized initializations:
The algorithm seems to always produce the same optimal value, regardless of the initialization.
This observation implies the heuristic ADM algorithm may always converge to one global minimizer! 6 Equally
surprising is that the phenomenon has been observed on real images7 . One may imagine only random data typically
have “favorable” structures; in fact, almost all existing theories for DL pertain only to random data [40]–[45].
C. Dictionary Recovery and Our Results
In this paper, we take a step towards explaining the surprising effectiveness of nonconvex optimization heuristics
for DL. We focus on the dictionary recovery (DR) setting: given a data matrix Y generated as Y = A0 X0 , where
5
This method is also called alternating minimization or (block) coordinate descent method. see, e.g., [36], [37] for classic results and [38],
[39] for several interesting recent developments.
6
Technically, the converge to global solutions is surprising because even convergence of ADM to critical points is atypical, see, e.g., [38],
[39] and references therein. Section III includes more detailed discussions on this point.
7
Actually the same phenomenon is also observed for simulated data when the coefficient matrix obeys the Bernoulli-Gaussian model, which
is defined later. The result on real images supports that previously claimed empirical successes over two decades may be non-incidental.
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
4
A0 ∈ A ⊆ Rn×m and X0 ∈ Rm×p is “reasonably sparse”, try to recover A0 and X0 . Here recovery means to
return any pair A0 ΠΣ, Σ−1 Π∗ X0 , where Π is a permutation matrix and Σ is a nonsingular diagonal matrix,
i.e., recovering up to sign, scale, and permutation.
To define a reasonably simple and structured problem, we make the following assumptions:
• The target dictionary A0 is complete, i.e., square and invertible (m = n). In particular, this class includes
orthogonal dictionaries. Admittedly overcomplete dictionaries tend to be more powerful for modeling and
to allow sparser representations. Nevertheless, most classic hand-designed dictionaries in common use are
orthogonal. Orthobases are competitive in performance for certain tasks such as image denoising [34], and
admit faster algorithms for learning and encoding. 8
• The coefficient matrix X0 follows the Bernoulli-Gaussian (BG) model with rate θ : [X0 ]ij = Ωij Vij , with
Ωij ∼ Ber (θ) and Vij ∼ N (0, 1), where all the different random variables are jointly independent. We write
compactly X0 ∼i.i.d. BG (θ).
In this paper, we provide a nonconvex formulation for the DR problem, and characterize the geometric structure of
the formulation that allows development of efficient algorithms for optimization. In the companion paper [3], we
derive an efficient algorithm taking advantage of the structure, and describe a complete algorithmic pipeline for
efficient recovery. Together, we prove the following result:
Theorem I.1 (Informal statement of our results, a detailed version included in the companion paper [3]). For any
θ ∈ (0, 1/3), given Y = A0 X0 with A0 a complete dictionary and X0 ∼i.i.d. BG (θ), there is a polynomial time
algorithm that recovers (up to sign, scale, and permutation) A0 and X0 with high probability (at least 1 − O(p−6 ))
whenever p ≥ p? (n, 1/θ, κ (A0 ) , 1/µ) for a fixed polynomial p? (·), where κ (A0 ) is the condition number of A0
and µ is a parameter that can be set as cn−5/4 for a fixed positive numerical constant c.
Obviously, even if X0 is known, one needs p ≥ n to make the identification problem well posed. Under our
particular probabilistic model, a simple coupon collection argument implies that one needs p ≥ Ω 1θ log n to ensure
all atoms in A0 are observed with high probability (w.h.p.). To ensure that an efficient algorithm exists may demand
more. Our result implies when p is polynomial in n, 1/θ and κ(A0 ), recovery with efficient algorithm is possible.
The parameter θ controls the sparsity level of X0 . Intuitively, the recovery problem is easy for small θ and
becomes harder for large θ.9 It is perhaps surprising that an efficient algorithm can succeed up to constant θ, i.e.,
linear sparsity in X0 . Compared to the case when A0 is known, there is only at most a constant gap in the sparsity
level one can deal with.
For DL, our result gives the first efficient algorithm that provably recovers complete A0 and X0 when X0 has
O(n) nonzeros per column under appropriate probability model. Section I-E provides detailed comparison of our
result with other recent recovery results for complete and overcomplete dictionaries.
D. Main Ingredients and Innovations
In this section we describe three main ingredients that we use to obtain the stated result.
1) A Nonconvex Formulation: Since Y = A0 X0 and A0 is complete, row (Y ) = row (X0 ) (row (·) denotes
the row space of a matrix) and hence rows of X0 are sparse vectors in the known (linear) subspace row (Y ).
We can use this fact to first recover the rows of X0 , and subsequently recover A0 by solving a system of linear
equations. In fact, for X0 ∼i.i.d. BG (θ), rows of X0 are the n sparsest vectors (directions) in row (Y ) w.h.p.
whenever p ≥ Ω (n log n) [40]. Thus one might try to recover rows of X0 by solving
minimize kq ∗ Y k0 subject to q 6= 0.
(I.3)
The objective is discontinuous, and the domain is an open set. In particular, the homogeneous constraint is
nonconventional and tricky to deal with. Since the recovery is up to scale, one can remove the homogeneity by fixing
8
Empirically, there is no systematic evidence supporting that overcomplete dictionaries are strictly necessary for good performance in all
published applications (though [11] argues for the necessity from neuroscience perspective). Some of the ideas and tools developed here for
complete dictionaries may also apply to certain classes of structured overcomplete dictionaries, such as tight frames. See Section III for
relevant discussion.
9
Indeed, when θ is small enough such that columns of X0 are predominately 1-sparse, one directly observes scaled versions of the atoms
(i.e., columns of X0 ); when X0 is fully dense corresponding to θ = 1, recovery is never possible as one can easily find another complete
A00 and fully dense X00 such that Y = A00 X00 with A00 not equivalent to A0 .
Slightly%more%formally%…!
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
5
Lemma:%Suppose))))))))))))))))))),)and)))))))))))))))))).)Then…)2
✁✂✞
✁✂☎✵
Negative%curvature
✁✂☎✝
✁✂☎✆
Nonzero%gradient
✁✂☎☎
Strongly%convex
✁✂☎
✁✂ ✵
✁✂✄
✁✂✄
✁
✁
✲✁✂✄
✲✁✂✄
✲
✲
…)and)so)…)every%local%optimizer%of%%%%%%%%%%%%%%%%%is%a%target%point.%+
Sn−1
Fig. 2: Why is dictionary learning over
tractable? Assume the target dictionary A0 is orthogonal. Left:
Large sample objective function EX0 [f (q)]. The only local minimizers are the columns of A0 and their negatives.
Center: The same function, visualized as a height above the plane a⊥
1 (a1 is the first column of A0 , and is also a
global minimizer). Right: Around a1 , the function exhibits a small region of positive curvature, a region of large
gradient, and finally a region in which the direction away from a1 is a direction of negative curvature.
the scale of q . Known relaxations [40], [46] fix the scale by setting kq ∗ Y k∞ = 1, where k·k∞ is the elementwise
`∞ norm. The optimization problem reduces to a sequence of convex programs, which recover (A0 , X0 ) for very
√
√
sparse X0 , but provably break down when columns of X0 has more than O ( n) nonzeros, or θ ≥ Ω (1/ n).
Inspired by our previous image experiment, we work with a nonconvex alternative10 :
p
. 1X
hµ (q ∗ ybk ) , subject to kqk = 1,
minimize f (q; Yb ) =
p
(I.4)
k=1
where Yb ∈ Rn×p is a proxy for Y (i.e., after appropriate processing), k indexes columns of Yb , and k·k is the usual
`2 norm for vectors. Here hµ (·) is chosen to be a convex smooth approximation to |·|, namely,
exp (z/µ) + exp (−z/µ)
= µ log cosh(z/µ),
(I.5)
hµ (z) = µ log
2
which is infinitely differentiable and µ controls the smoothing level.11 The spherical constraint is nonconvex. Hence,
a-priori, it is unclear whether (I.4) admits efficient algorithms that attain global optima. Surprisingly, simple descent
algorithms for (I.4) exhibit very striking behavior: on many practical numerical examples12 , they appear to produce
global solutions. Our next section will uncover interesting geometrical structures underlying the phenomenon.
2) A Glimpse into High-dimensional Function Landscape: For the moment, suppose A0 is orthogonal, and
take Yb = Y = A0 X0 in (I.4). Fig. 2 (left) plots EX0 [f (q; Y )] over q ∈ S2 (n = 3). Remarkably, EX0 [f (q; Y )]
has no spurious local minimizers. In fact, every local minimizer qb produces a row of X0 : qb∗ Y = e∗i X0 for a
certain i ∈ {1, . . . , n}.
To better illustrate the point, we take the particular case A0 = I and project the upper hemisphere above the
⊥
⊥
equatorial plane e⊥
3 onto e3 . The projection is bijective and we equivalently define a reparameterization g : e3 7→ R
of f . Fig. 2 (center) plots the graph of g . Obviously the only local minimizers are 0, ±e1 , ±e2 , and they are also
global minimizers. Moreover, the apparent nonconvex landscape has interesting structures around 0: when moving
away from 0, one sees successively a strongly convex region, a nonzero gradient region, and a region where at each
point one can always find a direction of negative curvature, as shown schematically in Fig. 2 (right). This geometry
implies that at any nonoptimal point, there is always at least one direction of descent. Thus, any algorithm that can
take advantage of the descent directions will likely converge to one global minimizer, irrespective of initialization.
Two challenges stand out when implementing this idea. For geometry, one has to show similar structure exists for
general complete A0 , in high dimensions (n ≥ 3), when the number of observations p is finite (vs. the expectation
10
A similar formulation has been proposed in [47] in the context of blind source separation; see also [48].
In fact, there is nothing special about this choice and we believe that any valid smooth (twice continuously differentiable) approximation
to |·| would work and yield qualitatively similar results. We also have some preliminary results showing the latter geometric picture remains
the same for certain nonsmooth functions, such as a modified version of the Huber function, though the analysis involves handling a different
set of technical subtleties. The algorithm also needs additional modifications.
12
... not restricted to the model we assume here for A0 and X0 .
11
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
6
in the experiment). For algorithms, we need to be able to take advantage of this structure without knowing A0
ahead of time. In Section I-D3, we describe a Riemannian trust region method which addresses the latter challenge.
a) Geometry for orthogonal A0 .: In this case, we take Yb = Y = A0 X0 . Since f (q; A0 X0 ) = f (A∗0 q; X0 ),
the landscape of f (q; A0 X0 ) is simply a rotated version of that of f (q; X0 ), i.e., when A0 = I . Hence we will
focus on the case when A0 = I . Among the 2n symmetric sections of Sn−1 centered around the signed basis
vectors ±e1 , . . . , ±en , we work with the symmetric section around en as an example. The result will carry over to
all sections with the same argument; together this provides a complete characterization of the function f (q; X0 )
over Sn−1 .
We again invoke the projection trick described above, this time onto the equatorial plane e⊥
n . This can be formally
captured by the reparameterization mapping:
q
2
(I.6)
q (w) = w, 1 − kwk , w ∈ Bn−1 ,
n−1 and Bn−1 is the unit ball in Rn−1 . We first study the composition
where w is the new variable in e⊥
n ∩B
.
g (w; X0 ) = f (q (w) ; X0 ) over the set
q
.
4n−1
Γ = w : kwk <
.
(I.7)
4n
It can be verified the section we chose to work with is contained in this set13 .
Our analysis characterizes the properties of g (w; X0 ) by studying three quantities
w∗ ∇2 g (w; X0 ) w
kwk2
respectively over three consecutive regions moving away from the origin, corresponding to the three regions in
Fig. 2 (right). In particular, through typical expectation-concentration style argument, we show that there exists a
positive constant c such that
∇2 g (w; X0 ) ,
w∗ ∇g (w; X0 )
,
kwk
w∗ ∇2 g (w; X0 ) w
≤ −cθ
(I.8)
kwk2
over the respective regions w.h.p., confirming our low-dimensional observations described above. In particular, the
favorable structure we observed for n = 3 persists in high dimensions, w.h.p., even when p is large yet finite, for
the case A0 is orthogonal. Moreover, the local minimizer of g (w; X0 ) over Γ is very close to 0, within a distance
of O (µ).
b) Geometry for complete A0 .: For general complete dictionaries A0 , we hope that the function f retains the
nice geometric structure discussed above. We can ensure this by “preconditioning” Y such that the output looks as
if being generated from a certain orthogonal matrix, possibly plus a small perturbation. We can then argue that the
perturbation does not significantly affect qualitative properties of the landscape of the objective function. Write
−1/2
1
Y = pθ
YY∗
Y.
(I.9)
∇2 g (w; X0 ) 1
cθI,
µ
w∗ ∇g (w; X0 )
≥ cθ,
kwk
Note that for X0 ∼i.i.d. BG (θ), E [X0 X0∗ ] / (pθ) = I . Thus, one expects
roughly like A0 A∗0 and hence Y to behave like
(A0 A∗0 )−1/2 A0 X0 = U V ∗ X0
1
pθ Y
Y∗ =
1
∗ ∗
pθ A0 X0 X0 A0
to behave
(I.10)
where we write the SVD of A0 as A0 = U ΣV ∗ . It is easy to see U V ∗ is an orthogonal matrix. Hence the
preconditioning scheme we have introduced is technically sound.
Our analysis shows that Y can be written as
Y = U V ∗ X0 + ΞX0 ,
(I.11)
where Ξ is a matrix with small magnitude. Simple perturbation argument shows that the constant c in (I.8) is at
most shrunk to c/2 for all w when p is sufficiently large. Thus, the qualitative aspects of the geometry have not
been changed by the perturbation.
Indeed, if hq, en i ≥ |hq, ei i| for any i 6= n, 1 − kwk2 = qn2 ≥ 1/n, implying kwk2 ≤ n−1
< 4n−1
. The reason we have defined an
n
4n
open set instead of a closed (compact) one is to avoid potential trivial local minimizers located on the boundary.
13
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
7
3) A Second-order Algorithm on Manifold: Riemannian Trust-Region Method: We do not know A0 ahead of
time, so our algorithm needs to take advantage of the structure described above without knowledge of A0 . Intuitively,
this seems possible as the descent direction in the w space appears to also be a local descent direction for f over
the sphere. Another issue is that although the optimization problem has no spurious local minimizers, it does have
many saddle points with indefinite Hessian, which We call second-order saddle points or ridable saddles 14 (Fig. 2).
We can use second-order information to guarantee to escape such saddle points. In the companion paper [3], We
derive an algorithm based on the Riemannian trust region method (TRM) [51], [52] over the sphere for this purpose.
There are other algorithmic possibilities; see, e.g., [50].
We provide here only the basic intuition why a local minimizer can be retrieved by the second-order trust-region
method. For a function f : Rn → R and an unconstrained optimization problem
min f (x) ,
x∈Rn
typical (second-order) TRM proceeds by successively forming a second-order approximation to f at the current
iterate,
.
fb(δ; x(k−1) ) = f (x(k−1) ) + ∇∗ f (x(k−1) )δ + 12 δ ∗ Q(x(k−1) )δ,
(I.12)
where Q(x(k−1) ) is a proxy for the Hessian matrix ∇2 f (x(k−1) ), which encodes the second-order geometry. The
next movement direction is determined by seeking a minimum of fb(δ; x(k−1) ) over a small region, normally a norm
ball kδkp ≤ ∆, called the trust region, inducing the well studied trust-region subproblem that can efficiently solved:
.
δ (k) = arg min fb(δ; x(k−1) ),
(I.13)
δ∈Rn ,kδkp ≤∆
where ∆ is called the trust-region radius that controls how far the movement can be made. If we take Q(x(k−1) ) =
∇2 f (x(k−1) ) for all k , then whenever the gradient is nonvanishing or the Hessian is indefinite, we expect to decrease
the objective function by a concrete amount provided kδk is sufficiently small. Since the domain is compact, the
iterate sequence ultimately moves into the strongly convex region, where the trust-region algorithm behaves like
a typical Newton algorithm. All these are generalized to our objective over the sphere and made rigorous in the
companion paper [3].
E. Prior Arts and Connections
It is far too ambitious to include here a comprehensive review of the exciting developments of DL algorithms
and applications after the pioneer work [10]. We refer the reader to Chapter 12 - 15 of the book [12] and the survey
paper [13] for summaries of relevant developments in image analysis and visual recognition. In the following, we
focus on reviewing recent developments on the theoretical side of dictionary learning, and draw connections to
problems and techniques that are relevant to the current work.
a) Theoretical Dictionary Learning: The theoretical study of DL in the recovery setting started only very
recently. [53] was the first to provide an algorithmic procedure to correctly extract the generating dictionary.
The algorithm requires exponentially many samples and has exponential running time; see also [54]. Subsequent
work [20], [21], [55]–[57] studied when the target dictionary is a local optimizer of natural recovery criteria. These
meticulous analyses show that polynomially many samples are sufficient to ensure local correctness under natural
assumptions. However, these results do not imply that one can design efficient algorithms to obtain the desired local
optimizer and hence the dictionary.
[40] initiated the on-going research effort to provide efficient algorithms that globally solve DR. They showed
that one can recover a complete dictionary A0 √from Y = A0 X0 by solving a certain sequence of linear programs,
when X0 is a sparse random matrix with O( n) nonzeros per column. [41], [43] and [42], [45] gave efficient
algorithms that provably recover overcomplete (m ≥ n) and incoherent dictionaries, based on a combination√of
e n)
{clustering or spectral initialization} and local refinement. These algorithms again succeed when X0 has O(
15
nonzeros per column. Recent work [58] provided the first polynomial-time algorithm that provably recovers
14
15
See [49] and [50].
e suppresses some logarithm factors.
The O
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
8
most “nice” overcomplete dictionaries when X0 has O(n1−δ ) nonzeros per column for any constant δ ∈ (0, 1).
However, the proposed algorithm runs in super-polynomial (quasipolynomial) time when the sparsity level goes up
to O(n). Similarly, [44] also proposed a super-polynomial time algorithm that guarantees recovery with (almost)
O (n) nonzeros per column. By comparison, we give the first polynomial-time algorithm that provably recovers
complete dictionary A0 when X0 has O (n) nonzeros per column.
Aside from efficient recovery, other theoretical work on DL includes results on identifiability [53], [54], [59],
generalization bounds [14]–[17], and noise stability [60].
b) Finding Sparse Vectors in a Linear Subspace: We have followed [40] and cast the core problem as finding
the sparsest vectors in a given linear subspace, which is also of independent interest. Under a planted sparse model16 ,
[46] showed
that solving a sequence of linear programs similar to [40] can recover sparse vectors with sparsity up
√
to O (p/ n), sublinear in the vector dimension. [48] improved the recovery limit to O (p) by solving a nonconvex
sphere-constrained problem similar to (I.4)17 via an ADM algorithm. The idea of seeking rows of X0 sequentially by
solving the above core problem sees precursors in [47] for blind source separation, and [61] for matrix sparsification.
[47] also proposed a nonconvex optimization similar to (I.4) here and that employed in [48].
c) Nonconvex Optimization Problems: For other nonconvex optimization problems of recovery of structured
signals18 , including low-rank matrix completion/recovery [70]–[79], phase retreival [80]–[83], tensor recovery [84]–
[87], mixed regression [88], [89], structured element pursuit [48], and recovery of simultaneously structured
signals [89], numerical linear algebra and optimization [90], [91], the initialization plus local refinement strategy
adopted in theoretical DL [41]–[45] is also crucial: nearness to the target solution enables exploiting the local
property of the optimizing objective to ensure that the local refinement succeeds.19 By comparison, we provide a
complete characterization of the global geometry, which admits efficient algorithms without any special initialization.
d) Independent Component Analysis (ICA) and Other Matrix Factorization Problems: DL can also be considered
in the general framework of matrix factorization problems, which encompass the classic principal component analysis
(PCA), ICA, and clustering, and more recent problems such as nonnegative matrix factorization (NMF), multi-layer
neural nets (deep learning architectures). Most of these problems are NP-hard. Identifying tractable cases of practical
interest and providing provable efficient algorithms are subject of on-going research endeavors; see, e.g., recent
progresses on NMF [95], and learning deep neural nets [96]–[99].
ICA factors a data matrix Y as Y = AX such that A is square and rows of X achieve maximal statistical
independence [100], [101]. In theoretical study of the recovery problem, it is often assumed that rows of X0
are (weakly) independent (see, e.g., [102]–[104]). Our i.i.d. probability model on X0 implies rows of X0 are
independent, aligning our problem perfectly with the ICA problem. More interestingly, the log cosh objective we
analyze here was proposed as a general-purpose contrast function in ICA that has not been thoroughly analyzed [105],
and algorithm and analysis with another popular contrast function, the fourth-order cumulants, indeed overlap with
ours considerably [103], [104]20 . While this interesting connection potentially helps port our analysis to ICA, it is a
fundamental question to ask what is playing the vital role for DR, sparsity or independence.
Fig. 3 helps shed some light in this direction, where we again plot the asymptotic objective landscape with
the natural reparameterization as in Section I-D2. From the left and central panels, it is evident that even without
independence, X0 with sparse columns induces the familiar geometric structures we saw in Fig. 2; such structures
are broken when the sparsity level becomes large. We believe all our later analyses can be generalized to the
correlated cases we experimented with. On the other hand, from the right panel21 , it seems that with independence,
the function landscape undergoes a transition as sparsity level grows - target solution goes from minimizers of the
objective to the maximizers of the objective. Without adequate knowledge of the true sparsity, it is unclear whether
16
... where one sparse vector embedded in an otherwise random subspace.
The only difference is that they chose to work with the Huber function as a proxy of the k·k1 function.
18
This is a body of recent work studying nonconvex recovery up to statistical precision, including, e.g., [62]–[69].
19
The powerful framework [38], [39] to establish local convergence of ADM algorithms to critical points applies to DL/DR also, see, e.g.,
[92]–[94]. However, these results do not guarantee to produce global optima.
20
Nevertheless, the objective functions are apparently different. Moreover, we have provided a complete geometric characterization of the
objective, in contrast to [103], [104]. We believe the geometric characterization could not only provide insight to the algorithm, but also help
improve the algorithm in terms of stability and also finding all components.
21
We have not showed the results on the BG model here, as it seems the structure persists even when θ approaches 1. We suspect the
“phase transition” of the landscape occurs at different points for different distributions and Gaussian is the outlying case where the transition
occurs at 1.
17
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
9
(a) Correlated Gaussian, θ = 0.1
(b) Correlated Uniform, θ = 0.1
(c) Independent Uniform, θ = 0.1
(d) Correlated Gaussian, θ = 0.9
(e) Correlated Uniform, θ = 0.9
(f) Independent Uniform, θ = 1
Fig. 3: Asymptotic function landscapes when rows of X0 are not independent. W.l.o.g., we again assume
A0 = I . In (a) and (d), X0 = Ω V , with Ω ∼i.i.d. Ber(θ) and columns of X0 i.i.d. Gaussian vectors √
obeying
vi ∼ N (0, Σ2 ) for symmetric Σ with 1’s on the diagonal and i.i.d. off-diagonal entries distributed as N (0, 2/20).
Similarly, in (b) and (e), X0 = Ω W , with Ω ∼i.i.d. Ber(θ) and columns of X0 i.i.d. vectors generated as
wi = Σui with ui ∼i.i.d. Uniform[−0.5, 0.5]. For comparison, in (c) and (f), X0 = Ω W with Ω ∼i.i.d. Ber(θ)
and W ∼i.i.d. Uniform[−0.5, 0.5]. Here denote the elementwise product, and the objective function is still based
on the log cosh function as in (I.4).
one would like to minimize or maximize the objective.22 This suggests that sparsity, instead of independence, makes
our current algorithm for DR work.
e) Nonconvex Problems with Similar Geometric Structure: Besides ICA discussed above, it turns out that
a handful of other practical problems arising in signal processing and machine learning induce the “no spurious
minimizers, all saddles are second-order” structure under natural setting, including the eigenvalue problem, generalized
phase retrieval [106], tensor decomposition [50], linear neural nets learning [107]. [49] gave a review of these
problems, and discussed how the methodology developed in this and the companion paper [3] can be generalized to
solve those problems.
F. Notations, and Reproducible Research
We use bold capital and small letters such as X and x to denote matrices and vectors, respectively. Small letters
are reserved for scalars. Several specific mathematical objects we will frequently work with: Ok for the orthogonal
.
group of order k , Sn−1 for the unit sphere in Rn , Bn for the unit ball in Rn , and [m] = {1, . . . , m} for positive
∗
integers m. We use (·) for matrix transposition, causing no confusion as we will work entirely on the real field.
We use superscript to index rows of a matrix, such as xi for the i-th row of the matrix X , and subscript to index
columns, suchas xj . All vectors are defaulted to column vectors. So the i-th row of X as a row vector will be
∗
written as xi . For norms, k·k is the usual `2 norm for a vector and the operator norm (i.e., `2 → `2 ) for a
matrix; all other norms will be indexed by subscript, for example the Frobenius norm k·kF for matrices and the
element-wise max-norm k·k∞ . We use x ∼ L to mean that the random variable x is distributed according to the
law L. Let N denote the Gaussian law. Then x ∼ N (0, I) means that x is a standard Gaussian vector. Similarly,
we use x ∼i.i.d. L to mean elements of x are independently and identically distributed according to the law L. So
22
For solving the ICA problem, this suggests the log cosh contrast function, that works well empirically [105], may not work for all
distributions (rotation-invariant Gaussian excluded of course), at least when one does not process the data (say perform certain whitening or
scaling).
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
10
the fact x ∼ N (0, I) is equivalent to that x ∼i.i.d. N (0, 1). One particular distribution of interest for this paper is
the Bernoulli-Gaussian with rate θ: Z ∼ B · G, with G ∼ N (0, 1) and B ∼ Ber (θ). We also write this compactly
as Z ∼ BG (θ). We frequently use indexed C and c for numerical constants when stating and proving technical
results. The scopes of such constants are local unless otherwise noted. We use standard notations for most other
cases, with exceptions clarified locally.
The codes to reproduce all the figures and experimental results are available online:
https://github.com/sunju/dl focm .
II. T HE H IGH - DIMENSIONAL F UNCTION L ANDSCAPE
To characterize the function landscape of f (q; X0 ) over Sn−1 , we mostly work with the function
p
1X
.
hµ (q (w)∗ (x0 )k ) ,
g (w) = f (q (w) ; X0 ) =
p
(II.1)
k=1
induced by the reparametrization
q (w) = (w,
q
1 − kwk2 ),
In particular, we focus our attention to the smaller set
(
Γ=
w : kwk <
r
w ∈ Bn−1 .
4n − 1
4n
(II.2)
)
,
(II.3)
because q (Γ) contains all points q ∈ Sn−1 with n ∈ arg maxi∈±[n] q ∗ ei and we can similarly characterize other
√
parts of f on Sn−1 using projection onto other equatorial planes. Note that over Γ, qn = (1 − kwk2 )1/2 ≥ 1/(2 n).
A. Main Geometric Theorems
Theorem II.1 (High-dimensional landscape - orthogonal dictionary). Suppose A0 = I and hence Y = A0 X0 = X0 .
There exist positive constants c? and C , such that for any θ ∈ (0, 1/2) and µ < min ca θn−1 , cb n−5/4 , whenever
n
p ≥ µ2Cθ2 n3 log µθ
, the following hold simultaneously with high probability:
µ
kwk ≤ √ ,
4 2
∗
w ∇g(w; X0 )
µ
1
√ ≤ kwk ≤ √
≥ c? θ
∀ w s.t.
kwk
4 2
20 5
r
∗
2
w ∇ g(w; X0 )w
4n − 1
1
√ ≤ kwk ≤
,
≤ −c? θ
∀ w s.t.
2
4n
20 5
kwk
q
o
n
.
and the function g(w; X0 ) has exactly one local minimizer w? over the open set Γ = w : kwk < 4n−1
,
4n
satisfies
s
(
)
cc µ n log p µ
kw? − 0k ≤ min
,
.
θ
p
16
∇2 g(w; X0 ) c? θ
I
µ
∀w
s.t.
(II.4)
(II.5)
(II.6)
which
(II.7)
−10 + θ(np)−7 +
In particular, with this choice of p, the
probability the claim fails to hold is at most 4np
2
2
2
exp (−0.3θnp) + cd exp −ce pµ θ /n . Here ca through ce are all positive numerical constants.
Here q (0) = en , which exactly recovers the last row of X0 , xn0 . Though the unique local minimizer w? may
not be 0, it is very near to 0. Hence the resulting q (w? ) produces a close approximation to xn0 . Note that q (Γ)
(strictly) contains all points q ∈ Sn−1 such that n = arg maxi∈±[n] q ∗ ei . We can characterize the graph of the
⊥
function f (q; X0 ) in the vicinity of other signed basis vector ±ei simply by changing the plane e⊥
n to ei . Doing
this 2n times (and multiplying the failure probability in Theorem II.1 by 2n), we obtain a characterization of
f (q; X0 ) over the entirety of Sn−1 .23 The result is captured by the next corollary.
23
In fact, it is possible to pull the very detailed geometry captured in (II.4) through (II.6) back to the sphere (i.e., the q space) also; analysis
of the Riemannian trust-region algorithm later does part of these. We will stick to this simple global version here.
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
11
constant C , such that
Corollary II.2. Suppose A0 = I and
hence Y = A
0 X0 = X0 . There exist positive
n
for any θ ∈ (0, 1/2) and µ < min ca θn−1 , cb n−5/4 , whenever p ≥ µ2Cθ2 n3 log µθ
, with probability at least
1 − 8n2 p−10 − θ(np)−7 − exp (−0.3θnp) − cc exp −cd pµ2 θ2 /n2 , the function f (q; X0 ) has exactly 2n local
minimizers over the sphere Sn−1 . In particular, there is a bijective map between these minimizers and signed basis
vectors {±ei }i , such that the corresponding local minimizer q? and b ∈ {±ei }i satisfy
s
)
(
√
cc µ n log p µ
.
(II.8)
,
kq? − bk ≤ 2 min
θ
p
16
Here ca to cd are numerical constants (possibly different from that in the above theorem).
Proof: By Theorem II.1, over q (Γ), q (w? ) is the unique local minimizer. Suppose not. Then there exist
q 0 ∈ q (Γ) with q 0 6= q (w? ) and ε > 0, such that f (q 0 ; X0 ) ≤ f (q; X0 ) for all q ∈ q (Γ) satisfying kq 0 − qk < ε.
√
Since the mapping w 7→ q (w) is 2√ n-Lipschitz (Lemma IV.7), g (w (q 0 ) ; X0 ) ≤ g (w (q) ; X0 ) for all w ∈ Γ
satisfying kw (q 0 ) − w (q)k < ε/ (2 n), implying w (q 0 ) is a local minimizer different from w? , a contradiction.
Let kw? − 0k = η . Straightforward calculation shows
p
p
kq (w? ) − en k2 = (1 − 1 − η 2 )2 + η 2 = 2 − 2 1 − η 2 ≤ 2η 2 .
Repeating the argument 2n times in the vicinity of other signed basis vectors ±ei gives 2n local minimizers of f .
Indeed, the 2n symmetric sections cover the sphere with certain overlaps, and a simple calculation shows that no
such local minimizer lies in the overlapped regions (due to nearness to a signed basis vector). There is no extra
local minimizers, as any extra local minimizer must be contained in at least one of the 2n symmetric sections,
making two different local minimizers in one section, contradicting the uniqueness result we obtained above.
Though the 2n isolated local minimizers may have different objective values, they are equally good in the sense
any of them helps produce a close approximation to a certain row of X0 . As discussed in Section I-D2, for cases
A0 is an orthobasis other than I , the landscape of f (q; Y ) is simply a rotated version of the one we characterized
above.
Theorem II.3 (High-dimensional landscape - complete dictionary). Suppose A0 is complete with its condition
number κ (A0 ). There exist positive constants c? (particularly, the same constant
n 4 as 5in
o Theorem II.1) and C ,such
0 )n
C
n n
−1
−5/4
that for any θ ∈ (0, 1/2) and µ < min ca θn , cb n
, when p ≥ c2 θ max µ4 , µ2 κ8 (A0 ) log4 κ(A
and
µθ
?
. √
Y = pθ (Y Y ∗ )−1/2 Y , U ΣV ∗ = SVD (A0 ), the following hold simultaneously with high probability:
c? θ
I
2µ
w∗ ∇g(w; V U ∗ Y )
1
≥ c? θ
kwk
2
µ
kwk ≤ √ ,
(II.9)
4 2
µ
1
√ ≤ kwk ≤ √
∀ w s.t.
(II.10)
4 2
20 5
r
w∗ ∇2 g(w; V U ∗ Y )w
1
1
4n − 1
√ ≤ kwk ≤
∀ w s.t.
,
(II.11)
≤ − c? θ
2
2
4n
20 5
kwk
q
n
o
.
and the function g(w; V U ∗ Y ) has exactly one local minimizer w? over the open set Γ = w : kwk < 4n−1
,
4n
which satisfies
kw? − 0k ≤ µ/7.
(II.12)
∇2 g(w; V U ∗ Y ) ∀w
s.t.
In particular, with this choice of p, the probability
the claim fails to hold is at most 4np−10 + θ(np)−7 +
−8
2
2
2
exp (−0.3θnp) + p + cd exp −ce pµ θ /n . Here ca through ce are all positive numerical constants.
Corollary II.4. Suppose A0 is complete with its condition number κ (A0 ). There exist positive constants
c? (particu
−1
−5/4 ,
larly, the same constant
as
in
Theorem
II.1)
and
C
,
such
that
for
any
θ
∈
(0,
1/2)
and
µ
<
min
c
θn
,
c
n
a
b
n 4 5o
. √
0 )n
when p ≥ cC2 θ max nµ4 , nµ2 κ8 (A0 ) log4 κ(A
and Y = pθ (Y Y ∗ )−1/2 Y , U ΣV ∗ = SVD (A0 ), with
µθ
?
probability at least 1 − 8n2 p−10 − θ(np)−7 − exp (−0.3θnp) − p−8 − cd exp −ce pµ2 θ2 /n2 , the function
f q; V U ∗ Y has exactly 2n local minimizers over the sphere Sn−1 . In particular, there is a bijective map
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
12
between these minimizers and signed basis vectors {±ei }i , such that the corresponding local minimizer q? and
b ∈ {±ei }i satisfy
√
kq? − bk ≤ 2µ/7.
(II.13)
Here ca through cd are numerical constants (possibly different from that in the above theorem).
We omit the proof to Corollary II.4 as it is almost identical to that of corollary II.2.
From the above theorems, it is clear that for any saddle point in the w space, the Hessian has at least one negative
eigenvalue with an associated eigenvector w/kwk. Now the question is whether all saddle points of f on Sn−1 are
“second-order”, since as alluded to in Section I-D3, we need to perform actual optimization in the q space. This is
indeed true, but we will only argue informally in the companion paper [3]. The arguments need to be put in the
language of Riemannian geometry, and we can switch back and forth between q and w spaces in our algorithm
analysis without stating this fact.
B. Useful Technical Lemmas and Proof Ideas for Orthogonal Dictionaries
The proof of Theorem II.1 is conceptually straightforward: one shows that EX0 [g (w; X0 )] has the claimed
properties, and then proves that each of the quantities of interest concentrates uniformly about its expectation. The
detailed calculations are nontrivial.
Note that
EX0 [g (q; X0 )] = Ex∼i.i.d. BG(θ) [hµ (q (w)∗ x)] .
The next three propositions show that in the expected function landscape, we see successively strongly convex
region, nonzero gradient region, and directional negative curvature region when moving away from zero, as depicted
in Fig. 2 and sketched in Section I-D2.
q
Proposition II.5. There exists a positive constant c, such that for every θ ∈ (0, 1/2) and any Rh ∈ 0, 4n−1
, if
4n
q
2 −1
µ ≤ c min θRh n , Rh n−5/4 , it holds for every w satisfying Rh ≤ kwk ≤ 4n−1
4n that
w∗ ∇2w E [hµ (q ∗ (w) x)] w
2
kwk
θ
≤− √ .
2 2π
Proof: See Page 15 under Section IV-A1.
Proposition II.6.√For every θ ∈ (0, 1/2) and
√ every µ ≤ 9/50, it holds for every w satisfying rg ≤ kwk ≤ Rg ,
where rg = µ/(6 2) and Rg = (1 − θ)/(10 5), that
w∗ ∇w E [hµ (q ∗ (w) x)]
θ
≥ √ .
kwk
20 2π
Proof: See Page 19 under Section IV-A2.
√
√
Proposition II.7. For every θ ∈ (0, 1/2), and every µ ≤ 1/(20 n), it holds for every w satisfying kwk ≤ µ/(4 2)
that
θ
I.
∇2w E[hµ (q ∗ (w) x)] √
25 2πµ
Proof: See Page 21 under Section IV-A3.
To prove that the above hold qualitatively for finite p, i.e., the function g (w; X0 ), we will need first prove that for
a fixed w each of the quantity of interest concentrate about their expectation w.h.p., and the function is nice enough
(Lipschitz) such that we can extend the results to all w via a discretization argument. The next three propositions
provide the desired pointwise concentration results.
√
Proposition II.8. Suppose 0 < µ ≤ 1/ n. For every w ∈ Γ, it holds that for any t > 0,
"
"
#
#
w∗ ∇2 g(w; X )w
w∗ ∇2 g(w; X0 )w pµ2 t2
0
P −E
.
≥ t ≤ 4 exp −
512n2 + 32nµt
kwk2
kwk2
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
13
Proof: See Page 24 under Section IV-A4.
Proposition II.9. For every w ∈ Γ, it holds that for any t > 0,
∗
∗
w ∇g(w; X0 )
w ∇g(w; X0 ) pt2
P −E
≥ t ≤ 2 exp − 8n + 4√nt .
kwk
kwk
Proof: See Page 25 under Section IV-A4.
√
Proposition II.10. Suppose 0 < µ ≤ 1/ n. For every w ∈ Γ ∩ {w : kwk ≤ 1/4}, it holds that for any t > 0,
2
2
pµ2 t2
P ∇ g(w; X0 ) − E ∇ g(w; X0 ) ≥ t ≤ 4n exp −
.
512n2 + 32µnt
Proof: See Page 25 under Section IV-A4.
The next three propositions provide the desired Lipschitz results.
Proposition II.11 (Hessian Lipschitz). Fix any rS
w∗ ∇2 g(w; X0 )w/ kwk2 is LS -Lipschitz with
LS ≤
∈
(0, 1). Over the set Γ ∩ {w : kwk ≥ rS },
16n3
8n3/2
48n5/2
3
2
kX
k
+
+
kX0 k2∞ + 96n5/2 kX0 k∞ .
kX
k
0
0
∞
∞
µ2
µrS
µ
Proof: See Page 29 under Section IV-A5.
Proposition II.12 (Gradient Lipschitz). Fix any rg ∈ (0, 1). Over the set Γ ∩ {w : kwk ≥ rg }, w∗ ∇g(w; X0 )/ kwk
is Lg -Lipschitz with
√
2 n kX0 k∞
4n2
Lg ≤
kX0 k2∞ .
+ 8n3/2 kX0 k∞ +
rg
µ
Proof: See Page 30 under Section IV-A5.
Proposition II.13 (Lipschitz for Hessian around zero). Fix any rN ∈ (0, 1/2). Over the set Γ ∩ {w : kwk ≤ rN },
∇2 g(w; X0 ) is LN -Lipschitz with
√ √
4n2
4n
8 2 n
3
2
LN ≤
kX0 k∞ +
kX0 k∞ +
kX0 k2∞ + 8 kX0 k∞ .
µ2
µ
µ
Proof: See Page 30 under Section IV-A5.
Integrating the above pieces, Section IV-B provides a complete proof of Theorem II.1.
C. Extending to Complete Dictionaries
As hinted in Section I-D2, instead of proving things from scratch, we build on the results we have obtained for
orthogonal dictionaries. In particular, we will work with the preconditioned data matrix
. p
Y = pθ(Y Y ∗ )−1/2 Y
(II.14)
and show that the function landscape f q; Y looks qualitatively like that of orthogonal dictionaries (up to a global
rotation), provided that p is large enough.
The next lemma shows Y can be treated as being generated from an orthobasis with the same BG coefficients,
plus small noise.
Lemma II.14. For any θ ∈ (0, 1/2), suppose A0 is complete with condition number κ (A0 ) and X0 ∼i.i.d. BG (θ).
Provided p ≥ Cκ4 (A0 ) θn2 log(nθκ (A0 )), one can write Y as defined in (II.14) as
Y = U V ∗ X0 + ΞX0 ,
20κ4 (A)
for a certain Ξ obeying kΞk ≤
and C is a positive numerical constant.
q
Proof: See Page 33 under Section IV-C.
θn log p
,
p
with probability at least 1 − p−8 . Here U ΣV ∗ = SVD (A0 ),
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
14
Notice that U V ∗ above is orthogonal, and that landscape of f (q; Y ) is simply a rotated version of that of
e 0 ) assuming
f (q; V U ∗ Y ), or using the notation in the above lemma, that of f (q; X0 +V U ∗ ΞX0 ) = f (q; X0 + ΞX
.
∗
e
Ξ = V U Ξ. So similar to the orthogonal case, it is enough to consider this “canonical” case, and its “canonical”
reparametrization:
p
X
e 0 =1
e (x0 ) .
g w; X0 + ΞX
hµ q ∗ (w) (x0 )k + q ∗ (w) Ξ
k
p
k=1
e 0
The following lemma provides quantitative comparison between the gradient and Hessian of g w; X0 + ΞX
and that of g (w; X0 ).
Lemma II.15. There exist positive constants Ca and Cb , such that for all w ∈ Γ,
n
e
e
∇w g(w; X0 + ΞX0 ) − ∇w g (w; X0 ) ≤ Ca log (np) kΞk,
µ
3/2 2 n
n
2
2
e
e
,
log3/2 (np) kΞk
∇w g(w; X0 + ΞX0 ) − ∇w g (w; X0 ) ≤ Cb max
2
µ
µ
with probability at least 1 − θ (np)−7 − exp (−0.3θnp).
Proof: See Page 34 under Section IV-C.
e = kΞk is then small enough
Combining the above two lemmas, it is easy to see when p is large enough, kΞk
(Lemma II.14), and hence the changes to the gradient and Hessian caused by the perturbation are small. This gives
the results presented in Theorem II.3; see Section IV-C for the detailed proof. In particular, for the p chosen in
Theorem II.3, it holds that
3/2 2 −1
n
n
3/2
e
kΞk ≤ cc? θ max
,
log (np)
(II.15)
µ2 µ
for a certain constant c which can be made arbitrarily small by making the constant C in p large.
III. D ISCUSSION
The dependency of p on n and other paramters could be suboptimal due to several factors: (1) The `1 proxy.
The derivative and Hessians of the log cosh function we adopted entail the tanh function, which is not amenable
to effective approximation and affects the sample complexity; (2) Space of geometric characterization. It seems
working directly on the sphere (i.e., in the q space) could simplify and possibly improve certain parts of the analysis;
(3) Dealing with the complete case. Treating the complete case directly, rather than using (pessimistic) bounds to
treat it as a perturbation of the orthogonal case, is very likely to improve the sample complexity. Particularly, general
linear transforms may change the space significantly, such that preconditioning and comparing to the orthogonal
transforms may not be the most efficient way to proceed.
It is possible to extend the current analysis to other dictionary settings. Our geometric structures (and algorithms)
allow plug-and-play noise analysis. Nevertheless, we believe a more stable way of dealing with noise is to directly
extract the whole dictionary, i.e., to consider geometry and optimization (and perturbation) over the orthogonal
group. This will require additional nontrivial technical work, but likely feasible thanks to the relatively complete
knowledge of the orthogonal group [52], [108]. A substantial leap forward would be to extend the methodology
to recovery of structured overcomplete dictionaries, such as tight frames. Though there is no natural elimination
of one variable, one can consider the marginalization of the objective function wrt the coefficients and work with
hidden functions. 24 For the coefficient model, as we alluded to in Section I-E, our analysis and results likely can
be carried through to coefficients with statistical dependence and physical constraints.
The connection to ICA we discussed in Section I-E suggests our geometric characterization and algorithms can
be modified for the ICA problem. This likely will provide new theoretical insights and computational schemes to
ICA. In the surge of theoretical understanding of nonconvex heuristics [41]–[45], [48], [70]–[75], [80], [81], [84],
24
This recent work [45] on overcomplete DR has used a similar idea. The marginalization taken there is near to the global optimum of one
variable, where the function is well-behaved. Studying the global properties of the marginalization may introduce additional challenges.
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
15
[85], [88], [89], [89], the initialization plus local refinement strategy mostly differs from practice, whereby random
initializations seem to work well, and the analytic techniques developed in that line are mostly fragmented and
highly specialized. The analytic and algorithmic framework we developed here holds promise to provide a coherent
account of these problems, see [49]. In particular, we have intentionally separated the geometric characterization
and algorithm development and analysis in hope of making both parts modular. It is interesting to see how far we
can streamline the geometric characterization. Moreover, the separation allows development of more provable and
practical algorithms, say in the direction of [50].
IV. P ROOFS OF T ECHNICAL R ESULTS
In this section, we provide complete proofs for technical results stated in Section II. Before that, let us introduce
some notations and common results that will be used later throughout this section. Since we deal with BG random
variables and random vectors, it is often convenient to write such vector explicitly as x = [Ω1 v1 , . . . , Ωn vn ] = Ω v ,
where Ω1 , . . . , Ωn are i.i.d. Bernoulli random variables and v1 , . . . , vn are i.i.d. standard normal. For a particular
realization of such random vector, we will denote the support as I ⊂ [n]. Due to the particular coordinate map in use,
.
.
.
we will often refer to subset J = I \{n} and the random vectors x = [Ω1 v1 , . . . , Ωn−1 vn−1 ] and v = [v1 , . . . , vn−1 ]
in Rn−1 . By Lemma A.1, it is not hard to see that
∗
q (w) x
xn
∇w hµ (q ∗ (w) x) = tanh
x−
w ,
(IV.1)
µ
qn (w)
∗
∗
1
xn
xn
2 q (w) x
2
∗
1 − tanh
w
w
∇w hµ (q (w) x) =
x−
x−
µ
µ
qn (w)
q (w)
∗
n
q (w) x
1
1
− xn tanh
I+ 3
ww∗ .
(IV.2)
µ
qn (w)
qn (w)
A. Proofs for Section II-B
1) Proof of Proposition II.5: The proof involves some delicate analysis, particularly polynomial approximation of
2
1
the function f (t) = (1+t)
2 over t ∈ [0, 1]. This is naturally induced by the 1 − tanh (·) function. The next lemma
characterizes one polynomial approximation of f (t).
1
Lemma IV.1. Consider f (t) = (1+t)
2 for t ∈ [0, 1]. For every T > 1, there is a sequence b0 , b1 , . . . , with
P
k
kbk`1 = T < ∞, such that the polynomial p(t) = ∞
k=0 bk t satisfies
1
√ ,
2 T
1
kf − pkL∞ [0,1] ≤ √ ,
T
√
In particular, one can choose bk = (−1)k (k + 1)β k with β = 1 − 1/ T < 1 such that
kf − pkL1 [0,1] ≤
∞
X
1
=
(−1)k (k + 1)β k tk .
p (t) =
(1 + βt)2
k=0
Moreover, such sequence satisfies 0 <
P∞
bk
k=0 (1+k)3
<
|bk |
k=0 (1+k)3
P∞
< 2.
Proof: See page 38 under Section B.
2
Lemma IV.2. Let X ∼ N 0, σX
and Y ∼ N 0, σY2 . We have
E
2
1 − tanh
X +Y
µ
X
2
1X+Y >0 ≤
2 σ2
2 σ2
2 µ3
µσX
µ3 σX
σX
1
3
2
2
Y
Y
√
√
+
+
3µ
+
4σ
.
X
2 + σ 2 3/2
2π σ 2 + σ 2 3/2
4 2π σ 2 + σ 2 5/2
σ
X
Y
X
Y
X
Y
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
16
2 x+y
Proof: For x + y > 0, let z = exp −2 x+y
∈
[0,
1]
,
then
1
−
tanh
=
µ
µ
1
(1+βz)2
4z
.
(1+z)2
Fix any T > 1 to be
√
with β = 1 − 1/ T to upper bound
determined later, by Lemma IV.1, we choose the polynomial pβ (z) =
1
f (z) = (1+z)
2 . So we have
2
2 X +Y
X 1X+Y >0 = 4E Zf (Z) X 2 1X+Y >0
E 1 − tanh
µ
≤ 4E Zpβ (Z) X 2 1X+Y >0
∞ n
h
io
X
=4
bk E Z k+1 X 2 1X+Y >0 ,
k=0
where bk = (−1)k (k + 1)β k , and the exchange of infinite summation and expectation above is justified in view that
∞
∞
∞
h
i X
X
X
2
|bk | E Z k+1 X 2 1X+Y >0 ≤
|bk | E X 2 1X+Y >0 ≤ σX
|bk | < ∞
k=0
k=0
k=0
and the dominated convergence theorem (see, e.g., theorem 2.24 and 2.25 of [109]). By Lemma B.1, we have
∞ n
h
io
X
bk E Z k+1 X 2 1X+Y >0
k=0
=
∞
X
"
k
2
σX
(−β) (k + 1)
k=0
4 (k + 1)2 4
σX
+
µ2

!
exp
! q
2 (k + 1)2 2
2
c 2 (k + 1)
2 + σ2
σ
+
σ
Φ
σ
X
Y
X
Y
µ2
µ
σ4
2 (k + 1)

qX
√
µ
2
2
2π σX + σY


∞
2
4
2
3
X
σX µ
µσX
σX µ
1
q
−
−
≤√
(−β)k (k + 1) 

3/2
3
2 + σ2
2 + σ 2 3/2
2
2
2π k=0
8
(k
+
1)
σ
2
(k
+
1)
σ
2 (k + 1) σX + σY
X
Y
X
Y
!
∞
µ5
3 X k
4 (k + 1)2 4
2
σ
+√
β (k + 1) σX
+
5/2 ,
X
µ2
2π
32 (k + 1)5 σ 2 + σ 2
−
k=0
X
Y
where we have applied Type I upper and lower bounds for Φc (·) to even k and odd k respectively and rearrange
the terms to obtain the last line. Using the following estimates (see Lemma IV.1)
∞
X
k=0
1
(−β) =
,
1+β
k
∞
X
bk
≥ 0,
(k + 1)3
k=0
∞
X
∞
X |bk |
|bk |
≤
≤ 2,
(k + 1)5
(k + 1)3
k=0
k=0
we obtain
∞ n
h
io
X
bk E Z k+1 X 2 1X+Y >0 ≤
k=0
2 σ2
2 µ3
µσX
σX
1
3
1
2
Y
√
√
+
3µ2 + 4σX
.
3/2
5/2
1 + β 16 2π σ 2 + σ 2
2 2π σ 2 + σ 2
X
Y
X
Y
and choosing T = µ−4 , we obtain the desired result.
2
Lemma IV.3. Let X ∼ N 0, σX
and Y ∼ N 0, σY2 . We have
X +Y
E tanh
X ≥
µ
2
2 µ2
2 µ4
2σX
4µ2 σ 2
2σX
3σX
−√ q X
−√
−
√
.
√ q 2
2 + σ 2 3/2
2 + σ 2 5/2
2 + σ2
2π
σ
2
2π
σ
2π σX + σY2
2π σX
X
Y
X
Y
Y
Noticing
1
1+β
<
1
2
+
1
√
2 T
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
17
Proof: By Lemma B.1, we know
2
σX
X +Y
2 X +Y
X =
E 1 − tanh
E tanh
µ
µ
µ
1
Similar to the proof of the above lemma, for x + y > 0, let z = exp −2 x+y
and f (z) = (1+z)
2 . Fixing any
µ
2 x+y
4z
T > 1, we will use 4zpβ (z) = (1+βz)
= 4zf (z) function from above, where
2 to approximate the 1 − tanh
µ
√
again β = 1 − 1/ T . So we obtain
X +Y
E 1 − tanh2
= 8E [f (Z) Z 1X+Y >0 ]
µ
= 8E [pβ (Z) Z 1X+Y >0 ] − 8E [(pβ (Z) − f (Z)) Z 1X+Y >0 ] .
Now for the first term, we have
E [pβ (Z) Z 1X+Y >0 ] =
∞
X
h
i
bk E Z k+1 1X+Y >0 ,
k=0
P
justified as k=0 |bk | E Z k+1 1X+Y >0 ≤ ∞
k=0 |bk | < ∞ making the dominated convergence theorem (see, e.g.,
theorem 2.24 and 2.25 of [109]) applicable. To proceed, from Lemma B.1, we obtain
∞
h
i
X
bk E Z k+1 1X+Y >0
P∞
k=0
∞
X
q
2
2
2
2
2
c
2
2
=
(−β) (k + 1) exp
(k + 1) σX + σY
Φ
(k + 1) σX + σY
µ2
µ
k=0


∞
3
1 X
µ
µ
q
≥√
−
(−β)k (k + 1) 

3
2 + σ 2 3/2
2 + σ2
2π k=0
8
(k
+
1)
σ
2 (k + 1) σX
X
Y
Y
k
∞
µ5
3 X k
β (k + 1)
−√
,
2 + σ 2 5/2
2π k=0
32 (k + 1)5 σX
Y
where we have applied Type I upper and lower bounds for Φc (·) to odd k and even k respectively and rearrange
the terms to obtain the last line. Using the following estimates (see Lemma IV.1)
∞
∞
∞
∞
X
X
X
X
1
bk
|bk |
|bk |
(−β)k =
, 0≤
≤
≤
≤ 2,
3
5
1+β
(k + 1)
(k + 1)
(k + 1)3
k=0
k=0
k=0
k=0
we obtain
∞
X
h
i
bk E Z k+1 1X+Y >0 ≥
k=0
1
µ3
3µ5
µ
q
−
−
√
√
.
√
2 + σ 2 3/2
2 + σ 2 5/2
2 + σ2 1 + β
4
2π
σ
16
2π
σ
2 2π σX
X
Y
X
Y
Y
To proceed, by Lemma B.1 and Lemma IV.1, we have
µ
q
E [(pβ (Z) − f (Z)) Z 1X+Y >0 ] ≤ kp − f kL∞ [0,1] E [Z 1X+Y >0 ] ≤ √
,
2 + σ2
2 2πT σX
Y
where we have also used Type I upper bound for Φc (·). Combining the above estimates, we get
X +Y
E tanh
X ≥
µ
2
2 µ2
2 µ4
4σX
2σX
3σX
1
1
√
q
−
−√
−
√
.
√
2 + σ 2 3/2
2 + σ 2 5/2
1+β
2 + σ2
T
2π
σ
2
2π
σ
2π σX
X
Y
X
Y
Y
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
18
1
> 12 and taking T = µ−4 , we obtain the claimed result.
Noticing 1+β
Proof: (of Proposition II.5) For any i ∈ [n − 1], we have
Z 1Z Z 1Z ∂
1
∗
hµ (q (w) x) µ (dx) dwi ≤
|xi | + |xn |
µ (dx) dwi < ∞.
qn (w)
0
x ∂wi
0
x
h
i
∂
∂
∗ (w) x)] = E
∗ (w) x) . Moreover for any j ∈ [n − 1],
E
[h
(q
h
(q
Hence by Lemma A.3 we obtain ∂w
µ
µ
∂wi
i
1Z
Z
0
∂2
∗
∂wj ∂wi hµ (q (w) x) µ (dx) dwj ≤
x
Z 1Z 1
1
|xn |
|xn |
1
|xi | +
|xj | +
+ |xn |
+
µ (dx) dwi < ∞.
qn (w)
qn (w)
qn (w) qn3 (w)
0
x µ
Invoking Lemma A.3 again we obtain
∂
∂2
∂
∂2
∗
∗
∗
E [hµ (q (w) x)] =
E
hµ (q (w) x) = E
hµ (q (w) x) .
∂wj ∂wi
∂wj
∂wi
∂wj ∂wi
The above holds for any pair of i, j ∈ [n − 1], so it follows that
∇2w E [hµ (q ∗ (w) x)] = E ∇2w hµ (q ∗ (w) x) .
Hence it is easy to see that
w∗ ∇2w E [hµ (q ∗ (w) x)] w
"
∗
∗
2 #
1
q (w) x
q
(w)
x
x
xn
n
2
2
2
∗
= E 1 − tanh
kwk
− E tanh
kwk .
w x−
µ
µ
qn (w)
µ
qn3 (w)
Now the first term is
"
∗
2 #
1
q
(w)
x
x
n
E 1 − tanh2
w∗ x −
kwk2
µ
µ
qn (w)
∗ 2 (1 − θ)
2 w x
∗ 2
=
E 1 − tanh
(w x) 1w∗ x>0
µ
µ
∗
4θ kwk2
2 w x + qn (w) xn
∗
−
E 1 − tanh
(w x) (qn (w) xn ) 1w∗ x+qn (w)xn >0
µ qn2 (w)
µ
∗
2
2θ
2 wJ v + qn (w) vn
∗
∗
+ EJ Ev
1 − tanh
wJ v 1wJ v+qn (w)vn >0
µ
µ
∗
2θ kwk4
2
2 wJ v + qn (w) vn
∗
+
E
E
1
−
tanh
(q
(w)
v
)
1
n
n
J v
wJ v+qn (w)vn >0
µ qn4 (w)
µ
8 (1 − θ)
w∗ x
≤
E exp −2
(w∗ x)2 1w∗ x>0
µ
µ
8θ kwk2
2
2
∗
∗
+
E exp − (w x + qn (w) xn ) (w x + qn (w) xn ) 1w∗ x+qn (w)xn >0
µ qn2 (w)
µ
2θ
2 X +Y
2
+ EJ EX,Y
1 − tanh
Y 1X+Y >0
µ
µ
2θ kwk4
2 X +Y
2
+
EJ EX,Y
1 − tanh
X 1X+Y >0 ,
µ qn4 (w)
µ
.
.
where conditioned on each support set J , we let X = qn (w) vn ∼ N 0, qn2 (w) and Y = wJ∗ v ∼ N 0, kwJ k2 .
Noticing the fact t 7→ exp (−2t/µ) t2 for t > 0 is maximized at t = µ with maximum value exp (−2) µ2 , and in
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
19
view of the estimate in Lemma IV.2, we obtain
"
2 #
∗
x
q
(w)
x
1
n
E 1 − tanh2
w∗ x −
kwk2
µ
µ
qn (w)
!
kwk2
≤ 8 exp (−2) 1 − θ + 2
θ µ
qn (w)
"
#
1 µ kwJ k2 qn2 (w) µ3 kwJ k2 qn2 (w)
3 kwJ k2 µ3 2
2θ
+
+ √
3µ + 4 kwJ k2
+ EJ √
µ
2π
4 2π kqI k5
kqI k3
kqI k3
#
"
2θ kwk4
3 qn2 (w) µ3
1 µ kwJ k2 qn2 (w) µ3 kwJ k2 qn2 (w)
+
+
+ √
3µ2 + 4qn2 (w)
EJ √
µ qn4 (w)
2π
4 2π kqI k5
kqI k3
kqI k3
"
#
kwJ k2
2θ
11
3
3
1
1
2
≤√
EJ
+ µ 2+ 2
+ 2θµ 1 + √
+√
,
+
20
qn (w)
2πqn2 (w)
2πqn (w) qn3 (w)
2πqn5 (w)
kqI k3
where we have used µ < qn (w) ≤ kqI k and kwJ k ≤ kqI k and kwk ≤ 1 and θ ∈ (0, 1/2) to simplify the
intermediate quantities to obtain the last line. Similarly for the second term, we obtain
∗
q (w) x
xn
2
E tanh
kwk
µ
qn3 (w)
∗
wJ v + qn (w) vn
kwk2 θ
= 4
EJ Ev tanh
xn qn (w)
qn (w)
µ
"
#
kwk2 θ
2qn2 (w)
4µ2 qn2 (w)
2qn2 (w) µ2
3qn2 (w) µ4
≥ 4
EJ √
−√
−√
− √
qn (w)
2π kqI k
2π kqI k
2π kqI k3 2 2π kqI k5
"
#
r
kwk2
2 θ
4θµ2
1
1
EJ
−√
+
.
≥
π qn2 (w)
kqI k
2π qn3 (w) qn5 (w)
Collecting the above estimates, we obtain
w∗ ∇2w E [hµ (q ∗ (w) x)] w


r
2
2
2 (w)
2
kwk
kw
k
+
q
J
n
kwJ k
2 θ

≤
EJ 
−
3
3
2
π qn (w)
kqI k
kqI k
11
1
3
2
5
2
+ µ 2+ 2
+ 2θµ 1 + √
+
+√
20
qn (w)
2πqn (w) qn3 (w)
2πqn5 (w)
"
#
r
kwJ c k2
2
11
11 µ
6
2
≤ −
θE
+
µ
+
+
2θµ
1
+
π
10
20 qn (w)
qn5 (w)
kqI k3
"
#
r
1
11 µ
6
2
11
2
≤ −
θ (1 − θ) kwk2 E
µ
+
+
2θµ
1
+
,
+
π
10
20 qn2 (w)
qn5 (w)
kqI k3
(IV.3)
where to obtain the last line we have invoked the association inequality in Lemma A.2, as both kwJ c k2 and
1/ kqI k3 both coordinatewise nonincreasing w.r.t. the index set. Substituting the upper bound for
q µ into (IV.3) and
noting Rh ≤ kwk and also noting the fact qn (w) ≥ 2√1 n (implied by the assumption kwk ≤ 4n−1
4n ), we obtain
the claimed result.
2) Proof of Proposition II.6:
Proof: By similar consideration as proof of the above proposition, the following
is justified:
∇w E [hµ (q ∗ (w) x)] = E [∇w hµ (q ∗ (w) x)] .
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
20
Now consider
w∗ ∇E [hµ (q ∗ (w) x)] = ∇E [w∗ hµ (q ∗ (w) x)]
∗
∗
kwk2
q (w) x
q (w) x
∗
(w x̄) −
E tanh
xn .
= E tanh
µ
qn
µ
(IV.4)
For (IV.4), we next provide a lower bound for the first expectation and an upper bound for the second expectation.
For the first, we have
∗
q (w) x
∗
E tanh
(w x)
µ
∗
∗ wJ v + qn (w) vn
wJ v
∗
∗
= θEJ Ev tanh
+ (1 − θ)EJ Ev tanh
wJ v
wJ v
µ
µ
Y
X +Y
Y
+ (1 − θ)EJ EY tanh
Y ,
= θEJ EX,Y tanh
µ
µ
.
.
where X = qn (w) vn ∼ N 0, qn2 (w) and Y = wJ∗ v ∼ N 0, kwJ k2 . Now by Lemma A.2 we obtain
X +Y
X +Y
Y ≥ E tanh
E [Y ] = 0,
E tanh
µ
µ
as tanh X+Y
and X are both coordinatewise nondecreasing function of X and Y . Using the tanh (z) ≥
µ
(1 − exp (−2z)) /2 lower bound for z > 0 and integral results in Lemma B.1, we obtain
Y
Y
E tanh
Y = 2E tanh
Y 1Y >0
µ
µ
2Y
≥ E 1 − exp −
Y 1Y >0
µ
2 2σY2
2σY
c 2σY
=
exp
Φ
µ
µ2
µ

s
2
2
σ
2σ
σY 
≥ √ Y  1 + Y2 −
µ
µ
µ 2π
s

2
2
2σ
kwk
kwk 
≥ √Y  1 +
−
,
2
µ
µ
µ 2π
c
where at the second last inequality we have used
√ Type III lower bound for Gaussian upper tail Φ (·) (Lemma A.4),
and at the last we have used the fact that t 7→ 1 + t2 − t is a monotonic decreasing function over t > 0 and that
σY = kwJ k ≤ kwk . Collecting the above estimates, we have

s

∗
2
2
2 kwJ k 
kwk2 kwk 
q (w) x
E tanh
(w∗ x) ≥ (1 − θ) EJ  √
1+
−
µ
µ2
µ
µ 2π
"
#
2 kwJ k2 µ
√
≥ (1 − θ) EJ
µ 2π 10 kwk
≥
θ (1 − θ) kwk
√
,
5 2π
where at the second line we have used the assumption that kwk ≥
1
for x ≥ 6√
.
2
(IV.5)
µ
√
6 2
and also the fact that
√
1 + x2 ≥ x +
1
10x
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
21
For the second expectation of (IV.4), we have
r
∗
∗
q
(w)
x
q (w) x
|vn | ≤ θ 2 ,
xn ≤ θE tanh
E tanh
µ
µ
π
(IV.6)
as tanh (·) is bounded by one in magnitude. Plugging the results of (IV.5) and (IV.6) into (IV.4) and noticing that
qn (w)2 + kwk2 = 1 we obtain


θ kwk  1 − θ
2 kwk  θ (1 − θ) kwk
√
w∗ ∇E [hµ (q ∗ (w) x)] ≥ √
−q
≥
,
5
2
2π
10
2π
1 − kwk
where we have invoked the assumption that kwk ≤
1√
10 5
2kwk
(1 − θ) to provide the upper bound √
1−kwk2
≤
1
10
(1 − θ).
We then choose the particular ranges as stated for µ and θ to ensure rg < Rg , completing the proof.
3) Proof of Proposition II.7:
Proof: By consideration similar to proof of Proposition II.5, we can exchange
the hessian and expectation, i.e.,
∇2w E [hµ (q ∗ (w) x)] = E ∇2w hµ (q ∗ (w) x) .
We are interested in the expected Hessian matrix
∗
∗ 1
xn
xn
2
∗
2 q (w) x
∇w E [hµ (q (w) x)] = E 1 − tanh
x−
w
x−
w
µ
µ
qn (w)
qn (w)
∗
q (w) x
xn
xn
∗
− E tanh
I+ 3
ww
µ
qn (w)
qn (w)
µ
.
in the region that 0 ≤ kwk ≤ 4√
2
When w = 0, by Lemma B.1, we have
E ∇2w hµ (q ∗ (w) x) w=0
1
xn
2 xn
∗
= E 1 − tanh
x x − E tanh
xn I
µ
µ
µ
θ(1 − θ)
θ2
θ
2 vn
2 vn
=
I + Evn 1 − tanh
I − Evn 1 − tanh
I
µ
µ
µ
µ
µ
θ(1 − θ)
qn (w) vn
=
Evn tanh2
I.
µ
µ
Simple calculation based on Lemma B.1 shows
2
2
2
2 vn
c
Evn tanh
≥ 2 1 − 4 exp
Φ
≥2 1− √ µ .
µ
µ2
µ
2π
Invoking the assumptions µ ≤
1√
20 n
≤ 1/20 and θ < 1/2, we obtain
2
θ
(1
−
θ)
4
θ
1
∗
I.
E ∇w hµ (q (w) x) w=0 2− √ µ I 1− √
µ
µ
2π
10 2π
µ
When 0 < kwk ≤ 4√
, we aim to derive a semidefinite lower bound for
2
E ∇2w hµ (q ∗ (w) x)
∗
∗
1
1
q (w) x
2 q (w) x
∗
= E 1 − tanh
xx − 2
E tanh
qn (w) xn I
µ
µ
q (w)
µ
∗
n
q (w) x
1
E 1 − tanh2
− 2
qn (w) xn (wx∗ + xw∗ )
µqn (w)
µ
∗
∗
1
1
q (w) x
2 q (w) x
2
+ 4
E 1 − tanh
(qn (w) xn ) − E tanh
qn (w) xn ww∗ . (IV.7)
qn (w) µ
µ
µ
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
22
We will first provide bounds for the last two lines and then tackle the first which is slightly more tricky. For the
second line, we have
∗
1
q
(w)
x
∗
∗
2
E 1 − tanh
qn (w) xn (wx + xw ) 2
µqn (w)
µ
∗
2
E 1 − tanh2 q (w) x
≤ 2
qn (w) xn x̄ w∗ µqn (w)
µ
∗
2
E 1 − tanh2 q (w) x
≤ 2
qn (w) xn x kwk
µqn (w)
µ
2
θ2 E [|vn |] E [kvk] kwk
≤
µqn (w)
√
4θ2 √
θ 4θ n kwk
θ 1
q
,
≤
n kwk ≤
≤
πµqn (w)
µ
µ 40π
π 1 − kwk2
2 q ∗ (w)x where from the third to the fourth line we have used 1 − tanh
≤ 1, Jensen’s inequality for the k·k
µ
µ
function, and independence of xn and x, and to obtain the last bound we have invoked the kwk ≤ 4√
,µ≤
2
1
and θ < 2 assumptions. For the third line in (IV.7), by Lemma A.1 and Lemma B.1,
∗
∗
1
q (w) x
2
E 1 − tanh2 q (w) x
(qn (w) xn ) − E tanh
qn xn µ
µ
µ
∗
θ
wJ v + qn (w) vn
1 − tanh2
(qn (w) vn )2
= EJ Ev
µ
µ
∗
wJ v + qn (w) vn
−θEJ Ev tanh
qn (w) vn µ
∗
θ
2 wJ v + qn (ws) vn
2
2
= EJ Ev
1 − tanh
(qn (w) vn ) + qn (w)
µ
µ
8θ
2
∗
2
2
≤ EJ Ev exp − wJ v + qn (w) vn
(qn (w) vn ) + qn (w) 1wJ∗ v+qn (w)vn >0
µ
µ


2 (w)
8θ
q
(w)
n
 ≤ 8θq
√n
≤ √ EJ  q
.
2
2π
2π
2
q (w) + kw k
n
1√
,
20 n
J
Thus, we have
∗ ∗ q x
2 q x
2
1 − tanh
(qn xn ) − E tanh
qn xn ww∗
µ
µ
!
8θ
θ 64n3/2 µ kwk2
θ
1
2
√ kwk I −
√
√ I,
I−
−
3
3
µ
µ
qn (w) 2π
qn (w) 2π
4000 2π
1
4
qn (w)
1
E
µ
µ
where we have again used kwk ≤ 4√
, µ ≤ 201√n , and qn (w) ≥ 2√1 n assumptions to simplify the final bound.
2
To derive a lower bound for the first line of (IV.7), we lower bound the first term and upper bound the second.
The latter is easy: using Lemma A.1 and Lemma B.1,
∗
1
q (w) x
E tanh
qn (w) xn
qn2 (w)
µ
∗
θ
2 wJ v + qn (w) vn
= EJ Ev 1 − tanh
µ
µ
∗
wJ v + qn (w) vn
8θ
≤ EJ Ev exp −2
1wJ∗ v+qn (w)vn >0
µ
µ
√
4θ
θ 8 nµ
θ 2
√
√ ,
≤√
≤
≤
µ 2π
µ 5 2π
2πqn (w)
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
23
where we have again used assumptions that qn (w) ≥ 2√1 n and µ ≤ 201√n to simplify the last bound. To lower
bound the first term, first note that
∗
∗ 1
q (w) x
1−θ
w x
E 1 − tanh2
x x∗ Ex 1 − tanh2
x x∗ .
µ
µ
µ
µ
We set out to lower bound the expectation as
∗ 2 w x
∗
Ex 1 − tanh
x x θβI
µ
for some scalar β > 0. Suppose w has k ∈ [n − 1] nonzeros, w.l.o.g., further assume the first k elements of w are
these nonzeros. It is easy to see the expectation above has a block diagonal structure diag (Σ; αθIn−1−k ), where
∗ w x
.
α = Ex 1 − tanh2
.
µ
So in order to derive the θβI lower bound as desired, it is sufficient to show Σ θβI and β ≤ α for some
e ∈ Rk be the subvector of nonzero elements,
0 < β < 1, i.e., letting w
∗ e x
e
2 w
∗
ex
e θβI,
Ex∼
1 − tanh
x
e i.i.d. BG(θ)
µ
which is equivalent to that for all z ∈ Rk such that kzk = 1,
∗ e x
e
∗ 2
2 w
e z) ≥ θβ.
(x
Ex∼
1 − tanh
e i.i.d. BG(θ)
µ
It is then sufficient to show that for any nontrivial support set S ⊂ [k] and any vector z ∈ Rk such that supp (z) = S
with kzk = 1,
∗ eS v
e
2 w
∗ 2
e z) ≥ β.
Eve∼i.i.d. N (0,1) 1 − tanh
(v
µ
To see the implication, suppose the latter claimed holds, then for any z with unit norm,
∗ e x
e
∗ 2
2 w
e z)
(x
Ex∼
1 − tanh
e i.i.d. BG(θ)
µ
∗ k
X
X
eS v
e
2
k−s
s
2 w
∗
e zS )
=
θ (1 − θ)
Eve∼i.i.d. N (0,1) 1 − tanh
(v
µ
[k]
s=1
S∈( s )
k
h
i
X
X
≥
θs (1 − θ)k−s
β kzS k2 = β ES kzS k2 = θβ.
s=1
S∈([k]
s )
Now for any fixed support set S ⊂ [k], z = Pwe S z + (I − Pwe S ) z . So we have
∗ eS v
e
∗ 2
2 w
e
Eve∼i.i.d. N (0,1) 1 − tanh
(v z)
µ
∗ ∗ eS v
e
eS v
e
2
2
2 w
∗
2 w
∗
e Pwe S z) + Eve 1 − tanh
e (I − Pwe S ) z)
= Eve 1 − tanh
(v
(v
µ
µ
∗ ∗ h
i
eS∗ z)2
eS v
e
eS v
e
(w
2
2
2 w
∗
2 w
∗
e
e
e
=
+
E
1
−
tanh
)
z)
E
1
−
tanh
(
v
w
)
E
(
v
(I
−
P
S
e
e
e
e
v
v
v
w
S
µ
µ
kwS k4
e ∗ z)2
e∗ v
e
e∗ v
e
(w
2w
2w
e∗ w
eS )2 1ve∗ we S >0 + 2Eve exp − S
(v
1weS∗ ve>0 k(I − PweS ) zk2 .
≥ 2 S 4 Eve exp − S
µ
µ
kwS k
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
24
Using expectation result from Lemma B.1, and applying Type III lower bound for Gaussian tails, we obtain
∗ eS v
e
w
e∗ z)2
(v
Eve∼i.i.d. N (0,1) 1 − tanh2
µ
s

2
eS∗ z)2
eS k
eS k 
4 (w
4 kw
2 kw
1 
√
√
−
≥
4+
−
µ2
µ
eS k
2π
µ 2π kw
1
3√
≥√
2 ,
2−
4
2π
µ
e∗ z)2 ≤ kv
e∗ k2 and invoked the assumption kwk ≤ 4√
where we have used Cauchy-Schwarz to obtain (v
to simplify
2
the last bound. On the other hand, we similarly obtain
p
4kwk2 /µ2 + 4 − 2kwk/µ
1√
2
1
2
2−
2 .
α = EJ EZ∼N (0,kwJ k2 ) [1 − tanh (Z/µ)] ≥ √
≥√
2
2
2π
2π
√ So we can take β = √12π 2 − 34 2 < 1.
Putting together the above estimates for the case w 6= 0, we obtain
√
2
2π
θ
3√
1
2
1 θ
∗
E ∇w hµ (q (w) x) √
1−
2−
−
−
I √
I.
8
40π
4000 5
µ 2π
25 2π µ
Hence for all w, we can take the 25√1 2π µθ as the lower bound, completing the proof.
4) Proof of Pointwise Concentration Results: To avoid clutter of notations, in this subsection we write X to
mean X0 ; similarly xk for (x0 )k , the k -th column of X0 . The function g (w) means g (w; X0 ). We first establish
a useful comparison lemma between random i.i.d. Bernoulli random vectors random i.i.d. normal random vectors.
Lemma IV.4. Suppose z, z 0 ∈ Rn are independent and obey z ∼i.i.d. BG (θ) and z 0 ∼i.i.d. N (0, 1). Then, for any
fixed vector v ∈ Rn , it holds that
m E [|v ∗ z|m ] ≤ E v ∗ z 0 = EZ∼N (0,kvk2 ) [|Z|m ] ,
m E [kzkm ] ≤ E z 0 ,
for all integers m ≥ 1.
Proof: See page 38 under Section B.
Now, we are ready to prove Proposition II.8 to Proposition II.10 as follows.
Proof: (of Proposition II.8) Let
Yk =
then
w∗ ∇2 g(w)w
kwk2
Yk
=
1
p
1
=
µ
Pp
For each Yk (k ∈ [p]), from (IV.2), we know that
k=1 Yk .
1
w∗ ∇2 hµ (q(w)∗ xk ) w,
kwk2
2
1 − tanh
q(w)∗ xk
µ
xk (n) kwk
w ∗ xk
−
kwk
qn (w)
2
− tanh
q(w)∗ xk
µ
Writing Yk = Wk + Vk , where
∗
xk (n) kwk 2
q(w)∗ xk
w xk
1 − tanh
−
,
µ
kwk
qn (w)
q(w)∗ xk xk (n)
Vk = − tanh
.
µ
qn3 (w)
1
Wk =
µ
2
xk (n)
.
qn3 (w)
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
25
Then by similar argument as in proof to Proposition II.9, we have for all integers m ≥ 2 that
"
2m #
h
i
∗x
x
(n)
kwk
1
1
w
k
k
2m
2 (w))
−
≤
E
|Z|
E [|Wk |m ] ≤ m E Z∼N
(0,1/q
n
µ
kwk
qn (w) µm
m
1
m! 4n
≤ m (2m − 1)!!(4n)m ≤
,
µ
2
µ
√ 3m
√ m
1
m!
E [|Vk |m ] ≤ 3m
(m − 1)!! ≤
E [|vk (n)|m ] ≤ 2 n
8n n ,
qn (w)
2
2 = 16n2 /µ2 ≥ E W 2 ,
where we have again used the assumption that qn (w) ≥ 2√1 n to simplify the result. Taking σW
k
P
P
√
RW = 4n/µ and σV2 = 64n3 ≥ E Vk2 , RV = 8n n, and considering SW = p1 pk=1 Wk and SV = p1 pk=1 Vk ,
then by Lemma A.8, we obtain
t
pµ2 t2
P |SW − E [SW ]| ≥
≤ 2 exp −
,
2
128n2 + 16nµt
pt2
t
√
P |SV − E [SV ]| ≥
≤ 2 exp −
.
2
512n3 + 32n nt
Combining the above results, we obtain
#
" p
1 X
Xk − E [Xk ] ≥ t = P [|SW − E [SW ] + SV − E [SV ]| ≥ t]
P p
k=1
t
t
≤ P |SW − E [SW ]| ≥
+ P |SV − E [SV ]| ≥
2
2
2
2
pt2
pµ t
√
+ 2 exp −
≤ 2 exp −
128n2 + 16nµt
512n3 + 32n nt
pµ2 t2
≤ 4 exp −
,
512n2 + 32nµt
provided that µ ≤ √1n , as desired.
Proof: (of Proposition II.9) Let
Xk =
w∗ ∇g(w)
kwk2
1
p
w∗
∇hµ (q(w)∗ xk ) ,
kwk2
Pp
For each Xk , k ∈ [p], from (IV.1), we know that
∗
kwk2 xk (n) w∗ xk
kwk2 xk (n) w xk
q(w)∗ xk
−
|Xk | = tanh
≤ kwk − qn (w) ,
µ
kwk
qn (w)
∗
kwkxk (n)
kwk ∗
xk
w
as the magnitude of tanh (·) is bounded by one. Because w
−
=
,
−
xk and xk ∼i.i.d.
kwk2
qn (w)
kwk
qn (w)
BG (θ), invoking Lemma IV.4, we obtain for every integer m ≥ 2 that
then
=
k=1 Xk .
E [|Xk |m ] ≤ EZ∼N (0,1/qn2 (w)) [|Z|m ] ≤
√ m−2
1
m!
(4n) 2 n
,
m (m − 1)!! ≤
qn (w)
2
where the Gaussian moment can be looked up in Lemma A.5 and we used the factthat (m − 1)!!
≤ m!/2 and the
√
assumption that qn (w) ≥ 2√1 n to get the result. Thus, by taking σ 2 = 4n ≥ E Xk2 and R = 2 n, and we obtain
the claimed result by invoking Lemma A.8.
P
Proof: (of Proposition II.10) Let Zk = ∇2w hµ (q(w)∗ xk ), then ∇2w g (w) = p1 pk=1 Zk . From (IV.2), we
know that
Zk = Wk + Vk
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
26
where
Wk
Vk
q(w)∗ xk
xk (n) w
xk (n) w ∗
1 − tanh
xk −
xk −
µ
qn (w)
qn (w)
q(w)∗ xk
xk (n)
xk (n) ww∗
= − tanh
.
I+
µ
qn (w)
qn3 (w)
1
=
µ
2
For Wk , we have
"
2m−2 ∗ #
1
x
(n)
w
x
(n)
w
x
(n)
w
k
k
k
0 E [Wkm ] m E xk −
xk −
xk − qn (w) µ
qn (w)
qn (w)
"
2m #
x
(n)
w
1
k
xk −
mE I
µ
qn (w) "
!m #
x2k (n) kwk2
2m
2
kxk k +
I
mE
µ
qn2 (w)
i
2m h
2m
m E kxk k2m I m EZ∼χ2 (n) [Z m ] I,
µ
µ
where we have used the fact that kwk2 /qn2 (w) = kwk2 /(1 − kwk2 ) ≤ 1 for kwk2 ≤ 14 and Lemma IV.4 to obtain
the last line. By Lemma A.6, we obtain
m
m!
m! 4n m
2
(2n)m I =
I.
0 E [Wkm ] µ
2
2
µ
2
. 1 Pp
16n2
2
Taking RW = 4n
k=1 Wk , by Lemma A.9, we obtain
µ and σW = µ2 ≥ E Wk , and letting SW = p
pµ2 t2
t
≤ 2n exp −
.
P kSW − E [SW ]k ≥
2
128n2 + 16µnt
Similarly, for Vk , we have
!m
kwk2
1
+
E [|xk (n)|m ] I
qn (w) qn3 (w)
√ m
8n n (m − 1)!!I
√ m
m!
8n n I,
2
1
where we have used the fact qn (w) ≥ 2√n to simplify the result. Similar argument also shows −E [Vkm ] √ m
√
. P
m! (8n n) I/2. Taking RV = 8n n and σV2 = 64n3 , and letting SV = p1 pk=1 Vk , again by Lemma A.9, we
obtain
t
pt2
√
P kSV − E [SV ]k ≥
≤ 2n exp −
.
2
512n3 + 32n nt
E [Vkm ]
Combining the above results, we obtain
#
" p
1 X
P Zk − E [Zk ] ≥ t = P [kSW − E [SW ] + SV − E [SV ]k ≥ t]
p
k=1
t
t
≤ P kSW − E [SW ]k ≥
+ P kSV − E [SV ]k ≥
2
2
2
2
pµ t
pt2
√
≤ 2n exp −
+
2n
exp
−
128n2 + 16µnt
512n3 + 32n nt
2
2
pµ t
≤ 4n exp −
,
512n2 + 32µnt
where we have simplified the final result based on the fact that µ ≤
√1 .
n
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
27
5) Proof of Lipschitz Results: To avoid clutter of notations, in this subsection we write X to mean X0 ; similarly
xk for (x0 )k , the k -th column of X0 . The function g (w) means g (w; X0 ). We need the following lemmas to
prove the Lipschitz results.
Lemma IV.5. Suppose that ϕ1 : U → V is an L-Lipschitz map from a normed space U to a normed space V , and
that ϕ2 : V → W is an L0 -Lipschitz map from V to a normed space W . Then the composition ϕ2 ◦ ϕ1 : U → W is
LL0 -Lipschitz.
Lemma IV.6. Fix any D ⊆ Rn−1 . Let g1 , g2 : D → R, and assume that g1 is L1 -Lipschitz, and g2 is L2 -Lipschitz,
and that g1 and g2 are bounded over D, i.e., |g1 (x)| ≤ M1 and |g2 (x)| ≤ M2 for all x ∈ D with some constants
M1 > 0 and M2 > 0. Then the function h(x) = g1 (x)g2 (x) is L-Lipschitz, with
L = M1 L2 + M2 L1 .
Lemma IV.7. For every w, w0 ∈ Γ, and every fixed x, we have
√
2 n
∗
0 ∗
kxk w − w0 ,
ḣµ (q(w) x) − ḣµ q(w ) x ≤
µ
√
n
4
kxk w − w0 .
ḧµ (q(w)∗ x) − ḧµ q(w0 )∗ x ≤
2
µ
Proof: We have
q
0
q
− w0 k
2
2
0
0
qn (w) − qn w = 1 − kwk − 1 − kw k = q kw + w k kw
q
1 − kwk2 + 1 − kw0 k2
max (kwk , kw0 k) w − w 0 .
≤
min (qn (w) , qn (w0 ))
Hence it holds that

2
, kw0 k2

max kwk
q (w) − q w0 2 = w − w0 2 + qn (w) − qn w0 2 ≤ 1 +
 w − w0 2
2
2
0
min (qn (w) , qn (w ))
1
w − w0 2 ≤ 4n w − w0 2 ,
2
2
0
min (qn (w) , qn (w ))
where we have used the fact qn (w) ≥ 2√1 n to get the final result. Hence the mapping w 7→ q(w) is
over Γ. Moreover it is easy to see q 7→ q ∗ x is kxk2 -Lipschitz. By Lemma A.1 and the composition
=
√
2 n-Lipschitz
rule in Lemma
IV.5, we obtain the desired claims.
Lemma IV.8. For any fixed x, consider the function
xn
. w∗ x
tx (w) =
−
kwk
kwk
qn (w)
defined over w ∈ Γ. Then, for all w, w0 in Γ such that kwk ≥ r and kw0 k ≥ r for some constant r ∈ (0, 1), it
holds that
tx (w) − tx (w0 ) ≤ 2 kxk + 4n3/2 kxk
w − w 0 ,
∞
r
√
|tx (w)| ≤ 2 n kxk ,
2
√
kxk
2
0
3/2
tx (w) − tx (w ) ≤ 8 n kxk
+ 4n kxk∞ w − w0 ,
r
2
2
tx (w) ≤ 4n kxk .
Proof: First of all, we have
∗
kwk
w
|tx (w)| =
,−
x ≤ kxk
kwk qn (w)
kwk2
1+ 2
qn (w)
!1/2
=
√
kxk
≤ 2 n kxk ,
|qn (w)|
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
28
where we have used the assumption that qn (w) ≥ 2√1 n to simplify the final result. The claim about t2x (w) follows
immediately. Now
∗ 0k 0
w
kwk
kw
w
0
tx (w) − tx (w ) ≤ kwk − kw0 k x + |xn | qn (w) − qn (w0 ) .
Moreover we have
0
0
0
0
w
w
w0 ∗ w0 ≤ kxk kw − w k kw k + kw k |kwk − kw k|
x
−
≤
kxk
−
kwk kw0 k kwk kw0 k
kwk kw0 k
2 kxk w − w0 ,
≤
r
√
where we have used the assumption that kwk ≥ r to simplify the result. Noticing that t 7→ t/ 1 − t2 is continuous
over [a, b] and differentiable over (a, b) for any 0 < a < b < 1, by mean value theorem,
kwk
kw0 k 1
0
3/2
0
3/2 w − w ≤ 8n w − w ,
qn (w) − qn (w0 ) ≤ sup w∈Γ
1 − kwk2
where we have again used the assumption that qn (w) ≥ 2√1 n to simplify the last result. Collecting the above
estimates, we obtain
w − w0 ,
tx (w) − tx (w0 ) ≤ 2 kxk + 8n3/2 kxk
∞
r
as desired. For the last one, we have
2
tx (w) − t2x (w0 ) = tx (w) − tx (w0 ) tx (w) + tx (w0 )
≤ 2 sup |tx (s)| tx (w) − tx (w0 ) ,
s∈Γ
leading to the claimed result once we substitute estimates of the involved quantities.
Lemma IV.9. For any fixed x, consider the function
xn
xn
Φx (w) =
I+ 3
ww∗
qn (w)
qn (w)
defined over w ∈ Γ. Then, for all w, w0 ∈ Γ such that kwk < r and kw0 k < r with some constant r ∈ 0, 12 , it
holds that
kΦx (w)k ≤ 2 kxk∞ ,
Φx (w) − Φx (w0 ) ≤ 4 kxk w − w0 .
∞
Proof: Simple calculation shows
kΦx (w)k ≤ kxk∞
kwk2
1
+ 3
qn (w) qn (w)
For the second one, we have
Φx (w) − Φx (w0 ) ≤ kxk ∞
≤ kxk∞
!
=
kxk∞
kxk∞
≤
≤ 2 kxk∞ .
3
qn (w)
(1 − r2 )3/2
1
1
1
1
∗
0
0 ∗
I+ 3
ww −
I − 3 0 w (w ) 0
qn (w)
qn (w)
qn (w )
qn (w )
!
2
0 k2 1
kwk
kw
1
qn (w) − qn (w0 ) + q 3 (w) − q 3 (w0 ) .
n
n
Now
1
max (kwk , kw0 k) 1 |qn (w) − qn (w0 )|
4 0
0
qn (w) − qn (w0 ) = qn (w) qn (w0 ) ≤ min (q 3 (w) , q 3 (w0 )) w − w ≤ 3√3 w − w ,
n
n
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
29
where we have applied the estimate for |qn (w) − qn (w0 )| as established in Lemma IV.7 and also used kwk ≤ 1/2
3/2
and kw0 k ≤ 1/2 to simplify the above result. Further noticing t 7→ t2 / 1 − t2
is differentiable over t ∈ (0, 1),
we apply the mean value theorem and obtain
kwk2
0 k2 ksk3 + 2 ksk kw
w − w0 ≤ √4 w − w0 .
− 3
≤
sup
3
0
5/2
qn (w) qn (w ) s∈Γ,ksk≤r< 1
3
2
1 − ksk2
Combining the above estimates gives the claimed result.
Lemma IV.10. For any fixed x, consider the function
ζx (w) = x −
xn
w
qn (w)
defined over w ∈ Γ. Then, for all w, w0 ∈ Γ such that kwk ≤ r and kw0 k ≤ r for some constant r ∈ 0, 12 , it
holds that
kζx (w)ζx (w)∗ k ≤ 2n kxk2∞ ,
√ √
ζx (w)ζx (w)∗ − ζx (w0 )ζx (w0 )∗ ≤ 8 2 n kxk2 w − w0 .
∞
Proof: We have kwk2 /qn2 (w) ≤ 1/3 when kwk ≤ r < 1/2, hence it holds that
kζx (w)ζx (w)∗ k ≤ kζx (w)k2 ≤ 2 kxk2 + 2x2n
kwk
≤ 2n kxk2∞ .
2
qn (w)
For the second, we first estimate
w
w0
w0 w
ζ(w) − ζ(w0 ) = xn
−
≤ kxk∞ −
qn (w) qn (w0 ) qn (w) qn (w0 ) 0 1
1 1 0
≤ kxk∞
w−w + w −
qn (w)
qn (w) qn (w0 ) kw0 k
1
w − w0 ≤ kxk∞
+
q (w) min {qn3 (w), qn3 (w0 )}
n
2
4
w − w0 ≤ 4 kxk w − w0 .
≤ kxk∞ √ + √
∞
3 3 3
Thus, we have
ζx (w)ζx (w)∗ − ζx (w0 )ζx (w0 )∗ ≤ kζ(w)k ζ(w) − ζ(w0 ) + ζ(w) − ζ(w0 ) ζ(w0 )
√ √
≤ 8 2 n kxk2 w − w0 ,
∞
as desired.
Now, we are ready to prove all the Lipschitz propositions.
Proof: (of Proposition II.11) Let
Fk (w) = ḧµ (q(w)∗ xk ) t2xk (w) + ḣµ (q(w)∗ xk )
xk (n)
.
qn3 (w)
1
1 Pp
∗ 2
∗
∗
Then, kwk
2 w ∇ g(w)w =
k=1 Fk (w). Noticing that ḧµ (q(w) xk ) is bounded by 1/µ and ḣµ (q(w) xk ) is
p
bounded by 1, both in magnitude. Applying Lemma IV.6, Lemma IV.7 and Lemma IV.8, we can see Fk (w) is
LkS -Lipschitz with
√
kxk k
1 √
2 4 n
k
3/2
LS = 4n kxk k
kxk k + 8 n kxk k
+ 4n kxk k∞
µ2
µ
rS
√
√
2 n
3
+ (2 n)3 kxk k∞
kxk k +
sup
√ 2n−1 (1 − a2 )5/2 kxk k∞
µ
rS <a<
2n
√
3/2
2
16n
8 n
48n
=
kxk k3 +
kxk k2 +
kxk k kxk k∞ + 96n5/2 kxk k∞ .
2
µ
µrS
µ
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
Thus,
1
∗ 2
kwk2 w ∇ g(w)w
30
is LS -Lipschitz with
p
1X k
48n5/2
16n3
8n3/2
LS ≤
kXk2∞ +
LS ≤ 2 kXk3∞ +
kXk2∞ + 96n5/2 kXk∞ ,
p
µ
µrS
µ
k=1
as desired.
Proof: (of Proposition II.12) We have
∗
p 1X
w
w0∗
0 ∗
0 ∗
0 ∇g(w)
−
∇g(w
)
≤
ḣ
(q(w)
x
)
t
(w)
−
ḣ
q(w
)
x
t
w
µ
µ
k xk
k xk
p
kwk
kw0 k
k=1
where ḣµ (t) = tanh(t/µ) is bounded by one in magnitude, and txk (w) and tx0k (w) is defined as in Lemma IV.8.
By Lemma IV.6, Lemma IV.7 and Lemma IV.8, we know that ḣµ (q(w)∗ xk ) txk (w) is Lk -Lipschitz with constant
Lk =
2 kxk k
4n
+ 8n3/2 kxk k∞ +
kxk k2 .
rg
µ
Therefore, we have
∗
p 1X
w
2 kxk k
w∗
4n
2 0 3/2
0
∇g(w)
−
∇g(w
)
≤
+
8n
kx
k
+
kx
k
w
−
w
k
k
∞
p
kwk
kwk
rg
µ
k=1
√
2 n
4n2
≤
kXk∞ + 8n3/2 kXk∞ +
kXk2∞ w − w0 ,
rg
µ
as desired.
Proof: (of Proposition II.13) Let
Fk (w) = ḧµ (q(w)∗ xk )ζk (w)ζk (w)∗ − ḣµ (q(w)∗ xk ) Φk (w)
(n)
(n)
1 Pp
∗
2
with ζk (w) = xk − qxnk(w)
w and Φk (w) = qxnk(w)
I + qxnn,k
k=1 Fk (w). Using Lemma
(w) ww . Then, ∇ g(w) = p
IV.6, Lemma IV.7, Lemma IV.9 and Lemma IV.10, and the facts that ḧµ (t) is bounded by 1/µ and that ḧµ (t) is
bounded by 1 in magnitude, we can see Fk (w) is LkN -Lipschitz continuous with
√
√
√ √
1
2 n
2 n
2
2
k
kxk k × 2 kxk k∞
LN = × 8 2 n kxk k∞ + 2 kxk k × 2n kxk k∞ + 4 kxk k∞ +
µ
µ
µ
√ √
√
4n3/2
4 n
8 2 n
2
≤
kxk k kxk k∞ +
kxk k kxk k∞ +
kxk k2∞ + 4 kxk k∞ .
µ2
µ
µ
Thus, we have
√ √
p
4n2
4n
8 2 n
1X k
3
2
LN ≤ 2 kXk∞ +
kXk∞ +
kXk2∞ + 8 kXk∞ ,
LN ≤
p
µ
µ
µ
k=1
as desired.
B. Proofs of Theorem II.1
To avoid clutter of notations, in this subsection we write X to mean X0 ; similarly xk for (x0 )k , the k -th column
of X0 . The function g (w) means g (w; X0 ). Before proving Theorem II.1, we record one useful lemma.
Lemma IV.11.
1), consider
the random matrix X ∈ Rn1 ×n2 with X ∼i.i.d. BG (θ). Define the
n For any θ ∈ (0,
o
p
.
event E∞ = 1 ≤ kXk∞ ≤ 4 log (np) . It holds that
c
P [E∞
] ≤ θ (np)−7 + exp (−0.3θnp) .
Proof: See page 39 under Section B.
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
For convenience, we define three regions for the range of w:
µ
.
.
,
R2 = w
R1 = w kwk ≤ √
4 2
( )
r
1
4n − 1
.
R3 = w √ ≤ kwk ≤
.
20 5
4n
31
µ
√ ≤ kwk ≤ 1√
,
4 2
20 5
Proof: (of Theorem II.1) We will focus on deriving the qualitative result and hence be sloppy about constants.
All indexed capital C or small c are numerical constants.
a) Strong convexity in region R1 .: Proposition II.7 shows that for any w ∈ R1 , E ∇2 g(w) cµ1 θ I . For any
√ √ n
ε ∈ (0, µ/ 4 2 ), R1 has an ε-net N1 of size at most (3µ/ 4 2ε ) . On E∞ , ∇2 g is
. C2 n 2
L1 =
log3/2 (np)
µ2
c1 θ
3µL1 ,
Lipschitz by Proposition II.13. Set ε =
so
#N1 ≤ exp 2n log
Let E1 denote the event
C3 n log(np)
θ
E1 =
On E1 ∩ E∞ ,
.
c1 θ
max ∇2 g(w) − E ∇2 g(w) ≤
w∈N1
3µ
.
2c1 θ
sup √ ∇2 g(w) − E ∇2 g(w) ≤
,
3µ
kwk≤µ/(4 2)
and so on E1 ∩ E∞ , (II.4) holds for any constant c? ≤ c1 /3. Setting t = c1 θ/3µ in Proposition II.10, we obtain that
for any fixed w,
2
2
c1 θ
c4 pθ2
P ∇ g(w) − E ∇ g(w) ≥
≤ 4n exp − 2
.
3µ
n
Taking a union bound, we obtain that
P [E1c ]
c4 pθ2
≤ 4n exp − 2 + C5 n log(n) + C5 n log log(p) .
n
b) Large gradient in region R2 .: Similarly, for the gradient quantity, for w ∈ R2 , Proposition II.6 shows that
∗
w ∇g(w)
E
≥ c6 θ.
kwk
Moreover, on E∞ ,
w∗ ∇g(w)
kwk
is
. C7 n 2
L2 =
log(np)
µ
Lipschitz by Proposition II.12. For any ε <
so
1√
,
20 5
the set R2 has an ε-net N2 of size at most
C8 n2 log(np)
#N2 ≤ exp n log
.
θµ
Let E2 denote the event
E2 =
On E2 ∩ E∞ ,
3√
20ε 5
n
. Set ε =
c6 θ
3L2 ,
∗
∗
w ∇g(w)
w
∇g(w)
c
θ
6
≤
max −E
.
w∈N2 kwk
kwk
3
∗
∗
w ∇g(w)
w
∇g(w)
≤ 2c6 θ ,
sup −E
kwk
kwk
3
w∈R2
(IV.8)
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
32
and so on E2 ∩ E∞ , (II.5) holds for any constant c? ≤ c6 /3. Setting t = c6 θ/3 in Proposition II.9, we obtain that
for any fixed w ∈ R2 ,
∗
∗
2
w ∇g(w)
w
∇g(w)
c
pθ
9
P ,
−E
≤ 2 exp −
kwk
kwk n
and so
P [E2c ]
c9 pθ2
≤ 2 exp −
+ n log
n
C8 n2 log(np)
θµ
.
(IV.9)
c) Existence of negative curvature direction in R3 .: Finally, for any w ∈ R3 , Proposition II.5 shows that
#
"
w∗ ∇2 g(w)w
≤ −c9 θ.
E
kwk2
On E∞ ,
w∗ ∇2 g(w)w
kwk2
is
C10 n3
log3/2 (np)
µ2
q
n
Lipschitz by Proposition II.11. As above, for any ε ≤ 4n−1
4n , R3 has an ε-net N3 of size at most (3/ε) . Set
ε = c9 θ/3L3 . Then
!!
C11 n3 log3/2 (np)
.
#N3 ≤ exp n log
θµ2
L3 =
Let E3 denote the event
"
#
)
w∗ ∇2 g(w)w
w∗ ∇2 g(w)w c9 θ
max −E
≤
w∈N3 3
kwk2
kwk2
(
E3 =
On E3 ∩ E∞ ,
"
#
w∗ ∇2 g(w)w
∗ ∇2 g(w)w w
2c9 θ
sup ,
−E
≤
2
2
3
kwk
kwk
w∈R3 and (II.6) holds with any constant c? < c9 /3. Setting t = c9 θ/3 in Proposition II.8 and taking a union bound, we
obtain
!!
2 θ2
3 log3/2 (np)
c
pµ
C
n
12
11
P [E3c ] ≤ 4 exp −
+ n log
.
n2
θµ2
d) The unique local minimizer located near 0.: Let
Eg be the event
√ that the bounds (II.4)-(II.6) hold. On
Eg , the function g is cµ? θ -strongly convex over R1 = w kwk ≤ µ/ 4 2 . This implies that f has at most one
local minimum on R1 . It also implies that for any w ∈ R1 ,
g(w) ≥ g(0) + h∇g(0), wi +
cθ
c? θ
kwk2 ≥ g(0) − kwk k∇g(0)k +
kwk2 .
2µ
2µ
So, if g(w) ≤ g(0), we necessarily have
kwk ≤
2µ
k∇g(0)k .
c? θ
Suppose that
c? θ
.
(IV.10)
32
Then g(w) ≤ g(0) implies that kwk ≤ µ/16. By Wierstrass’s theorem, g(w) has at least one minimizer w? over
the compact set S = {w | kwk ≤ µ/10}. By the above reasoning, kw? k ≤ µ/16, and hence w? does not lie on
the boundary of S . This implies that w? is a local minimizer of g . Moreover, as above,
k∇g(0)k ≤
kw? k ≤
2µ
k∇g(0)k .
c? θ
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
33
We now use the vector Bernstein inequality to show that with our choice of p, (IV.10) is satisifed with high
probability. Notice that
p
1X
∇g(0) =
ḣµ (xi (n))xi ,
p
i=1
and ḣµ is bounded by one in magnitude, so for any integer m ≥ 2,
m i
h
E ḣµ (xi (n))xi ≤ E [kxi km ] ≤ EZ∼χ(n) [Z m ] ≤ m!nm/2 ,
where we have applied the moment estimate for the
χ (n) distribution shown in Lemma A.7. Applying the vector
√
Bernstein inequality in Corollary A.10 with R = n and σ 2 = 2n, we obtain
pt2
√
P [k∇g(0)k ≥ t] ≤ 2(n + 1) exp −
4n + 2 nt
for all t > 0. Using this inequality, it is not difficult to show that there exist constants C13 , C14 > 0 such that when
p ≥ C13 n log n, with probability at least 1 − 4np−10 ,
s
n log p
.
(IV.11)
k∇g(0)k ≤ C3
p
When logp p ≥
the proof.
C14 n
θ2 ,
for appropriately large C14 , (IV.11) implies (IV.10). Summing up failure probabilities completes
C. Proofs for Section II-C and Theorem II.3
Proof: (of Lemma II.14) By the generative model,
−1/2
−1/2
1
1
∗
∗ ∗
Y =
YY
Y =
A0 X0 X0 A0
A0 X0 .
pθ
pθ
−1/2
1
A0 with (A0 A∗0 )−1/2 A0 = U V ∗ . By
Since E [X0 X0∗ / (pθ)] = I , we will compare pθ
A0 X0 X0∗ A∗0
Lemma B.2, we have
−1/2
1
A0 − (A0 A∗0 )−1/2 A0 A0 X0 X0∗ A∗0
pθ
−1/2
1
≤ kA0 k A0 X0 X0∗ A∗0
− (A0 A∗0 )−1/2 pθ
1
2 kA0 k3 1
4
∗
∗
≤ kA0 k 4
X0 X0 − I = 2κ (A0 ) X0 X0 − I pθ
σmin (A0 ) pθ
provided
2 (A )
1
σmin
1
1
0
∗
∗
kA0 k X0 X0 − I ≤
⇐⇒ X0 X0 − I ≤ 2
.
pθ
2
pθ
2κ (A0 )
q
1
p
On the other hand, by Lemma B.3, when p ≥ C1 n2 log n for some large constant C1 , pθ
X0 X0∗ − I ≤ 10 θn log
p
2
with probability at least 1 − p−8 . Thus, when p ≥ C2 κ4 (A0 ) θn2 log(nθκ (A0 )),
s
−1/2
1
θn log p
∗ ∗
∗ −1/2
4
A0 X0 X0 A0
A0 − (A0 A0 )
A0 ≤ 20κ (A0 )
,
pθ
p
as desired.
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
34
Proof: (of Lemma II.15) To avoid clutter in notation, we write X to mean X0 , and xk to mean (x0 )k
e .
e
in
let Y = X0 + ΞX0 . Note the Jacobian matrix for the mapping q (w) is ∇w q (w) =
this proof.
q We also
I, −w/ 1 − kwk2 . Hence for any vector z ∈ Rn and all w ∈ Γ,
k∇w q (w) zk ≤
√
n − 1 kzk∞ + q
kwk
1 − kwk2
√
kzk∞ ≤ 3 n kzk∞ .
Now we have
∇w g w; Ye − ∇w g (w; X)
p
p
1 X 1X
∗
∗
∗
e
e
ḣµ q (w) xk + q (w) Ξxk ∇w q (w) xk + Ξxk −
ḣµ (q (w) xk ) ∇w q (w) xk = p
p
k=1
k=1
p
1 X
e k ∇w q (w) xk + Ξx
e k − xk ḣµ q ∗ (w) xk + q ∗ (w) Ξx
≤ p
k=1
p
1 X h i
∗
∗
∗
e k − ḣµ (q (w) xk ) ∇w q (w) xk ḣµ q (w) xk + q (w) Ξx
+
p
k=1
e
2
≤ Ξ
max ḣµ (t) 3n kXk∞ + Lḣµ 3n kXk∞ ,
t
e
where Lḣµ denotes the Lipschitz constant for ḣµ (·). Similarly, suppose Ξ
≤
1
2n ,
and also notice that
∗ √ 3/2
I
kwk2
ww
1
1
≤
+
+
qn (w) q 3 (w) qn (w) q 3 (w) = q 3 (w) ≤ 2 2n ,
n
n
n
we obtain that
2
∇w g w; Ye − ∇2w g (w; X)
p
1 X h
i
≤ ḧ (q ∗ (w) yek ) ∇w q (w) yek yek∗ (∇w q (w))∗ − ḧ (q ∗ (w) xk ) ∇w q (w) xk x∗k (∇w q (w))∗ p
k=1
p 1 X
∗
∗
I
ww
I
ww
ḣ (q ∗ (w) yek )
+ 3
yek (n) − ḣ (q ∗ (w) xk )
+ 3
xk (n) +
p
qn (w)
qn
qn (w)
qn
k=1
2
2 e
2 e
3 e
3/2
2
3/2
+
10n
kXk
≤ 45
Ξ
+
max
ḧ
(t)
18n
kXk
Ξ
L
n
kXk
µ
∞
∞
∞ Ξ
2 ḧµ
t
√
√
e
e
2
+ 3 2Lḣµ n2 Ξ
kXk∞ + max ḣ (t) 2 2n2 Ξ
kXk∞ ,
t
where Lḧµ denotes the Lipschitz constant for ḧµ (·). Since
max ḣµ (t) ≤ 1,
t
max ḧµ (t) ≤
t
1
,
µ
Lhµ ≤ 1,
Lḣµ ≤
1
,
µ
Lḧµ ≤
2
,
µ2
p
and by Lemma IV.11, kXk∞ ≤ 4 log (np) with probability at least 1 − θ (np)−7 − exp (−0.3θnp), we obtain
n
e
e
∇w g w; Y − ∇w g (w; X) ≤ C1 log (np) Ξ
,
µ
3/2 n2
n
2
e
2
3/2
,
log
(np)
∇w g w; Ye − ∇w g (w; X) ≤ C2 max
Ξ
µ2 µ
for numerical constants C1 , C2 > 0.
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
35
Proof: (of Theorem II.3) Assume the constant c? as defined in Theorem II.1. By Lemma II.14, when
4 5
n n
C1
8
4 κ (A0 ) n
,
p ≥ 2 max
κ (A0 ) log
,
c? θ
µ4 µ2
µθ
the magnitude of the perturbation is bounded as
−1
3/2 2 n
n
e
3/2
,
log (np)
,
Ξ ≤ C2 c? θ max
µ2 µ
where C2 can be made arbitrarily small by making C1 large. Combining this result with Lemma II.15, we obtain
that for all w ∈ Γ,
c θ
?
e 0 − ∇w g (w; X)
∇w g w; X0 + ΞX
≤
2
c θ
2
2
e 0 − ∇w g (w; X) ≤ ? ,
∇w g w; X0 + ΞX
2
with probability at least 1 − p−8 − θ (np)−7 − exp (−0.3θnp). In view of (II.11) in Theorem II.1, we have
∗ g w; X + ΞX
e
e 0 w
w
w∗ g w; X0 + ΞX
∗
0
0 w
w g (w; X0 ) w
w∗ g (w; X0 ) w
=
+
−
kwk2
kwk2
kwk2
kwk2
1
e 0 − ∇2w g (w; X)
≤ −c? θ + ∇2w g w; X0 + ΞX
≤ − c? θ.
2
By similar arguments, we obtain (II.9) through (II.11) in Theorem II.3.
To show the unique
local minimizer
over Γ is near 0, we note that (recall the last part of proof of Theorem II.1
e 0 being c? θ strongly convex near 0 implies that
in Section IV-B) g w; X0 + ΞX
2µ
4µ e
kw? k ≤
∇g 0; X0 + ΞX0 .
c? θ
The above perturbation analysis implies there exists C3 > 0 such that when
4 5
C3
n n
8
4 κ (A0 ) n
p ≥ 2 max
,
κ (A0 ) log
,
c? θ
µ4 µ2
µθ
it holds that
c? θ
e 0 − ∇w g (0; X)
,
∇w g 0; X0 + ΞX
≤
400
which in turn implies
kw? k ≤
4µ c? θ
µ
µ
µ
4µ
k∇g (0; X0 )k +
≤ +
< ,
c? θ
c? θ 400
8 100
7
where we have recall the result that c2µ
k∇g (0; X0 )k ≤ µ/16 from proof of Theorem II.1. A simple union bound
?θ
with careful bookkeeping gives the success probability.
A PPENDIX A
T ECHNICAL T OOLS AND BASIC FACTS U SED IN P ROOFS
In this section, we summarize some basic calculations that are useful throughout, and also record major technical
tools we use in proofs.
Lemma A.1 (Derivates and Lipschitz Properties of hµ (z)). For the sparsity surrogate
hµ (z) = µ log cosh (z/µ) ,
the first two derivatives are
ḣµ (z) = tanh(z/µ), ḧµ (z) = 1 − tanh2 (z/µ) /µ.
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
36
Also, for any z > 0, we have
(1 − exp(−2z/µ)/2 ≤ tanh(z/µ) ≤ 1 − exp(−2z/µ),
exp(−2z/µ) ≤ 1 − tanh2 (z/µ) ≤ 4 exp(−2z/µ).
Moreover, for any z, z 0 ∈ R, we have
|ḣµ (z) − ḣµ (z 0 )| ≤ |z − z 0 |/µ, |ḧµ (z) − ḧµ (z 0 )| ≤ 2|z − z 0 |/µ2 .
Lemma A.2 (Harris’ Inequality, [110], see also Theorem 2.15 of [111]). Let X1 , . . . , Xn be independent, real-valued
random variables and f, g : Rn 7→ R be nonincreasing (nondecreasing) w.r.t. any one variable while fixing the
others. Define a random vector X = (X1 , · · · , Xn ) ∈ Rn , then we have
E [f (X) g (X)] ≥ E [f (X)] E [g (X)] .
Similarly, if f is nondecreasing (nonincreasing) and g is nonincreasing (nondecreasing) coordinatewise in the above
sense, we have
E [f (X) g (X)] ≤ E [f (X)] E [g (X)] .
(x,s)
Lemma A.3 (Differentiation under the Integral Sign). Consider a function F : Rn × R 7→ R such that ∂F∂s
is
n and some t > 0. For any probability
well defined and measurable over U × (0, t0 ) for some open subset
U
⊂
R
0
R t R (x,s) measure µ on Rn and any t ∈ (0, t0 ) such that 0 U ∂F∂s
µ (dx) ds < ∞, it holds that
Z
Z
∂F (x, t)
d
∂F (x, t)
d
F (x, t) µ (dx) =
µ (dx) , or Ex [F (x, t) 1U ] = Ex
1U .
dt U
∂t
dt
∂t
U
Proof: See proof of Lemma A.4 in the technical report [2].
Lemma A.4 (Gaussian Tail Estimates). Let X ∼ N (0, 1) and Φ (x) be CDF of X . For any x ≥ 0, we have the
.
following estimates for Φc (x) = 1 − Φ (x):
1 exp −x2 /2
1
3 exp −x2 /2
1
1
c
√
√
−
−
+
≤ Φ (x) ≤
, (Type I)
x x3
x x3 x5
2π
2π
x exp −x2 /2
1 exp −x2 /2
c
√
√
≤
Φ
(x)
≤
, (Type II)
x2 + 1
x
2π 2π
√
exp −x2 /2
p
x2 + 4 − x exp −x2 /2
c
2
√
√
2+x −x
≤ Φ (x) ≤
(Type III).
2
2π
2π
Proof: See proof of Lemma A.5 in the technical report [2].
Lemma A.5 (Moments of the Gaussian RV). If X ∼ N 0, σ 2 , then it holds for all integer p ≥ 1 that
E [|X|p ] ≤ σ p (p − 1)!!.
Lemma A.6 (Moments of the χ2 RV). If X ∼ χ2 (n), then it holds for all integer p ≥ 1 that
p
p
Y
+ n/2)
=
(n + 2k − 2) ≤ p! (2n)p /2.
Γ (n/2)
p Γ (p
E [X ] = 2
k=1
Lemma A.7 (Moments of the χ RV). If X ∼ χ (n), then it holds for all integer p ≥ 1 that
E [X p ] = 2p/2
Γ (p/2 + n/2)
≤ p!np/2 .
Γ (n/2)
Lemma A.8 (Moment-Control Bernstein’s Inequality for Scalar RVs, Theorem 2.10 of [112]). Let X1 , . . . , Xp be
i.i.d. real-valued random variables. Suppose that there exist some positive numbers R and σ 2 such that
E [|Xk |m ] ≤ m!σ 2 Rm−2 /2, for all integers m ≥ 2.
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
.
Let S =
1
p
Pp
k=1 Xk ,
37
then for all t > 0, it holds that
pt2
P [|S − E [S]| ≥ t] ≤ 2 exp − 2
2σ + 2Rt
.
Lemma A.9 (Moment-Control Bernstein’s Inequality for Matrix RVs, Theorem 6.2 of [113]). Let X1 , . . . , Xp ∈ Rd×d
be i.i.d. random, symmetric matrices. Suppose there exist some positive number R and σ 2 such that
E [Xkm ] m!σ 2 Rm−2 /2 · I and − E [Xkm ] m!σ 2 Rm−2 /2 · I , for all integers m ≥ 2.
. P
Let S = p1 pk=1 Xk , then for all t > 0, it holds that
pt2
P [kS − E [S]k ≥ t] ≤ 2d exp − 2
.
2σ + 2Rt
Proof: See proof of Lemma A.10 in the technical report [2].
Corollary A.10 (Moment-Control Bernstein’s Inequality for Vector RVs). Let x1 , . . . , xp ∈ Rd be i.i.d. random
vectors. Suppose there exist some positive number R and σ 2 such that
E [kxk km ] ≤ m!σ 2 Rm−2 /2,
Let s =
1
p
Pp
k=1 xk ,
for all integers m ≥ 2.
then for any t > 0, it holds that
pt2
P [ks − E [s]k ≥ t] ≤ 2(d + 1) exp − 2
2σ + 2Rt
.
Proof: See proof of Lemma A.11 in the technical report [2].
Lemma A.11 (Integral Form of Taylor’s Theorem). Let f (x) : Rn 7→ R be a twice continuously differentiable
function, then for any direction y ∈ Rn , we have
Z 1
f (x + ty) = f (x) + t
h∇f (x + sty), yi ds,
0
Z 1
2
f (x + ty) = f (x) + t h∇f (x), yi + t
(1 − s) ∇2 f (x + sty)y, y ds.
0
A PPENDIX B
AUXILLARY R ESULTS FOR P ROOFS
2
2
c
Lemma
R ∞ B.1. Let2 X ∼ N (0, σX ) and Y ∼ N (0, σY ) be independent random variables and Φ (t) =
1
√
exp −x /2 dx be the complementary cumulative distribution function of the standard normal. For
2π t
any a > 0, we have
σX
E [X 1X>0 ] = √ ,
(B.1)
2π
2 2 a σX
σX
2
Φc (aσX ) ,
(B.2)
E [exp (−aX) X 1X>0 ] = √ − aσX
exp
2
2π
2 2 a σX
E [exp (−aX) 1X>0 ] = exp
Φc (aσX ) ,
(B.3)
2
2 2
q
a σX + a2 σY2
2
2
2 + σ2
E exp (−a(X + Y )) X 2 1X+Y >0 = σX
1 + a2 σX
exp
Φc a σX
Y
2
aσ 4
−√ q X
,
(B.4)
2 + σ2
2π σX
Y
2 2
q
a σX + a2 σY2
2 2 2
c
2
2
E [exp (−a(X + Y )) XY 1X+Y >0 ] = a σX σY exp
Φ a σX + σY
2
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
38
2 σ2
aσX
q Y
,
2 + σ2
2π σX
Y
2
2
E [tanh (aX) X] = aσX E 1 − tanh (aX) ,
2
E [tanh (a(X + Y )) X] = aσX
E 1 − tanh2 (a(X + Y )) .
−√
(B.5)
(B.6)
(B.7)
Proof: Equalities (B.1), (B.2), (B.3), (B.4) and (B.5) can be obtained by direct integrations. Equalities (B.6)
and (B.7) can be derived using integration by part.
P∞
1
k
k k
Proof: (of Lemma IV.1) Indeed (1+βt)
2 =
k=0 (−1) (k + 1)β t , as
∞
X
k
k k
(−1) (k + 1)β t =
k=0
∞
X
k
(−βt) +
k=0
∞
X
1
−βt
1
+
=
.
2
1 + βt (1 + βt)
(1 + βt)2
k(−βt)k =
k=0
The magnitude of the coefficient vector is
kbk`1 =
∞
X
k
β (1 + k) =
k=0
Observing that
1
(1+βt)2
>
k
β +
k=0
1
(1+t)2
Z
kp − f kL1 [0,1] =
∞
X
∞
X
1
β
1
+
=
= T.
2
1 − β (1 − β)
(1 − β)2
kβ k =
k=0
for t ∈ [0, 1] when 0 < β < 1, we obtain
1
1
Z
|p(t) − f (t)| dt =
0
0
1
1
1−β
1
−
dt =
≤ √ .
2
2
(1 + βt)
(1 + t)
2(1 + β)
2 T
Moreover, we have
t(1 − β) (2 + t(1 + β))
1
≤1−β = √ .
2
2
(1 + t) (1 + βt)
t∈[0,1]
T
kf − pkL∞ [0,1] = max p(t) − f (t) = max
t∈[0,1]
Finally, notice that
∞
X
k=0
∞
∞ X
X
(−β)k
β 2i+1
bk
β 2i
=
=
−
(1 + k)3
(1 + k)2
(1 + 2i)2 (2i + 2)2
i=0
k=0
=
∞
X
β 2i
i=0
(2i + 2)2 − β(2i + 1)2
> 0,
(2i + 2)2 (2i + 1)2
where at the second equality we have grouped consecutive even-odd pair of summands. In addition, we have
n
X
k=0
n
n
n
k=0
k=0
k=1
X |bk |
X βk
X
1
1
bk
≤
=
≤
1
+
=2−
,
(1 + k)3
(1 + k)3
(1 + k)2
(1 + k)k
n+1
which converges to 2 when n → ∞, completing the proof.
Proof: (of Lemma IV.4) The first inequality is obviously true for v = 0. When v 6= 0, we have
∗
m
E [|v z| ] =
n
X
θ` (1 − θ)n−`
`=0
≤
n
X
`=0
X
EZ∼N (0,kvJ k2 ) [|Z|m ]
J ∈([n]
` )
θ` (1 − θ)n−`
X
EZ∼N (0,kvk2 ) [|Z|m ]
J ∈([n]
` )
m
= EZ∼N (0,kvk2 ) [|Z| ]
n
X
`=0
= EZ∼N (0,kvk2 ) [|Z|m ] ,
`
n−`
θ (1 − θ)
n
`
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
39
where the second line relies on the fact kvJ k ≤ kvk and that for a fixed order, central moment of Gaussian is
monotonically increasing w.r.t. its variance. Similarly, to see the second inequality,
E [kzkm ] =
n
X
θ` (1 − θ)n−`
`=0
X
m E zJ0 J ∈([n]
` )
n
0 m X
m n−` n
`
≤E z
θ (1 − θ)
= E z 0 ,
`
`=0
as desired.
Proof: (of Lemma IV.11) Consider one component of X , i.e., Xij = Bij Vij for i ∈ [n] and j ∈ [p], where
Bij ∼ Ber (θ)) and Vij ∼ N (0, 1). We have
i
h
i
h
p
p
P |Xij | > 4 log (np) ≤ θP |Vij | > 4 log(np) ≤ θ exp (−8 log(np)) = θ(np)−8 .
And also
P [|Xij | < 1] = 1 − θ + θP [|Vij | < 1] ≤ 1 − 0.3θ.
Applying a union bound as
h
i
p
P kXk∞ ≤ 1 or kXk∞ ≥ 4 log (np) ≤ (1 − 0.3θ)np + npθ (np)−8 ≤ exp (−0.3θnp) + θ (np)−7 ,
we complete the proof.
Lemma B.2. Suppose A 0. Then for any symmetric perturbation matrix ∆ with k∆k ≤
σmin (A)
,
2
2 kAk1/2 k∆k
−1/2
.
− A−1/2 ≤
(A + ∆)
2 (A)
σmin
it holds that
(B.8)
Proof: See proof of Lemma B.2 in the technical report [2].
Lemma B.3. There exists a positive constant C such that for any θ ∈ (0, 1/2) and n2 > Cn21 log n1 , the random
matrix X ∈ Rn1 ×n2 with X ∼i.i.d. BG (θ) obeys
r
1
θn1 log n2
∗
(B.9)
n2 θ XX − I ≤ 10
n2
with probability at least 1 − n−8
2 .
1
Proof: Observe that E θ xk x∗k = I for any column xk of X and so n12 θ XX ∗ can be considered as a normalize
sum of independent random matrices. Moreover, for any integer m ≥ 2,
m i
1
1 h
E
xk x∗k
= m E kxk k2m−2 xk x∗k .
θ
θ
h
i
h
i
h
i
Now E kxk k2m−2 xk x∗k is a diagonal matrix (as E kxk k2 xk (i) xk (j) = −E kxk k2 xk (i) xk (j) for any i 6= j
h
i
h
i
by symmetry of the distribution) in the form E kxk k2m−2 xk x∗k = E kxk2m−2 x(1)2 I for x ∼i.i.d. BG (θ) with
x ∈ Rn1 . Let t2 (x) = kxk2 − x(1)2 . Then if m = 2,
h
i
E kxk2 x(1)2 = E x(1)4 + E t2 (x) E x(1)2
2
= E x(1)4 + (n1 − 1) E x(1)2
= 3θ + (n1 − 1) θ2 ≤ 3n1 θ,
where for the last simplification we use the assumption θ ≤ 1/2. For m ≥ 3,
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
h
2m−2
E kxk
2
x(1)
i
40
m−1
X
h
i m−1
i h
i
X m − 1 h
m−1
2k
2m−2k
=
E t (x) x(1)
=
E t2k (x) E x(1)2m−2k
k
k
k=0
k=0
m−1
h i
h
i
X m − 1
≤
EZ∼χ2 (n1 −1) Z k θEW ∼N (0,1) W 2m−2k
k
k=0
m−1
X m − 1 k!
(2n1 − 2)k (2m − 2k)!!
≤θ
2
k
k=0
m−1
X m − 1 m m!
≤ θ2
(n1 − 1)k
2
k
k=0
m! m−1 m−1
n
2
,
≤
2 1
where we have used the moment estimates for Gaussian and χ2 random variables from Lemma A.5 and Lemma A.6,
and also θ ≤ 1/2. Taking σ 2 = 3n1 θ and R = 2n1 , and invoking the matrix Bernstein in Lemma A.9, we obtain
#
"
p
1 X
n2 t2
∗
xk xk − I > t ≤ exp −
+ 2 log n1
(B.10)
E pθ
6n1 θ + 4n1 t
k=1
p
for any t ≥ 0. Taking t = 10 θn1 log (n2 ) /n2 gives the claimed result.
ACKNOWLEDGMENT
We thank Dr. Boaz Barak for pointing out an inaccurate comment made on overcomplete dictionary learning
using SOS. We thank Cun Mu and Henry Kuo of Columbia University for discussions related to this project. JS
thanks the Wei Family Private Foundation for their generous support. This work was partially supported by grants
ONR N00014-13-1-0492, NSF 1343282, NSF CCF 1527809, NSF IIS 1546411, and funding from the Moore and
Sloan Foundations.
R EFERENCES
[1] J. Sun, Q. Qu, and J. Wright, “Complete dictionary recovery using nonconvex optimization,” in Proceedings of the 32nd International
Conference on Machine Learning (ICML-15), 2015, pp. 2351–2360.
[2] ——, “Complete dictionary recovery over the sphere,” arXiv preprint arXiv:1504.06785, 2015.
[3] ——, “Complete dictionary recovery over the sphere II: Recovery by Riemannian trust-region method,” arXiv preprint arXiv:1511.04777.
[4] R. A. DeVore, “Nonlinear approximation,” Acta numerica, vol. 7, pp. 51–150, 1998.
[5] V. N. Temlyakov, “Nonlinear methods of approximation,” Foundations of Computational Mathematics, vol. 3, no. 1, pp. 33–107, 2003.
[6] R. A. DeVore, “Nonlinear approximation and its applications,” in Multiscale, Nonlinear and Adaptive Approximation. Springer, 2009,
pp. 169–201.
[7] E. J. Candès, “New ties between computational harmonic analysis and approximation theory,” Approximation Theory X, pp. 87–153,
2002.
[8] J. Ma and G. Plonka, “A review of curvelets and recent applications,” IEEE Signal Processing Magazine, vol. 27, no. 2, pp. 118–133,
2010.
[9] D. L. Donoho, M. Vetterli, R. A. DeVore, and I. Daubechies, “Data compression and harmonic analysis,” Information Theory, IEEE
Transactions on, vol. 44, no. 6, pp. 2435–2476, 1998.
[10] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,”
Nature, vol. 381, no. 6583, pp. 607–609, 1996.
[11] ——, “Sparse coding with an overcomplete basis set: A strategy employed by v1?” Vision research, vol. 37, no. 23, pp. 3311–3325,
1997.
[12] M. Elad, Sparse and redundant representations: from theory to applications in signal and image processing. Springer, 2010.
[13] J. Mairal, F. Bach, and J. Ponce, “Sparse modeling for image and vision processing,” Foundations and Trends in Computer Graphics
and Vision, vol. 8, no. 2-3, pp. 85–283, 2014. [Online]. Available: http://dx.doi.org/10.1561/0600000058
[14] A. Maurer and M. Pontil, “K-dimensional coding schemes in hilbert spaces,” Information Theory, IEEE Transactions on, vol. 56,
no. 11, pp. 5839–5846, 2010.
[15] D. Vainsencher, S. Mannor, and A. M. Bruckstein, “The sample complexity of dictionary learning,” Journal of Machine Learning
Research, vol. 12, no. 23, pp. 3259–3281, Nov. 2011.
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
41
[16] N. Mehta and A. G. Gray, “Sparsity-based generalization bounds for predictive sparse coding,” Proceedings of the 30th International
Conference on Machine Learning (ICML-13), vol. 28, no. 1, pp. 36–44, 2013.
[17] R. Gribonval, R. Jenatton, F. Bach, M. Kleinsteuber, and M. Seibert, “Sample complexity of dictionary learning and other matrix
factorizations,” arXiv preprint arXiv:1312.3790, 2013.
[18] A. S. Bandeira, C. Kennedy, and A. Singer, “Approximating the little grothendieck problem over the orthogonal and unitary groups,”
arXiv preprint arXiv:1308.5207, 2013.
[19] J. Briët and O. Regev, “Tight hardness of the non-commutative grothendieck problem,” arXiv preprint arXiv:1412.4413, 2014.
[20] R. Gribonval and K. Schnass, “Dictionary identification - sparse matrix-factorization via `1 -minimization,” IEEE Transactions on
Information Theory, vol. 56, no. 7, pp. 3523–3539, 2010.
[21] Q. Geng and J. Wright, “On the local correctness of `1 -minimization for dictionary learning,” Submitted to IEEE Transactions on
Information Theory, 2011, preprint: http://www.columbia.edu/∼jw2966.
[22] A. Ahmed, B. Recht, and J. Romberg, “Blind deconvolution using convex programming,” Information Theory, IEEE Transactions on,
vol. 60, no. 3, pp. 1711–1732, 2014.
[23] S. Choudhary and U. Mitra, “Identifiability scaling laws in bilinear inverse problems,” arXiv preprint arXiv:1402.2637, 2014.
[24] D. Donoho and J. Tanner, “Observed universality of phase transitions in high-dimensional geometry, with implications for modern data
analysis and signal processing,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences,
vol. 367, no. 1906, pp. 4273–4293, 2009.
[25] S. Oymak and B. Hassibi, “New null space results and recovery thresholds for matrix rank minimization,” arXiv preprint arXiv:1011.6326,
2010.
[26] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM (JACM), vol. 58, no. 3, p. 11,
2011.
[27] D. L. Donoho, M. Gavish, and A. Montanari, “The phase transition of matrix recovery from gaussian measurements matches the
minimax mse of matrix denoising,” Proceedings of the National Academy of Sciences, vol. 110, no. 21, pp. 8405–8410, 2013.
[28] M. B. McCoy and J. A. Tropp, “Sharp recovery bounds for convex demixing, with applications,” Foundations of Computational
Mathematics, vol. 14, no. 3, pp. 503–567, 2014.
[29] C. Mu, B. Huang, J. Wright, and D. Goldfarb, “Square deal: Lower bounds and improved relaxations for tensor recovery,” arXiv
preprint arXiv:1307.5870, 2013.
[30] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The convex geometry of linear inverse problems,” Foundations of
Computational mathematics, vol. 12, no. 6, pp. 805–849, 2012.
[31] E. J. Candes, T. Strohmer, and V. Voroninski, “Phaselift: Exact and stable signal recovery from magnitude measurements via convex
programming,” Communications on Pure and Applied Mathematics, vol. 66, no. 8, pp. 1241–1274, 2013.
[32] D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp, “Living on the edge: Phase transitions in convex programs with random data,”
Information and Inference, p. iau005, 2014.
[33] E. J. Candès, “Mathematics of sparsity (and few other things),” in Proceedings of the International Congress of Mathematicians, Seoul,
South Korea, 2014.
[34] C. Bao, J.-F. Cai, and H. Ji, “Fast sparsity-based orthogonal dictionary learning for image restoration,” in Computer Vision (ICCV),
2013 IEEE International Conference on. IEEE, 2013, pp. 3384–3391.
[35] S. Lesage, R. Gribonval, F. Bimbot, and L. Benaroya, “Learning unions of orthonormal bases with thresholded singular value
decomposition,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5. IEEE, 2005,
pp. v–293.
[36] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: numerical methods. Prentice hall Englewood Cliffs, NJ,
1989, vol. 23.
[37] P. Tseng, “Convergence of a block coordinate descent method for nondifferentiable minimization,” Journal of optimization theory and
applications, vol. 109, no. 3, pp. 475–494, 2001.
[38] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran, “Proximal alternating minimization and projection methods for nonconvex problems:
an approach based on the kurdyka-lojasiewicz inequality,” Mathematics of Operations Research, vol. 35, no. 2, pp. 438–457, 2010.
[39] J. Bolte, S. Sabach, and M. Teboulle, “Proximal alternating linearized minimization for nonconvex and nonsmooth problems,”
Mathematical Programming, vol. 146, no. 1-2, pp. 459–494, 2014.
[40] D. A. Spielman, H. Wang, and J. Wright, “Exact recovery of sparsely-used dictionaries,” in Proceedings of the 25th Annual Conference
on Learning Theory, 2012.
[41] A. Agarwal, A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon, “Learning sparsely used overcomplete dictionaries via alternating
minimization,” arXiv preprint arXiv:1310.7991, 2013.
[42] S. Arora, R. Ge, and A. Moitra, “New algorithms for learning incoherent and overcomplete dictionaries,” arXiv preprint arXiv:1308.6273,
2013.
[43] A. Agarwal, A. Anandkumar, and P. Netrapalli, “Exact recovery of sparsely used overcomplete dictionaries,” arXiv preprint
arXiv:1309.1952, 2013.
[44] S. Arora, A. Bhaskara, R. Ge, and T. Ma, “More algorithms for provable dictionary learning,” arXiv preprint arXiv:1401.0579, 2014.
[45] S. Arora, R. Ge, T. Ma, and A. Moitra, “Simple, efficient, and neural algorithms for sparse coding,” arXiv preprint arXiv:1503.00778,
2015.
[46] L. Demanet and P. Hand, “Scaling law for recovering the sparsest element in a subspace,” Information and Inference, vol. 3, no. 4, pp.
295–309, 2014.
[47] M. Zibulevsky and B. Pearlmutter, “Blind source separation by sparse decomposition in a signal dictionary,” Neural computation,
vol. 13, no. 4, pp. 863–882, 2001.
[48] Q. Qu, J. Sun, and J. Wright, “Finding a sparse vector in a subspace: Linear sparsity using alternating directions,” in Advances in
Neural Information Processing Systems, 2014, pp. 3401–3409.
[49] J. Sun, Q. Qu, and J. Wright, “When are nonconvex problems not scary?” arXiv preprint arXiv:1510.06096, 2015.
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
42
[50] R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddle points—online stochastic gradient for tensor decomposition,” arXiv
preprint arXiv:1503.02101, 2015.
[51] P.-A. Absil, C. G. Baker, and K. A. Gallivan, “Trust-region methods on riemannian manifolds,” Foundations of Computational
Mathematics, vol. 7, no. 3, pp. 303–330, 2007.
[52] P.-A. Absil, R. Mahoney, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2009.
[53] M. Aharon, M. Elad, and A. M. Bruckstein, “On the uniqueness of overcomplete dictionaries, and a practical way to retrieve them,”
Linear algebra and its applications, vol. 416, no. 1, pp. 48–67, 2006.
[54] C. Hillar and F. T. Sommer, “When can dictionary learning uniquely recover sparse data from subsamples?” arXiv preprint
arXiv:1106.3616, 2011.
[55] K. Schnass, “Local identification of overcomplete dictionaries,” arXiv preprint arXiv:1401.6354, 2014.
[56] ——, “On the identifiability of overcomplete dictionaries via the minimisation principle underlying k-svd,” Applied and Computational
Harmonic Analysis, vol. 37, no. 3, pp. 464–491, 2014.
[57] ——, “Convergence radius and sample complexity of itkm algorithms for dictionary learning,” arXiv preprint arXiv:1503.07027, 2015.
[58] B. Barak, J. A. Kelner, and D. Steurer, “Dictionary learning and tensor decomposition via the sum-of-squares method,” arXiv preprint
arXiv:1407.1543, 2014.
[59] S. Wu and B. Yu, “Local identifiability of `1 -minimization dictionary learning: a sufficient and almost necessary condition,” arXiv
preprint arXiv:1505.04363, 2015.
[60] R. Gribonval, R. Jenatton, and F. Bach, “Sparse and spurious: dictionary learning with noise and outliers,” arXiv preprint arXiv:1407.5155,
2014.
[61] L.-A. Gottlieb and T. Neylon, “Matrix sparsification and the sparse null space problem,” in Approximation, Randomization, and
Combinatorial Optimization. Algorithms and Techniques. Springer, 2010, pp. 205–218.
[62] P.-L. Loh and M. J. Wainwright, “High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity,”
in Advances in Neural Information Processing Systems, 2011, pp. 2726–2734.
[63] ——, “Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima,” in Advances in Neural
Information Processing Systems, 2013, pp. 476–484.
[64] Z. Wang, H. Lu, and H. Liu, “Nonconvex statistical optimization: minimax-optimal sparse pca in polynomial time,” arXiv preprint
arXiv:1408.5352, 2014.
[65] S. Balakrishnan, M. J. Wainwright, and B. Yu, “Statistical guarantees for the em algorithm: From population to sample-based analysis,”
arXiv preprint arXiv:1408.2156, 2014.
[66] Z. Wang, Q. Gu, Y. Ning, and H. Liu, “High dimensional expectation-maximization algorithm: Statistical optimization and asymptotic
normality,” arXiv preprint arXiv:1412.8729, 2014.
[67] P.-L. Loh and M. J. Wainwright, “Support recovery without incoherence: A case for nonconvex regularization,” arXiv preprint
arXiv:1412.5632, 2014.
[68] P.-L. Loh, “Statistical consistency and asymptotic normality for high-dimensional robust m-estimators,” arXiv preprint arXiv:1501.00312,
2015.
[69] W. Sun, J. Lu, H. Liu, and G. Cheng, “Provable sparse tensor decomposition,” arXiv preprint arXiv:1502.01425, 2015.
[70] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix completion from a few entries,” Information Theory, IEEE Transactions on, vol. 56,
no. 6, pp. 2980–2998, 2010.
[71] P. Jain, P. Netrapalli, and S. Sanghavi, “Low-rank matrix completion using alternating minimization,” in Proceedings of the forty-fifth
annual ACM symposium on Theory of Computing. ACM, 2013, pp. 665–674.
[72] M. Hardt, “Understanding alternating minimization for matrix completion,” in Foundations of Computer Science (FOCS), 2014 IEEE
55th Annual Symposium on. IEEE, 2014, pp. 651–660.
[73] M. Hardt and M. Wootters, “Fast matrix completion without the condition number,” in Proceedings of The 27th Conference on Learning
Theory, 2014, pp. 638–678.
[74] P. Netrapalli, U. Niranjan, S. Sanghavi, A. Anandkumar, and P. Jain, “Non-convex robust pca,” in Advances in Neural Information
Processing Systems, 2014, pp. 1107–1115.
[75] P. Jain and P. Netrapalli, “Fast exact matrix completion with finite samples,” arXiv preprint arXiv:1411.1087, 2014.
[76] R. Sun and Z.-Q. Luo, “Guaranteed matrix completion via non-convex factorization,” arXiv preprint arXiv:1411.8003, 2014.
[77] Q. Zheng and J. Lafferty, “A convergent gradient descent algorithm for rank minimization and semidefinite programming from random
linear measurements,” arXiv preprint arXiv:1506.06081, 2015.
[78] S. Tu, R. Boczar, M. Soltanolkotabi, and B. Recht, “Low-rank solutions of linear matrix equations via procrustes flow,” arXiv preprint
arXiv:1507.03566, 2015.
[79] Y. Chen and M. J. Wainwright, “Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees,”
arXiv preprint arXiv:1509.03025, 2015.
[80] P. Netrapalli, P. Jain, and S. Sanghavi, “Phase retrieval using alternating minimization,” in Advances in Neural Information Processing
Systems, 2013, pp. 2796–2804.
[81] E. Candès, X. Li, and M. Soltanolkotabi, “Phase retrieval via wirtinger flow: Theory and algorithms,” Information Theory, IEEE
Transactions on, vol. 61, no. 4, pp. 1985–2007, April 2015.
[82] Y. Chen and E. J. Candes, “Solving random quadratic systems of equations is nearly as easy as solving linear systems,” arXiv preprint
arXiv:1505.05114, 2015.
[83] C. D. White, R. Ward, and S. Sanghavi, “The local convexity of solving quadratic equations,” arXiv preprint arXiv:1506.07868, 2015.
[84] P. Jain and S. Oh, “Provable tensor factorization with missing data,” in Advances in Neural Information Processing Systems, 2014, pp.
1431–1439.
[85] A. Anandkumar, R. Ge, and M. Janzamin, “Guaranteed non-orthogonal tensor decomposition via alternating rank-1 updates,” arXiv
preprint arXiv:1402.5180, 2014.
IEEE TRANSACTION ON INFORMATION THEORY, VOL. XX, NO. XX, XXXX 2015
43
[86] ——, “Analyzing tensor power method dynamics: Applications to learning overcomplete latent variable models,” arXiv preprint
arXiv:1411.1488, 2014.
[87] A. Anandkumar, P. Jain, Y. Shi, and U. Niranjan, “Tensor vs matrix methods: Robust tensor decomposition under block sparse
perturbations,” arXiv preprint arXiv:1510.04747, 2015.
[88] X. Yi, C. Caramanis, and S. Sanghavi, “Alternating minimization for mixed linear regression,” arXiv preprint arXiv:1310.3745, 2013.
[89] K. Lee, Y. Wu, and Y. Bresler, “Near optimal compressed sensing of sparse rank-one matrices via sparse power factorization,” arXiv
preprint arXiv:1312.0525, 2013.
[90] P. Jain, C. Jin, S. M. Kakade, and P. Netrapalli, “Computing matrix squareroot via non convex local search,” arXiv preprint
arXiv:1507.05854, 2015.
[91] S. Bhojanapalli, A. Kyrillidis, and S. Sanghavi, “Dropping convexity for faster semi-definite optimization,” arXiv preprint
arXiv:1509.03917, 2015.
[92] C. Bao, H. Ji, Y. Quan, and Z. Shen, “L0 norm based dictionary learning by proximal methods with global convergence,” in Computer
Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014, pp. 3858–3865.
[93] C. Bao, Y. Quan, and H. Ji, “A convergent incoherent dictionary learning algorithm for sparse coding,” in Computer Vision–ECCV
2014. Springer, 2014, pp. 302–316.
[94] C. Bao, H. Ji, and Z. Shen, “Convergence analysis for iterative data-driven tight frame construction scheme,” Applied and Computational
Harmonic Analysis, 2014.
[95] S. Arora, R. Ge, R. Kannan, and A. Moitra, “Computing a nonnegative matrix factorization–provably,” in Proceedings of the forty-fourth
annual ACM symposium on Theory of computing. ACM, 2012, pp. 145–162.
[96] S. Arora, A. Bhaskara, R. Ge, and T. Ma, “Provable bounds for learning some deep representations,” arXiv preprint arXiv:1310.6343,
2013.
[97] H. Sedghi and A. Anandkumar, “Provable methods for training neural networks with sparse connectivity,” arXiv preprint arXiv:1412.2693,
2014.
[98] B. Neyshabur and R. Panigrahy, “Sparse matrix factorization,” arXiv preprint arXiv:1311.3315, 2013.
[99] R. Livni, S. Shalev-Shwartz, and O. Shamir, “On the computational efficiency of training neural networks,” in Advances in Neural
Information Processing Systems, 2014, pp. 855–863.
[100] A. Hyvärinen and E. Oja, “Independent component analysis: algorithms and applications,” Neural networks, vol. 13, no. 4, pp. 411–430,
2000.
[101] K. J. Hyvärinen, A. and E. Oja, Independent Component Analysis. John Wiley and Sons., 2001.
[102] P. Comon, “Independent component analysis, a new concept?” Signal processing, vol. 36, no. 3, pp. 287–314, 1994.
[103] A. Frieze, M. Jerrum, and R. Kannan, “Learning linear transformations,” in 2013 IEEE 54th Annual Symposium on Foundations of
Computer Science. IEEE Computer Society, 1996, pp. 359–359.
[104] S. Arora, R. Ge, A. Moitra, and S. Sachdeva, “Provable ica with unknown gaussian noise, with implications for gaussian mixtures and
autoencoders,” in Advances in Neural Information Processing Systems, 2012, pp. 2375–2383.
[105] A. Hyvarinen, “Fast and robust fixed-point algorithms for independent component analysis,” IEEE Trans. Neural Networks, vol. 10,
no. 3, pp. 626–634, 1999.
[106] J. Sun, Q. Qu, and J. Wright, “A geometric analysis of phase retreival,” In preparation, 2015.
[107] P. Baldi and K. Hornik, “Neural networks and principal component analysis: Learning from examples without local minima,” Neural
networks, vol. 2, no. 1, pp. 53–58, 1989.
[108] A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with orthogonality constraints,” SIAM journal on Matrix
Analysis and Applications, vol. 20, no. 2, pp. 303–353, 1998.
[109] G. B. Folland, Real Analysis: Modern Techniques and Their Applications, 2nd ed. John Wiley & Sons, 1999.
[110] T. E. Harris, “A lower bound for the critical probability in a certain percolation process,” in Mathematical Proceedings of the Cambridge
Philosophical Society, vol. 56, no. 01. Cambridge Univ Press, 1960, pp. 13–20.
[111] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University
Press, 2013.
[112] S. Foucart and H. Rauhut, A Mathematical Introduction to Compressive Sensing. Springer, 2013.
[113] J. A. Tropp, “User-friendly tail bounds for sums of random matrices,” Foundations of Computational Mathematics, vol. 12, no. 4, pp.
389–434, 2012.