Nonconvex Low Rank Matrix Factorization via Inexact First Order

Nonconvex Low Rank Matrix Factorization via
Inexact First Order Oracle
Tuo Zhao∗
Zhaoran Wang†
Han Liu‡
Abstract
We study the low rank matrix factorization problem via nonconvex optimization. Compared with the convex relaxation approach, nonconvex optimization exhibits superior empirical
performance for large scale low rank matrix estimation. However, the understanding of its theoretical guarantees is limited. To bridge this gap, we exploit the notion of inexact first order oracle,
which naturally appears in low rank matrix factorization problems such as matrix sensing and
completion. Particularly, our analysis shows that a broad class of nonconvex optimization algorithms, including alternating minimization and gradient-type methods, can be treated as solving
two sequences of convex optimization algorithms using inexact first order oracle. Thus we can
show that these algorithms converge geometrically to the global optima and recover the true low
rank matrices under suitable conditions. Numerical results are provided to support our theory.
1
Introduction
Let M ∗ ∈ Rm×n be a rank k matrix with k much smaller than m and n. Our goal is to estimate M ∗
based on partial observations of its entries. For example, matrix completion is based on a subsample
of M ∗ ’s entries, while matrix sensing is based on linear measurements hAi , M ∗ i, where i ∈ {1, . . . , d}
with d much smaller than mn and Ai is the sensing matrix. In the past decade, significant progress
has been made on the recovery of low rank matrix [Candès and Recht, 2009, Candès and Tao, 2010,
Candes and Plan, 2010, Recht et al., 2010, Lee and Bresler, 2010, Keshavan et al., 2010a,b, Jain
et al., 2010, Cai et al., 2010, Recht, 2011, Gross, 2011, Chandrasekaran et al., 2011, Hsu et al.,
2011, Rohde and Tsybakov, 2011, Koltchinskii et al., 2011, Negahban and Wainwright, 2011, Chen
et al., 2011, Xiang et al., 2012, Negahban and Wainwright, 2012, Agarwal et al., 2012, Recht and
Ré, 2013, Chen, 2013, Chen et al., 2013a,b, Jain et al., 2013, Jain and Netrapalli, 2014, Hardt,
2014, Hardt et al., 2014, Hardt and Wootters, 2014, Sun and Luo, 2014, Hastie et al., 2014, Cai
and Zhang, 2015, Yan et al., 2015, Zhu et al., 2015, Wang et al., 2015]. Among these works, most
∗
Tuo Zhao is Affiliated with Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218
USA and Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544;
e-mail: [email protected].
†
Zhaoran Wang is Affiliated with Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544 USA; e-mail: [email protected].
‡
Han Liu is Affiliated with Department of Operations Research and Financial Engineering, Princeton University,
Princeton, NJ 08544 USA; e-mail: [email protected].
1
are based upon convex relaxation with nuclear norm constraint or regularization. Nevertheless,
solving these convex optimization problems can be computationally prohibitive in high dimensional
regimes with large m and n [Hsieh and Olsen, 2014]. A computationally more efficient alternative
is nonconvex optimization. In particular, we reparameterize the m × n matrix variable M in the
optimization problem as U V > with U ∈ Rm×k and V ∈ Rn×k , and optimize over U and V . Such a
reparametrization automatically enforces the low rank structure and leads to low computational cost
per iteration. Due to this reason, the nonconvex approach is widely used in large scale applications
such as recommendation systems or collaborative filtering [Koren, 2009, Koren et al., 2009].
Despite the superior empirical performance of the nonconvex approach, the understanding of
its theoretical guarantees is rather limited in comparison with the convex relaxation approach.
The classical nonconvex optimization theory can only show its sublinear convergence to local optima. But many empirical results have corroborated its exceptional computational performance
and convergence to global optima. Only until recently has there been theoretical analysis of the
block coordinate descent-type nonconvex optimization algorithm, which is known as alternating
minimization [Jain et al., 2013, Hardt, 2014, Hardt et al., 2014, Hardt and Wootters, 2014]. In particular, the existing results show that, provided a proper initialization, the alternating minimization
algorithm attains a linear rate of convergence to a global optimum U ∗ ∈ Rm×k and V ∗ ∈ Rn×k ,
which satisfy M ∗ = U ∗ V ∗ > . Meanwhile, Keshavan et al. [2010a,b] establish the convergence of the
gradient-type methods, and Sun and Luo [2014] further establish the convergence of a broad class of
nonconvex optimization algorithms including both gradient-type and block coordinate descent-type
methods. However, Keshavan et al. [2010a,b], Sun and Luo [2014] only establish the asymptotic
convergence for an infinite number of iterations, rather than the explicit rate of convergence. Besides these works, Lee and Bresler [2010], Jain et al. [2010], Jain and Netrapalli [2014] consider
projected gradient-type methods, which optimize over the matrix variable M ∈ Rm×n rather than
U ∈ Rm×k and V ∈ Rn×k . These methods involve calculating the top k singular vectors of an m × n
matrix at each iteration. For k much smaller than m and n, they incur much higher computational
cost per iteration than the aforementioned methods that optimize over U and V . All these works,
except Sun and Luo [2014], focus on specific algorithms, while Sun and Luo [2014] do not establish
the explicit optimization rate of convergence.
In this paper, we propose a new theory for analyzing a broad class of nonconvex optimization
algorithms for low rank matrix estimation. The core of our theory is the notion of inexact first
order oracle. Based on the inexact first order oracle, we establish sufficiently conditions under which
the iteration sequences converge geometrically to the global optima. For both matrix sensing and
completion, a direct consequence of our threoy is that, a broad family of nonconvex optimization
algorithms, including gradient descent, block coordinate gradient descent, and block coordinate
minimization, attain linear rates of convergence to the true low rank matrices U ∗ and V ∗ . In
particular, our proposed theory covers alternating minimization as a special case and recovers the
results of Jain et al. [2013], Hardt [2014], Hardt et al. [2014], Hardt and Wootters [2014] under
suitable conditions. Meanwhile, our approach covers gradient-type methods, which are also widely
used in practice [Takács et al., 2007, Paterek, 2007, Koren et al., 2009, Gemulla et al., 2011, Recht
and Ré, 2013, Zhuang et al., 2013]. To the best of our knowledge, our analysis is the first one
that establishes exact recovery guarantees and geometric rates of convergence for a broad family of
nonconvex matrix sensing and completion algorithms.
2
To achieve maximum generality, our unified analysis significantly differs from previous works.
In detail, Jain et al. [2013], Hardt [2014], Hardt et al. [2014], Hardt and Wootters [2014] view
alternating minimization as an approximate power method. However, their point of view relies
on the closed form solution of each iteration of alternating minimization, which makes it difficult
to generalize to other algorithms, e.g., gradient-type methods. Meanwhile, Sun and Luo [2014]
take a geometric point of view. In detail, they show that the global optimum of the optimization
problem is the unique stationary point within its neighborhood and thus a broad class of algorithms
succeed. However, such geometric analysis of the objective function does not characterize the
convergence rate of specific algorithms towards the stationary point. Unlike existing results, we
analyze nonconvex optimization algorithms as approximate convex counterparts. For example,
our analysis views alternating minimization on a nonconvex objective function as an approximate
block coordinate minimization on some convex objective function. We use the key quantity, the
inexact first order oracle, to characterize such a perturbation effect, which results from the local
nonconvexity at intermediate solutions. This eventually allows us to establish explicit rate of
convergence in an analogous way as existing convex optimization analysis.
Our proposed inexact first order oracle is closely related to a series previous work on inexact or
approximate gradient descent algorithms: Güler [1992], Luo and Tseng [1993], Nedić and Bertsekas
[2001], d’Aspremont [2008], Baes [2009], Friedlander and Schmidt [2012], Devolder et al. [2014].
Different from these existing results focusing on convex minimization, we show that the inexact first
order oracle can also sharply captures the evolution of generic optimization algorithms even with
the presence of nonconvexity. More recently, Candes et al. [2014], Balakrishnan et al. [2014], Arora
et al. [2015] respectively analyze the Wirtinger Flow algorithm for phase retrieval, the expectation
maximization (EM) Algorithm for latent variable models, and the gradient descent algorithm for
sparse coding based on a similar idea to ours. Though their analysis exploits similar nonconvex
structures, they work on completely different problems, and the delivered technical results are also
fundamentally different.
A conference version of this paper was presented in the Annual Conference on Neural Information Processing Systems 2015 [Zhao et al., 2015]. During our conference version was under review,
similar work was released on arXiv.org by Zheng and Lafferty [2015], Bhojanapalli et al. [2015], Tu
et al. [2015], Chen and Wainwright [2015]. These works focus on symmetric positive semidefinite
low rank matrix factorization problems. In contrast, our proposed methodologies and theory do
not require the symmetry and positive semidefiniteness, and therefore can be applied to rectangular
low rank matrix factorization problems.
The rest of this paper is organized as follows. In §2, we review the matrix sensing problems,
and then introduce a general class of nonconvex optimization algorithms. In §3, we present the
convergence analysis of the algorithms. In §4, we lay out the proof. In §5, we extend the proposed methodology and theory to the matrix completion problems. In §6, we provide numerical
experiments and draw the conclusion.
P q
Notation: For v = (v1 , . . . , vd )T ∈ Rd , we define the vector `q norm as kvkqq =
j vj . We
define ei as an indicator vector, where the i-th entry is one, and all other entries are zero. For
a matrix A ∈ Rm×n , we use A∗j = (A1j , ..., Amj )> to denote the j-th column of A, and Ai∗ =
(Ai1 , ..., Ain )> to denote the i-th row of A. Let σmax (A) and σmin (A) be the largest and smallest
P
nonzero singular values of A. We define the following matrix norms: kAk2F = j kA∗j k22 , kAk2 =
3
σmax (A). Moreover, we define kAk∗ to be the sum of all singular values of A. We define as the
Moore-Penrose pseudoinverse of A as A† . Given another matrix B ∈ Rm×n , we define the inner
P
product as hA, Bi = i,j Aij Bij . For a bivariate function f (u, v), we define ∇u f (u, v) to be the
gradient with respect to u. Moreover, we use the common notations of Ω(·), O(·), and o(·) to
characterize the asymptotics of two real sequences.
2
Matrix Sensing
We start with the matrix sensing problem. Let M ∗ ∈ Rm×n be the unknown low rank matrix of
interest. We have d sensing matrices Ai ∈ Rm×n with i ∈ {1, . . . , d}. Our goal is to estimate M ∗
based on bi = hAi , M ∗ i in the high dimensional regime with d much smaller than mn. Under such
a regime, a common assumption is rank(M ∗ ) = k min{d, m, n}. Existing approaches generally
recover M ∗ by solving the following convex optimization problem
min
M ∈Rm×n
kM k∗
subject to b = A(M ),
(1)
where b = [b1 , ..., bd ]> ∈ Rd , and A(M ) : Rm×n → Rd is an operator defined as
A(M ) = [hA1 , M i, ..., hAi , M i]> ∈ Rd .
(2)
Existing convex optimization algorithms for solving (1) are computationally inefficient, since they
incur high per-iteration computational cost and only attain sublinear rates of convergence to the
global optimum [Jain et al., 2013, Hsieh and Olsen, 2014]. Therefore in large scale settings, we
usually consider the following nonconvex optimization problem instead
min
U ∈Rm×k ,V
∈Rn×k
1
F(U, V ), where F(U, V ) = kb − A(U V > )k22 .
2
(3)
The reparametrization of M = U V > , though making the problem in (3) nonconvex, significantly
improves the computational efficiency. Existing literature [Koren, 2009, Koren et al., 2009, Takács
et al., 2007, Paterek, 2007, Koren et al., 2009, Gemulla et al., 2011, Recht and Ré, 2013, Zhuang
et al., 2013] has established convincing evidence that (3) can be effectively solved by a broad variety
of gradient-based nonconvex optimization algorithms, including gradient descent, alternating exact
minimization (i.e., alternating least squares or block coordinate minimization), as well as alternating
gradient descent (i.e., block coordinate gradient descent), as illustrated in Algorithm 1.
It is worth noting that the QR decomposition and rank k singular value decomposition in Algorithm 1 can be accomplished efficiently. In particular, the QR decomposition can be accomplished in
O(k 2 max{m, n}) operations, while the rank k singular value decomposition can be accomplished in
O(kmn) operations. In fact, the QR decomposition is not necessary for particular update schemes,
e.g., Jain et al. [2013] prove that the alternating exact minimization update schemes with or without
the QR decomposition are equivalent.
3
Convergence Analysis
We analyze the convergence of the algorithms illustrated in §2. Before we present the main results,
we first introduce a unified analytical framework based on a key quantity named the approximate
4
Algorithm 1 A family of nonconvex optimization algorithms for matrix sensing. Here (U , D, V ) ←
KSVD(M ) is the rank k singular value decomposition of M . D is a diagonal matrix containing the
top k singular values of M in decreasing order, and U and V contain the corresponding top k
left and right singular vectors of M . (V , RV ) ← QR(V ) is the QR decomposition, where V is the
corresponding orthonormal matrix and RV is the corresponding upper triangular matrix.
Input: {bi }di=1 , {Ai }di=1
Parameter: Step size η, Total number of iterations T
P
(0)
(0)
(0)
(0)
(U , D(0) , V ) ← KSVD( di=1 bi Ai ), V (0) ← V D(0) , U (0) ← U D(0)
For: t = 0, ...., T − 1

(t)


Alternating Exact Minimization : V (t+0.5) ← argminV F(U , V )



(t+1)
(t+0.5)
(t+0.5)


(V
, RV
) ← QR(V
)



(t)
(t+0.5)
(t)
(t)
Alternating Gradient Descent : V
← V − η∇ F(U , V ) 
V
(t+0.5)>
(V
← QR(V
← U RV
(t)
(t+0.5)
(t)
Gradient Descent : V
← V − η∇V F(U , V (t) )
(t+1)
(t) (t+0.5)>
(t+0.5)
(V
, RV
) ← QR(V (t+0.5) ), U (t+1) ← U RV
(t+1)
)
Alternating Exact Minimization : U (t+0.5) ← argminU F(U, V
(t+1)
(t+0.5)
(t+0.5)
(U
, RU
) ← QR(U
)
(t+1)
(t+0.5)
Alternating Gradient Descent : U
← U (t) − η∇U F(U (t) , V
)
(t+1)
t+1 (t+0.5)>
(t+0.5)
(t+0.5)
(t+1)
(U
, RU
) ← QR(U
), V
←V
RU
(t)
Gradient Descent : U (t+0.5) ← U (t) − η∇U F(U (t) , V )
(t+1)
t (t+0.5)>
(t+0.5)
(U
, RU
) ← QR(U (t+0.5) ), V (t+1) ← V RU
(t+1)
(t+0.5)
, RV
)
(t+0.5) ),
(t)
U (t)























Updating V
Updating U











End for
(T )>
(T )
(for gradient descent we use U V (T )> )
Output: M (T ) ← U (T −0.5) V
first order oracle. Such a unified framework equips our theory with the maximum generality.
Without loss of generality, we assume m ≤ n throughout the rest of this paper.
3.1
Main Idea
We first provide an intuitive explanation for the success of nonconvex optimization algorithms,
which forms the basis of our later analysis of the main results in §4. Recall that (3) can be written
as a special instance of the following optimization problem,
min
U ∈Rm×k ,V ∈Rn×k
f (U, V ).
(4)
A key observation is that, given fixed U , f (U, ·) is strongly convex and smooth in V under suitable
conditions, and the same also holds for U given fixed V correspondingly. For the convenience of
discussion, we summarize this observation in the following technical condition, which will be later
verified for matrix sensing and completion under suitable conditions.
5
Condition 1 (Strong Biconvexity and Bismoothness). There exist universal constants µ+ > 0 and
µ− > 0 such that
µ− 0
µ+ 0
kU − U k2F ≤ f (U 0 , V ) − f (U, V ) − hU 0 − U, ∇U f (U, V )i ≤
kU − U k2F for all U, U 0 ,
2
2
µ− 0
µ+ 0
kV − V k2F ≤ f (U, V 0 ) − f (U, V ) − hV 0 − V, ∇V f (U, V )i ≤
kV − V k2F for all V, V 0 .
2
2
3.1.1
Ideal First Order Oracle
To ease presentation, we assume that U ∗ and V ∗ are the unique global minimizers to the generic
optimization problem in (4). Assuming that U ∗ is given, we can obtain V ∗ by
V ∗ = argmin f (U ∗ , V ).
(5)
V ∈Rn×k
Condition 1 implies the objective function in (5) is strongly convex and smooth. Hence, we can
choose any gradient-based algorithm to obtain V ∗ . For example, we can directly solve for V ∗ in
∇V f (U ∗ , V ) = 0,
(6)
or iteratively solve for V ∗ using gradient descent, i.e.,
V (t) = V (t−1) − η∇V f (U ∗ , V (t−1) ),
(7)
where η is a step size. Taking gradient descent as an example, we can invoke classical convex
optimization results [Nesterov, 2004] to prove that
kV (t) − V ∗ kF ≤ κkV (t−1) − V ∗ kF for all t = 0, 1, 2, . . . ,
where κ ∈ (0, 1) and only depends on µ+ and µ− in Condition 1. For notational simplicity, we call
∇V f (U ∗ , V (t−1) ) the ideal first order oracle, since we do not know U ∗ in practice.
3.1.2
Inexact First Order Oracle
Though the ideal first order oracle is not accessible in practice, it provides us insights to analyze
nonconvex optimization algorithms. Taking gradient descent as an example, at the t-th iteration,
we take a gradient descent step over V based on ∇V f (U, V (t−1) ). Now we can treat ∇V f (U, V (t−1) )
as an approximation of ∇V f (U ∗ , V (t−1) ), where the approximation error comes from approximating
U ∗ by U . Then the relationship between ∇V f (U ∗ , V (t−1) ) and ∇V f (U, V (t−1) ) is similar to that
between gradient and approximate gradient in existing literature on convex optimization. For
simplicity, we call ∇V f (U, V (t−1) ) the inexact first order oracle.
To characterize the difference between ∇V f (U ∗ , V (t−1) ) and ∇V f (U, V (t−1) ), we define the
approximation error of the inexact first order oracle as
E(V, V 0 , U ) = k∇V f (U ∗ , V 0 ) − ∇V f (U, V 0 )kF ,
(8)
where V 0 is the current decision varaible for evaluating the gradient. In the above example, it
holds for V 0 = V (t−1) . Later we will illustrate that E(V, V 0 , U ) is critical to our analysis. In
6
the above example of alternating gradient descent, we will prove later that for V (t) = V (t−1) −
η∇V f (U, V (t−1) ), we have
kV (t) − V ∗ kF ≤ κkV (t−1) − V ∗ kF +
2
E(V (t) , V (t−1) , U ).
µ+
(9)
In other words, E(V (t) , V (t−1) , U ) captures the perturbation effect by employing the inexact first
order oracle ∇V f (U, V (t−1) ) instead of the ideal first order oracle ∇V f (U ∗ , V (t−1) ). For V (t+1) =
argminV f (U, V ), we will prove that
kV (t) − V ∗ kF ≤
1
E(V (t) , V (t) , U ).
µ−
(10)
According to the update schemes shown in Algorithms 1 and 2, for alternating exact minimization,
we set U = U (t) in (10), while for gradient descent or alternating gradient descent, we set U = U (t−1)
or U = U (t) in (9) respectively. Due to symmetry, similar results also hold for kU (t) − U ∗ kF .
To establish the geometric rate of convergence towards the global minima U ∗ and V ∗ , it remains
to establish upper bounds for the approximate error of the inexact first oder oracle. Taking gradient
decent as an example, we will prove that given an appropriate initial solution, we have
2
E(V (t) , V (t−1) , U (t−1) ) ≤ αkU (t−1) − U ∗ kF
µ+
(11)
for some α ∈ (0, 1 − κ). Combining with (9) (where we take U = U (t−1) ), (11) further implies
kV (t) − V ∗ kF ≤ κkV (t−1) − V ∗ kF + αkU (t−1) − U ∗ kF .
(12)
Correspondingly, similar results hold for kU (t) − U ∗ kF , i.e.,
kU (t) − U ∗ kF ≤ κkU (t−1) − U ∗ kF + αkV (t−1) − V ∗ kF .
(13)
Combining (12) and (13) we then establish the contraction
max{kV (t) − V ∗ kF , kU (t) − U ∗ kF } ≤ (α + κ) · max{kV (t−1) − V ∗ kF , kU (t−1) − U ∗ kF },
which further implies the geometric convergence, since α ∈ (0, 1−κ). Respectively, we can establish
similar results for alternating exact minimization and alternating gradient descent. Based upon
such a unified analysis, we now present the main results.
3.2
Main Results
Before presenting the main results, we first introduce an assumption known as the restricted isometry property (RIP). Recall that k is the rank of the target low rank matrix M ∗ .
Assumption 1 (Restricted Isometry Property). The linear operator A(·) : Rm×n → Rd defined in (2)
satisfies 2k-RIP with parameter δ2k ∈ (0, 1), i.e., for all ∆ ∈ Rm×n such that rank(∆) ≤ 2k, it
holds that
(1 − δ2k )k∆k2F ≤ kA(∆)k22 ≤ (1 + δ2k )k∆k2F .
7
Several random matrix ensembles satisfy 2k-RIP for a sufficiently large d with high probability. For example, suppose that each entry of Ai is independently drawn from a sub-Gaussian
−2
distribution, A(·) satisfies 2k-RIP with parameter δ2k with high probability for d = Ω(δ2k
kn log n).
The following theorem establishes the geometric rate of convergence of the nonconvex optimization algorithms summarized in Algorithm 1.
Theorem 1. Assume there exists a sufficiently small constant C1 such that A(·) satisfies 2k-RIP
with δ2k ≤ C1 /k, and the largest and smallest nonzero singular values of M ∗ are constants, which
do not scale with (d, m, n, k). For any pre-specified precision , there exist an η and universal
constants C2 and C3 such that for all T ≥ C2 log(C3 /), we have kM (T ) − M ∗ kF ≤ .
The proof of Theorems 1 is provided in §4.2, §4.3, and §4.4. Theorem 1 implies that all
three nonconvex optimization algorithms converge geometrically to the global optimum. Moreover,
assuming that each entry of Ai is independently drawn from a sub-Gaussian distribution with
mean zero and variance proxy one, our result further suggests that, to achieve exact low rank
matrix recovery, our algorithm requires the number of measurements d to satisfy
d = Ω(k 3 n log n),
(14)
since we assume that δ2k ≤ C1 /k. This sample complexity result matches the state-of-the-art result
for nonconvex optimization methods, which is established by Jain et al. [2013]. In comparison with
their result, which only covers the alternating exact minimization algorithm, our results holds for
a broader variety of nonconvex optimization algorithms.
Note that the sample complexity in (14) depends on a polynomial of σmax (M ∗ )/σmin (M ∗ ),
which is treated as a constant in our paper. If we allow σmax (M ∗ )/σmin (M ∗ ) to increase, we can
plug the nonconvex optimization algorithms into the multi-stage framework proposed by Jain et al.
[2013]. Following similar lines to the proof of Theorem 1, we can derive a new sample complexity,
which is independent of σmax (M ∗ )/σmin (M ∗ ). See more details in Jain et al. [2013].
4
Proof of Main Results
We sketch the proof of Theorems 1. The proof of all related lemmas are provided in the appendix.
For notational simplicity, let σ1 = σmax (M ∗ ) and σk = σmin (M ∗ ). Recall the nonconvex optimization algorithms are symmetric about the updates of U and V . Hence, the following lemmas for the
update of V also hold for updating U . We omit some statements for conciseness. Theorem 2 can
be proved in a similar manner, and its proof is provided in Appendix F.
Before presenting the proof, we first introduce the following lemma, which verifies Condition 1.
Lemma 1. Suppose that A(·) satisfies 2k-RIP with parameter δ2k . Given an arbitrary orthonormal
matrix U ∈ Rm×k , for any V, V 0 ∈ Rn×k , we have
1 + δ2k 0
1 − δ2k 0
kV − V k2F ≥ F(U , V 0 ) − F(U , V ) − h∇V F(U , V ), V 0 − V i ≥
kV − V k2F .
2
2
The proof of Lemma 1 is provided in Appendix A.1. Lemma 1 implies that F(U , ·) is strongly
convex and smooth in V given a fixed orthonormal matrix U , as specified in Condition 1. Equipped
with Lemma 1, we now lay out the proof for each update scheme in Algorithm 1.
8
4.1
Rotation Issue
∗
∗
∗> , where
Given a factorization of M ∗ = U V ∗> , we can equivalently represent it as M ∗ = U new Vnew
∗
∗
∗>
and Vnew
= V ∗> Onew
U new = U Onew
∗
for an arbitrary unitary matrix Onew ∈ Rk×k . This implies that directly calculating kU − U kF is
not desirable and the algorithm may converge to an arbitrary factorization of M ∗ .
To address this issue, existing analysis usually chooses subspace distances to evaluate the dif∗
ference between subspaces spanned by columns of U and U , because these subspaces are invariant
to rotations [Jain et al., 2013]. For example, let U ⊥ ∈ Rm×(m−k) denote the orthonormal comple> ∗
ment to U , we can choose the subspace distance as kU ⊥ U kF . For any Onew ∈ Rk×k such that
> O
Onew
new = Ik , we have
>
∗
>
∗
>
∗
kU ⊥ U new kF = kU ⊥ U Onew kF = kU ⊥ U kF .
In this paper, we consider a different subspace distance defined as
∗
min kU − U OkF .
(15)
O> O=Ik
We can verify that (15) is also invariant to rotation. The next lemma shows that (15) is equivalent
> ∗
to kU ⊥ U kF .
∗
Lemma 2. Given two orthonormal matrices U ∈ Rm×k and U ∈ Rm×k , we have
√
> ∗
∗
> ∗
kU ⊥ U kF ≤ min kU − U OkF ≤ 2kU ⊥ U kF .
O> O=I
The proof of Lemma 2 is provided in Stewart et al. [1990], therefore omitted. Equipped with
Lemma 2, our convergence analysis guarantees that there always exists a factorization of M ∗
satisfying the desired computational properties for each iteration (See Lemma 5, Corollaries 1 and
2). Similarly, the above argument can also be generalized to gradient descent and alternating
gradient descent algorithms.
4.2
Proof of Theorem 1 (Alternating Exact Minimization)
Proof. Throughout the proof for alternating exact minimization, we define a constant ξ ∈ (1, ∞)
to simplify the notation. Moreover, we assume that at the t-th iteration, there exists a matrix
factorization of M ∗
M∗ = U
∗(t)
V ∗(t)> ,
∗(t)
∈ Rm×k is an orthonormal matrix. We define the approximation error of the inexact
where U
first order oracle as
E(V (t+0.5) , V (t+0.5) , U
(t)
) = k∇V F(U
∗(t)
, V (t+0.5) ) − ∇V F(U
(t)
, V (t+0.5) )kF .
The following lemma establishes an upper bound for the approximation error of the approximation
first order oracle under suitable conditions.
9
(t)
Lemma 3. Suppose that δ2k and U
δ2k ≤
satisfy
(1 − δ2k )2 σk
12ξk(1 + δ2k )σ1
and kU
(t)
−U
∗(t)
kF ≤
(1 − δ2k )σk
.
4ξ(1 + δ2k )σ1
(16)
Then we have
E(V (t+0.5) , V (t+0.5) , U
(t)
)≤
(1 − δ2k )σk (t)
∗(t)
kU − U kF .
2ξ
The proof of Lemma 3 is provided in Appendix A.2. Lemma 3 shows that the approximation
(t)
error of the inexact first order oracle for updating V diminishes with the estimation error of U ,
(t)
∗(t)
when U is sufficiently close to U . The following lemma quantifies the progress of an exact
minimization step using the inexact first order oracle.
Lemma 4. We have
1
(t)
E(V (t+0.5) , V (t+0.5) , U ).
1 − δ2k
kV (t+0.5) − V ∗(t) kF ≤
The proof of Lemma 4 is provided in Appendix A.3. Lemma 4 illustrates that the estimation
error of V (t+0.5) diminishes with the approximation error of the inexact first order oracle. The
following lemma characterizes the effect of the renormalization step using QR decomposition, i.e.,
(t+1)
the relationship between V (t+0.5) and V
in terms of the estimation error.
Lemma 5. Suppose that V (t+0.5) satisfies
kV (t+0.5) − V ∗(t) kF ≤
Then there exists a factorization of M ∗ = U ∗(t+1) V
mal matrix, and satisfies
kV
(t+1)
−V
∗(t+1)
kF ≤
∗(t+1)
σk
.
4
(17)
such that V
∗(t+0.5)
∈ Rn×k is an orthonor-
2
kV (t+0.5) − V ∗(t) kF .
σk
The proof of Lemma 5 is provided in Appendix A.4. The next lemma quantifies the accuracy
(0)
of the initialization U .
Lemma 6. Suppose that δ2k satisfies
δ2k ≤
(1 − δ2k )2 σk4
.
192ξ 2 k(1 + δ2k )2 σ14
Then there exists a factorization of M ∗ = U
matrix, and satisfies
kU
(0)
∗(0)
V ∗(0)> such that U
∗
− U kF ≤
10
(1 − δ2k )σk
.
4ξ(1 + δ2k )σ1
(18)
∗(0)
∈ Rm×k is an orthonormal
The proof of Lemma 6 is provided in Appendix A.5. Lemma 6 implies that the initial solution
U attains a sufficiently small estimation error.
(0)
Combining Lemmas 3, 4, and 5, we obtain the following corollary for a complete iteration of
updating V .
Corollary 1. Suppose that δ2k and U
δ2k ≤
(t)
satisfy
(1 − δ2k )2 σk4
192ξ 2 k(1 + δ2k )2 σ14
and kU
(t)
−U
∗(t)
kF ≤
(1 − δ2k )σk
.
4ξ(1 + δ2k )σ1
(19)
We then have
kV
(t+1)
−V
∗(t+1)
kF ≤
(1 − δ2k )σk
.
4ξ(1 + δ2k )σ1
Moreover, we also have
kV
(t+1)
−V
∗(t+1)
1 (t)
∗(t)
kF ≤ kU − U kF
ξ
and kV (t+0.5) − V ∗(t) kF ≤
σk (t)
∗(t)
kU − U kF .
2ξ
The proof of Corollary 1 is provided in Appendix A.6. Since the alternating exact minimization
algorithm updates U and V in a symmetric manner, we can establish similar results for a complete
iteration of updating U in the next corollary.
(t+1)
Corollary 2. Suppose that δ2k and V
δ2k ≤
(1 − δ2k )2 σk4
192ξ 2 k(1 + δ2k )2 σ14
satisfy
and kV
Then there exists a factorization of M ∗ = U
and satisfies
kU
(t+1)
−U
∗(t+1)
∗(t+1)
(t+1)
−V
∗(t+1)
kF ≤
V ∗(t+1)> such U
kF ≤
(1 − δ2k )σk
.
4ξ(1 + δ2k )σ1
∗(t+1)
(20)
is an orthonormal matrix,
(1 − δ2k )σk
.
4ξ(1 + δ2k )σ1
Moreover, we also have
kU
(t+1)
−U
∗(t+1)
1 (t+1)
∗(t+1)
kF ≤ kV
−V
kF
ξ
and kU (t+0.5) − U ∗(t+1) kF ≤
σk (t+1)
∗(t+1)
kV
−V
kF .
2ξ
The proof of Corollary 2 directly follows Appendix A.6, and is therefore omitted..
We then proceed with the proof of Theorem 1 for alternating exact minimization. Lemma 6
(0)
ensures that (19) of Corollary 1 holds for U . Then Corollary 1 ensures that (20) of Corollary
(1)
2 holds for V . By induction, Corollaries 1 and 2 can be applied recursively for all T iterations.
Thus we obtain
kV
(T )
−V
∗(T )
1 (T −1)
1
∗(T −1)
(T −1)
∗(T −1)
kF ≤ kU
−U
kF ≤ 2 kV
−V
kF
ξ
ξ
1
(1 − δ2k )σk
(0)
∗(0)
≤ · · · ≤ 2T −1 kU − U
kF ≤ 2T
,
ξ
4ξ (1 + δ2k )σ1
11
(21)
where the last inequality comes from Lemma 6. Therefore, for a pre-specified accuracy , we need
at most
1
(1 − δ2k )σk
T =
log−1 ξ
(22)
log
2
2(1 + δ2k )σ1
iterations such that
kV
(T )
−V
∗(T )
(1 − δ2k )σk
≤ .
2T
4ξ (1 + δ2k )σ1
2
kF ≤
(23)
Moreover, Corollary 2 implies
kU (T −0.5) − U ∗(T ) kF ≤
(1 − δ2k )σk2
σk (T )
∗(T )
,
kV
−V
kF ≤ 2T +1
2ξ
8ξ
(1 + δ2k )σ1
where the last inequality comes from (21). Therefore, we need at most
(1 − δ2k )σk2
1
−1
T =
log
log ξ
2
4ξ(1 + δ2k )
(24)
iterations such that
kU (T −0.5) − U ∗ kF ≤
(1 − δ2k )σk2
≤
.
2T
+1
8ξ
(1 + δ2k )σ1
2σ1
(25)
Then combining (23) and (25), we obtain
kM (T ) − M ∗ k = kU (T −0.5) V
(T )>
= kU (T −0.5) V
(T )>
≤ kV
(T )
− U ∗(T ) V
∗(T )>
− U ∗(T ) V
(T )>
kF
+ U ∗(T ) V
(T )>
k2 kU (T −0.5) − U ∗(T ) kF + kU ∗(T ) k2 kV
(T )
− U ∗(T ) V
(T )
−V
∗(T )
∗(T )>
kF
kF ≤ ,
(26)
(T )
is orthonormal) and kU ∗ k2 =
where the last inequality comes from kV k2 = 1 (since V
∗(T )>
∗(T )
kM ∗ k2 = σ1 (since U ∗(T ) V
= M ∗ and V
is orthonormal). Thus combining (22) and
(24) with (26), we complete the proof.
4.3
Proof of Theorem 1 (Alternating Gradient Descent)
Proof. Throughout the proof for alternating gradient descent, we define a sufficiently large constant
ξ. Moreover, we assume that at the t-th iteration, there exists a matrix factorization of M ∗
M∗ = U
∗(t)
V ∗(t)> ,
∗(t)
where U
∈ Rm×k is an orthonormal matrix. We define the approximation error of the inexact
first order oracle as
E(V (t+0.5) , V (t) , U
(t)
) = k∇V F(U
(t)
, V (t) ) − ∇V F(U
∗(t)
, V (t) )kF .
The first lemma is parallel to Lemma 3 for alternating exact minimization.
12
Lemma 7. Suppose that δ2k , U
δ2k
(t)
, and V (t) satisfy
√
σk2
(1 − δ2k )σk
σ1 k
(t)
∗(t)
(t)
∗(t)
, kU − U kF ≤
≤
, and kV − V
kF ≤
.
24ξkσ1
2
4ξσ12
(27)
Then we have
E(V (t+0.5) , V (t) , U
(t)
)≤
(1 + δ2k )σk (t)
∗(t)
kU − U kF .
ξ
The proof of Lemma 7 is provided in Appendix B.1. Lemma 7 illustrates that the approximation
(t)
(t)
error of the inexact first order oracle diminishes with the estimation error of U , when U and
∗(t)
V (t) are sufficiently close to U
and V ∗(t) .
Lemma 8. Suppose that the step size parameter η satisfies
η=
1
.
1 + δ2k
(28)
Then we have
kV (t+0.5) − V ∗ kF ≤
p
δ2k kV (t) − V ∗ kF +
2
(t)
E(V (t+0.5) , V (t) , U ).
1 + δ2k
The proof of Lemma 8 is in Appendix B.2. Lemma 8 characterizes the progress of a gradient
descent step with a pre-specified fixed step size. A more practical option is adaptively selecting η
using the backtracking line search procedure, and similar results can be guaranteed. See Nesterov
[2004] for details. The following lemma characterizes the effect of the renormalization step using
QR decomposition.
Lemma 9. Suppose that V (t+0.5) satisfies
kV (t+0.5) − V ∗(t) kF ≤
Then there exists a factorization of M ∗ = U ∗(t+1) V
mal matrix, and
∗(t+1)
σk
.
4
such that V
(29)
∗(t+1)
∈ Rn×k is an orthonor-
2
kV (t+0.5) − V ∗(t) kF ,
σk
3σ1 (t+0.5)
(t)
∗(t)
kU (t) − U ∗(t+1) kF ≤
kV
− V ∗(t) kF + σ1 kU − U kF ,
σk
kV
(t+1)
−V
∗(t+1)
kF ≤
The proof of Lemma 9 is provided in Appendix B.3. The next lemma quantifies the accuracy
of the initial solutions.
Lemma 10. Suppose that δ2k satisfies
δ2k ≤
σk6
.
192ξ 2 kσ16
(30)
Then we have
kU
(0)
−U
∗(0)
σ2
kF ≤ k 2
4ξσ1
and kV
13
(0)
−V
∗(0)
√
σk2
σ1 k
kF ≤
≤
.
2ξσ1
2
The proof of Lemma 10 is in Appendix B.4. Lemma 10 indicates that the initial solutions U
and V (0) attain sufficiently small estimation errors.
(0)
Combining Lemmas 7, 8, 5, , we obtain the following corollary for a complete iteration of
updating V .
Corollary 3. Suppose that δ2k , U
δ2k ≤
σk6
,
192ξ 2 kσ16
kU
(t)
(t)
, and V (t) satisfy
−U
∗(t)
kF ≤
σk2
,
4ξσ12
and kV (t) − V ∗(t) kF ≤
σk2
.
2ξσ1
(31)
We then have
kV
(t+1)
−V
∗(t+1)
kF ≤
σk2
4ξσ12
and kU (t) − U ∗(t+1) kF ≤
σk2
.
2ξσ1
Moreover, we have
p
2σk (t)
∗(t)
δ2k kV (t) − V ∗(t) kF +
kU − U kF ,
ξ
√
2 δ2k (t)
4 (t)
(t+1)
∗(t+1)
∗(t)
kV
−V
kF ≤
kV − V ∗(t) kF + kU − U kF ,
σk
ξ
√
3σ1 δ2k (t)
6
(t)
∗(t)
kU (t) − U ∗(t+1) kF ≤
kV − V ∗(t) kF +
+ 1 σ1 kU − U kF .
σk
ξ
kV (t+0.5) − V ∗(t) kF ≤
(32)
(33)
(34)
The proof of Corollary 3 is provided in Appendix B.5. Since the alternating gradient descent
algorithm updates U and V in a symmetric manner, we can establish similar results for a complete
iteration of updating U in the next corollary.
Corollary 4. Suppose that δ2k , V
δ2k ≤
(t+1)
, and U (t) satisfy
σk6
σk2
σk2
(t+1)
∗(t+1)
(t)
∗(t+1)
−
V
k
≤
.
,
kV
,
and
kU
−
U
k
≤
F
F
2ξσ1
192ξ 2 kσ16
4ξσ12
(35)
We then have
kU
(t+1)
−U
∗(t+1)
kF ≤
σk2
4ξσ12
and kV (t+1) − V ∗(t+1) kF ≤
σk2
.
2ξσ1
Moreover, we have
p
2σk (t+1)
∗(t+1)
kV
δ2k kU (t) − U ∗(t+1) kF +
−V
kF ,
ξ
√
2 δ2k (t)
4 (t+1)
(t+1)
∗(t+1)
∗(t+1)
−U
kF ≤
kU − U ∗(t+1) kF + kV
−V
kF ,
kU
σk
ξ
√
3σ1 δ2k (t)
6
(t+1)
∗(t+1)
kV (t+1) − V ∗(t+1) kF ≤
kU − U ∗(t+1) kF +
+ 1 σ1 kV
−V
kF .
σk
ξ
kU (t+0.5) − U ∗(t+1) kF ≤
14
(36)
(37)
(38)
The proof of Corollary 4 directly follows Appendix B.5, and is therefore omitted..
Now we proceed with the proof of Theorem 1 for alternating gradient descent. Recall that
(0)
Lemma 10 ensures that (31) of Corollary 3 holds for U
and V (0) . Then Corollary 3 ensures
(1)
that (35) of Corollary 4 holds for U (0) and V . By induction, Corollaries 1 and 2 can be applied
recursively for all T iterations. For notational simplicity, we write (32)-(38) as
(t)
kV (t+0.5) − V ∗(t) kF ≤ α1 kV (t) − V ∗(t) kF + γ1 σ1 kU
∗(t+1)
∗(t)
∗(t)
kF ,
(39)
kF ,
(40)
(t)
−U
kU (t+0.5) − U ∗(t+1) kF ≤ α3 kU (t) − U ∗(t+1) kF + γ3 σ1 kV
(t+1)
−V
kF ≤ α4 kU (t) − U ∗(t+1) kF + γ4 σ1 kV
(t+1)
−V
(t)
−U
kV (t+1) − V ∗(t+1) kF ≤ α6 kU (t) − U ∗(t+1) kF + γ6 σ1 kV
(t+1)
σ1 kV
σ1 kU
(t+1)
−U
(t+1)
−V
−U
∗(t+1)
kF ≤ α2 kV (t) − V ∗(t) kF + γ2 σ1 kU
kU (t) − U ∗(t+1) kF ≤ α5 kV (t) − V ∗(t) kF + γ5 σ1 kU
∗(t)
∗(t+1)
∗(t+1)
kF ,
(41)
kF ,
(42)
kF ,
−V
(43)
∗(t+1)
kF .
(44)
Note that we have γ5 , γ6 ∈ (1, 2), but α1 ,...,α6 , γ1 ,..., and γ4 can be sufficiently small as long as ξ
is sufficiently large. We then have
(i)
kU (t+1) − U ∗(t+2) kF ≤ α5 kV (t+1) − V ∗(t+1) kF + γ5 σ1 kU
(ii)
≤ α5 α6 kU (t) − U ∗(t+1) kF + α5 γ6 σ1 kV
(t+1)
(t+1)
−V
∗(t+1)
−U
∗(t+1)
kF
kF + γ5 σ1 kU
(iii)
≤ (α5 α6 + γ5 α4 )kU (t) − U ∗(t+1) kF + (γ5 γ4 σ1 + α5 γ6 )σ1 kV
(t+1)
(t+1)
−V
−U
∗(t+1)
∗(t+1)
kF
kF
(iv)
≤ (α5 α6 + γ5 α4 )kU (t) − U ∗(t+1) kF + (γ5 γ4 σ1 + α5 γ6 )α2 kV (t) − V ∗(t) kF
+ (γ5 γ4 σ1 + α5 γ6 )γ2 σ1 kU
(t)
−U
∗(t)
kF ,
(45)
where (i) comes from (43), (ii) comes from (44), (iii) comes from (42), and (iv) comes from (40).
Similarly, we can obtain
kV (t+1) − V ∗(t+1) kF ≤ α6 kU (t) − U ∗(t+1) kF + γ6 α2 kV (t) − V ∗(t) kF
+ γ6 γ2 σ1 kU
σ1 kU
(t+1)
−U
∗(t+1)
kU
−U
∗(t+1)
−U
(t)
−U
(t)
−U
∗(t)
kF ,
(46)
kF
(47)
kF .
(48)
kF ≤ α4 kU (t) − U ∗(t+1) kF + γ4 α2 kV (t) − V ∗(t) kF
+ γ4 γ2 σ1 kU
(t+0.5)
(t)
kF ≤ α3 kU
(t)
−U
∗(t+1)
kF + γ3 α2 kV
(t)
−V
∗(t)
∗(t)
kF
+ γ3 γ2 σ1 kU
∗(t)
For simplicity, we define
φV (t+1) = kV (t+1) − V ∗(t+1) kF , φV (t+0.5) = kV (t+0.5) − V ∗(t) kF , φ
V
(t+1)
φU (t+1) = kU (t+1) − U ∗(t+2) kF , φU (t+0.5) = kU (t+0.5) − U ∗(t+1) kF , φ
15
U
= σ1 kV
(t+1)
(t+1)
= σ1 kU
−V
(t+1)
∗(t+1)
−U
kF ,
∗(t+1)
kF .
Then combining (39), (40) with (45)–(48), we obtain
max φV (t+1) , φV (t+0.5) , φ (t+1) , φU (t+1) , φU (t+0.5) , φ
V
U
(t+1)
≤ β max φV (t) , φU (t) , φ
U
(t)
,
(49)
where β is a contraction coefficient defined as
β = max{α5 α6 + γ5 α4 , α6 , α4 , α3 } + max{α1 , α2 , (γ5 γ4 σ1 + α5 γ6 ), γ6 α2 , γ4 α2 , γ3 α2 }
+ max{γ1 , γ2 , (γ5 γ4 σ1 + α5 γ6 )γ2 , γ6 γ2 , γ4 γ2 , γ3 γ2 }.
Then we can choose ξ as a sufficiently large constant such that β < 1. By recursively applying (49)
for t = 0, ..., T , we obtain
max φV (T ) , φV (T −0.5) , φ (T ) , φU (T ) , φU (T −0.5) , φ (T ) ≤ β max φV (T −1) , φU (T −1) , φ (T −1)
V
U
U
≤ β 2 max φV (T −2) , φU (T −2) , φ (T −2) ≤ ... ≤ β T max φV (0) , φU (0) , φ (0) .
U
U
By Corollary 3, we obtain
kU
(0)
−U
∗(1)
√
6
3σ1 δ2k (0)
(0)
∗(0)
∗(0)
kV − V
kF +
+ 1 σ1 kU − U
kF
kF ≤
σk
ξ
2
(i) 3σ
σk3
σk2
σk
6
1
≤
·
·
+
+1
3
σk 12ξσ1 2ξσ1
ξ
4ξσ1
(ii)
=
σk4
3σk2
σk2 (iii) σk2
+
+
≤
,
2ξσ1
8ξ 2 σ13 2ξ 2 σ1 4ξσ1
(50)
where (i) and (ii) come from Lemma 10, and (iii) comes from the definition of ξ and σ1 ≥ σk .
Combining (50) with Lemma 10, we have
φV (0) , φU (0) , φ
U
(0)
≤ max
n σ2
σ2 o
k
, k2 .
2ξσ1 4ξσ1
Then we need at most
T = log
n σ2 σ2 σ2 σ2 o 1 −1 −1
k
k
k
k
max
, ,
·
log (β )
,
ξσ1 2ξσ12 ξ 2ξσ1
iterations such that
kV
(T )
∗
− V kF ≤ β T max
n σ2
n σ2
σ2 o σ2 o
k
k
, k2 ≤
and kU (T ) − U ∗ kF ≤ β T max
, k2 ≤
.
2ξσ1 4ξσ1
2
2ξσ1 4ξσ1
2σ1
We then follow similar lines to (26) in §4.2, and show kM (T ) − M ∗ kF ≤ , which completes the
proof.
4.4
Proof of Theorem 1 (Gradient Descent)
Proof. The convergence analysis of the gradient descent algorithm is similar to that of the alternating gradient descent. The only difference is that for updating U , the gradient descent algorithm
(t)
(t+1)
employs V = V
instead of V = V
to calculate the gradient at U = U (t) . Then everything
else directly follows §4.3, and is therefore omitted..
16
5
Extensions to Matrix Completion
We then extend our methodology and theory to matrix completion problems. Let M ∗ ∈ Rm×n
be the unknown low rank matrix of interest. We observe a subset of the entries of M ∗ , namely,
∗ is
W ⊆ {1, . . . , m} × {1, . . . , n}. We assume that W is drawn uniformly at random, i.e., Mi,j
∗
observed independently with probability ρ̄ ∈ (0, 1]. To exactly recover M , a common assumption
is the incoherence of M ∗ , which will be specified later. A popular approach for recovering M ∗ is
to solve the following convex optimization problem
min
M ∈Rm×n
kM k∗
subject to PW (M ∗ ) = PW (M ),
(51)
where PW (M ) : Rm×n → Rm×n is an operator defined as
Mij if (i, j) ∈ W,
[PW (M )]ij =
0
otherwise.
Similar to matrix sensing, existing algorithms for solving (51) are computationally inefficient.
Hence, in practice we usually consider the following nonconvex optimization problem
min
U ∈Rm×k ,V
∈Rn×k
1
FW (U, V ), where FW (U, V ) = kPW (M ∗ ) − PW (U V > )k2F .
2
(52)
Similar to matrix sensing, (52) can also be efficiently solved by gradient-based algorithms illustrated
in Algorithm 2. For the convenience of later convergence analysis, we partition the observation set
W into 2T +1 subsets W0 ,...,W2T by Algorithm 4. However, in practice we do not need the partition
scheme, i.e., we simply set W0 = · · · = W2T = W.
Before we present the convergence analysis, we first introduce an assumption known as the
incoherence property.
Assumption 2 (Incoherence Property). The target rank k matrix M ∗ is incoherent with parameter
∗
∗>
µ, i.e., given the rank k singular value decomposition of M ∗ = U Σ∗ V , we have
r
r
k
k
∗
∗
max kU i∗ k2 ≤ µ
and max kV j∗ k2 ≤ µ
.
i
j
m
n
Roughly speaking, the incoherence assumption guarantees that each entry of M ∗ contains similar amount of information, which makes it feasible to complete M ∗ when its entries are missing
uniformly at random. The following theorem establishes the iteration complexity and the estimation error under the Frobenius norm.
Theorem 2. Suppose that there exists a universal constant C4 such that ρ̄ satisfies
ρ̄ ≥
C4 µ2 k 3 log n log(1/)
,
m
(53)
where is the pre-specified precision. Then there exist an η and universal constants C5 and C6
such that for any T ≥ C5 log(C6 /), we have kM (T ) − M kF ≤ with high probability.
17
Algorithm 2 A family of nonconvex optimization algorithms for matrix completion. The incoherence factorization algorithm IF(·) is illustrated in Algorithm 3, and the partition algorithm
Partition(·), which is proposed by Hardt and Wootters [2014], is provided in Algorithm 4 of Appendix C for the sake of completeness. The initialization procedures INTU (·) and INTU (·) are
provided in Algorithm 5 and Algorithm 6 of Appendix D for the sake of completeness. Here FW (·)
is defined in (52).
Input: PW (M ∗ )
Parameter: Step size η, Total number of iterations T
f) ← PW (M ∗ ), and M
fij ← 0 for all (i, j) ∈
({Wt }2T
e) ← Partition(W), PW0 (M
/ W0
0
t=0 , ρ
(0)
(0)
(0)
(0)
f
f
(U , V ) ← INT (M ), (V , U ) ← INT (M )
U
V
For: t = 0, ...., T − 1
(t)
Alternating Exact Minimization : V (t+0.5) ← argminV FW2t+1 (U , V )
(V
(t+1)
(t+0.5)
, RV
) ← IF(V (t+0.5) )
Alternating Gradient Descent : V (t+0.5) ← V (t) − η∇V FW2t+1 (U
(t)
, V (t) )












Updating V .





(t)


Gradient Descent : V (t+0.5) ← V (t) − η∇V FW2t+1 (U , V (t) )



(t+1)
(t)
(t+0.5)
(t+0.5)>
(t+0.5)
(t+1)

(V
, RV
) ← IF(V
), U
← U RV

(t+1)


)
Alternating Exact Minimization : U (t+0.5) ← argminU FW2t+2 (U, V



(t+1)
(t+0.5)
(t+0.5)


(U
, RU
) ← IF(U
)



(t+1)
(t+0.5)
(t)
(t)
Alternating Gradient Descent : U
← U − η∇U FW2t+2 (U , V
) 
Updating U .
(t+1)
(t+1) (t+0.5)>
(t+0.5)

(U
, RU
RU
) ← IF(U (t+0.5) ), V (t+1) ← V




(t)


Gradient Descent : U (t+0.5) ← U (t) − η∇U FW2t+2 (U (t) , V )



(t+1)
(t) (t+0.5)>
(t+0.5)
(t+0.5)
(t+1)

(U
, RU
) ← IF(U
), V
← V RU
End for
(T )>
(T )
(for gradient descent we use U V (T )> )
Output: M (T ) ← U (T −0.5) V
(V
(t+1)
(t+0.5)
, RV
) ← IF(V (t+0.5) ), U (t) ← U
(t)
(t+0.5)>
RV
Algorithm 3 The incoherence factorization algorithm for matrix completion. It guarantees that
the solutions satisfy the incoherence condition throughout all iterations.
Input: W in
r ← Number of rows of W in
Parameter: Incoherence parameter µ
in
in ) ← QR(W in )
(W , RW
p
f ← argmin kW − W in k2F subject to max kWj∗ k2 ≤ µ k/r
W
W
tmp
(W , RW
) ← QR(W out )
out = W out> W in
RW
out
out
Output: W , RW
j
out
18
The proof of Theorem 2 is provided in §F.1, §F.2, and §F.3. Theorem 2 implies that all
three nonconvex optimization algorithms converge to the global optimum at a geometric rate.
Furthermore, our results indicate that the completion of the true low rank matrix M ∗ up to accuracy requires the entry observation probability ρ̄ to satisfy
ρ̄ = Ω(µ2 k 3 log n log(1/)/m).
(54)
This result matches the result established by Hardt [2014], which is the state-of-the-art result for
alternating minimization. Moreover, our analysis covers three nonconvex optimization algorithms.
(M ∗ )
In fact, the sample complexity in (54) depends on a polynomial of σσmax
∗ , which is a constant
min (M )
∗
(M )
since in this paper we assume that σmax (M ∗ ) and σmin (M ∗ ) are constants. If we allow σσmax
to
∗
min (M )
increase, we can replace the QR decomposition in Algorithm 3 with the smooth
QRdecomposition
σmax (M ∗ )
proposed by Hardt and Wootters [2014] and achieve a dependency of log σmin (M ∗ ) on the condition number with a more involved proof. See more details in Hardt and Wootters [2014]. However,
in this paper, our primary focus is on the dependency on k, n and m, rather than optimizing over
the dependency on condition number.
6
Numerical Experiments
We present numerical experiments to support our theoretical analysis. We first consider a matrix
sensing problem with m = 30, n = 40, and k = 5. We vary d from 300 to 900. Each entry of Ai ’s are
e ∈ Rm×k and Ve ∈ Rn×k
independent sampled from N (0, 1). We then generate M = U V > , where U
are two matrices with all their entries independently sampled from N (0, 1/k). We then generate d
measurements by bi = hAi , M i for i = 1, ..., d. Figure 1 illustrates the empirical performance of the
alternating exact minimization and alternating gradient descent algorithms for a single realization.
The step size for the alternating gradient descent algorithm is determined by the backtracking line
search procedure. We see that both algorithms attain linear rate of convergence for d = 600 and
d = 900. Both algorithms fail for d = 300, because d = 300 is below the minimum requirement of
sample complexity for the exact matrix recovery.
We then consider a matrix completion problem with m = 1000, n = 50, and k = 5. We vary ρ̄
e ∈ Rm×k and Ve ∈ Rn×k are two matrices
from 0.025 to 0.1. We then generate M = U V > , where U
with all their entries independently sampled from N (0, 1/k). The observation set is generated
uniformly at random with probability ρ̄. Figure 2 illustrates the empirical performance of the
alternating exact minimization and alternating gradient descent algorithms for a single realization.
The step size for the alternating gradient descent algorithm is determined by the backtracking line
search procedure. We see that both algorithms attain linear rate of convergence for ρ̄ = 0.05 and
ρ̄ = 0.1. Both algorithms fail for ρ̄ = 0.025, because the entry observation probability is below the
minimum requirement of sample complexity for the exact matrix recovery.
7
Conclusion
In this paper, we propose a generic analysis for characterizing the convergence properties of nonconvex optimization algorithms. By exploiting the inexact first order oracle, we prove that a broad class
19
Estaimation Error
Estimation Error
10
10
2
10
0
2
10 0
10 -2
10 -4
10 -4
d=300
d=600
d=900
10 -6
0
10 -2
10
20
30
10 -6
40
d=300
d=600
d=900
0
20
40
60
80
Number of Iterations
Numer of Iterations
(a) Alternating Exact Minimization Algorithm
(b) Alternating Gradient Descent Algorithm
Figure 1: Two illustrative examples for matrix sensing. The vertical axis corresponds to estimation
error kM (t) − M kF . The horizontal axis corresponds to numbers of iterations. Both the alternating
exact minimization and alternating gradient descent algorithms attain linear rate of convergence
for d = 600 and d = 900. But both algorithms fail for d = 300, because the sample size is not large
enough to guarantee proper initial solutions.
of nonconvex optimization algorithms converge geometrically to the global optimum and exactly
recover the true low rank matrices under suitable conditions.
A
A.1
Lemmas for Theorem 1 (Alternating Exact Minimization)
Proof of Lemma 1
∗(t)
Proof. For notational convenience, we omit the index t in U
and V ∗(t) , and denote them by U
∗
and V respectively. Then we define two nk × nk matrices


(t)
(t)
S11 · · · S1k
d
X
 .
(t) (t)>
.. 
(t)
..

..
S (t) = 
with
S
=
Ai U ∗p U ∗q A>
.
. 
i ,
pq

(t)
(t)
i=1
Sk1 · · · Skk

G
(t)

(t)
(t)
G11 · · · G1k
 .
.. 
..
..
=
.
. 


(t)
(t)
Gk1 · · · Gkk
∗
with
G(t)
pq
=
d
X
∗
∗>
Ai U ∗p U ∗q A>
i
i=1
(t)
for 1 ≤ p, q ≤ k. Note that S (t) and G(t) are essentially the partial Hessian matrices ∇2V F(U , V )
∗
and ∇2V F(U , V ) for a vectorized V , i.e., vec(V ) ∈ Rnk . Before we proceed with the main proof,
we first introduce the following lemma.
20
10
2
10 2
10
Estimation Error
Estimation Error
10 0
-2
10 -4
10
-6
10
-8
;7 = 0:025
;7 = 0:025
;7 = 0:025
0
20
40
60
80
100
Number of Iterations
10
0
10 -2
10
-4
10
-6
10 -8
;7 = 0:025
;7 = 0:05
;7 = 0:1
0
10
20
30
40
50
Number of Iterations
(a) Alternating Exact Minimization Algorithm
(b) Alternating Gradient Descent Algorithm
Figure 2: Two illustrative examples for matrix completion. The vertical axis corresponds to estimation error kM (t) − M kF . The horizontal axis corresponds to numbers of iterations. Both the
alternating exact minimization and alternating gradient descent algorithms attain linear rate of
convergence for ρ̄ = 0.05 and ρ̄ = 0.1. But both algorithms fail for ρ̄ = 0.025, because the entry
observation probability is not large enough to guarantee proper initial solutions.
Lemma 11. Suppose that A(·) satisfies 2k-RIP with parameter δ2k . We then have
1 + δ2k ≥ σmax (S (t) ) ≥ σmin (S (t) ) ≥ 1 − δ2k .
The proof of Lemma 11 is provided in Appendix A.7. Note that Lemma 11 is also applicable
since G(t) shares the same structure with S (t) .
We then proceed with the proof of Lemma 1. Given a fixed U , F(U , V ) is a quadratic function
of V . Therefore we have
F(U , V 0 ) = F(U , V ) + h∇V F(U , V ), V 0 − V i + hvec(V 0 ) − vec(V ), ∇2V F (U , V ) vec(V 0 ) − vec(V ) i,
G(t) ,
which further implies implies
F(U , V 0 ) − F(U , V ) − h∇V (U , V ), V 0 − V i ≤ σmax (∇2V F (U , V ))kV 0 − V k2F
F(U , V 0 ) − F(U , V ) − h∇V (U , V ), V 0 − V i ≥ σmin (∇2V F (U , V ))kV 0 − V k2F .
Then we can verify that ∇2V F (U, V ) also shares the same structure with S (t) . Thus applying Lemma
11 to the above two inequalities, we complete the proof.
21
A.2
Proof of Lemma 3
∗(t)
and V ∗(t) , and denote them by U
Proof. For notational convenience, we omit the index t in U
∗
and V respectively. We define two nk × nk matrices


(t)
(t)
J11 · · · J1k
d
X
 .
(t) ∗>
.. 
(t)
..

..
J (t) = 
with
J
=
Ai U ∗p U ∗q A>
.
. 
i ,
pq

(t)
(t)
i=1
Jk1 · · · Jkk
∗

K (t)

(t)
(t)
K11 · · · K1k
 .
.. 
..
..
=
.
. 


(t)
(t)
Kk1 · · · Kkk
(t)>
∗
(t)
with Kpq
= U ∗p U ∗q In
for 1 ≤ p, q ≤ k. Before we proceed with the main proof, we first introduce the following lemmas.
Lemma 12. Suppose that A(·) satisfies 2k-RIP with parameter δ2k . We then have
√
(t)
∗
kS (t) K (t) − J (t) k2 ≤ 3δ2k kkU − U kF .
The proof of Lemma 12 is provided in Appendix A.8. Note that Lemma 12 is also applicable
to G(t) K (t) − J (t) , since G(t) and S (t) share the same structure.
Lemma 13. Given F ∈ Rk×k , we define a nk × nk matrix


F11 In · · · F1k In


. . ..
F =  ...
.
. .
Fk1 In · · · Fkk In
For any V ∈ Rn×k , let v = vec(V ) ∈ Rnk , then we have kFvk2 = kF V > kF .
Proof. By linear algebra, we have
>
[F V ]ij = Fi∗
Vj∗ =
k
X
Fi` Vj` =
`=1
k
X
>
Fi` I∗`
V∗` ,
`=1
which completes the proof.
We then proceed with the proof of Lemma 3. Since bi = tr(V ∗> Ai U ∗ ), then we rewrite F(U , V )
as
d
d
2 1 X
1X
F(U , V ) =
tr(V > Ai U ) − bi =
2
2
i=1
i=1
X
k
j=1
>
Vj∗
Ai U ∗j
−
k
X
∗
∗>
Vj∗
Ai U ∗j
(t)
, V (t+0.5) ) = S (t) v (t+0.5) − J (t) v ∗ = 0.
22
.
j=1
For notational simplicity, we define v = vec(V ). Since V (t+0.5) minimizes F(U
vec ∇U F(U
2
(t)
, V ), we have
Solving the above system of equations, we obtain
v (t+0.5) = (S (t) )−1 J (t) v ∗ .
(55)
Meanwhile, we have
∗
vec(∇V F(U , V (t+0.5) )) = G(t) v (t+0.5) − G(t) v ∗
= G(t) (S (t) )−1 J (t) v ∗ − G(t) v ∗ = G(t) (S (t) )−1 J (t) − Ink v ∗ ,
(56)
where the second equality come from (55). By triangle inequality, (56) further implies
k((S (t) )−1 J (t) − Ink )v ∗ k2 ≤ k(K (t) − Ink )v ∗ k2 + k(S (t) )−1 (J (t) − S (t) K (t) )v ∗ k2
≤ k(U
≤ kU
(t)>
(t)>
∗
U − Ik )V ∗> kF + k(S (t) )−1 k2 k(J (t) − S (t) K (t) )v ∗ k2
∗
U − Ik kF kV ∗ k2 + k(S (t) )−1 k2 k(J (t) − S (t) K (t) )v ∗ k2 ,
(57)
where the second inequality comes from Lemma 13. Plugging (57) into (56), we have
∗
kvec(∇V F(U , V (t+0.5) ))k2 ≤ kG(t) k2 k((S (t) )−1 J (t) − Ink )v ∗ k2
√
(t)> ∗
≤(1 + δ2k )(σ1 kU
U − Ik k2 + k(S (t) )−1 k2 kS (t) K (t) − J (t) k2 σ1 k)
(ii)
3δ2k k
(t)
∗
(t)
∗
(t)
∗
kU − U kF
≤ (1 + δ2k )σ1 k(U − U )> (U − U )kF +
1 − δ2k
(iv) (1 − δ )σ
(iii)
3δ2k k
(t)
∗
(t)
∗
∗
(t)
2k k
kU − U kF ≤
kU − U kF ,
≤ (1 + δ2k )σ1 kU − U k2F +
1 − δ2k
2ξ
√
where (i) comes from Lemma 11 and kV ∗ k2 = kM ∗ k = σ1 and kV ∗ kF = kv ∗ k2 ≤ σ1 k, (ii) comes
from Lemmas 11 and 12, (iii) from Cauchy-Schwartz inequality, and (iv) comes from (16). Since
(t)
we have ∇V F(U , V (t+0.5) ) = 0, we further btain
(i)
E(V (t+0.5) , V (t+0.5) , U
(t)
)≤
(1 − δ2k )σk ∗
(t)
kU − U kF ,
2ξ
which completes the proof.
A.3
Proof of Lemma 4
∗(t)
Proof. For notational convenience, we omit the index t in U
and V ∗(t) , and denote them by U
∗
and V ∗ respectively. By the strong convexity of F(U , ·), we have
∗
F(U , V ∗ ) −
1 − δ2k (t+0.5)
∗
kV
− V ∗ k2F ≥ F(U , V (t+0.5) )
2
∗
+ h∇V F(U , V (t+0.5) ), V ∗ − V (t+0.5) i.
∗
(58)
∗
By the strong convexity of F(U , ·) again, we have
∗
∗
∗
F(U , V (t+0.5) ) ≥ F(U , V ∗ ) + h∇V F(U , V ∗ ), V (t+0.5) − V (t+0.5) i +
∗
≥ F(U , V ∗ ) +
1 − δ2k (t+0.5)
kV
− V ∗ k2F ,
2
23
1 − δ2k (t+0.5)
kV
− V ∗ k2F
2
(59)
∗
where the last inequality comes from the optimality condition of V ∗ = argminV F(U , V ), i.e.
∗
h∇V F(U , V ∗ ), V (t+0.5) − V ∗ i ≥ 0.
Meanwhile, since V (t+0.5) minimizes F(U
h∇V F(U
(t)
(t)
, ·), we have the optimality condition
, V (t+0.5) ), V ∗ − V (t+0.5) i ≥ 0,
which further implies
∗
h∇V F(U , V (t+0.5) ), V ∗ − V (t+0.5) i
∗
≥ h∇V F(U , V (t+0.5) ) − ∇V F(U
(t)
, V (t+0.5) ), V ∗ − V (t+0.5) i.
(60)
Combining (58) and (59) with (60), we obtain
kV (t+0.5) − V ∗ k2 ≤
1
(t)
E(V (t+0.5) , V (t+0.5) , U ),
1 − δ2k
which completes the proof.
A.4
Proof of Lemma 5
Proof. Before we proceed with the proof, we first introduce the following lemma.
Lemma 14. Suppose that A∗ ∈ Rn×k is a rank k matrix. Let E ∈ Rn×k satisfy kEk2 kA∗† k2 < 1.
Then given a QR decomposition (A∗ + E) = QR, there exists a factorization of A∗ = Q∗ O∗ such
that Q∗ ∈ Rn×k is an orthonormal matrix, and satisfies
√
2kA∗† k2 kEkF
∗
kQ − Q kF ≤
.
1 − kEk2 kA∗† k2
The proof of Lemma 14 is provided in Stewart et al. [1990], therefore omitted.
We then proceed with the proof of Lemma 5. We consider A∗ = V ∗(t) and E = V (t+0.5) − V ∗(t)
in Lemma 14 respectively. We can verify that
kV (t+0.5) − V ∗(t) k2 kV ∗(t)† k2 ≤
Then there exists a V ∗(t) = V
kV
∗(t+0.5)
−V
∗(t+1)
∗(t+1)
O∗ such that V
kV (t+0.5) − V ∗(t) kF
1
≤ .
σk
4
∗(t+1)
is an orthonormal matrix, and satisfies
kF ≤ 2kV ∗(t)† k2 kV (t+0.5) − V ∗(t) kF ≤
24
2
kV (t+0.5) − V ∗(t) kF .
σk
A.5
Proof of Lemma 6
Proof. Before we proceed with the main proof, we first introduce the following lemma.
Lemma 15. Let b = A(M ∗ ) + ε, M is a rank-k matrix, and A is a linear measurement operator
that satisfies 2k-RIP with constant δ2k < 1/3. Let X (t+1) be the (t + 1)-th step iterate of SVP,
then we have
kA(X (t+1) ) − bk22 ≤ kA(M ∗ ) − bk22 + 2δ2k kA(X (t) ) − bk22
The proof of Lemma 15 is provided in Jain et al. [2010], therefore omitted. We then explain
the implication of Lemma 15. Jain et al. [2010] show that X (t+1) is obtained by taking a projected
gradient iteration over X (t) using step size 1+δ1 2k . Then taking X (t) = 0, we have
X
(t+1)
U
=
(0)
(0)
(0)>
Σ V
1 + δ2k
.
Then Lemma 15 implies
(0) (0) (0)>
2
U
Σ
V
∗
∗ 2
A
−M ≤ 4δ2k kA(M )k2 .
1 + δ2k
2
(61)
Since A(·) satisfies 2k-RIP, then (61) further implies
2
(0) (0) (0)>
U Σ V
∗ 2
∗
−
M
≤ 4δ2k (1 + 3δ2k )kM kF .
1 + δ2k
F
(0)
We then project each column of M ∗ into the subspace spanned by {U ∗i }ki=1 , and obtain
kU
(0)
U
(0)>
M ∗ − M ∗ k2F ≤ 6δ2k kM ∗ k2F .
(0)
Let U ⊥ denote the orthonormal complement of U
(0)>
U⊥
(0)
U ⊥ = In−k
(0)
, i.e.,
(0)>
and U ⊥
U
(0)
= 0.
e ∗D
e ∗ Ve ∗> , we have
Then given a compact singular value decomposition of M ∗ = U
6δ2k kσ12
(0) (0)>
e ∗ k2F = kU (0)> U
e ∗ k2F .
≥ k(U U
− In )U
⊥
2
σk
(0)
e ∗ OkF , we have
Thus Lemma 2 guarantees that for O∗ = argminO> O=Ik kU − U
p
√
(0)
e ∗ O∗ kF ≤ 2kU (0)> U
e ∗ kF ≤ 2 3δ2k k · σ1 .
kU − U
⊥
σk
We define U
∗(0)
e ∗ O∗ . Then combining the above inequality with (18), we have
=U
kU
(0)
−U
∗(0)
kF ≤
(1 − δ2k )σk
.
4ξ(1 + δ2k )σ1
e ∗ O∗ . Then we have U ∗(0) V ∗(0)> = U
e ∗ OO∗> D
e ∗ Ve ∗ = M ∗ .
Meanwhile, we define V ∗(0) = Ve ∗ D
25
(62)
A.6
Proof of Corollary 1
Proof. Since (19) ensures that (16) of Lemma 3 holds, then we have
1
1
(1 − δ2k )σk (t)
(t) (i)
∗(t)
kU − U kF
E(V (t+0.5) , V (t+0.5) , U ) ≤
·
1 − δ2k
1 − δ2k
2ξ
(iii)
(ii)
(1 − δ2k )σk
1
(1 − δ2k )σk (1 − δ2k )σk
σk
≤
σk ≤
≤
·
·
, (63)
2
1 − δ2k
2ξ
4ξ(1 + δ2k )σ1
8ξ (1 + δ2k )σ1
4
kV (t+0.5) − V ∗(t) kF ≤
where (i) comes from Lemma 4, (ii) comes from (19), and (iii) comes from the definition of ξ and
σk ≤ σ1 . Since (63) ensures that (17) of Lemma 5 holds for V (t+0.5) , then we obtain
kV
(t+1)
−V
∗(t+1)
kF ≤
(ii) (1 − δ )σ
(i) 1
2
(t)
∗(t)
2k k
,
kV (t+0.5) − V ∗(t) kF ≤ kU − U kF ≤
σk
ξ
4ξ(1 + δ2k )σ1
(64)
where (i) comes from (63), and (ii) comes from the definition of ξ and (19).
A.7
Proof of Lemma 11
Proof. We consider an arbitrary W ∈ Rn×k such that kW kF = 1. Let w = vec(W ). Then we have
>
w Bw =
=
k
X
> (t)
W∗p
Spq W∗p
k
X
=
>
W∗p
p,q=1
p,q=1
d X
k
X
X
k
i=1
(t)
>
W∗p
Ai U ∗p
X
d
(t) (t)>
Ai U ∗p U ∗q A>
i
i=1
(t)
>
W∗q
Ai U ∗q
q=1
p=1
W∗q
=
n
X
tr(W > Ai U
(t) 2
) = kA(U
(t)
W > )k22 .
i=1
Since A(·) satisfies 2k-RIP, then we have
kA(U
(t)
W > )k22 ≥ (1 − δ2k )kU
(t)
W > kF = (1 − δ2k )kW kF = 1 − δ2k ,
kA(U
(t)
W > )k22 ≤ (1 + δ2k )kU
(t)
W > kF = (1 + δ2k )kW kF = 1 + δ2k .
Since W is arbitrary, then we have
σmin (S (t) ) = min w> S (t) w ≥ 1 − δ2k
kwk2 =1
A.8
and σmax (S (t) ) = max w> S (t) w ≤ 1 + δ2k .
kwk2 =1
Proof of Lemma 12
∗(t)
Proof. For notational convenience, we omit the index t in U
and V ∗(t) , and denote them by
∗
∗
U and V respectively. Before we proceed with the main proof, we first introduce the following
lemma.
Lemma 16. Suppose A(·) satisfies 2k-RIP. For any U, U 0 ∈ Rm×k and V, V 0 ∈ Rn×k , we have
|hA(U V > ), A(U 0 V 0> )i − hU > U 0 , V > V 0 i| ≤ 3δ2k kU V > kF · kU 0 V 0> kF .
26
The proof of Lemma 16 is provided in Jain et al. [2013], and hence omitted.
We now proceed with the proof of Lemma 12. We consider arbitrary W, Z ∈ Rn×k such that
kW kF = kZkF = 1. Let w = vec(W ) and z = vec(Z). Then we have
>
(t)
w (S K
(t)
−J
(t)
k
X
)z =
> (t) (t)
W∗p
[S K − J (t) ]pq Z∗q .
p,q=1
We consider a decomposition
[S (t) K (t) − J (t) ]pq =
=
=
k
X
(t)
(t)
(t)
Sp` K`q − Jpq
=
k
X
`=1
`=1
d
X
∗> (t)
(t) (t)
(t)
U ∗q U ∗`
Ai U ∗p U ∗` A>
i − Jpq
i=1
`=1
d
X
(t) ∗
(t) (t)>
Ai U ∗p U ∗q (U U
− In )A>
i .
i=1
k
X
∗
(t)>
(t)
(t)
Sp` U ∗` U ∗q In − Jpq
=
k
X
d
(t) X
∗>
Ai U ∗q U ∗`
(t)
(t)
U ∗p U ∗` A>
i −
i=1
`=1
d
X
(t)
∗
Ai U ∗p U ∗q A>
i
i=1
which further implies
>
(t)
w (S K
(t)
−J
(t)
)z =
X
>
W∗p
X
d
p,q
=
(t) ∗
(t) (t)>
Ai U ∗p U ∗q (U U
−
Im )A>
i
Z∗q
i=1
d X
X
∗
(t)
>
W∗p
Ai U ∗p U ∗q (U
(t)
U
(t)>
− Im )A>
i Z∗q
i=1 p,q
=
d
X
tr(W > Ai U
(t)
) tr Z > Ai (U
(t)
U
(t)>
− Im )U
∗
.
(65)
i=1
Since A(·) satisfies 2k-RIP, then by Lemma 16, we obtain
∗
(t)>
(t)
(t)
(t) (t)>
∗
U
− Im )U W > Z + 3δ2k kU W > kF k(U U
− Im )U Z > kF
q
(i)
∗>
(t) (t)>
∗
≤ 3δ2k kW kF kU (U U
− Im )U kF kZ > ZkF ,
(66)
w> (S (t) K (t) − J (t) )z ≤ tr U (U
(t)
(t)
(t)>
(t)
(t)
where the last inequality comes from (U U
− Im )U
= 0. Let U ⊥ ∈ Rm−k denote the
(t)
(t)> (t)
(t)> (t)
orthogonal complement to U such that U
U ⊥ = 0 and U ⊥ U ⊥ = Im−k . Then we have
Im − U
(t)
U
(t)>
(t)
(t)>
= U⊥ U⊥ ,
which implies
q
q
∗>
(t) (t)>
∗
∗> (t) (t)> ∗
(t)> ∗
kU (U U
− Im )U kF = kU U ⊥ U ⊥ U kF ≤ kU ⊥ U kF
(t)>
= kU ⊥ U
27
(t)
(t)>
∗
− U ⊥ U kF ≤ kU
(t)
∗
− U kF .
(67)
Combining (66) with (67), we obtain
√
(t)
∗
w> (S (t) K (t) − J (t) )z ≤ 3δ2k kkU − U kF .
(68)
Since W and Z are arbitrary, then (68) implies
σmax (S (t) K (t) − J (t) ) =
max
kwk2 =1,kzk2 =1
√
(t)
∗
w> (S (t) K (t) − J (t) )w ≤ 3δ2k kkU − U kF ,
which completes the proof.
B
B.1
Lemmas for Theorem 1 (Alternating Gradient Descent)
Proof of Lemma 7
Proof. For notational convenience, we omit the index t in U
and V ∗ respectively. We have
vec(∇V F(U
(t)
∗(t)
and V ∗(t) , and denote them by U
∗
∗
, V (t) )) = S (t) v (t) − J (t) v ∗ and vec(∇V F(U , V (t) )) = G(t) v (t) − G(t) v ∗ .
Therefore, we further obtain
k∇V F(U
(t)
∗
, V (t) ) − ∇V F(U , V (t) )kF
= k(S (t) − J (t) )(v (t) − v ∗ ) + (S (t) − J (t) )v ∗ + (J (t) − G(t) )(v (t) − v ∗ )k2
≤ k(S (t) − J (t) )(v (t) − v ∗ )k2 + k(S (t) − J (t) )v ∗ k2 + k(J (t) − G(t) )(v (t) − v ∗ )k2
≤ kS (t) k2 · k((S (t) )−1 J (t) − Ink )(v (t) − v ∗ )k2 + kS (t) k2 · k((S (t) )−1 J (t) − Ink )v ∗ k2
+ kGk2 · k((G(t) )−1 J (t) − Ink )(v (t) − v ∗ )k2 .
(69)
Recall that Lemma 12 is also applicable to G(t) K (t) − J (t) . Since we have
kV (t) − V ∗ k2 ≤ kV (t) − V ∗ kF = kv (t) − v ∗ k2 ≤ σ1 ,
following similar lines to Appendix A.2, we can show
3δ2k k
(t)
∗ 2
(t)
∗
(t) −1 (t)
∗
k((S ) J − Imn )v k2 ≤ σ1 kU − U kF +
kU − U kF ,
1 − δ2k
3δ2k k
(t)
∗
(t)
∗
k((G(t) )−1 J (t) − Imn )(v (t) − v ∗ )k2 ≤ σ1 kU − U k2F +
kU − U kF ,
1 − δ2k
3δ2k k
(t)
∗ 2
(t)
∗
(t) −1 (t)
(t)
∗
k((S ) J − Imn )(v − v )k2 ≤ σ1 kU − U kF +
kU − U kF .
1 − δ2k
Combining the above three inequalities with (69), we have
k∇V F(U
(t)
∗
, V (t) ) − ∇V F(U , V (t) )kF
3δ2k k
(t)
∗
(t)
∗
≤ 2(1 + δ2k )σ1 kU − U k2F +
kU − U kF .
1 − δ2k
28
(70)
Since U
(t)
, δ2k , and ξ satisfy (27), then (70) further implies
E(V (t+0.5) , V (t) , U
(t)
) = k∇V F(U
(t)
∗
, V (t) ) − ∇V F(U , V (t) )kF ≤
(1 + δ2k )σk (t)
∗
kU − U kF ,
ξ
which completes the proof.
B.2
Proof of Lemma 8
∗(t)
and V ∗(t) , and denote them by U
Proof. For notational convenience, we omit the index t in U
∗
and V ∗ respectively. By the strong convexity of F(U , ·), we have
1 − δ2k (t)
∗
∗
kV − V ∗ k2F ≥ F(U , V (t) ) + h∇V F(U , V (t) ), V ∗ − V (t) i
2
∗
∗
∗
= F(U , V (t) ) + h∇V F(U , V (t) ), V (t+0.5) − V (t) i + h∇V F(U , V (t) ), V ∗ − V (t+0.5) i.
∗
∗
F(U , V ∗ ) −
(71)
Meanwhile, we define
∗
∗
∗
Q(V ; U , V (t) ) = F(U , V (t) ) + h∇V F(U , V (t) ), V − V (t) i +
1
kV − V (t) k2F .
2η
∗
∗
Since η satisfies (28) and F(U , V ) is strongly smooth in V for a fixed orthonormal U , we have
∗
∗
Q(V ; U , V (t) ) ≥ F(U , V (t) ).
Combining the above two inequalities, we obtain
∗
∗
∗
F(U , V (t) ) + h∇V F(U , V (t) ), V (t+0.5) − V (t) i = Q(V (t+0.5) ; U , V (t) ) −
∗
≥ F(U , V (t+0.5) ) −
1
kV (t+0.5) − V (t) k2F
2η
1
kV (t+0.5) − V (t) k2F .
2η
(72)
∗
Moreover, by the strong convexity of F(U , ·) again, we have
∗
∗
∗
F(U , V (t+0.5) ) ≥ F(U , V ∗ ) + h∇V F(U , V ∗ ), V (t+0.5) − V ∗ i +
∗
≥ F(U , V ∗ ) +
1 − δ2k (t+0.5)
kV
− V ∗ k2F
2
1 − δ2k (t+0.5)
kV
− V ∗ k2F ,
2
(73)
∗
where the second equalities comes from the optimality condition of V ∗ = argminV F(U , V ), i.e.
∗
h∇V F(U , V ∗ ), V (t+0.5) − V ∗ i ≥ 0.
Combining (71) and (72) with (73), we obtain
∗
∗
F(U , V (t) ) + h∇V F(U , V (t) ), V (t+0.5) − V (t) i
1 − δ2k (t+0.5)
1
∗
≥ F(U , V ∗ ) +
kV
− V ∗ k2F − kV (t+0.5) − V (t) k2F .
2
2η
29
(74)
∗
On the other hand, since V (t+0.5) minimizes Q(V ; U , V (t) ), we have
∗
0 ≤ h∇Q(V (t+0.5) ; U , V (t) ), V ∗ − V (t+0.5) i
∗
≤ h∇V F(U , V (t) ), V ∗ − V (t+0.5) i + (1 + δ2k )hV (t+0.5) − V (t) , V ∗ − V (t+0.5) i.
(75)
Meanwhile, we have
∗
h∇V F(U , V (t) ), V ∗ − V (t+0.5) i
= h∇V F(U
(t)
, V (t) ), V ∗ − V (t+0.5) i − E(V (t+0.5) , V (t) , U
(t)
)kV ∗ − V (t+0.5) k2
≥ (1 + δ2k )hV (t) − V (t+0.5) , V ∗ − V (t+0.5) i − E(V (t+0.5) , V (t) , U
1
= (1 + δ2k )hV (t) − V (t+0.5) , V ∗ − V (t) i + kV (t) − V (t+0.5) k2F
2η
− E(V (t+0.5) , V (t) , U
(t)
(t)
)kV ∗ − V (t+0.5) k2
)kV ∗ − V (t+0.5) k2 .
(76)
Combining (75) with (76), we obtain
2hV (t) − V (t+0.5) , V ∗ − V (t) i ≤ −η(1 − δ2k )kV (t) − V ∗ k22 − η(1 − δ2k )kV (t+0.5) − V ∗ k22
− kV (t+0.5) − V (t) k22 + E(V (t+0.5) , V (t) , U
(t)
)kV ∗ − V (t+0.5) k2 .
(77)
Therefore, combining (74) with (77), we obtain
kV (t+0.5) − V ∗ k2F ≤ kV (t+0.5) − V (t) + V (t) − V ∗ k2F
= kV (t+0.5) − V (t) k2F + kV (t) − V ∗ k2F + 2hV (t+0.5) − V (t) , V (t) − V ∗ i
≤ 2ηkV (t) − V ∗ k2F − η(1 − δ2k )kV (t+0.5) − V ∗ k2F
− E(V (t+0.5) , V (t) , U
Rearranging the above inequality, we obtain
p
kV (t+0.5) − V ∗ kF ≤ δ2k kV (t) − V ∗ kF +
(t)
)kV ∗ − V (t+0.5) k2 .
2
(t)
E(V (t+0.5) , V (t) , U ),
1 + δ2k
which completes the proof.
B.3
Proof of Lemma 9
Proof. Before we proceed with the main proof, we first introduce the following lemma.
e ∈ Rm×k and V, Ve ∈ Rn×k , we have
Lemma 17. For any matrix U, U
e Ve > kF ≤ kU k2 kV − Ve k + kVe k2 kU − U
e kF .
kU V > − U
Proof. By linear algebra, we have
e Ve > kF = kU V > − U Ve > + U Ve > − U
e Ve > kF
kU V > − U
e Ve > kF ≤ kU k2 kV − Ve kF + kVe k2 kU − U
e kF .
≤ kU V > − U Ve > kF + kU Ve > − U
30
(78)
We then proceed with the proof of Lemma 9. By Lemma 17, we have
(t+0.5)
kRV
−V
∗(t+1)>
V ∗(t) kF = kV
(t+0.5)>
V (t+0.5) − V
∗(t+1)>
V ∗(t) kF
(t+0.5)
(t+0.5)
k2 kV (t+0.5) − V ∗(t) kF + kV ∗(t) k2 kV
2σ1 (t+0.5)
kV
− V ∗(t) kF ,
≤ kV (t+0.5) − V ∗(t) kF +
σk
≤ kV
where the last inequality comes from Lemma 5. Moreover, we define U ∗(t+1) = U
Then we can verify
U ∗(t+1) V
∗(t+1)
=U
∗(t)
V ∗(t)> V
∗(t+1)
∗(t+1)
where the last equality holds, since V
V
space of M ∗ . Thus by Lemma 17, we have
kU (t+1) − U ∗(t+1) kF = kU
(t)
(t)
(t+0.5)>
RV
−U
V
∗(t+1)>
∗(t+1)>
∗(t)
V
∗(t+1)>
(79)
∗(t)
(V
B.4
∗(t+1)
∗(t+1)>
V ∗(t) )> .
= M ∗,
∗(t+1)>
V ∗(t) kF + kV
V ∗(t) k2 kU
≤ kU k2 kRV
−V
2σ1
(t)
∗(t)
≤ 1+
kV (t+0.5) − V ∗(t) kF + σ1 kU − U kF ,
σk
where the last inequality comes from (79), kV
kF
V ∗(t) )> kF
∗(t+1)>
(t+0.5)
∗(t+1)
∗(t+1)
is exactly the projection matrix for the row
∗(t+1)>
(V
= M ∗V
−V
k2 = 1, kU
(t)
(t)
−U
∗(t)
kF
k2 = 1, and kV ∗(t) k2 = σ1 .
Proof of Lemma 10
Proof. Following similar lines to Appendix A.5, we have
kU
(0)
−U
∗(0)
kF ≤
σk2
.
4ξσ12
(80)
In Appendix A.5, we have already shown
(0) (0) (0)>
p
U Σ V
∗
∗
≤
2
−
M
δ2k (1 + 3δ2k )kΣ kF .
1 + δ2k
(81)
F
Then by Lemma 17 we have
(0) (0)
(0)> (0) (0) (0)>
U
U Σ
U Σ V
∗(0)>
∗(0) ∗
−U
M 1 + δ2k − V
=
1 + δ2k
F
F
(0) (0) (0)>
U Σ V
(0)
(0)
∗(0)
∗
≤ kU k2 − M ∗
kF
+ kM k2 kU − U
1 + δ2k
F
p
σ2
≤ 2 δ2k k(1 + 3δ2k )σ1 + k 1 ,
4ξσ1
31
(82)
(0)
where the last inequality comes from (80), (81), kM ∗ k2 = σ1 , and kU k2 = 1. By triangle
inequality, we further have
(0) (0)
U Σ
(0) (0)
∗(0)
∗(0) ∗(0)
kU Σ − V
kF ≤ (1 + δ2k )
−V
kF
+ δ2k kV
1 + δ2k
F
p
(i)
√
σ2 ≤(1 + δ2k ) 2 δ2k k(1 + 3δ2k )σ1 + k + δ2k σ1 k
4ξσ1
(iii) 2
3
2
3
(ii)
σ
σk
σk
σk
σ1 ≤
,
≤
+ 3k 2 +
3
2
3
2ξσ
9σ1 ξ 3σ1 ξ
192ξ σ1
1
√
where (i) comes from (82) and kV ∗(0) kF = kM ∗ kF ≤ σ1 k, (ii) comes from (30), and (iii) comes
from the definition of ξ and σ1 ≥ σk .
B.5
Proof of Corollary 3
Proof. Since (31) ensures that (27) of Lemma 7 holds, we have
p
2
(t)
kV (t+0.5) − V ∗(t) kF ≤ δ2k kV (t) − V ∗(t) kF +
E(V (t+0.5) , V (t) , U )
1 + δ2k
(i) p
2
(1 + δ2k )σk (t)
∗(t)
·
≤ δ2k kV (t) − V ∗(t) kF +
kU − U kF
1 + δ2k
ξ
(ii) σ 2
2σk (t)
∗(t)
k
≤
kV (t) − V ∗(t) kF +
kU − U kF
2
ξ
12ξσ1
(iii)
≤
σk2
σk2
2σk σk2 (iv) 13σk3 (v) σk
+
·
≤
≤
,
·
ξ
4
12ξσ12 2ξσ1
4ξσ12
24ξ 2 σ12
(83)
where (i) comes from Lemma 8, (ii) and (iii) come from (31), and (iv) and (v) come from the
definition of ξ and σk ≤ σ1 . Since (83) ensures that (17) of Lemma 5, then we obtain
√
(i) 2 δ
2
4 (t)
(t+1)
∗(t+1)
∗(t)
2k
(t+0.5)
∗(t)
kV
−V
kF ≤
kV
−V
kF ≤
kV (t) − V ∗(t) kF + kU − U kF
σk
σk
ξ
2
2
(ii)
(iii)
σ
σk
σk
4
≤
+
· k2 ≤
,
(84)
3ξσ1 ξ
4ξσ1
4ξσ12
where (i) and (ii) come from (83), and (iii) comes from the definition of ξ and σ1 > σk . Moreover,
since (83) ensures that (29) of Lemma 9 holds, then we have
3σ1 (t+0.5)
(t)
∗(t)
kV
− V ∗(t) kF + σ1 kU − U kF
σk
√
(i) 3σ
6
(t)
∗(t)
1 δ2k
(t)
∗(t)
≤
kV − V
kF +
+ 1 σ1 kU − U kF
σk
ξ
2
2 (iii) 2
3
2
(ii) 3σ
σk
σk
σk2
σk
σk
σk
6
3 1
1
≤
·
+ +
·
+
+1 ·
=
≤
,
3
2
2
σk 12ξσ1 2ξσ1
ξ
4ξσ1
ξ 2 2ξσ1
2ξσ1
4ξ σ1
kU (t) − U ∗(t+1) kF ≤
where (i) comes from (83), (ii) comes from (31), and (iii) comes from the definition of ξ and
σ1 ≥ σk .
32
C
Partition Algorithm for Matrix Computation
Algorithm 4 The observation set partition algorithm for matrix completion. It guarantees the
independence among all 2T + 1 output observation sets.
Input: W, ρ̄
1
ρe = 1 − (1 − ρ̄) 2T +1 .
For: t = 0, ...., 2T
(mn)!ρ̄t+1 (1 − ρ̄)mn−t−1
ρet =
ρ̄(mn − t − 1)!(t + 1)!
End for
W0 = ∅, ..., W2T = ∅
For every (i, j) ∈ W
Sample t from {0, ..., 2T } with probability {e
ρ0 , ..., ρe2T }
Sample (w/o replacement) a set B such that |B| = t from {0, ..., 2T } with equal probability
Add (i, j) to W` for all ` ∈ B
End for
Output: {Wt }2T
e
t=0 , ρ
D
Initialization Procedures for Matrix Computation
Algorithm 5 The initialization procedure INTU (·) for matrix completion. It guarantees that the
initial solutions satisfy the incoherence condition throughout all iterations.
f
Input: M
Parameter: Incoherence parameter µ
e , D,
e Ve ) ← KSVD(M
f)
(U
p
tmp
e
e k2F subject to max kUi∗ k2 ≤ µ k/m
U
← argmin kU − U
out
(U
Ve tmp
i
U
out
, RU ) ←
e tmp )
QR(U
p
← argmin kV − Ve tmp k2F subject to max kVj∗ k2 ≤ µ k/n
V
out
(V , RV ) ← QR(Ve tmp )
out
out> f out >
V out = V (U
MV )
out
out
j
out
Output: U
E
E.1
,V
Lemmas for Theorem 2 (Alternating Exact Minimization)
Proof of Lemma 21
∗(t)
Proof. For notational convenience, we omit the index t in U
and V ∗(t) , and denote them by
∗
U and V ∗ respectively. Before we proceed with the main proof, we first introduce the following
33
Algorithm 6 The initialization procedure INTV (·) for matrix completion. It guarantees that the
initial solutions satisfy the incoherence condition throughout all iterations.
f
Input: M
Parameter: Incoherence parameter µ
e , D,
e Ve ) ← KSVD(M
f)
(U
p
tmp
e
V
← argmin kV − Ve k2F subject to max kVj∗ k2 ≤ µ k/n
out
(V
e tmp
U
(U
j
V
out
, RV ) ←
QR(Ve tmp )
p
e tmp k2 subject to max kUi∗ k2 ≤ µ k/m
← argmin kU − U
F
out
i
U
out ) ← QR(U
e tmp )
, RU
out
out> f out
= U (U
MV )
U out
out
Output: V , U out
lemma.
Lemma 18. Suppose that ρe satisfies (93). For any z ∈ Rm and w ∈ Rn such that
a t ∈ {0, ..., 2T }, there exists a universal constant C such that
X
zi wj ≤ Cm1/4 n1/4 ρe1/2 kzk2 kwk2
P
i zi
= 0, and
(i,j)∈Wt
with probability at least 1 − n−3 .
The proof of Lemma 18 is provided in Keshavan et al. [2010a], and therefore omitted.
We then proceed with the proof of Lemma 21. For j = 1, ..., k, we define S (j,t) , J (j,t) , and K (j,t)
as
S (j,t) =
1
ρe
(t)
X
(t)>
U i∗ U i∗ , J (j,t) =
i:(i,j)∈W2t+1
1
ρe
∗>
(t)
X
U i∗ U i∗ , and K (j,t) = U
(t)>
∗
U .
i:(i,j)∈W2t+1
We then consider an arbitrary W ∈ Rn×k such that kW kF = 1 and w = vec(W ). Then we have
X
> (j,t)
max σmax (S (j,t) ) ≥ w> S (t) w =
Wj∗
S
Wj∗ ≥ min σmin (S (j,t) ).
(85)
j
j
j=1
Since W2t+1 is drawn uniformly at random, we can use mn independent Bernoulli random variables
i.i.d.
δij ’s to describe W2t+1 , i.e., δij ∼ Bernoulli(e
ρ) with δij = 1 denoting (i, j) ∈ W2t+1 and 0 denoting
(i, j) ∈
/ W2t+1 . We then consider an arbitrary z ∈ Rk with kzk2 = 1, and define
Y = z > S (j,t) z =
1
ρe
(t) 2
X
z > U i∗
=
1X
(t) 2
δij z > U i∗ .
ρe
i
i:(i,j)∈W2t+1
Then we have
EY = z > U
(t)
U
(t)>
z=1
and EY 2 =
1X
4µ2 k X > (t) 2 4µ2 k
(t) 4
δij z > U i∗ ≤
z U i∗ =
,
ρe
me
ρ
me
ρ
i
34
i
where the last inequality holds, since U
(t)
is incoherent with parameter 2µ. Similarly, we can show
(t) 2
max z > U i∗
i
≤
4µ2 k
.
me
ρ
Thus by Bernstein’s inequality, we obtain
2
3δ2k
me
ρ
.
P (|Y − EY | ≥ δ2k ) ≤ exp −
6 + 2δ2k 4µ2 k
Since ρe and δ2k satisfy (53) and (95), then for a sufficiently large C7 , we have
P (|Y − EY | ≥ δ2k ) ≤
1
,
n3
which implies that for any z and j, we have
P 1 + δ2k ≥ z > S (j,t) z ≥ 1 − δ2k ≥ 1 − n−3 .
(86)
Combining (86) with (85), we complete the proof.
E.2
Proof of Lemma 22
∗(t)
Proof. For notational convenience, we omit the index t in U
and V ∗(t) , and denote them by U
∗
(j,t)
(j,t)
(j,t)
(j,t)
and V respectively. Let H
=S
K
−J
. We have
H (j,t) =
1
ρe
X
(t)
(t)>
U i∗ U i∗ U
(t)>
∗
>
U − U i∗ U i∗ =
i:(i,j)∈W2t+1
1
ρe
X
H (i,j,t) .
i:(i,j)∈W2t+1
We consider an arbitrary Z ∈ Rn×k such that kZkF = 1. Let z = vec(Z) and v = vec(V ). Since
X
H (i,j,t) = U
(t)>
U
(t)
U
(t)>
∗
U −U
(t)>
∗
U = 0,
i
then by Lemma 18, we have
>
(t)
z (S K
(t)
−J
(t)
)v =
X
>
Z∗j
(S (j,t) K (t)
−J
(j,t)
j
)Vj∗
1X
≤
ρ p,q
sX
2 (V )2
Zpj
jq
sX
j
[H (i,j,t) ]2pq .
i
Meanwhile, we have
X
X (t)
(t)> (t)> ∗
∗
(t) X
(t)> (t)> ∗
∗
[H (i,j,t) ]2pq =
(U ip )2 (U i∗ U
U ∗q − U iq )2 ≤ max(U ip )2
(U i∗ U
U ∗q − U iq )2
i
i
i
(t)
(t)> ∗ 2
(t)
= max(U ip )2 (1 − kU
U ∗q k2 ) ≤ max kU i∗ k22 (1
i
i
(i) 4µ2 k
(t)> ∗
(t)> ∗
≤
(1 − U ∗q U ∗q )(1 + U ∗q U ∗q )
m
√
4µ2 k
4 2µ2 k (t)
∗ 2
∗
≤
kU ∗q − U ∗q k2 ≤
kU − U k2F ,
m
m
(ii)
35
i
(t)>
∗
− (U ∗q U ∗q )2 )
∗
(t)
(t)>
∗
∗
(t)
where (i) comes from the incoherence of U , and (ii) comes from U ∗q U ∗q ≤ kU ∗q k2 kU ∗q k2 ≤ 1.
Combining the above inequalities, by the incoherence of V and Bernstein’s inequality, we have
z > (S (t) K (t) − J (t) )v ≤
X 4σ1 µ2 k
me
ρ
p,q
(t)
kU
∗
− U kF kZ∗p k2 ≤ 3kσ1 δ2k kU
(t)
∗
− U kF
−3
with probability
√ at least 1 − n , where the last inequality comes from the incoherence of V ,
P
k, and a sufficiently large ρe. Since z is arbitrary, then we have
p kZ∗p k2 ≤
P k(S (t) K (t) − J (t) )vk2 ≤ 3δ2k kσ1 kU
(t)
∗
− U kF ≥ 1 − n−3 ,
which completes the proof.
E.3
Proof of Lemma 26
(t+1)
Proof. Recall that we have W in = V (t+0.5) and V
= W out in Algorithm 3. Since V (t+0.5)
∗(t+0.5)>
satisfies (5) of Lemma 5, then there exists a factorization of M ∗ = U ∗(t+0.5) V
such that
∗(t+0.5)
V
is an orthonormal matrix, and satisfies
kW
in
−V
∗(t+0.5)
kF ≤
2 σk
1
2
kW in − V ∗(t) kF ≤
·
= .
σk
σk 8
4
(87)
Since the Frobenius norm projection is contractive, then we have
f−V
kW
Since V
∗(t+0.5)
kW
−V
∗(t+1)
kF ≤ kW
in
−V
∗(t+0.5)
1
kF ≤ .
4
(88)
is an orthonormal matrix, by Lemma 14, we have
√
out
∗(t+0.5)
∗(t+1)
kF ≤
2kV
1 − kW
∗(t+0.5)†
in
−V
f − V ∗(t+0.5) kF
k2 kW
∗(t+0.5)
kF kV
∗(t+0.5)†
f − V ∗(t+0.5) kF ≤ 1 ,
≤ 2kW
2
k2
∗(t+0.5)
(89)
=V
O for some unitary matrix O ∈ Rk×k , and the last inequality comes from
where V
∗(t+1)
(88). Moreover, since V
is an orthonormal matrix, then we have
f ) ≥ σmin (V
σmin (W
∗(t+1)
f−V
) − kW
where the last inequality comes from (89). Since W
out
kW i∗ k2
≤ kW
out>
out
∗(t+1)
kF ≥ 1 −
1
1
= .
2
2
f (Rtmp )−1 , then we have
=W
f
W
r
−1
ei k2 = k(RW
f)
f > ei k2 ≤ σ −1 (W
f )µ
k2 kW
min
36
k
≤ 2µ
n
r
k
.
n
E.4
Proof of Lemma 27
Proof. Before we proceed with the main proof, we first introduce the following lemma.
e , Σ,
e and Ve are defined in Algorithm 2.
Lemma 19. Suppose that ρe satisfies (93). Recall that U
There exists a universal constant C such that
s
k
eΣ
e Ve > − M ∗ k2 = C
√
kU
ρe mn
with high probability.
The proof of Lemma 19 is provided in Keshavan et al. [2010a], therefore omitted.
eΣ
e Ve > and M ∗ are rank k matrices,
We then proceed with the proof of Lemma 27. Since both U
eΣ
e Ve − M ∗ has at most rank 2k. Thus by Lemma 19, we have
then U
2
σk6 (1 − δ2k )
2Ck 3 σ 2
eΣ
e Ve > − M ∗ k2F ≤ 2kkU
eΣ
e Ve > − M ∗ k22 ≤ 2Ck
√
kU
kM ∗ k2F ≤ √ 1 ≤
ρ mn
ρe mn
1024(1 + δ2k )σ14 ξ 2
(90)
with high probability, where the last inequality comes from (93) with
C7 ≥
2048(1 + δ2k )2 σ16 ξ 2
.
µ2 σk6 (1 − δ2k )2
e ∗D
e ∗ Ve ∗> . Then we have
Suppose that M ∗ has a rank k singular value decomposition M ∗ = U
eΣ
e Ve > k2
e ∗ Ve ∗> − U
eΣ
e Ve > − M ∗ k2 = kU
e ∗D
kU
F
F
∗ e ∗ e ∗>
> e ∗ e ∗ e ∗>
e
e
e
eU
e >U
e ∗D
e ∗ Ve ∗> − U
eΣ
e Ve > k2F
= kU D V − U U U D V + U
e Ve > )k2
e ∗ Ve ∗> − Σ
e ∗D
e (U
e >U
e ∗ Ve ∗> + U
eU
e > )U
e ∗D
= k(Im − U
F
T e ∗ e ∗ e ∗> 2
e
e
≥ k(Im − U U )U D V kF .
e⊥ ∈ Rm×(m−k) denote the orthogonal complement to U
e . Then we have
Let U
e ∗D
e ∗ k2F ≥
e ∗D
e ∗ Ve ∗> k22 = kU
e >U
eU
e T )U
e ∗D
e ∗ Ve ∗> k2F = k(U
e⊥ U
e > )U
k(Im − U
⊥
⊥
σk2 e > e ∗ 2
kU⊥ U kF .
2
e = argminO> O=I kU
e −U
e ∗ OkF , we have
Thus Lemma 2 guarantees that for O
k
e −U
e ∗ Ok
e F≤
kU
√
e >U
e ∗ kF ≤
2kU
⊥
2 e e e>
kU ΣV − M ∗ kF .
σk
e ∗tmp = U
e ∗ O.
e Then combining the above inequality with (90), we have
We define U
2
e −U
e ∗tmp kF ≤ 2 kU
eΣ
e Ve > − M ∗ kF ≤ σk (1 − δ2k ) .
kU
σk
16(1 + δ2k )σ12 ξ
Since the Frobenius norm projection is contractive, then we have
e tmp − U
e ∗tmp kF ≤ kU
e −U
e ∗tmp kF ≤
kU
37
σk2 (1 − δ2k )
1
≤ ,
2
16
16(1 + δ2k )σ1 ξ
(91)
e ∗tmp is an orthonormal
where the last inequality comes from the definition of ξ and σ1 ≥ σk . Since U
matrix, by Lemma 14, we have
√ ∗tmp† tmp
e
e
e ∗tmp kF
2U
kU
−U
out
∗(0)
kU − U
kF ≤
e tmp − U
e ∗tmp kF kU
e ∗tmp† k2
1 − kU
2
e tmp − U
e ∗tmp kF ≤ σk (1 − δ2k ) ≤ 1 ,
≤ 2kU
8
8(1 + δ2k )σ12 ξ
(92)
∗(0)
e ∗tmp O
e tmp for some unitary matrix O
e tmp ∈ Rk×k such that O
e tmp O
e tmp> = Ik , and
=U
where U
∗(0)
the last inequality comes from (91). Moreover, since U
is an orthonormal matrix, then we have
e tmp ) ≥ σmin (U ∗(0) ) − kU
e tmp − V ∗(t+1) kF ≥ 1 − 1 = 7 ,
σmin (U
8
8
where the last inequality comes from (91). Since U
out
kU i∗ k2
≤ kU
out>
out
e tmp (Rout )−1 , then we have
=U
U
r
ei k2 =
−1 e tmp
(U
)µ
σmin
out −1
e > ei k2
k(RU
) k2 kU
≤
∗(0)
∗(0)
k
8µ
≤
m
7
∗(0)
r
k
.
m
∗(0)>
. Then we have U
V ∗(0)> = U
U
M ∗ = M ∗ , where
Moreover, we define V ∗(0) = M ∗> U
∗(0) ∗(0)>
the last inequality comes from the fact that U
U
is exactly the projection matrix for the
column space of M ∗ .
E.5
Proof of Corollary 5
(t)
(t)
(t)
Proof. Since EU implies that EU,1 ,..., and EU,4 hold with probability at least 1 − 4n−3 , then combining Lemmas 24 and 25, we obtain
kV (t+0.5) − V ∗(t) kF ≤
(i) σ
σk2 (1 − δ2k ) (ii) σk
σk (t)
σk (1 − δ2k )
∗(t)
k
kU − U kF ≤
·
=
≤
2ξ
2ξσ1 4(1 + δ2k )σ1
8
8(1 + δ2k )σ12 ξ 2
(t)
with probability at least 1 − 4n−3 , where (i) comes from the definition of EU , and (ii) comes
from the definition of ξ and σ1 ≥ σk . Therefore Lemma 26 implies that V
parameter 2µ, and
kV
(t+1)
−V
∗(t+1)
kF ≤
(t+1)
is incoherent with
4
2 (t)
σk (1 − δ2k )
∗(t)
kV (t+0.5) − V ∗(t) kF ≤ kU − U kF ≤
σk
ξ
4(1 + δ2k )σ1
with probability at least 1 − 4n−3 , where the last inequality comes from the definition of ξ and
(t)
EU .
F
Proof of Theorem 2
We present the technical proof for matrix completion. Before we proceed with the main proof, we
first introduce the following lemma.
38
Lemma 20. [Hardt and Wootters [2014]] Suppose that the entry observation probability ρ̄ of W
satisfies (53). Then the output sets {Wt }2T
t=0 , of Algorithm 4 are equivalent to 2T + 1 observation
sets, which are independently generated with the entry observation probability
ρe ≥
C7 µ2 k 3 log n
m
(93)
for some constant C7 .
See Hardt and Wootters [2014] for the proof of Lemma 20. Lemma 20 ensures the independence
among all observation sets generated by Algorithm 4. To make the convergence analysis for matrix
completion comparable to that for matrix sensing, we rescale both the objective function FW and
step size η by the entry observation probability ρe of each individual set, which is also obtained by
Algorithm 4. In particular, we define
1
FeW (U, V ) =
kPW (U V > ) − PW (M ∗ )k2F
2e
ρ
and ηe = ρeη.
(94)
For notational simplicity, we assume that at the t-th iteration, there exists a matrix factorization
of M ∗ as
M∗ = U
where U
V ∗(t)> ,
∗(t)

S (t)
∗(t)

=

∈ Rm×k is an orthonormal matrix. Then we define several nk × nk matrices
 1
X
(t) (t)
0
U ip U iq · · ·

(t)
(t)
 ρe
S11 · · · S1k
 i:(i,1)∈W2t+1

..
..
..
.. 
(t)
..
..

.
.
.
.
.
.  with Spq = 


X
(t)
(t)
1
(t) (t)

Sk1 · · · Skk
0
···
U ip U iq
ρe




,



i:(i,n)∈W2t+1

G(t)

=

(t)
J11
..
.
(t)
Jk1

K (t)
(t)
G
· · · G1k
 .11 .
.. 

..
=  ..
. 

(t)
(t)
Gk1 · · · Gkk

J (t)
(t)


=

(t)
K11
..
.
(t)
Kk1
(t)
with G(t)
pq

· · · J1k
.. 
..
.
. 

(t)
· · · Jkk
(t)
(t)
with Jpq
 1
X
∗(t) ∗(t)
···
U ip U iq
 ρe
 i:(i,1)∈W2t+1

..
..
=
.
.



0
···
 1
X
(t) ∗(t)
U ip U iq
···
 ρe
 i:(i,1)∈W2t+1

..
..
=
.
.



0
···

· · · K1k
.. 
..
.
. 

(t)
· · · Kkk
with
(t)>
∗(t)
(t)
Kpq
= U ∗p U ∗q In ,
39

0
..
.
1
ρe
∗(t)
X
i:(i,n)∈W2t+1

0
..
.
1
ρe
∗(t)
U ip U iq
X
i:(i,n)∈W2t+1
(t)
∗(t)
U ip U iq



,






,



(t)
where 1 ≤ p, q ≤ k. Note that S (t) and G(t) are the partial Hessian matrices ∇2V FeW2t+1 (U , V )
∗(t)
(U , V ) with respect to a vectorized V , i.e., vec(V ).
and ∇2 FeW
V
F.1
2t+1
Proof of Theorem 2 (Alternating Exact Minimization)
Proof. Throughout the proof for alternating exact minimization, we define a constant ξ ∈ (2, ∞)
to simplify the notation. We define the approximation error of the inexact first order oracle as
E(V (t+0.5) , V (t+0.5) , U
(t)
) = k∇V FeW2t+1 (U
∗(t)
, V (t+0.5) ) − ∇V FeW2t+1 (U
(t)
, V (t+0.5) )kF .
To simplify our later analysis, we first introduce the following event.
(
r )
(1 − δ2k )σk
k
(t)
∗(t)
(t)
(t)
and max kU i∗ k2 ≤ 2µ
.
EU = kU − U kF ≤
i
4ξ(1 + δ2k )σ1
m
(t)
We then present two important consequences of EU .
(t)
Lemma 21. Suppose that EU holds, and ρe satisfies (93). Then we have
P 1 + δ2k ≥ σmax (S (t) ) ≥ σmin (S (t) ) ≥ 1 − δ2k ≥ 1 − n−3 ,
where δ2k is some constant satisfying
δ2k ≤
σk6
.
192ξ 2 kσ16
(95)
The proof of Lemma 21 is provided in Appendix E.1. Lemma 21 is also applicable to G(t) , since
∗(t)
G(t) shares the same structure with S (t) , and U
is incoherent with parameter µ.
(t)
Lemma 22. Suppose that EU holds, and ρe satisfies (93). Then for an incoherent V with parameter
3σ1 µ, we have
(t)
P k(S (t) K (t) − J (t) ) · vec(V )k2 ≤ 3kσ1 δ2k kU − U ∗(t) kF ≥ 1 − n−3 ,
where δ2k is defined in (95).
The proof of Lemma 22 is provided in Appendix E.2. Note that Lemma 22 is also applicable to
∗(t)
is incoherent
k(G(t) K (t) − J (t) ) · vec(V )k2 , since G(t) shares the same structure with S (t) , and U
with parameter µ.
We then introduce another two events:
(t)
EU,1 = {1 + δ2k ≥ σmax (S (t) ) ≥ σmin (S (t) ) ≥ 1 − δ2k },
(t)
EU,2 = {1 + δ2k ≥ σmax (G(t) ) ≥ σmin (G(t) ) ≥ 1 − δ2k },
(t)
(t)
(t)
(t)
where δ2k is defined in EU . By Lemmas 21, we can verify that EU implies EU,1 and EU,2 with
(t)
(t)
probability at least 1 − 2n−3 . The next lemma shows that EU,1 and EU,2 imply the strong convexity
(t)
∗(t)
and smoothness of FeW
(U, V ) in V at U = U and U .
2t+1
40
(t)
(t)
Lemma 23. Suppose that EU,1 and EU,2 hold. Then for any V, V 0 ∈ Rn×k , we have
1 + δ2k 0
(t)
(t)
kV − V k2F ≥ FeW2t+1 (U , V 0 ) − FW2t+1 (U , V )
2
1 − δ2k 0
(t)
− h∇V FeW2t+1 (U , V ), V 0 − V i ≥
kV − V k2F ,
2
1 + δ2k 0
∗(t)
∗(t)
kV − V k2F ≥ FeW2t+1 (U , V 0 ) − FW2t+1 (U , V )
2
1 − δ2k 0
∗(t)
− h∇V FeW2t+1 (U , V ), V 0 − V i ≥
kV − V k2F .
2
(t)
∗(t)
Since S (t) and G(t) are essentially the partial Hessian matrices ∇2V FeW2t+1 (U , V ) and ∇2V FeW2t+1 (U , V ),
the proof of 23 directly follows Appendix A.1, and is therefore omitted.
We then introduce another two events:
(t)
(t)
− U ∗(t) kF },
(t)
(t)
− U ∗(t) kF },
EU,3 = {k(S (t) K (t) − J (t) ) · vec(V ∗(t) )k2 ≤ 3kσ1 δ2k kU
EU,4 = {k(G(t) K (t) − J (t) ) · vec(V ∗(t) )k2 ≤ 3kσ1 δ2k kU
(t)
(t)
(t)
(t)
where δ2k is defined in EU . We can verify that EU implies EU,3 and EU,4 with probability at least
1 − 2n−3 by showing the incoherence of V ∗(t) . More specifically, let V ∗(t) = V
QR decomposition of V ∗(t) . We have
∗(t)
kVj∗ k2
=
∗(t)>
kRV V ∗(t)> ej k2
≤
∗(t)
kRV k2 kV ∗(t)> ej k2
(t)
≤
∗(t)
σ1 kVj∗ k2
∗(t)
∗(t)
RV
r
≤ σ1 µ
k
.
n
denote the
(96)
(t)
Then Lemma 22 are applicable to EU,3 and EU,4 .
We then introduce the following key lemmas, which will be used in the main proof.
(t)
(t)
(t)
Lemma 24. Suppose that EU , EU,1 ,..., and EU,4 hold. We then have
E(V (t+0.5) , V (t+0.5) , U
(t)
)≤
(1 − δ2k )σk (t)
∗(t)
kU − U kF .
2ξ
Lemma 24 shows that the approximation error of the inexact first order oracle for updating V
diminishes with the estimation error of U (t) , when U (t) is sufficiently close to U ∗(t) . It is analogous
to Lemma 3 in the analysis of matrix sensing, and its proof directly follows A.2, and is therefore
omitted.
(t)
(t)
(t)
Lemma 25. Suppose that EU , EU,1 ,..., and EU,4 hold. We then have
kV (t+0.5) − V ∗(t) kF ≤
1
(t)
E(V (t+0.5) , V (t) , U ).
1 − δ2k
Lemma 25 illustrates that the estimation error of V (t+0.5) diminishes with the approximation
error of the inexact first order oracle. It is analogous to Lemma 4 in the analysis of matrix sensing,
and its proof directly follows Appendix A.3, and is therefore omitted.
41
Lemma 26. Suppose that V (t+0.5) satisfies
kV (t+0.5) − V ∗(t) kF ≤
σk
.
8
(97)
∗(t+1)>
∗(t+1)
such that V
∈ Rn×k is an orthonorThen there exists a factorization of M ∗ = U ∗(t+1) V
mal matrix, and satisfies
r
2k
4
(t+1)
(t+1)
∗(t+1)
max kV j∗ k2 ≤ 2µ
and kV
−V
kF ≤
kV (t+0.5) − V ∗ kF .
j
n
σk
The proof of Lemma 26 is provided in Appendix E.3. Lemma 26 ensures that the incoherence
(t+1)
to be incoherent with parameter 2µ.
factorization enforces V
(0)
Lemma 27. Suppose that ρe satisfies (93). Then EU holds with high probability.
(0)
The proof of Lemma 27 is provided in §E.4. Lemma 27 shows that the initial solution U
is incoherent with parameter 2µ, while achieving a sufficiently small estimation error with high
probability. It is analogous to Lemma 6 for matrix sensing.
Combining Lemmas 24, 25, and 26, we obtain the next corollary for a complete iteration of
updating V .
(t)
Corollary 5. Suppose that EU holds. Then
(
(1 − δ2k )σk
(t+1)
∗(t+1)
(t)
EV = kV
−V
kF ≤
4ξ(1 + δ2k )σ1
and
max kV
j
(t+1)
k2
j∗
r
≤ 2µ
k
m
)
holds with probability at least 1 − 4n−3 . Moreover, we have
kV
(t+1)
2 (t)
σk (t)
∗
∗(t)
∗(t)
kU − U kF
− V kF ≤ kU − U kF and kV (t+0.5) − V ∗(t) kF ≤
ξ
2ξ
with probability at least 1 − 4n−3 .
The proof of Corollary 5 is provided in Appendix E.5. Since the alternating exact minimization
algorithm updates U and V in a symmetric manner, we can establish similar results for a complete
iteration of updating U in the next corollary.
(t)
(t+1)
Corollary 6. Suppose EV holds. Then EU
we have
kU
(t+1)
−U
∗(t+1)
holds with probability at least 1 − 4n−3 . Moreover,
2 (t+1)
σk (t+1)
∗(t+1)
∗(t+1)
kF ≤ kV
−V
kF and kU (t+0.5) − U ∗(t+1) kF ≤
kV
−V
kF
ξ
2ξ
with probability at least 1 − 4n−3 .
The proof of Lemma 6 directly follows Appendix E.5, and is therefore omitted.
42
(0)
We proceed with the proof of Theorem 2 conditioning on EU . Similar to Appendix 1, we can
(t)
(t)
recursively apply Corollaries 5 and 6, and show that {EU }Tt=1 and {EV }Tt=0 simultaneously hold
(t)
(t)
with probability at least 1 − 8T n−3 . Then conditioning on all {EU }Tt=0 and {EV }Tt=0 , we have
kV
(T )
−V
∗(T )
2
2 (T −1)
2
∗(T −1)
(T −1)
∗(T −1)
kV
kF ≤ kU
−U
kF ≤
−V
kF
ξ
ξ
2T −1
2T
2
2
(1 − δ2k )σk
(0)
∗(0)
≤
kF ≤
,
kU − U
ξ
ξ
8(1 + δ2k )σ1
(98)
(0)
where the last inequality comes from the definition of EU . Thus we only need
(1 − δ2k )σk 1
1
−1 ξ
log
log
·
T =
2
2
4(1 + δ2k )σ1 iterations such that
kV
(T )
−V
∗(T )
2T
(1 − δ2k )σk
2
≤ .
kF ≤
ξ
8(1 + δ2k )σ1
2
(99)
Meanwhile, by (99) and Corollary 6, we have
kU
(T −0.5)
−U
∗(T )
σk (T )
∗(T )
kF ≤
kV
−V
kF ≤
2ξ
2T
(1 − δ2k )σk2
2
,
ξ
16ξ(1 + δ2k )σ1
where the last inequality comes from (98). Thus we only need
(1 − δ2k )σk2 1
1
−1 ξ
T =
log
log
·
2
2
8ξ(1 + δ2k ) iterations such that
kU
(T −0.5)
−U
∗(T )
2T
(1 − δ2k )σk2
2
kF ≤
≤
.
ξ
16ξ(1 + δ2k )σ1
2σ1
(100)
We then combine (99) and (100) by following similar lines to §4.2, and show
kM (T ) − M ∗ kF ≤ .
(0)
(101)
(0)
The above analysis only depends on EU . Because Lemma 27 guarantees that EU holds with high
probability, given T n3 , (101) also holds with high probability.
F.2
Proof of Theorem 2 (Alternating Gradient Descent)
Proof. Throughout the proof for alternating gradient descent, we define a sufficiently large constant
ξ. Moreover, we assume that at the t-th iteration, there exists a matrix factorization of M ∗ as
M∗ = U
∗(t)
43
V ∗(t)> ,
∗(t)
where U
∈ Rm×k is an orthonormal matrix. We define the approximation error of the inexact
first order oracle as
E(V (t+0.5) , V (t) , U
(t)
(t)
∗(t)
) ≤ kFeW2t+1 (U , V (t) ) − ∇V FeW2t+1 (U , V (t) )kF
To simplify our later analysis, we introduce the following event.
(
r
σ2
k
(t)
(t)
∗(t)
(t)
EU = max kU i∗ k2 ≤ 2µ
, kU − U kF ≤ k 2
i
n
4ξσ1
)
r
σk2
k
(t)
(t)
∗(t)
max kVj∗ k2 ≤ 2σ1 µ
.
, and kV − V
kF ≤
i
n
2ξσ1
(t)
As has been shown in §F.1, EU implies the following four events with probability at least 1 − 4n−3 ,
(t)
EU,1 = 1 + δ2k ≥ σmax (S (t) ) ≥ σmin (S (t) ) ≥ 1 − δ2k ,
(t)
EU,2 = 1 + δ2k ≥ σmax (G(t) ) ≥ σmin (G(t) ) ≥ 1 − δ2k ,
(t)
(t)
EU,3 = k(S (t) K (t) − J (t) ) · vec(V ∗(t) )k2 ≤ 3kσ1 δ2k kU − U ∗(t) kF ,
(t)
(t)
EU,4 = k(G(t) K (t) − J (t) ) · vec(V ∗(t) )k2 ≤ 3kσ1 δ2k kU − U ∗(t) kF ,
(t)
(t)
where δ2k is defined in (95). In §F.1, we also show that EU,1 and EU,2 imply the strong convexity
(t)
∗(t)
and smoothness of FeW2t+1 (U, V ) at U = U and U .
Moreover, we introduce the following two events,
(t)
(t)
EU,5 = k(S (t) K (t) − J (t) ) · vec(V (t) − V ∗(t) )k2 ≤ 3kσ1 δ2k kU − U ∗(t) kF ,
(t)
(t)
EU,6 = k(G(t) K (t) − J (t) ) · vec(V (t) − V ∗(t) )k2 ≤ 3kσ1 δ2k kU − U ∗(t) kF ,
(t)
(t)
(t)
where δ2k is defined in (95). We can verify that EU implies EU,5 and EU,6 with probability at least
1 − 2n−3 by showing the incoherence of V (t) − V ∗(t) . More specifically, we have
r
k
(t)
∗(t)
(t)
∗(t)
max kVj∗ − Vj∗ k2 ≤ max kVj∗ k2 + max kVj∗ k2 ≤ 3σ1 µ
,
j
i
j
n
(t)
where the last inequality follows the definition of EU and the incoherence of V ∗(t) as shown in (96).
(t)
(t)
Then Lemma 22 are applicable to EU,5 and EU,6
We then introduce the following key lemmas, which will be used in the main proof.
(t)
(t)
(t)
Lemma 28. Suppose that EU , EU,1 , ..., and EU,6 hold. Then we have
E(V (t+0.5) , V (t) , U
(t)
)≤
(1 + δ2k )σk (t)
∗
kU − U kF .
ξ
Lemma 28 shows that the approximation error of the inexact first order oracle for updating V
diminishes with the estimation error of U (t) , when U (t) is sufficiently close to U ∗(t) . It is analogous
to Lemma 7 in the analysis of matrix sensing, and its proof directly follows B.1, and is therefore
omitted.
44
(t)
(t)
(t)
Lemma 29. Suppose that EU , EU,1 , ..., and EU,6 hold. Meanwhile, the rescaled step size parameter
ηe satisfies
ηe =
1
.
1 + δ2k
Then we have
kV (t+0.5) − V ∗(t) kF ≤
p
δ2k kV (t) − V ∗(t) kF +
2
(t)
E(V (t+0.5) , V (t) , U ).
1 + δ2k
Lemma 29 illustrates that the estimation error of V (t+0.5) diminishes with the approximation
error of the inexact first order oracle. It is analogous to Lemma 8 in the analysis of matrix sensing.
Its proof directly follows Appendix B.2, and is therefore omitted.
Lemma 30. Suppose that V (t+0.5) satisfies
kV (t+0.5) − V ∗(t) kF ≤
σk
.
8
We then have
max kV
j
(t+1)
k2
j∗
r
≤ 2µ
2k
n
and
(t+1)
max kUi∗ k2
i
Moreover, there exists a factorization of M ∗ = U ∗(t+1) V
matrix, and
∗(t+1)>
r
≤ 2σ1 µ
such that V
2k
m
∗(t+1)
is an orthonormal
4
kV (t+0.5) − V ∗(t) kF ,
σk
5σ1 (t+0.5)
(t)
∗(t)
kU (t) − U ∗(t+1) kF ≤
kV
− V ∗(t) kF + σ1 kU − U kF .
σk
kV
(t+1)
−V
∗(t)
kF ≤
The proof of Lemma 26 is provided in Appendix G.1. Lemma 30 guarantees that the incoherence
(t+1)
factorization enforces V
and U (t) to be incoherent with parameters 2µ and 2σ1 µ respectively.
The next lemma characterizes the estimation error of the initial solutions.
(0)
Lemma 31. Suppose that ρe satisfies (93). Then EU holds with high probability.
The proof of Lemma 31 is provided in Appendix G.2. Lemma 31 ensures that the initial solutions
U (0) , and V (0) are incoherent with parameters 2µ and 2σ1 µ respectively, while achieving sufficiently
small estimation errors with high probability. It is analogous to Lemma 10 in the analysis of matrix
sensing.
Combining Lemmas 28, 29, and 26, we obtain the following corollary for a complete iteration
of updating V .
(t)
Corollary 7. Suppose that EU holds. Then
(
r
σ2
k
(t)
(t)
∗(t)
(t)
EV = max kV j∗ k2 ≤ 2µ
, kV − V
kF ≤ k 2 ,
j
m
4ξσ1
)
r
σk2
k
(t)
(t)
∗(t+1)
max kUi∗ k2 ≤ 2σ1 µ
, and kU − U
kF ≤
i
m
2ξσ1
45
holds with probability at least 1 − 6n−3 . Moreover, we have
p
2σk (t)
∗(t)
δ2k kV (t) − V ∗(t) kF +
kU − U kF ,
ξ
√
4 δ2k (t)
8 (t)
(t+1)
∗(t)
∗(t)
kV
−V
kF ≤
kV − V ∗(t) kF + kU − U kF ,
σk
ξ
√
5σ1 δ2k (t)
10
(t)
∗(t)
kU (t) − U ∗(t+1) kF ≤
kV − V ∗(t) kF +
+ 1 σ1 kU − U kF ,
σk
ξ
kV (t+0.5) − V ∗(t) kF ≤
(102)
(103)
(104)
with probability at least 1 − 6n−3 .
The proof of Corollary 7 is provided in Appendix G.3. Since the algorithm updates U and V
in a symmetric manner, we can establish similar results for a complete iteration of updating U in
the next corollary.
(t)
(t+1)
Corollary 8. Suppose EV holds. Then EU
holds with probability at least 1 − 6n−3 . Moreover,
we have
p
2σk (t+1)
∗(t+1)
kU (t+0.5) − U ∗(t+1) kF ≤ δ2k kU (t) − U ∗(t+1) kF +
kV
−V
kF ,
(105)
ξ
√
4 δ2k (t)
8 (t+1)
(t+1)
∗(t+1)
∗(t+1)
kU
−U
kF ≤
kU − U ∗(t+1) kF + kV
−V
kF ,
(106)
σk
ξ
√
5σ1 δ2k (t)
10σ1
(t+1)
∗(t+1)
kV (t+1) − V ∗(t+1) kF ≤
kU − U ∗(t+1) kF +
+ 1 kV
−V
kF , (107)
σk
ξ
with probability at least 1 − 6n−3 .
The proof of Corollary 8 directly follows Appendix G.3, and is therefore omitted.
(0)
We then proceed with the proof of Theorem 2 conditioning on EU . Similar to Appendix 1, we
(t)
(t)
can recursively apply Corollaries 7 and 8, and show that {EU }Tt=1 and {EV }Tt=0 simultaneously
hold with probability at least 1 − 12T n−3 . For simplicity, we define
φV (t+1) = kV (t+1) − V ∗(t+1) kF , φV (t+0.5) = kV (t+0.5) − V ∗(t) kF , φ
V
(t+1)
φU (t+1) = kU (t+1) − U ∗(t+2) kF , φU (t+0.5) = kU (t+0.5) − U ∗(t+1) kF , φ
U
= σ1 kV
(t+1)
(t+1)
= σ1 kU
−V
(t+1)
∗(t+1)
−U
kF ,
∗(t+1)
kF .
We then follow similar lines to §4.3 and §F.1, and show that kM (T ) −M kF ≤ with high probability.
F.3
Proof of Theorem 2 (Gradient Descent)
Proof. The convergence analysis of the gradient descent algorithm is similar to alternating gradient
(t)
descent. The only difference is, for updating U , gradient descent uses V = V
instead of V =
(t+1)
(t)
V
to calculate the gradient at U = U . Then everything else directly follows Appendix F.2,
and is therefore omitted.
46
G
Lemmas for Theorem 2 (Alternating Gradient Descent)
G.1
Proof of Lemma 30
Proof. Recall that we have W in = V (t+0.5) and V
can show
kW
out
−V
∗(t+1)
kF ≤
(t+1)
= W out in Algorithm 3. By Lemma 26, we
4
kV (t+0.5) − V ∗(t) kF .
σk
(108)
By Lemma 17, we have
(t+0.5)
kRV
−V
∗(t+1)>
V ∗(t) kF = kV
(t+1)>
V (t+0.5) − V
∗(t+1)>
V ∗(t) kF
(t+1)
(t+1)
k2 kV (t+0.5) − V ∗(t) kF + kV ∗(t) k2 kV
−V
4σ1 (t+0.5)
≤ kV (t+0.5) − V ∗(t) kF +
kV
− V ∗(t) kF ,
σk
≤ kV
∗(t+1)
kF
(109)
(t+1)
k2 = 1, and kV ∗(t) k2 = σ1 . Moreover, we define
where the last inequality comes from (108), kV
∗(t)
∗(t+1)> ∗(t) >
U ∗(t+1) = U (V
V
) . Then we can verify
U ∗(t+1) V
∗(t+1)
where the last equality holds since V
of M ∗ . Thus we further have
kU (t) − U ∗(t+1) kF = kU
(t)
(V
(t)
=U
∗(t+1)
(t+1)>
∗(t)
V
V ∗(t)> V
∗(t+1)
V
∗(t+1)>
= M ∗,
is exactly the projection matrix for the row space
V (t+0.5) )> − U
(t+0.5)
∗(t+1)
∗(t)
(V
∗(t+1)>
∗(t+1)>
V ∗(t) )> kF
∗(t+1)>
≤ kU k2 kRV
V ∗(t) kF + kV
−V
5σ1 (t+0.5)
(t)
∗(t)
≤
kV
− V ∗(t) kF + σ1 kU − U kF ,
σk
where the last inequality comes from (109), kU
G.2
(t)
k2 = 1, kV
∗(t+1)>
V ∗(t) k2 kU
(t)
−U
∗(t)
kF
V ∗(t) k2 = σ1 , and σ1 ≥ σk .
Proof of Lemma 31
Proof. Following similar lines to Appendix E.4, we can obtain
r
σ2
8µ k
(0)
(0)
∗(0)
max kU i∗ k2 ≤
, kU − U
kF ≤ k 2 ,
i
7
m
8ξσ1
r
σ2
8µ k
(0)
(0)
∗(0)
max kV j∗ k2 ≤
kF ≤ k 2 .
, kV − V
j
7
n
8ξσ1
(110)
(111)
Then by Lemma 17, we have
kU
(0)> f
M −U
∗(0)>
M ∗ kF ≤ kU
≤
(0)
f − M ∗ kF + kM ∗ k2 kU (0) − U ∗(0) kF
k2 kM
σk2
5σk2
σ13
+
≤
.
32ξσ1
32ξσk2 8ξσ1
47
(112)
By Lemma 17 again, we have
kV
(0)> f>
M U
(0)
≤ kV
≤
−V
(0)
∗(0)>
M ∗> U
f> U
k2 kM
(0)
∗(0)
kF
− M ∗> U
∗(0)
kF + kU
∗(0)>
M ∗ k2 kV
(0)
−V
∗(0)
kF
5σk2
σ2
9σk2
+ k ≤
,
32ξσ1 8ξσ1
32ξσ1
(113)
where the last inequality comes from (111) and (112), and kM ∗ k2 = σ1 . By Lemma 17 again, we
have
kV (0) − V ∗(0) kF ≤ kV
≤ kV
≤
(0)
k2 kV
9σk2
32ξσ1
+
(0)
V
(0)> f>
M U
(0)> f>
M U
σk2
8ξσ1
≤
(0)
(0)
−V
13σk2
32ξσ1
−V
∗(0)
∗(0)>
V
∗(0)>
M ∗> U
M ∗> U
∗(0)
∗(0)
kF + kU
kF
∗(0)>
M ∗V
∗(0)
k2 kV
(0)
−V
,
∗(0)
kF
(114)
∗(0)>
where the last two inequalities come from (111), (114), and kU
M ∗V
(0)
of ξ, and σ1 ≥ σk . Moreover, by the incoherence of V , we have
∗(0)
k2 ≤ σ1 , the definition
(0)> f> (0)
(0)>
(0)
M U k2 kV
kVj∗ k2 ≤ kV (0)> ej k2 = kV
e i k2
6µ r k
∗(0) ∗(0)>
∗(0)
(0)>
(0)
∗(0)
∗(0)>
∗(0)
f> U − V
V
M ∗> U
k2 + kV
M
V
M ∗> U
kF
≤ kV
5
m
r
r
9σk2
41σ1 µ k
6σ1 µ k
≤ 1+
≤
,
5
m
28
m
32ξσ12
where the last two inequalities come from (111), (114), the definition of ξ, and σ1 ≥ σk
G.3
Proof of Corollary 7
(t)
(t)
(t)
Proof. Since EU implies that EU,1 ,...,EU,6 hold with probability 1 − 6n−3 , then combining Lemmas
28 and 29, we obtain
p
2
(t)
kV (t+0.5) − V ∗(t) kF ≤ δ2k kV (t) − V ∗(t) kF +
E(V (t+0.5) , V (t) , U )
1 + δ2k
p
2σk (t)
∗(t)
≤ δ2k kV (t) − V ∗(t) kF +
kU − U kF
ξ
2σk2 σk
σk3
σk2
σk5
σk3
σk
≤
·
+
=
+
≤
3
2
4
2
2
2
ξ 4ξσ1
8
12ξσ1 2ξσ1
24ξ σ1
2ξ σ1
with probability 1 − 6n−3 , where the last inequality comes from the definition of ξ and σ1 ≥ σk .
Thus by Lemma 30, we have
√
4 δ2k (t)
8 (t)
(t+1)
∗(t+1)
∗(t)
kV
−V
kF ≤
kV − V ∗(t) kF + kU − U kF
σk
ξ
σk5
σk3
σk4
2σk2
σk2
4
≤
+
=
+
≤
,
σk 24ξ 2 σ14 2ξ 2 σ12
6ξ 2 σ14 ξ 2 σ12
4ξσ12
48
with probability 1 − 6n−3 , where the last inequality comes from the definition of ξ and σ1 ≥ σk .
Moreover, by Lemma 30 again, we have
√
5σ1 δ2k (t)
10
(t)
∗(t)
(t)
∗(t+1)
kU − U
kF ≤
kV − V ∗(t) kF +
+ 1 σ1 kU − U kF
σk
ξ
σk2
σk2
σk3
σk2
5σk4
σk2
5σ1
10
≤
+
+
1
σ
≤
·
·
=
+
1
σk 12ξσ13 2ξσ1
ξ
2ξσ1
4ξσ12
24ξ 2 σ13 3ξσ1
with probability 1 − 6n−3 , where the last inequality comes from the definition of ξ and σ1 ≥ σk .
References
Alekh Agarwal, Sahand Negahban, and Martin J Wainwright. Noisy matrix decomposition via
convex relaxation: Optimal rates in high dimensions. The Annals of Statistics, 40(2):1171–1197,
2012.
Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, efficient, and neural algorithms
for sparse coding. arXiv preprint arXiv:1503.00778, 2015.
Michel Baes. Estimate sequence methods: extensions and approximations. Institute for Operations
Research, ETH, Zürich, Switzerland, 2009.
Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guarantees for the EM
algorithm: From population to sample-based analysis. arXiv preprint arXiv:1408.2156, 2014.
Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi. Dropping convexity for faster
semi-definite optimization. arXiv preprint arXiv:1509.03917, 2015.
Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen. A singular value thresholding algorithm for
matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.
T Tony Cai and Anru Zhang. ROP: Matrix recovery via rank-one projections. The Annals of
Statistics, 43(1):102–138, 2015.
Emmanuel Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger flow:
Theory and algorithms. arXiv preprint arXiv:1407.1065, 2014.
Emmanuel J Candes and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE, 98
(6):925–936, 2010.
Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization.
Foundations of Computational Mathematics, 9(6):717–772, 2009.
Emmanuel J Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix
completion. IEEE Transactions on Information Theory, 56(5):2053–2080, 2010.
Venkat Chandrasekaran, Sujay Sanghavi, Pablo A Parrilo, and Alan S Willsky. Rank-sparsity
incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572–596, 2011.
49
Jianhui Chen, Jiayu Zhou, and Jieping Ye. Integrating low-rank and group-sparse structures for
robust multi-task learning. In Proceedings of the 17th ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 42–50. ACM, 2011.
Yudong Chen. Incoherence-optimal matrix completion. arXiv preprint arXiv:1310.0154, 2013.
Yudong Chen and Martin J. Wainwright. Fast low-rank estimation by projected gradient descent:
General statistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025, 2015.
Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, and Rachel Ward. Coherent matrix completion. arXiv preprint arXiv:1306.2979, 2013a.
Yudong Chen, Ali Jalali, Sujay Sanghavi, and Constantine Caramanis. Low-rank matrix recovery
from errors and erasures. IEEE Transactions on Information Theory, 59(7):4324–4337, 2013b.
Alexandre d’Aspremont. Smooth optimization with approximate gradient. SIAM Journal on Optimization, 19(3):1171–1183, 2008.
Olivier Devolder, François Glineur, and Yurii Nesterov. First-order methods of smooth convex
optimization with inexact oracle. Mathematical Programming, 146(1-2):37–75, 2014.
Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting.
SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.
Rainer Gemulla, Erik Nijkamp, Peter J Haas, and Yannis Sismanis. Large-scale matrix factorization
with distributed stochastic gradient descent. In ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 69–77, 2011.
David Gross. Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions
on Information Theory, 57(3):1548–1566, 2011.
Osman Güler. New proximal point algorithms for convex minimization. SIAM Journal on Optimization, 2(4):649–664, 1992.
Moritz Hardt. Understanding alternating minimization for matrix completion. In Symposium on
Foundations of Computer Science, pages 651–660, 2014.
Moritz Hardt and Mary Wootters. Fast matrix completion without the condition number. arXiv
preprint arXiv:1407.4070, 2014.
Moritz Hardt, Raghu Meka, Prasad Raghavendra, and Benjamin Weitz. Computational limits for
matrix completion. arXiv preprint arXiv:1402.2331, 2014.
Trevor Hastie, Rahul Mazumder, Jason Lee, and Reza Zadeh. Matrix completion and low-rank svd
via fast alternating least squares. arXiv preprint arXiv:1410.2596, 2014.
Cho-Jui Hsieh and Peder Olsen. Nuclear norm minimization via active subspace selection. In
International Conference on Machine Learning, pages 575–583, 2014.
50
Daniel Hsu, Sham M Kakade, and Tong Zhang. Robust matrix decomposition with sparse corruptions. IEEE Transactions on Information Theory, 57(11):7221–7234, 2011.
Prateek Jain and Praneeth Netrapalli. Fast exact matrix completion with finite samples. arXiv
preprint arXiv:1411.1087, 2014.
Prateek Jain, Raghu Meka, and Inderjit S Dhillon. Guaranteed rank minimization via singular
value projection. In Advances in Neural Information Processing Systems, pages 937–945, 2010.
Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. In Symposium on Theory of Computing, pages 665–674, 2013.
Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few
entries. IEEE Transactions on Information Theory, 56(6):2980–2998, 2010a.
Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy
entries. Journal of Machine Learning Research, 11:2057–2078, 2010b.
Vladimir Koltchinskii, Karim Lounici, and Alexandre B Tsybakov. Nuclear-norm penalization and
optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302–2329,
2011.
Yehuda Koren. The bellkor solution to the netflix grand prize. Netflix Prize Documentation, 81,
2009.
Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender
systems. IEEE Computer, 18:30–37, 2009.
Kiryung Lee and Yoram Bresler. Admira: Atomic decomposition for minimum rank approximation.
IEEE Transactions on Information Theory, 56(9):4402–4416, 2010.
Zhi-Quan Luo and Paul Tseng. Error bounds and convergence analysis of feasible descent methods:
a general approach. Annals of Operations Research, 46(1):157–178, 1993.
Angelia Nedić and Dimitri Bertsekas. Convergence rate of incremental subgradient algorithms. In
Stochastic optimization: algorithms and applications, pages 223–264. Springer, 2001.
Sahand Negahban and Martin J Wainwright. Estimation of (near) low-rank matrices with noise
and high-dimensional scaling. The Annals of Statistics, 39(2):1069–1097, 2011.
Sahand Negahban and Martin J Wainwright. Restricted strong convexity and weighted matrix
completion: Optimal bounds with noise. Journal of Machine Learning Research, 13(1):1665–
1697, 2012.
Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer,
2004.
Arkadiusz Paterek. Improving regularized singular value decomposition for collaborative filtering.
In Proceedings of KDD Cup and workshop, volume 2007, pages 5–8, 2007.
51
Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning Research,
12:3413–3430, 2011.
Benjamin Recht and Christopher Ré. Parallel stochastic gradient algorithms for large-scale matrix
completion. Mathematical Programming Computation, 5(2):201–226, 2013.
Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of
linear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471–501, 2010.
Angelika Rohde and Alexandre B Tsybakov. Estimation of high-dimensional low-rank matrices.
The Annals of Statistics, 39(2):887–930, 2011.
Gilbert W Stewart, Ji-guang Sun, and Harcourt Brace Jovanovich. Matrix perturbation theory,
volume 175. Academic press New York, 1990.
Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization. arXiv
preprint arXiv:1411.8003, 2014.
Gábor Takács, István Pilászy, Bottyán Németh, and Domonkos Tikk. Major components of the
gravity recommendation system. ACM SIGKDD Explorations Newsletter, 9(2):80–83, 2007.
Stephen Tu, Ross Boczar, Mahdi Soltanolkotabi, and Benjamin Recht. Low-rank solutions of linear
matrix equations via procrustes flow. arXiv preprint arXiv:1507.03566, 2015.
Zheng Wang, Ming-Jun Lai, Zhaosong Lu, Wei Fan, Hasan Davulcu, and Jieping Ye. Orthogonal
rank-one matrix pursuit for low rank matrix completion. SIAM Journal on Scientific Computing,
37(1):A488–A514, 2015.
Shuo Xiang, Yunzhang Zhu, Xiaotong Shen, and Jieping Ye. Optimal exact least squares rank
minimization. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 480–488. ACM, 2012.
Qi Yan, Jieping Ye, and Xiaotong Shen. Simultaneous pursuit of sparseness and rank structures
for matrix decomposition. Journal of Machine Learning Research, 16:47–75, 2015.
Tuo Zhao, Zhaoran Wang, and Han Liu. A nonconvex optimization framework for low rank matrix
factorization. Advances in Neural Information Processing Systems, 2015. To appear.
Qinqing Zheng and John Lafferty. A convergent gradient descent algorithm for rank minimization and semidefinite programming from random linear measurements. arXiv preprint
arXiv:1506.06081, 2015.
Yunzhang Zhu, Xiaotong Shen, and Changqing Ye. Personalized prediction and sparsity pursuit in
latent factor models. Journal of the American Statistical Association, 2015. To appear.
Yong Zhuang, Wei-Sheng Chin, Yu-Chin Juan, and Chih-Jen Lin. A fast parallel sgd for matrix
factorization in shared memory systems. In ACM Conference on Recommender Systems, pages
249–256, 2013.
52

Download Report

Nonconvex Low Rank Matrix Factorization via Inexact First Order

Paperzz.com

Your Paperzz