Accelerated Stochastic Gradient Descent

Accelerated Stochastic
Gradient Descent
Praneeth Netrapalli
MSR India
Presented at OSL workshop, Les Houches, France.
Joint work with
Prateek Jain, Sham M. Kakade, Rahul Kidambi and Aaron Sidford
Linear regression
min ๐‘“ ๐‘ฅ = ๐ด๐‘ฅ โˆ’ ๐‘
๐‘ฅ
2
2
๐‘ฅ โˆˆ โ„๐‘‘ , ๐ด โˆˆ โ„๐‘›×๐‘‘ , ๐‘ โˆˆ โ„๐‘›
โ€ข Basic problem and arises pervasively in applications
โ€ข Deeply studied in literature
Gradient descent for linear regression
โŠค
๐‘ฅ๐‘ก+1 = ๐‘ฅ๐‘ก โˆ’ ๐›ฟ โ‹… ๐ด ๐ด๐‘ฅ๐‘ก โˆ’ ๐‘
Gradient
โ€ข Convergence rate: ๐‘‚
โ€ข ๐‘“ โˆ— = min ๐‘“ ๐‘ฅ ;
๐‘ฅ
๐‘“ ๐‘ฅ0 โˆ’๐‘“โˆ—
๐œ… log
๐œ–
๐œ– = Target suboptimality
โ€ข Condition number: ๐œ… =
๐œŽ๐‘š๐‘Ž๐‘ฅ ๐ดโŠค ๐ด
๐œŽ๐‘š๐‘–๐‘› ๐ดโŠค ๐ด
Question: Is it possible to do better?
Gradient descent:
โ€ข Hope: GD does not reuse past gradients
โ€ข Answer: Yes!
๐‘ฅ๐‘ก+1 = ๐‘ฅ๐‘ก โˆ’ ๐›ฟ โ‹… ๐›ป๐‘“ ๐‘ฅ๐‘ก
๏ƒ˜Conjugate gradient (Hestenes and Stiefel 1952)
๏ƒ˜Heavy ball method (Polyak 1964)
๏ƒ˜Accelerated gradient descent (Nemirovsky and Yudin 1977, Nesterov 1983)
Accelerated gradient descent (AGD)
๐‘ฅ๐‘ก+1 = ๐‘ฆ๐‘ก โˆ’ ๐›ฟ๐›ป๐‘“ ๐‘ฆ๐‘ก
๐‘ฆ๐‘ก+1 = ๐‘ฅ๐‘ก+1 + ๐›พ ๐‘ฅ๐‘ก+1 โˆ’ ๐‘ฅ๐‘ก
๐‘“ ๐‘ฅ0 โˆ’๐‘“โˆ—
๐œ… log
๐œ–
โ€ข Convergence rate: ๐‘‚
โ€ข ๐‘“ โˆ— = min ๐‘“ ๐‘ฅ ;
๐‘ฅ
๐œ– = Target suboptimality
โ€ข Condition number: ๐œ… =
๐œŽ๐‘š๐‘Ž๐‘ฅ ๐ดโŠค ๐ด
๐œŽ๐‘š๐‘–๐‘› ๐ดโŠค ๐ด
Accelerated gradient descent (AGD)
Compared to: ๐‘‚
๐‘ฅ
for GD
๐‘“ ๐‘ฅ0 โˆ’๐‘“โˆ—
๐œ… log
๐œ–
โ€ข Convergence rate: ๐‘‚
โ€ข ๐‘“ โˆ— = min ๐‘“ ๐‘ฅ ;
๐‘“ ๐‘ฅ0 โˆ’๐‘“โˆ—
๐œ… log
๐œ–
๐œ– = Target suboptimality
โ€ข Condition number: ๐œ… =
๐œŽ๐‘š๐‘Ž๐‘ฅ ๐ดโŠค ๐ด
๐œŽ๐‘š๐‘–๐‘› ๐ดโŠค ๐ด
Source: http://blog.mrtz.org/2014/08/18/robustness-versus-acceleration.html
Stochastic approximation (Robbins and Monro 1951)
โ€ข
Distribution ๐’Ÿ on โ„๐‘‘ × โ„
โ€ข
๐‘“ ๐‘ฅ โ‰œ๐”ผ
โ€ข
Equivalent to ๐ด๐‘ฅ โˆ’ ๐‘
โ€ข
Observe ๐‘› pairs ๐‘Ž1 , ๐‘1 , โ‹ฏ , ๐‘Ž๐‘› , ๐‘๐‘›
โ€ข
Interested in entire distribution ๐’Ÿ rather than data points like in ML
โ€ข
Fit a linear model to the distribution
โ€ข
Cannot compute exact gradients
๐‘Ž,๐‘ โˆผ๐’Ÿ
๐‘ŽโŠค ๐‘ฅ โˆ’ ๐‘
2
2
2
where ๐ด has infinite rows
Stochastic gradient descent (SGD) (Robbins and
Monro 1951)
เท  ๐‘ฅ๐‘ก where ๐”ผ ๐›ป๐‘“
เท  ๐‘ฅ๐‘ก
โ€ข ๐‘ฅ๐‘ก+1 = ๐‘ฅ๐‘ก โˆ’ ๐›ฟ โ‹… ๐›ป๐‘“
โ€ข Return
1
ฯƒ๐‘– ๐‘ฅ๐‘–
๐‘›
= ๐›ป๐‘“ ๐‘ฅ๐‘ก
(Polyak and Juditsky 1992)
โ€ข Is gradient descent in expectation
โ€ข For linear regression, SGD:
๐‘ฅ๐‘ก+1 = ๐‘ฅ๐‘ก โˆ’ ๐›ฟ โ‹… ๐‘Ž๐‘กโŠค ๐‘ฅ๐‘ก โˆ’ ๐‘๐‘ก ๐‘Ž๐‘ก
โ€ข Streaming algorithm: extremely efficient and widely used in practice
Best possible rate
โ€ข Consider ๐‘ = ๐‘ŽโŠค ๐‘ฅ โˆ— + noise; noise โˆผ ๐’ฉ 0, ๐œŽ 2
โ€ข Recall: ๐‘“ ๐‘ฅ โ‰œ ๐”ผ
โ€ข ๐‘ฅเทœ โ‰œ
argmin ฯƒ๐‘›๐‘–=1
๐‘ฅ
โ€ข ๐”ผ ๐‘“ ๐‘ฅเทœ
โˆ’๐‘“ ๐‘ฅ
โˆ—
๐‘Ž,๐‘ โˆผ๐’Ÿ
๐‘Ž๐‘–โŠค ๐‘ฅ
๐‘ŽโŠค ๐‘ฅ โˆ’ ๐‘
โˆ’ ๐‘๐‘–
2
2
= 1+๐‘œ 1
๐œŽ2 ๐‘‘
๐‘›
(van der Vaart, 2000)
Best possible rate
โ€ข In general: ๐‘ฅ โˆ— โ‰œ argmin ๐‘“ ๐‘ฅ
๐‘ฅ
โŠค โˆ—
2
โ€ข ๐”ผ ๐‘Ž ๐‘ฅ โˆ’ ๐‘ ๐‘Ž๐‘Ž
โŠค
2
โ‰ผ ๐œŽ ๐”ผ ๐‘Ž๐‘Ž
โŠค
Equivalently
๐‘› โ‰ค 1+๐‘œ 1
โ€ข ๐‘ฅเทœ โ‰œ
๐‘›
ฯƒ
argmin ๐‘–=1
๐‘ฅ
โ€ข ๐”ผ ๐‘“ ๐‘ฅเทœ
โˆ’๐‘“
๐‘ฅโˆ—
โŠค
๐‘Ž๐‘– ๐‘ฅ
โˆ’๐‘
โ‰ค 1+๐‘œ 1
2
๐œŽ2 ๐‘‘
๐‘›
(van der Vaart, 2000)
๐œŽ 2๐‘‘
๐œ–
Convergence rate of SGD
โ€ข Convergence rate: ๐‘‚เทจ
โ€ข ๐‘“ โˆ— = min ๐‘“ ๐‘ฅ ;
๐‘ฅ
๐‘“ ๐‘ฅ0 โˆ’๐‘“โˆ—
๐œ… log
๐œ–
+
๐œŽ2 ๐‘‘
๐œ–
(Jain et al. 2016)
๐œ– = Target suboptimality
โ€ข Condition number: ๐œ… โ‰œ
max ๐‘Ž 22
๐œŽ๐‘š๐‘–๐‘› ๐”ผ ๐‘Ž๐‘ŽโŠค
โ€ข Noise level: ๐”ผ ๐‘ŽโŠค ๐‘ฅ โˆ— โˆ’ ๐‘ 2 ๐‘Ž๐‘ŽโŠค โ‰ผ ๐œŽ 2 ๐”ผ ๐‘Ž๐‘ŽโŠค
Recap
Deterministic case
Stochastic approximation
GD
๐‘“ ๐‘ฅ0 โˆ’ ๐‘“ โˆ—
๐‘‚ ๐œ… log
๐œ–
SGD
๐‘“ ๐‘ฅ0 โˆ’ ๐‘“ โˆ— ๐œŽ 2 ๐‘‘
๐‘‚เทจ ๐œ… log
+
๐œ–
๐œ–
๐‘‚
AGD
๐‘“ ๐‘ฅ0 โˆ’ ๐‘“ โˆ—
๐œ… log
๐œ–
Accelerated SGD?
Unknown
Question: Is accelerating SGD possible?
Is this really important?
โ€ข Extremely important in practice
๏ƒ˜As we saw, acceleration can really give orders of magnitude improvement
๏ƒ˜Neural network training uses Nesterovโ€™s AGD as well as Adam; but no theoretical
understanding
๏ƒ˜Jain et al. 2016 shows acceleration leads to more parallelizability
โ€ข Existing results show AGD not robust to deterministic noise (dโ€™Aspremont
2008, Devolder et al. 2014) but is robust to random additive noise (Ghadimi
and Lan 2010, Dieuleveut et al. 2016)
โ€ข Stochastic approximation falls between the above two cases
โ€ข Key issue: mixes optimization and statistics (i.e., # iterations = #samples)
Is acceleration possible?
โ€ข ๐‘ = ๐‘ŽโŠค ๐‘ฅ โˆ—
โ‡’
Noise level: ๐œŽ 2 = 0
โ€ข SGD convergence rate: ๐‘‚เทจ
โ€ข Accelerated rate: ๐‘‚เทจ
๐‘“ ๐‘ฅ0 โˆ’๐‘“โˆ—
๐œ… log
๐œ–
๐‘“ ๐‘ฅ0 โˆ’๐‘“โˆ—
๐œ… log
๐œ–
?
Example I: Discrete distribution
โ€ข๐‘Ž=
0
โ‹ฎ
0
1
0
โ‹ฎ
0
with probability ๐‘๐‘–
โ€ข In this case, ๐œ… โ‰œ
โ€ข Is ๐‘‚เทจ
max ๐‘Ž 22
๐œŽ๐‘š๐‘–๐‘› ๐”ผ ๐‘Ž๐‘ŽโŠค
๐‘“ ๐‘ฅ0 โˆ’๐‘“โˆ—
๐œ… log
๐œ–
=
1
๐‘min
possible?
โ€ข Or, halve the error using ๐‘‚เทจ
๐œ… samples?
Example I: Discrete distribution
โ€ข๐‘Ž=
0
โ‹ฎ
0
1
0
โ‹ฎ
0
with probability ๐‘๐‘– ;
๐œ…โ‰œ
1
๐‘min
โ€ข Fewer than ๐œ… samples โ‡’ do not observe ๐‘min direction
โŠค
ฯƒ
โ‡’ ๐‘– ๐‘Ž๐‘– ๐‘Ž๐‘– not invertible
โ€ข Cannot do better than ๐‘‚ ๐œ…
โ€ข Acceleration not possible
Example II: Gaussian
โ€ข ๐‘Ž โˆผ ๐’ฉ 0, ๐ป , ๐ป is a PSD matrix
โ€ข In this case, ๐œ… โˆผ
Tr ๐ป
๐œŽmin ๐ป
โ‰ฅ๐‘‘
โ€ข However, after ๐‘‚ ๐‘‘ samples:
1
ฯƒ
๐‘› ๐‘–
๐‘Ž๐‘– ๐‘Ž๐‘–โŠค โˆผ ๐ป
โ€ข Possible to solve ๐‘Ž๐‘–โŠค ๐‘ฅ โˆ— = ๐‘๐‘– after O ๐‘‘ samples
โ€ข Acceleration might be possible
Discrete vs Gaussian
Discrete distribution
Gaussian distribution
Key issue: matrix spectral concentration
โ€ข Recall: ๐‘Ž๐‘– โˆผ ๐’Ÿ. Let ๐ป โ‰ ๐”ผ ๐‘Ž๐‘– ๐‘Ž๐‘–โŠค .
โ€ข For ๐‘ฅเทœ โ‰œ
argmin ฯƒ๐‘›๐‘–=1
๐‘ฅ
๐‘Ž๐‘–โŠค ๐‘ฅ
โˆ’ ๐‘๐‘–
2
to be good, need:
๐‘›
1
1 โˆ’ ๐›ฟ ๐ป โ‰ผ เท ๐‘Ž๐‘– ๐‘Ž๐‘–โŠค โ‰ผ 1 + ๐›ฟ ๐ป
๐‘›
๐‘–=1
โ€ข How many samples are required for spectral concentration?
Separating optimization and statistics
Matrix variance (Tropp 2012): ๐”ผ ๐‘Ž
2
โŠค
2 ๐‘Ž๐‘Ž
2
Recall ๐ป โ‰ ๐”ผ ๐‘Ž๐‘ŽโŠค
Statistical condition number: ๐œ…ว โ‰ ๐”ผ
1
โˆ’2
๐ป ๐‘Ž
2
2
1
โˆ’2
1
โˆ’2
๐ป ๐‘Ž ๐ป ๐‘Ž
โŠค
2
Matrix Bernstein Theorem
(Tropp 2015)
If ๐‘› > ๐‘‚ ๐œ…ว , then 1 โˆ’ ๐›ฟ ๐ป โ‰ผ
1 ๐‘›
ฯƒ๐‘–=1 ๐‘Ž๐‘– ๐‘Ž๐‘–โŠค
๐‘›
โ‰ผ 1+๐›ฟ ๐ป
Is acceleration possible?
โ€ข ๐‘‚ ๐œ…ว samples sufficient
โ€ข Recall SGD convergence rate: ๐‘‚เทจ
โ€ข Always ๐œ…ว โ‰ค ๐œ….
๐‘“ ๐‘ฅ0 โˆ’๐‘“โˆ—
๐œ… log
๐œ–
Acceleration might be possible if ๐œ…ว โ‰ช ๐œ…
โ€ข Discrete case: ๐œ…ว =
1
๐‘min
= ๐œ…;
Gaussian case: ๐œ…ว = ๐‘‚ ๐‘‘ โ‰ค ๐œ…
Result
โ€ข Convergence rate of ASGD: ๐‘‚เทจ
โ€ข Compared to SGD: ๐‘‚เทจ
๐‘“ ๐‘ฅ0 โˆ’๐‘“โˆ—
๐œ…๐œ…ว log
๐œ–
๐‘“ ๐‘ฅ0 โˆ’๐‘“ โˆ—
๐œ… log
๐œ–
๐œŽ2 ๐‘‘
+
๐œ–
๐œŽ2 ๐‘‘
+
๐œ–
โ€ข Improvement since ๐œ…ว โ‰ค ๐œ…
โ€ข Conjecture: lower bound ฮฉ
Srebro 2016)
๐‘“ ๐‘ฅ0 โˆ’๐‘“ โˆ—
๐œ…๐œ…ว log
๐œ–
(inspired by Woodworth and
Key takeaway
โ€ข
โ€ข
Acceleration possible!
Gain depends on statistical condition number
Simulations โ€“ No noise
Discrete distribution
Gaussian distribution
Simulations โ€“ With noise
Discrete distribution
Gaussian distribution
High level challenges
โ€ข Several versions of accelerated algorithms known e.g., conjugate
gradient 1952, heavy ball 1964, momentum methods 1983,
accelerated coordinate descent 2012, linear coupling 2014
โ€ข Many of them are equivalent in deterministic setting but not in
stochastic setting
โ€ข Many different analyses even for momentum methods: Nesterovโ€™s
analysis 1983, coordinate descent 2012, ODE analysis 2013, linear
coupling 2014
Algorithm
Parameters: ๐›ผ, ๐›ฝ, ๐›พ, ๐›ฟ
1. ๐‘ฃ0 = ๐‘ฅ0
2. ๐‘ฆ๐‘กโˆ’1 = ๐›ผ๐‘ฅ๐‘กโˆ’1 + 1 โˆ’ ๐›ผ ๐‘ฃ๐‘กโˆ’1
3. ๐‘ฅ๐‘ก = ๐‘ฆ๐‘กโˆ’1 โˆ’ ๐›ฟ๐›ป๐‘“ ๐‘ฆ๐‘กโˆ’1
4. ๐‘ง๐‘กโˆ’1 = ๐›ฝ๐‘ฆ๐‘กโˆ’1 + 1 โˆ’ ๐›ฝ ๐‘ฃ๐‘กโˆ’1
5. ๐‘ฃ๐‘ก = ๐‘ง๐‘กโˆ’1 โˆ’ ๐›พ๐›ป๐‘“ ๐‘ฆ๐‘กโˆ’1
Nesterov 2012
Parameters: ๐›ผ, ๐›ฝ, ๐›พ, ๐›ฟ
1. ๐‘ฃ0 = ๐‘ฅ0
2. ๐‘ฆ๐‘กโˆ’1 = ๐›ผ๐‘ฅ๐‘กโˆ’1 + 1 โˆ’ ๐›ผ ๐‘ฃ๐‘กโˆ’1
3. ๐‘ฅ๐‘ก = ๐‘ฆ๐‘กโˆ’1 โˆ’ ๐›ฟ ๐›ปเท ๐‘ก ๐‘“ ๐‘ฆ๐‘กโˆ’1
4. ๐‘ง๐‘กโˆ’1 = ๐›ฝ๐‘ฆ๐‘กโˆ’1 + 1 โˆ’ ๐›ฝ ๐‘ฃ๐‘กโˆ’1
5. ๐‘ฃ๐‘ก = ๐‘ง๐‘กโˆ’1 โˆ’ ๐›พ๐›ปเท ๐‘ก ๐‘“ ๐‘ฆ๐‘กโˆ’1
Our algorithm
Proof overview
โ€ข Recall our guarantee: ๐‘‚เทจ
๐‘“ ๐‘ฅ0 โˆ’๐‘“โˆ—
๐œ…๐œ…ว log
๐œ–
+
๐œŽ2 ๐‘‘
๐œ–
โ€ข First term depends on initial error; second is statistical error
โ€ข Different analyses for the two terms
โ€ข For the first term, analyze assuming ๐œŽ = 0
โ€ข For the second term, analyze assuming ๐‘ฅ0 = ๐‘ฅ โˆ—
Part I: Potential function
โ€ข Iterates ๐‘ฅ๐‘ก , ๐‘ฃ๐‘ก of ASGD. ๐ป โ‰œ ๐”ผ ๐‘Ž๐‘ŽโŠค .
โ€ข Existing analyses use potential function
๐‘ฅ๐‘ก โˆ’ ๐‘ฅ โˆ— 2๐ป + ๐œŽmin ๐ป โ‹… ๐‘ฃ๐‘ก โˆ’ ๐‘ฅ โˆ—
โ€ข We show
โ‰ค 1โˆ’
2
2
๐‘ฅ๐‘ก โˆ’ ๐‘ฅ โˆ—
โ€ข We use
๐‘ฅ๐‘ก โˆ’ ๐‘ฅ โˆ—
1
๐œ…เทฅ
๐œ…
๐‘ฅ๐‘กโˆ’1 โˆ’ ๐‘ฅ โˆ—
2
2
+ ๐œŽmin ๐ป โ‹… ๐‘ฃ๐‘ก โˆ’ ๐‘ฅ โˆ—
+ ๐œŽmin ๐ป โ‹… ๐‘ฃ๐‘ก โˆ’ ๐‘ฅ โˆ—
2
2
2
๐ป โˆ’1
+ ๐œŽmin ๐ป โ‹… ๐‘ฃ๐‘กโˆ’1 โˆ’ ๐‘ฅ โˆ—
2
๐ป โˆ’1
2
2
2
๐ป โˆ’1
Part II: Stochastic process analysis
๐‘ฅ๐‘ก+1 โˆ’ ๐‘ฅ โˆ—
๐‘ฅ๐‘ก โˆ’ ๐‘ฅ โˆ—
โˆ— =๐ถ
โˆ— + noise
๐‘ฆ๐‘ก+1 โˆ’ ๐‘ฅ
๐‘ฆ๐‘ก โˆ’ ๐‘ฅ
โˆ— โŠค
โˆ—
๐‘ฅ๐‘ก โˆ’ ๐‘ฅ ๐‘ฅ๐‘ก โˆ’ ๐‘ฅ
Let ๐œƒ๐‘ก โ‰œ ๐”ผ ๐‘ฆ โˆ’ ๐‘ฅ โˆ— ๐‘ฆ โˆ’ ๐‘ฅ โˆ—
๐‘ก
๐‘ก
๐œƒ๐‘ก+1 = ๐”…๐œƒ๐‘ก + noise โ‹… noiseโŠค
๐œƒ๐‘› โ†’ เท ๐”…๐‘– noise โ‹… noiseโŠค
๐‘–
= ๐•€โˆ’๐”…
โˆ’1
noise โ‹… noiseโŠค
Parameters: ๐›ผ, ๐›ฝ, ๐›พ, ๐›ฟ
1. ๐‘ฃ0 = ๐‘ฅ0
2. ๐‘ฆ๐‘กโˆ’1 = ๐›ผ๐‘ฅ๐‘กโˆ’1 + 1 โˆ’ ๐›ผ ๐‘ฃ๐‘กโˆ’1
เท  ๐‘ฆ๐‘กโˆ’1
3. ๐‘ฅ๐‘ก = ๐‘ฆ๐‘กโˆ’1 โˆ’ ๐›ฟ ๐›ป๐‘“
4. ๐‘ง๐‘กโˆ’1 = ๐›ฝ๐‘ฆ๐‘กโˆ’1 + 1 โˆ’ ๐›ฝ ๐‘ฃ๐‘กโˆ’1
เท  ๐‘ฆ๐‘กโˆ’1
5. ๐‘ฃ๐‘ก = ๐‘ง๐‘กโˆ’1 โˆ’ ๐›พ๐›ป๐‘“
Our algorithm
Part II: Stochastic process analysis
โ€ข Need to understand ๐•€ โˆ’ ๐”…
โˆ’1
noise โ‹… noiseโŠค
โ€ข ๐”… has singular values > 1, but fortunately eigenvalues < 1
โ€ข Solve the 1-dim version of ๐•€ โˆ’ ๐”…
computations
โˆ’1
noise โ‹… noiseโŠค via explicit
โ€ข Combine the 1-dim bounds with (statistical) condition number bounds
โ€ข ๐•€โˆ’๐”…
โˆ’1
noise โ‹… noiseโŠค โ‰ผ ๐œ…๐ป
ว โˆ’1 + ๐›ฟ โ‹… ๐ผ
Recap
Deterministic case
Stochastic approximation
GD
๐‘“ ๐‘ฅ0 โˆ’ ๐‘“ โˆ—
๐‘‚ ๐œ… log
๐œ–
SGD
๐‘“ ๐‘ฅ0 โˆ’ ๐‘“ โˆ— ๐œŽ 2 ๐‘‘
๐‘‚เทจ ๐œ… log
+
๐œ–
๐œ–
๐‘‚
AGD
๐‘“ ๐‘ฅ0 โˆ’ ๐‘“ โˆ—
๐œ… log
๐œ–
๐‘‚เทจ
ASGD
๐‘“ ๐‘ฅ0 โˆ’ ๐‘“ โˆ— ๐œŽ 2 ๐‘‘
๐œ…๐œ…ว log
+
๐œ–
๐œ–
โ€ข Acceleration possible โ€“ depends on statistical condition number
โ€ข Techniques: new potential function, stochastic process analysis
โ€ข Conjecture: Our result is tight
Streaming optimization for ML
โ€ข Streaming algorithms are very powerful for ML applications
โ€ข SGD and variants widely used in practice
โ€ข Classical stochastic approximation focuses on asymptotic rates
โ€ข Tools from optimization help obtain strong finite sample guarantees
โ€ข Have implications for parallelization as well
Some examples
โ€ข Linear regression
๏ƒ˜Finite sample guarantees: Moulines and Bach 2011, Defossez and Bach 2015
๏ƒ˜Parallelization: Jain et al. 2016
๏ƒ˜Acceleration: This talk
โ€ข Smooth convex functions:
๏ƒ˜Finite sample guarantees: Bach and Moulines 2013
โ€ข PCA: Ojaโ€™s algorithm
๏ƒ˜Rank-1: Balsubramani et al. 2013, Jain et al. 2016
๏ƒ˜Higher rank: Allen-Zhu and Li 2016
Open problems
โ€ข
Linear regression: Parameter free algorithm e.g., conjugate gradient
โ€ข
General convex functions: Acceleration, parallelization?
โ€ข
Non-convex functions: Streaming algorithms, acceleration,
parallelization?
โ€ข
PCA: Tight finite sample guarantees?
โ€ข
Quasi-Newton methods
Thank you!
Questions?
[email protected]