PowerPoint Template - Carnegie Mellon School of Computer Science

Uniformly-optimal Stochastic Dual
Averaging Methods with Double
Projections
Presenter: Xi Chen1
Collaborators: Qihang Lin2, Javier Pena2
Machine Learning Department1
Tepper School of Business2
Carnegie Mellon University
Stochastic Composite Optimization
 Regularized Risk Minimization:
Convex (Non-smooth) Regularization
Expected Convex Loss
 Expected Convex Loss:
 A broad class of loss functions:
πœ‡ = 0 : 𝑓 π‘₯ non-strongly convex
πœ‡: strong convexity parameter of 𝑓 π‘₯
πœ‡ > 0 : 𝑓 π‘₯ strongly convex
2
Sparse Linear Regression
 Sparse Linear Model:
=
×
+
 Stochastic Composite Optimization
Expected Loss:
L1 regularization:
Enforce Sparse
[Tibshirani 96]
Solution
 Challenges in optimizing the expected loss:
ο‚§ We do not know P but only can sample from P
ο‚§ Involves integration over high dimensional P
3
Stochastic Optimization
 What information we can utilize?
Stochastic Oracle:
Each query of stochastic oracle returns a stochastic subgradient 𝐺(π‘₯, πœ‰):
 Stochastic subgradient: empirical estimator of 𝑓 β€² π‘₯ = E 𝐹 β€² π‘₯, πœ‰
ο‚§ Random draw a data point πœ‰~𝑃
Online Learning
ο‚§ Compute 𝐺 π‘₯, πœ‰ = 𝛻π‘₯ 𝐹 π‘₯, πœ‰ β‡’ Eπœ‰ 𝐺 π‘₯, πœ‰
= 𝑓′(π‘₯)
 Optimization with stochastic subgradient:
4
Regularized Dual Averaging
 Distance generating function:
 Bregman distance function:
Quadratic Growth Assumption:
 Regularized Dual Averaging (RDA) Algorithm
[Nesterov 09
Xiao 10]
Projection Step:
(Proximal Mapping)
Sparse π‘₯𝑑+1
5
Properties of RDA
 Expected Convergence Rate
Non-sparse
Convex Functions
Strongly Convex
Optimal Rate
Optimal Rate
 Optimal Regularized Dual Averaging (ORDA) with Double Projections
πœˆπ‘– =
𝑖+1
2
(1) More
flexibility
(2) Sparse
Weight for 𝐺 𝑦𝑖 , πœ‰π‘– : 𝑂(
𝑖
𝑑2
)
6
Convergence Rate
Optimal Rate
Uniformly-optimal Convergence Rate:
Convergence Rate:
Smooth part of 𝑓(π‘₯)
Non-smooth part of 𝑓(π‘₯)
Stochastic 𝑓 π‘₯ : 𝜎 > 0, optimal rate 𝑂(
𝜎
)
𝑁
Randomness in Stochastic
Subgradient
[Algorithm 3, Tseng 08]
𝐿
𝑁
Smooth (𝐿 > 0, 𝑀 = 0): optimal rate 𝑂( 2)
Deterministic 𝑓 π‘₯ : 𝜎 = 0
Non-smooth (𝑀 > 0, 𝐿 = 0): optimal rate 𝑂(
𝑀
)
𝑁
7
Strongly Convex
 Strongly convex 𝑓(π‘₯)
ο‚§
ο‚§
Unifiedframework
ο‚§
 Convergence Rate
Non-strongly Convex f(x):
RDA:
Uniformly Optimal Rate
[Lan et al. 10]
8
Uniformly Optimal Rate for Strongly Convex
 Uniformly Optimal Rate
 Multi-stage Algorithm (Homotopy)
[Juditsky and Nesterov, 10
Hazan and Kale, 11
Lan et al. 11]
ο‚§ Each stage 1 ≀ π‘˜ ≀ 𝐾, we run ORDA for π‘π‘˜ iterations using the
solution from previous stage π‘₯π‘˜βˆ’1 as the starting point
ο‚§ π‘π‘˜ = 2π‘π‘˜βˆ’1 .
ο‚§ 𝑁 = 𝑁1 + β‹― + 𝑁𝐾
9
Experiment: Synthetic Data
Elastic-net:
10
Experiment: Real Data
 Binary Classification with sparse logistic regression
11
Distributed Implementation
 Distributed Implementation according to [Dekel et al. 11]
 Mini-batch Strategy
 Implement each query of stochastic oracle in a distributed
manner with a subsequent averaging based on vector-sum
 Asymptotic linear speedup
12
Conclusion
 Stochastic optimization is a powerful tool for large-scale
machine learning problems and online learning
 Regularized Dual Averaging: utilize all historical stochastic
gradients
 Optimal Regularized Dual Averaging: optimal rate for convex
& strongly convex loss functions
 Multi-stage Extensions: uniformly optimal rate for strongly
convex loss
 Easily parallelizable
13
Variance and High Probability Bounds
 Bound on E πœ™ π‘₯𝑁+1 βˆ’ πœ™ π‘₯ βˆ— only implies πœ™ π‘₯𝑁+1 converges to
πœ™ π‘₯ βˆ— on average.
 If we are only allowed to run algorithm once, whether the solution
is reliable ?
 Key: showing the convergence on Var πœ™ π‘₯𝑁+1 βˆ’ πœ™ π‘₯ βˆ— β†’ 0
 A slightly stronger assumption (fourth moment of 𝐺 π‘₯, πœ‰ βˆ’ 𝑓′(π‘₯))
 High Probability Bounds:
15
Comparison to Existing Methods
Convex 𝒇(𝒙)
Strongly Convex 𝒇(𝒙)
Uniformly-Opt
Optimal Rate
Uniformly-Opt
FOBOS (Duchi
et al, 09)
NO
NO
COMID (Duchi
et al, 09)
NO
NO
SAGE (Hu et
al. 09)
YES
YES
NO
Prox
NO
AC-SA (Lan et
al, 10)
YES
YES
NO
Avg
YES
M-AC-SA (Lan
et al, 10)
NA
YES
YES
Avg
YES
RDA (Xiao 10)
NO
NO
NO
Avg
YES
AC-RDA (Xiao
10)
YES
NO
NO
Avg
YES
ORDA
YES
YES
NO
Prox
YES
M-ORDA
NA
Prox
YES
YES
Final 𝒙
Bregman
Divergence
NO
Prox
NO
NO
Prox
YES
YES
Projection
with only
the current
stochastic
subgraident
Dual
Averaging
16