PowerPoint Template - Carnegie Mellon School of Computer Science

Uniformly-optimal Stochastic Dual
Averaging Methods with Double
Projections
Presenter: Xi Chen1
Collaborators: Qihang Lin2, Javier Pena2
Machine Learning Department1
Tepper School of Business2
Carnegie Mellon University
Stochastic Composite Optimization
 Regularized Risk Minimization:
Convex (Non-smooth) Regularization
Expected Convex Loss
 Expected Convex Loss:
 A broad class of loss functions:
𝜇 = 0 : 𝑓 𝑥 non-strongly convex
𝜇: strong convexity parameter of 𝑓 𝑥
𝜇 > 0 : 𝑓 𝑥 strongly convex
2
Sparse Linear Regression
 Sparse Linear Model:
=
×
+
 Stochastic Composite Optimization
Expected Loss:
L1 regularization:
Enforce Sparse
[Tibshirani 96]
Solution
 Challenges in optimizing the expected loss:
 We do not know P but only can sample from P
 Involves integration over high dimensional P
3
Stochastic Optimization
 What information we can utilize?
Stochastic Oracle:
Each query of stochastic oracle returns a stochastic subgradient 𝐺(𝑥, 𝜉):
 Stochastic subgradient: empirical estimator of 𝑓 ′ 𝑥 = E 𝐹 ′ 𝑥, 𝜉
 Random draw a data point 𝜉~𝑃
Online Learning
 Compute 𝐺 𝑥, 𝜉 = 𝛻𝑥 𝐹 𝑥, 𝜉 ⇒ E𝜉 𝐺 𝑥, 𝜉
= 𝑓′(𝑥)
 Optimization with stochastic subgradient:
4
Regularized Dual Averaging
 Distance generating function:
 Bregman distance function:
Quadratic Growth Assumption:
 Regularized Dual Averaging (RDA) Algorithm
[Nesterov 09
Xiao 10]
Projection Step:
(Proximal Mapping)
Sparse 𝑥𝑡+1
5
Properties of RDA
 Expected Convergence Rate
Non-sparse
Convex Functions
Strongly Convex
Optimal Rate
Optimal Rate
 Optimal Regularized Dual Averaging (ORDA) with Double Projections
𝜈𝑖 =
𝑖+1
2
(1) More
flexibility
(2) Sparse
Weight for 𝐺 𝑦𝑖 , 𝜉𝑖 : 𝑂(
𝑖
𝑡2
)
6
Convergence Rate
Optimal Rate
Uniformly-optimal Convergence Rate:
Convergence Rate:
Smooth part of 𝑓(𝑥)
Non-smooth part of 𝑓(𝑥)
Stochastic 𝑓 𝑥 : 𝜎 > 0, optimal rate 𝑂(
𝜎
)
𝑁
Randomness in Stochastic
Subgradient
[Algorithm 3, Tseng 08]
𝐿
𝑁
Smooth (𝐿 > 0, 𝑀 = 0): optimal rate 𝑂( 2)
Deterministic 𝑓 𝑥 : 𝜎 = 0
Non-smooth (𝑀 > 0, 𝐿 = 0): optimal rate 𝑂(
𝑀
)
𝑁
7
Strongly Convex
 Strongly convex 𝑓(𝑥)


Unifiedframework

 Convergence Rate
Non-strongly Convex f(x):
RDA:
Uniformly Optimal Rate
[Lan et al. 10]
8
Uniformly Optimal Rate for Strongly Convex
 Uniformly Optimal Rate
 Multi-stage Algorithm (Homotopy)
[Juditsky and Nesterov, 10
Hazan and Kale, 11
Lan et al. 11]
 Each stage 1 ≤ 𝑘 ≤ 𝐾, we run ORDA for 𝑁𝑘 iterations using the
solution from previous stage 𝑥𝑘−1 as the starting point
 𝑁𝑘 = 2𝑁𝑘−1 .
 𝑁 = 𝑁1 + ⋯ + 𝑁𝐾
9
Experiment: Synthetic Data
Elastic-net:
10
Experiment: Real Data
 Binary Classification with sparse logistic regression
11
Distributed Implementation
 Distributed Implementation according to [Dekel et al. 11]
 Mini-batch Strategy
 Implement each query of stochastic oracle in a distributed
manner with a subsequent averaging based on vector-sum
 Asymptotic linear speedup
12
Conclusion
 Stochastic optimization is a powerful tool for large-scale
machine learning problems and online learning
 Regularized Dual Averaging: utilize all historical stochastic
gradients
 Optimal Regularized Dual Averaging: optimal rate for convex
& strongly convex loss functions
 Multi-stage Extensions: uniformly optimal rate for strongly
convex loss
 Easily parallelizable
13
Variance and High Probability Bounds
 Bound on E 𝜙 𝑥𝑁+1 − 𝜙 𝑥 ∗ only implies 𝜙 𝑥𝑁+1 converges to
𝜙 𝑥 ∗ on average.
 If we are only allowed to run algorithm once, whether the solution
is reliable ?
 Key: showing the convergence on Var 𝜙 𝑥𝑁+1 − 𝜙 𝑥 ∗ → 0
 A slightly stronger assumption (fourth moment of 𝐺 𝑥, 𝜉 − 𝑓′(𝑥))
 High Probability Bounds:
15
Comparison to Existing Methods
Convex 𝒇(𝒙)
Strongly Convex 𝒇(𝒙)
Uniformly-Opt
Optimal Rate
Uniformly-Opt
FOBOS (Duchi
et al, 09)
NO
NO
COMID (Duchi
et al, 09)
NO
NO
SAGE (Hu et
al. 09)
YES
YES
NO
Prox
NO
AC-SA (Lan et
al, 10)
YES
YES
NO
Avg
YES
M-AC-SA (Lan
et al, 10)
NA
YES
YES
Avg
YES
RDA (Xiao 10)
NO
NO
NO
Avg
YES
AC-RDA (Xiao
10)
YES
NO
NO
Avg
YES
ORDA
YES
YES
NO
Prox
YES
M-ORDA
NA
Prox
YES
YES
Final 𝒙
Bregman
Divergence
NO
Prox
NO
NO
Prox
YES
YES
Projection
with only
the current
stochastic
subgraident
Dual
Averaging
16

Download Report

PowerPoint Template - Carnegie Mellon School of Computer Science

Paperzz.com

Your Paperzz