Uniformly-optimal Stochastic Dual Averaging Methods with Double Projections Presenter: Xi Chen1 Collaborators: Qihang Lin2, Javier Pena2 Machine Learning Department1 Tepper School of Business2 Carnegie Mellon University Stochastic Composite Optimization οΆ Regularized Risk Minimization: Convex (Non-smooth) Regularization Expected Convex Loss οΆ Expected Convex Loss: οΆ A broad class of loss functions: π = 0 : π π₯ non-strongly convex π: strong convexity parameter of π π₯ π > 0 : π π₯ strongly convex 2 Sparse Linear Regression οΆ Sparse Linear Model: = × + οΆ Stochastic Composite Optimization Expected Loss: L1 regularization: Enforce Sparse [Tibshirani 96] Solution οΆ Challenges in optimizing the expected loss: ο§ We do not know P but only can sample from P ο§ Involves integration over high dimensional P 3 Stochastic Optimization οΆ What information we can utilize? Stochastic Oracle: Each query of stochastic oracle returns a stochastic subgradient πΊ(π₯, π): οΆ Stochastic subgradient: empirical estimator of π β² π₯ = E πΉ β² π₯, π ο§ Random draw a data point π~π Online Learning ο§ Compute πΊ π₯, π = π»π₯ πΉ π₯, π β Eπ πΊ π₯, π = πβ²(π₯) οΆ Optimization with stochastic subgradient: 4 Regularized Dual Averaging οΆ Distance generating function: οΆ Bregman distance function: Quadratic Growth Assumption: οΆ Regularized Dual Averaging (RDA) Algorithm [Nesterov 09 Xiao 10] Projection Step: (Proximal Mapping) Sparse π₯π‘+1 5 Properties of RDA οΆ Expected Convergence Rate Non-sparse Convex Functions Strongly Convex Optimal Rate Optimal Rate οΆ Optimal Regularized Dual Averaging (ORDA) with Double Projections ππ = π+1 2 (1) More flexibility (2) Sparse Weight for πΊ π¦π , ππ : π( π π‘2 ) 6 Convergence Rate Optimal Rate Uniformly-optimal Convergence Rate: Convergence Rate: Smooth part of π(π₯) Non-smooth part of π(π₯) Stochastic π π₯ : π > 0, optimal rate π( π ) π Randomness in Stochastic Subgradient [Algorithm 3, Tseng 08] πΏ π Smooth (πΏ > 0, π = 0): optimal rate π( 2) Deterministic π π₯ : π = 0 Non-smooth (π > 0, πΏ = 0): optimal rate π( π ) π 7 Strongly Convex οΆ Strongly convex π(π₯) ο§ ο§ Unifiedframework ο§ οΆ Convergence Rate Non-strongly Convex f(x): RDA: Uniformly Optimal Rate [Lan et al. 10] 8 Uniformly Optimal Rate for Strongly Convex οΆ Uniformly Optimal Rate οΆ Multi-stage Algorithm (Homotopy) [Juditsky and Nesterov, 10 Hazan and Kale, 11 Lan et al. 11] ο§ Each stage 1 β€ π β€ πΎ, we run ORDA for ππ iterations using the solution from previous stage π₯πβ1 as the starting point ο§ ππ = 2ππβ1 . ο§ π = π1 + β― + ππΎ 9 Experiment: Synthetic Data Elastic-net: 10 Experiment: Real Data οΆ Binary Classification with sparse logistic regression 11 Distributed Implementation οΆ Distributed Implementation according to [Dekel et al. 11] οΆ Mini-batch Strategy οΆ Implement each query of stochastic oracle in a distributed manner with a subsequent averaging based on vector-sum οΆ Asymptotic linear speedup 12 Conclusion οΆ Stochastic optimization is a powerful tool for large-scale machine learning problems and online learning οΆ Regularized Dual Averaging: utilize all historical stochastic gradients οΆ Optimal Regularized Dual Averaging: optimal rate for convex & strongly convex loss functions οΆ Multi-stage Extensions: uniformly optimal rate for strongly convex loss οΆ Easily parallelizable 13 Variance and High Probability Bounds οΆ Bound on E π π₯π+1 β π π₯ β only implies π π₯π+1 converges to π π₯ β on average. οΆ If we are only allowed to run algorithm once, whether the solution is reliable ? οΆ Key: showing the convergence on Var π π₯π+1 β π π₯ β β 0 οΆ A slightly stronger assumption (fourth moment of πΊ π₯, π β πβ²(π₯)) οΆ High Probability Bounds: 15 Comparison to Existing Methods Convex π(π) Strongly Convex π(π) Uniformly-Opt Optimal Rate Uniformly-Opt FOBOS (Duchi et al, 09) NO NO COMID (Duchi et al, 09) NO NO SAGE (Hu et al. 09) YES YES NO Prox NO AC-SA (Lan et al, 10) YES YES NO Avg YES M-AC-SA (Lan et al, 10) NA YES YES Avg YES RDA (Xiao 10) NO NO NO Avg YES AC-RDA (Xiao 10) YES NO NO Avg YES ORDA YES YES NO Prox YES M-ORDA NA Prox YES YES Final π Bregman Divergence NO Prox NO NO Prox YES YES Projection with only the current stochastic subgraident Dual Averaging 16
© Copyright 2026 Paperzz