Optimal Value Function Methods in Numerical Optimization Level Set Methods James V Burke Mathematics, University of Washington, ([email protected]) Joint work with Aravkin (UW), Drusvyatskiy (UW), Friedlander (UBC/Davis), Roy (UW) The Hong Kong Polytechnic University Applied Mathematics Colloquium February 4, 2016 Motivation Optimization in Large-Scale Inference • A range of large-scale data science applications can be modeled using optimization: - Inverse problems (medical and seismic imaging ) - High dimensional inference (compressive sensing, LASSO, quantile regression) - Machine learning (classification, matrix completion, robust PCA, time series) • These applications are often solved using side information: - Sparsity or low rank of solution - Constraints (topography, non-negativity) - Regularization (priors, total variation, “dirty” data) • We need efficient large-scale solvers for nonsmooth programs. 2/36 The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax ≈ b 3/36 The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax ≈ b Example: Model Selection y = aT x where y ∈ Rk is an observation and a ∈ Rn are covariates. Suppose y is a disease classifier and a is micro-array data (n ≥ 104 ). T Given data {(yi , ai )}m i=1 , find x so that yi ≈ ai x. 3/36 The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax ≈ b Example: Model Selection y = aT x where y ∈ Rk is an observation and a ∈ Rn are covariates. Suppose y is a disease classifier and a is micro-array data (n ≥ 104 ). T Given data {(yi , ai )}m i=1 , find x so that yi ≈ ai x. Since m << n, one can “always” find x such that yi = aiT x, i = 1, . . . , m. 3/36 The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax ≈ b Example: Model Selection y = aT x where y ∈ Rk is an observation and a ∈ Rn are covariates. Suppose y is a disease classifier and a is micro-array data (n ≥ 104 ). T Given data {(yi , ai )}m i=1 , find x so that yi ≈ ai x. Since m << n, one can “always” find x such that yi = aiT x, i = 1, . . . , m. This x gives little insight into the role of the covariates a in determining the observations y. We prefer the most parsimonious subset of covariates that can be used to explain the observations. That is, we prefer the sparsest model from the 2n possible models. Such models are used to further our knowledge of disease mechanisms and to develop efficient disease assays. 3/36 The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax ≈ b There are numerous other applications; • system identification • image segmentation • compressed sensing • grouped sparsity for remote sensor location • ... 4/36 The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax ≈ b 5/36 The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax ≈ b Convex approaches: kxk1 as a sparsity surragate (Candes-Tao-Donaho) BPDN min x s.t. kxk1 1 kAx 2 LASSO min − bk2 2 ≤ σ x s.t. 1 kAx 2 Lagrangian (Penalty) − bk2 2 min x 1 kAx 2 − bk2 2 + λkxk1 kxk1 ≤ τ 5/36 The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax ≈ b Convex approaches: kxk1 as a sparsity surragate (Candes-Tao-Donaho) BPDN min x s.t. kxk1 1 kAx 2 LASSO min − bk2 2 ≤ σ x s.t. 1 kAx 2 Lagrangian (Penalty) − bk2 2 min x 1 kAx 2 − bk2 2 + λkxk1 kxk1 ≤ τ • BPDN: often most natural and transparent. (physical considerations guide σ) 5/36 The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax ≈ b Convex approaches: kxk1 as a sparsity surragate (Candes-Tao-Donaho) BPDN min x s.t. kxk1 1 kAx 2 LASSO min − bk2 2 ≤ σ x s.t. 1 kAx 2 Lagrangian (Penalty) − bk2 2 min x 1 kAx 2 − bk2 2 + λkxk1 kxk1 ≤ τ • BPDN: often most natural and transparent. (physical considerations guide σ) • Lagrangian: ubiquitous in practice. (“no constraints”) 5/36 The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax ≈ b Convex approaches: kxk1 as a sparsity surragate (Candes-Tao-Donaho) BPDN min x s.t. kxk1 1 kAx 2 LASSO min − bk2 2 ≤ σ x s.t. 1 kAx 2 Lagrangian (Penalty) − bk2 2 min x 1 kAx 2 − bk2 2 + λkxk1 kxk1 ≤ τ • BPDN: often most natural and transparent. (physical considerations guide σ) • Lagrangian: ubiquitous in practice. (“no constraints”) All three are (essentially) equivalent computationally! 5/36 The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax ≈ b Convex approaches: kxk1 as a sparsity surragate (Candes-Tao-Donaho) BPDN min x s.t. kxk1 1 kAx 2 LASSO min − bk2 2 ≤ σ x s.t. 1 kAx 2 Lagrangian (Penalty) − bk2 2 min x 1 kAx 2 − bk2 2 + λkxk1 kxk1 ≤ τ • BPDN: often most natural and transparent. (physical considerations guide σ) • Lagrangian: ubiquitous in practice. (“no constraints”) All three are (essentially) equivalent computationally! Basis for SPGL1 (van den Berg-Friedlander ’08) 5/36 Optimal Value or Level Set Framework Problem class: Solve min x∈X φ(x) P(σ) s.t. ρ(Ax − b) ≤ σ 6/36 Optimal Value or Level Set Framework Problem class: Solve min x∈X φ(x) P(σ) s.t. ρ(Ax − b) ≤ σ Strategy: Consider the “flipped” problem v(τ ) := min x∈X ρ(Ax − b) Q(τ ) s.t. φ(x) ≤ τ 6/36 Optimal Value or Level Set Framework Problem class: Solve min x∈X φ(x) P(σ) s.t. ρ(Ax − b) ≤ σ Strategy: Consider the “flipped” problem v(τ ) := min x∈X ρ(Ax − b) Q(τ ) s.t. φ(x) ≤ τ Then opt-val(P(σ)) is the minimal root of the equation v(τ ) = σ 6/36 Queen Dido’s Problem The intuition behind the proposed framework has a distinguished history, appearing even in antiquity. Perhaps the earliest instance is Queen Dido’s problem and the fabled origins of Carthage. In short, the problem is to find the maximum area that can be enclosed by an arc of fixed length and a given line. The converse problem is to find an arc of least length that traps a fixed area between a line and the arc. Although these two problems reverse the objective and the constraint, the solution in each case is a semi-circle. 7/36 Queen Dido’s Problem The intuition behind the proposed framework has a distinguished history, appearing even in antiquity. Perhaps the earliest instance is Queen Dido’s problem and the fabled origins of Carthage. In short, the problem is to find the maximum area that can be enclosed by an arc of fixed length and a given line. The converse problem is to find an arc of least length that traps a fixed area between a line and the arc. Although these two problems reverse the objective and the constraint, the solution in each case is a semi-circle. Other historical examples abound. More recently, these observations provide the basis for the Markowitz Mean-Variance Portfolio Theory. 7/36 The Role of Convexity Convex Sets Let C ⊂ Rn . We say that C is convex if (1 − λ)x + λy ∈ C whenever x, y ∈ C and 0 ≤ λ ≤ 1. 8/36 The Role of Convexity Convex Sets Let C ⊂ Rn . We say that C is convex if (1 − λ)x + λy ∈ C whenever x, y ∈ C and 0 ≤ λ ≤ 1. Convex Functions Let f : Rn → R̄ := R ∪ {+∞}. We say that f is convex if the set epi (f ) := { (x, µ) : f (x) ≤ µ } is a convex set. 8/36 The Role of Convexity Convex Sets Let C ⊂ Rn . We say that C is convex if (1 − λ)x + λy ∈ C whenever x, y ∈ C and 0 ≤ λ ≤ 1. Convex Functions Let f : Rn → R̄ := R ∪ {+∞}. We say that f is convex if the set epi (f ) := { (x, µ) : f (x) ≤ µ } is a convex set. (x 2 , f (x 2 )) (x 1 , f (x 1 )) x1 λ x 1 + (1 - λ)x 2 x2 f ((1 − λ)x1 + λx2 ) ≤ (1 − λ)f (x1 ) + λf (x2 ) 8/36 Convex Functions Convex indicator functions Let C ⊂ Rn . Then the function ( δC (x) := 0 , if x ∈ C , +∞ , if x ∈ / C, is a convex function. 9/36 Convex Functions Convex indicator functions Let C ⊂ Rn . Then the function ( δC (x) := 0 , if x ∈ C , +∞ , if x ∈ / C, is a convex function. Addition Non-negative linear combinations of convex functions are convex: fi convex and λi ≥ 0, i = 1, . . . , k P f (x) := ki=1 λi fi (x). 9/36 Convex Functions Convex indicator functions Let C ⊂ Rn . Then the function ( δC (x) := 0 , if x ∈ C , +∞ , if x ∈ / C, is a convex function. Addition Non-negative linear combinations of convex functions are convex: fi convex and λi ≥ 0, i = 1, . . . , k P f (x) := ki=1 λi fi (x). Infimal Projection If f : Rn × Rm → R̄ is convex, then so is v(x) := inf y f (x, y), since epi (v) = { (x, µ) : ∃ y ∈ s.t. f (x, y) ≤ µ }. 9/36 Convexity of v When X , ρ, and φ are convex, the optimal value function v is a non-increasing convex function by infimal projection: v(τ ) := min x∈X = min x ρ(Ax − b) s.t. φ(x) ≤ τ ρ(Ax − b) + δepi (φ) (x, τ ) + δX (x) 10/36 Newton and Secant Methods For f convex and non-increasing, solve f (τ ) = 0. 11/36 Newton and Secant Methods For f convex and non-increasing, solve f (τ ) = 0. 400 350 300 250 200 150 100 50 0 0 20 40 60 80 100 11/36 Newton and Secant Methods For f convex and non-increasing, solve f (τ ) = 0. 400 350 300 250 200 150 100 50 0 0 20 40 60 80 100 Problem: f is often not differentiable. 11/36 Newton and Secant Methods For f convex and non-increasing, solve f (τ ) = 0. 400 350 300 250 200 150 100 50 0 0 20 40 60 80 100 Problem: f is often not differentiable. Use the convex subdifferential ∂f (x) := { z : f (y) ≥ f (x) + z T (y − x) ∀ y ∈ Rn } 11/36 Superlinear Convergence τ∗ := inf{τ : f (τ ) ≤ 0} and g∗ := inf { g : g ∈ ∂f (τ∗ ) } < 0 (non-degeneracy) 12/36 Superlinear Convergence τ∗ := inf{τ : f (τ ) ≤ 0} and g∗ := inf { g : g ∈ ∂f (τ∗ ) } < 0 (non-degeneracy) Initialization: τ−1 < τ0 < τ∗ ( τk if f (τk ) = 0, τk+1 := (Newton) k) τk − f (τ [for g ∈ ∂f (τ )] otherwise; k k gk and ( τk+1 := τk τk − τk −τk−1 f (τk )−f (τk−1 ) f (τk ) if f (τk ) = 0, otherwise. (Secant) 12/36 Superlinear Convergence τ∗ := inf{τ : f (τ ) ≤ 0} and g∗ := inf { g : g ∈ ∂f (τ∗ ) } < 0 (non-degeneracy) Initialization: τ−1 < τ0 < τ∗ ( τk if f (τk ) = 0, τk+1 := (Newton) k) τk − f (τ [for g ∈ ∂f (τ )] otherwise; k k gk and ( τk+1 := τk τk − τk −τk−1 f (τk )−f (τk−1 ) f (τk ) if f (τk ) = 0, otherwise. (Secant) If either sequence terminates finitely at some τk , then τk = τ∗ ; otherwise, |τ∗ − τk+1 | ≤ (1 − g∗ )|τ∗ − τk |, k = 1, 2, . . . , γk where γk = gk (Newton) and γk ∈ ∂f (τk−1 ) (secant). In either case, γk ↑ g∗ and τk ↑ τ∗ globally q-superlinearly. 12/36 Inexact Root Finding • Problem: Find root of the inexactly known convex function v(·) − σ. 13/36 Inexact Root Finding • Problem: Find root of the inexactly known convex function v(·) − σ. • Bisection is one approach 13/36 Inexact Root Finding • Problem: Find root of the inexactly known convex function v(·) − σ. • Bisection is one approach • nonmonotone iterates (bad for warmstarts) • at best linear convergence (with perfect information) 13/36 Inexact Root Finding • Problem: Find root of the inexactly known convex function v(·) − σ. • Bisection is one approach • nonmonotone iterates (bad for warmstarts) • at best linear convergence (with perfect information) • Solution: • modified secant • approximate Newton methods 13/36 Inexact Root Finding: Secant 14/36 Inexact Root Finding: Secant 14/36 Inexact Root Finding: Secant 14/36 Inexact Root Finding: Secant 14/36 Inexact Root Finding: Secant 14/36 Inexact Root Finding: Secant 14/36 Inexact Root Finding: Newton 14/36 Inexact Root Finding: Newton 14/36 Inexact Root Finding: Newton 14/36 Inexact Root Finding: Newton 14/36 Inexact Root Finding: Newton 14/36 Inexact Root Finding: Newton 14/36 Inexact Root Finding: Convergence 15/36 Inexact Root Finding: Convergence 15/36 Inexact Root Finding: Convergence 15/36 Inexact Root Finding: Convergence Key observation: C = C (τ0 ) is independent of v 0 (τ ∗ ). 15/36 Minorants from Duality 16/36 Minorants from Duality 16/36 Minorants from Duality 16/36 Robustness: 1 ≤ u/l ≤ α, where α ∈ [1, 2) and = 10−2 140 120 100 80 60 40 20 010 f1 9 8 7 6 5 4 3 2 (a) k = 13, α = 1.3 140 120 100 80 60 40 20 010 8 7 6 5 4 3 (d) k = 9, α = 1.3 120 f1 2 160 140 120 100 80 60 40 20 010 f2 100 80 60 40 20 9 8 7 6 5 4 3 2 (b) k = 770, α = 1.99 f1 9 160 140 120 100 80 60 40 20 010 f1 010 8 6 4 2 0 (c) k = 18, α = 1.3 120 f2 100 80 60 40 20 9 8 7 6 5 4 3 2 (e) k = 15, α = 1.99 010 8 6 4 2 0 (f) k = 10, α = 1.3 Figure : Inexact secant (top) and Newton (bottom) for f1 (τ ) = (τ − 1)2 − 10 (first two columns) and f2 (τ ) = τ 2 (last column). Below each panel, α is the oracle accuracy, and k is the number of iterations needed to converge, i.e., to reach fi (τk ) ≤ = 10−2 . 17/36 Sensor Network Localization (SNL) 0.5 0.4 0.3 0.2 0.1 0 −0.1 −0.2 −0.3 −0.4 −0.5 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 Given a weighted graph G = (V , E, d) find a realization: p1 , . . . , pn ∈ R 2 with dij = kpi − pj k2 for all ij ∈ E. 18/36 Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. ’04, Biswas et al. ’06): max tr (X ) s.t. kPE K(X ) − dk22 ≤ σ Xe = 0, X 0 where [K(X )]i,j = Xii + Xjj − 2Xij . 19/36 Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. ’04, Biswas et al. ’06): max tr (X ) s.t. kPE K(X ) − dk22 ≤ σ Xe = 0, X 0 where [K(X )]i,j = Xii + Xjj − 2Xij . n 1 X kpi − pj k2 Intuition: X = PP T and then tr (X ) = n + 1 i,j=1 with pi the ith row of P. 19/36 Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. ’04, Biswas et al. ’06): max tr (X ) s.t. kPE K(X ) − dk22 ≤ σ Xe = 0, X 0 where [K(X )]i,j = Xii + Xjj − 2Xij . n 1 X kpi − pj k2 Intuition: X = PP T and then tr (X ) = n + 1 i,j=1 with pi the ith row of P. Flipped problem: min kPE K(X ) − dk22 s.t. tr X = τ Xe = 0 X 0. 19/36 Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. ’04, Biswas et al. ’06): max tr (X ) s.t. kPE K(X ) − dk22 ≤ σ Xe = 0, X 0 where [K(X )]i,j = Xii + Xjj − 2Xij . n 1 X kpi − pj k2 Intuition: X = PP T and then tr (X ) = n + 1 i,j=1 with pi the ith row of P. Flipped problem: min kPE K(X ) − dk22 s.t. tr X = τ Xe = 0 X 0. • Perfectly adapted for the Frank-Wolfe method. 19/36 Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. ’04, Biswas et al. ’06): max tr (X ) s.t. kPE K(X ) − dk22 ≤ σ Xe = 0, X 0 where [K(X )]i,j = Xii + Xjj − 2Xij . n 1 X kpi − pj k2 Intuition: X = PP T and then tr (X ) = n + 1 i,j=1 with pi the ith row of P. Flipped problem: min kPE K(X ) − dk22 s.t. tr X = τ Xe = 0 X 0. • Perfectly adapted for the Frank-Wolfe method. Key point: Slater failing (always the case) is irrelevant. 19/36 Approximate Newton Pareto curve 1.5 1 0.5 0 0 5 10 15 20 Figure : σ = 0.25 20/36 Approximate Newton Pareto curve Pareto curve 1.5 1.4 1.2 1 1 0.8 0.6 0.5 0.4 0.2 0 0 5 10 15 Figure : σ = 0.25 20 0 0 5 10 15 20 Figure : σ = 0 20/36 Max-trace Newton_max 0.6 0.4 0.4 0.2 0.2 0 0 −0.2 −0.2 −0.4 −0.4 −0.5 0 Refine positions 0.6 0.5 −0.5 0 0.5 21/36 Max-trace Newton_max 0.6 0.4 0.4 0.2 0.2 0 0 −0.2 −0.2 −0.4 −0.4 −0.5 0 Refine positions 0.6 −0.5 0.5 Newton_min 0.6 0.6 0.4 0.4 0.2 0.2 0 0 −0.2 −0.2 −0.4 −0.4 −0.5 0 0 0.5 Refine positions 0.5 −0.5 0 0.5 21/36 Observations • Simple strategy for optimizing over complex domains • Rigorous convergence guarantees • Insensitivity to ill-conditioning • Many applications • Sensor Network Localization (Drusvyatskiy-Krislock-Voronin-Wolkowicz ’15) • Sparse/Robust Estimation and Kalman Smoothing (Aravkin-B-Pillonetto ’13) • Large scale SDP and LP (cf. Renegar ’14) • Chromosome reconstruction (Aravkin-Becker-Drusvyatskiy-Lozano ’15) • Phase retrieval (Aravkin-B-Drusvyatskiy-Friedlander-Roy ’16) • Generalized linear models (Aravkin-B-Drusvyatskiy-Friedlander-Roy ’16) • ... 22/36 Conjugate Functions and Duality Convex Indicator For any convex set C , the convex indicator function for C is ( δ (x | C ) := 0, +∞, x ∈ C, x∈ / C. 23/36 Conjugate Functions and Duality Convex Indicator For any convex set C , the convex indicator function for C is ( δ (x | C ) := 0, +∞, x ∈ C, x∈ / C. Support Functionals For any set C , the support functional for C is δ ∗ (x | C ) := sup hx, zi . z∈C 23/36 Conjugate Functions and Duality Convex Indicator For any convex set C , the convex indicator function for C is ( δ (x | C ) := 0, +∞, x ∈ C, x∈ / C. Support Functionals For any set C , the support functional for C is δ ∗ (x | C ) := sup hx, zi . z∈C Convex Conjugates For any convex function g(x), the convex conjugate is given by g ∗ (y) := δ ∗ ((y, −1) | epi (g) ) = sup[hx, yi − g(x)] . x 23/36 Conjugate’s and the Subdifferential g ∗ (y) = sup[hx, yi − g(x)] . x 24/36 Conjugate’s and the Subdifferential g ∗ (y) = sup[hx, yi − g(x)] . x The Bi-Conjugate Theorem If epi (g) is closed and dom (g) 6= ∅, then (g ∗ )∗ = g. 24/36 Conjugate’s and the Subdifferential g ∗ (y) = sup[hx, yi − g(x)] . x The Bi-Conjugate Theorem If epi (g) is closed and dom (g) 6= ∅, then (g ∗ )∗ = g. The Young-Fenchel Inequality g(x) + g ∗ (z) ≥ hz, xi for all x, y ∈ Rn with equality if and only if z ∈ ∂g(x) and x ∈ ∂g ∗ (z). In particular, ∂g(x) = argmaxz [hz, xi − g ∗ (z)]. 24/36 Conjugate’s and the Subdifferential g ∗ (y) = sup[hx, yi − g(x)] . x The Bi-Conjugate Theorem If epi (g) is closed and dom (g) 6= ∅, then (g ∗ )∗ = g. The Young-Fenchel Inequality g(x) + g ∗ (z) ≥ hz, xi for all x, y ∈ Rn with equality if and only if z ∈ ∂g(x) and x ∈ ∂g ∗ (z). In particular, ∂g(x) = argmaxz [hz, xi − g ∗ (z)]. Maximal Montone Operator If epi (g) is closed and dom (g) 6= ∅, then ∂g is a maximal monotone operator with ∂g −1 = ∂g ∗ . 24/36 Conjugate’s and the Subdifferential g ∗ (y) = sup[hx, yi − g(x)] . x The Bi-Conjugate Theorem If epi (g) is closed and dom (g) 6= ∅, then (g ∗ )∗ = g. The Young-Fenchel Inequality g(x) + g ∗ (z) ≥ hz, xi for all x, y ∈ Rn with equality if and only if z ∈ ∂g(x) and x ∈ ∂g ∗ (z). In particular, ∂g(x) = argmaxz [hz, xi − g ∗ (z)]. Maximal Montone Operator If epi (g) is closed and dom (g) 6= ∅, then ∂g is a maximal monotone operator with ∂g −1 = ∂g ∗ . Note:The lsc hull of g is cl g := g ∗∗ . 24/36 The perspective function epi (g π ) := cl cone (epi (g)) = cl ( S λ>0 λepi (g)) 25/36 The perspective function epi (g π ) := cl cone (epi (g)) = cl ( S π g (z, λ) := λ>0 λepi (g)) −1 λg(λ z) g ∞ (z) +∞ if if if λ > 0, λ = 0, λ < 0, where g ∞ is the horizon function of g: g ∞ (z) := sup [g(x + z) − g(x)] . x∈dom g 25/36 Support Functions for epi (g) and levg (τ ) g : Rn → R be closed proper and convex. Then δ ∗ ((y, µ) | epi (g) ) = (g ∗ )π (y, −µ) and δ ∗ (y | [g ≤ τ ] ) = cl inf [τ µ + (g ∗ )π (y, µ)], µ≥0 where epi (g) := {(x, µ) | g(x) ≤ µ } [g ≤ τ ] := {x | g(x) ≤ τ } δ ∗ (z | C ) := sup hz, wi w∈C 26/36 Perturbation Framework and Duality (Rockafellar (1970)) The perturbation function f (x, b, τ ) := ρ(b − Ax) + δ ((x, τ ) | epi (φ) ) Its conjugate f ∗ (y, u, µ) = (φ∗ )π (y + AT u, −µ) + ρ∗ (u) . 27/36 Perturbation Framework and Duality (Rockafellar (1970)) The perturbation function f (x, b, τ ) := ρ(b − Ax) + δ ((x, τ ) | epi (φ) ) Its conjugate f ∗ (y, u, µ) = (φ∗ )π (y + AT u, −µ) + ρ∗ (u) . The Primal Problem P(b, τ ) : infimal projection in x v(b, τ ) := min f (x, b, τ ) . x 27/36 Perturbation Framework and Duality (Rockafellar (1970)) The perturbation function f (x, b, τ ) := ρ(b − Ax) + δ ((x, τ ) | epi (φ) ) Its conjugate f ∗ (y, u, µ) = (φ∗ )π (y + AT u, −µ) + ρ∗ (u) . The Primal Problem P(b, τ ) : v(b, τ ) := min f (x, b, τ ) . x The Dual Problem D(b, τ ) : v̂(b, τ ) := sup hb, ui + τ µ − f ∗ (0, u, µ) u,µ (reduced dual) = sup hb, ui − ρ∗ (u) − δ ∗ AT u | [φ ≤ τ ] . u 27/36 Perturbation Framework and Duality (Rockafellar (1970)) The perturbation function f (x, b, τ ) := ρ(b − Ax) + δ ((x, τ ) | epi (φ) ) Its conjugate f ∗ (y, u, µ) = (φ∗ )π (y + AT u, −µ) + ρ∗ (u) . The Primal Problem P(b, τ ) : v(b, τ ) := min f (x, b, τ ) . x The Dual Problem D(b, τ ) : v̂(b, τ ) := sup hb, ui + τ µ − f ∗ (0, u, µ) u,µ (reduced dual) = sup hb, ui − ρ∗ (u) − δ ∗ AT u | [φ ≤ τ ] . u The Subdifferential: If (b, τ ) ∈ int (dom v), then v(b, τ ) = v̂(b, τ ) and ∅= 6 ∂v(b, τ ) = argmax D(b, τ ) u,µ 27/36 Piecewise Linear-Quadratic Penalties 1 φ(x) := sup [hx, ui − u T Bu] 2 u∈U U ⊂ Rn is nonempty, closed and convex with 0 ∈ U (not nec. poly.) B ∈ Rn×n is symmetric positive semi-definite. Examples: 1. Support functionals: B = 0 2. Gauge functionals: γ (· | U ◦ ) = δ ∗ (· | U ) 3. Norms: B = closed unit ball, k·k = γ (· | B ) 4. Least-squares: U = Rn , B = I 5. Huber: U = [−, ]n , B = I 28/36 PLQ Densities: Gauss, Laplace, Huber, Vapnik y y x x V (x) = 12 x2 V (x) = |x| `1 Gauss y y −K K x − V (x) = −Kx − 12 K 2 ; x < −K V (x) = −x − ; x < − V (x) = Kx − 12 K 2 ; K < x V (x) = x − ; ≤ x V (x) = 12 x2 ; −K ≤ x ≤ K Huber x V (x) = 0; − ≤ x ≤ Vapnik 29/36 Computing v 0 for PLQ Penalties φ 1 φ(x) := sup [hx, ui − u T Bu] 2 u∈U P(b, τ ) : v(b, τ ) := min ρ(b − Ax) st φ(x) ≤ τ ! ∃ x s.t. 0 ∈ −AT ∂ρ(b − Ax) + µ+ ∂φ(x) and u ∂v(b, τ ) = n √ √ o −µ µ = max γ AT u | U , u T ABAT u/ 2τ . 30/36 A Few Special Cases v(τ ) := min 12 kb − Axk22 st φ(x) ≤ τ Optimal Residual: r := Ax − b Optimal Solution: x 1. Support functionals: φ(x) = δ ∗ (x | U ) , 0 ∈ U =⇒ v 0 (τ ) = −δ ∗ AT r | U ◦ = −γ AT r | U 2. Gauge functionals: φ(x) = γ (x | U ) , 0 ∈ U =⇒ v 0 (τ ) = −γ AT r | U ◦ = −δ ∗ AT r | U 3. Norms: φ(x) = kxk =⇒ v 0 (τ ) = −kAT rk∗ 1 sup [hx, ui − u T u] =⇒ 2 n u∈[−,] √ v 0 (τ ) = − max{AT r , AT r / 2τ } 4. Huber: φ(x) = ∞ 2 5. Vapnik: φ(x) = k(x − )+ k1 + k(−x − )+ k1 =⇒ v 0 (τ ) = −(kAT r̄k∞ + kAT r̄k2 ) 31/36 Basis Pursuit with Outliers BPσ : min kxk1 st ρ(b − Ax) ≤ σ Standard least-squares: ρ(z) = kzk2 or ρ(z) = kzk22 . 32/36 Basis Pursuit with Outliers BPσ : min kxk1 st ρ(b − Ax) ≤ σ Standard least-squares: ρ(z) = kzk2 or ρ(z) = kzk22 . Quantile Huber: ρκ,τ (r) = κτ 2 τ |r| − 2 if r < −τ κ, if r ∈ [−κτ, (1 − τ )κ], (1 − τ )|r| − κ(1−τ )2 , 2 if r > 1 2 r 2κ (1 − τ )κ. Standard Huber when τ = 0.5. 32/36 Huber 33/36 Sparse and Robust Formulation Signal Recovery 3 HBPσ : min kxk1 st ρ(b − Ax) ≤ σ Huber 2 1 Problem Specification Truth x 20-sparse spike train in R512 LS b measurements in R120 A Measurement matrix satisfying RIP ρ Huber function σ error level set at .01 LS 5 outliers 0 ï1 ï2 ï3 ï4 0 100 200 300 400 500 600 40 60 80 100 120 Residuals 2 1.5 1 Results In the presence of outliers, the robust formulation recovers the spike train, while the standard formulation does not. 0.5 Truth 0 ï0.5 ï1 Huber ï1.5 ï2 0 20 34/36 Sparse and Robust Formulation Signal Recovery 4 HBPσ : min 0≤x kxk1 st ρ(b − Ax) ≤ σ 3 LS 2 Problem Specification 1 x 20-sparse spike train in R512 + Truth b measurements in R120 A Measurement matrix satisfying RIP Huber ρ Huber function σ error level set at .01 5 outliers 0 ï1 ï2 0 100 200 300 400 500 Residuals 2 Results In the presence of outliers, the robust formulation recovers the spike train, while the standard formulation does not. LS 1.5 1 0.5 Truth 0 ï0.5 ï1 ï1.5 Huber ï2 ï2.5 0 20 40 60 80 100 120 35/36 References • “Probing the pareto frontier for basis pursuit solutions” van der Berg - Friedlander SIAM J. Sci. Comput. 31(2008), 890–912. • “Sparse optimization with least-squares constraints” van der Berg - Friedlander SIOPT 21(2011), 1201–1229. • “Variational Properties of Value Functions.” Aravkin - B - Friedlander SIOPT 23(2013), 1689–1717. • “Level-set methods for convex optimization” Aravkin - B - Drusvyatskiy - Friedlander - Roy Preprint, 2016 36/36
© Copyright 2024 Paperzz