Level Set Methods in Numerical Optimization

Optimal Value Function Methods
in Numerical Optimization
Level Set Methods
James V Burke
Mathematics, University of Washington, ([email protected])
Joint work with
Aravkin (UW), Drusvyatskiy (UW), Friedlander (UBC/Davis), Roy (UW)
The Hong Kong Polytechnic University
Applied Mathematics Colloquium
February 4, 2016
Motivation
Optimization in Large-Scale Inference
• A range of large-scale data science applications can be
modeled using optimization:
- Inverse problems (medical and seismic imaging )
- High dimensional inference (compressive sensing, LASSO,
quantile regression)
- Machine learning (classification, matrix completion, robust
PCA, time series)
• These applications are often solved using side information:
- Sparsity or low rank of solution
- Constraints (topography, non-negativity)
- Regularization (priors, total variation, “dirty” data)
• We need efficient large-scale solvers for nonsmooth
programs.
2/36
The Prototypical Problem
Sparse Data Fitting:
Find sparse x with Ax ≈ b
3/36
The Prototypical Problem
Sparse Data Fitting:
Find sparse x with Ax ≈ b
Example: Model Selection
y = aT x
where y ∈ Rk is an observation and a ∈ Rn are covariates.
Suppose y is a disease classifier and a is micro-array data (n ≥ 104 ).
T
Given data {(yi , ai )}m
i=1 , find x so that yi ≈ ai x.
3/36
The Prototypical Problem
Sparse Data Fitting:
Find sparse x with Ax ≈ b
Example: Model Selection
y = aT x
where y ∈ Rk is an observation and a ∈ Rn are covariates.
Suppose y is a disease classifier and a is micro-array data (n ≥ 104 ).
T
Given data {(yi , ai )}m
i=1 , find x so that yi ≈ ai x.
Since m << n, one can “always” find x such that
yi = aiT x, i = 1, . . . , m.
3/36
The Prototypical Problem
Sparse Data Fitting:
Find sparse x with Ax ≈ b
Example: Model Selection
y = aT x
where y ∈ Rk is an observation and a ∈ Rn are covariates.
Suppose y is a disease classifier and a is micro-array data (n ≥ 104 ).
T
Given data {(yi , ai )}m
i=1 , find x so that yi ≈ ai x.
Since m << n, one can “always” find x such that
yi = aiT x, i = 1, . . . , m.
This x gives little insight into the role of the covariates a in
determining the observations y. We prefer the most parsimonious
subset of covariates that can be used to explain the observations.
That is, we prefer the sparsest model from the 2n possible models.
Such models are used to further our knowledge of disease mechanisms
and to develop efficient disease assays.
3/36
The Prototypical Problem
Sparse Data Fitting:
Find sparse x with Ax ≈ b
There are numerous other applications;
• system identification
• image segmentation
• compressed sensing
• grouped sparsity for remote sensor location
• ...
4/36
The Prototypical Problem
Sparse Data Fitting:
Find sparse x with Ax ≈ b
5/36
The Prototypical Problem
Sparse Data Fitting:
Find sparse x with Ax ≈ b
Convex approaches: kxk1 as a sparsity surragate
(Candes-Tao-Donaho)
BPDN
min
x
s.t.
kxk1
1 kAx
2
LASSO
min
− bk2
2 ≤ σ
x
s.t.
1 kAx
2
Lagrangian (Penalty)
− bk2
2
min
x
1 kAx
2
− bk2
2 + λkxk1
kxk1 ≤ τ
5/36
The Prototypical Problem
Sparse Data Fitting:
Find sparse x with Ax ≈ b
Convex approaches: kxk1 as a sparsity surragate
(Candes-Tao-Donaho)
BPDN
min
x
s.t.
kxk1
1 kAx
2
LASSO
min
− bk2
2 ≤ σ
x
s.t.
1 kAx
2
Lagrangian (Penalty)
− bk2
2
min
x
1 kAx
2
− bk2
2 + λkxk1
kxk1 ≤ τ
• BPDN: often most natural and transparent.
(physical considerations guide σ)
5/36
The Prototypical Problem
Sparse Data Fitting:
Find sparse x with Ax ≈ b
Convex approaches: kxk1 as a sparsity surragate
(Candes-Tao-Donaho)
BPDN
min
x
s.t.
kxk1
1 kAx
2
LASSO
min
− bk2
2 ≤ σ
x
s.t.
1 kAx
2
Lagrangian (Penalty)
− bk2
2
min
x
1 kAx
2
− bk2
2 + λkxk1
kxk1 ≤ τ
• BPDN: often most natural and transparent.
(physical considerations guide σ)
• Lagrangian: ubiquitous in practice.
(“no constraints”)
5/36
The Prototypical Problem
Sparse Data Fitting:
Find sparse x with Ax ≈ b
Convex approaches: kxk1 as a sparsity surragate
(Candes-Tao-Donaho)
BPDN
min
x
s.t.
kxk1
1 kAx
2
LASSO
min
− bk2
2 ≤ σ
x
s.t.
1 kAx
2
Lagrangian (Penalty)
− bk2
2
min
x
1 kAx
2
− bk2
2 + λkxk1
kxk1 ≤ τ
• BPDN: often most natural and transparent.
(physical considerations guide σ)
• Lagrangian: ubiquitous in practice.
(“no constraints”)
All three are (essentially) equivalent computationally!
5/36
The Prototypical Problem
Sparse Data Fitting:
Find sparse x with Ax ≈ b
Convex approaches: kxk1 as a sparsity surragate
(Candes-Tao-Donaho)
BPDN
min
x
s.t.
kxk1
1 kAx
2
LASSO
min
− bk2
2 ≤ σ
x
s.t.
1 kAx
2
Lagrangian (Penalty)
− bk2
2
min
x
1 kAx
2
− bk2
2 + λkxk1
kxk1 ≤ τ
• BPDN: often most natural and transparent.
(physical considerations guide σ)
• Lagrangian: ubiquitous in practice.
(“no constraints”)
All three are (essentially) equivalent computationally!
Basis for SPGL1 (van den Berg-Friedlander ’08)
5/36
Optimal Value or Level Set Framework
Problem class: Solve
min
x∈X
φ(x)
P(σ)
s.t. ρ(Ax − b) ≤ σ
6/36
Optimal Value or Level Set Framework
Problem class: Solve
min
x∈X
φ(x)
P(σ)
s.t. ρ(Ax − b) ≤ σ
Strategy: Consider the “flipped” problem
v(τ ) := min
x∈X
ρ(Ax − b)
Q(τ )
s.t. φ(x) ≤ τ
6/36
Optimal Value or Level Set Framework
Problem class: Solve
min
x∈X
φ(x)
P(σ)
s.t. ρ(Ax − b) ≤ σ
Strategy: Consider the “flipped” problem
v(τ ) := min
x∈X
ρ(Ax − b)
Q(τ )
s.t. φ(x) ≤ τ
Then opt-val(P(σ)) is the minimal root of the equation
v(τ ) = σ
6/36
Queen Dido’s Problem
The intuition behind the proposed framework has a
distinguished history, appearing even in antiquity. Perhaps the
earliest instance is Queen Dido’s problem and the fabled origins
of Carthage.
In short, the problem is to find the maximum area that can be
enclosed by an arc of fixed length and a given line. The
converse problem is to find an arc of least length that traps a
fixed area between a line and the arc. Although these two
problems reverse the objective and the constraint, the solution
in each case is a semi-circle.
7/36
Queen Dido’s Problem
The intuition behind the proposed framework has a
distinguished history, appearing even in antiquity. Perhaps the
earliest instance is Queen Dido’s problem and the fabled origins
of Carthage.
In short, the problem is to find the maximum area that can be
enclosed by an arc of fixed length and a given line. The
converse problem is to find an arc of least length that traps a
fixed area between a line and the arc. Although these two
problems reverse the objective and the constraint, the solution
in each case is a semi-circle.
Other historical examples abound. More recently, these
observations provide the basis for the Markowitz
Mean-Variance Portfolio Theory.
7/36
The Role of Convexity
Convex Sets
Let C ⊂ Rn . We say that C is convex if
(1 − λ)x + λy ∈ C whenever x, y ∈ C and 0 ≤ λ ≤ 1.
8/36
The Role of Convexity
Convex Sets
Let C ⊂ Rn . We say that C is convex if
(1 − λ)x + λy ∈ C whenever x, y ∈ C and 0 ≤ λ ≤ 1.
Convex Functions
Let f : Rn → R̄ := R ∪ {+∞}. We say that f is convex if the
set
epi (f ) := { (x, µ) : f (x) ≤ µ }
is a convex set.
8/36
The Role of Convexity
Convex Sets
Let C ⊂ Rn . We say that C is convex if
(1 − λ)x + λy ∈ C whenever x, y ∈ C and 0 ≤ λ ≤ 1.
Convex Functions
Let f : Rn → R̄ := R ∪ {+∞}. We say that f is convex if the
set
epi (f ) := { (x, µ) : f (x) ≤ µ }
is a convex set.
(x 2 , f (x 2 ))
(x 1 , f (x 1 ))
x1
λ x 1 + (1 - λ)x 2
x2
f ((1 − λ)x1 + λx2 ) ≤ (1 − λ)f (x1 ) + λf (x2 )
8/36
Convex Functions
Convex indicator functions
Let C ⊂ Rn . Then the function
(
δC (x) :=
0
, if x ∈ C ,
+∞ , if x ∈
/ C,
is a convex function.
9/36
Convex Functions
Convex indicator functions
Let C ⊂ Rn . Then the function
(
δC (x) :=
0
, if x ∈ C ,
+∞ , if x ∈
/ C,
is a convex function.
Addition
Non-negative linear combinations of convex functions are
convex: fi convex and λi ≥ 0, i = 1, . . . , k
P
f (x) := ki=1 λi fi (x).
9/36
Convex Functions
Convex indicator functions
Let C ⊂ Rn . Then the function
(
δC (x) :=
0
, if x ∈ C ,
+∞ , if x ∈
/ C,
is a convex function.
Addition
Non-negative linear combinations of convex functions are
convex: fi convex and λi ≥ 0, i = 1, . . . , k
P
f (x) := ki=1 λi fi (x).
Infimal Projection
If f : Rn × Rm → R̄ is convex, then so is
v(x) := inf y f (x, y),
since
epi (v) = { (x, µ) : ∃ y ∈ s.t. f (x, y) ≤ µ }.
9/36
Convexity of v
When X , ρ, and φ are convex, the optimal value function v is a
non-increasing convex function by infimal projection:
v(τ ) := min
x∈X
= min
x
ρ(Ax − b)
s.t. φ(x) ≤ τ
ρ(Ax − b) + δepi (φ) (x, τ ) + δX (x)
10/36
Newton and Secant Methods
For f convex and non-increasing, solve f (τ ) = 0.
11/36
Newton and Secant Methods
For f convex and non-increasing, solve f (τ ) = 0.
400
350
300
250
200
150
100
50
0
0
20
40
60
80
100
11/36
Newton and Secant Methods
For f convex and non-increasing, solve f (τ ) = 0.
400
350
300
250
200
150
100
50
0
0
20
40
60
80
100
Problem: f is often not differentiable.
11/36
Newton and Secant Methods
For f convex and non-increasing, solve f (τ ) = 0.
400
350
300
250
200
150
100
50
0
0
20
40
60
80
100
Problem: f is often not differentiable.
Use the convex subdifferential
∂f (x) := { z : f (y) ≥ f (x) + z T (y − x)
∀ y ∈ Rn }
11/36
Superlinear Convergence
τ∗ := inf{τ : f (τ ) ≤ 0} and g∗ := inf { g : g ∈ ∂f (τ∗ ) } < 0 (non-degeneracy)
12/36
Superlinear Convergence
τ∗ := inf{τ : f (τ ) ≤ 0} and g∗ := inf { g : g ∈ ∂f (τ∗ ) } < 0 (non-degeneracy)
Initialization: τ−1 < τ0 < τ∗
(
τk
if f (τk ) = 0,
τk+1 :=
(Newton)
k)
τk − f (τ
[for
g
∈
∂f
(τ
)]
otherwise;
k
k
gk
and
(
τk+1 :=
τk
τk −
τk −τk−1
f (τk )−f (τk−1 ) f (τk )
if f (τk ) = 0,
otherwise.
(Secant)
12/36
Superlinear Convergence
τ∗ := inf{τ : f (τ ) ≤ 0} and g∗ := inf { g : g ∈ ∂f (τ∗ ) } < 0 (non-degeneracy)
Initialization: τ−1 < τ0 < τ∗
(
τk
if f (τk ) = 0,
τk+1 :=
(Newton)
k)
τk − f (τ
[for
g
∈
∂f
(τ
)]
otherwise;
k
k
gk
and
(
τk+1 :=
τk
τk −
τk −τk−1
f (τk )−f (τk−1 ) f (τk )
if f (τk ) = 0,
otherwise.
(Secant)
If either sequence terminates finitely at some τk , then τk = τ∗ ;
otherwise,
|τ∗ − τk+1 | ≤ (1 −
g∗
)|τ∗ − τk |, k = 1, 2, . . . ,
γk
where γk = gk (Newton) and γk ∈ ∂f (τk−1 ) (secant). In either case,
γk ↑ g∗ and τk ↑ τ∗ globally q-superlinearly.
12/36
Inexact Root Finding
• Problem: Find root of the inexactly known convex function
v(·) − σ.
13/36
Inexact Root Finding
• Problem: Find root of the inexactly known convex function
v(·) − σ.
• Bisection is one approach
13/36
Inexact Root Finding
• Problem: Find root of the inexactly known convex function
v(·) − σ.
• Bisection is one approach
• nonmonotone iterates (bad for warmstarts)
• at best linear convergence (with perfect information)
13/36
Inexact Root Finding
• Problem: Find root of the inexactly known convex function
v(·) − σ.
• Bisection is one approach
• nonmonotone iterates (bad for warmstarts)
• at best linear convergence (with perfect information)
• Solution:
• modified secant
• approximate Newton methods
13/36
Inexact Root Finding: Secant
14/36
Inexact Root Finding: Secant
14/36
Inexact Root Finding: Secant
14/36
Inexact Root Finding: Secant
14/36
Inexact Root Finding: Secant
14/36
Inexact Root Finding: Secant
14/36
Inexact Root Finding: Newton
14/36
Inexact Root Finding: Newton
14/36
Inexact Root Finding: Newton
14/36
Inexact Root Finding: Newton
14/36
Inexact Root Finding: Newton
14/36
Inexact Root Finding: Newton
14/36
Inexact Root Finding: Convergence
15/36
Inexact Root Finding: Convergence
15/36
Inexact Root Finding: Convergence
15/36
Inexact Root Finding: Convergence
Key observation: C = C (τ0 ) is independent of v 0 (τ ∗ ).
15/36
Minorants from Duality
16/36
Minorants from Duality
16/36
Minorants from Duality
16/36
Robustness: 1 ≤ u/l ≤ α, where α ∈ [1, 2) and = 10−2
140
120
100
80
60
40
20
010
f1
9
8
7
6
5
4
3
2
(a) k = 13, α = 1.3
140
120
100
80
60
40
20
010
8
7
6
5
4
3
(d) k = 9, α = 1.3
120
f1
2
160
140
120
100
80
60
40
20
010
f2
100
80
60
40
20
9
8
7
6
5
4
3
2
(b) k = 770, α = 1.99
f1
9
160
140
120
100
80
60
40
20
010
f1
010
8
6
4
2
0
(c) k = 18, α = 1.3
120
f2
100
80
60
40
20
9
8
7
6
5
4
3
2
(e) k = 15, α = 1.99
010
8
6
4
2
0
(f) k = 10, α = 1.3
Figure : Inexact secant (top) and Newton (bottom) for
f1 (τ ) = (τ − 1)2 − 10 (first two columns) and f2 (τ ) = τ 2 (last column).
Below each panel, α is the oracle accuracy, and k is the number of
iterations needed to converge, i.e., to reach fi (τk ) ≤ = 10−2 .
17/36
Sensor Network Localization (SNL)
0.5
0.4
0.3
0.2
0.1
0
−0.1
−0.2
−0.3
−0.4
−0.5
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
Given a weighted graph G = (V , E, d) find a realization:
p1 , . . . , pn ∈ R 2
with
dij = kpi − pj k2
for all ij ∈ E.
18/36
Sensor Network Localization (SNL)
SDP relaxation (Weinberger et al. ’04, Biswas et al. ’06):
max tr (X )
s.t. kPE K(X ) − dk22 ≤ σ
Xe = 0,
X 0
where [K(X )]i,j = Xii + Xjj − 2Xij .
19/36
Sensor Network Localization (SNL)
SDP relaxation (Weinberger et al. ’04, Biswas et al. ’06):
max tr (X )
s.t. kPE K(X ) − dk22 ≤ σ
Xe = 0,
X 0
where [K(X )]i,j = Xii + Xjj − 2Xij .
n
1 X
kpi − pj k2
Intuition: X = PP T and then tr (X ) =
n
+
1
i,j=1
with pi the ith row of P.
19/36
Sensor Network Localization (SNL)
SDP relaxation (Weinberger et al. ’04, Biswas et al. ’06):
max tr (X )
s.t. kPE K(X ) − dk22 ≤ σ
Xe = 0,
X 0
where [K(X )]i,j = Xii + Xjj − 2Xij .
n
1 X
kpi − pj k2
Intuition: X = PP T and then tr (X ) =
n
+
1
i,j=1
with pi the ith row of P.
Flipped problem:
min kPE K(X ) − dk22
s.t. tr X = τ
Xe = 0
X 0.
19/36
Sensor Network Localization (SNL)
SDP relaxation (Weinberger et al. ’04, Biswas et al. ’06):
max tr (X )
s.t. kPE K(X ) − dk22 ≤ σ
Xe = 0,
X 0
where [K(X )]i,j = Xii + Xjj − 2Xij .
n
1 X
kpi − pj k2
Intuition: X = PP T and then tr (X ) =
n
+
1
i,j=1
with pi the ith row of P.
Flipped problem:
min kPE K(X ) − dk22
s.t. tr X = τ
Xe = 0
X 0.
• Perfectly adapted for the Frank-Wolfe method.
19/36
Sensor Network Localization (SNL)
SDP relaxation (Weinberger et al. ’04, Biswas et al. ’06):
max tr (X )
s.t. kPE K(X ) − dk22 ≤ σ
Xe = 0,
X 0
where [K(X )]i,j = Xii + Xjj − 2Xij .
n
1 X
kpi − pj k2
Intuition: X = PP T and then tr (X ) =
n
+
1
i,j=1
with pi the ith row of P.
Flipped problem:
min kPE K(X ) − dk22
s.t. tr X = τ
Xe = 0
X 0.
• Perfectly adapted for the Frank-Wolfe method.
Key point: Slater failing (always the case) is irrelevant.
19/36
Approximate Newton
Pareto curve
1.5
1
0.5
0
0
5
10
15
20
Figure : σ = 0.25
20/36
Approximate Newton
Pareto curve
Pareto curve
1.5
1.4
1.2
1
1
0.8
0.6
0.5
0.4
0.2
0
0
5
10
15
Figure : σ = 0.25
20
0
0
5
10
15
20
Figure : σ = 0
20/36
Max-trace
Newton_max
0.6
0.4
0.4
0.2
0.2
0
0
−0.2
−0.2
−0.4
−0.4
−0.5
0
Refine positions
0.6
0.5
−0.5
0
0.5
21/36
Max-trace
Newton_max
0.6
0.4
0.4
0.2
0.2
0
0
−0.2
−0.2
−0.4
−0.4
−0.5
0
Refine positions
0.6
−0.5
0.5
Newton_min
0.6
0.6
0.4
0.4
0.2
0.2
0
0
−0.2
−0.2
−0.4
−0.4
−0.5
0
0
0.5
Refine positions
0.5
−0.5
0
0.5
21/36
Observations
• Simple strategy for optimizing over complex domains
• Rigorous convergence guarantees
• Insensitivity to ill-conditioning
• Many applications
• Sensor Network Localization
(Drusvyatskiy-Krislock-Voronin-Wolkowicz ’15)
• Sparse/Robust Estimation and Kalman Smoothing
(Aravkin-B-Pillonetto ’13)
• Large scale SDP and LP (cf. Renegar ’14)
• Chromosome reconstruction
(Aravkin-Becker-Drusvyatskiy-Lozano ’15)
• Phase retrieval (Aravkin-B-Drusvyatskiy-Friedlander-Roy
’16)
• Generalized linear models
(Aravkin-B-Drusvyatskiy-Friedlander-Roy ’16)
• ...
22/36
Conjugate Functions and Duality
Convex Indicator
For any convex set C , the convex indicator function for C is
(
δ (x | C ) :=
0,
+∞,
x ∈ C,
x∈
/ C.
23/36
Conjugate Functions and Duality
Convex Indicator
For any convex set C , the convex indicator function for C is
(
δ (x | C ) :=
0,
+∞,
x ∈ C,
x∈
/ C.
Support Functionals
For any set C , the support functional for C is
δ ∗ (x | C ) := sup hx, zi .
z∈C
23/36
Conjugate Functions and Duality
Convex Indicator
For any convex set C , the convex indicator function for C is
(
δ (x | C ) :=
0,
+∞,
x ∈ C,
x∈
/ C.
Support Functionals
For any set C , the support functional for C is
δ ∗ (x | C ) := sup hx, zi .
z∈C
Convex Conjugates
For any convex function g(x), the convex conjugate is given by
g ∗ (y) := δ ∗ ((y, −1) | epi (g) ) = sup[hx, yi − g(x)] .
x
23/36
Conjugate’s and the Subdifferential
g ∗ (y) = sup[hx, yi − g(x)] .
x
24/36
Conjugate’s and the Subdifferential
g ∗ (y) = sup[hx, yi − g(x)] .
x
The Bi-Conjugate Theorem
If epi (g) is closed and dom (g) 6= ∅, then (g ∗ )∗ = g.
24/36
Conjugate’s and the Subdifferential
g ∗ (y) = sup[hx, yi − g(x)] .
x
The Bi-Conjugate Theorem
If epi (g) is closed and dom (g) 6= ∅, then (g ∗ )∗ = g.
The Young-Fenchel Inequality
g(x) + g ∗ (z) ≥ hz, xi for all x, y ∈ Rn with equality if and only if
z ∈ ∂g(x) and x ∈ ∂g ∗ (z).
In particular, ∂g(x) = argmaxz [hz, xi − g ∗ (z)].
24/36
Conjugate’s and the Subdifferential
g ∗ (y) = sup[hx, yi − g(x)] .
x
The Bi-Conjugate Theorem
If epi (g) is closed and dom (g) 6= ∅, then (g ∗ )∗ = g.
The Young-Fenchel Inequality
g(x) + g ∗ (z) ≥ hz, xi for all x, y ∈ Rn with equality if and only if
z ∈ ∂g(x) and x ∈ ∂g ∗ (z).
In particular, ∂g(x) = argmaxz [hz, xi − g ∗ (z)].
Maximal Montone Operator
If epi (g) is closed and dom (g) 6= ∅, then ∂g is a maximal
monotone operator with ∂g −1 = ∂g ∗ .
24/36
Conjugate’s and the Subdifferential
g ∗ (y) = sup[hx, yi − g(x)] .
x
The Bi-Conjugate Theorem
If epi (g) is closed and dom (g) 6= ∅, then (g ∗ )∗ = g.
The Young-Fenchel Inequality
g(x) + g ∗ (z) ≥ hz, xi for all x, y ∈ Rn with equality if and only if
z ∈ ∂g(x) and x ∈ ∂g ∗ (z).
In particular, ∂g(x) = argmaxz [hz, xi − g ∗ (z)].
Maximal Montone Operator
If epi (g) is closed and dom (g) 6= ∅, then ∂g is a maximal
monotone operator with ∂g −1 = ∂g ∗ .
Note:The lsc hull of g is cl g := g ∗∗ .
24/36
The perspective function
epi (g π ) := cl cone (epi (g)) = cl (
S
λ>0 λepi (g))
25/36
The perspective function
epi (g π ) := cl cone (epi (g)) = cl (
S
π
g (z, λ) :=
λ>0 λepi (g))

−1


λg(λ z)
g ∞ (z)


+∞
if
if
if
λ > 0,
λ = 0,
λ < 0,
where g ∞ is the horizon function of g:
g ∞ (z) :=
sup
[g(x + z) − g(x)] .
x∈dom g
25/36
Support Functions for epi (g) and levg (τ )
g : Rn → R be closed proper and convex.
Then
δ ∗ ((y, µ) | epi (g) ) = (g ∗ )π (y, −µ)
and
δ ∗ (y | [g ≤ τ ] ) = cl inf [τ µ + (g ∗ )π (y, µ)],
µ≥0
where
epi (g) := {(x, µ) | g(x) ≤ µ }
[g ≤ τ ] := {x | g(x) ≤ τ }
δ ∗ (z | C ) := sup hz, wi
w∈C
26/36
Perturbation Framework and Duality (Rockafellar (1970))
The perturbation function
f (x, b, τ ) := ρ(b − Ax) + δ ((x, τ ) | epi (φ) )
Its conjugate
f ∗ (y, u, µ) = (φ∗ )π (y + AT u, −µ) + ρ∗ (u) .
27/36
Perturbation Framework and Duality (Rockafellar (1970))
The perturbation function
f (x, b, τ ) := ρ(b − Ax) + δ ((x, τ ) | epi (φ) )
Its conjugate
f ∗ (y, u, µ) = (φ∗ )π (y + AT u, −µ) + ρ∗ (u) .
The Primal Problem
P(b, τ ) :
infimal projection in x
v(b, τ ) := min f (x, b, τ ) .
x
27/36
Perturbation Framework and Duality (Rockafellar (1970))
The perturbation function
f (x, b, τ ) := ρ(b − Ax) + δ ((x, τ ) | epi (φ) )
Its conjugate
f ∗ (y, u, µ) = (φ∗ )π (y + AT u, −µ) + ρ∗ (u) .
The Primal Problem
P(b, τ ) :
v(b, τ ) := min f (x, b, τ ) .
x
The Dual Problem
D(b, τ ) :
v̂(b, τ ) :=
sup hb, ui + τ µ − f ∗ (0, u, µ)
u,µ
(reduced dual)
=
sup hb, ui − ρ∗ (u) − δ ∗ AT u | [φ ≤ τ ] .
u
27/36
Perturbation Framework and Duality (Rockafellar (1970))
The perturbation function
f (x, b, τ ) := ρ(b − Ax) + δ ((x, τ ) | epi (φ) )
Its conjugate
f ∗ (y, u, µ) = (φ∗ )π (y + AT u, −µ) + ρ∗ (u) .
The Primal Problem
P(b, τ ) :
v(b, τ ) := min f (x, b, τ ) .
x
The Dual Problem
D(b, τ ) :
v̂(b, τ ) :=
sup hb, ui + τ µ − f ∗ (0, u, µ)
u,µ
(reduced dual)
=
sup hb, ui − ρ∗ (u) − δ ∗ AT u | [φ ≤ τ ] .
u
The Subdifferential: If (b, τ ) ∈ int (dom v), then v(b, τ ) = v̂(b, τ )
and
∅=
6 ∂v(b, τ ) = argmax D(b, τ )
u,µ
27/36
Piecewise Linear-Quadratic Penalties
1
φ(x) := sup [hx, ui − u T Bu]
2
u∈U
U ⊂ Rn is nonempty, closed and convex with 0 ∈ U (not nec. poly.)
B ∈ Rn×n is symmetric positive semi-definite.
Examples:
1. Support functionals: B = 0
2. Gauge functionals: γ (· | U ◦ ) = δ ∗ (· | U )
3. Norms: B = closed unit ball, k·k = γ (· | B )
4. Least-squares: U = Rn , B = I
5. Huber: U = [−, ]n , B = I
28/36
PLQ Densities: Gauss, Laplace, Huber, Vapnik
y
y
x
x
V (x) = 12 x2
V (x) = |x|
`1
Gauss
y
y
−K
K
x
−
V (x) = −Kx − 12 K 2 ; x < −K
V (x) = −x − ; x < −
V (x) = Kx − 12 K 2 ; K < x
V (x) = x − ; ≤ x
V (x) = 12 x2 ; −K ≤ x ≤ K
Huber
x
V (x) = 0; − ≤ x ≤ Vapnik
29/36
Computing v 0 for PLQ Penalties φ
1
φ(x) := sup [hx, ui − u T Bu]
2
u∈U
P(b, τ ) :
v(b, τ ) := min ρ(b − Ax)
st φ(x) ≤ τ
! ∃ x s.t. 0 ∈ −AT ∂ρ(b − Ax) + µ+ ∂φ(x) and
u
∂v(b, τ ) =
n √
√ o

−µ

µ = max γ AT u | U , u T ABAT u/ 2τ






.


30/36
A Few Special Cases
v(τ ) := min 12 kb − Axk22
st φ(x) ≤ τ
Optimal Residual: r := Ax − b
Optimal Solution: x
1. Support functionals: φ(x) = δ ∗ (x | U ) , 0 ∈ U =⇒
v 0 (τ ) = −δ ∗ AT r | U ◦ = −γ AT r | U
2. Gauge functionals: φ(x) = γ (x | U ) , 0 ∈ U =⇒
v 0 (τ ) = −γ AT r | U ◦ = −δ ∗ AT r | U
3. Norms: φ(x) = kxk =⇒ v 0 (τ ) = −kAT rk∗
1
sup [hx, ui − u T u] =⇒
2
n
u∈[−,]
√
v 0 (τ ) = − max{AT r , AT r / 2τ }
4. Huber: φ(x) =
∞
2
5. Vapnik: φ(x) = k(x − )+ k1 + k(−x − )+ k1 =⇒
v 0 (τ ) = −(kAT r̄k∞ + kAT r̄k2 )
31/36
Basis Pursuit with Outliers
BPσ : min kxk1 st ρ(b − Ax) ≤ σ
Standard least-squares:
ρ(z) = kzk2 or ρ(z) = kzk22 .
32/36
Basis Pursuit with Outliers
BPσ : min kxk1 st ρ(b − Ax) ≤ σ
Standard least-squares:
ρ(z) = kzk2 or ρ(z) = kzk22 .
Quantile Huber:
ρκ,τ (r) =

κτ 2


τ |r| − 2
if r < −τ κ,
if r ∈ [−κτ, (1 − τ )κ],

(1 − τ )|r| − κ(1−τ )2 ,
2
if r >
1 2
r
 2κ
(1 − τ )κ.
Standard Huber when τ = 0.5.
32/36
Huber
33/36
Sparse and Robust Formulation
Signal Recovery
3
HBPσ :
min
kxk1
st
ρ(b − Ax) ≤ σ
Huber
2
1
Problem Specification
Truth
x 20-sparse spike train in R512
LS
b measurements in R120
A Measurement matrix satisfying RIP
ρ Huber function
σ error level set at .01
LS
5 outliers
0
ï1
ï2
ï3
ï4
0
100
200
300
400
500
600
40
60
80
100
120
Residuals
2
1.5
1
Results
In the presence of outliers, the robust
formulation recovers the spike train,
while the standard formulation does not.
0.5
Truth
0
ï0.5
ï1
Huber
ï1.5
ï2
0
20
34/36
Sparse and Robust Formulation
Signal Recovery
4
HBPσ :
min
0≤x
kxk1
st
ρ(b − Ax) ≤ σ
3
LS
2
Problem Specification
1
x 20-sparse spike train in R512
+
Truth
b measurements in R120
A Measurement matrix satisfying RIP Huber
ρ Huber function
σ error level set at .01
5 outliers
0
ï1
ï2
0
100
200
300
400
500
Residuals
2
Results
In the presence of outliers, the robust
formulation recovers the spike train,
while the standard formulation does not.
LS
1.5
1
0.5
Truth
0
ï0.5
ï1
ï1.5
Huber
ï2
ï2.5
0
20
40
60
80
100
120
35/36
References
• “Probing the pareto frontier for basis pursuit solutions”
van der Berg - Friedlander
SIAM J. Sci. Comput. 31(2008), 890–912.
• “Sparse optimization with least-squares constraints”
van der Berg - Friedlander
SIOPT 21(2011), 1201–1229.
• “Variational Properties of Value Functions.”
Aravkin - B - Friedlander
SIOPT 23(2013), 1689–1717.
• “Level-set methods for convex optimization”
Aravkin - B - Drusvyatskiy - Friedlander - Roy
Preprint, 2016
36/36