Estimators based on non-convex programs:
Statistical and computational guarantees
Martin Wainwright
UC Berkeley
Statistics and EECS
Based on joint work with:
Po-Ling Loh (UC Berkeley)
Martin Wainwright (UC Berkeley)
May 2013
1 / 1
Introduction
Prediction/regression problems arise throughout statistics:
vector of predictors/covariates x ∈ Rp
response variable y ∈ R
In “big data” setting:
◮
◮
both sample size n and ambient dimension p are large
many problems have p ≫ n
Introduction
Prediction/regression problems arise throughout statistics:
vector of predictors/covariates x ∈ Rp
response variable y ∈ R
In “big data” setting:
◮
◮
both sample size n and ambient dimension p are large
many problems have p ≫ n
Regularization is essential:
θb ∈ arg min
θ
n
n1X
o
L(θ; xi , yi ) + Rλ (θ)
.
| {z }
n i=1
{z
} Regularizer
|
n
Ln (θ;xn
1 ,y1 )
Introduction
Prediction/regression problems arise throughout statistics:
In “big data” setting:
◮
◮
both sample size n and ambient dimension p are large
many problems have p ≫ n
Regularization is essential:
θb ∈ arg min
θ
n
o
n1X
.
L(θ; xi , yi ) + Rλ (θ)
| {z }
n i=1
{z
} Regularizer
|
n
Ln (θ;xn
1 ,y1 )
For non-convex problems: “Mind the gap!”
any global optimum is “statistically good”....
but efficient algorithms only find local optima
Introduction
Prediction/regression problems arise throughout statistics:
In “big data” setting:
◮
◮
both sample size n and ambient dimension p are large
many problems have p ≫ n
Regularization is essential:
θb ∈ arg min
θ
n
o
n1X
.
L(θ; xi , yi ) + Rλ (θ)
| {z }
n i=1
{z
} Regularizer
|
n
Ln (θ;xn
1 ,y1 )
For non-convex problems: “Mind the gap!”
any global optimum is “statistically good”....
but efficient algorithms only find local optima
Question
How to close this undesirable gap between statistics and computation?
Vignette A: Regression with non-convex penalties
Set-up: Observe (yi , xi ) pairs for i = 1, 2, . . . , n, where
yi ∼ Q · | hθ∗ , xi i ,
where θ ∈ Rpp has “low-dimensional structure”
X
y
n
∼
Q
(
n×p
θ∗
S
)
Sc
Vignette A: Regression with non-convex penalties
Set-up: Observe (yi , xi ) pairs for i = 1, 2, . . . , n, where
yi ∼ Q · | hθ∗ , xi i ,
where θ ∈ Rpp has “low-dimensional structure”
X
y
n
∼
Q
(
n×p
θ∗
S
)
Sc
Estimator: Rλ -regularized likelihood
)
(
n
1X
b
log Q(yi | hxi , θi + Rλ (θ) .
θ ∈ arg min −
θ
n i=1
Vignette A: Regression with non-convex penalties
X
y
n
∼
Q
(
n×p
θ∗
S
)
Sc
Example: Logistic regression for binary responses yi ∈ {0, 1}:
)
( n
1 X
hxi , θi
b
log(1 + e
) − yi hxi , θi + Rλ (θ) .
θ ∈ arg min
θ
n i=1
Many non-convex penalties are possible:
capped ℓ1 -penalty
SCAD penalty
MCP penalty
(Fan & Li, 2001)
(Zhang, 2006)
Convex and non-convex regularizers
Regularizers
0.8
0.7
0.6
λ
R (t)
0.5
0.4
0.3
MCP
L
0.2
1
Capped L
1
0.1
0
−1
−0.5
0
t
0.5
1
Statistical error versus optimization error
Algorithm generating sequence of iterates {θt }∞
t=0 to solve
( n
)
X
1
θb ∈ arg min
L(θ; xi , yi ) + Rλ (θ) .
θ
n i=1
Global minimizer of population risk
h
i
θ∗ := arg min EX,Y L(θ; X, Y )
θ
|
{z
}
L̄(θ)
Statistical error versus optimization error
Algorithm generating sequence of iterates {θt }∞
t=0 to solve
( n
)
X
1
θb ∈ arg min
L(θ; xi , yi ) + Rλ (θ) .
θ
n i=1
Global minimizer of population risk
h
i
θ∗ := arg min EX,Y L(θ; X, Y )
θ
|
{z
}
L̄(θ)
Goal of statistician
Provide bounds on Statistical error:
kθt − θ∗ k or kθb − θ∗ k
Statistical error versus optimization error
Algorithm generating sequence of iterates {θt }∞
t=0 to solve
( n
)
X
1
θb ∈ arg min
L(θ; xi , yi ) + Rλ (θ) .
θ
n i=1
Global minimizer of population risk
h
i
θ∗ := arg min EX,Y L(θ; X, Y )
θ
|
{z
}
L̄(θ)
Goal of statistician
Provide bounds on Statistical error:
kθt − θ∗ k or kθb − θ∗ k
Goal of optimization-theorist
Provide bounds on Optimization error:
b
kθt − θk
Logistic regression with non-convex regularizer
log error plot for logistic regression with SCAD, a = 3.7
0.5
opt err
stat err
0
log(||βt − β*||2)
−0.5
−1
−1.5
−2
−2.5
−3
−3.5
−4
0
500
1000
iteration count
1500
2000
What phenomena need to be explained?
Empirical observation #1:
From a statistical perspective, all local optima are essentially as good as a
global optimum.
What phenomena need to be explained?
Empirical observation #1:
From a statistical perspective, all local optima are essentially as good as a
global optimum.
Some past work:
for least-squares loss, certain local optima are good
(Zhang & Zhang, 2012)
if initialized at Lasso solution with ℓ∞ -guarantees, local algorithm has
good behavior
(Fan et al., 2012)
What phenomena need to be explained?
Empirical observation #1:
From a statistical perspective, all local optima are essentially as good as a
global optimum.
Some past work:
for least-squares loss, certain local optima are good
(Zhang & Zhang, 2012)
if initialized at Lasso solution with ℓ∞ -guarantees, local algorithm has
good behavior
(Fan et al., 2012)
Empirical observation #2:
First-order methods converge as fast as possible up to statistical precision.
Vignette B: Error-in-variables regression
Begin with high-dimensional sparse regression:
X
y
n
=
n×p
θ∗
ε
S
Sc
+
Vignette B: Error-in-variables regression
Begin with high-dimensional sparse regression:
n
θ∗
X
y
S
n×p
=
ε
+
Sc
Observe y ∈ Rn and Z ∈ Rn×p
Z
n×p
W
X
= F
n×p
Here W ∈ Rn×p is a stochastic perturbation.
,
n×p
Example: Missing data
Missing data:
Z
X
=
n×p
W
n×p
Here W ∈ Rn×p is multiplicative perturbation (e.g., Wij ∼ Ber(α).)
X
y
n
=
n×p
θ∗
ε
S
Sc
+
Example: Additive perturbations
Additive noise in covariates:
Z
X
=
n×p
n×p
W
n×p
+
Here W ∈ Rn×p is an additive perturbation (e.g., Wij ∼ N (0, σ 2 )).
X
y
n
=
n×p
θ∗
ε
S
Sc
+
A second look at regularized least-squares
Equivalent formulation:
θb ∈ arg minp
θ∈R
Martin Wainwright (UC Berkeley)
n1
2
θT
o
XT y
XT X θ − hθ,
i + Rλ (θ) .
n
n
May 2013
10 / 1
A second look at regularized least-squares
Equivalent formulation:
θb ∈ arg minp
θ∈R
n1
2
θT
o
XT y
XT X θ − hθ,
i + Rλ (θ) .
n
n
Population view: unbiased estimators
cov(x1 ) = E
Martin Wainwright (UC Berkeley)
h XT X i
n
,
and
cov(x1 , y1 ) = E
h XT y i
.
n
May 2013
10 / 1
Corrected estimators
Equivalent formulation:
θb ∈ arg minp
θ∈R
n1
2
θT
o
XT y
XT X θ − hθ,
i + Rλ (θ) .
n
n
Population view: unbiased estimators
cov(x1 ) = E
h XT X i
n
,
and
cov(x1 , y1 ) = E
h XT y i
.
n
A general family of estimators
θb ∈ arg minp
θ∈R
n1
2
o
b − θT γ
b + λ2n kθk21 ,
θT Γθ
b γ
where (Γ,
b) are unbiased estimators of cov(x1 ) and cov(x1 , y1 ).
Martin Wainwright (UC Berkeley)
May 2013
10 / 1
Example: Estimator for missing data
observe corrupted version Z ∈ Rn×p
(
Xij with probability 1 − α
Zij =
⋆
with probability α.
Example: Estimator for missing data
observe corrupted version Z ∈ Rn×p
(
Xij with probability 1 − α
Zij =
⋆
with probability α.
Natural unbiased estimates: set ⋆ ≡ 0 and Zb :=
b
Γ
=
ZbT Zb
ZbT Zb − α diag
,
n
n
and
Z
(1−α) :
γ
b
=
ZbT y
,
n
Example: Estimator for missing data
observe corrupted version Z ∈ Rn×p
(
Xij with probability 1 − α
Zij =
⋆
with probability α.
Natural unbiased estimates: set ⋆ ≡ 0 and Zb :=
b
Γ
=
ZbT Zb
ZbT Zb − α diag
,
n
n
Z
(1−α) :
and
γ
b
=
solve (doubly non-convex) optimization problem:
1
b − hb
γ , θi + Rλ (θ) .
θb ∈ arg min θT Γθ
θ∈Ω 2
ZbT y
,
n
(Loh & W., 2012)
Non-convex quadratic and non-convex regularizer
log error plot for corrected linear regression with MCP, b = 1.5
2
opt err
stat err
0
2
log(||βt − β*|| )
−2
−4
−6
−8
−10
−12
0
200
400
600
iteration count
800
1000
Remainder of talk
1
Why are all local optima “statistically good”?
◮
◮
◮
2
Restricted strong convexity
A general theorem
Various examples
Why do first-order gradient methods converge quickly?
◮
◮
◮
Composite gradient methods
Statistical versus optimization error
Fast convergence for non-convex problems
Martin Wainwright (UC Berkeley)
May 2013
13 / 1
Geometry of a non-convex quadratic loss
1
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
1
−1
−1
0.5
0
−0.5
0
−0.5
0.5
1
−1
Loss function has directions of both positive and negative curvature.
Geometry of a non-convex quadratic loss
1
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
1
−1
−1
0.5
0
−0.5
0
−0.5
0.5
1
−1
Loss function has directions of both positive and negative curvature.
Negative directions must be forbidden by regularizer.
Restricted strong convexity
Here defined with respect to the ℓ1 -norm:
Definition
The loss function Ln satisfies RSC with parameters (αj , τj ), j = 1, 2 if
∗
∗
h∇Ln (θ + ∆) − ∇Ln (θ ), ∆i ≥
{z
}
|
Measure of curvature
Martin Wainwright (UC Berkeley)
(
α1 k∆k22 − τ1 logn p k∆k21
q
α2 k∆k2 − τ2 logn p k∆k1
if k∆k2 ≤ 1
if k∆k2 > 1.
May 2013
15 / 1
Restricted strong convexity
Here defined with respect to the ℓ1 -norm:
Definition
The loss function Ln satisfies RSC with parameters (αj , τj ), j = 1, 2 if
∗
∗
h∇Ln (θ + ∆) − ∇Ln (θ ), ∆i ≥
{z
}
|
Measure of curvature
(
α1 k∆k22 − τ1 logn p k∆k21
q
α2 k∆k2 − τ2 logn p k∆k1
if k∆k2 ≤ 1
if k∆k2 > 1.
holds with τ1 = τ2 = 0 for any function that is locally strongly convex
around θ∗
Martin Wainwright (UC Berkeley)
May 2013
15 / 1
Restricted strong convexity
Here defined with respect to the ℓ1 -norm:
Definition
The loss function Ln satisfies RSC with parameters (αj , τj ), j = 1, 2 if
∗
∗
h∇Ln (θ + ∆) − ∇Ln (θ ), ∆i ≥
{z
}
|
Measure of curvature
(
α1 k∆k22 − τ1 logn p k∆k21
q
α2 k∆k2 − τ2 logn p k∆k1
if k∆k2 ≤ 1
if k∆k2 > 1.
holds with τ1 = τ2 = 0 for any function that is locally strongly convex
around θ∗
holds for a variety of loss functions (convex and non-convex):
◮
◮
◮
ordinary least-squares
likelihoods for generalized linear models
certain non-convex quadratic functions
Martin Wainwright (UC Berkeley)
(Raskutti, W. & Yu, 2010)
(Negahban et al., 2012)
(Loh & W, 2012)
May 2013
15 / 1
Well-behaved regularizers
Properties defined at the univariate level Rλ : R → [0, ∞].
Satisfies Rλ (0) = 0, and is symmetric around zero (Rλ (t) = Rλ (−t).)
Non-decreasing and subadditive Rλ (s + t) ≤ Rλ (s) + Rλ (t).
Function t 7→
Rλ (t)
t
is nonincreasing for t > 0
Differentiable for all t 6= 0, subdifferentiable at t = 0 with subgradients
bounded in absolute value by λL.
e λ (t) = Rλ (t) + µt2 is convex.
For some µ > 0, the function R
Martin Wainwright (UC Berkeley)
May 2013
16 / 1
Well-behaved regularizers
Properties defined at the univariate level Rλ : R → [0, ∞].
Satisfies Rλ (0) = 0, and is symmetric around zero (Rλ (t) = Rλ (−t).)
Non-decreasing and subadditive Rλ (s + t) ≤ Rλ (s) + Rλ (t).
Function t 7→
Rλ (t)
t
is nonincreasing for t > 0
Differentiable for all t 6= 0, subdifferentiable at t = 0 with subgradients
bounded in absolute value by λL.
e λ (t) = Rλ (t) + µt2 is convex.
For some µ > 0, the function R
Includes (among others):
rescaled ℓ1 loss: Rλ (t) = λ|t|.
MCP penalty and SCAD penalties
does not include capped ℓ1 -penalty
Martin Wainwright (UC Berkeley)
(Fan et al., 2001; Zhang, 2006)
May 2013
16 / 1
Main statistical guarantee
regularized M -estimator
θb ∈ arg min
kθk1 ≤M
n
o
Ln (θ) + Rλ (θ) .
loss function satisfies (α, τ ) RSC, and regularizer is regular (with
parameters (µ, L))
Main statistical guarantee
regularized M -estimator
θb ∈ arg min
kθk1 ≤M
n
o
Ln (θ) + Rλ (θ) .
loss function satisfies (α, τ ) RSC, and regularizer is regular (with
parameters (µ, L))
local optimum θb defined by conditions
b + ∇Rλ (θ),
b θ − θi
b ≥0
h∇Ln (θ)
for all feasible θ.
Main statistical guarantee
regularized M -estimator
θb ∈ arg min
kθk1 ≤M
n
o
Ln (θ) + Rλ (θ) .
loss function satisfies (α, τ ) RSC, and regularizer is regular (with
parameters (µ, L))
local optimum θb defined by conditions
b + ∇Rλ (θ),
b θ − θi
b ≥0
h∇Ln (θ)
for all feasible θ.
Theorem (Loh & W., 2013)
Suppose M is chosen such that θ∗ is feasible, and λ satisfies the bounds
r
α2
log p ∗
≤ λ ≤
max k∇Ln (θ )k∞ , α2
n
6LM
Then any local optimum θb satisfies the bound
√
6 λn s
∗
b
where s = kβ ∗ k0 .
kθ − θ k2 ≤
4 (α − µ)
Geometry of local/global optima
ǫstat
θb
Consequence:
θ′
θ∗
θe
All { local, global } optima are within distance ǫstat of the target θ∗ .
Geometry of local/global optima
ǫstat
θb
Consequence:
θ′
θ∗
θe
All { local, global } optima are within distance ǫstat of the target θ∗ .
With λ = c
q
log p
n ,
statistical error scales as
r
s log p
,
which is minimax optimal.
ǫstat ≍
n
Empirical results (unrescaled)
0.25
p=128
p=256
p=512
p=1024
Mean squared error
0.2
0.15
0.1
0.05
0
200
400
600
800
Raw sample size
1000
1200
Empirical results (rescaled)
0.25
p=128
p=256
p=512
p=1024
Mean squared error
0.2
0.15
0.1
0.05
1
2
3
4
5
6
7
Rescaled sample size
8
9
10
Comparisons between different penalties
comparing penalties for corrected linear regression
0.45
p=64
p=128
p=256
0.4
0.3
0.25
0.2
2
l norm error
0.35
0.15
0.1
0.05
0
0
10
20
30
n/(k log p)
40
50
First-order algorithms and fast convergence
Thus far....
have shown that all local optima are “statistically good”
how to obtain a local optimum quickly?
First-order algorithms and fast convergence
Thus far....
have shown that all local optima are “statistically good”
how to obtain a local optimum quickly?
Composite gradient descent for regularized objectives:
n
o
min f (θ) + g(θ)
θ∈Ω
where f is differentiable, and g is convex, sub-differentiable.
(Nesterov, 2007)
First-order algorithms and fast convergence
Thus far....
have shown that all local optima are “statistically good”
how to obtain a local optimum quickly?
Composite gradient descent for regularized objectives:
n
o
min f (θ) + g(θ)
θ∈Ω
where f is differentiable, and g is convex, sub-differentiable.
Simple updates:
n
o
θt+1 = arg min kθ − αt ∇f (θt )k22 + g(θ) .
θ∈Ω
(Nesterov, 2007)
First-order algorithms and fast convergence
Thus far....
have shown that all local optima are “statistically good”
how to obtain a local optimum quickly?
Composite gradient descent for regularized objectives:
n
o
min f (θ) + g(θ)
(Nesterov, 2007)
θ∈Ω
where f is differentiable, and g is convex, sub-differentiable.
Simple updates:
n
o
θt+1 = arg min kθ − αt ∇f (θt )k22 + g(θ) .
θ∈Ω
Not directly applicable with f = Ln and g = Rλ (since Rλ can be
non-convex).
Composite gradient on a convenient splitting
Define modified loss functions and regularizers:
Len (θ) := Ln (θ) − µkθk22 ,
|
{z
}
non-convex
and
e λ (θ) := Rλ (θ) + µkθk2 .
R
|
{z
}2
convex
Composite gradient on a convenient splitting
Define modified loss functions and regularizers:
Len (θ) := Ln (θ) − µkθk22 ,
|
{z
}
non-convex
and
e λ (θ) := Rλ (θ) + µkθk2 .
R
|
{z
}2
convex
Apply composite gradient descent to the objective
n
o
e λ (θ) .
min Len (θ) + R
θ∈Ω
Composite gradient on a convenient splitting
Define modified loss functions and regularizers:
Len (θ) := Ln (θ) − µkθk22 ,
|
{z
}
non-convex
and
e λ (θ) := Rλ (θ) + µkθk2 .
R
|
{z
}2
convex
Apply composite gradient descent to the objective
n
o
e λ (θ) .
min Len (θ) + R
θ∈Ω
converges to local optimum θb
b θ − θi
b ≥0
e λ (θ),
h∇Len (θ) + ∇R
(Nesterov, 2007)
for all feasible θ.
Composite gradient on a convenient splitting
Define modified loss functions and regularizers:
Len (θ) := Ln (θ) − µkθk22 ,
|
{z
}
non-convex
and
e λ (θ) := Rλ (θ) + µkθk2 .
R
|
{z
}2
convex
Apply composite gradient descent to the objective
n
o
e λ (θ) .
min Len (θ) + R
θ∈Ω
converges to local optimum θb
b θ − θi
b ≥0
e λ (θ),
h∇Len (θ) + ∇R
(Nesterov, 2007)
for all feasible θ.
will show that convergence is geometrically fast with constant stepsize
Theoretical guarantees on computational error
implement Nesterov’s composite method with constant stepsize to
e λ ) split.
(Len , R
fixed global optimum βb defines the statistical error ǫ2stat = kβb − β ∗ k2 .
population minimizer β ∗ is s-sparse
Theoretical guarantees on computational error
implement Nesterov’s composite method with constant stepsize to
e λ ) split.
(Len , R
fixed global optimum βb defines the statistical error ǫ2stat = kβb − β ∗ k2 .
population minimizer β ∗ is s-sparse
loss function satisfies (α, τ ) RSC and smoothness conditions, and
regularizer is (µ, L)-good
Theoretical guarantees on computational error
implement Nesterov’s composite method with constant stepsize to
e λ ) split.
(Len , R
fixed global optimum βb defines the statistical error ǫ2stat = kβb − β ∗ k2 .
population minimizer β ∗ is s-sparse
loss function satisfies (α, τ ) RSC and smoothness conditions, and
regularizer is (µ, L)-good
Theorem (Loh & W., 2013)
If n % s log p, there is a contraction factor κ ∈ (0, 1) such that for any δ ≥ ǫstat ,
we have
s log p 2 b2≤ 2
kθt − θk
δ 2 + 128τ
ǫstat
for all t ≥ T (δ) iterations,
2
α−µ
n
where T (δ) ≍
log(1/δ)
log(1/κ) .
Geometry of result
b1
∆
bt
∆
ǫ
θb
b0
∆
θ∗ − θb
b t := θt − θb decreases geometrically up to statistical
Optimization error ∆
tolerance:
b 2+o
b2
b 2 ≤ κt kθ0 − θk
for all t = 0, 1, 2, . . ..
kθ∗ − θk
kθt+1 − θk
| {z }
Stat. error ǫ2stat
Non-convex linear regression with SCAD
log error plot for corrected linear regression with SCAD, a = 2.5
1
opt err
stat err
0
2
log(||βt − β*|| )
−1
−2
−3
−4
−5
−6
−7
−8
0
200
400
600
800
iteration count
1000
1200
Non-convex linear regression with SCAD
−2
t
*
log(||β − β ||2)
log error plot for corrected linear regression with SCAD, a = 3.7
2
opt err
stat err
0
−4
−6
−8
−10
0
200
400
600
800
iteration count
1000
1200
Summary
M -estimators based on non-convex programs arise frequently
under suitable regularity conditions, we showed that:
◮
◮
all local optima are “well-behaved” from the statistical point of view
simple first-order methods converge as fast as possible
Summary
M -estimators based on non-convex programs arise frequently
under suitable regularity conditions, we showed that:
◮
◮
all local optima are “well-behaved” from the statistical point of view
simple first-order methods converge as fast as possible
many open questions
◮
◮
similar guarantees for more general problems?
geometry of non-convex problems in statistics?
Summary
M -estimators based on non-convex programs arise frequently
under suitable regularity conditions, we showed that:
◮
◮
all local optima are “well-behaved” from the statistical point of view
simple first-order methods converge as fast as possible
many open questions
◮
◮
similar guarantees for more general problems?
geometry of non-convex problems in statistics?
Papers and pre-prints:
Loh & W. (2013). Regularized M -estimators with nonconvexity: Statistical and
algorithmic theory for local optima. Pre-print arXiv:1305.2436
Loh & W. (2012). High-dimensional regression with noisy and missing data:
Provable guarantees with non-convexity. Annals of Statistics, 40:1637–1664.
Negahban et al. (2012). A unified framework for high-dimensional analysis of
M -estimators. Statistical Science, 27(4): 538–557.
© Copyright 2026 Paperzz