Nonparametric Convex Regression1

Nonparametric Convex Regression1
Bodhisattva Sen
Department of Statistics
Columbia University, New York
[email protected]
University of North Carolina
Chapel Hill
4 April, 2014
1 Supported
by NSF grant DMS-1150435 (CAREER)
Outline
1
Nonparametric convex regression: Introduction
2
Testing against a linear model
3
Theoretical results: Risk bounds
1
Nonparametric convex regression: Introduction
2
Testing against a linear model
3
Theoretical results: Risk bounds
What is convex regression?
Regression model of the form
Yi = φ0 (Xi ) + i ,
for i = 1, . . . , n,
where
φ0 : Rd → R, d ≥ 1, is an unknown convex function
Xi is an X-valued vector (X ⊂ Rd is a closed convex set)
i is a random variable with E(i |Xi ) = 0 and Var (i ) < ∞
Goal: Estimate φ0 nonparametrically, assuming the convexity
What is a convex function?
f : Rd → R is convex iff
for all X1 , X2 ∈ Rd and t ∈ [0, 1],
f (tX1 + (1 − t)X2 ) ≤ t f (X1 ) + (1 − t) f (X2 ).
Arises frequently in economics, operations research, financial
engineering, e.g., production functions are known to be concave & ↑
Approximating a function (objective function, constraint) for a
convex optimization problem
Convexity is probably the simplest shape constraint that has a
natural generalization to multi-dimension.
Two further applications of the method
φ0 is affine (i.e. φ0 (X ) = α + β T X ) iff φ0 both convex and concave
Test the goodness-of-fit of a linear model (Sen and Meyer (2013))
Any smooth function on a compact domain can be represented as
the difference of two convex functions
Estimate any smooth function (Work in progress)
Convex regression when d = 1
Model
Yi = φ0 (xi ) + i ,
for i = 1, 2, . . . , n
φ0 : R → R is an unknown convex (or concave) function
x1 < x2 < · · · < xn are known constants; 1 , . . . , n mean 0 errors.
Want to estimate φ0 nonparametrically, given φ0 is convex/concave
Belgium firms
4
2
log(output)
-4
-2
0
13
12
log(income)
14
6
15
Age/income data
20
30
40
age
50
60
0
200
400
600
800 1000
labour
1400
The estimation procedure: least squares
Model
Yi = φ0 (Xi ) + i
for i = 1, . . . , n;
Xi ∈ Rd ; d ≥ 1
We estimate φ0 via the method of least squares
φ̂ =
argmin
n
X
ψ:Rd →R convex i=1
2
{Yi − ψ(Xi )} .
θ̂ = (φ̂(x1 ), . . . , φ̂(xn )) is unique (Why? next slide ...)
Long history: Hildreth (1954) ...
Fully automated; does not require the choice of tuning parameter(s)
Sequence model: Yi = φ0 (Xi ) + i , i = 1, 2, . . . , n, is equivalent to
Y = θ0 + Y = (Y1 , . . . , Yn )> , θ 0 = (φ0 (X1 ), . . . φ0 (Xn ))> , = (1 , . . . , n )>
minψ convex
Pn
i=1 {Yi
− ψ(Xi )}2 ≡ minθ∈K
Pn
i=1 (Yi
− θi )2
K = {θ ∈ Rn : ∃ ψ convex s.t. ψ(Xi ) = θi , ∀i = 1, . . . , n}
θ̂ is the projection of Y onto K, i.e.,
θ̂ := Π(Y|K) = argmin kY − θk2
θ∈K
K is a closed convex cone in Rn (T is a convex cone iff
α1 θ 1 + α2 θ 2 ∈ T , ∀α1 , α2 > 0 & θ 1 , θ 2 ∈ T )
The projection theorem gives the existence and uniqueness of θ̂.
Computation when d = 1
K has a straightforward characterization when d = 1, i.e.,
θ3 − θ2
θn − θn−1
n θ2 − θ1
K= θ∈R :
≤
≤ ··· ≤
x2 − x1
x3 − x2
xn − xn−1
Quadratic program with (n − 2 ) linear constraints
φ̂ ends up being a piece-wise linear function constructed by linearly
interpolating the pairs (x1 , θ̂1 ), . . . , (xn , θ̂n )
What happens when d > 1?
If f is differentiable, then f is convex iff
f (x) + ∇f (x)> (y − x) ≤ f (y ),
for all x, y ∈ Rd
where ∇f (x) is the gradient of f at x.
The first-order Taylor approximation is a global under-estimator of f .
In fact, for every convex function f , there exists ξx ∈ Rd such that
f (x) + ξx> (y − x) ≤ f (y )
ξx is called a sub-gradient at x.
for all x, y ∈ Rd .
Sub-differentials and sub-gradients
The sub-differential of f at x ∈ Rd is given by
∂f (x) := {ξ ∈ Rd : f (x) + ξ T (y − x) ≤ f (y ) ∀y ∈ Rd }.
The elements of the sub-differential are called sub-gradients.
Example: If f (x) = kxk, ∂f (0) = {ξ ∈ Rd : kξk ≤ 1}.
f is differentiable at x iff ∂f (x) = {∇f (x)}.
Computation of θ̂
The vector θ̂ can now be computed by solving the following
quadratic program with linear constraints:
min
kY − θk2
>
subject to θi + ξi (Xj − Xi ) ≤ θj ∀ i, j = 1, . . . , n
ξ 1 , . . . , ξ n ∈ Rd , θ ∈ Rn
This problem has a unique solution in θ but not in the ξj ’s (i.e., for
any two solutions (ξ1 , . . . , ξn , θ) and (τ1 , . . . , τn , ζ) we have θ = ζ)
Quadratic program with n(n − 1) linear constraints and n(d + 1)
variables.
Can use MOSEK, cvx or any off-the-shelf solver for n ∼ 200
Scalable algorithm? (work in progress; can work with n ∼ 5000)
Estimator at a point X : φ̂(X ) = maxi=1,...,n {θ̂i + (X − Xi )> ξˆi }
20
8
6
15
4
2
10
0
5
−2
−4
0
−6
3
3
2
−5
3
1
2.5
2
1.5
2
−8
3
0
1
0.5
0
−0.5
−1
−1
−1.5
−2
−2
1
2.5
2
1.5
1
0
0.5
0
−0.5
−1
−1
−1.5
−2
−2
Figure: The scatter plot and LSE: (a) φ0 (X ) = kX k2 (left panel);
(b) φ0 (X ) = −X1 + X2 (right panel). For both cases, n = 256 and ∼ N (0, 1).
φ̂ is a polyhedral convex function.
Belgium labour firms example (MATLAB figures)
Consistency
The LSE is consistent in fixed and stochastic design settings.
Consistency theorem (Seijo and Sen (2011))
Under appropriate conditions we have:
(i) P(sup {|φ̂(X ) − φ0 (X )|} → 0 for any compact set X ⊂ X◦ ) = 1.
X ∈X
(ii) Denoting by B the unit ball (w.r.t. the Euclidian norm) we have
P(∂ φ̂(X ) ⊂ ∂φ0 (X ) + B almost always) = 1 ∀ > 0, ∀ X ∈ X◦ .
(iii) If φ0 is differentiable on X◦ , then




P  sup kξ − ∇φ0 (X )k → 0 for any compact set X ⊂ X◦  = 1.
ξ∈∂ φ̂(X )
X ∈X
1
Nonparametric convex regression: Introduction
2
Testing against a linear model
3
Theoretical results: Risk bounds
Testing against a linear model
Let Y := (Y1 , Y2 , . . . , Yn ) ∈ Rn and
Y ∼ N (θ 0 , σ 2 In ),
σ2 > 0
Problem: Test H0 : θ 0 ∈ S where
S := {θ ∈ Rn : θ = Xβ, β ∈ Rk , for some k ≥ 1 }
and X is the design matrix.
Important problem with a long history
Most methods fit the parametric and a nonparametric model to
estimate the regression function
Compare the two fits to come up with a test statistic
Drawbacks of most methods
Involves the choice of delicate tuning parameters (e.g., bandwidths)
Calibration (i.e., the level) of the test is usually done by resampling
the test statistics (bootstrap), and thus only approximate.
We develop a likelihood ratio test for H0 using ideas from shape
restricted regression.
Fully automated, no tuning parameters required
Test has exact level
Power of the test goes to 1, under (almost) all alternatives
Testing against linear regression
Yi = φ0 (Xi ) + i ,
i = 1, 2, . . . , n.
φ0 : Rd → R, d ≥ 1, is an unknown smooth function
Design points X1 , X2 , . . . , Xn in Rd ; 1 , . . . , n are i.i.d. N (0, σ 2 )
Test H0 : φ0 is affine, i.e., φ0 (X ) = a + b > X , for some unknown
a ∈ R, b ∈ Rd
Thus, Y ∼ N (θ 0 , σ 2 In ) where θ 0 := (φ0 (X1 ), . . . , φ0 (Xn ))> ∈ Rn
Test H0 : θ 0 ∈ S where S := {θ ∈ Rn : θ = Xβ, β ∈ Rd+1 }, and
X = [e|x] where x = (X1 , . . . , Xn )> ∈ Rn×d , e = (1, . . . , 1)> ∈ Rn .
Idea
A function is affine iff it is both convex and concave.
Let I = {θ ∈ Rn : ∃ ψ convex s.t. ψ(Xi ) = θi , ∀i = 1, . . . , n}
Let D = −I = {θ ∈ Rn : ∃ ψ concave s.t. ψ(Xi ) = θi }
I, D are cones that contain S
If H0 is true, both Π(Y|I) and Π(Y|D) must be close to Π(Y|S).
4.0
y
3.5
y
3.0
2.0
-1.0
2.5
0.0 0.5 1.0 1.5 2.0
phi_0(x) = sin(3*pi*x)
4.5
phi_0(x) = 4 - 6*x*(1-x)
0.0
0.2
0.4
0.6
x
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
Figure: Constant (dotted blue), convex (dashed red) and concave
(solid black) fits to a scatterplot generated from Y = φ0 (x) + with
independent standard normal errors.
Method
Model:
Y ∼ N (θ 0 , σ 2 In )
Consider testing
H0 : θ 0 ∈ S
versus
H1 : θ 0 ∈ I ∪ D\S
S is a linear subspace.
I ⊂ Rn is a closed convex cone containing S; D = −I.
I ∩D =S
The log-likelihood function (up to a constant) is
`(θ, σ 2 ) = −
1
n
kY − θk2 − log σ 2 .
2σ 2
2
The likelihood ratio statistic is
Λ=2
max 2 `(θ, σ 2 ) −
θ∈I∪D,σ >0
max
θ∈S,σ 2 >0
2
`(θ, σ )
An equivalent test is to reject H0 if
T (Y) :=
max{kθ̂ S − θ̂ I k2 , kθ̂ S − θ̂ D k2 }
kY − θ̂ S k2
θ̂ I = Π(Y|I), θ̂ D = Π(Y|D) and θ̂ S = Π(Y|S).
Questions
How do we find the critical value of the test?
Power against all alternatives?
is large
How to find the critical value of the test?
Result: The distribution of T is invariant to translations in S.
Lemma (Sen and Meyer (2013))
For any s ∈ S and z ∈ Rn ,
T (z + s) = T (z).
As a consequence, for any θ 0 ∈ S,
d
T (Y) = T (Z),
where Y ∼ N (θ 0 , σ 2 In ) and Z ∼ N (0, In ).
Proof
For z ∈ Rn and s ∈ S,
Π(z + s|I) = Π(z|I) + s,
and
Π(z + s|S) = Π(z|S) + s.
Note that Y = σZ + θ 0 , where Z ∼ N (0, In ).
The test statistics T is pivotal, under H0
Suppose T (Z) ∼ F . Then, we reject H0 if
T (Y) > F −1 (1 − α),
where α ∈ (0, 1) is the desired level of the test
The distribution F can be approximated, up to any desired precision,
by Monte Carlo simulations
The test procedure has exact level α
The test is also unbiased
Theorem (Sen and Meyer (2013))
Under mild conditions, the power of the test converges to 1.
Example: Testing against simple linear regression
Yi = φ0 (xi ) + i ; 0 ≤ x1 < · · · < xn ≤ 1; 1 , . . . , n i.i.d. N (0, σ 2 )
φ0 : [0, 1] → R is an unknown smooth function
Test H0 : φ0 is an affine function
Recall T (Y) =
max{ n1 kθ̂ S −θ̂ I k2 , n1 kθ̂ S −θ̂ D k2 }
1
2
n kY−θ̂ S k
Under H0 , T = Op (log n5/4 /n).
Otherwise, T →p c, c > 0.
Thus, Pφ0 (T > F −1 (1 − α)) → 1, if φ0 ∈
/ H0
1
Nonparametric convex regression: Introduction
2
Testing against a linear model
3
Theoretical results: Risk bounds
Convex regression in 1-dimension
φ0 : [0, 1] → R is an unknown convex function.
Observe
Yi = φ0 (xi ) + i ,
for i = 1, 2, . . . , n.
x1 < x2 < · · · < xn are fixed uniform grid points in [0, 1].
1 , . . . , n are i.i.d. N (0, σ 2 ), σ 2 > 0.
Want to study the LSE φ̂ls of φ0 , i.e.,
φ̂ls = argmin
ψ
is convex
n
X
{Yi − ψ(xi )}2 .
i=1
θ̂ = (φ̂ls (x1 ), . . . , φ̂ls (xn )) is unique.
Theoretical results: Convex LSE when d = 1
Known results
Asymptotic Behavior on Interior Compacta:
Consistency (Hanson and Pledger (1976))
Rate of convergence (Dümbgen et al. (2004))
Asymptotic Behavior at a Point:
Rate of convergence (Mammen (1991))
Limiting distribution (Groeneboom et al. (2001))
Not much is known about the global behavior of the LSE.
Natural global loss function:
n
1X
` (ψ, φ) =
{ψ(xi ) − φ(xi )}2 .
n
2
i=1
Worst case risk upper bound
Let θ 0 = (φ0 (x1 ), . . . , φ0 (xn )), θ̂ = (φ̂ls (x1 ), . . . , φ̂ls (xn ))
For every φ0 , the risk
Eφ0 `2 (φ̂ls , φ0 ) =
1
Eθ kθ̂ − θ 0 k2 ,
n 0
is bounded from above by n−4/5 up to logarithmic factors in n.
Risk upper bound (Guntuboyina and Sen (2013b))
We have
Eφ0 `2 (φ̂ls , φ0 ) . n−4/5 log n
No smoothness assumptions on φ0 necessary. Only convexity needed.
Finite sample result, holds for every n.
Question: Can Eφ0 `2 (φ̂ls , φ0 ) be much smaller than n−4/5 ?
(1) NO: If φ000 (x) exists and is bounded from above and below by
positive constants on a subinterval of (0, 1).
Then, in a very strong sense, no estimator (not just the LSE) can
have risk smaller than n−4/5 .
The local minimax risk at φ0 :
Rn (φ0 ) := inf
sup Eφ `2 (φ, φ̂)
φ̂ φ∈N(φ0 )
where N(φ0 ) := φ convex : kφ − φ0 k∞ ≤ n−2/5 .
Local minimax lower bound (Guntuboyina and Sen (2013b))
Rn (φ0 ) & n−4/5
Question: Can Eφ0 `2 (φ̂ls , φ0 ) be much smaller than n−4/5 ?
(2) YES: φ0 is a piecewise affine convex function.
Adaptation: Suppose φ0 has k affine pieces, then
Eφ0 `2 (φ̂ls , φ0 ) .
k
5/4
(log n) .
n
Parametric rate up to logarithmic factors.
Surprising as the LSE minimizes the LS criterion over all convex fns.
phi_0(x) = |x|
1.0
y
0.0
0.5
0.5
0.0
-0.5
-0.5
y
1.0
1.5
1.5
phi_0(x) = x^2
-1.0
-0.5
0.0
0.5
1.0
-1.0
-0.5
0.0
0.5
1.0
Key proof ideas - Risk upper bounds
Convex LSE is an Empirical Risk Minimization (ERM) procedure
(Birgé and Massart (1993), Van de Geer, Van der Vaart and
Wellner, Massart, Koltchinskii).
Risk behavior of φ̂ls is determined by



E

sup
φ convex
`2 (φ, φ0 ) ≤ r 2
n
X
i=1



i {φ(xi ) − φ0 (xi )}

Local balls of convex functions, defined as
S(φ0 , r ) := {φ convex : `2 (φ, φ0 ) ≤ r 2 }
Need good upper bounds on expected maxima of Gaussian processes.
Using Dudley’s theorem, we can bound the above if we can compute
the metric entropy of these local balls
minimum proper cover.
It can be shown that covering numbers and proper covering numbers are related by the following inequality.
Key proof ideas - Risk upper bounds
!
N (!, T ) ≤ Nproper (!, T ) ≤ N ( , T )
2
The metric entropy, defined as H(!, T ) = log N (!, T ), is a natural representation of how many bits you need
to send in order to identify an element of a set up to precision !.
Figure 1: The cover of a set. The covering number is the size of the minimum ! cover and the entropy is the
of the covering number
Recall: For alogarithm
subset
F of a metric space (X , ρ) the -covering
Example For the space ([0, 1] , L (P )), the distance is defined as
number N(, F, ρ) is defined as the smallest number of balls of
1
radius whose union contains F.
X
2
n
Covering numbers capture the size of the underlying set F.
Metric entropy: logarithm of the covering numbers log N(, F, ρ)
We need good bounds on the metric entropy log N(, S(φ0 , r ), `)
Metric Entropy
Bronšteı̆n (1976) showed that the metric entropy of uniformly
Lipschitz and uniformly bounded convex functions on [0, 1]d under
the supremum metric grows as −d/2 .
But functions in these local balls are neither uniformly bounded nor
uniformly Lipschitz.
We modify Bronshtein’s result through a series of peeling arguments
to obtain the metric entropy of these balls (Dryanov (2009) and
Guntuboyina and Sen (2013a).
Ongoing research and future work
Determine the global and local (point-wise) rates of convergence
and the asymptotic distribution of the LSE in convex regression.
Determine the optimal rates of convergence in these problems.
How do we estimate and compute other multi-dimensional shape
restricted functions, e.g., multi-dimensional monotone regression,
quasi convex functions, etc.?
Risk bounds for a general polyhedral cone (connection to
non-negative least squares)?
Investigate additive models with unknown monotone/convex
functions.
Study single index models with unknown monotone/convex links.
Collaborators
Emilio Seijo
former Ph.D. student
Start-up
Mary Meyer
Professor
Colorado State Uni.
and others ...
Aditya Guntuboyina
Assistant Professor
UC Berkeley
References
Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T., and Silverman, E. (1955). An empirical distribution function for sampling with
incomplete information. Ann. Math. Statist., 26:641–647.
Birgé, L. and Massart, P. (1993). Rates of convergence for minimum contrast estimators. Probab. Theory Related Fields, 97(1-2):113–150.
Bronšteı̆n, E. M. (1976). ε-entropy of convex sets and functions. Sibirsk. Mat. Ž., 17(3):508–514, 715.
Brunk, H. D. (1955). Maximum likelihood estimates of monotone parameters. Ann. Math. Statist., 26:607–616.
Dryanov, D. (2009). Kolmogorov entropy for classes of convex functions. Constructive Approximation, 30:137–153.
Dümbgen, L., Freitag, S., and Jongbloed, G. (2004). Consistency of concave regression with an application to current-status data. Math.
Methods Statist., 13(1):69–81.
Groeneboom, P., Jongbloed, G., and Wellner, J. A. (2001). Estimation of a convex function: characterizations and asymptotic theory.
Ann. Statist., 29(6):1653–1698.
Guntuboyina, A. and Sen, B. (2013a). Covering numbers for convex functions. IEEE Trans. Inf. Th., 59(4):1957–1965.
Guntuboyina, A. and Sen, B. (2013b). Global risk bounds and adaptation in univariate convex regression. arXiv preprint arXiv:1305.1648.
Hanson, D. L. and Pledger, G. (1976). Consistency in concave regression. Ann. Statist., 4(6):1038–1050.
Hildreth, C. (1954). Point estimates of ordinates of concave functions. J. Amer. Statist. Assoc., 49:598–619.
Mammen, E. (1991). Nonparametric regression under qualitative smoothness assumptions. Ann. Statist., 19(2):741–759.
Robertson, T., Wright, F. T., and Dykstra, R. L. (1988). Order restricted statistical inference. Wiley Series in Probability and
Mathematical Statistics: Probability and Mathematical Statistics. John Wiley & Sons Ltd., Chichester.
Seijo, E. and Sen, B. (2011). Nonparametric least squares estimation of a multivariate convex regression function. Ann. Statist.,
39(3):1633–1657.
Sen, B. and Meyer, M. (2013). Testing against a parametric regression function using ideas from shape restricted estimation. arXiv
preprint arXiv:1311.6849.
Thank you very much!
Questions?