Nonparametric Convex Regression1 Bodhisattva Sen Department of Statistics Columbia University, New York [email protected] University of North Carolina Chapel Hill 4 April, 2014 1 Supported by NSF grant DMS-1150435 (CAREER) Outline 1 Nonparametric convex regression: Introduction 2 Testing against a linear model 3 Theoretical results: Risk bounds 1 Nonparametric convex regression: Introduction 2 Testing against a linear model 3 Theoretical results: Risk bounds What is convex regression? Regression model of the form Yi = φ0 (Xi ) + i , for i = 1, . . . , n, where φ0 : Rd → R, d ≥ 1, is an unknown convex function Xi is an X-valued vector (X ⊂ Rd is a closed convex set) i is a random variable with E(i |Xi ) = 0 and Var (i ) < ∞ Goal: Estimate φ0 nonparametrically, assuming the convexity What is a convex function? f : Rd → R is convex iff for all X1 , X2 ∈ Rd and t ∈ [0, 1], f (tX1 + (1 − t)X2 ) ≤ t f (X1 ) + (1 − t) f (X2 ). Arises frequently in economics, operations research, financial engineering, e.g., production functions are known to be concave & ↑ Approximating a function (objective function, constraint) for a convex optimization problem Convexity is probably the simplest shape constraint that has a natural generalization to multi-dimension. Two further applications of the method φ0 is affine (i.e. φ0 (X ) = α + β T X ) iff φ0 both convex and concave Test the goodness-of-fit of a linear model (Sen and Meyer (2013)) Any smooth function on a compact domain can be represented as the difference of two convex functions Estimate any smooth function (Work in progress) Convex regression when d = 1 Model Yi = φ0 (xi ) + i , for i = 1, 2, . . . , n φ0 : R → R is an unknown convex (or concave) function x1 < x2 < · · · < xn are known constants; 1 , . . . , n mean 0 errors. Want to estimate φ0 nonparametrically, given φ0 is convex/concave Belgium firms 4 2 log(output) -4 -2 0 13 12 log(income) 14 6 15 Age/income data 20 30 40 age 50 60 0 200 400 600 800 1000 labour 1400 The estimation procedure: least squares Model Yi = φ0 (Xi ) + i for i = 1, . . . , n; Xi ∈ Rd ; d ≥ 1 We estimate φ0 via the method of least squares φ̂ = argmin n X ψ:Rd →R convex i=1 2 {Yi − ψ(Xi )} . θ̂ = (φ̂(x1 ), . . . , φ̂(xn )) is unique (Why? next slide ...) Long history: Hildreth (1954) ... Fully automated; does not require the choice of tuning parameter(s) Sequence model: Yi = φ0 (Xi ) + i , i = 1, 2, . . . , n, is equivalent to Y = θ0 + Y = (Y1 , . . . , Yn )> , θ 0 = (φ0 (X1 ), . . . φ0 (Xn ))> , = (1 , . . . , n )> minψ convex Pn i=1 {Yi − ψ(Xi )}2 ≡ minθ∈K Pn i=1 (Yi − θi )2 K = {θ ∈ Rn : ∃ ψ convex s.t. ψ(Xi ) = θi , ∀i = 1, . . . , n} θ̂ is the projection of Y onto K, i.e., θ̂ := Π(Y|K) = argmin kY − θk2 θ∈K K is a closed convex cone in Rn (T is a convex cone iff α1 θ 1 + α2 θ 2 ∈ T , ∀α1 , α2 > 0 & θ 1 , θ 2 ∈ T ) The projection theorem gives the existence and uniqueness of θ̂. Computation when d = 1 K has a straightforward characterization when d = 1, i.e., θ3 − θ2 θn − θn−1 n θ2 − θ1 K= θ∈R : ≤ ≤ ··· ≤ x2 − x1 x3 − x2 xn − xn−1 Quadratic program with (n − 2 ) linear constraints φ̂ ends up being a piece-wise linear function constructed by linearly interpolating the pairs (x1 , θ̂1 ), . . . , (xn , θ̂n ) What happens when d > 1? If f is differentiable, then f is convex iff f (x) + ∇f (x)> (y − x) ≤ f (y ), for all x, y ∈ Rd where ∇f (x) is the gradient of f at x. The first-order Taylor approximation is a global under-estimator of f . In fact, for every convex function f , there exists ξx ∈ Rd such that f (x) + ξx> (y − x) ≤ f (y ) ξx is called a sub-gradient at x. for all x, y ∈ Rd . Sub-differentials and sub-gradients The sub-differential of f at x ∈ Rd is given by ∂f (x) := {ξ ∈ Rd : f (x) + ξ T (y − x) ≤ f (y ) ∀y ∈ Rd }. The elements of the sub-differential are called sub-gradients. Example: If f (x) = kxk, ∂f (0) = {ξ ∈ Rd : kξk ≤ 1}. f is differentiable at x iff ∂f (x) = {∇f (x)}. Computation of θ̂ The vector θ̂ can now be computed by solving the following quadratic program with linear constraints: min kY − θk2 > subject to θi + ξi (Xj − Xi ) ≤ θj ∀ i, j = 1, . . . , n ξ 1 , . . . , ξ n ∈ Rd , θ ∈ Rn This problem has a unique solution in θ but not in the ξj ’s (i.e., for any two solutions (ξ1 , . . . , ξn , θ) and (τ1 , . . . , τn , ζ) we have θ = ζ) Quadratic program with n(n − 1) linear constraints and n(d + 1) variables. Can use MOSEK, cvx or any off-the-shelf solver for n ∼ 200 Scalable algorithm? (work in progress; can work with n ∼ 5000) Estimator at a point X : φ̂(X ) = maxi=1,...,n {θ̂i + (X − Xi )> ξˆi } 20 8 6 15 4 2 10 0 5 −2 −4 0 −6 3 3 2 −5 3 1 2.5 2 1.5 2 −8 3 0 1 0.5 0 −0.5 −1 −1 −1.5 −2 −2 1 2.5 2 1.5 1 0 0.5 0 −0.5 −1 −1 −1.5 −2 −2 Figure: The scatter plot and LSE: (a) φ0 (X ) = kX k2 (left panel); (b) φ0 (X ) = −X1 + X2 (right panel). For both cases, n = 256 and ∼ N (0, 1). φ̂ is a polyhedral convex function. Belgium labour firms example (MATLAB figures) Consistency The LSE is consistent in fixed and stochastic design settings. Consistency theorem (Seijo and Sen (2011)) Under appropriate conditions we have: (i) P(sup {|φ̂(X ) − φ0 (X )|} → 0 for any compact set X ⊂ X◦ ) = 1. X ∈X (ii) Denoting by B the unit ball (w.r.t. the Euclidian norm) we have P(∂ φ̂(X ) ⊂ ∂φ0 (X ) + B almost always) = 1 ∀ > 0, ∀ X ∈ X◦ . (iii) If φ0 is differentiable on X◦ , then P sup kξ − ∇φ0 (X )k → 0 for any compact set X ⊂ X◦ = 1. ξ∈∂ φ̂(X ) X ∈X 1 Nonparametric convex regression: Introduction 2 Testing against a linear model 3 Theoretical results: Risk bounds Testing against a linear model Let Y := (Y1 , Y2 , . . . , Yn ) ∈ Rn and Y ∼ N (θ 0 , σ 2 In ), σ2 > 0 Problem: Test H0 : θ 0 ∈ S where S := {θ ∈ Rn : θ = Xβ, β ∈ Rk , for some k ≥ 1 } and X is the design matrix. Important problem with a long history Most methods fit the parametric and a nonparametric model to estimate the regression function Compare the two fits to come up with a test statistic Drawbacks of most methods Involves the choice of delicate tuning parameters (e.g., bandwidths) Calibration (i.e., the level) of the test is usually done by resampling the test statistics (bootstrap), and thus only approximate. We develop a likelihood ratio test for H0 using ideas from shape restricted regression. Fully automated, no tuning parameters required Test has exact level Power of the test goes to 1, under (almost) all alternatives Testing against linear regression Yi = φ0 (Xi ) + i , i = 1, 2, . . . , n. φ0 : Rd → R, d ≥ 1, is an unknown smooth function Design points X1 , X2 , . . . , Xn in Rd ; 1 , . . . , n are i.i.d. N (0, σ 2 ) Test H0 : φ0 is affine, i.e., φ0 (X ) = a + b > X , for some unknown a ∈ R, b ∈ Rd Thus, Y ∼ N (θ 0 , σ 2 In ) where θ 0 := (φ0 (X1 ), . . . , φ0 (Xn ))> ∈ Rn Test H0 : θ 0 ∈ S where S := {θ ∈ Rn : θ = Xβ, β ∈ Rd+1 }, and X = [e|x] where x = (X1 , . . . , Xn )> ∈ Rn×d , e = (1, . . . , 1)> ∈ Rn . Idea A function is affine iff it is both convex and concave. Let I = {θ ∈ Rn : ∃ ψ convex s.t. ψ(Xi ) = θi , ∀i = 1, . . . , n} Let D = −I = {θ ∈ Rn : ∃ ψ concave s.t. ψ(Xi ) = θi } I, D are cones that contain S If H0 is true, both Π(Y|I) and Π(Y|D) must be close to Π(Y|S). 4.0 y 3.5 y 3.0 2.0 -1.0 2.5 0.0 0.5 1.0 1.5 2.0 phi_0(x) = sin(3*pi*x) 4.5 phi_0(x) = 4 - 6*x*(1-x) 0.0 0.2 0.4 0.6 x 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x Figure: Constant (dotted blue), convex (dashed red) and concave (solid black) fits to a scatterplot generated from Y = φ0 (x) + with independent standard normal errors. Method Model: Y ∼ N (θ 0 , σ 2 In ) Consider testing H0 : θ 0 ∈ S versus H1 : θ 0 ∈ I ∪ D\S S is a linear subspace. I ⊂ Rn is a closed convex cone containing S; D = −I. I ∩D =S The log-likelihood function (up to a constant) is `(θ, σ 2 ) = − 1 n kY − θk2 − log σ 2 . 2σ 2 2 The likelihood ratio statistic is Λ=2 max 2 `(θ, σ 2 ) − θ∈I∪D,σ >0 max θ∈S,σ 2 >0 2 `(θ, σ ) An equivalent test is to reject H0 if T (Y) := max{kθ̂ S − θ̂ I k2 , kθ̂ S − θ̂ D k2 } kY − θ̂ S k2 θ̂ I = Π(Y|I), θ̂ D = Π(Y|D) and θ̂ S = Π(Y|S). Questions How do we find the critical value of the test? Power against all alternatives? is large How to find the critical value of the test? Result: The distribution of T is invariant to translations in S. Lemma (Sen and Meyer (2013)) For any s ∈ S and z ∈ Rn , T (z + s) = T (z). As a consequence, for any θ 0 ∈ S, d T (Y) = T (Z), where Y ∼ N (θ 0 , σ 2 In ) and Z ∼ N (0, In ). Proof For z ∈ Rn and s ∈ S, Π(z + s|I) = Π(z|I) + s, and Π(z + s|S) = Π(z|S) + s. Note that Y = σZ + θ 0 , where Z ∼ N (0, In ). The test statistics T is pivotal, under H0 Suppose T (Z) ∼ F . Then, we reject H0 if T (Y) > F −1 (1 − α), where α ∈ (0, 1) is the desired level of the test The distribution F can be approximated, up to any desired precision, by Monte Carlo simulations The test procedure has exact level α The test is also unbiased Theorem (Sen and Meyer (2013)) Under mild conditions, the power of the test converges to 1. Example: Testing against simple linear regression Yi = φ0 (xi ) + i ; 0 ≤ x1 < · · · < xn ≤ 1; 1 , . . . , n i.i.d. N (0, σ 2 ) φ0 : [0, 1] → R is an unknown smooth function Test H0 : φ0 is an affine function Recall T (Y) = max{ n1 kθ̂ S −θ̂ I k2 , n1 kθ̂ S −θ̂ D k2 } 1 2 n kY−θ̂ S k Under H0 , T = Op (log n5/4 /n). Otherwise, T →p c, c > 0. Thus, Pφ0 (T > F −1 (1 − α)) → 1, if φ0 ∈ / H0 1 Nonparametric convex regression: Introduction 2 Testing against a linear model 3 Theoretical results: Risk bounds Convex regression in 1-dimension φ0 : [0, 1] → R is an unknown convex function. Observe Yi = φ0 (xi ) + i , for i = 1, 2, . . . , n. x1 < x2 < · · · < xn are fixed uniform grid points in [0, 1]. 1 , . . . , n are i.i.d. N (0, σ 2 ), σ 2 > 0. Want to study the LSE φ̂ls of φ0 , i.e., φ̂ls = argmin ψ is convex n X {Yi − ψ(xi )}2 . i=1 θ̂ = (φ̂ls (x1 ), . . . , φ̂ls (xn )) is unique. Theoretical results: Convex LSE when d = 1 Known results Asymptotic Behavior on Interior Compacta: Consistency (Hanson and Pledger (1976)) Rate of convergence (Dümbgen et al. (2004)) Asymptotic Behavior at a Point: Rate of convergence (Mammen (1991)) Limiting distribution (Groeneboom et al. (2001)) Not much is known about the global behavior of the LSE. Natural global loss function: n 1X ` (ψ, φ) = {ψ(xi ) − φ(xi )}2 . n 2 i=1 Worst case risk upper bound Let θ 0 = (φ0 (x1 ), . . . , φ0 (xn )), θ̂ = (φ̂ls (x1 ), . . . , φ̂ls (xn )) For every φ0 , the risk Eφ0 `2 (φ̂ls , φ0 ) = 1 Eθ kθ̂ − θ 0 k2 , n 0 is bounded from above by n−4/5 up to logarithmic factors in n. Risk upper bound (Guntuboyina and Sen (2013b)) We have Eφ0 `2 (φ̂ls , φ0 ) . n−4/5 log n No smoothness assumptions on φ0 necessary. Only convexity needed. Finite sample result, holds for every n. Question: Can Eφ0 `2 (φ̂ls , φ0 ) be much smaller than n−4/5 ? (1) NO: If φ000 (x) exists and is bounded from above and below by positive constants on a subinterval of (0, 1). Then, in a very strong sense, no estimator (not just the LSE) can have risk smaller than n−4/5 . The local minimax risk at φ0 : Rn (φ0 ) := inf sup Eφ `2 (φ, φ̂) φ̂ φ∈N(φ0 ) where N(φ0 ) := φ convex : kφ − φ0 k∞ ≤ n−2/5 . Local minimax lower bound (Guntuboyina and Sen (2013b)) Rn (φ0 ) & n−4/5 Question: Can Eφ0 `2 (φ̂ls , φ0 ) be much smaller than n−4/5 ? (2) YES: φ0 is a piecewise affine convex function. Adaptation: Suppose φ0 has k affine pieces, then Eφ0 `2 (φ̂ls , φ0 ) . k 5/4 (log n) . n Parametric rate up to logarithmic factors. Surprising as the LSE minimizes the LS criterion over all convex fns. phi_0(x) = |x| 1.0 y 0.0 0.5 0.5 0.0 -0.5 -0.5 y 1.0 1.5 1.5 phi_0(x) = x^2 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 Key proof ideas - Risk upper bounds Convex LSE is an Empirical Risk Minimization (ERM) procedure (Birgé and Massart (1993), Van de Geer, Van der Vaart and Wellner, Massart, Koltchinskii). Risk behavior of φ̂ls is determined by E sup φ convex `2 (φ, φ0 ) ≤ r 2 n X i=1 i {φ(xi ) − φ0 (xi )} Local balls of convex functions, defined as S(φ0 , r ) := {φ convex : `2 (φ, φ0 ) ≤ r 2 } Need good upper bounds on expected maxima of Gaussian processes. Using Dudley’s theorem, we can bound the above if we can compute the metric entropy of these local balls minimum proper cover. It can be shown that covering numbers and proper covering numbers are related by the following inequality. Key proof ideas - Risk upper bounds ! N (!, T ) ≤ Nproper (!, T ) ≤ N ( , T ) 2 The metric entropy, defined as H(!, T ) = log N (!, T ), is a natural representation of how many bits you need to send in order to identify an element of a set up to precision !. Figure 1: The cover of a set. The covering number is the size of the minimum ! cover and the entropy is the of the covering number Recall: For alogarithm subset F of a metric space (X , ρ) the -covering Example For the space ([0, 1] , L (P )), the distance is defined as number N(, F, ρ) is defined as the smallest number of balls of 1 radius whose union contains F. X 2 n Covering numbers capture the size of the underlying set F. Metric entropy: logarithm of the covering numbers log N(, F, ρ) We need good bounds on the metric entropy log N(, S(φ0 , r ), `) Metric Entropy Bronšteı̆n (1976) showed that the metric entropy of uniformly Lipschitz and uniformly bounded convex functions on [0, 1]d under the supremum metric grows as −d/2 . But functions in these local balls are neither uniformly bounded nor uniformly Lipschitz. We modify Bronshtein’s result through a series of peeling arguments to obtain the metric entropy of these balls (Dryanov (2009) and Guntuboyina and Sen (2013a). Ongoing research and future work Determine the global and local (point-wise) rates of convergence and the asymptotic distribution of the LSE in convex regression. Determine the optimal rates of convergence in these problems. How do we estimate and compute other multi-dimensional shape restricted functions, e.g., multi-dimensional monotone regression, quasi convex functions, etc.? Risk bounds for a general polyhedral cone (connection to non-negative least squares)? Investigate additive models with unknown monotone/convex functions. Study single index models with unknown monotone/convex links. Collaborators Emilio Seijo former Ph.D. student Start-up Mary Meyer Professor Colorado State Uni. and others ... Aditya Guntuboyina Assistant Professor UC Berkeley References Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T., and Silverman, E. (1955). An empirical distribution function for sampling with incomplete information. Ann. Math. Statist., 26:641–647. Birgé, L. and Massart, P. (1993). Rates of convergence for minimum contrast estimators. Probab. Theory Related Fields, 97(1-2):113–150. Bronšteı̆n, E. M. (1976). ε-entropy of convex sets and functions. Sibirsk. Mat. Ž., 17(3):508–514, 715. Brunk, H. D. (1955). Maximum likelihood estimates of monotone parameters. Ann. Math. Statist., 26:607–616. Dryanov, D. (2009). Kolmogorov entropy for classes of convex functions. Constructive Approximation, 30:137–153. Dümbgen, L., Freitag, S., and Jongbloed, G. (2004). Consistency of concave regression with an application to current-status data. Math. Methods Statist., 13(1):69–81. Groeneboom, P., Jongbloed, G., and Wellner, J. A. (2001). Estimation of a convex function: characterizations and asymptotic theory. Ann. Statist., 29(6):1653–1698. Guntuboyina, A. and Sen, B. (2013a). Covering numbers for convex functions. IEEE Trans. Inf. Th., 59(4):1957–1965. Guntuboyina, A. and Sen, B. (2013b). Global risk bounds and adaptation in univariate convex regression. arXiv preprint arXiv:1305.1648. Hanson, D. L. and Pledger, G. (1976). Consistency in concave regression. Ann. Statist., 4(6):1038–1050. Hildreth, C. (1954). Point estimates of ordinates of concave functions. J. Amer. Statist. Assoc., 49:598–619. Mammen, E. (1991). Nonparametric regression under qualitative smoothness assumptions. Ann. Statist., 19(2):741–759. Robertson, T., Wright, F. T., and Dykstra, R. L. (1988). Order restricted statistical inference. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. John Wiley & Sons Ltd., Chichester. Seijo, E. and Sen, B. (2011). Nonparametric least squares estimation of a multivariate convex regression function. Ann. Statist., 39(3):1633–1657. Sen, B. and Meyer, M. (2013). Testing against a parametric regression function using ideas from shape restricted estimation. arXiv preprint arXiv:1311.6849. Thank you very much! Questions?
© Copyright 2026 Paperzz