Statistics meets Optimization: Fast randomized algorithms for large data sets Martin Wainwright UC Berkeley Statistics and EECS Joint work with: Martin Wainwright (UC Berkeley) Mert Pilanci UC Berkeley & Stanford University February 2017 1 / 24 What is the “big data” phenomenon? ◦ Every day: 2.5 billion gigabytes of data created ◦ Last two years: creation of 90% of the world’s data (source: IBM) How can algorithms be scaled? Massive data sets require fast algorithms but with rigorous guarantees. How can algorithms be scaled? Massive data sets require fast algorithms but with rigorous guarantees. Randomized projection is a general purpose tool: ◦ Choose a random subspace of “low” dimension m. ◦ Project data into subspace, and solve reduced dimension problem. Lower dimensional space Random projection High−dimensional space How can algorithms be scaled? Randomized projection is a general purpose tool: ◦ Choose a random subspace of “low” dimension m. ◦ Project data into subspace, and solve reduced dimension problem. Lower dimensional space Random projection High−dimensional space Widely studied and used: ◦ Johnson & Lindenstrauss (1984): in Banach/Hilbert space geometry ◦ various surveys and books: Vempala, 2004; Mahoney et al., 2011 Cormode et al., 2012. Randomized sketching for optimization DATA OPTIMIZER Randomized sketching for optimization DATA OPTIMIZER Randomized sketching for optimization DATA OPTIMIZER cost all data parameter Randomized sketching for optimization DATA OPTIMIZER cost sample all data parameter Randomized sketching for optimization DATA OPTIMIZER cost all data parameter Randomized sketching for optimization DATA OPTIMIZER 𝑤1 𝑤2 𝑤3 𝑤4 𝑤5 cost all data 𝑤6 𝑤7 𝑤8 parameter Randomized sketching for optimization DATA OPTIMIZER 𝑤1 𝑤2 𝑤3 𝑤4 𝑤5 cost all data 𝑤6 𝑤7 𝑤8 combined parameter Randomized projection for constrained least-squares ◦ Given data matrix A ∈ Rn×d , and response vector y ∈ Rn ◦ Least-squares over convex constraint set C ⊆ Rd : xLS = arg min kAx − yk22 x∈C | {z } f (Ax) Randomized projection for constrained least-squares ◦ Given data matrix A ∈ Rn×d , and response vector y ∈ Rn ◦ Least-squares over convex constraint set C ⊆ Rd : xLS = arg min kAx − yk22 x∈C | {z } f (Ax) d n A Randomized projection for constrained least-squares ◦ Given data matrix A ∈ Rn×d , and response vector y ∈ Rn ◦ Least-squares over convex constraint set C ⊆ Rd : xLS = arg min kAx − yk22 x∈C | {z } f (Ax) ◦ Randomized approximation: (Sarlos, 2006) x b = arg min kS(Ax − y)k22 x∈C ◦ Random projection matrix S ∈ Rm×n m d d = m SA S n A A general approximation-theoretic bound The randomized solution x b ∈ C provides δ-accurate cost approximation if f (AxLS ) ≤ f (Ab x) ≤ (1 + δ) f (AxLS ). A general approximation-theoretic bound The randomized solution x b ∈ C provides δ-accurate cost approximation if f (AxLS ) ≤ f (Ab x) ≤ (1 + δ) f (AxLS ). Theorem (Pilanci & W, 2015) For a broad class of random projection matrices, a sketch dimension m% effrank(A; C) δ yields δ-accurate cost approximation with exp. high probability. A general approximation-theoretic bound The randomized solution x b ∈ C provides δ-accurate cost approximation if f (AxLS ) ≤ f (Ab x) ≤ (1 + δ) f (AxLS ). Theorem (Pilanci & W, 2015) For a broad class of random projection matrices, a sketch dimension m% effrank(A; C) δ yields δ-accurate cost approximation with exp. high probability. ◦ past work on unconstrained case C = Rd : effective rank equivalent to rank(A) (Sarlos, 2006; Mahoney et al. 2011) ◦ effective rank can be much smaller than standard rank Favorable dependence on optimum x C ∆ K x LS Tangent cone K at xLS Set of feasible directions at the optimum xLS K = ∆ ∈ Rd | ∆ = t (x − xLS ) for some x ∈ C. . LS Unfavorable dependence on optimum x C K ∆ x LS Tangent cone K at xLS Set of feasible directions at the optimum xLS K = ∆ ∈ Rd | ∆ = t (x − xLS ) for some x ∈ C. . LS But what about solution approximation? x∗ = arg min kAx − yk22 x∈C x b ∈ arg min kS(Ax − y)k22 and x∈C original cost 𝜖 𝒙∗ approximate cost 𝑥ො But what about solution approximation? x∗ = arg min kAx − yk22 x∈C x b ∈ arg min kS(Ax − y)k22 and x∈C original cost 𝜖 𝒙∗ ? approximate cost 𝑥ො Failure of standard random projection ◦ Noisy observation model: y = Ax∗ + w where w ∼ N (0, σ 2 In ). Failure of standard random projection ◦ Noisy observation model: 1 y = Ax∗ + w where w ∼ N (0, σ 2 In ). Mean−squared error vs. number of samples MSE 0.1 0.01 LS 0.001 2 10 Sketch ◦ Least-squares accuracy: 10 3 number of samples 10 4 EkxLS − x∗ k2A = σ 2 rank(A) n Overcoming this barrier? Overcoming this barrier? Sequential scheme.... 1 Mean−squared error vs. number of samples MSE 0.1 0.01 0.001 2 10 LS Iterative Sketch Naive Sketch 3 10 number of samples 4 10 Iterative projection scheme yields accurate tracking of original least-squares solution. Application to Netflix data Martin Wainwright (UC Berkeley) February 2017 11 / 24 Netflix data set ◦ 2 million × 17000 matrix A of ratings (users × movies) ◦ Predict the ratings of a particular movie ◦ Least-squares regression with `2 regularization min kAx − yk22 + λkxk22 x ◦ Partition into test and training sets, solve for all values of λ ∈ {1, 2, ..., 100}. Fitting the full regularization path Test Error 8.25 Test Error Matlab Cholesky solver Conjugate Gradient Iterative Sketch 8.2 8.15 8.1 0.5 1 2 3 X: 99 Y: 1677 1692 1500 874.2 1000 500 0 2.5 Computation time with respect to regularization parameter index 2000 Computation Time 1.5 Regularization parameter : 100 values 0 0 10 1000 computation time (seconds) 20 total 30 40 50 60 70 Regularization parameter : 100 values X: 100 Y: 101.5 2000 80 90 100 Fitting the full regularization path Test Error 8.25 Test Error Matlab Cholesky solver Conjugate Gradient Iterative Sketch 8.2 8.15 8.1 0.5 1 2 2.5 3 Computation time with respect to regularization parameter index 2000 Computation Time 1.5 Regularization parameter : 100 values 1692 X: 99 Y: 1677 1500 874.2 1000 0 X: 100 Y: 101.5 101.5 500 0 0 10 20 30 40 100050 60 70 Regularization parameter : 100 values total computation time (seconds) 80 200090 100 Gradient Descent vs Newton’s Method 0 1 2 3 4 5 6 7 time (hours) 8 9 10 11 12 Gradient Descent vs Newton’s Method O(nd) 0 1 2 3 4 5 6 7 time (hours) 8 9 10 11 12 Gradient Descent vs Newton’s Method 0 1 2 3 4 5 6 7 time (hours) 8 9 10 11 12 Gradient Descent vs Newton’s Method 0 1 2 3 4 5 6 7 time (hours) 8 9 10 11 12 Gradient Descent vs Newton’s Method 0 1 2 3 4 5 6 7 time (hours) 8 9 10 11 12 Gradient Descent vs Newton’s Method 0 1 2 3 4 5 6 7 time (hours) 8 9 10 11 12 Gradient Descent vs Newton’s Method 0 1 2 3 4 5 6 7 time (hours) 8 9 10 11 12 Gradient Descent vs Newton’s Method 0 1 2 3 4 5 6 7 time (hours) 8 9 10 11 12 Gradient Descent vs Newton’s Method 0 1 2 3 4 5 6 7 time (hours) 8 9 10 11 12 Gradient Descent vs Newton’s Method 2 O(nd ) 0 1 2 3 4 5 6 7 time (hours) 8 9 10 11 12 Gradient Descent vs Newton’s Method affine invariant 0 1 2 3 4 5 6 7 time (hours) 8 9 10 11 12 Gradient Descent vs Newton’s Method O(nd) 0 1 2 3 4 5 6 7 time (hours) 8 9 10 11 12 Gradient Descent vs Newton’s Method 0 1 2 3 4 5 6 7 time (hours) 8 9 10 11 12 Exact and approximate forms of Newton’s method Minimize g(x) = f (Ax) over convex set C ⊆ Rd : xopt = arg min g(x), x∈C where g : Rd → R is twice-differentiable. Exact and approximate forms of Newton’s method Minimize g(x) = f (Ax) over convex set C ⊆ Rd : xopt = arg min g(x), x∈C where g : Rd → R is twice-differentiable. Ordinary Newton steps: xt+1 = arg min x∈C n1 2 o (x − xt )T ∇2 g(xt )(x − xt )k22 + h∇g(xt ), x − xt i , where ∇2 g(xt ) is Hessian at xt . Exact and approximate forms of Newton’s method Minimize g(x) = f (Ax) over convex set C ⊆ Rd : xopt = arg min g(x), x∈C where g : Rd → R is twice-differentiable. Ordinary Newton steps: xt+1 = arg min x∈C n1 2 o (x − xt )T ∇2 g(xt )(x − xt )k22 + h∇g(xt ), x − xt i , where ∇2 g(xt ) is Hessian at xt . Approximate Newton steps: ◦ various types of quasi-Newton updates: Nocedal & Wright book: Chap. 6 ◦ BFGS method; SR1 method etc. ◦ stochastic gradient + stochastic quasi-Newton (e.g., Byrd, Hansen, Nocedal & Singer 2014) Iterative sketching for general convex functions xopt = arg min g(x), x∈C where g : Rd → R is twice-differentiable. Iterative sketching for general convex functions xopt = arg min g(x), x∈C where g : Rd → R is twice-differentiable. Ordinary Newton steps: xt+1 = arg min x∈C n1 2 o k∇2 g(xt )1/2 (x − xt )k22 + h∇g(xt ), x − xt i , where ∇2 g(xt )1/2 is matrix square root Hessian at xt . Cost per step: O(nd2 ) in unconstrained case. Iterative sketching for general convex functions xopt = arg min g(x), x∈C where g : Rd → R is twice-differentiable. Ordinary Newton steps: xt+1 = arg min x∈C n1 2 o k∇2 g(xt )1/2 (x − xt )k22 + h∇g(xt ), x − xt i , where ∇2 g(xt )1/2 is matrix square root Hessian at xt . Cost per step: O(nd2 ) in unconstrained case. Sketched Newton steps: Using random sketch matrix S t : x̃t+1 = arg min x∈C n1 2 o kS t ∇2 g(xt )1/2 (x − x̃t )k22 + h∇g(x̃t ), x − x̃t i . e Cost per step: O(nd) in unconstrained case. Convergence of Newton sketch Run algorithm with sketch dimension m d on a self-concordant function g(x) = f (Ax), and data matrix A ∈ Rn×d with n d. Convergence of Newton sketch Run algorithm with sketch dimension m d on a self-concordant function g(x) = f (Ax), and data matrix A ∈ Rn×d with n d. Theorem (Pilanci & W, 2015) With probability at least 1 − c0 e−c1 m , number of iterations required for accuracy is less than c2 log(1/) where (c0 , c1 , c2 ) are universal (problem-independent) constants. Convergence of Newton sketch Run algorithm with sketch dimension m d on a self-concordant function g(x) = f (Ax), and data matrix A ∈ Rn×d with n d. Theorem (Pilanci & W, 2015) With probability at least 1 − c0 e−c1 m , number of iterations required for accuracy is less than c2 log(1/) where (c0 , c1 , c2 ) are universal (problem-independent) constants. Dependence on sample size n, dimension d; conditioning κ; and tolerance Algorithm Gradient Descent Acc. gradient Descent Newton’s Method Newton Sketch Computational cost O(κ √n d log(1/)) O( κ nd log(1/)) O(nd2 log log(1/)) e O(nd log(1/)) Convergence of Newton sketch Run algorithm with sketch dimension m d on a self-concordant function g(x) = f (Ax), and data matrix A ∈ Rn×d with n d. Theorem (Pilanci & W, 2015) With probability at least 1 − c0 e−c1 m , number of iterations required for accuracy is less than c2 log(1/) where (c0 , c1 , c2 ) are universal (problem-independent) constants. Dependence on sample size n, dimension d; conditioning κ; and tolerance Algorithm Gradient Descent Acc. gradient Descent Newton’s Method Newton Sketch Computational cost O(κ √n d log(1/)) O( κ nd log(1/)) O(nd2 log log(1/)) e O(nd log(1/)) Note: Dependence on condition number κ unavoidable among 1st-order methods (Nesterov, 2004) Logistic regression: uncorrelated features Optimality vs. time 5 Optimality gap 10 0 10 Newton −5 Grad. desc 10 Newton Sketch −10 10 0 1 2 3 4 5 Wall clock time 6 7 Sample size n = 500, 000 with d = 5, 000 features 8 Logistic regression: correlated features Optimality vs. time 5 Optimality gap 10 0 10 Newton Grad. desc −5 10 Newton Sketch −10 10 0 1 2 3 4 5 Wall clock time 6 7 Sample size n = 500, 000 with d = 5, 000 features 8 Consequences for linear programming ◦ LP in standard form: min cT x Ax≤b Martin Wainwright (UC Berkeley) where A ∈ Rn×d , b ∈ Rn and c ∈ Rd . February 2017 18 / 24 Consequences for linear programming ◦ LP in standard form: min cT x Ax≤b where A ∈ Rn×d , b ∈ Rn and c ∈ Rd . ◦ interior point methods for LP solving: based on unconstrained sequence n n o X xµ := arg min µcT x − log(bi − aTi x) . x∈Rd Martin Wainwright (UC Berkeley) i=1 February 2017 18 / 24 Consequences for linear programming ◦ LP in standard form: min cT x Ax≤b where A ∈ Rn×d , b ∈ Rn and c ∈ Rd . ◦ interior point methods for LP solving: based on unconstrained sequence n n o X xµ := arg min µcT x − log(bi − aTi x) . x∈Rd i=1 ◦ as parameter µ → +∞, the path xµ approaches an optimal solution x∗ from the interior Martin Wainwright (UC Berkeley) February 2017 18 / 24 Standard central path min cT x Ax≤b Exact Newton µc x − T n X i=1 c log(bi − aiT x) Newton sketch follows central path min cT x Ax≤b Exact Newton Newton Sketch µc x − T n X i=1 c log(bi − aiT x) Linear Programs Consequence: An LP with n constraints and d variables can be solved in ≈ O(nd) time when n d. Martin Wainwright (UC Berkeley) February 2017 20 / 24 Performance compared to CPLEX Sample size n = 10, 000 Dimensions d = 1, 2, ..., 500 Computation Time (Seconds) Random ensembles of linear programs 100 CPLEX Newton Sketch 80 60 40 20 0 0 50 100 150 200 d 250 300 350 400 450 CODE: eecs.berkeley.edu/~mert/LP.zip Martin Wainwright (UC Berkeley) February 2017 21 / 24 Summary ◦ high-dimensional data: challenges and opportunities ◦ optimization at large scales: • Need fast methods... • But approximate answers are OK • randomized algorithms (with strong control) are useful ◦ this talk: ◦ the power of random projection ◦ information-theoretic analysis reveals deficiency of classical sketch ◦ Newton sketch: a fast and randomized Newton-type method Summary ◦ high-dimensional data: challenges and opportunities ◦ optimization at large scales: • Need fast methods... • But approximate answers are OK • randomized algorithms (with strong control) are useful ◦ this talk: ◦ the power of random projection ◦ information-theoretic analysis reveals deficiency of classical sketch ◦ Newton sketch: a fast and randomized Newton-type method Papers/pre-prints: ◦ Pilanci & W. (2015): Randomized sketches of convex programs with sharp guarantees, IEEE Transactions on Information Theory ◦ Pilanci & W. (2016a): Iterative Hessian Sketch: Fast and accurate solution approximation for constrained least-squares, Journal of Machine Learning Research ◦ Pilanci & W. (2016b): Newton Sketch: A linear-time optimization algorithm with linear-quadratic convergence. To appear in SIAM Journal of Optimization. Gaussian width of transformed tangent cone Gaussian width of set AK ∩ S n−1 = {A∆ | ∆ ∈ K, kA∆k2 = 1} h W(AK) := E where g ∼ N (0, In×n ). sup z∈AK∩S n−1 hg, zi i Gaussian width of transformed tangent cone Gaussian width of set AK ∩ S n−1 = {A∆ | ∆ ∈ K, kA∆k2 = 1} h W(AK) := E sup z∈AK∩S n−1 hg, zi i where g ∼ N (0, In×n ). Gaussian widths used in many areas: ◦ Banach space theory: Pisier, 1986, Gordon 1988 ◦ Empirical process theory: Ledoux & Talagrand, 1991, Bartlett et al., 2002 ◦ Geometric analysis, compressed sensing: Mendelson, Pajor & Tomczak-Jaegermann, 2007 Fast Johnson-Lindenstrauss sketch Step 1: Choose some fixed orthonormal matrix H ∈ Rn×n . Example: Hadamard matrices 1 1 1 H2 = √ H2t = H2 ⊗ H2 ⊗ · · · ⊗ H2 {z } | 2 1 −1 Kronecker product t times (E.g., Ailon & Liberty, 2010) Fast Johnson-Lindenstrauss sketch Step 1: Choose some fixed orthonormal matrix H ∈ Rn×n . Example: Hadamard matrices 1 1 1 H2 = √ H2t = H2 ⊗ H2 ⊗ · · · ⊗ H2 {z } | 2 1 −1 Kronecker product t times = Sy e H D y Step 2: (A) Multiply data vector y with a diagonal matrix of random signs {−1, +1} e ∈ Rm×n (B) Choose m rows of H to form sub-sampled matrix H e Dy. (C) Requires O(n log m) time to compute sketched vector Sy = H (E.g., Ailon & Liberty, 2010)
© Copyright 2025 Paperzz