Statistics meets Optimization: Fast randomized algorithms

Statistics meets Optimization:
Fast randomized algorithms for large data sets
Martin Wainwright
UC Berkeley
Statistics and EECS
Joint work with:
Martin Wainwright (UC Berkeley)
Mert Pilanci
UC Berkeley & Stanford University
February 2017
1 / 24
What is the “big data” phenomenon?
◦ Every day: 2.5 billion gigabytes of data created
◦ Last two years: creation of 90% of the world’s data
(source: IBM)
How can algorithms be scaled?
Massive data sets require fast algorithms but with rigorous guarantees.
How can algorithms be scaled?
Massive data sets require fast algorithms but with rigorous guarantees.
Randomized projection is a general purpose tool:
◦ Choose a random subspace of “low” dimension m.
◦ Project data into subspace, and solve reduced dimension problem.
Lower dimensional space
Random
projection
High−dimensional space
How can algorithms be scaled?
Randomized projection is a general purpose tool:
◦ Choose a random subspace of “low” dimension m.
◦ Project data into subspace, and solve reduced dimension problem.
Lower dimensional space
Random
projection
High−dimensional space
Widely studied and used:
◦ Johnson & Lindenstrauss (1984): in Banach/Hilbert space geometry
◦ various surveys and books: Vempala, 2004; Mahoney et al., 2011
Cormode et al., 2012.
Randomized sketching for optimization
DATA
OPTIMIZER
Randomized sketching for optimization
DATA
OPTIMIZER
Randomized sketching for optimization
DATA
OPTIMIZER
cost
all data
parameter
Randomized sketching for optimization
DATA
OPTIMIZER
cost
sample
all data
parameter
Randomized sketching for optimization
DATA
OPTIMIZER
cost
all data
parameter
Randomized sketching for optimization
DATA
OPTIMIZER
𝑤1
𝑤2
𝑤3
𝑤4
𝑤5
cost
all data
𝑤6
𝑤7
𝑤8
parameter
Randomized sketching for optimization
DATA
OPTIMIZER
𝑤1
𝑤2
𝑤3
𝑤4
𝑤5
cost
all data
𝑤6
𝑤7
𝑤8
combined
parameter
Randomized projection for constrained least-squares
◦ Given data matrix A ∈ Rn×d , and response vector y ∈ Rn
◦ Least-squares over convex constraint set C ⊆ Rd :
xLS = arg min kAx − yk22
x∈C |
{z }
f (Ax)
Randomized projection for constrained least-squares
◦ Given data matrix A ∈ Rn×d , and response vector y ∈ Rn
◦ Least-squares over convex constraint set C ⊆ Rd :
xLS = arg min kAx − yk22
x∈C |
{z }
f (Ax)
d
n
A
Randomized projection for constrained least-squares
◦ Given data matrix A ∈ Rn×d , and response vector y ∈ Rn
◦ Least-squares over convex constraint set C ⊆ Rd :
xLS = arg min kAx − yk22
x∈C |
{z }
f (Ax)
◦ Randomized approximation:
(Sarlos, 2006)
x
b = arg min kS(Ax − y)k22
x∈C
◦ Random projection matrix S ∈ Rm×n
m
d
d
= m SA
S
n
A
A general approximation-theoretic bound
The randomized solution x
b ∈ C provides δ-accurate cost approximation if
f (AxLS ) ≤ f (Ab
x) ≤ (1 + δ) f (AxLS ).
A general approximation-theoretic bound
The randomized solution x
b ∈ C provides δ-accurate cost approximation if
f (AxLS ) ≤ f (Ab
x) ≤ (1 + δ) f (AxLS ).
Theorem (Pilanci & W, 2015)
For a broad class of random projection matrices, a sketch dimension
m%
effrank(A; C)
δ
yields δ-accurate cost approximation with exp. high probability.
A general approximation-theoretic bound
The randomized solution x
b ∈ C provides δ-accurate cost approximation if
f (AxLS ) ≤ f (Ab
x) ≤ (1 + δ) f (AxLS ).
Theorem (Pilanci & W, 2015)
For a broad class of random projection matrices, a sketch dimension
m%
effrank(A; C)
δ
yields δ-accurate cost approximation with exp. high probability.
◦ past work on unconstrained case C = Rd : effective rank equivalent to
rank(A)
(Sarlos, 2006; Mahoney et al. 2011)
◦ effective rank can be much smaller than standard rank
Favorable dependence on optimum x
C
∆
K
x
LS
Tangent cone K at xLS
Set of feasible directions at the optimum xLS
K = ∆ ∈ Rd | ∆ = t (x − xLS )
for some x ∈ C. .
LS
Unfavorable dependence on optimum x
C
K
∆
x
LS
Tangent cone K at xLS
Set of feasible directions at the optimum xLS
K = ∆ ∈ Rd | ∆ = t (x − xLS )
for some x ∈ C. .
LS
But what about solution approximation?
x∗ = arg min kAx − yk22
x∈C
x
b ∈ arg min kS(Ax − y)k22
and
x∈C
original cost
𝜖
𝒙∗
approximate cost
𝑥ො
But what about solution approximation?
x∗ = arg min kAx − yk22
x∈C
x
b ∈ arg min kS(Ax − y)k22
and
x∈C
original cost
𝜖
𝒙∗
?
approximate cost
𝑥ො
Failure of standard random projection
◦ Noisy observation model:
y = Ax∗ + w where w ∼ N (0, σ 2 In ).
Failure of standard random projection
◦ Noisy observation model:
1
y = Ax∗ + w where w ∼ N (0, σ 2 In ).
Mean−squared error vs. number of samples
MSE
0.1
0.01
LS
0.001 2
10
Sketch
◦ Least-squares accuracy:
10
3
number of samples
10
4
EkxLS − x∗ k2A = σ 2 rank(A)
n
Overcoming this barrier?
Overcoming this barrier? Sequential scheme....
1
Mean−squared error vs. number of samples
MSE
0.1
0.01
0.001 2
10
LS
Iterative Sketch
Naive Sketch
3
10
number of samples
4
10
Iterative projection scheme yields accurate tracking of original least-squares
solution.
Application to Netflix data
Martin Wainwright (UC Berkeley)
February 2017
11 / 24
Netflix data set
◦ 2 million × 17000 matrix A of ratings (users × movies)
◦ Predict the ratings of a particular movie
◦ Least-squares regression with `2 regularization
min kAx − yk22 + λkxk22
x
◦ Partition into test and training sets, solve for all values of
λ ∈ {1, 2, ..., 100}.
Fitting the full regularization path
Test Error
8.25
Test Error
Matlab Cholesky solver
Conjugate Gradient
Iterative Sketch
8.2
8.15
8.1
0.5
1
2
3
X: 99
Y: 1677
1692
1500
874.2
1000
500
0
2.5
Computation time with respect to regularization parameter index
2000
Computation Time
1.5
Regularization parameter : 100 values
0
0
10
1000
computation
time (seconds)
20 total 30
40
50
60
70
Regularization parameter : 100 values
X: 100
Y: 101.5
2000
80
90
100
Fitting the full regularization path
Test Error
8.25
Test Error
Matlab Cholesky solver
Conjugate Gradient
Iterative Sketch
8.2
8.15
8.1
0.5
1
2
2.5
3
Computation time with respect to regularization parameter index
2000
Computation Time
1.5
Regularization parameter : 100 values
1692
X: 99
Y: 1677
1500
874.2
1000
0
X: 100
Y: 101.5
101.5
500
0
0
10
20
30
40 100050
60
70
Regularization parameter : 100 values
total computation
time (seconds)
80
200090
100
Gradient Descent vs Newton’s Method
0
1
2
3
4
5
6
7
time (hours)
8
9
10
11
12
Gradient Descent vs Newton’s Method
O(nd)
0
1
2
3
4
5
6
7
time (hours)
8
9
10
11
12
Gradient Descent vs Newton’s Method
0
1
2
3
4
5
6
7
time (hours)
8
9
10
11
12
Gradient Descent vs Newton’s Method
0
1
2
3
4
5
6
7
time (hours)
8
9
10
11
12
Gradient Descent vs Newton’s Method
0
1
2
3
4
5
6
7
time (hours)
8
9
10
11
12
Gradient Descent vs Newton’s Method
0
1
2
3
4
5
6
7
time (hours)
8
9
10
11
12
Gradient Descent vs Newton’s Method
0
1
2
3
4
5
6
7
time (hours)
8
9
10
11
12
Gradient Descent vs Newton’s Method
0
1
2
3
4
5
6
7
time (hours)
8
9
10
11
12
Gradient Descent vs Newton’s Method
0
1
2
3
4
5
6
7
time (hours)
8
9
10
11
12
Gradient Descent vs Newton’s Method
2
O(nd )
0
1
2
3
4
5
6
7
time (hours)
8
9
10
11
12
Gradient Descent vs Newton’s Method
affine
invariant
0
1
2
3
4
5
6
7
time (hours)
8
9
10
11
12
Gradient Descent vs Newton’s Method
O(nd)
0
1
2
3
4
5
6
7
time (hours)
8
9
10
11
12
Gradient Descent vs Newton’s Method
0
1
2
3
4
5
6
7
time (hours)
8
9
10
11
12
Exact and approximate forms of Newton’s method
Minimize g(x) = f (Ax) over convex set C ⊆ Rd :
xopt = arg min g(x),
x∈C
where g : Rd → R is twice-differentiable.
Exact and approximate forms of Newton’s method
Minimize g(x) = f (Ax) over convex set C ⊆ Rd :
xopt = arg min g(x),
x∈C
where g : Rd → R is twice-differentiable.
Ordinary Newton steps:
xt+1 = arg min
x∈C
n1
2
o
(x − xt )T ∇2 g(xt )(x − xt )k22 + h∇g(xt ), x − xt i ,
where ∇2 g(xt ) is Hessian at xt .
Exact and approximate forms of Newton’s method
Minimize g(x) = f (Ax) over convex set C ⊆ Rd :
xopt = arg min g(x),
x∈C
where g : Rd → R is twice-differentiable.
Ordinary Newton steps:
xt+1 = arg min
x∈C
n1
2
o
(x − xt )T ∇2 g(xt )(x − xt )k22 + h∇g(xt ), x − xt i ,
where ∇2 g(xt ) is Hessian at xt .
Approximate Newton steps:
◦ various types of quasi-Newton updates: Nocedal & Wright book: Chap. 6
◦ BFGS method; SR1 method etc.
◦ stochastic gradient + stochastic quasi-Newton (e.g., Byrd, Hansen, Nocedal &
Singer 2014)
Iterative sketching for general convex functions
xopt = arg min g(x),
x∈C
where g : Rd → R is twice-differentiable.
Iterative sketching for general convex functions
xopt = arg min g(x),
x∈C
where g : Rd → R is twice-differentiable.
Ordinary Newton steps:
xt+1 = arg min
x∈C
n1
2
o
k∇2 g(xt )1/2 (x − xt )k22 + h∇g(xt ), x − xt i ,
where ∇2 g(xt )1/2 is matrix square root Hessian at xt .
Cost per step: O(nd2 ) in unconstrained case.
Iterative sketching for general convex functions
xopt = arg min g(x),
x∈C
where g : Rd → R is twice-differentiable.
Ordinary Newton steps:
xt+1 = arg min
x∈C
n1
2
o
k∇2 g(xt )1/2 (x − xt )k22 + h∇g(xt ), x − xt i ,
where ∇2 g(xt )1/2 is matrix square root Hessian at xt .
Cost per step: O(nd2 ) in unconstrained case.
Sketched Newton steps: Using random sketch matrix S t :
x̃t+1 = arg min
x∈C
n1
2
o
kS t ∇2 g(xt )1/2 (x − x̃t )k22 + h∇g(x̃t ), x − x̃t i .
e
Cost per step: O(nd)
in unconstrained case.
Convergence of Newton sketch
Run algorithm with sketch dimension m d on a self-concordant function
g(x) = f (Ax), and data matrix A ∈ Rn×d with n d.
Convergence of Newton sketch
Run algorithm with sketch dimension m d on a self-concordant function
g(x) = f (Ax), and data matrix A ∈ Rn×d with n d.
Theorem (Pilanci & W, 2015)
With probability at least 1 − c0 e−c1 m , number of iterations required for accuracy is less than
c2 log(1/)
where (c0 , c1 , c2 ) are universal (problem-independent) constants.
Convergence of Newton sketch
Run algorithm with sketch dimension m d on a self-concordant function
g(x) = f (Ax), and data matrix A ∈ Rn×d with n d.
Theorem (Pilanci & W, 2015)
With probability at least 1 − c0 e−c1 m , number of iterations required for accuracy is less than
c2 log(1/)
where (c0 , c1 , c2 ) are universal (problem-independent) constants.
Dependence on sample size n, dimension d; conditioning κ; and tolerance Algorithm
Gradient Descent
Acc. gradient Descent
Newton’s Method
Newton Sketch
Computational cost
O(κ
√n d log(1/))
O( κ nd log(1/))
O(nd2 log log(1/))
e
O(nd
log(1/))
Convergence of Newton sketch
Run algorithm with sketch dimension m d on a self-concordant function
g(x) = f (Ax), and data matrix A ∈ Rn×d with n d.
Theorem (Pilanci & W, 2015)
With probability at least 1 − c0 e−c1 m , number of iterations required for accuracy is less than
c2 log(1/)
where (c0 , c1 , c2 ) are universal (problem-independent) constants.
Dependence on sample size n, dimension d; conditioning κ; and tolerance Algorithm
Gradient Descent
Acc. gradient Descent
Newton’s Method
Newton Sketch
Computational cost
O(κ
√n d log(1/))
O( κ nd log(1/))
O(nd2 log log(1/))
e
O(nd
log(1/))
Note: Dependence on condition number κ unavoidable among 1st-order
methods
(Nesterov, 2004)
Logistic regression: uncorrelated features
Optimality vs. time
5
Optimality gap
10
0
10
Newton
−5
Grad. desc
10
Newton Sketch
−10
10
0
1
2
3
4
5
Wall clock time
6
7
Sample size n = 500, 000 with d = 5, 000 features
8
Logistic regression: correlated features
Optimality vs. time
5
Optimality gap
10
0
10
Newton
Grad. desc
−5
10
Newton Sketch
−10
10
0
1
2
3
4
5
Wall clock time
6
7
Sample size n = 500, 000 with d = 5, 000 features
8
Consequences for linear programming
◦ LP in standard form:
min cT x
Ax≤b
Martin Wainwright (UC Berkeley)
where A ∈ Rn×d , b ∈ Rn and c ∈ Rd .
February 2017
18 / 24
Consequences for linear programming
◦ LP in standard form:
min cT x
Ax≤b
where A ∈ Rn×d , b ∈ Rn and c ∈ Rd .
◦ interior point methods for LP solving: based on unconstrained sequence
n
n
o
X
xµ := arg min µcT x −
log(bi − aTi x) .
x∈Rd
Martin Wainwright (UC Berkeley)
i=1
February 2017
18 / 24
Consequences for linear programming
◦ LP in standard form:
min cT x
Ax≤b
where A ∈ Rn×d , b ∈ Rn and c ∈ Rd .
◦ interior point methods for LP solving: based on unconstrained sequence
n
n
o
X
xµ := arg min µcT x −
log(bi − aTi x) .
x∈Rd
i=1
◦ as parameter µ → +∞, the path xµ approaches an optimal solution x∗
from the interior
Martin Wainwright (UC Berkeley)
February 2017
18 / 24
Standard central path
min cT x
Ax≤b
Exact Newton
µc x −
T
n
X
i=1
c
log(bi − aiT x)
Newton sketch follows central path
min cT x
Ax≤b
Exact Newton
Newton Sketch
µc x −
T
n
X
i=1
c
log(bi − aiT x)
Linear Programs
Consequence: An LP with n constraints and d variables can be solved in
≈ O(nd) time when n d.
Martin Wainwright (UC Berkeley)
February 2017
20 / 24
Performance compared to CPLEX
Sample size n = 10, 000
Dimensions d = 1, 2, ..., 500
Computation Time (Seconds)
Random ensembles of linear programs
100
CPLEX
Newton Sketch
80
60
40
20
0
0
50
100
150
200
d
250
300
350
400
450
CODE: eecs.berkeley.edu/~mert/LP.zip
Martin Wainwright (UC Berkeley)
February 2017
21 / 24
Summary
◦ high-dimensional data: challenges and opportunities
◦ optimization at large scales:
• Need fast methods...
• But approximate answers are OK
• randomized algorithms (with strong control) are useful
◦ this talk:
◦ the power of random projection
◦ information-theoretic analysis reveals deficiency of classical sketch
◦ Newton sketch: a fast and randomized Newton-type method
Summary
◦ high-dimensional data: challenges and opportunities
◦ optimization at large scales:
• Need fast methods...
• But approximate answers are OK
• randomized algorithms (with strong control) are useful
◦ this talk:
◦ the power of random projection
◦ information-theoretic analysis reveals deficiency of classical sketch
◦ Newton sketch: a fast and randomized Newton-type method
Papers/pre-prints:
◦ Pilanci & W. (2015): Randomized sketches of convex programs with
sharp guarantees, IEEE Transactions on Information Theory
◦ Pilanci & W. (2016a): Iterative Hessian Sketch: Fast and accurate
solution approximation for constrained least-squares, Journal of Machine
Learning Research
◦ Pilanci & W. (2016b): Newton Sketch: A linear-time optimization
algorithm with linear-quadratic convergence. To appear in SIAM Journal
of Optimization.
Gaussian width of transformed tangent cone
Gaussian width of set
AK ∩ S n−1 = {A∆ | ∆ ∈ K, kA∆k2 = 1}
h
W(AK) := E
where g ∼ N (0, In×n ).
sup
z∈AK∩S n−1
hg, zi
i
Gaussian width of transformed tangent cone
Gaussian width of set
AK ∩ S n−1 = {A∆ | ∆ ∈ K, kA∆k2 = 1}
h
W(AK) := E
sup
z∈AK∩S n−1
hg, zi
i
where g ∼ N (0, In×n ).
Gaussian widths used in many areas:
◦ Banach space theory: Pisier, 1986, Gordon 1988
◦ Empirical process theory: Ledoux & Talagrand, 1991, Bartlett et al., 2002
◦ Geometric analysis, compressed sensing: Mendelson, Pajor &
Tomczak-Jaegermann, 2007
Fast Johnson-Lindenstrauss sketch
Step 1: Choose some fixed orthonormal matrix H ∈ Rn×n .
Example: Hadamard matrices
1 1 1
H2 = √
H2t =
H2 ⊗ H2 ⊗ · · · ⊗ H2
{z
}
|
2 1 −1
Kronecker product t times
(E.g., Ailon & Liberty, 2010)
Fast Johnson-Lindenstrauss sketch
Step 1: Choose some fixed orthonormal matrix H ∈ Rn×n .
Example: Hadamard matrices
1 1 1
H2 = √
H2t =
H2 ⊗ H2 ⊗ · · · ⊗ H2
{z
}
|
2 1 −1
Kronecker product t times
=
Sy
e
H
D
y
Step 2:
(A) Multiply data vector y with a diagonal matrix of random signs {−1, +1}
e ∈ Rm×n
(B) Choose m rows of H to form sub-sampled matrix H
e Dy.
(C) Requires O(n log m) time to compute sketched vector Sy = H
(E.g., Ailon & Liberty, 2010)