ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning
Cho-Jui Hsieh
UC Davis
Sept 29, 2015
Outline
Linear regression
Ridge regression and Lasso
Time complexity (closed form solution)
Iterative Solvers
Regression
Input: training data x1 , x2 , . . . , xn ∈ Rd and corresponding outputs
y1 , y2 , . . . , yn ∈ R
Training: compute a function f such that f (xi ) ≈ yi for all i
Prediction: given a testing sample x̃, predict the output as f (x̃)
Examples:
Income, number of children ⇒ Consumer spending
Processes, memory ⇒ Power consumption
Financial reports ⇒ Risk
Atmospheric conditions ⇒ Precipitation
Linear Regression
Assume f (·) is a linear function parameterized by w ∈ Rd :
f (x) = w T x
Training: compute the model w ∈ Rd such that w T xi ≈ yi for all i
Prediction: given a testing sample x̃, the prediction value is w T x̃
How to find w ?
n
X
w ∗ = argmin
(w T xi − yi )2
w ∈Rd
i=1
Linear Regression: probability interpretation (*)
Assume the data is generated from the probability model:
yi ∼ w T xi + i ,
i ∼ N (0, 1)
Maximum likelihood estimator:
w ∗ = argmax log P(y1 , . . . , yn | x1 , . . . , xn , w )
w
= argmax
w
w
= argmax
= argmin
w
log P(yi | xi , w )
i=1
= argmax
w
n
X
n
X
1
T
2
log( √ e −(w xi −yi ) /2 )
2π
i=1
n
X
i=1
n
X
i=1
1
− (w T xi − yi )2 + constant
2
(w T xi − yi )2
Linear Regression: written as a matrix form
Linear regression: w ∗ = argminw ∈Rd
Pn
i=1 (w
Tx
i
− yi )2
Matrix form: let X ∈ Rn×d be the matrix where the i-th row is xi ,
y = [y1 , . . . , yn ]T , then linear regression can be written as
w ∗ = argmin kX w − yk22
w ∈Rd
Data Structure
Dense Matrix vs Sparse Matrix
Any matrix X ∈ Rm×n can be dense or sparse
Dense Matrix: most entries in X are nonzero (mn space)
Sparse Matrix: only few entries in X are nonzero (O(nnz) space)
Dense Matrix Operations
Let A ∈ Rm×n , B ∈ Rm×n , s ∈ R
A + B, sA, AT : mn operations
Let A ∈ Rm×n , b ∈ Rn×1
Ab: mn operations
Dense Matrix Operations
Matrix-matrix multiplication: let A ∈ Rm×k , B ∈ Rk×n , what is the
time complexity of computing AB?
Dense Matrix Operations
Assume A, B ∈ Rn×n , what is the time complexity of computing AB?
Naive implementation: O(n3 )
Theoretical best: O(n2.xxx ) (but slower than naive implementation in
practice)
Best way: using BLAS (Basic Linear Algebra Subprograms)
Dense Matrix Operations
BLAS matrix product: O(mnk) for computing AB where
A ∈ Rm×k , B ∈ Rk×n
Compute matrix product block by block to minimize cache miss rate
Can be called from C, Fortran; can be used in MATLAB, R, Python, . . .
Three levels of BLAS operations:
1
2
3
Level 1: Vector operations, e.g., y = αx + y
Level 2: Matrix-Vector operations, e.g., y = αAx + βy
Level 3: Matrix-Matrix operations, e.g., y = αAB + βC
BLAS (*)
Sparse Matrix Operations
Widely-used format: Compressed Sparse Column (CSC), Compressed
Sparse Row (CSR), . . .
CSC: three arrays for storing an m × n matrix with nnz nonzeroes
1
2
3
val (nnz real numbers): the values of each nonzero elements
row ind (nnz integers): the row indices corresponding to the values
col ptr (n + 1 integers): the list of value indexes where each column
starts
Sparse Matrix Operations
Widely-used format: Compressed Sparse Column (CSC), Compressed
Sparse Row (CSR), . . .
CSR: three arrays for storing an m × n matrix with nnz nonzeroes
1
2
3
val (nnz real numbers): the values of each nonzero elements
col ind (nnz integers): the column indices corresponding to the values
row ptr (m + 1 integers): the list of value indexes where each row starts
Sparse Matrix Operations
If A ∈ Rm×n (sparse), B ∈ Rm×n (sparse or dense), s ∈ R
A + B, sA, AT : O(nnz) operations
If A ∈ Rm×n , b ∈ Rn×1
Ab: O(nnz) operations
If A ∈ Rm×k (sparse), B ∈ Rk×n (dense):
AB: O((nnz)n) operations (use sparse BLAS)
If A ∈ Rm×k (sparse), B ∈ Rk×n (sparse):
AB: O(nnz(A)nnz(B)/k) in average
AB: O(nnz(A)n) worst case
The resulting matrix will be much denser
Closed Form Solution
Solving Linear Regression
Minimize the sum of squared error J(w )
1
J(w ) = kX w − yk2
2
1
= (X w − y)T (X w − y)
2
1 T T
1
= w X X w − yT X w + yT y
2
2
Derivative:
∂
∂w J(w )
= XTXw − XTy
Setting the derivative equal to zero gives the normal equation
XTXw∗ = XTy
Therefore, w ∗ = (X T X )−1 X T y
Solving Linear Regression
Minimize the sum of squared error J(w )
1
J(w ) = kX w − yk2
2
1
= (X w − y)T (X w − y)
2
1 T T
1
= w X X w − yT X w + yT y
2
2
Derivative:
∂
∂w J(w )
= XTXw − XTy
Setting the derivative equal to zero gives the normal equation
XTXw∗ = XTy
Therefore, w ∗ = (X T X )−1 X T y
(but X T X may not be invertible. . . )
Solving Linear Regression
Normal equation: X T X w ∗ = X T y
If X T X is invertible (typically when # samples > # features):
w ∗ = (X T X )−1 y
If X T X is low-rank (typically when # features > # samples):
infinite number of solutions (why?)
Least norm solution: w ∗ = X † y where X † is the pseudo inverse
Regularized Linear Regression
Overfitting
Overfitting: the model has low training error but high prediction error.
Using too many features can lead to overfitting
Avoid Overfitting by Controlling Model Complexity (*)
In the following, we derive a bound for generalization (prediction) error
Training & testing data are generated from the same distribution
(xi , yi ) ∼ D
Generalization error (expectation of prediction error):
R(f ) := E(x,y )∼D [(f (x) − y )2 ]
Best model (minimizing prediction error): f ∗ := argminf ∈F R(f )
Empirical error: error on the training data
n
R̂(fˆ) :=
1X
(f (xi ) − yi )2
n
i=1
Our estimator: fˆ := argminf ∈F R̂(f )
F: a class of function (e.g., all linear functions {f : f (x) = w T x})
Avoid Overfitting by Controlling Model Complexity (*)
Generalization error of fˆ:
R(fˆ) =R(f ∗ ) + (R̂(fˆ) − R̂(f ∗ )) + (R(fˆ) − R̂(fˆ)) + (R̂(f ∗ ) − R(f ∗ ))
≤R(f ∗ ) + 0 + 2max |R̂(f ) − R(f )|
f ∈F
Overfitting is due to large maxf ∈F |R̂(f ) − R(f )|
How to make this term smaller?
1
2
Have more samples (larger n)
Make F a smaller set (regularization)
Avoid overfitting by Controlling Model Complexity (*)
≤ R(f ∗ ) + 2 max |R̂(f ) − R(f )|
R(fˆ)
| {z }
f ∈F
generalization error of the estimator
Control F to be a subset of linear functions:
1
2
Ridge regression: F = {f (x) = w T x | kw k2 ≤ K } where K is a constant
Lasso: F = {f (x) = w T x | kw k1 ≤ K } where K is a constant
If K is large ⇒ overfitting (maxf ∈F |R̂(f ) − R(f )| is large)
If K is small ⇒ underfitting (R(f ∗ ) is large because F does not cover
the best solution)
Adding a constraint is equivalent to adding a regularization term:
n
X
argmin
kw T xi − yi k2 s.t. kw k2 ≤ K
w
i=1
is equivalent to the following problem with some λ
n
X
argmin
kw T xi − yik2 + λkw k2
w
i=1
Regularized Linear Regression
Regularized Linear Regression:
argmin kX w − yk2 + R(w )
w
R(w ): regularization
Ridge Regression (`2 regularization):
argmin kX w − yk2 + λkw k2
w
Lasso (`1 regularization):
argmin kX w − yk2 + λkw k1
w
Note that kw k1 =
Pd
i=1 |wi |
Regularization
Lasso: the solution is sparse, but no closed form solution
Ridge Regression
Ridge regression: argminw ∈Rd
1
λ
kX w − yk2 + kw k2
2
{z
}
|2
J(w )
Closed form solution: optimal solution w ∗ satisfies ∇J(w ∗ ) = 0:
X T X w ∗ − X T y + λw ∗ = 0
(X T X + λI )w ∗ = X T y
Optimial solution: w ∗ = (X T X + λI )−1 X T y
Inverse always exists because X T X + λI is positive definite
What’s the computational complexity?
Time Complexity (Closed Form Solution)
Time Complexity (closed form solution of ridge regression)
Goal: compute (X T X + λI )−1 (X T y)
Step 1: Compute A = X T X + λI and b = X T y
what’s the time complexity? O(nd 2 + nd) if X is dense
Step 2: several options with O(d 3 ) time complexity
1
2
Compute A−1 , and then compute w ∗ = A−1 b
Directly solve the linear system Aw = b
(calling underlying LAPACK routine)
(Cholesky factorization)
Which one is better?
Time Complexity (closed form solution)
Time Complexity for step 2: O(d 3 )
More on Linear System Solvers
Can be done by linear system solver: computing w such that Aw = y
Different libraries can be used:
Default LAPACK
Intel Math Kernel Library (Intel MKL)
AMD Core Math Library (ACML)
Parallel numerical linear algebra packages are available today:
ScaLAPACK (SCAlable Linear Algebra PACKage): linear algebra routines
for distributed memory machines
PLAPACK (Parallel Linear Algebra PACKage): dense linear algebra
routines
PLASMA (Parallel Linear Algebra Software for Multicore Architectures):
combine state-of-the-art solutions in parallel algorithms and scheduling
for optimized solutions of linear systems, eigenvalue problems, . . .
However, still O(d 3 ) time complexity to solve Aw = y when A ∈ Rd×d
Time Complexity
When X is dense:
Closed form solution requires O(nd 2 + d 3 ) if X is dense
Efficient if d is very small
Runs forever when d > 100, 000
Typical case for big data applications:
X ∈ Rn×d is sparse with large n and large d
How can we solve the problem?
Iterative Solver (Coordinate Descent)
Optimization: iterative solver
Iterative solver: generate a sequence of (improving) approximate
solutions for a problem.
w 0 (initialpoint) → w 1 → w 2 → w 3 → . . . . . .
Convergence of the iterative algorithm
Convergence properties:
Will the algorithm converge to a point?
Will it converge to optimal solution(s)?
What’s the convergence rate (how fast will the sequence converge)?
We will just introduce existing properties in this class, but we will focus on:
computational complexity per iteration
Coordinate Descent
Solve the optimization problem:
argmin J(w )
w
Update one variable at a time
Obtain a model with reasonable performance for a few iterations
Randomized coordinate descent: pick a random coordinate to update
at each iteration
Coordinate Descent
Input: X ∈ RN×d , y ∈ RN , initial w (0)
Output: Solution w
1: t = 0
2: while not converged do
3:
Randomly choose a variable wj
4:
Compute the optimal update on wj by solving
δ ∗ = argmin
δ
Update wj : wj ← wj + δ ∗
6:
t ←t +1
7: end while
5:
J(w + δej ) − J(w ).
Coordinate Descent
Input: X ∈ RN×d , y ∈ RN , initial w (0)
Output: Solution w
1: t = 0
2: while not converged do
3:
Randomly choose a variable wj
4:
Compute the optimal update on wj by solving
δ ∗ = argmin
J(w + δej ) − J(w ).
δ
Update wj : wj ← wj + δ ∗
6:
t ←t +1
7: end while
Q: What is the exact CD rule for Ridge Regression?
5:
Coordinate Descent
Input: X ∈ RN×d , y ∈ RN , initial w (0)
Output: Solution w
1: t = 0
2: while not converged do
3:
Randomly choose a variable wj
4:
Compute the optimal update on wj by solving
δ ∗ = argmin
J(w + δej ) − J(w ).
δ
Update wj : wj ← wj + δ ∗
6:
t ←t +1
7: end while
Q: What is the exact CD rule for Ridge Regression?
Pn
Xij (xiT w − yi ) + λwj
∗
P
δ = − i=1
λ + ni=1 Xij2
5:
Coordinate Descent (convergence)
If J(w ) is strongly convex, randomized coordinate converges to the
global optimum with a global linear convergence rate.
For ridge regression, J(w ) is strongly convex because
∇2 J(w ) = X T X + λI
Will show more details in next class
Time Complexity
For updating wj , we need to compute
δ∗ = −
Pn
Tw − y )
i
i=1 Xij (xiP
λ + ni=1 Xij2
What’s the computational complexity?
+ λwj
Time Complexity
For updating wj , we need to compute
δ∗ = −
Pn
Tw − y )
i
i=1 Xij (xiP
λ + ni=1 Xij2
+ λwj
Assume P
X ∈ Rn×d is a sparse matrix; each column of X has ni nonzero
entries ( i ni = nnz(X ))
A naive approach:
For each coordinate update (wj )
1
2
3
4
T
Compute ri := xP
i w − yi for all i: O(nnz(X )) operations
n
Compute hj := i=1 Xij2 : O(nj ) operations
Compute δ ∗ = −
wj ← wj + δ ∗
Pn
i=1 ri Xij +λwj
λ+hj
: O(nj ) operations
Time Complexity
For updating wj , we need to compute
δ∗ = −
Pn
Tw − y )
i
i=1 Xij (xiP
λ + ni=1 Xij2
+ λwj
Assume P
X ∈ Rn×d is a sparse matrix; each column of X has ni nonzero
entries ( i ni = nnz(X ))
P
Precompute hj := ni=1 Xij2 for all j = 1, . . . , d, O(nnz(X )) operations
For each coordinate update (wj ):
1
2
3
Compute ri := xiTP
w − yi for all i: O(nnz(X )) operations
n
ri Xij +λwj
∗
Compute δ = − i=1λ+h
: O(nj ) operations
j
wj ← wj + δ ∗
Time Complexity
For updating wj , we need to compute
δ∗ = −
Pn
Tw − y )
i
i=1 Xij (xiP
λ + ni=1 Xij2
+ λwj
Assume P
X ∈ Rn×d is a sparse matrix; each column of X has ni nonzero
entries ( i ni = nnz(X ))
P
Precompute hj := ni=1 Xij2 for all j = 1, . . . , d, O(nnz(X )) operations
Precompute ri := xiT w − yi for all i, O(nnz(X ))operations
For each coordinate update (wj ):
Pn
1
2
3
r X +λw
i ij
j
: O(nj ) operations
Compute δ ∗ = − i=1λ+h
j
For all i = 1, 2, . . . , n, update ri ← ri + δ ∗ Xij . O(nj ) operations
wj ← wj + δ ∗
Time Complexity
Averagely O(n̄) per coordinate update, where n̄ = nnz(X )/d
T outer iterations, each outer iteration d coordinate updates:
T × O(nnz(X )) operations
Closed form solution: O(nnz(X )d + d 3 ) operations
Can be easily extended to solve the LASSO problem
Other Iterative Solvers
Other iterative solvers for ridge regression:
Gradient Descent
Stochastic Gradient Descent
......
Coming up
Next class: intro to optimization
Questions?