Slide

Preconditioning in Expectation
Richard Peng
MIT
Joint with Michael Cohen (MIT), Rasmus Kyng (Yale),
Jakub Pachocki (CMU), and Anup Rao (Yale)
CMU theory seminar, April 5, 2014
RANDOM SAMPLING
• Collection of many objects
• Pick a small subset of them
GOALS OF SAMPLING
• Estimate quantities
• Approximate higher
dimensional objects
• Use in algorithms
SAMPLE TO APPROXIMATE
• ε- nets / cuttings
• Sketches
• Graphs
• Gradients
This talk: matrices
NUMERICAL LINEAR ALGEBRA
• Linear system in n x n matrix
• Inverse is dense
• [Concus-Golub-O'Leary `76]:
incomplete Cholesky, drop entries
HOW TO ANALYZE?
• Show sample is good
• Concentration bounds
• Scalar: [Bernstein `24][Chernoff`52]
• Matrices: [AW`02][RV`07][Tropp `12]
THIS TALK
• Directly show algorithm
using samples runs well
• Better bounds
• Simpler analysis
OUTLINE
• Random matrices
• Iterative methods
• Randomized preconditioning
• Expected inverse moments
HOW TO DROP ENTRIES?
• Entry based representation hard
• Group entries together
• Symmetric with positive entries
 adjacency matrix of a graph
SAMPLE WITH GUARANTEES
• Sample edges in graphs
• Goal: preserve size of all cuts
• [BK`96] graph sparsification
• Generalization of expanders
DROPPING ENTRIES/EDGES
• L: graph Laplacian
• 0-1 x : |x|L2 = size
of cut between 0sand-1s
Unit weight case:
|x|L2 = Σuv (xu – xv)2
Matrix norm:
|x|P2 = xTPx
DECOMPOSING A MATRIX
Σuv (xu – xv)2
L = Σuv
• Sample based on positive
representations
• P = Σi Pi, with each Pi P.S.D
• Graphs: one Pi per edge
1 -1
u
-1 1
v
u
v
P.S.D. multi-variate version of positive
MATRIX CHERNOFF BOUNDS
P = Σi Pi, with each Pi P.S.D
Can sample Q with
O(nlognε-2) rescaled Pis
s.t. P ≼ Q ≼ (1 +ε) P
≼ : Loewner’s partial ordering,
A ≼ B  B – A positive semi definite
CAN WE DO BETTER?
• Yes, [BSS `12]: O(nε-2) is possible
• Iterative, cubic time construction
• [BDM `11]: extends to general matrices
DIRECT APPLICATION
Find Q very close to P
Solve problem on Q
Return answer
For ε accuracy, need P ≼ Q ≼(1 +ε) P
Size of Q depends inversely on ε
ε-1 is best that we can hope for
USE INSIDE ITERATIVE METHODS
Find Q somewhat similar to P
Solve problem on P
using Q as a guide
• [AB `11]: crude samples
give good answers
• [LMP `12]: extensions to
row sampling
ALGORITHMIC VIEW
• Crude approximations are ok
• But need to be efficient
• Can we use [BSS `12]?
SPEED UP [BSS `12]
• Expander graphs, and more
• ‘i.i.d. sampling’ variant related
to the Kadison-Singer problem
MOTIVATION
• One dimensional sampling:
• moment estimation,
• pseudorandom generators
• Rarely need w.h.p.
• Dimensions should be disjoint
MOTIVATION
• Randomized coordinate descent
for electrical flows [KOSZ`13,LS`13]
• ACDM from [LS `13] improves
various numerical routines
RANDOMIZED COORDINATE DESCENT
• Related to stochastic optimization
• Known analyses when Q = Pj
• [KOSZ`13][LS`13] can be viewed
as ways of changing bases
OUR RESULT
For numerical routines, random
Q gives same performances as
[BSS`12], in expectation
IMPLICATIONS
• Similar bounds to ACDM from [LS `13]
• Recursive Chebyshev iteration
([KMP`11]) runs faster
• Laplacian solvers in ~ mlog1/2n time
OUTLINE
• Random matrices
• Iterative methods
• Randomized preconditioning
• Expected inverse moments
ITERATIVE METHODS
Find Q s.t. P ≼ Q ≼10 P
Use Q as guide to solve
problem on P
• [Gauss, 1823] Gauss-Siedel iteration
• [Jacobi, 1845] Jacobi Iteration
• [Hestnes-Stiefel `52] conjugate gradient
[RICHARDSON `1910]
x(t + 1) = x(t) + (b – Px(t))
• Fixed point: b – Px(t) = 0
• Each step: one matrixvector multiplication
ITERATIVE METHODS
• Multiplication is easier than division,
especially for matrices
• Use verifier to solve problem
1D CASE
Know: 1/2 ≤ p ≤ 1
 1 ≤ 1/p ≤ 2
• 1 is a ‘good’ estimate
• Bad when p is far from 1
• Estimate of error: 1 - p
ITERATIVE METHODS
• 1 + (1 – p) = 2 – p is more accurate
• Two terms of Taylor expansion
• Can take more terms
ITERATIVE METHODS
1/p = 1 + (1 – p) + (1 – p)2 + (1 – p)3…
Generalizes to matrix settings:
P-1 = I + (I – P) + (I – P)2 + …
[RICHARDSON `1910]
x(0) = Ib
X(1) = (I + (I – P))b
x(2) = (I + (I – P) (I + (I – P)))b
…
x(t + 1) = b + (I – P) x(t)
• Error of x(t) = (I – P)t b
• Geometric decrease if P is close to I
OPTIMIZATION VIEW
Residue: r(t) = x(t ) – P-1b
Error: |r(t)|22
• Quadratic potential function
• Goal: walk down to the bottom
• Direction given by gradient
DESCENT STEPS
x(t)
x(t+1)
x(t)
• Step may overshoot
• Need smooth function
x(t+1)
MEASURE OF SMOOTHNESS
x(t + 1) = b + (I – P) x(t)
Note: b = PP-1b
r(t + 1) = (I – P) r(t)
|r(t + 1)|2 ≤|I – P|2 |x(t)|2
MEASURE OF SMOOTHNESS
• |I – P|2 : smoothness of |r(t)|22
• Distance between P and I
• Related to eigenvalues of P
1 / 2 I ≼ P ≼ I  |I – P|2 ≤
1/2
MORE GENERAL
• Convex functions
• Smoothness / strong convexity
This talk: only quadratics
OUTLINE
• Random matrices
• Iterative methods
• Randomized preconditioning
• Expected inverse moments
ILL POSED PROBLEMS
.8 0
0 .1
• Smoothness of directions differ
• Progress limited by steeper parts
PRECONDITIONING
P
Q
P
• Solve similar problem Q
• Transfer steps across
PRECONDITIONED RICHARDSON
P
Q
• Optimal step down energy
function of Q given by Q-1
• Equivalent to solving
Q-1Px = Q-1b
PRECONDITIONED RICHARDSON
x(t + 1) = b + (I – Q-1P) x(t)
Residue:
r(t + 1) = (I – Q-1P) r(t)
|r(t + 1)|P = |(I – Q-1P )r(t)|P
CONVERGENCE
P
Q
Improvements depend
on |I – P1/2Q-1P1/2|2
• If P ≼ Q ≼10 P, error
halves in O(1) iterations
• How to find a good Q?
MATRIX CHERNOFF
P = Σi P i
Q = Σi s i P i
s has small support
• Take O(nlogn) (rescaled) Pis with
probability ~ trace(PiP-1)
• Matrix Chernoff
([AW`02],[RV`07]):
w.h.p.
P ≼ Q ≼ 2P
Note: Σitrace(PiP-1) = n
WHY THESE PROBABILITIES?
.8 0
0 .1
• trace(PiP-1):
• Matrix ‘dot product’
• If P is diagonal
• 1 for all i
• Need all entries
Overhead of concentration:
union bound on dimensions
IS CHERNOFF NECESSARY?
1 0
1 0
0 1
0 0
• P: diagonal matrix
• Missing one entry: unbounded
approximation factor
BETTER CONVERGENCE?
• [Kaczmarz `37]: random projections
onto small subspaces can work
• Better (expected) behavior than
what matrix concentration gives!
HOW?
P
≠
• Will still progress in good directions
• Can have (finite) badness if they
are orthogonal to goal
Q1
QUANTIFY DEGENERACIES
P
.8 0
.2 0
0 .2
0 .1
• Have some D ≼ P ‘for free’
• D = λmin (P)I (min eigenvalue)
• D = tree when P is a graph
• D = crude approximation /
rank certificate
D
REMOVING DEGENERACIES
P
D
• ‘Padding’ to remove degeneracy
• If D ≼ P and 0.5 P ≼ Q ≼ P,
0.5P ≼ D + Q ≼ 2P
ROLE OF D
P
D
• Implicit in proofs of matrix
Chernoff, as well as [BSS`12]
• Splitting of P in numerical analysis
• D and P can be very different
MATRIX CHERNOFF
P
Q
• Let D ≤ 0.1P, t = trace(PD-1)
• Take O(tlogn) samples with
probability ~ trace(PiD-1)
• Q  D + (rescaled) samples
• W.h.p. P ≼ Q ≼ 2 P
WEAKER REQUIREMENT
Q only needs to do well in
some directions, on average
Q1
P
Q2
EXPECTED CONVERGENCE
• Let t = trace(PD-1)
• Take rand[t, 2t] samples,
w.p. trace(PiD-1)
• Add (rescaled) results to D
to form Q
Exist constant c s.t. for any r,
E[|(I – c Q-1P )r|P ≤ 0.99|r|P
OUTLINE
• Random matrices
• Iterative methods
• Randomized preconditioning
• Expected inverse moments
ASIDE
Matrix Chernoff
• f(Q)=exp(P-1/2(P-Q)P-1/2)
• Show decrease in
relative eigenvalues
Iterative methods:
• f(x) = |x – P-1b|P
• Show decrease in
distance to solution
Goal: combine these analyses
SIMPLIFYING ASSUMPTIONS
P0
P
D0
D
• P = I (by normalization)
• tr(Pi D-1) = 0.1, ‘unit weight’
• Expected value of picking
a Pi at random: 1/t I
DECREASE
Step: r’ = (I – Q-1P)r
= (I – Q-1)r
New error: |r’|P = |(I – Q-1 )r|2
Expand:
| r ' | - | r | = -2 | r |
2
2
2
2
2
Q-1
+|r |
2
Q-2
DECREASE:
-2 | r |
2
Q-1
+ | r | £ -0.1 r 2
2
Q-2
• I ≼ Q ≼ 1.1 I would imply:
• 0.9 I ≼ Q-1
• Q-2 ≼ I
• But also Q-3 ≼ I and etc.
• Don’t need 3rd moment
2
RELAXATIONS
2
2 ù
é
EQ ë-2 | r |Q-1 + | r |Q-2 û
• Only need Q-1 and Q-2
• By linearity, suffices to:
• Lower bound EQ[Q-1]
• Upper bound EQ[Q-2]
TECHNICAL RESULT
Assumption: Σi Pi = I
trace(PiD-1) = 0.1
• Let t = trace(D-1)
• Take rand[t, 2t] uniform samples
• Add (rescaled) results to D to form Q
• 0.9I ≼ E[Q-1]
• E[Q-2] ≼ O(1) I
Q-1
• 0.5I ≼ E[Q-1] follows from
matrix arithmetic-harmonic
mean inequality ([ST`94])
• Need: upper bound on E[Q-2]
E[Q-2] ≼ O(1) ?
Q-1
j=0
j=t
Q-2
j=2t
• Q-2 is gradient of Q-1
• More careful tracking of Q-1
gives info on Q-2 as well!
TRACKING Q-1
• Q: start from D, add [t,2t]
random (rescaled) Pis.
• Track inverse of Q under
rank-1 perturbations
Sherman Morrison formula:
(A + M )
-1
-1
-1
A MA
=A +
-1
1+tr(A M )
-1
BOUNDING Q-1: DENOMINATOR
Current matrix: Qj, sample: R
-1
-1
Q
RQ
-1
j
j
Q-1
=
Q
+
j+1
j
1+tr(Q-1
j R)
• D ≼ Qj  Qj-1 ≼ D-1
• tr(Qj-1R) ≤ tr(D-1R) ≤ 0.1 for any R,
ER[Qj+1-1] ≼ Qj-1 – 0.9 Qj-1E[R]Qj-1E
BOUNDING Q-1: NUMERATOR
ER[Qj+1-1] ≼ Qj-1 – 0.9 Qj-1E[R]Qj-1
• R: random rescaled Pi sampled
• Assumption: E[R] = P = I
ER[Qj+1-1] ≼ Qj-1 – 0.9/t Qj-2
AGGREGATION
ER[Qj+1-1] ≼ Qj-1 – 0.9/t Qj-2
D = Q0
Q1
Q2
• Qj is also random
• Need to aggregate choices
of R into bound on E[Qj-1]
HARMONIC SUMS
HrmSum(X, a) = 1/(1/x + 1/a)
• Use harmonic sum of matrices
• Matrix functionals
• Similar to Steljes transform in [BSS`12]
• Proxy for -2th power
• Well behaved under expectation:
EX[HrmSum (X,a)] ≤ HrmSum(E[X],a)
HARMONIC SUM
ER[Qj+1-1] ≼ Qj-1 – 0.9/t Qj-2
1 2
1
2
- u M -2 £ 2
- u M -1
t
u M -1 + t
æ 2 1ö
2 ù
2
é
Eë u Q-1 û £ 0.1 u Q-1 + 0.9HrmSumç u Q-1 , ÷
j+1
j
è j tø
Initial condition + telescoping sum
gives E[Qt-1] ≼ O(1)I
E[Q-2] ≼ O(1)I
Q-1
j=0
j=t
Q-2
j=2t
• Q-2 is gradient of Q-1:
0.9/t Qj-2 ≼ Qj-1 - ER[Qj+1-1]
• 0.9/tΣj=t2t-1 Qj-2 ≼ E[Q2t-1] - E[Qt-1]
• Random j from [t,2t] is good!
SUMMARY
Un-normalize:
• 0.5 P ≼ E[PQ-1P]
• E[PQ-1PQ-1P] ≼ 5P
One step of preconditioned Richardson:
2ù
é
-1
Eê ( I - 0.1Q P) r ú
Pû
ë
= r P - 0.2 r
2
2
PQ-1P
+ 0.01 r
£ r P - 0.1 r P + 0.05 r
2
2
2
P
2
PQ-1PQ-1P
= 0.95 r
2
P
MORE GENERAL
• Works for some convex functions
• Sherman-Morrison replaced by
inequality, primal/dual
FUTURE WORK
• Expected convergence of
• Chebyshev iteration?
• Conjugate gradient?
• Same bound without D
(using pseudo-inverse)?
• Small error settings
• Stochastic optimization?
• More moments?
THANK YOU!
Questions?