Ball Walk Hessian manifold Suggested algorithm

Recent Progress On Sampling Problem
Yin Tat Lee (MSR/UW), Santosh Vempala (Gatech)
My Dream
Tell the complexity of a convex problem by looking at the formula.
Example
Minimum Cost Flow Problem:
▪ This is a linear program, each row has two non-zero.
It can be solved in 𝑂(𝑚 𝑛). [LS14] (Previous: 𝑂(𝑚 𝑚) for graph with 𝑚 edges and 𝑛 vertices.)
My Dream
Tell the complexity of a convex problem by looking at the formula.
Example
Submodular Minimization:
Fundamental in combinatorial
optimization.
Worth ≥ 2 Fulkerson prizes
where 𝑓 satisfies diminishing return, i.e.
𝑓 𝑆+𝑒 −𝑓 𝑆 ≤𝑓 𝑇+𝑒 −𝑓 𝑇
▪ 𝑓 can be extended to a convex function on 0,1 𝑛 .
▪ subgradient of 𝑓 can be computed in 𝑛2 time.
It can be solved in 𝑂(𝑛3 ). [LSW15] (Previous: 𝑂(𝑛5 ))
∀𝑇 ⊂ 𝑆, 𝑒 ∉ 𝑆.
Algorithmic Convex Geometry
To describe a formula, we need some operations.
Given a convex set 𝐾, we have following operations
▪ Membership(x): Check if 𝑥 ∈ 𝐾.
▪ Separation(x): Assert 𝑥 ∈ 𝐾, or find a hyperplane separate 𝑥 and 𝐾.
▪ Width(c): Compute min𝑐 𝑇 𝑥.
𝑥∈𝐾
▪ Optimize(c): Compute argmin𝑐 𝑇 𝑥.
𝑥∈𝐾
▪ Sample(g): Sample according to 𝑔 𝑥 1𝐾 . (assume 𝑔 is logconcave)
▪ Integrate(g): Compute
𝐾
𝑔 𝑥 𝑑𝑥. (assume 𝑔 is logconcave)
Theorem: They are all equivalent by polynomial time algorithms.
One of the Major Source of Polynomial Time Algorithms!
Algorithmic Convex Geometry
Traditionally viewed as impractical.
Now, we have an efficient version of
ellipsoid method.
Why those operations?
For any convex 𝑓, define the dual 𝑓 ∗ 𝑐 = argmax𝑥 𝑐 𝑇 𝑥 − 𝑓(𝑥), and 𝑙𝐾 = ∞1𝐾𝑐 .
Progress: We are getting the tight polynomial equivalence between left 4.
Integration
Membership
𝑙𝐾 (𝑥)
Width
𝑙𝐾∗ (𝑐)
Separation
𝜕𝑙𝐾 (𝑥)
Convex Optimization
Optimization
𝜕𝑙𝐾∗ (𝑐)
𝑔 𝑥 𝑑𝑥
𝐾
Sample
~𝑒 −𝑔 1𝐾
Problem: Sampling
Input: a convex set 𝐾.
Output: sample a point from the uniform distribution on K.
Generalized Problem:
Input: a logconcave distribution 𝑓
Output: sample a point according to 𝑓.
Why? useful for optimization, integration/counting, learning, rounding.
 Best way to minimize convex function with noisy value oracle.
 Only way to compute volume of convex set.
Non-trivial application: Convex Bandit
Game: For each round 𝑡 = 1,2, ⋯ , 𝑇, the player
▪ Adversary selects a convex loss function ℓ𝑡
Sébastien Bubeck
Ronen Eldan
▪ Chooses (possibly randomly) 𝑥𝑡 from unit ball in n dim based on past observations.
▪ Receives the loss/observation ℓ𝑡 𝑥𝑡 ∈ [0,1].
▪ Nothing else about ℓ𝑡 is revealed!
Measure performance by regret:
There is a good fixed action, but
▪ We only learn one point each iteration!
▪ Adversary can give confusing information!
The gold of standard is getting 𝑂( 𝑇).
Namely, 𝒦 1000 𝑇 is better than 𝒦 𝑇 2/3 .
Non-trivial application: Convex Bandit
Game: For each round 𝑡 = 1,2, ⋯ , 𝑇, the player
▪ Adversary selects a convex loss function ℓ𝑡
Sébastien Bubeck
Ronen Eldan
▪ Chooses (possibly randomly) 𝑥𝑡 from unit ball in n dim based on past observations.
▪ Receives the loss/observation ℓ𝑡 𝑥𝑡 ∈ [0,1].
▪ Nothing else about ℓ𝑡 is revealed!
Measure performance by regret:
After a decade of research, we have
𝑅𝑇 = 𝑛10.5 𝑇.
(The first polynomial time and regret algorithm.)
The gold of standard is getting 𝑂( 𝑇).
Namely, 𝒦 1000 𝑇 is better than 𝒦 𝑇 2/3 .
How to Input the set
Oracle Setting:
 A membership oracle: answer YES/NO to “𝑥 ∈ 𝐾”.
 A ball 𝑥 + 𝑟𝐵 such that 𝑥 + 𝑟𝐵 ⊂ 𝐾 ⊂ 𝑥 + poly 𝑛 𝑟𝐵.
Explicit Setting:
 Given explicitly, such as polytopes, spectrahedrons, …
 In this talk, we focus on polytope {𝐴𝑥 ≥ 𝑏}. (m = # constraints)
Outline
 Oracle Setting:
 Introduce the ball walk
 KLS conjecture and its related conjectures
 Main Result
 Explicit Setting: (original promised talk)
 Introduce the geodesic walk
 Bound the # of iteration
 Bound the cost per iteration
Sampling Problem
Input: a convex set 𝐾 with a membership oracle
Output: sample a point from the uniform distribution on K.
Conjectured Lower Bound: 𝑛2 .
Generalized Problem: Given a logconcave distribution 𝑝, sampled 𝑥 from 𝑝.
Conjectured Optimal Algorithm: Ball Walk
At 𝑥, pick random 𝑦 from 𝑥 + 𝛿𝐵𝑛 ,
if 𝑦 is in 𝐾, go to 𝑦.
otherwise, sample again
This walk may get trapped on one side if the set is not convex.
Isoperimetric constant
For any set 𝐾, we define the isoperimetric constant 𝜙𝐾 by
𝜙𝐾 = min
𝑆
Area(𝜕𝑆)
min(vol 𝑆 , vol 𝑆 𝑐 )
𝜙 large, hard to cut the set
Theorem
Given a random point in 𝐾, we can generate another in
𝑛
𝑂( 2 2 log(1/𝜀))
𝛿 𝜙𝐾
iterations of Ball Walk where 𝛿 is step size.
▪ 𝜙𝐾 or 𝛿 larger, mix better.
▪ 𝛿 cannot be too large, otherwise, fail probability is ~1.
𝜙 small, easy to cut the set
Isoperimetric constant of Convex Set
Note that 𝜙𝐾 is not affine invariant and can be arbitrary small.
L
1
𝜙𝐾 = 1/𝐿.
However, you can renormalize 𝐾 such that Cov 𝐾 = 𝐼.
Definition: 𝐾 is isotropic, if it is mean 0 and Cov 𝐾 = 𝐼.
Theorem: If 𝛿 <
0.001
,
𝑛
ball walk stays inside the set with constant probability.
𝑛2
Theorem: Given a random point in isotropic 𝐾, we can generate another in 𝑂( 2 log(1/𝜀))
𝜙𝐾
To make body isotropic, we can sample the body to compute covariance.
KLS Conjecture
Kannan-Lovász-Simonovits Conjecture:
For any isotropic convex 𝐾, 𝜙𝐾 = Ω(1).
If this is true, Ball Walk takes O(𝑛2 ) iter for isotropic 𝐾
(Matched the believed information theoretical lower bound.)
To get the “tight” reduction from membership to sampling,
it suffices to prove KLS conjecture
KLS conjecture and its related conjectures
Slicing Conjecture:
Any unit volume convex set 𝐾 has a slice with volume Ω(1).
Thin-Shell Conjecture:
For isotropic convex 𝐾, 𝔼( 𝑥 − 𝑛 2 ) = 𝑂(1).
Generalized Levy concentration:
For logconcave distribution 𝑝, 1-Lipschitz 𝑓 with 𝔼𝑓 = 0,
ℙ |𝑓 𝑥 − 𝔼𝑓| > 𝑡 = exp(−Ω 𝑡 ).
Essentially, it is asking if all convex sets looks like ellipsoids.
Main Result
[Lovasz-Simonovits 93] 𝜙 = Ω 1 𝑛−1/2 .
[Klartag 2006] 𝜎 = Ω 1 𝑛−1/2 log1/2 𝑛.
What if we cut the body by
sphere only?
𝜎𝐾 ≝
𝑛
𝑉𝑎𝑟 𝑋
2
≥ 𝜙𝐾
[Fleury, Guedon, Paouris 2006] 𝜎 = Ω 1 𝑛−1/2 log1/6 𝑛 log −2 log 𝑛.
[Klartag 2006] 𝜎 = Ω(1)𝑛−0.4 .
[Fleury 2010] 𝜎 = Ω(1)𝑛−0.375 .
[Guedon, Milman 2010] 𝜎 = Ω(1)𝑛−0.333 .
[Eldan 2012] 𝜙 = Ω 1 𝜎 = Ω(1)𝑛−0.333 .
[Lee Vempala 2016] 𝜙 = Ω 1 𝑛−0.25 .
In particular, we have 𝑂(𝑛2.5 ) mixing for ball walk.
Do you know better
way to bound mixing
time of ball walk?
Outline
 Oracle Setting:
 Introduce the ball walk
 KLS conjecture and its related conjectures
 Main Result
 Explicit Setting:
 Introduce the geodesic walk
 Bound the # of iteration
 Bound the cost per iteration
Problem: Sampling
Input: a polytope with 𝑚 constraints and 𝑛 variables.
Output: sample a point from the uniform distribution on K.
{𝐴𝑥 ≥ 𝑏}
Iterations Time/Iter
of matrix
Polytopes KN09
Dikin walk
𝑚𝑛
𝑚𝑛1.38 Cost
inversion
LV16
Ball walk
𝑛2.5
𝑚𝑛
LV16
Geodesic walk
𝑚𝑛0.75
𝑚𝑛1.38
First sub-quadratic algorithm.
How does nature mix particles?
Brownian Motion.
It works for sampling on ℝ𝑛 .
However, convex set has boundary .
Option 1) Reflect it when you hit the boundary.
However, it need tiny step for discretization.
How does the nature mixes particle?
Brownian Motion.
It works for sampling on ℝ𝑛 .
However, convex set has boundary .
Option 2) Remove the boundary by blowing up.
However, this requires explicit polytopes.
Blowing Up?
Non-Uniform Distribution on Real
After blow up
Original Polytope
Real Line
The distortion makes the hard constraint
becomes “soft”.
Uniform Distribution on [0,1]
Enter Riemannian manifolds
 𝑛-dimensional manifold M is an 𝑛-dimensional surface.
 Each point 𝑝 has a tangent space 𝑇𝑝 𝑀 of dimension 𝑛, the local linear
approximation of M at 𝑝; tangents of curves in 𝑀 lie in 𝑇𝑝 𝑀.
 The inner product in 𝑇𝑝 𝑀 depends on 𝑝: 𝑢, 𝑣
𝑝
Informally, you can think it is like assigning an unit ball for every point
Enter Riemannian manifolds
 Each point 𝑝 has a linear tangent space 𝑇𝑝 𝑀.
 The inner product in 𝑇𝑝 𝑀 depends on 𝑝: 𝑢, 𝑣
 Length of a curve 𝑐: 0,1 → 𝑀 is
1
𝐿 𝑐 =
0
𝑑
𝑐 𝑡
𝑑𝑡
𝑝
𝑑𝑡
𝑐(𝑡)
 Distance 𝑑(𝑥, 𝑦) is the infimum over all paths in M between x and y.
“Generalized” Ball Walk
At x, pick random y from 𝐷𝑥 where 𝐷𝑥 = {𝑦: 𝑑 𝑥, 𝑦 ≤ 1}.
Hessian manifold
Hessian manifold: a subset of ℝ𝑛 with
inner product defined by 𝑢, 𝑣 𝑝 = 𝑢𝑇 𝛻 2 𝜙 𝑝 𝑣.
For polytope 𝑎𝑖𝑇 𝑥 ≥ 𝑏𝑖 ∀𝑖 , we use the log barrier function
𝑚
1
𝜙 𝑥 =
log(
)
𝑠𝑖 𝑥
𝑖=1
 𝑠𝑖 𝑥 = 𝑎𝑖𝑇 𝑥 − 𝑏𝑖 is the distance from 𝑥 to constraint 𝑖
 𝑝 blows up when 𝑥 close to boundary
 Our walk is slower when it is close to boundary.
Suggested algorithm
At x, pick random y from 𝐷𝑥
where 𝐷𝑥 = {𝑦: 𝑑 𝑥, 𝑦 ≤ 1} is induced by log barrier.
Doesn’t work!
Corresponding
Hessian Manifold
(Called Dikin Ellipsoid)
random walk on
real line
Original Polytope
Converges to the boundary since the volume of “boundary” is +∞.
Getting Uniform Distribution
Lemma If 𝑝 𝑥 → 𝑦 = 𝑝(𝑦 → 𝑥), then stationary distribution is uniform.
To make a Markov chain 𝑝 symmetric, we use
min 𝑝 𝑥 → 𝑦 , 𝑝 𝑦 → 𝑥 𝑖𝑓 𝑥 ≠ 𝑦
𝑝 𝑥→𝑦 =
.
⋯ 𝑖𝑓 𝑥 = 𝑦
To implement it, we sample 𝑦 according to 𝑝(𝑥 → 𝑦)
if 𝑝 𝑥 → 𝑦 < 𝑝 𝑦 → 𝑥 , go to 𝑦.
Else, we go to 𝑦 with probability 𝑝 𝑦 → 𝑥 /𝑝(𝑥 → 𝑦);
Stay at x otherwise.
Dikin Walk
At x, pick random y from 𝐷𝑥
if 𝑥 ∉ 𝐷𝑦 , reject 𝑦
else, accept 𝑦 with probability
[Copied from KN09]
vol 𝐷𝑥
min(1,
vol 𝐷𝑦
).
[KN09] proved it takes 𝑂(𝑚𝑛) steps.
Better than the previous best 𝑂 𝑛2.5 for oracle setting.
At x, pick random y from 𝐷𝑥
if 𝑥 ∉ 𝐷𝑦 , reject 𝑦
Dikin Walk and its Limitation
else, accept 𝑦 with probability min(1,
Dikin ellipsoid is fully contained in 𝐾.
Idea: Pick next step y from a blown-up Dikin ellipsoid.
Can afford to blow up by ~ 𝑛/ log 𝑚 . WHP 𝑦 ∈ 𝐾.
In high dimension, volume of 𝐷𝑥 is not that smooth. (Worst case 0,1 𝑛 )
Any larger step makes the success probability exponentially small!
0,1
𝑛
is the worst case for ball walk, hit-and-run, Dikin walk .
vol 𝐷𝑥
vol 𝐷𝑦
).
At x, pick random y from 𝐷𝑥
if 𝑥 ∉ 𝐷𝑦 , reject 𝑦
Going back to Brownian Motion
else, accept 𝑦 with probability min(1,
vol 𝐷𝑥
vol 𝐷𝑦
The walk is not symmetric in the “space”. Tendency of going to center.
Corresponding
Hessian Manifold
Original Polytope
Taking step size to 0, Dikin walk becomes a stochastic differential equation:
𝑑𝑥𝑡 = 𝜇 𝑥𝑡 𝑑𝑡 + 𝜎 𝑥𝑡 𝑑𝑊𝑡
where 𝜎 𝑥𝑡 = 𝜙 ′′ 𝑥𝑡 −1/2 and 𝜇(𝑥𝑡 ) is the drift towards center.
).
What is the drift? Fokker-Planck equation
The probability distribution of the SDE
𝑑𝑥𝑡 = 𝜇 𝑥𝑡 𝑑𝑡 + 𝜎 𝑥𝑡 𝑑𝑊𝑡
is given by
𝜕𝑝
𝜕
1 𝜕2
2 𝑥 𝑝 𝑥, 𝑡 .
𝑥, 𝑡 = −
𝜇 𝑥 𝑝 𝑥, 𝑡 +
𝜎
𝑑𝑡
𝜕𝑥
2 𝜕𝑥 2
To make the stationary distribution constant, we need
𝜕
1 𝜕2
2
−
𝜇 𝑥 +
𝜎
𝑥 =0
2
𝜕𝑥
2 𝜕𝑥
Hence, we have 𝜇 𝑥 = 𝜎𝜎′.
A New Walk
A new walk:
𝑥𝑡+ℎ = 𝑥𝑡 + ℎ ⋅ 𝜇 𝑥𝑡 + 𝜎 𝑥𝑡 𝑊
with 𝑊~𝑁(0, ℎ𝐼).
It doesn’t make sense.
Exponential map
 Exponential map exp𝑝 : 𝑇𝑝 𝑀 → 𝑀 is defined as
 exp𝑝 (𝑣) = 𝛾𝑣 (1),
 𝛾𝑣 : unique geodesic (shortest path) from p with initial velocity 𝑣.
Geodesic Walk
A new walk:
𝑥𝑡+ℎ = exp𝑥𝑡 (ℎ/2 ⋅ 𝜇 𝑥𝑡 + 𝜎 𝑥𝑡 𝑊)
with 𝑊~𝑁(0, ℎ𝐼).
Anyway to avoid
using filter?
However, this walk has discretization error.
So, we do a metropolis filter after.
Since our walk is complicated, the filter is super complicated.
Outline
 Oracle Setting:
 Introduce the ball walk
 KLS conjecture and its related conjectures
 Main Result
 Explicit Setting: (original promised talk)
 Introduce the geodesic walk
 Bound the # of iteration
 Bound the cost per iteration
Geodesic Walk
A new walk:
𝑥𝑡+ℎ = exp𝑥𝑡 (ℎ/2 ⋅ 𝜇 𝑥𝑡 + 𝑊)
with 𝑊~𝑁(0, ℎ𝐼).
Geodesic is better than “straight line”:
1) It extends infinitely.
2) It gives a massive cancellation.
Key Lemma 1: Provable Long Geodesic
Straight line defines finitely; Geodesic defines infinitely.
Thm [LV16]: For manifold induced by log barrier,
a random geodesic 𝛾 starting from 𝑥 satisfies
1
−4
𝑎𝑖𝑇 𝛾 ′ 𝑡 ≤ 𝑂(𝑛 )(𝑎𝑖𝑇 𝑥 − 𝑏) for 0 ≤ 𝑡 ≤ 𝑂(𝑛1/4 ).
Namely, the geodesic is well behavior for a long time.
Remark:
If central path in IPM had this, we have a 𝑚5/4 time algorithm for MaxFlow!
Key Lemma 2: Massive Cancellation
Consider a SDE on 1 dimensional real line (NOT manifold)
𝑑𝑥𝑡 = 𝜇 𝑥𝑡 𝑑𝑡 + 𝜎 𝑥𝑡 𝑑𝑊𝑡 .
How good is the “Euler method”, namely 𝑥0 + ℎ𝜇 𝑥0 + ℎ𝜎 𝑥0 𝑊?
By “Taylor” expansions, we have
ℎ ′
𝑥ℎ = 𝑥0 + ℎ𝜇 𝑥0 + ℎ𝜎 𝑥0 𝑊 + 𝜎 𝑥0 𝜎 𝑥0 𝑊 2 − 1 + 𝑂 ℎ1.5 .
2
If 𝜎 ′ 𝑥0 ≠ 0, the error is 𝑂(ℎ).
If 𝜎 ′ 𝑥0 = 0, the error is 𝑂(ℎ1.5 ).
For geodesic walk, 𝜎 ′ 𝑥0 = 0 (Christoffel symbols vanish in normal
coordinates)
Convergence Theorem
Thm [LV16]: For log barrier, the geodesic walk mixes in 𝑂 𝑚𝑛0.75 steps.
Thm [LV16]: For log barrier on 0,1 𝑛 , it mixes in 𝑂(𝑛1/3 ) steps. 
The best bound for ball-walk, hit-and-run and Dikin walk
is 𝑂(𝑛2 ) steps for 0,1 𝑛 .
Our walk is similar to Milstein method.
Is high order method for
SDE used in MCMC?
Outline
 Oracle Setting:
 Introduce the ball walk
 KLS conjecture and its related conjectures
 Main Result
 Explicit Setting: (original promised talk)
 Introduce the geodesic walk
 Bound the # of iteration
 Bound the cost per iteration
Can we simply do Taylor expansion?
How to implement the algorithm
In high dim, it may take 𝑛𝑘 time to
compute the 𝑘 𝑡ℎ derivatives.
In tangent plane at x,
1. pick 𝑤 ∼ 𝑁𝑥 (0, 𝐼), i.e. standard Gassian in ‖. ‖𝑥
2. Compute 𝑦 =
ℎ
exp𝑥 𝜇
2
𝑥 + ℎ𝑤
3. Accept with probability Min
𝑝 𝑦→𝑥
1,
𝑝 𝑥→𝑦
How to compute geodesic and rejection probability?
Need high accuracy for rejection probability  due to “directedness”.
Geodesic is given by geodesic equation; probability is given by Jacobi field.
Collocation Method for ODE
A weakly polynomial time algorithm for some ODEs
Consider the ODE 𝑦 ′ = 𝑓 𝑡, 𝑦(𝑡) with 𝑦 0 = 𝑦0 .
Given a degree 𝑑 poly 𝑞 and distinct points 𝑡1 , 𝑡2 , ⋯ , 𝑡𝑑 ,
let 𝑇(𝑞) be the unique degree 𝑑 poly 𝑝 s.t.
p
𝑝′ 𝑡 = 𝑓(𝑡, 𝑞(𝑡))
on 𝑡 = 𝑡1 , 𝑡2 , ⋯ , 𝑡𝑑
𝑝 0 = 𝑞(0).
Lem [LV16]: 𝑇 is well defined. If 𝑡𝑖 are Chebyshev points on [0,1], then
𝐿𝑖𝑝 𝑇 = 𝑂 𝐿𝑖𝑝 𝑓 .
Thm [LV16]: If 𝐿𝑖𝑝 𝑓 ≤ 0.001, we can find a fix point of 𝑇 efficiently.
Collocation Method for ODE
A weakly polynomial time algorithm for some ODEs
Consider the ODE 𝑦 ′ = 𝑓 𝑡, 𝑦(𝑡) with 𝑦 0 = 𝑦0 .
Thm [LV16]: Suppose that
 𝐿𝑖𝑝 𝑓 ≤ 0.001
 There is a degree 𝑑 poly 𝑝 such that 𝑝′ − 𝑓 ′ ≤ 𝜀.
Then, we can find a 𝑦 such that 𝑦 − 𝑦 1 = 𝑂(𝜀) in time
𝑂(𝑑log 2 𝑑𝜀 −1 ) with 𝑂(𝑑log 𝑑𝜀 −1 ) evaluations of 𝑓.
Remark: No need to compute 𝑓′!
In general, the runtime is 𝑂(𝑛𝑑Lip𝑂(1) (𝑓)) instead.
𝒕𝒉
How can I bound the 𝟐𝟕𝟎
derivatives?
For 1 variable function, we can estimate 𝑘 𝑡ℎ derivatives easily.
Idea: reduce estimating derivatives of general functions to 1 variable.
In general, we write 𝐹 ≤𝑥 𝑓
D𝑘 𝐹(𝑥) ≤ 𝑓 𝑘 0 .
Calculus rule: 𝐹 ≤𝑥 𝑓 and 𝐺 ≤𝐹(𝑥) 𝑔, then 𝐺 ∘ 𝐹 ≤𝑥 𝑔 ∘ (𝑓 − 𝑓 0 ).
Implementation Theorem
Using the trick before, we show geodesic can be approximated by 𝑂(1)
degree poly.
Hence, collocation method finds in 𝑂(1) steps.
Thm [LV16]: If ℎ ≤ 𝑛−1/2 , 1 step of Geodesic walk can be implemented
in matrix multiplication time.
For hypercube, ℎ ≤ 𝑂(1) suffices.
Questions
 We have no background on numerical ODE/SDE and RG.
So, the running time should be improvable easily.
 How to avoid the filtering step?
 Is there way to tell a walk mixed or not?
(i.e. even if we cannot prove KLS, the algorithm can still stop early.)
 Is higher order method in SDE useful in MCMC?
 Any other suggestion/heuristic for sampling on convex set?

Download Report

Ball Walk Hessian manifold Suggested algorithm

Paperzz.com

Your Paperzz