Larry.pdf

Simplicity and Inference
or
Some Case Studies From Statistics and
Machine Learning
or
Why We Need a Theory of Oversmoothing
CFE Workshop, Carnegie Mellon, June 2012
Larry Wasserman
Dept of Statistics and Machine Learning Department
Carnegie Mellon University
1
Conclusion
Statistics needs a rigorous theory of oversmoothing (undefitting).
There are hints:
G. Terrell (the oversmoothing principle)
D. Donoho (one-sided inference)
L. Davies (simplest model consistent with the data).
But, as I’ll show, we need a more general way to do this.
2
Plan
1.
2.
3.
4.
Regression
Graphical Models
Density Estimation (simplicity versus L2)
Topological data analysis
3
Preview of the Examples
1. (High-Dimensional) Regression: Observe (X1, Y1), . . . , (Xn, Yn).
Observe new X.
Predict new Y .
Here, Y ∈ R and X ∈ Rd with d > n.
2. (High-Dimensional) Undirected Graphs.
X ∼ P . G = G(P ) = (V, E). V = {1, . . . , d}. E = edges.
No edge between j and k means Xj q Xk |rest.
4
Preview of the Examples
3. Density Estimation.
Estimator:
Y1, . . . , Yn ∼ P and P has density p.
n
1 X
1
b
ph(x) =
K
n i=1 hd
!
||y − Yi||
.
h
Need to choose h.
4. Topological Data Analysis.
X1, . . . , Xn ∼ G.
G is supported on a manifold M .
Observe Yi = Xi + i.
Want to recover the homology of M .
5
Regression
Best predictor is
m(x) = E(Y |X = x).
Assume only iid and bounded random variables.
There is no uniformly consistent (distribution free) estimator of
m.
How about the best linear predictor? Excess risk:
b = E(Y − βbT X)2 − inf E(Y − β T X)2.
E(β)
β
But, as n → ∞ and d = d(n) → ∞
b → ∞.
inf sup E(β)
βb
P
6
Simplicity: Best Sparse Linear Predictor
Let
Bk = {β : ||β||0 ≤ k}
where ||β||0 = #{j : βj 6= 0}. Small ||β||0 = simplicity.
Good news: If βb is best subset estimator then
E(Y − βbT X)2 − inf E(Y − β T X)2 → 0.
β∈Bk
Bad news: Computing βb is NP-hard.
7
Convex Relaxation: The Lasso
Let
BL = {β : ||β||1 ≤ L}
P
where ||β||1 = j |βj |. Note that L controls sparsity (simplicity).
Oracle: β∗ minimizes R(β) over BL.
Pn
1
b
Lasso: β minimizes n i=1(Yi − β T Xi)2 over BL.
In this case:
2
b > R(β ) + ) exp −cn .
sup P (R(β)
∗
P
But how to choose L?
8
Choosing L (estimating the simplicity)
b
Usually, we minimize risk estimator R(L)
(such as cross-validation).
It is known that this overfits.
Theorem : (Meishausen and Buhlmann, Wasserman and Roeder):
P n(support(β) ⊂ support(βb∗)) → 1
where β∗ minimizes the true prediction loss.
But if we try to correct by moving to a simpler model, we risk
huge losses since the risk function is asymmetric.
Here is a simulation: true model size is 5. (d = 80, n = 40).
9
40
30
Risk
20
10
0
0
10
20
Number of Variables
30
40
10
Corrected Risk Estimation
In other words:
simplicity 6= accurate prediction
High predictive accuracy requires that we overfit.
What if we want to force more simplicity? Can we correct the
overfitting without incurring a disaster?
Safe simplicity:
b
b
|R(Λ)
− R(`)|
.
Z(Λ) = sup
s(Λ, `)
`≥Λ
(This is Lepski’s nonparametric method, adapted to the lasso.)
11
10 20 30 40
0
CV
0
10
20
30
40
50
60
80
100
−3
−1
z
1 2 3
nv
0
20
40
Index
12
Screen and Clean
Even better: Screen and Clean (Wasserman and Roeder, Annals
2009).
Split data into three parts:
Part 1: Fit lasso
Part 2: Variable selection by cross-validation
Part 3: Least squares on surviving variables followed by ordinary
hypothesis testing.
13
6
T
4
2
0
0
20
40
1:d
60
80
14
Screen and Clean
However, this is getting complicated (and inefficient).
What happens when linearity is false, high correlations etc.?
Is there anything simpler?
15
Graphs
X = (X1, . . . , Xd).
G = (V, E).
V = {1, . . . , d}.
E = edges.
(j, k) ∈
/ E means that Xj q Xk |rest.
X
Z
Y
means X q Y |Z.
Observe: X (1), . . . , X (n) ∼ P . Infer G.
16
Graphs
Common approach: assume X (1), . . . , X (n) ∼ N (µ, Σ).
c to maximize
b Σ
Find µ,
loglikelihood(µ, Σ) − λ
X
|Ωjk |
j6=k
where Ω = Σ−1.
c = 0.
Omit an edge if Ω
jk
Same problems as lasso: no good way to choose λ. In addition,
non-Gaussianty seems to lead to overfitting.
The latter can be alleviated using forests.
17
Forests
A forest is a graph with no cycles. In this case
p(x) =
d
Y
j=1
pj (xj )
Y
pjk (xj , xk )
p (x )p (x )
(j,k)∈E j j k k
.
The densities can be estimated nonparametrically. The edge set
can be estimated by the Chow-Liu algorithm based on nonparametric estimates of mutual information I(Xj , Xk ).
Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman
(JMLR 2010)
18
Gene Microarray:
Glasso
Nonparametric
19
Synthetic Example:
True
Nonparametric
Best Fit Glasso
20
What We Really Want
Cannot estimate the truth. There is no universally consistent,
distribution free test of
H0 : X q Y |Z.
We are better off asking: What is the simplest graphical model
consistent with the data?
21
What We Really Want
Precedents for this are: Davies, Terrell, Donoho etc. Here is
Davies idea (simplified).
Observe (X1, Y1), . . . , (Xn, Yn). For any function m we can write
Yi = m(Xi) + i = signal + noise
where i = Yi − m(Xi). He finds the “simplest” function m such
that 1, . . . , n look like “noise.”
Davies, Kovac and Meise (2009) and Davies and Kovac (2001).
How do we do this for graphical models?
22
Density Estimation
Seemingly and old, solved problem.
Y1, . . . , Yn ∼ P where Yi ∈ Rd and P might have a density p.
kernel estimator
n
1
1 X
pbh(y) =
K
d
n i=1 h
!
||y − Yi||
.
h
Here, K is a kernel and h > 0 is a bandwidth.
How do we choose h? Usually, we minimize an estimate of
R(h) = E
Z
(pbh(x) − p(x))2dx .
But this is the wrong loss function ...
23
0.3
0.2
0.1
0.0
0.0
0.1
0.2
0.3
0.4
p1
0.4
p0
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
3
0.3
0.4
p2
0.0
0.1
0.2
L2(p0, p1) = L2(p0, p2)
−3
−2
−1
0
1
2
3
24
0.03
0.00
0.04
0.10
0.05
0.20
0.06
0.30
0
0.00
2
0.04
4
6
0.08
8
0.12
10
Truth
−5
−5
0
0
h = .01
5
5
−5
h = .4
−5
0
0
5
h=4
5
25
Density Estimation
More generally p might have: smooth parts, singularities, near
singularities (mass concentrated near manifolds) etc.
In principle we can use Lepski’s method: choose a local bandwidth
b
h(x)
= sup h : |pbh(x) − pbt(x)| ≤ ψ(t, h) for all t < h .
Lepski and Spokoiny 1997, Lepski, Mammen and Spokoiny 1997.
It leads to this ...
26
0.00
0.00
0.10
0.05
0.10
0.20
0.15
−5
−5
0
0
5
5
−5
x
−5
0
0
5
x
5
27
0.0
0.0
0.5
1.0
1.0
2.0
p(x)
1.5
h(x)
2.0
3.0
2.5
Oversmoothing
We really just want a principled way to oversmooth. Terrell and
Scott (1985) and Terrell (1990) suggest the following: choose
the largest amount of smoothing compatible with the scale of
the density.
The asymptotically optimal bandwidth (with d = 1) is
R
h=
!1
2
K (x)dx 5
4 I(p)
nσK
where I = (p00)2. Now: minimize I = I(p) subject to:
R
T (P ) = T (Pbn)
where T (·) is the variance.
28
Oversmoothing
Solution:
1.47 s
h=
R
1
n5
1
2
K 5
.
Good idea, but:
- it is still based on L2 loss
- it is based on an asymptotic expression for optimal h.
We need a finite sample version with a more appropriate loss
function.
Here it is on our example:
29
0.00
0.05
0.10
0.15
0.20
Oversmoothing
−5
0
5
30
Summary so far ...
Our current methods select models that are too complex.
Are there simple methods for choosing simple models?
Now, a more exotic application ...
31
Topological Data Analysis
X1, . . . , Xn ∼ G where G is supported on a manifold M .
Here Xi ∈ RD but dimension(M ) = d < D.
Observe Yi = Xi + i.
Goal: infer the homology of M .
Homology: clusters, holes, tunnels, etc.
32
One Cluster
●
●
●
●
●●●
●
●●
●
●
●
●●
●
● ●
● ●
● ● ● ●
●
●
●
●
●
●
●
●
● ●●
●
●
●● ●●
●
●
●
● ●
● ●● ●
●
●
●●●
●
●
●
●
●
● ● ●●
● ●●
●
●●● ●●
●
● ● ●●
●
● ● ● ●●
●
● ●
● ●
● ●
●● ●
●
● ●
●● ● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
● ●
●●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
33
One Cluster + One Hole
●
● ●●
●
●
●
● ●●
● ●●
●● ●
●
● ●
● ●
●
● ●
●
● ●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●● ●
●
●
●
● ●
●●
●
●●
●
●●
●
●
●
●
34
Inference
The Niyogi, Smale, Weinberger (2008) estimator:
1.
2.
3.
4.
Estimate density. (h)
Throw away low density points. (t)
Form a Cech complex. ()
Apply an algorithm from computational geometry.
Usually, the results are summarized as a function of in a barcode
plot (or a persistence diagram).
35
Example: from Horak, Maletic and Rajkovic (2008)
36
The Usual Heuristic
Usually assume that small barcodes are topological noise.
This is really a statistical problem with many tuning parameters.
Currently, there are no methods for choosing the tuning parameters.
What we want: the simplest topology consistent with the data.
Working on this with Sivaraman Balakrishnan, Aarti Singh, Alessandro Rinaldo and Don Sheehy.
37
Summary
We still don’t know how to choose simple models.
38
Summary
THE END
39