Leverage score sampling for large-scale least-squares
Garvesh Raskutti
joint with Michael Mahoney
Massive Datasets Transition Workshop SAMSI
May 20 2013
Outline
Two themes: leverage score sampling and randomized algorithms.
Problem statement.
Random and deterministic leverage-score sampling.
Results.
Future directions.
Data reduction for large-scale least-squares
Ordinary least-squares estimator, (X , Y ), X ∈ Rn×p , Y ∈ Rn (rank(X ) = p):
βOLS = arg minp kY − X βk22 .
β∈R
Often both n and p extremely large and intensive to perfom least-squares.
One possible solution: Reduce data and perform least-squares. S̃ ∈ Rr ×n where
r ≪ n, reduced data (S̃X , S̃ Y ) and estimator:
βS̃ ∈ arg minp kS̃Y − S̃X βk22 .
β∈R
Need to show that βS̃ is a ”good” approximation to βOLS .
Choosing S̃: leverage-score sampling
Singular value decomposition for X :
VT ,
Σ |{z}
X = |{z}
U |{z}
n×p p×p p×p
where U T U = Ip .
ℓi = kU(i ) k22 for 1 ≤ i ≤ n.
Leverage-score sampling equivalent to row-norm sampling of U.
For prediction
error metrics, only dependence on X is through U.
Pn
Note: i =1 ℓi = p, and 0 ≤ ℓi ≤ 1.
Algorithmic setup (Drineas et al. 2006)
Assume Y ∈ Rn arbitrary and fixed (no distribution assumed) with known and
fixed X .
Criterea: Prove the following relative error/approxiamtion guarantee:
kY − X βS̃ k22 ≤ (1 + κ)kY − X βOLS k22 ,
for any Y .
Drineas et al. 2006 provide approximation guarantees when S̃ is based on random
leverage-score sub-sampling.
Importantly, bounds derived for arbitrary Y .
Statistical setup
Assume Y generated by the following linear model:
Y = X β + ǫ,
where β ∈ Rp , ǫ ∈ Rn where E[ǫǫT ] = In×n .
Criterea: Two possible relative error/appoximation guarantees:
E[kY − X βS̃ k22 ]
E[kX (βS̃ −
β)k22 ]
≤ (1 + κ)E[kY − X βOLS k22 ],
≤ C E[kX (βOLS −
β)k22 ].
(1)
(2)
Goal: Determine good sub-sampling schemes and prove approxiamtion guarantees
for (2).
Sampling for statistical setup
Let s̃ ⊂ {1, 2, ..., n} with |s̃| = r denote the subset of rows selected X . S̃ki = 1 is
the sub-sampling operator where s̃(k) = i and 0 with no repeated samples.
Let (ℓi )ni=1 be the leverage scores of U, and (λj )pj=1 denote the eigenvalues of
U T S̃ T S̃U.
Theorem
If rank(S̃X ) = p,
X
1 X λj
.
E[kX (βS̃ − β)k22 |s̃] = p p(
ℓk )−1 2
p
λm
j,m
k∈s̃
{z
} | {z }
|
Levs
Eigs
Levs ≥ 1 is the inverse of the leverage score mass captured. Eigs ≥ 1 is a measure
of how non-uniform the eigenvalues of U T S̃ T S̃U are.
Estimator is ”optimal” by Gauss-Markov Theorem.
Motivation for leverage-score sampling
Ideally, we would choose s̃ to minimize the product Levs Eigs .
Unfortunately computationally intensive to minimize Levs Eigs .
To minimize Levs , deterministically pick largest leverage scores (Broadbent et
al. 2010). ”Hope” that Eigs is well-controlled.
Can construct examples in which Eigs is arbitrarily bad.
Compromise: Sub-sample randomly with probability proporional to leverage
scores Drineas et al. (2006).
Randomization tends to break down worst-case examples. Theoretical
guarantees can be provided.
Empirical setup
Setup following Ma, Mahoney and Yu (unpublished).
n = 1000, p = 50, r varied Repated 100 times and averaged.
Each row of X generated randomly using a multivariate t distribution, where
[Σ]ij = 2 × 0.5|i −j| , and ν degrees of freedom, ν = 1, 2, 10.
Different ν correspond to different uniformity of leverage score.
Y = X β + ǫ, where ǫ ∼ N (0, In ).
Leverage score distributions
1
50
0.9
45
Leverage score
0.7
0.6
Cumulative leverage mass
ν = 10
ν=2
ν=1
Uniform
0.8
0.5
0.4
0.3
0.2
0.1
0
0
40
35
30
25
ν=10
ν=2
ν=1
Uniform
20
15
10
5
200
400
600
Index
800
1000
0
0
200
400
600
Index
800
1000
Empirical results: r large
2.8
8
5.5
2.6
5
5
4
3
4
3.5
3
2.5
1
200
400
500
600
700
800
Number of samples drawn (r)
(a) ν = 10
900
1000
1
200
2
1.8
1.6
1.2
1.5
300
2.2
1.4
2
2
Deterministic
Random
2.4
Deterministic
Random
4.5
6
Normalized MSE
Deterministic
Random
Normalized MSE
Normalized MSE
7
300
400
500
600
700
800
Number of samples drawn (r)
(b) ν = 2
900
1000
1
300
400
500
600
700
800
900
1000
Number of samples drawn (r)
(c) ν = 1
Figure: Comparison of MSE (1 for OLS estimator) for determistic and random sampling
for large r .
Empirical results: r small
r
70
80
90
100
200
Determ.
28.0
20.3
15.2
12.7
4.72
Random
53.7
36.3
27.6
21.0
7.39
r
70
80
90
100
200
Determ.
11.1
5.64
3.56
3.62
1.98
(a) ν = 10
r
80
90
100
200
300
Random
86.3
53.1
29.8
22.8
5.32
(b) ν = 2
Determ.
1.45
1.38
1.31
1.10
1.06
Random
5.1 × 106
2.0 × 105
2.2 × 105
629.4
2.75
(c) ν = 1
Figure: Comparison of MSE for deterministic and random sampling for small r
Theoretical result for random sampling
Assume that U has a k-uniform leverage score distribution. That is there are k
non-zero leverage scores such that ℓi = pk and the remaining are 0 (p ≤ k ≤ n).
The smaller k, the more non-uniform the leverage scores.
Theorem (Random leverage-score sub-sampling)
Let S̃ denote standard sub-sampling operator with re-samples removed, and
10p log p ≤ r ≤ 10k log k:
p
k
k log k
≤ E[kX (βS̃ − β)k22 ] ≤ 10p
2r
r
with probability greater than 1 − c1 exp(−c2 r ).
No such guarantees exist for deterministic sampling.
Conclusions and future directions
Conclusions:
Simulations suggest deterministic sub-sampling out-performs random
sub-sampling for examples considered.
Theoretical guarantees for random but not deterministic sub-sampling.
Future directions:
Provide theoretical results for deterministic sub-sampling. E.g. consider
random or ”smoothed” matrices X .
Consider Bayesian/shrinkage estimators after sub-sampling.
General framework to unify the algorithmic and statistical perspectives.
© Copyright 2026 Paperzz