On conditional moments of high-dimensional random vectors given

Motivation & Overview
Results
Intuition
On conditional moments
of high-dimensional random vectors
given lower-dimensional projections
Hannes Leeb (University of Vienna)
(with Ivana Milovic and Lukas Steinberger)
Statistical Inference for Large Scale Data
Simon Fraser University
April 20, 2015
Outlook
Motivation & Overview
Results
Intuition
Outlook
S PARSE MODELS
Among a large collection of potentially important factors or
explanatory variables, a sparse (working) model uses only a
small subset. (Economics, finance, genomics, proteomics, social
sciences, etc.)
Why do you fit sparse models?
I
Because I assume, or know, that the true model is sparse.
I
Because I do not have enough data to fit the full model.
Motivation & Overview
Results
Intuition
Outlook
S PARSE MODELS
Among a large collection of potentially important factors or
explanatory variables, a sparse (working) model uses only a
small subset. (Economics, finance, genomics, proteomics, social
sciences, etc.)
Why do you fit sparse models?
I
Because I assume, or know, that the true model is sparse.
I
Because I do not have enough data to fit the full model.
Motivation & Overview
Results
Intuition
Outlook
S PARSE MODELS
Among a large collection of potentially important factors or
explanatory variables, a sparse (working) model uses only a
small subset. (Economics, finance, genomics, proteomics, social
sciences, etc.)
Why do you fit sparse models?
I
Because I assume, or know, that the true model is sparse.
I
Because I do not have enough data to fit the full model.
Motivation & Overview
Results
Intuition
Outlook
S PARSE MODELS
Among a large collection of potentially important factors or
explanatory variables, a sparse (working) model uses only a
small subset. (Economics, finance, genomics, proteomics, social
sciences, etc.)
Why do you fit sparse models?
I
Because I assume, or know, that the true model is sparse.
I
Because I do not have enough data to fit the full model.
Motivation & Overview
Results
Intuition
Outlook
S PARSE MODELS
Among a large collection of potentially important factors or
explanatory variables, a sparse (working) model uses only a
small subset. (Economics, finance, genomics, proteomics, social
sciences, etc.)
Why do you fit sparse models?
I
Because I assume, or know, that the true model is sparse.
I
Because I do not have enough data to fit the full model.
Motivation & Overview
Results
Intuition
Outlook
S PARSE MODELS : W ITH AND WITHOUT SPARSITY
For illustration, consider, as the full model, the homoskedastic
linear model
Y = Xθ + U,
were X is n × d with n d.
Moreover, assume that E[U] = 0, that E[UU0 ] = σ 2 In , and that U
independent of X.
Partition X = (X1 : X2 ) where X1 is n × p with p < n, partition
θ0 = (θ10 , θ20 ) conformably, and consider the (sparse) submodel
where Y is regressed on X1 .
The submodel is correct if θ2 = 0. Then we have
E[XθkX1 ] = X1 θ1 and Var[XθkX1 ] = 0.
The submodel ‘correct’ if E[XθkX1 ] is linear in X1 and
Var[XθkX1 ] is a multiple of the identity.
Motivation & Overview
Results
Intuition
Outlook
S PARSE MODELS : W ITH AND WITHOUT SPARSITY
For illustration, consider, as the full model, the homoskedastic
linear model
Y = Xθ + U,
were X is n × d with n d.
Moreover, assume that E[U] = 0, that E[UU0 ] = σ 2 In , and that U
independent of X.
Partition X = (X1 : X2 ) where X1 is n × p with p < n, partition
θ0 = (θ10 , θ20 ) conformably, and consider the (sparse) submodel
where Y is regressed on X1 .
The submodel is correct if θ2 = 0. Then we have
E[XθkX1 ] = X1 θ1 and Var[XθkX1 ] = 0.
The submodel ‘correct’ if E[XθkX1 ] is linear in X1 and
Var[XθkX1 ] is a multiple of the identity.
Motivation & Overview
Results
Intuition
Outlook
S PARSE MODELS : W ITH AND WITHOUT SPARSITY
For illustration, consider, as the full model, the homoskedastic
linear model
Y = Xθ + U,
were X is n × d with n d.
Moreover, assume that E[U] = 0, that E[UU0 ] = σ 2 In , and that U
independent of X.
Partition X = (X1 : X2 ) where X1 is n × p with p < n, partition
θ0 = (θ10 , θ20 ) conformably, and consider the (sparse) submodel
where Y is regressed on X1 .
The submodel is correct if θ2 = 0. Then we have
E[XθkX1 ] = X1 θ1 and Var[XθkX1 ] = 0.
The submodel ‘correct’ if E[XθkX1 ] is linear in X1 and
Var[XθkX1 ] is a multiple of the identity.
Motivation & Overview
Results
Intuition
Outlook
S PARSE MODELS : W ITH AND WITHOUT SPARSITY
For illustration, consider, as the full model, the homoskedastic
linear model
Y = Xθ + U,
were X is n × d with n d.
Moreover, assume that E[U] = 0, that E[UU0 ] = σ 2 In , and that U
independent of X.
Partition X = (X1 : X2 ) where X1 is n × p with p < n, partition
θ0 = (θ10 , θ20 ) conformably, and consider the (sparse) submodel
where Y is regressed on X1 .
The submodel is correct if θ2 = 0. Then we have
E[XθkX1 ] = X1 θ1 and Var[XθkX1 ] = 0.
The submodel ‘correct’ if E[XθkX1 ] is linear in X1 and
Var[XθkX1 ] is a multiple of the identity.
Motivation & Overview
Results
Intuition
S PARSE MODELS : W ITH AND WITHOUT SPARSITY
Recall that the data-generating model is
Y
=
Xθ + U.
If the working model is correct, then
Y
=
X1 θ1 + U.
If the working model is ‘correct’, then
Y
=
X1 β + V,
for some β ∈ Rp , and with E[VkX1 ] = 0 and Var[VkX1 ] = ς 2 In .
If the working model ‘correct’ but not correct, we have ς 2 > σ 2 ,
because
V = U + (Xθ − E[XθkX1 ])
and hence
ς2
=
σ 2 + Var[XθkX1 ].
Outlook
Motivation & Overview
Results
Intuition
S PARSE MODELS : W ITH AND WITHOUT SPARSITY
Recall that the data-generating model is
Y
=
Xθ + U.
If the working model is correct, then
Y
=
X1 θ1 + U.
If the working model is ‘correct’, then
Y
=
X1 β + V,
for some β ∈ Rp , and with E[VkX1 ] = 0 and Var[VkX1 ] = ς 2 In .
If the working model ‘correct’ but not correct, we have ς 2 > σ 2 ,
because
V = U + (Xθ − E[XθkX1 ])
and hence
ς2
=
σ 2 + Var[XθkX1 ].
Outlook
Motivation & Overview
Results
Intuition
S PARSE MODELS : W ITH AND WITHOUT SPARSITY
Recall that the data-generating model is
Y
=
Xθ + U.
If the working model is correct, then
Y
=
X1 θ1 + U.
If the working model is ‘correct’, then
Y
=
X1 β + V,
for some β ∈ Rp , and with E[VkX1 ] = 0 and Var[VkX1 ] = ς 2 In .
If the working model ‘correct’ but not correct, we have ς 2 > σ 2 ,
because
V = U + (Xθ − E[XθkX1 ])
and hence
ς2
=
σ 2 + Var[XθkX1 ].
Outlook
Motivation & Overview
Results
Intuition
S PARSE MODELS : W ITH AND WITHOUT SPARSITY
Recall that the data-generating model is
Y
=
Xθ + U.
If the working model is correct, then
Y
=
X1 θ1 + U.
If the working model is ‘correct’, then
Y
=
X1 β + V,
for some β ∈ Rp , and with E[VkX1 ] = 0 and Var[VkX1 ] = ς 2 In .
If the working model ‘correct’ but not correct, we have ς 2 > σ 2 ,
because
V = U + (Xθ − E[XθkX1 ])
and hence
ς2
=
σ 2 + Var[XθkX1 ].
Outlook
Motivation & Overview
Results
Intuition
’C ORRECT ’ SUBMODELS
Consider one observation from the full model, i.e.,
y
=
x0 θ + u,
and consider the submodel using only x1 (where x0 = (x01 , x02 )).
The submodel is ‘correct’ if θ, x, and x1 satisfy
I
I
E[θ0 xkx1 ] is linear in x1 and
Var[θ0 xkx1 ] is constant in x1 .
Outlook
Motivation & Overview
Results
Intuition
’C ORRECT ’ SUBMODELS
Consider one observation from the full model, i.e.,
y
=
x0 θ + u,
and consider the submodel using only x1 (where x0 = (x01 , x02 )).
The submodel is ‘correct’ if θ, x, and x1 satisfy
I
I
E[θ0 xkx1 ] is linear in x1 and
Var[θ0 xkx1 ] is constant in x1 .
The submodel is ‘correct’ irrespective of θ if x and x1 satisfy
I
E[xkx1 ] is linear in x1 and
I
Var[xkx1 ] is constant in x1 .
Outlook
Motivation & Overview
Results
Intuition
’C ORRECT ’ SUBMODELS
Consider one observation from the full model, i.e.,
y
=
x0 θ + u,
and consider the submodel using only x1 (where x0 = (x01 , x02 )).
The submodel is ‘correct’ if θ, x, and x1 satisfy
I
I
E[θ0 xkx1 ] is linear in x1 and
Var[θ0 xkx1 ] is constant in x1 .
The submodel is ‘correct’ irrespective of θ if x and x1 satisfy
I
E[xkx1 ] is linear in x1 and
I
Var[xkx1 ] is constant in x1 .
Under the latter condition, standard methods can be used to
perform inference based on the submodel.
Outlook
Motivation & Overview
Results
Intuition
’C ORRECT ’ SUBMODELS
Consider one observation from the full model, i.e.,
y
=
x0 θ + u,
and consider the submodel using only x1 (where x0 = (x01 , x02 )).
The submodel is ‘correct’ if θ, x, and x1 satisfy
I
I
E[θ0 xkx1 ] is linear in x1 and
Var[θ0 xkx1 ] is constant in x1 .
The submodel is ‘correct’ irrespective of θ if x and x1 satisfy
I
E[xkx1 ] is linear in x1 and
I
Var[xkx1 ] is constant in x1 .
The latter condition is restrictive: The Gaussian distribution is,
in a sense, characterized by linear conditional means and
constant conditional variances (Eaton 1986, Bryc 1995).
Outlook
Motivation & Overview
Results
Intuition
I NFORMAL SUMMARY: M OST SUBMODELS ARE
‘ CORRECT ’
Recall that a submodel with regressors x1 is ‘correct’ if E[xkx1 ]
is linear and Var[xkx1 ] is constant in x1 .
We show:
For a large class of non-Gaussian distributions, most
conditional means are approximately linear and most conditional
variances are approximately constant.
Our approximation errors go to zero provided that the
dimension of the conditioning vector (submodel) is small
relative to the dimension of the overall vector (full model).
(Large-dimension asymptotics)
Outlook
Motivation & Overview
Results
Intuition
I NFORMAL SUMMARY: M OST SUBMODELS ARE
‘ CORRECT ’
Recall that a submodel with regressors x1 is ‘correct’ if E[xkx1 ]
is linear and Var[xkx1 ] is constant in x1 .
We show:
For a large class of non-Gaussian distributions, most
conditional means are approximately linear and most conditional
variances are approximately constant.
Our approximation errors go to zero provided that the
dimension of the conditioning vector (submodel) is small
relative to the dimension of the overall vector (full model).
(Large-dimension asymptotics)
Outlook
Motivation & Overview
Results
Intuition
I NFORMAL SUMMARY: M OST SUBMODELS ARE
‘ CORRECT ’
Recall that a submodel with regressors x1 is ‘correct’ if E[xkx1 ]
is linear and Var[xkx1 ] is constant in x1 .
We show:
For a large class of non-Gaussian distributions, most
conditional means are approximately linear and most conditional
variances are approximately constant.
Our approximation errors go to zero provided that the
dimension of the conditioning vector (submodel) is small
relative to the dimension of the overall vector (full model).
(Large-dimension asymptotics)
Outlook
Motivation & Overview
Results
Intuition
I NFORMAL SUMMARY: M OST SUBMODELS ARE
‘ CORRECT ’
Recall that a submodel with regressors x1 is ‘correct’ if E[xkx1 ]
is linear and Var[xkx1 ] is constant in x1 .
We show:
For a large class of non-Gaussian distributions, most
conditional means are approximately linear and most conditional
variances are approximately constant.
Our approximation errors go to zero provided that the
dimension of the conditioning vector (submodel) is small
relative to the dimension of the overall vector (full model).
(Large-dimension asymptotics)
Note: On a technical level, our results should be compared to
those of Hall and Li (1993) and also to those of Diaconis and
Freedman (1984).
Outlook
Motivation & Overview
Results
Intuition
I NFORMAL SUMMARY: M OST SUBMODELS ARE
‘ CORRECT ’
Recall that a submodel with regressors x1 is ‘correct’ if E[xkx1 ]
is linear and Var[xkx1 ] is constant in x1 .
We show:
For a large class of non-Gaussian distributions, most
conditional means are approximately linear and most conditional
variances are approximately constant.
Our approximation errors go to zero provided that the
dimension of the conditioning vector (submodel) is small
relative to the dimension of the overall vector (full model).
(Large-dimension asymptotics)
Note: Conceptually, our approach is similar to that of Berk,
Brown, Buja, Zhang and Zhao (2013), Genovese and
Wasserman (2008), or Leeb (2009).
Outlook
Motivation & Overview
Results
Intuition
Q UANTITIES OF INTEREST
For each dimension d, consider a random d-vector Z that has a
Lebesgue density and that is standardized so that EZ = 0 and
EZZ0 = Id . We also impose a technical condition (c) that will be
introduced and discussed later.
Our results are asymptotic as d → ∞. We allow for p → ∞
subject to p d.
Consider a d × p matrix B with orthonormal columns.
Conditional on B0 Z, the mean of Z is linear and the variance of
Z is constant in the conditioning variables, if both
0
E[ZkB0 Z = x] − Bx =
0
E[ZZ0 kB0 Z = x] − (Id + B(xx0 − Ip )B0 ) =
hold for each x ∈ Rp . The standardizations of Z and B are
inconsequential.
Outlook
Motivation & Overview
Results
Intuition
Q UANTITIES OF INTEREST
For each dimension d, consider a random d-vector Z that has a
Lebesgue density and that is standardized so that EZ = 0 and
EZZ0 = Id . We also impose a technical condition (c) that will be
introduced and discussed later.
Our results are asymptotic as d → ∞. We allow for p → ∞
subject to p d.
Consider a d × p matrix B with orthonormal columns.
Conditional on B0 Z, the mean of Z is linear and the variance of
Z is constant in the conditioning variables, if both
0
E[ZkB0 Z = x] − Bx =
0
E[ZZ0 kB0 Z = x] − (Id + B(xx0 − Ip )B0 ) =
hold for each x ∈ Rp . The standardizations of Z and B are
inconsequential.
Outlook
Motivation & Overview
Results
Intuition
Q UANTITIES OF INTEREST
For each dimension d, consider a random d-vector Z that has a
Lebesgue density and that is standardized so that EZ = 0 and
EZZ0 = Id . We also impose a technical condition (c) that will be
introduced and discussed later.
Our results are asymptotic as d → ∞. We allow for p → ∞
subject to p d.
Consider a d × p matrix B with orthonormal columns.
Conditional on B0 Z, the mean of Z is linear and the variance of
Z is constant in the conditioning variables, if both
0
E[ZkB0 Z = x] − Bx =
0
E[ZZ0 kB0 Z = x] − (Id + B(xx0 − Ip )B0 ) =
hold for each x ∈ Rp . The standardizations of Z and B are
inconsequential.
Outlook
Motivation & Overview
Results
Intuition
Outlook
T WO PRELIMINARY RESULTS
Consider the conditions
0
E[ZkB Z = x] − Bx
0
0
0
0 E[ZZ kB Z = x] − (Id + B(xx − Ip )B )
for each x ∈ Rp .
=
0
=
0
Motivation & Overview
Results
Intuition
Outlook
T WO PRELIMINARY RESULTS
Consider the conditions
0
E[ZkB Z = x] − Bx
0
0
0
0 E[ZZ kB Z = x] − (Id + B(xx − Ip )B )
=
0
=
0
for each x ∈ Rp .
Proposition 1 (Hall & Li 1993):
Set p = 1 and impose condition (c). For each fixed x ∈ R and
each t > 0, we have
n
o
d→∞
−→ 0,
νd,1 B ∈ Rd : E[ZkB0 Z = x] − Bx > t
where νd,1 denotes the uniform distribution on the unit sphere
in Rd .
Motivation & Overview
Results
Intuition
Outlook
T WO PRELIMINARY RESULTS
Consider the conditions
0
E[ZkB Z = x] − Bx
0
0
0
0 E[ZZ kB Z = x] − (Id + B(xx − Ip )B )
=
0
=
0
for each x ∈ Rp .
Proposition 2 (L, 2013):
Set p = 1 and impose condition (c). For each fixed x ∈ R and
each t > 0, we have
n
o
νd,1 B ∈ Rd : E[ZZ0 kB0 Z = x] − (Id + B(x2 − 1)B0 ) > t
d→∞
−→
0.
Motivation & Overview
Results
Intuition
Outlook
T WO PRELIMINARY RESULTS
Consider the conditions
0
E[ZkB Z = x] − Bx
0
0
0
0 E[ZZ kB Z = x] − (Id + B(xx − Ip )B )
=
0
=
0
for each x ∈ Rp .
If the last two propositions apply, then the left-hand sides in
the preceding display are small, for most B’s but only for fixed
x ∈ R, provided only that d is large. To show that this also
applies for most x’s, we now replace x by B0 Z, and we now also
allow for p > 1.
Motivation & Overview
Results
Intuition
M AIN RESULT, QUALITATIVE VERSION
Write Vd,p for the collection of all d × p matrices B with
orthonormal columns (Stiefel manifold), and denote the
corresponding Haar measure by νd,p .
Outlook
Motivation & Overview
Results
Intuition
Outlook
M AIN RESULT, QUALITATIVE VERSION
Write Vd,p for the collection of all d × p matrices B with
orthonormal columns (Stiefel manifold), and denote the
corresponding Haar measure by νd,p .
Theorem 1 (L, 2013; Steinberger & L, 2014):
For each fixed d, consider a random d-vector Z that has a
Lebesgue density and that is standardized to that EZ = 0 and
EZZ0 = Id . Suppose that condition (c) is satisfied, and suppose
that p = o(log d). Then there are Borel sets Gd,p ⊆ Vd,p so that
νd,p (Gd,p ) → 1 as d → ∞, and so that
0
0 sup P E[ZkB Z] − BB Z > t
and
(1)
B∈Gd,p
sup P E[ZZ0 kB0 Z] − (Id + B(B0 ZZ0 B − Ip )B0 ) > t
B∈Gd,p
converge to zero as d → ∞ for each t > 0.
(2)
Motivation & Overview
Results
Intuition
Outlook
M AIN RESULT, QUANTITATIVE VERSION
Theorem 2 (Steinberger & L, 2014):
For fixed d, consider a random d-vector Z that has a Lebesgue
density and that is standardized such that EZ = 0 and
EZZ0 = Id . Suppose that condition (c) is satisfied. Then, for
each p < d, there is a Borel set Gd,p ⊆ Vd,p so that both (1) and
(2) are bounded by
p 2γ
1 − ξ
d 2 +
t
log d 5ξ
for each t > 0, and such that
νd,p (Gcd,p )
≤
2κ d
p
− ξ2 1− log d
2γ
ξ
.
Here, ξ is given by ξ = min{ξ, /2 + 1/4, 1/2}/5, and , ξ, γ
and κ are constants derived from condition (c) that do not
depend on p or d.
Motivation & Overview
Results
Intuition
C ONDITION ( C ) IMPOSED IN T HEOREMS 1 & 2:
Condition (c) – the simple version:
The components of Z are independent with bounded marginal
moments and bounded marginal densities of sufficiently high
order (where the bounds do not depend on d).
Outlook
Motivation & Overview
Results
Intuition
C ONDITION ( C ) IMPOSED IN T HEOREMS 1 & 2:
A more general version of condition (c) relies on a set of
finite-sample moment bounds involving the Gram-matrix
Sk
=
(Z0i Zj /d)ki,j=1
for Zi , i = 1, . . . , k, being i.i.d. copies of Z.
Outlook
Motivation & Overview
Results
Intuition
Outlook
C ONDITION ( C ) IMPOSED IN T HEOREMS 1 & 2:
I
(b1) For fixed k ∈ N, there are constants ∈ [0, 1/2], α ≥ 1, β > 0,
and ξ ∈ (0, 1/2], such that the following holds true:
√
I (a) Ek d(S − I )k2k+1+ ≤ α.
k
k
I (b) For any monomial G = G(S − I ) in the elements of
k
k
Sk − Ik , whose degree g satisfies g ≤ 2k, we have
|dg/2 EG − 1| ≤ β/dξ if G consists only of quadratic terms in
elements above the diagonal, and |dg/2 EG| ≤ β/dξ if G
contains a linear term.
I (c) Consider two monomials G = G(S − I ) and
k
k
H = H(Sk − Ik ) of degree g and h, respectively, in the
elements of Sk − Ik . If G is given by
Z01 Z2 Z02 Z3 . . . Z0g−1 Zg Z0g Z1 /dg , if H depends at least on those
Zi ’s with i ≤ g, and if 2 ≤ h < g ≤ k, then |dg EGH| ≤ β/dξ .
I
(b2) For fixed k ∈ N, there is a constant D ≥ 1, such that the
following holds true: If R is an orthogonal d × d matrix, then a
marginal density of the first d − k + 1 components of RZ is
d 1/2 d−k+1
bounded by k−1
D
.
Motivation & Overview
Results
Intuition
C ONDITION ( C ) IMPOSED IN T HEOREMS 1 & 2:
Condition (c) – the general version:
For each d and p under consideration, suppose that (b1) and
(b2) hold with k = 4, such that the constants , ξ, α, β, and D in
these bounds do not depend on d or p.
The general version of condition (c) allows for dependent
components and unbounded marginal moments.
Outlook
Motivation & Overview
Results
Intuition
C ONDITION ( C ) IMPOSED IN T HEOREMS 1 & 2:
Condition (c) – the general version:
For each d and p under consideration, suppose that (b1) and
(b2) hold with k = 4, such that the constants , ξ, α, β, and D in
these bounds do not depend on d or p.
The general version of condition (c) allows for dependent
components and unbounded marginal moments.
Outlook
Motivation & Overview
Results
Intuition
Outlook
C ONDITION ( C ) IMPOSED IN T HEOREMS 1 & 2:
Remarks:
I
The bound in (b2) appears to be qualitatively different from (b1)
in that it does not directly impose restrictions to moments of the
standardized Gram matrix Sk − Ik . However, (b2) is used only to
−4(k+1)
bound the p-th moment of det Sl
for l = 1, . . . , k.
I
(b1).(a) and (b2) guarantee that the mean of certain functions of
Z, and of i.i.d. copies of Z, is bounded. And (b1).(b-c) require
that certain moments of Z are close to what they would be in the
Gaussian case.
I
From a statistical perspective, we note that the moments
discussed here can be estimated from a sample of independent
copies of Z. Indeed, population means like EkSk − Ik k2k+1+ , EG,
−4p(k+1)
EGH, or E det Sl
as above are readily estimated by
appropriate sample means. In this sense, we rely on bounds on
quantities that can be estimated from data.
Motivation & Overview
Results
Intuition
S OME INTUITION
For each d, let Z be a d-vector with i.i.d. components (from the
same distribution), each with mean zero and variance 1.
By the central limit theorem, we have
α0 Z
w
−→
N(0, 1)
√
√
for unit-vectors α of the form α0 = (1/ d, . . . , 1/ d) and as
d → ∞.
Diaconis and Freedman (1984) show that the above holds for
‘most’ unit-vectors α:
νd,1 α : d(L(α0 Z), N(0, 1)) > t
−→ 1
as d → ∞ for some metric d(·, ·) that ‘implies weak
convergence’.
Outlook
Motivation & Overview
Results
Intuition
S OME INTUITION
For each d, let Z be a d-vector with i.i.d. components (from the
same distribution), each with mean zero and variance 1.
By the central limit theorem, we have
α0 Z
w
−→
N(0, 1)
√
√
for unit-vectors α of the form α0 = (1/ d, . . . , 1/ d) and as
d → ∞.
Diaconis and Freedman (1984) show that the above holds for
‘most’ unit-vectors α:
νd,1 α : d(L(α0 Z), N(0, 1)) > t
−→ 1
as d → ∞ for some metric d(·, ·) that ‘implies weak
convergence’.
Outlook
Motivation & Overview
Results
Intuition
S OME INTUITION
For each d, let Z be a d-vector with i.i.d. components (from the
same distribution), each with mean zero and variance 1.
By the central limit theorem, we have
α0 Z
w
−→
N(0, 1)
√
√
for unit-vectors α of the form α0 = (1/ d, . . . , 1/ d) and as
d → ∞.
Diaconis and Freedman (1984) show that the above holds for
‘most’ unit-vectors α:
νd,1 α : d(L(α0 Z), N(0, 1)) > t
−→ 1
as d → ∞ for some metric d(·, ·) that ‘implies weak
convergence’.
Outlook
Motivation & Overview
Results
Intuition
Outlook
S OME INTUITION
The result of Diaconis and Freedman (1984) also holds for most
pairs of unit-vectors α and β. In other words,
0
1 α0 β
0
0
L(α Z, β Z) ≈ N
,
0
α0 β 1
for most pairs (α, β) and as d → ∞.
Motivation & Overview
Results
Intuition
Outlook
S OME INTUITION
The result of Diaconis and Freedman (1984) also holds for most
pairs of unit-vectors α and β. In other words,
0
1 α0 β
0
0
L(α Z, β Z) ≈ N
,
0
α0 β 1
for most pairs (α, β) and as d → ∞.
This makes our result plausible. But convergence of
distributions typically does not entail convergence of moments
or convergence of conditional moments.
Motivation & Overview
Results
Intuition
Outlook
S OME INTUITION
The result of Diaconis and Freedman (1984) also holds for most
pairs of unit-vectors α and β. In other words,
0
1 α0 β
0
0
L(α Z, β Z) ≈ N
,
0
α0 β 1
for most pairs (α, β) and as d → ∞.
Idea 1:
For most pairs of unit-vectors (α, β) and as d → ∞, we have
E[α0 Zkβ 0 Z]
≈
α0 ββ 0 Z,
Var[α0 Zkβ 0 Z]
≈
1 − (α0 β)2
Motivation & Overview
Results
Intuition
Outlook
S OME INTUITION
The result of Diaconis and Freedman (1984) also holds for most
pairs of unit-vectors α and β. In other words,
0
1 α0 β
0
0
L(α Z, β Z) ≈ N
,
0
α0 β 1
for most pairs (α, β) and as d → ∞.
Idea 1:
For most pairs of unit-vectors (α, β) and as d → ∞, we have
E[α0 Zkβ 0 Z]
≈
α0 ββ 0 Z,
Var[α0 Zkβ 0 Z]
≈
1 − (α0 β)2
But α0 β ≈ 0 for most pairs (α, β) and as d → ∞.
Motivation & Overview
Results
Intuition
Outlook
S OME INTUITION
The result of Diaconis and Freedman (1984) also holds for most
pairs of unit-vectors α and β. In other words,
0
1 α0 β
0
0
L(α Z, β Z) ≈ N
,
0
α0 β 1
for most pairs (α, β) and as d → ∞.
Idea 2:
For most unit-vectors β and as d → ∞, we have
E[Zkβ 0 Z]
≈
ββ 0 Z,
Var[Zkβ 0 Z]
≈
Id − ββ 0 .
Motivation & Overview
Results
Intuition
Outlook
S OME INTUITION
The result of Diaconis and Freedman (1984) also holds for most
pairs of unit-vectors α and β. In other words,
0
1 α0 β
0
0
L(α Z, β Z) ≈ N
,
0
α0 β 1
for most pairs (α, β) and as d → ∞.
Idea 3:
For most matrices B from the Stiefel-manifold Vd,p and as
d → ∞, we have
E[ZkB0 Z]
≈
BB0 Z,
Var[Zkβ 0 Z]
≈
Id − BB0 ,
provided that p = o(log(d)). Moreover, we provide explicit
error bounds that hold for fixed p and d.
Motivation & Overview
Results
Intuition
Outlook
O UTLOOK
I
Approximately valid prediction and inference when fitting
simple linear submodels to complex data-generating
processes; joint with Lukas Steinberger.
I
Fast convergence rates for the error bounds in Theorem 2,
with applications to model selection and regularization;
joint with Ivana Milovic.
Motivation & Overview
Results
Intuition
Outlook
R EFERENCES
I
Diaconis, P., Freedman, D. (1984): Asymptotics of graphical
projection pursuit. Ann. Statist. 12 793–815 (1984)
I
Hall, P., Li, K.C.: On almost linearity of low dimensional
projections from high dimensional data. Ann. Statist. 21,
867–889 (1993)
I
Leeb, H.: On the conditional distribution of
low-dimensional projections from high-dimensional data.
Ann. Statist. 41, 464–483 (2013)
I
Steinberger, L., Leeb, H: On conditional moments of
high-dimensional random vectors given
lower-dimensional projections, revision for Bernoulli in
preparation (2014); arXiv:1405.2183.