Theoretical Analysis of PCA for Heteroscedastic Data

Theoretical Analysis of PCA for Heteroscedastic Data
David Hong
Laura Balzano
Jeffrey A. Fessler
Department of EECS
University of Michigan
Ann Arbor, Michigan 48105
Department of EECS
University of Michigan
Ann Arbor, Michigan 48105
Department of EECS
University of Michigan
Ann Arbor, Michigan 48105
I. I NTRODUCTION
IV. D EPENDENCE ON PARAMETERS
Principal Component Analysis (PCA) is a method for estimating
a subspace given noisy samples. It is useful in a variety of applications [1], [2], [3] and problems ranging from dimensionality
reduction to anomaly detection and the visualization of high dimensional data. Effective use of PCA requires a rigorous understanding
of its performance. This work provides an analysis of PCA for
samples with heteroscedastic noise, that is, samples that have nonuniform noise variances. In particular, we provide a simple asymptotic
prediction of the recovery of a low-dimensional subspace from noisy
heteroscedastic samples. The prediction enables: a) easy and efficient
calculation of the asymptotic performance, b) reasoning about the
asymptotic performance (with heteroscedasticity such as outliers),
and (c) a deeper understanding that PCA has best performance when
the noise is homoscedastic (all points share the same noise level).
The main result (Theorem 1) enables us to reason about how PCA
depends on the various parameters.
Dependence on sample-to-dimension ratio and average noise
variance. Figure 3 shows the asymptotic recovery as we sweep over
c > 1 and σ
ˉ 2 > 0 where the subspace is one-dimensional with
ˉ 2 /(2p1 )
amplitude θ1 = 1 and there are two noise levels σ12 = σ
and σ22 = σ
ˉ 2 /(2p2 ) with p = (0.8, 0.2). As expected, the recovery
generally improves with increasing c (more samples per dimension)
and decreasing σ
ˉ 2 (less noise on average). It also has the same general
shape as the homoscedastic case (analyzed as an example in [5]) but
with more samples generally needed to achieve the same recovery.
Dependence on sample proportions. The red curves in Figure 1
show the asymptotic recovery as p2 ∈ (0, 1) varies with p1 = 1 −p2 ,
c = 10, θ1 = 1, and σ = (1.8, 0.2). As expected, recovery generally
improves with increasing p2 ; having more low noise samples is better.
Dependence on noise variances. Figure 4 shows the asymptotic
recovery as σ12 and σ22 vary with c = 10, p = (0.7, 0.3) and θ = 1.
As expected, the recovery generally improves with decreasing noise
variances. Interestingly, for a fixed average noise variance (i.e., along
the dotted line in the plot), the best recovery occurs when σ12 = σ22 .
It turns out this is true in general, a fact which we next state.
II. M ODEL FOR H ETEROSCEDASTIC DATA
We model n heteroscedastic samples y1 , . . . , yn ∈ Rd from a
k dimensional subspace Ũ ∈ Rd×k as
yi = ŨΘzi + ηi εi
(1)
where
k×k
is a diagonal matrix of amplitudes θ1 , . . . , θk ,
• Θ ∈ R+
• ηi ∈ R+ are the noise standard deviations,
d×k
is a matrix representation of the subspace basis,
• Ũ ∈ R
iid
• zi ∼ N (0, Ik ) are independent coefficient vectors,
iid
• εi ∼ N (0, Id ) are independent noise vectors, and
the L noise levels σ1 , . . . , σL occur in proportions p1 , . . . , pL (i.e.,
p1 of the samples have ηi = σ1 , p2 have ηi = σ2 and so on).
Remark 1: The following results also hold without modification
for distributions on zi and εi satisfying a certain set of technical
conditions (described in [4]). We present Gaussian here for simplicity.
III. M AIN R ESULT
a.s.
−→
n, d → ∞
n/d = c
if A (β1 ) , . . . , A (βk ) > 0 where
A (x) := 1 − c
L
X
`=1
p` σ`4
,
(x − σ`2 )2
k
1 X A (βi )
k
βi Bi0 (βi )
L
X
`=1
PL
i=1
p σ 2 is the average noise variance. The bound is
`=1 ` `
2
(i.e., the noise is homoscedastic).
when σ12 = ∙ ∙ ∙ = σL
where σ
ˉ :=
attained when
Remark 2: It follows that the asymptotic recovery for a fixed average noise variance is maximized when the noise is homoscedastic.
Thus, using average noise variance to predict performance gives an
optimistic proxy; actual recovery will be worse than such predictions
for heteroscedastic data. Our expression (2) is more accurate.
VI. C ONCLUSION
i=1
Bi (x) := 1 − cθi2
and βi is the largest real root of Bi .
(2)
k
k
c−σ
ˉ 4 /θi4
1 X A (βi )
1X
≤
max
0,
βi Bi0 (βi )
k
k
c+σ
ˉ 2 /θi2
i=1
2
The following theorem provides our main
2 result: a simple expression
for the asymptotic recovery k1 ÛT ŨF of the subspace Ũ.
Theorem 1: For fixed samples-to-dimension ratio c > 1, the PCA
estimate Û is such that
1
ÛT Ũ2
F
k
V. A N UPPER BOUND ON PERFORMANCE
The following theorem provides an upper bound on the asymptotic
recovery of the subspace Ũ. The bound is a function of the average
noise variance and is attained when the noise is homoscedastic.
Theorem 2: If A (βi ) ≥ 0, the asymptotic recovery is bounded as:
p`
x − σ`2
Figure 1 shows simulations displaying these results for L = 2.
Figure 2 illustrates the behavior of the largest real root of Bi in
relation to the noise levels.
This work provides a simple expression for the asymptotic recovery
of a subspace from heteroscedastic samples by PCA. We use the
result to gain insights about the performance of PCA as a function
of the parameters and find an upper bound that shows the performance
for a fixed average noise variance is optimal when the noise is
homoscedastic. There are many avenues for potential extensions and
future work, including further study of the algebraic structure of the
expressions in the prediction, considering a weighted version of PCA,
and analyzing the non-asymptotic recovery.
0.8
0.6
0.6
0.4
0.2
0
σ22
0.4
0.2
0
0
1
1/2
p2
(a) Results for d = 102 , n = 103
from 10000 trials.
0
1
1/2
p2
(b) Results for d = 103 , n = 104
from 1000 trials.
Fig. 1: Numerical simulation results for c = 10, θ1 = 1, σ = (1.8, 0.2)
where p2 is swept from 0 to 1 with p1 = 1 − p2 . Simulation mean (blue
curve) and interquartile interval (light blue ribbon) shown with asymptotic
prediction (red curve). Going from Figure 1a to Figure 1b, we increase the
problem size while keeping the same model parameters. The simulation
mean moves closer to the asymptotic prediction and the interquartile
range shrinks, indicating concentration to the asymptotic prediction.
PL
p`
`=1 x−σ 2
σ22
1
cθi2
x
σ12 βi
Fig. 2: Location of the largest real root βi of Bi (x) for σ12 = 2.04,
σ22 = 0.874, c = 10, p = (0.7, 0.3) and θi = 1.
1
c
0.6
10
0.4
5
1
1
kÛ T Ũ k2F
k
0.8
15
0.2
0
1
2
3
4
5
1
4
0.8
3
0.6
2
0.4
1
0.2
1
2
σ12
3
4
0
Fig. 4: Asymptotic recovery as a function of noise variances σ12 and σ22
with c = 10, p = (0.7, 0.3), θ1 = 1. Namely, 70% of samples have noise
variance σ12 and 30% have noise variance σ22 . Contours are overlaid in
black and the region where A(βi ) ≤ 0 is shown as zero. The dotted line
indicates points with the same average noise variance. As observed in
Section IV, the best performance along the line occurs when σ12 = σ22 ,
and this is true in general as discussed in Section V.
R EFERENCES
`
20
5
1
kÛ T Ũ k2F
k
0.8
1
kÛ T Ũ k2F
k
1.0
1
kÛ T Ũ k2F
k
1.0
0
σ̄ 2
ˉ2
Fig. 3: Asymptotic recovery as a function of average noise variances σ
ˉ 2 /(2p1 ), σ22 =
and sample-to-dimension ratio c with θ1 = 1, σ12 = σ
σ
ˉ 2 /(2p2 ) and p = (0.8, 0.2). Contours are overlaid in black and the
region where A(βi ) ≤ 0 is shown as zero.
[1] B. A. Ardekani, J. Kershaw, K. Kashikura, and I. Kanno, “Activation
detection in functional MRI using subspace modeling and maximum
likelihood estimation,” IEEE Transactions on Medical Imaging, vol. 18,
no. 2, pp. 101–114, 1999.
[2] A. Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide traffic
anomalies,” in Proceedings of the 2004 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, ser.
SIGCOMM ’04, 2004, pp. 219–230.
[3] N. Sharma and K. Saroha, “A novel dimensionality reduction method for
cancer dataset using PCA and feature ranking,” in Advances in Computing, Communications and Informatics (ICACCI), 2015 International
Conference on, 2015, pp. 2261–2264.
[4] D. Hong, L. Balzano, and J. A. Fessler, “Towards a theoretical analysis of
pca for heteroscedastic data,” in 2016 54th Annual Allerton Conference
on Communication, Control, and Computing (Allerton), 2016.
[5] F. Benaych-Georges and R. R. Nadakuditi, “The singular values and
vectors of low rank perturbations of large rectangular random matrices,”
Journal of Multivariate Analysis, vol. 111, pp. 120 – 135, 2012.