Principal Components Principal Components I Often, a multivariate response has relatively few dominant modes of variation. I E.g. length, width, and height of turtle shells: overall size is a dominant mode of variation, making all three responses larger or smaller simultaneously. I Principal components analysis (PCA) explores modes of variation suggested by a variance-covariance matrix. NC STATE UNIVERSITY 1 / 30 Statistics 784 Multivariate Analysis Principal Components I Identifying the modes of variation can lead to: I I I new insights and interpretation; reduction in dimensionality and collinearity. Often used to prepare data for further analysis, e.g. in regression analysis (Principal Components Regression). NC STATE UNIVERSITY 2 / 30 Statistics 784 Multivariate Analysis Principal Components I Basic idea: find linear combinations of responses that have most of the variance. I More precisely: given X with E(X) = 0 and Cov(X) = Σ, find a1 to maximize Var (a01 X), subject to a01 a1 = 1. Notes: I I I a1 must be constrained, otherwise the variance could be made arbitrarily large just by making a1 large. Why this constraint? Because the problem has a convenient solution. NC STATE UNIVERSITY 3 / 30 Statistics 784 Multivariate Analysis Principal Components I Solution: Var a01 X = a01 Σa1 , which is maximized at a1 = e1 , the first eigenvector of Σ; I I The maximized value is λ1 , the associated eigenvalue. Often, the elements of e1 are all positive and similar in magnitude ⇒ a mode in which all responses vary together: I I not often interesting; a useful summary or composite variable. NC STATE UNIVERSITY 4 / 30 Statistics 784 Multivariate Analysis Principal Components I What about other modes of variation? Find a2 to maximize Var (a02 X), subject to: I I I Solution: I I I a02 a2 = 1; Cov (a01 X, a02 X) = 0. a2 = e2 , the second eigenvector of Σ; The maximized value is λ2 , the associated eigenvalue. Similarly modes 3, 4, . . . , p. NC STATE UNIVERSITY 5 / 30 Statistics 784 Multivariate Analysis Principal Components Alternative Approach: Data Compression I Observer sees X but communicates only Y = a0 X. I Receiver uses X∗ = bY to approximate X. I Error is X − X∗ = (I − ba0 )X. I Error covariance matrix is Cov(X − X∗ ) = (I − ba0 )Σ(I − ab0 ). I Total error variance is trace[Cov(X − X∗ )] = trace(Σ) − a0 Σb − b0 Σa + b0 ba0 Σa. NC STATE UNIVERSITY 6 / 30 Statistics 784 Multivariate Analysis Principal Components I For a given b, total error variance is minimized by a= I b . b0 b Minimum for a given b is trace(Σ) − b0 Σb . b0 b I Best b is e1 , and then best a = e1 also. I Best (i.e., minimum) total error variance is trace(Σ) − λ1 = λ2 + λ3 + · · · + λp . NC STATE UNIVERSITY 7 / 30 Statistics 784 Multivariate Analysis Principal Components I Higher dimensions: suppose observer can communicate Y = (Y1 , Y2 )0 = A0 X, and receiver uses X∗ = BY to approximate X. I Similar math: best A = B = (e1 , e2 ), with total error trace(Σ) − λ1 − λ2 = λ3 + λ4 + · · · + λp . I Similarly for k channels, k < p: proportion of total variance “explained” (i.e., communicated) is λ1 + λ2 + · · · + λk . λ1 + λ2 + · · · + λk + λk+1 + · · · + λp NC STATE UNIVERSITY 8 / 30 Statistics 784 Multivariate Analysis Principal Components Scaling in PCA I Recall: PCA is the solution to a data compression problem, where “error” is quantified by total error variance. I Question: is “total variance” appropriate? I Variables in different units must be scaled. I Variables in the same units but with very different variances are usually scaled. NC STATE UNIVERSITY 9 / 30 Statistics 784 Multivariate Analysis Principal Components I Simplest scaling: divide each variable by its standard deviation ⇒ covariances are correlations. I In other words: use eigen structure of correlation matrix R, not covariance matrix Σ. NC STATE UNIVERSITY 10 / 30 Statistics 784 Multivariate Analysis Principal Components PCA for some Special Cases I Diagonal matrix: if Σ= σ1,1 0 ... 0 σ2,2 . . . .. .. .. . . . 0 0 ... 0 0 .. . σp,p then the principal components are just the original variables. NC STATE UNIVERSITY 11 / 30 Statistics 784 Multivariate Analysis Principal Components I Compound symmetry: if Σ= σ 2 ρσ 2 . . . ρσ 2 σ 2 . . . .. .. .. . . . ρσ 2 ρσ 2 . . . ρσ 2 ρσ 2 .. . σ2 then (if ρ > 0): I I I I λ1 = 1 + (p − 1)ρ and e1 = p −1/2 (1, 1, . . . , 1)0 ; λk = 1 − ρ, k > 1; e2 , e3 , . . . , ep are an arbitrary basis for the rest of Rp . If ρ < 0 the order is reversed, but note that ρ must satisfy 1 + (p − 1)ρ ≥ 0 ⇒ ρ ≥ −1/(p − 1). NC STATE UNIVERSITY 12 / 30 Statistics 784 Multivariate Analysis Principal Components I Time series (1st order autoregression): σ2 φσ 2 2 φσ σ2 Σ= .. .. . . ... ... .. . φp−1 σ 2 φp−2 σ 2 . . . I φp−1 σ 2 φp−2 σ 2 .. . σ2 No closed form, but for large p the eigen vectors are like sines and cosines. NC STATE UNIVERSITY 13 / 30 Statistics 784 Multivariate Analysis Principal Components Sample PCA I Essentially the eigen analysis of S (or R): Sêk = λ̂k êk , and ŷk = Xdev êk , where Xdev 1 = X − 110 X = n NC STATE UNIVERSITY 14 / 30 1 0 I − 11 X n Statistics 784 Multivariate Analysis Principal Components I Note: 1 S= X0 Xdev = n − 1 dev I 1 √ Xdev n−1 0 1 √ Xdev n−1 Singular value decomposition: √ 1 Xdev = UDV0 n−1 where U and V have orthonormal columns and D is diagonal (but may not be square). I The diagonal entries of D are the square roots of the largest p eigenvalues of both (n − 1)−1 X0dev Xdev = S and (n − 1)−1 Xdev X0dev . I The columns of V are the eigenvectors of X0dev Xdev . NC STATE UNIVERSITY 15 / 30 Statistics 784 Multivariate Analysis Principal Components I Also Xdev V = [ŷ1 , ŷ2 , . . . , ŷp ] = √ n − 1 UD so the singular value decomposition of (n − 1)−1/2 Xdev provides all the details of the sample principal components: I I I the coefficients V; the values UD. Similarly, if X∗ is Xdev with its columns normalized (sum of squares = 1), then R = X∗ 0 X∗ , and the singular value decomposition of X∗ gives the PCA of R. NC STATE UNIVERSITY 16 / 30 Statistics 784 Multivariate Analysis Principal Components I Example: 5 stocks I I I I I DU PONT E I DE NEM (NYSE:DD) (a former Dow Industrials stock) HONEYWELL INTL INC (NYSE:HON) (a former Dow Industrials stock) EXXON MOBIL CP (NYSE:XOM) (a Dow Industrials stock) CHEVRON CORP (NYSE:CVX) (a Dow Industrials stock) DOW CHEMICAL (NYSE:DOW) (former Dow stock) NC STATE UNIVERSITY 17 / 30 Statistics 784 Multivariate Analysis Principal Components I I SAS proc princomp program and output. R code and graphs for an updated set of stocks: DD, HON, and XOM, plus MSFT (Microsoft) and WMT (Walmart). stocksPCAcor = prcomp(stocks(), scale. = TRUE); print(stocksPCAcor); plot(stocksPCAcor); biplot(stocksPCAcor); stocksPCAcov = prcomp(stocks()); print(stocksPCAcov); plot(stocksPCAcov); biplot(stocksPCAcov); NC STATE UNIVERSITY 18 / 30 Statistics 784 Multivariate Analysis Principal Components 1.0 Variances 1.5 2.0 stocksPCAcor NC STATE UNIVERSITY 19 / 30 Statistics 784 Multivariate Analysis Principal Components Biplot for standardized stock returns −5 0 5 0.2 17 109 112 85 80 43 104 2138 82 37 WMT 0.0 PC2 0.1 111 101 64 MSFT 48 20110 49 88 3115 106 893 96 35 39 51 7 25 58 81 60 31 6965 79 70 11350 639 95114 14 33 84 34 52 66 62 27 7415 57 94 56 10 622 83 46 71 13 2 24 67 97 87 26 Statistics 784 54 59 68 107 Multivariate NC STATE UNIVERSITY 20 / 30 Analysis Principal Components 10 Variances 15 stocksPCAcov NC STATE UNIVERSITY 21 / 30 Statistics 784 Multivariate Analysis Principal Components Biplot for unstandardized stock return −20 −10 0 10 20 0.0 PC2 0.1 0.2 28 NC STATE UNIVERSITY 73 75 36 30 86 32 11 12 100 8972 105 98 19 16 55 44 61 76 77 18 7859 29 40 103 74 99 41 5 107 45 92 97 108 6 90 47 25 15 84 67 5366 14 1 54 68 2 87 56 24 52 58 26 71 13 8369 46 1027 50 113 23 94 22 114 91 57 65 60 51 62 33 7 Statistics 784 8 34 70 3 20 95 35 22 / 3081 111 9 Multivariate Analysis X Principal Components How Many Components? I How many components are important? I I No definitive answer in the PCA framework. The factor analysis model allows maximum likelihood estimation, hence hypothesis testing. I For insight, use all that have a substantive interpretation. I For other uses such as regression, we need an objective rule. NC STATE UNIVERSITY 23 / 30 Statistics 784 Multivariate Analysis Principal Components I I Use “scree” plot (plot of eigenvalues), and look for an “elbow”; very subjective. Simple rule of thumb: use all eigenvalues larger than the average. I I Note: for the correlation matrix, the average is always 1, so the rule is: use all eigenvalues > 1. Overland and Preisendorfer (1982) proposed a rule (“Rule N”) based on comparison of observed eigenvalues with the distribution of eigenvalues in the case of iid N(0, σ 2 ). NC STATE UNIVERSITY 24 / 30 Statistics 784 Multivariate Analysis Principal Components Large Samples I Suppose that the rows of the data matrix X are a random sample of size n from Np (µ, Σ). I Assume that Σ has distinct eigenvalues λ1 > λ2 > · · · > λp > 0. I Then, approximately for large n, √ n λ̂ − λ ∼ Np 0, 2 × diag λ2 . NC STATE UNIVERSITY 25 / 30 Statistics 784 Multivariate Analysis Principal Components I In other words, λ̂i and λ̂k are approximately independent for k 6= i, and 2λ2i . λ̂i ∼ N λi , n I Note: if nλ̂ ∼ χ2n λ then similarly, approximately, 2λ2 λ̂ ∼ N λ, . n NC STATE UNIVERSITY 26 / 30 Statistics 784 Multivariate Analysis Principal Components I So we could also state that, approximately, nλ̂i ∼ χ2n . λi I Simulations suggest that this is a better approximation for small n, if the eigenvalues are well separated. I Asymptotics suggest that the degrees of freedom for λ̂i could be n − i + 1 instead of n. √ Also n (êi − ei ) is approximately Np (0, Ei ), where I Ei = λi p X λk (λk − λi )2 k=1 × ek e0k . k6=i NC STATE UNIVERSITY 27 / 30 Statistics 784 Multivariate Analysis Principal Components I Test for equal correlations: H0 : R = 1 ρ .. . ρ ... 1 ... .. . . . . ρ ρ ... ρ ρ .. . 1 against the alternative of a general (unstructured) Σ or R. I H0 is equivalent to equality of all eigenvalues of R but one. I Likelihood ratio test (and Lawley’s test) give large-sample χ2 statistic with (p + 1)(p − 2)/2 d.f. NC STATE UNIVERSITY 28 / 30 Statistics 784 Multivariate Analysis Principal Components I Other related tests of interest: I I I I compound symmetry is a variance-components structure; it is stronger, requiring equal correlations and equal variances; equivalent to equality of all eigenvalues of Σ but one. sphericity is the necessary and sufficient condition for univariate repeated measures to be valid; it is a weaker condition: σi,j = (σi,i + σj,j ) /2 − τ 2 . no simple characterization in terms of eigen structure. NC STATE UNIVERSITY 29 / 30 Statistics 784 Multivariate Analysis
© Copyright 2026 Paperzz