I. Sample Geometry and Random Sampling A.The Geometry of the Sample Our sample data in matrix form looks like this: X nxp x11 x 21 = x n1 x12 x 22 x n2 x1p x'1 ' x 2p x2 = ' x np x n Separate multivariate observations Just as the point where the population means of all p variables lies is the centroid of the population, the point where the sample means of all p variables lies is the centroid of the sample – for a sample with two variables and three observations: X nxp x11 = x21 x31 we have x12 x22 x32 x2 x11,x12 x21,x22 row space centroid of the sample _ _ x•1,x•2 x31,x32 x1 in the p = 2 variable or ‘row’ space (because rows are treated as vector coordinates) These same data X nxp x11 = x21 x31 x12 x22 x32 plotted in item or ‘column’ space, would like like this: 3 column space centroid of the sample x11,x21 ,x31 _ _ _ x1•,x2•,x3• 2 x12,x22 ,x32 This is referred to as the ‘column’ space because columns are treated as vector coordinates) Suppose we have the following data: 5 3 X = -3 11 In row space we have the following p = 2 dimensional scatter diagram x2 x21,x22 row space centroid of the sample (1,7) x11,x12 x1 with the obvious centroid (1,7). For the same data: 5 3 X = -3 11 In column space we have the following n = 2 dimensional plot 2 x12,x22 (4,4) row space centroid of the sample 1 x11,x21 with the obvious centroid (4,4). Suppose we have the following data: 2 4 6 X = 1 7 1 -6 1 8 in row space we have the following p = 3 dimensional scatter diagram x3 x31,x32,x33 x11,x12,x13 (-1,4,5) x21,x22,x23 x2 with the centroid (-1,4,5). row space centroid of the sample x1 For the same data: 2 4 6 X = 1 7 1 -6 1 8 in column space we have the following n = 3 dimensional scatter diagram 3 x13,x23,x33 (4,3,1) 1 x12,x22,x32 2 with the centroid (4,3,1). x11,x21,x31 The column space reveals an interesting geometric interpretation of the centroid – suppose we plot an n x 1 vector 1: ~ 3 In n = 3 dimensions we have: 1,1,1 1 2 This vector obviously forms equal angles with each of the n coordinate axes – this means normalization of this vector yields 1 1 n Now consider some vector yi of coordinates (that ~ represent various sample values of a random variable X). 3 In n = 3 dimensions we have: x1,x2,x3 1 2 1 The projection of yi on the unit vector 1 is given ~ n n by xij 1 1 yi 1 1 = j= 1 1 = xi 1 n n n 3 In n = 3 dimensions we have: y ~i 1 1 ~ xi 1 2 _ The sample mean xi corresponds to the multiple of 1 ~ necessary to generate the projection of yi onto the line ~ determined by ~1! Again using the Pythagorean Theorem, we can show ~ that the length of the vector drawn perpendicularly from the projection of y onto 1 to y is yi - xi 1. 3 yi - xi 1 In n = 3 dimensions we have: y ~i 1 1 ~ xi 1 2 This is often referred to as the deviation (or mean corrected) vector and is given by: x1i - xi x - xi di = yi - xi 1 = 2i xni - xi Example: Consider our previous matrix of three observations in three-space: 2 4 6 X = 1 7 1 -6 1 8 This data has a mean vector of: -1 x = 4 5 _ _ _ i.e., x1 = -1.0, x2 = 4.0, and x3 = 5.0. So we have 1 -1 x1 1 = -1.0 1 = -1 , 1 -1 1 4 x 2 1 = 4.0 1 = 4 , 1 4 1 5 x3 1 = 5.0 1 = 5 1 5 Consequently d1 d2 d3 2 -1 = y1 - x1 1 = 1 - -1 -6 -1 4 4 = y 2 - x 2 1 = 7 - 4 = 1 4 6 5 = y3 - x3 1 = 1 - 5 = 8 5 _ Note here that xi1 di i =1 ,…,p . ~ ~ 3 = 2 , -5 0 3 , -3 1 -4 3 So the decomposition is y1 y2 y3 2 -1 3 = 1 = - -1 + 2 , -6 -1 -5 4 4 0 = 7 = 4 + 3 , 1 4 -3 6 5 1 = 1 = 5 + -4 8 5 3 We are particularly interested in the deviation vectors d1 3 0 1 = 2 , d 2 = 3 , d 3 = -4 -5 -3 3 If we plot these deviation (or residual) vectors (translated to the origin without change in their lengths 3 or directions) d3 1 2 d2 d1 Now consider the squared lengths of the deviation vectors: n 2 di L ' i = d di = x ji j= 1 squared length of deviation vector - xi 2 sum of the squared deviations Recalling that the sample variance is: n si2 = x ji j= 1 - xi 2 n we can see that the squared length of a variable’s deviation vector is proportional to that variable’s variance (and so length is proportional to the standard deviation)! Now consider any two deviation vectors. Their dot product is d'id k = n xji - xi j= 1 xjk - xk = x'y which is simply a sum of crossproducts. Now let qik denote the angle between these two deviation vectors. Recall that cos θxy = x'y Lx Ly = x'y x'x y'y by substitution we have that d'id k = Ldi Ldk d'id k = Ldi Ldk cos θik L di L d k Another substitution based on 2 di L n ' i = d di = x - xi ji j= 1 2 and d'id k = n x ji - xi j= 1 x jk - xk yields n x ji j=1 - xi xjk - xk = n x ji j=1 - xi 2 n x jk j=1 - xk cos θik 2 A little algebra gives us n x cos θik = - xi ji j= 1 n x ji - xi j= 1 x x = n x ji j= 1 sik sii skk jk - xk j= 1 n xji - xi j= 1 n 2 = - xk jk - xi 2 n = rik 2 xjk - x k n n x j= 1 jk - xk 2 n Example: Consider our previous matrix of three observations in three-space: 2 4 6 X = 1 7 1 -6 1 8 which resulted in deviation vectors d1 3 0 1 = 2 , d 2 = 3 , d 3 = -4 -5 -3 3 Let’s use these results to find the sample covariance and correlation matrices. We have: d'1d 1 = 3 2 -5 d'2d 2 = 0 3 -3 d'3d 3 = 1 -4 3 3 38 2 = 38 = 3s11 s11 = 3 , -5 0 18 3 = 18 = 3s s = , 22 22 3 -3 1 26 -4 = 26 = 3s33 s33 = 3 3 and: d'1d 2 = 3 2 -5 d'1d 3 = 3 2 -5 d'2d 3 = 0 3 -3 0 21 3 = 21 = 3s12 s12 = 3 , -3 1 20 -4 = -20 = 3s s = , 13 13 3 3 1 21 -4 = -21 = 3s23 s23 = - 3 3 so: r12 = r13 = r23 = s12 s11 s22 s13 s11 s33 s23 s22 s33 = 21 3 = 0.803, 38 3 18 3 = 20 3 = -0.636, 38 3 26 3 = -21 3 = -0.971 18 3 26 3 which gives us Sn 38 3 21 = 3 -20 3 21 3 18 3 -21 3 -20 3 0.803 -0.636 1.000 -21 and R = 0.803 1.000 -0.971 3 -0.636 -0.971 1.000 26 3 B. Random Samples and the Expected Values of m and S ~ ~ Suppose we intend to collect n sets of measurements (or observations) on p variables. At this point we can consider each of the n x p values to be observed to be random variables Xjk. This leads to interpretation of each set of measurements Xj on the p variables to be a ~ random vector, i.e., X nxp x11 x 21 = x n1 x12 x 22 x n2 x1p x'1 ' x 2p x2 = ' x np x n Separate multivariate observations These concepts will be used to define a random sample. ' ' , xp Random Sample – if the row vectors X 1, X 2, - represent independent observation - from a common joint probability distribution f X = f x1, x 2, then X'1, X'2, , X'n , X'n are said to form a random sample from f X . This means the observations have a joint density function of n f x j j= 1 where f xj = f xj1, xj2, , xjp is the density function for the jth row vector. Keep in mind two thoughts with regards to random samples - The measurements of the p variables in a single trial X'j = X j1, X j2, , X jp will usually be correlated. The measurements from different trials, however, must be independent for inference to be valid. - Independence of the measurements from different trials is often violated when the data have a serial component. Note that m and S have certain properties no matter what ~ ~ the underlying joint distribution of random variables is. Let X1, X2, , Xn be a random sample from a joint distribution with mean vector m and covariance matrix S. Then: ~ ~ -X _ is an unbiased estimate of m, i.e., E(X) = m, and ~ ~ ~ 1 ~ a covariance matrix Σ = Cov X has n - the sample covariance matrix Sn has expected value ~ n - 1 1 E Sn = Σ = Σ Σ bias n n i.e., Sn is a biased estimator of covariance matrix S, but ~ n n E Sn = E Sn = Σ n - 1 n - 1 ~ This means we can write an unbiased sample variance covariance matrix S as ~ n 1 S = Sn = n-1 n-1 n X j - X j=1 X j - X whose (i,k)th element is sik 1 = n -1 n X j= 1 ji - Xi X jk - Xk ' ' Example: Consider our previous matrix of three observations in three-space: 2 4 6 X = 1 7 1 -6 1 8 the unbiased estimate S is ~ n S = Sn n - 1 38 3 3 21 = 3 - 1 3 -20 3 21 3 18 3 -21 3 -20 38 2 3 -21 21 = 2 3 -20 26 2 3 21 2 18 2 -21 2 -20 2 -21 2 26 2 Notice that this does not change the sample correlation matrix R! 0.803 -0.636 1.000 R = 0.803 1.000 -0.971 -0.636 -0.971 1.000 Why? C. Generalizing Variance over P Dimensions For a given variance-covariance matrix S nxp s11 s 12 = s1p s12 s22 s2p 1 = sik = n - 1 s1p s2p spp n X ji - Xi j= 1 the Generalized Sample Variance is |S|. ~ X jk - Xk ' Example: Consider our previous matrix of three observations in three-space: 2 4 6 X = 1 7 1 -6 1 8 the Generalized Sample Variance is X = 2 7 1 -4 1 1 +6 1 7 1 8 -6 8 -6 1 = 2 55 - 4 14 + 6 43 = 312 Of course, some of the information regarding the variances and covariances is lost when summarizing multiple dimensions with a single number. Consider the geometry of |S| in two dimensions - we ~ will generate two deviation variables d1 L d1 Q Ld2 Height = Ld1 sin q d2 This resulting trapezoid has area Ld1sin θ Ld2 . Because sin2(q) + cos2(q) = 1, we can rewrite the area of this trapezoid as Ld1 sin θ Ld2 = Ld1 Ld2 1- cos2 θ Earlier we showed that n L d1 = X j1 - X1 - X2 j=1 n Ld2 = X j=1 and cos(q) = r12. j2 2 2 = n - 1 s = n - 1 s 11 22 So by substitution Ld1 sin θ Ld2 = Ld1 Ld2 1- cos θ 2 = n - 1 s11 s22 1- r 2 12 = n - 1 s11s22 1- r 2 12 and we know that s11 S = s21 s11 s12 = s22 s11 s22 r12 s11 s22 r12 s22 2 2 = s11s22 - s11s22r12 = s11s22 1 - r12 So S = area2 n - 1 2 . More generally, we can establish the Generalized Sample Variance to be S = volume2 n - 1 p which simply means that the generalized sample variance (for a fixed set of data) is proportional to the squared volume generated by its p deviation vectors. Note that - the generalized sample variance increases as any deviation vector increases in length (the corresponding variable increases in variation) - the generalized sample variance increases as the direction of any two deviation vector becomes more dissimilar (the correlation of the corresponding variables decreases) Here we see the generalized sample variance changes as the length of deviation vector d2 changes (the variation of the corresponding variable changes): 3 d2 3 d1 d3 cd2 d1 d3 1 2 1 2 deviation vector d2 increases in length to cd2 , c > 1 (i.e., the variance of x2 increases) Here we see the generalized sample variance decrease as the direction of deviation vectors d2 and d3 become more similar (the correlation of x2 and x3 increases): 3 d2 3 d1 d3 d3 d2 1 2 d1 1 2 q23 = 900, i.e., deviation vectors d2 and d3 are orthogonal (x2 and x3 are not correlated 00< q23 < 900, i.e., deviation vectors d2 and d3 move in similar directions (x2 and x3 are positively correlated This suggests an important result - the generalized sample variance is zero when and only when at least one deviation vector lies in the span of other deviation vectors, i.e., when. one deviation vector is a linear combination of some other deviation vectors one variable is perfectly correlated with a linear combination of other variables the rank of the data is less than the number of columns the determinant of the variance-covariance matrix is zero These results also suggests simple conditions for determining if S is of full rank: ~ - If n p then |S| = 0 ~ - For the p x 1 vectors ~ x1, x2, …, xp representing ~ ~ realizations of independent random vectors X1, X2, …, ~ ~ ’ th Xp, where xj is the j row of data matrix X ~ ~ ~ if the linear combination a’Xj has positive variance for ~ ~ each constant vector a 0 and p < n, S is of full rank ~ ~ ~ and |S| > 0 ~ if a’Xj is a constant j, then |S| = 0 ~~ ~ Generalized Sample Variance also has a geometric interpretation in the p-dimensional scatter plot representation of the data in row space. Consider the measure of distance of each point in row space from the sample centroid x1 x 2 x = x p -1 substituted for A. with S ~ ~ Under these circumstances, the coordinates x’ that lie a ~ constant distance c from the centroid must satisfy -1 2 x x S x x = c ' A little integral calculus can be used to show that the volume of this ellipsoid is volume x : x - x S-1 x - x c2 ' where = kp S cp p kp 2 2 = p p 2 Thus, the squared volume of the ellipsoid is equal to the product of some constant and the generalized sample variance. Example: Here we have three data sets, all with centroid (3, 3) and generalized variance |S| = 9.0: ~ x2 0.76 1.63 2.09 0.33 0.73 7.01 2.70 5.52 2.38 8.27 3.45 0.00 1.87 4.80 5.68 0.24 3.59 3.01 2.63 3.43 2.88 5 4 S = , r = 0.80 4 5 Data Set A x2 Scatter Plot x1 → x1 4.64 3.64 2.90 6.64 7.14 -0.60 2.14 0.40 2.40 -0.10 4.14 6.14 3.14 1.40 0.90 5.14 2.64 5.64 1.64 1.14 1.90 1 9 λ = , e1 = 1 1 1 2 e2 = 2 -1 2 2 x2 1.54 2.75 1.74 1.73 5.35 4.88 3.65 3.24 1.60 4.34 2.03 1.07 3.41 1.37 1.76 6.69 1.97 2.06 1.88 2.99 6.94 Data Set B Scatter Plot x2 3 0 S = , r = 0.00 0 3 x1 → x1 6.34 1.19 3.69 2.34 4.84 5.84 2.19 2.69 1.34 3.34 4.34 0.84 5.34 0.34 3.19 1.84 2.84 3.84 4.19 1.69 0.69 1 1 0 λ = , e1 = e2 = 0 0 1 x2 0.65 3.52 7.63 6.24 1.75 2.64 1.81 1.76 3.70 1.31 2.14 1.32 8.61 4.14 3.96 2.20 3.08 3.78 1.97 1.14 -0.36 Data Set C Scatter Plot x2 5 -4 S = , r = -0.80 -4 5 x1 → x1 -0.10 3.64 5.64 5.14 1.40 2.64 1.90 1.64 4.64 0.90 2.90 1.14 6.14 6.64 4.14 2.14 3.14 7.14 2.40 0.40 -0.60 1 9 λ = , e1 = -1 1 1 2 e2 = 2 1 2 2 Other measures of Generalized Variance have been suggested based on: - the variance-covariance matrix of the standardized variables, i.e., |R| ignores differences in variances ~ - total sample variance, i.e., of individual variables p s11 + s22 + + spp = s ii i=1 ignores pairwise correlations between variables D.Matrix Operations for Calculating Sample Means, Covariances, and Correlations For a given data matrix X x11 x 21 X = x n1 x12 ~ x 22 x n2 x1p x 2p = y 1 x np y2 yp we have that y'1 1 n ' y21 x = n ' yp1 n x11 x 1 12 = n x1p x12 x 22 x 2p x n1 x n2 x np 1 1 = 1 X'1 n 1 We can also create a n x p matrix of means ' 1x x1 x 1 1 = 11'X = n x1 xp xp x p x2 x2 x2 If we subtract this result from data matrix X we have ~ x11 - x1 x - x 1 21 1 ' X 11 X = n x n1 - x1 x12 - x 2 x 22 - x 2 x n2 - x 2 which is an n x p matrix of deviations! x1p - x p x 2p - x p x np - x p Now the matrix (n – 1)S of sums of squares and ~ crossproducts is ' 1 1 ' ' n 1 S = X 11 X X 11 X n n x n1 - x1 x11 - x1 x 21 - x1 x - x x x x x 12 2 22 2 n2 2 = x x x x x x x 1p p 2p p np p x1p - x p x11 - x1 x12 - x 2 x - x x x x x 1 22 2 2p p 21 x x x x x x n1 1 n2 2 np p 1 = X' I 11' X n So the unbiased sample variance-covariance matrix S is ~ 1 1 ' S = X I 11' X n n - 1 Similarly, the common biased sample variancecovariance matrix Sn is ~ Sn = 1 ' 1 X I 11' X n n If we substitute zeros for the off-diagonal elements of the variance-covariance matrix Sn and take the square ~ root of each element of the resulting matrix, we get the standard deviation matrix 12 D whose inverse is 12 D s11 0 = 0 1 s 11 0 = 0 0 s22 0 0 0 spp 0 0 1 s22 0 0 1 spp Now since s11 s 12 S = s1p and s11 s11 s11 s12 R = s11 s22 s1p s12 s22 s2p s12 s11 s22 s22 s22 s22 s2p s1p s2p spp s1p s11 spp s2p s22 spp spp spp spp -1 2 -1 2 we have R = D SD which can be manipulated to show that the sample variance-covariance matrix S is a function of the sample ~ correlation matrix R: ~ S = D1 2RD1 2 E. Sample Values of Linear Combinations of Variables For some linear combination of p variables c'X = p cX i i i=1 whose observed value on the jth trial is c'xj = p c ji xji, j = 1,..., n i=1 the n derived observations have sample mean = c'x, sample variance = c'Sc If we have a second linear combination of these p variables b'X = p biX i i=1 whose observed value on the jth trial is b'xj = p b ji xji, j = 1,..., n i=1 the the two linear combinations have sample variance - covariance = b'Sc = c'Sb If we have a q x p matrix A whose kth row contains the coefficients of a linear combinations of these p variables, then these q linear combinations have sample mean = Ax, sample variance - covariance = ASA'
© Copyright 2025 Paperzz