Test #2 Answers STAT 873 Fall 2013 Complete the problems below. Make sure to fully explain all answers and show your work to receive full credit! 1) (16 total points) A representative sample of 200 patients who were admitted to an intensive care unit at a hospital was taken. The following information was collected on each patient: Variable ID STA AGE GENDER CPR SYS HRA TYP LOC Description Identification number Vital status: 0 = lived, 1 = died Age in years 0 = male, 1 =f emale CPR administered: 0 = no, 1 = yes Systolic blood pressure Heart rate beats per minute Type of admission: 0 = elective, 1 = emergency Level of consciousness: 0 = no coma or deep stupor, 1 = deep stupor, 2 = coma (ordinal variable) The data is in the test2.txt file which is available on the graded materials web page of the course website. Below is an example of how I read in the data: > set1<-read.table(file > head(set1) ID STA AGE GENDER CPR 1 8 0 27 1 0 2 12 0 59 0 0 3 14 0 77 0 0 4 28 0 54 0 0 5 32 0 87 1 0 6 38 0 69 0 0 = "C:\\chris\\test2.txt", header = TRUE) SYS 142 112 100 142 110 110 HRA TYP LOC 88 1 0 80 1 0 70 0 0 103 1 0 154 1 0 132 1 0 The ultimate goal for this data set is to gain a better understanding of the data and how vital status is related to the other variables. Complete the following: a) (8 points) Specifically describe an efficient way that AGE and HRA could be plotted for different levels of CPR, TYP, and LOC. While actually constructing a plot may be helpful, it is definitely not needed to answer the question (I did not when creating the answer key). An answer of a “star plot” will not receive credit because it is discussed in the next part. There are a number of ways to construct a plot. In particular, a scatter plot can be plotted for AGE and HRA. The plotting point color can correspond to CPR level, and the plotting point symbol can correspond to TYP. LOC can be added to the plot in a number of ways. First, a trellis plot could be constructed so that there are three panels of AGE vs. HRA with the plotting point corresponding to CPR and TYP. Each panel would correspond to one value of LOC. Alternatively, the size of the plotting point in a single AGE vs. HRA scatter plot could correspond to the level of LOC. Note that symbols() could not be used for this type of plot. 1 A parallel coordinate plot would not work well here due to the few levels of CPR, TYP, and LOC. It would be difficult to see which observation would correspond to each line. b) (8 points) Construct a star plot in R using all of the variables except for ID. What are TWO separate overall findings about this data that can be obtained from this examining the plot? There are more than two, but I will only grade the first two that you list! A simple finding such as “Observation #1 is a female” will not be given credit. Please be very specific in your discussion because I am evaluating if you understand how to interpret a star plot. To help you out, please note that the data has been sorted by the STA variable. This can be helpful to determine what conditions are associated with those who lived or died. > stars(x = set1[,-1], draw.segments = TRUE, key.loc = c(-1,12)) GENDER AGE CPR STA SYS LOC HRA TYP There are many answers to this problem: i) Larger LOC values are associated with STA=1 (died). This can be seen by noticing that there are many more large gray areas for the STA = 1 stars (large black ray) than the STA = 0 (no black ray) stars. Overall, there are only 2 non-zero LOC values in the STA=0 observations. ii) Almost all of the STA = 1 (died) were the result of TYP = 1 (emergency) admission instead of TYP=0 (elective) 2 iii) There does not appear to be much of a difference between the gender levels among the STA = 0 and STA = 1 groups. iv) CPR was not given too often because there are few blue rays – especially for the elective admissions To verify some of my findings from the plot, I used aggregate() to find mean values for each variable conditioned on STA = 0 or 1. > aggregate(x = set1[,-1], by = list(set1$STA), FUN = mean) Group.1 STA AGE GENDER CPR SYS HRA TYP 1 0 0 55.650 0.375 0.0375 135.6438 98.500 0.68125 2 1 1 65.125 0.400 0.1750 118.8250 100.625 0.95000 LOC 1 0.025 2 0.525 2) (37 total points) Continuing to use the data set in Error! Reference source not found., complete the following regarding principal component analysis. Make sure to exclude ID and STA and use the correlation matrix. This exclusion of ID and STA can be accomplished simply by using only set1[,-c(1:2)] in your analysis. a) (6 points) Why is the correlation matrix better to use here than the covariance matrix? The variables are in multiple scales and have different variances, which can cause some variables to play a larger role in the PCA than others. b) (10 points) State and interpret the first two principal components. > pca.cor<-princomp(x = set1[,-c(1:2)], cor = TRUE, scores = FALSE) > summary(pca.cor, loadings = TRUE, cutoff = 0.0) Importance of components: Comp.1 Comp.2 Comp.3 Standard deviation 1.2828271 1.1032624 1.0523541 Proportion of Variance 0.2350922 0.1738840 0.1582070 Cumulative Proportion 0.2350922 0.4089762 0.5671832 Comp.4 Comp.5 Comp.6 Standard deviation 0.9846019 0.9199364 0.82369148 Proportion of Variance 0.1384916 0.1208976 0.09692395 Cumulative Proportion 0.7056748 0.8265723 0.92349630 Comp.7 Standard deviation 0.7317964 Proportion of Variance 0.0765037 Cumulative Proportion 1.0000000 Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 AGE -0.027 0.668 -0.200 0.471 0.268 0.359 -0.302 GENDER 0.219 0.278 -0.544 -0.522 0.433 -0.301 0.168 CPR 0.550 0.216 0.017 -0.044 -0.497 -0.361 -0.521 SYS -0.370 0.232 -0.342 -0.395 -0.622 0.382 0.056 HRA 0.212 -0.250 -0.616 0.555 -0.256 -0.092 0.362 TYP 0.456 -0.439 -0.175 -0.196 0.158 0.649 -0.289 LOC 0.508 0.349 0.370 -0.046 -0.123 0.273 0.625 ŷr1 = -0.027AGE + 0.219GENDER + 0.550CPR – 0.370SYS + 0.212HRA + 0.456TYP + 0.508LOC 3 ŷr 2 = 0.668AGE + 0.278GENDER + 0.216CPR + 0.232SYS – 0.250HRA – 0.439TYP + 0.349LOC where the AGE, GENDER, …, LOC are standardized. PC #1: This is a contrast between GENDER, CPR, HRA, TYP, LOC vs. SYS. People with small values of PC #1 will generally be in better shape than ones with large values. For example CPR = 0, TYP = 0, LOC = 0 will cause small values. PC #2: This is a contrast between AGE, GENDER, CPR, SYS, LOC vs. HRA and TYP. I do not know of a more in-depth interpretation than this. c) (10 points) Suppose a new patient arrived in the intensive care unit with the following variable values: AGE GENDER CPR SYS HRA TYP LOC 32 0 0 110 60 0 0 Find the first principal component value for this patient. Provide a possible reason why an intensive care unit may want to find these values for any patient. Please remember to show your work! PC #1 ( ŷnew,1 ) = -0.027(32 - 57.545)/20.0546 + + 0.508*(0 - 0.125)/0.4587 = -1.234368 This would be useful to in order to see how this patient corresponds to all of the others in our sample. For example, -1.2344 is in a region where no one in the sample died. Perhaps patients can be given care sooner or later depending on the PC score value. > new.obs<-data.frame(AGE = 32, GENDER = 0, CPR = 0, SYS = 110, HRA = 60, TYP = 0, LOC = 0) > apply(X = set1[,-c(1:2)], MARGIN = 2, FUN = sd) AGE GENDER CPR SYS HRA 20.0546483 0.4866045 0.2471445 32.9520986 26.8296202 TYP LOC 0.4424407 0.4587234 > colMeans(set1[,-c(1:2)]) AGE GENDER CPR SYS HRA TYP LOC 57.545 0.380 0.065 132.280 98.925 0.735 0.125 > pca.cor$scale<-apply(X = set1[,-c(1:2)], MARGIN = 2, FUN = sd) > predict(pca.cor, newdata = new.obs) Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 [1,] -1.234264 -0.2842628 1.989497 -0.380299 0.01410576 Comp.6 Comp.7 [1,] -1.40413 0.1370736 d) (6 points) How many principal components are needed? Justify your answer. At least three because the number of eigenvalues greater than 1 is three. More PCs may be needed because three only explains 56.72% of the variation in the data. With 5 PCs, greater than 80% of the variation is explained. 4 I accepted 3, 4, or 5 as long as one of them was picked and there was correct justification. e) (5 points) What is the variance of PC #1? 1.6456, which is the largest eigenvalue 3) (21 total points) Continuing to use the data set in Error! Reference source not found., complete the following regarding factor analysis. Make sure to exclude ID and STA and use the correlation matrix. This exclusion of ID and STA can be accomplished simply by using only set1[,-c(1:2)] in your analysis. a) (7 points) How many common factors should be used? Justify your answer. > mod.fit1<-factanal(x = set1[,-c(1:2)], factors = 1, rotation = "none") > data.frame(stat = mod.fit1$STATISTIC, pvalue = mod.fit1$PVAL) stat pvalue objective 33.72142 0.002264714 > mod.fit2<-factanal(x = set1[,-c(1:2)], factors = 2, rotation = "none") > data.frame(stat = mod.fit2$STATISTIC, pvalue = mod.fit2$PVAL) stat pvalue objective 12.28279 0.1390253 > mod.fit3<-factanal(x = set1[,-c(1:2)], factors = 3, rotation = "none") > data.frame(stat = mod.fit3$STATISTIC, pvalue = mod.fit3$PVAL) stat pvalue objective 5.090927 0.1652589 > mod.fit4<-factanal(x = set1[,-c(1:2)], factors = 4, rotation = "none") Error in factanal(x = set1[, -c(1:2)], factors = 4, rotation = "none") : 4 factors are too many for 7 variables > #Compare estimates of correlation matrix > resid2<-mod.fit2$correlation - (mod.fit2$loadings[,]%*%t(mod.fit2$loadings[,]) + diag(mod.fit2$uniqueness)) > round(resid2, 4) AGE GENDER CPR SYS HRA TYP LOC AGE 0.0000 0.1148 -0.0007 0.0250 0.0853 -0.0020 0e+00 GENDER 0.1148 0.0000 0.0935 0.0790 0.0211 0.0017 -2e-04 CPR -0.0007 0.0935 0.0000 -0.0251 0.1259 -0.0032 0e+00 SYS 0.0250 0.0790 -0.0251 0.0000 -0.0374 0.0006 0e+00 HRA 0.0853 0.0211 0.1259 -0.0374 0.0000 -0.0006 -4e-04 TYP -0.0020 0.0017 -0.0032 0.0006 -0.0006 0.0000 0e+00 LOC 0.0000 -0.0002 0.0000 0.0000 -0.0004 0.0000 0e+00 > abs(resid2)>0.1 AGE GENDER AGE FALSE TRUE GENDER TRUE FALSE CPR FALSE FALSE SYS FALSE FALSE HRA FALSE FALSE TYP FALSE FALSE LOC FALSE FALSE CPR FALSE FALSE FALSE FALSE TRUE FALSE FALSE > sum(abs(resid2)>0.1) [1] 4 > colMeans(abs(resid2)) AGE GENDER SYS FALSE FALSE FALSE FALSE FALSE FALSE FALSE HRA FALSE FALSE TRUE FALSE FALSE FALSE FALSE CPR TYP FALSE FALSE FALSE FALSE FALSE FALSE FALSE LOC FALSE FALSE FALSE FALSE FALSE FALSE FALSE SYS HRA TYP 5 3.255360e-02 4.432697e-02 3.548839e-02 2.387437e-02 3.865842e-02 1.146683e-03 LOC 9.079648e-05 > resid3<-mod.fit3$correlation diag(mod.fit3$uniqueness)) > resid3 AGE GENDER CPR AGE 0.0000 0.0174 -0.0423 GENDER 0.0174 0.0000 0.0287 CPR -0.0423 0.0287 0.0000 SYS -0.0249 0.0726 -0.0101 HRA 0.0205 -0.0641 0.0430 TYP -0.0034 0.0290 -0.0279 LOC 0.0001 -0.0001 0.0001 > abs(resid3)>0.1 AGE GENDER CPR AGE FALSE FALSE FALSE GENDER FALSE FALSE FALSE CPR FALSE FALSE FALSE SYS FALSE FALSE FALSE HRA FALSE FALSE FALSE TYP FALSE FALSE FALSE LOC FALSE FALSE FALSE > sum(abs(resid3)>0.1) [1] 0 > colMeans(abs(resid3)) AGE GENDER LOC 0.0155049426 0.0302558987 0.0000586778 SYS FALSE FALSE FALSE FALSE FALSE FALSE FALSE - (mod.fit3$loadings[,]%*%t(mod.fit3$loadings[,]) + SYS HRA TYP LOC -0.0249 0.0205 -0.0034 1e-04 0.0726 -0.0641 0.0290 -1e-04 -0.0101 0.0430 -0.0279 1e-04 0.0000 -0.0227 -0.0073 0e+00 -0.0227 0.0000 0.0001 -1e-04 -0.0073 0.0001 0.0000 1e-04 0.0000 -0.0001 0.0001 0e+00 HRA FALSE FALSE FALSE FALSE FALSE FALSE FALSE TYP FALSE FALSE FALSE FALSE FALSE FALSE FALSE LOC FALSE FALSE FALSE FALSE FALSE FALSE FALSE CPR SYS HRA TYP 0.0217252655 0.0196703138 0.0215082853 0.0096741223 One common factor is likely not enough due to its low p-value for the LRT. Four factors are too many (e.g., there would be a negative degrees of freedom). Both two and three factors have non-significant LRT results, so those numbers of common factors may be sufficient. I will choose three common factors due to the smaller in absolute value residuals. b) (76 points) Suppose two factors are used (this does not mean your answer to part Error! Reference source not found. should be 2!). What is the final estimated factor analysis model after the varimax rotation? > mod.fit2v<-factanal(x = set1[,-c(1:2)], factors = 2, rotation = "varimax") > print(x = mod.fit2v, cutoff = 0.0) Call: factanal(x = set1[, -c(1:2)], factors = 2, rotation = "varimax") Uniquenesses: AGE GENDER 0.947 0.983 CPR 0.848 SYS 0.939 HRA 0.956 TYP 0.073 LOC 0.005 Loadings: Factor1 Factor2 AGE 0.036 -0.226 GENDER 0.086 0.096 CPR 0.389 0.009 SYS -0.219 -0.112 6 HRA TYP LOC -0.019 0.394 0.960 SS loadings Proportion Var Cumulative Var 0.209 0.878 -0.272 Factor1 Factor2 1.285 0.962 0.184 0.137 0.184 0.321 Test of the hypothesis that 2 factors are sufficient. The chi square statistic is 12.28 on 8 degrees of freedom. The p-value is 0.139 zAGE 0.036 0.226 1 z 0.086 0.096 GENDER 2 zCPR 0.389 3 0.009 f1 zSYS 0.219 0.112 f 4 2 zHRA 0.019 0.209 5 0.878 zTYP 0.394 6 zLOC 0.960 0.272 7 c) (8 points) Suppose two factors are used. What is the estimated correlation between AGE and factor #1? What does this say about what factor #1 represents? Note that there may be more than one answer to this question. Please remember that Corr(zj, fk) = jk. Using the unrotated common factors, the estimated correlation is 0.0954. Using the rotated common factors, the estimated correlation is 0.036. In both cases, the correlation is very small and positive. Thus, there does not appear to be much of a relationship between AGE and common factor #1. Furthermore, this means that factor #1 does not represent AGE too much. Output before rotation: > mod.fit2<-factanal(x = set1[,-c(1:2)], factors = 2, rotation = "none") > print(x = mod.fit2, cutoff = 0.0) Call: factanal(x = set1[, -c(1:2)], factors = 2, rotation = "none") Uniquenesses: AGE GENDER 0.947 0.983 CPR 0.848 Loadings: Factor1 AGE 0.094 GENDER 0.058 CPR 0.373 SYS -0.182 HRA -0.073 TYP 0.149 LOC 0.997 Factor2 -0.209 0.115 0.111 -0.166 0.196 0.951 -0.010 SYS 0.939 HRA 0.956 TYP 0.073 LOC 0.005 Factor1 Factor2 7 SS loadings Proportion Var Cumulative Var 1.207 0.172 0.172 1.040 0.149 0.321 Test of the hypothesis that 2 factors are sufficient. The chi square statistic is 12.28 on 8 degrees of freedom. The p-value is 0.139 4) (26 total points) Answer the following questions. a) (10 points) What is a parallel coordinate plot and when is it useful? Make sure to explain how observations are represented on the plot. Pictures of the plot will not be accepted! For each variable, a vertical line is plotted and individual variable values are plotted upon it. The minimum variable value is at the bottom and the maximum variable value is at the top for each variable. Lines connecting variable values for each observation are drawn across the vertical variable lines. These plots are useful for viewing multivariate data because there is no limit to the number of variables that could be represented. Trends can be found by following the lines for observations. For example, perhaps some observations that have large x1 values tend to also have large x2 values. This would be demonstrated by having lines drawn between x1 and x2 toward the top of the plot connecting observation values. b) (6 points) In general, why should one wait to interpret the common factors until the rotation is done in factor analysis? By rotating the factors, we are more likely to see loadings closer to -1, 0, or 1. When this happens, the common factors are easier to interpret. For example, loadings on a common factor that are close to 0 indicate that the common factor does not represent much of the corresponding original variable. The reverse is true when loadings are close to 1 or -1. Please remember that Corr(zj, fk) = jk. c) (10 points) With respect to factor analysis, discuss what the “nonuniqueness of the common factors” means. Also, provide a mathematical explanation of its cause. The nonuniqueness of the common factors means the common factors can be interpreted multiple ways due to the factor loadings possibly changing. As was shown in class, x = f + = TTf + = (T)(Tf) + = f + where = T and f = Tf. Thus, there is a new factor loading matrix, , which provides the jk’s used to interpret the common factors. 5) (3 points, extra credit) Where were trellis plots developed? AT&T Bell labs (see 9-30-13 video at 23:00) 8
© Copyright 2026 Paperzz