Answers

Test #2 Answers
STAT 873
Fall 2013
Complete the problems below. Make sure to fully explain all answers and show your work to receive
full credit!
1) (16 total points) A representative sample of 200 patients who were admitted to an intensive care
unit at a hospital was taken. The following information was collected on each patient:
Variable
ID
STA
AGE
GENDER
CPR
SYS
HRA
TYP
LOC
Description
Identification number
Vital status: 0 = lived, 1 = died
Age in years
0 = male, 1 =f emale
CPR administered: 0 = no, 1 = yes
Systolic blood pressure
Heart rate beats per minute
Type of admission: 0 = elective, 1 = emergency
Level of consciousness: 0 = no coma or deep stupor, 1 = deep stupor, 2 = coma
(ordinal variable)
The data is in the test2.txt file which is available on the graded materials web page of the course
website. Below is an example of how I read in the data:
> set1<-read.table(file
> head(set1)
ID STA AGE GENDER CPR
1 8
0 27
1
0
2 12
0 59
0
0
3 14
0 77
0
0
4 28
0 54
0
0
5 32
0 87
1
0
6 38
0 69
0
0
= "C:\\chris\\test2.txt", header = TRUE)
SYS
142
112
100
142
110
110
HRA TYP LOC
88
1
0
80
1
0
70
0
0
103
1
0
154
1
0
132
1
0
The ultimate goal for this data set is to gain a better understanding of the data and how vital status
is related to the other variables. Complete the following:
a) (8 points) Specifically describe an efficient way that AGE and HRA could be plotted for
different levels of CPR, TYP, and LOC. While actually constructing a plot may be helpful, it is
definitely not needed to answer the question (I did not when creating the answer key). An
answer of a “star plot” will not receive credit because it is discussed in the next part.
There are a number of ways to construct a plot. In particular, a scatter plot can be plotted for
AGE and HRA. The plotting point color can correspond to CPR level, and the plotting point
symbol can correspond to TYP. LOC can be added to the plot in a number of ways. First, a
trellis plot could be constructed so that there are three panels of AGE vs. HRA with the plotting
point corresponding to CPR and TYP. Each panel would correspond to one value of LOC.
Alternatively, the size of the plotting point in a single AGE vs. HRA scatter plot could
correspond to the level of LOC. Note that symbols() could not be used for this type of plot.
1
A parallel coordinate plot would not work well here due to the few levels of CPR, TYP, and
LOC. It would be difficult to see which observation would correspond to each line.
b) (8 points) Construct a star plot in R using all of the variables except for ID. What are TWO
separate overall findings about this data that can be obtained from this examining the plot?
There are more than two, but I will only grade the first two that you list! A simple finding such
as “Observation #1 is a female” will not be given credit. Please be very specific in your
discussion because I am evaluating if you understand how to interpret a star plot. To help you
out, please note that the data has been sorted by the STA variable. This can be helpful to
determine what conditions are associated with those who lived or died.
> stars(x = set1[,-1], draw.segments = TRUE, key.loc = c(-1,12))
GENDER AGE
CPR
STA
SYS
LOC
HRA TYP
There are many answers to this problem:
i) Larger LOC values are associated with STA=1 (died). This can be seen by noticing that
there are many more large gray areas for the STA = 1 stars (large black ray) than the STA
= 0 (no black ray) stars. Overall, there are only 2 non-zero LOC values in the STA=0
observations.
ii) Almost all of the STA = 1 (died) were the result of TYP = 1 (emergency) admission instead
of TYP=0 (elective)
2
iii) There does not appear to be much of a difference between the gender levels among the
STA = 0 and STA = 1 groups.
iv) CPR was not given too often because there are few blue rays – especially for the elective
admissions
To verify some of my findings from the plot, I used aggregate() to find mean values for each
variable conditioned on STA = 0 or 1.
> aggregate(x = set1[,-1], by = list(set1$STA), FUN = mean)
Group.1 STA
AGE GENDER
CPR
SYS
HRA
TYP
1
0
0 55.650 0.375 0.0375 135.6438 98.500 0.68125
2
1
1 65.125 0.400 0.1750 118.8250 100.625 0.95000
LOC
1 0.025
2 0.525
2) (37 total points) Continuing to use the data set in Error! Reference source not found., complete
the following regarding principal component analysis. Make sure to exclude ID and STA and
use the correlation matrix. This exclusion of ID and STA can be accomplished simply by using
only set1[,-c(1:2)] in your analysis.
a) (6 points) Why is the correlation matrix better to use here than the covariance matrix?
The variables are in multiple scales and have different variances, which can cause some
variables to play a larger role in the PCA than others.
b) (10 points) State and interpret the first two principal components.
> pca.cor<-princomp(x = set1[,-c(1:2)], cor = TRUE, scores = FALSE)
> summary(pca.cor, loadings = TRUE, cutoff = 0.0)
Importance of components:
Comp.1
Comp.2
Comp.3
Standard deviation
1.2828271 1.1032624 1.0523541
Proportion of Variance 0.2350922 0.1738840 0.1582070
Cumulative Proportion 0.2350922 0.4089762 0.5671832
Comp.4
Comp.5
Comp.6
Standard deviation
0.9846019 0.9199364 0.82369148
Proportion of Variance 0.1384916 0.1208976 0.09692395
Cumulative Proportion 0.7056748 0.8265723 0.92349630
Comp.7
Standard deviation
0.7317964
Proportion of Variance 0.0765037
Cumulative Proportion 1.0000000
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
AGE
-0.027 0.668 -0.200 0.471 0.268 0.359 -0.302
GENDER 0.219 0.278 -0.544 -0.522 0.433 -0.301 0.168
CPR
0.550 0.216 0.017 -0.044 -0.497 -0.361 -0.521
SYS
-0.370 0.232 -0.342 -0.395 -0.622 0.382 0.056
HRA
0.212 -0.250 -0.616 0.555 -0.256 -0.092 0.362
TYP
0.456 -0.439 -0.175 -0.196 0.158 0.649 -0.289
LOC
0.508 0.349 0.370 -0.046 -0.123 0.273 0.625
ŷr1 = -0.027AGE + 0.219GENDER + 0.550CPR – 0.370SYS + 0.212HRA + 0.456TYP +
0.508LOC
3
ŷr 2 = 0.668AGE + 0.278GENDER + 0.216CPR + 0.232SYS – 0.250HRA – 0.439TYP +
0.349LOC
where the AGE, GENDER, …, LOC are standardized.
PC #1: This is a contrast between GENDER, CPR, HRA, TYP, LOC vs. SYS. People with
small values of PC #1 will generally be in better shape than ones with large values. For
example CPR = 0, TYP = 0, LOC = 0 will cause small values.
PC #2: This is a contrast between AGE, GENDER, CPR, SYS, LOC vs. HRA and TYP. I do
not know of a more in-depth interpretation than this.
c) (10 points) Suppose a new patient arrived in the intensive care unit with the following variable
values:
AGE
GENDER
CPR
SYS
HRA
TYP
LOC
32
0
0
110
60
0
0
Find the first principal component value for this patient. Provide a possible reason why an
intensive care unit may want to find these values for any patient. Please remember to show
your work!
PC #1 ( ŷnew,1 ) = -0.027(32 - 57.545)/20.0546 +  + 0.508*(0 - 0.125)/0.4587 = -1.234368
This would be useful to in order to see how this patient corresponds to all of the others in our
sample. For example, -1.2344 is in a region where no one in the sample died. Perhaps
patients can be given care sooner or later depending on the PC score value.
> new.obs<-data.frame(AGE = 32, GENDER = 0, CPR = 0, SYS = 110, HRA = 60, TYP = 0,
LOC = 0)
> apply(X = set1[,-c(1:2)], MARGIN = 2, FUN = sd)
AGE
GENDER
CPR
SYS
HRA
20.0546483 0.4866045 0.2471445 32.9520986 26.8296202
TYP
LOC
0.4424407 0.4587234
> colMeans(set1[,-c(1:2)])
AGE GENDER
CPR
SYS
HRA
TYP
LOC
57.545
0.380
0.065 132.280 98.925
0.735
0.125
> pca.cor$scale<-apply(X = set1[,-c(1:2)], MARGIN = 2, FUN = sd)
> predict(pca.cor, newdata = new.obs)
Comp.1
Comp.2
Comp.3
Comp.4
Comp.5
[1,] -1.234264 -0.2842628 1.989497 -0.380299 0.01410576
Comp.6
Comp.7
[1,] -1.40413 0.1370736
d) (6 points) How many principal components are needed? Justify your answer.
At least three because the number of eigenvalues greater than 1 is three. More PCs may be
needed because three only explains 56.72% of the variation in the data. With 5 PCs, greater
than 80% of the variation is explained.
4
I accepted 3, 4, or 5 as long as one of them was picked and there was correct justification.
e) (5 points) What is the variance of PC #1?
1.6456, which is the largest eigenvalue
3) (21 total points) Continuing to use the data set in Error! Reference source not found., complete
the following regarding factor analysis. Make sure to exclude ID and STA and use the
correlation matrix. This exclusion of ID and STA can be accomplished simply by using only
set1[,-c(1:2)] in your analysis.
a) (7 points) How many common factors should be used? Justify your answer.
> mod.fit1<-factanal(x = set1[,-c(1:2)], factors = 1, rotation = "none")
> data.frame(stat = mod.fit1$STATISTIC, pvalue = mod.fit1$PVAL)
stat
pvalue
objective 33.72142 0.002264714
> mod.fit2<-factanal(x = set1[,-c(1:2)], factors = 2, rotation = "none")
> data.frame(stat = mod.fit2$STATISTIC, pvalue = mod.fit2$PVAL)
stat
pvalue
objective 12.28279 0.1390253
> mod.fit3<-factanal(x = set1[,-c(1:2)], factors = 3, rotation = "none")
> data.frame(stat = mod.fit3$STATISTIC, pvalue = mod.fit3$PVAL)
stat
pvalue
objective 5.090927 0.1652589
> mod.fit4<-factanal(x = set1[,-c(1:2)], factors = 4, rotation = "none")
Error in factanal(x = set1[, -c(1:2)], factors = 4, rotation = "none") :
4 factors are too many for 7 variables
> #Compare estimates of correlation matrix
> resid2<-mod.fit2$correlation - (mod.fit2$loadings[,]%*%t(mod.fit2$loadings[,]) +
diag(mod.fit2$uniqueness))
> round(resid2, 4)
AGE GENDER
CPR
SYS
HRA
TYP
LOC
AGE
0.0000 0.1148 -0.0007 0.0250 0.0853 -0.0020 0e+00
GENDER 0.1148 0.0000 0.0935 0.0790 0.0211 0.0017 -2e-04
CPR
-0.0007 0.0935 0.0000 -0.0251 0.1259 -0.0032 0e+00
SYS
0.0250 0.0790 -0.0251 0.0000 -0.0374 0.0006 0e+00
HRA
0.0853 0.0211 0.1259 -0.0374 0.0000 -0.0006 -4e-04
TYP
-0.0020 0.0017 -0.0032 0.0006 -0.0006 0.0000 0e+00
LOC
0.0000 -0.0002 0.0000 0.0000 -0.0004 0.0000 0e+00
> abs(resid2)>0.1
AGE GENDER
AGE
FALSE
TRUE
GENDER TRUE FALSE
CPR
FALSE FALSE
SYS
FALSE FALSE
HRA
FALSE FALSE
TYP
FALSE FALSE
LOC
FALSE FALSE
CPR
FALSE
FALSE
FALSE
FALSE
TRUE
FALSE
FALSE
> sum(abs(resid2)>0.1)
[1] 4
> colMeans(abs(resid2))
AGE
GENDER
SYS
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
HRA
FALSE
FALSE
TRUE
FALSE
FALSE
FALSE
FALSE
CPR
TYP
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
LOC
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
SYS
HRA
TYP
5
3.255360e-02 4.432697e-02 3.548839e-02 2.387437e-02 3.865842e-02 1.146683e-03
LOC
9.079648e-05
> resid3<-mod.fit3$correlation
diag(mod.fit3$uniqueness))
> resid3
AGE GENDER
CPR
AGE
0.0000 0.0174 -0.0423
GENDER 0.0174 0.0000 0.0287
CPR
-0.0423 0.0287 0.0000
SYS
-0.0249 0.0726 -0.0101
HRA
0.0205 -0.0641 0.0430
TYP
-0.0034 0.0290 -0.0279
LOC
0.0001 -0.0001 0.0001
> abs(resid3)>0.1
AGE GENDER
CPR
AGE
FALSE FALSE FALSE
GENDER FALSE FALSE FALSE
CPR
FALSE FALSE FALSE
SYS
FALSE FALSE FALSE
HRA
FALSE FALSE FALSE
TYP
FALSE FALSE FALSE
LOC
FALSE FALSE FALSE
> sum(abs(resid3)>0.1)
[1] 0
> colMeans(abs(resid3))
AGE
GENDER
LOC
0.0155049426 0.0302558987
0.0000586778
SYS
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
- (mod.fit3$loadings[,]%*%t(mod.fit3$loadings[,]) +
SYS
HRA
TYP
LOC
-0.0249 0.0205 -0.0034 1e-04
0.0726 -0.0641 0.0290 -1e-04
-0.0101 0.0430 -0.0279 1e-04
0.0000 -0.0227 -0.0073 0e+00
-0.0227 0.0000 0.0001 -1e-04
-0.0073 0.0001 0.0000 1e-04
0.0000 -0.0001 0.0001 0e+00
HRA
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
TYP
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
LOC
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
CPR
SYS
HRA
TYP
0.0217252655 0.0196703138 0.0215082853 0.0096741223
One common factor is likely not enough due to its low p-value for the LRT. Four factors are too
many (e.g., there would be a negative degrees of freedom). Both two and three factors have
non-significant LRT results, so those numbers of common factors may be sufficient. I will
choose three common factors due to the smaller in absolute value residuals.
b) (76 points) Suppose two factors are used (this does not mean your answer to part Error!
Reference source not found. should be 2!). What is the final estimated factor analysis model
after the varimax rotation?
> mod.fit2v<-factanal(x = set1[,-c(1:2)], factors = 2, rotation = "varimax")
> print(x = mod.fit2v, cutoff = 0.0)
Call:
factanal(x = set1[, -c(1:2)], factors = 2, rotation = "varimax")
Uniquenesses:
AGE GENDER
0.947 0.983
CPR
0.848
SYS
0.939
HRA
0.956
TYP
0.073
LOC
0.005
Loadings:
Factor1 Factor2
AGE
0.036 -0.226
GENDER 0.086
0.096
CPR
0.389
0.009
SYS
-0.219 -0.112
6
HRA
TYP
LOC
-0.019
0.394
0.960
SS loadings
Proportion Var
Cumulative Var
0.209
0.878
-0.272
Factor1 Factor2
1.285
0.962
0.184
0.137
0.184
0.321
Test of the hypothesis that 2 factors are sufficient.
The chi square statistic is 12.28 on 8 degrees of freedom.
The p-value is 0.139
 zAGE   0.036 0.226 
1 
z
  0.086

 
0.096 
 GENDER  
 2
 zCPR   0.389
3 
0.009  

 
  f1   
 zSYS    0.219 0.112   f    4 
2
 zHRA   0.019
0.209     5 

 

 
0.878 
 zTYP   0.394
6 
 zLOC   0.960 0.272 
7 

 

 
c) (8 points) Suppose two factors are used. What is the estimated correlation between AGE and
factor #1? What does this say about what factor #1 represents? Note that there may be more
than one answer to this question.
Please remember that Corr(zj, fk) = jk. Using the unrotated common factors, the estimated
correlation is 0.0954. Using the rotated common factors, the estimated correlation is 0.036. In
both cases, the correlation is very small and positive. Thus, there does not appear to be much
of a relationship between AGE and common factor #1. Furthermore, this means that factor #1
does not represent AGE too much.
Output before rotation:
> mod.fit2<-factanal(x = set1[,-c(1:2)], factors = 2, rotation = "none")
> print(x = mod.fit2, cutoff = 0.0)
Call:
factanal(x = set1[, -c(1:2)], factors = 2, rotation = "none")
Uniquenesses:
AGE GENDER
0.947 0.983
CPR
0.848
Loadings:
Factor1
AGE
0.094
GENDER 0.058
CPR
0.373
SYS
-0.182
HRA
-0.073
TYP
0.149
LOC
0.997
Factor2
-0.209
0.115
0.111
-0.166
0.196
0.951
-0.010
SYS
0.939
HRA
0.956
TYP
0.073
LOC
0.005
Factor1 Factor2
7
SS loadings
Proportion Var
Cumulative Var
1.207
0.172
0.172
1.040
0.149
0.321
Test of the hypothesis that 2 factors are sufficient.
The chi square statistic is 12.28 on 8 degrees of freedom.
The p-value is 0.139
4) (26 total points) Answer the following questions.
a) (10 points) What is a parallel coordinate plot and when is it useful? Make sure to explain how
observations are represented on the plot. Pictures of the plot will not be accepted!
For each variable, a vertical line is plotted and individual variable values are plotted upon it.
The minimum variable value is at the bottom and the maximum variable value is at the top for
each variable. Lines connecting variable values for each observation are drawn across the
vertical variable lines.
These plots are useful for viewing multivariate data because there is no limit to the number of
variables that could be represented. Trends can be found by following the lines for
observations. For example, perhaps some observations that have large x1 values tend to also
have large x2 values. This would be demonstrated by having lines drawn between x1 and x2
toward the top of the plot connecting observation values.
b) (6 points) In general, why should one wait to interpret the common factors until the rotation is
done in factor analysis?
By rotating the factors, we are more likely to see loadings closer to -1, 0, or 1. When this
happens, the common factors are easier to interpret. For example, loadings on a common
factor that are close to 0 indicate that the common factor does not represent much of the
corresponding original variable. The reverse is true when loadings are close to 1 or -1. Please
remember that Corr(zj, fk) = jk.
c) (10 points) With respect to factor analysis, discuss what the “nonuniqueness of the common
factors” means. Also, provide a mathematical explanation of its cause.
The nonuniqueness of the common factors means the common factors can be interpreted
multiple ways due to the factor loadings possibly changing. As was shown in class,
x
= f + 
= TTf + 
= (T)(Tf) + 
= f + 
where  = T and f = Tf. Thus, there is a new factor loading matrix, , which provides the
jk’s used to interpret the common factors.
5) (3 points, extra credit) Where were trellis plots developed?
AT&T Bell labs (see 9-30-13 video at 23:00)
8