University of Toronto Scarborough Department of Computer and Mathematical Sciences STAD29 / STA 1007 (K. Butler), Final Exam April 20, 2015 Aids allowed: - My lecture overheads (slides) - The R “book” - Any notes that you have taken in this course - Your marked assignments - Last year’s final exam and my solutions to it - Non-programmable, non-communicating calculator Before you begin, complete the signature sheet, but sign it only when the invigilator collects it. The signature sheet shows that you were present at the exam. This exam has 18 numbered pages of questions. Please check to see that you have all the pages. In addition, you should have an additional booklet of output to refer to during the exam. Contact an invigilator if you do not have this. Answer each question in the space provided (under the question). If you need more space, use the backs of the pages, but be sure to draw the marker’s attention to where the rest of the answer may be found. The maximum marks available for each part of each question are shown next to the question part. In addition, the total marks available for each page are shown at the bottom of the page, and in the table on the next page. The University of Toronto’s Code of Behaviour on Academic Matters applies to all University of Toronto Scarborough students. The Code prohibits all forms of academic dishonesty including, but not limited to, cheating, plagiarism, and the use of unauthorized aids. Students violating the Code may be subject to penalties up to and including suspension or expulsion from the University. Last name: First name: Student number: For marker’s use only: Page Points 1 8 2 9 3 9 4 5 5 8 6 6 7 7 8 8 9 8 10 7 11 9 12 8 13 5 14 5 15 6 16 4 17 6 18 5 Total: 123 Score STAD29 Final Exam Page 1 of 18 1. It is thought that boxes of brand-name raisins contain more raisins than the generic brand. Fourteen 15g boxes of brand-name (Sunmaid) and generic raisins were randomly sampled, and the number of raisins in each counted. The data are shown in Figure 1 in the booklet of code and output. (a) (2 marks) Side-by-side boxplots are shown in Figure 2. What does this suggest about the truth of the opening sentence of this question? My answer: The median number of raisins for the Sunmaid brand is higher than for the generic brand, so it suggests that the opening sentence of the question could be true. (b) (1 mark) Figure 3 shows the results of a test. Why was alternative="l" (that’s a lowercase letter L) used? My answer: To test the alternative that the mean of the first group generic was less than the mean of the second group sunmaid, as opposed to the default two-sided “not equal” alternative. (The one-sided test is what the opening sentence of the question says that we want to do.) (c) (3 marks) What do you conclude from the test in Figure 3, in terms of raisins and brands? My answer: The P-value is less than 0.05, so we can conclude that the Sunmaid brand has a higher mean number of raisins per box than the generic brand. (The null hypothesis is that the means are the same, and this is rejected in favour of the one-sided alternative that the Sunmaid mean is bigger.) (d) (2 marks) Do you see anything in Figure 2 to make you doubt your conclusions of (c)? Explain briefly. My answer: There are no outliers, and both distributions look more or less symmetric. So I am happy with the results of the t-test. Exam continues. . . This page: of possible 8 points. STAD29 Final Exam Page 2 of 18 2. A large number of people received large doses of radiation following a nuclear accident. The radiation received by each person was measured to the nearest 100 rems. We will treat the variable rems in our data set as an exact number (even though, for example, 50 rems means only “between 1 and 99 rems”). Some of the people lived, and some of them died. We are interested in whether the amount of radiation affects the chances of a person dying. The data and a logistic regression are shown in Figure 4 in the booklet of code and output. In the data are shown, for each number of rems, the number of people who survived and died. (a) (2 marks) Explain briefly why logistic regression is a suitable method to use in this situation. My answer: The response variable is a binary yes/no: a person either survives or not. Such a response variable can be handled by logistic regression (to see how the probability of the event, death, depends on the explanatory variable(s), rems in this case.) (b) (2 marks) In the model radiation.1, am I modelling the probability of death or the probability of survival? How do you know? My answer: The probability of death, because that’s the first column of the response variable matrix. (Other clue: looking at the data, the proportion of people who died is increasing as rems increases, and the slope of rems is positive, so we must be modelling the probability of death.) (c) (2 marks) Why is the strong significance of the slope coefficient of rems not a surprise? Explain briefly. My answer: Looking at the data, almost all the people survive at low rems and almost all of them die at high rems. So it makes sense that the radiation (rems) has a clear and large impact on the probability of survival. (d) (3 marks) Use the information in Figure 5 to estimate the probability that someone suffering radiation of 400 rems will survive. Show me your thought process, not just an answer. My answer: The variable new.rems starts at 0 and goes up in steps of 100, so 400 is the fifth value (0, 100, 200, 300, 400). Of the two lines of predictions, the first one is of log-odds (not helpful), while the second one is probabilities. Since new.rems was used to make the predictions (as the second thing in predict), we need the fifth one of these also, 0.169. But this is the probability of a person dying (as you see, the predictions go up with increasing rems), so the probability of a person surviving at this radiation is 1 − 0.169 = 0.831. If you thought we were predicting the probability of survival earlier, I’m prepared to take this into account when marking this part, but I can’t give you full marks here because you have made the question easier to answer. Exam continues. . . This page: of possible 9 points. STAD29 Final Exam Page 3 of 18 3. An experiment was designed to assess the effect of management training on the decision-making abilities of supervisors in a large corporation. Sixteen supervisors were selected, and eight were randomly chosen to receive management training. Four trained and four untrained supervisors were randomly selected to handle a situation in which a standard problem arose. The other eight supervisors were presented with an emergency situation in which standard procedures could not be used. The response variable was a management behaviour rating for each supervisor. (a) (2 marks) On the plot of Figure 6, are the lines approximately parallel? What does that mean, in terms of training and situation type? My answer: I’d say they are more or less parallel, which means that there is no interaction: the effect of training on rating is more or less the same in the two situations. If you want to say that they are not parallel, go ahead, but make sure to say that then, the effect of training is different on the two situations (or something equivalent). (b) (3 marks) Look at Figure 7. Would you say that (i) training makes a difference to rating, (ii) the kind of situation makes a difference to rating, (iii) your answers to (i) and (ii) are different for the different levels of the other factor? My answer: Training has a big effect on rating: compare “yes” and “no” within emergency and within standard, and the ratings are all clearly higher when training is ”yes”. The kind of situation makes only a small difference to rating: for example, when training is Yes, the ratings in a standard situation are a little higher than in an emergency situation, and the same when training is No. The difference between standard and emergency situations is a tiny bit bigger when the supervisor has undergone training than when not, but it’s up to you whether you think this is worth commenting on. (This is the same issue as for interaction.) (c) (4 marks) Which of the two analyses in Figure 8 do you prefer? Explain briefly. How is this consistent with your answers to each of the previous two parts (or “how is it inconsistent”, if it is that)? My answer: In Figure 8, I looked at the first analysis and saw that the interaction was not significant. So I removed it, and saw this was the second analysis, which I prefer. The nonsignificant interaction was consistent with the more-or-less parallel lines on the interaction plot, and also with the boxplots: for example, the rating was a little higher in a standard situation than an emergency one, both for supervisors that had the training and supervisors that did not. If you thought, looking at the interaction plot, that the interaction was going to be significant, its non-significance should have been a surprise to you (and you needed to comment on this). Question 3 continues. . . This page: of possible 9 points. Question 3 continues. . . STAD29 Final Exam Page 4 of 18 (d) (2 marks) I did not run Tukey’s method here. Would that have been useful? Explain briefly why or why not. My answer: There are only two training groups, yes or no, and only two types of situation, Emergency and Standard. So if either of the main-effects F -tests were significant, that would have demonstrated that mean ratings were different for the two different groups in each case. No further analysis (like Tukey) would be necessary. Tukey would have been useful if there had been, say, three types of situation. (This is not the usual answer to this kind of question, which is that the need for Tukey depends on the significance or not of the F -test(s).) (e) (3 marks) What are your conclusions about the effectiveness of the training, and the effect of standard or emergency conditions on the ratings (and possibly the effect of those in combination)? Explain briefly, using α = 0.05. My answer: The P-value for trained is very small (8.6 × 10−8 ), so training definitely has an effect on rating. The P-value for situation is not less than 0.05, so we cannot conclude that this has any effect on rating. We don’t need to worry about the combination of training and situation, because we previously said that the interaction was not significant (meaning that trained and situation have consistent effects regardless of the level of the other factor). Exam continues. . . This page: of possible 5 points. STAD29 Final Exam Page 5 of 18 4. The data set shown in Figure 9 show a number of patients who are on a waiting list for a liver transplant. The event of interest is “received a liver transplant”, labelled ltx. A patient can drop off the waiting list for other reasons: they die, they withdraw from the transplant list, or they move away (or cannot be followed up for other reasons). The potential explanatory variables are: • age (years) • sex (M or F) • blood group, A, B, AB or O (called abo) • year when patient entered waiting list The variable futime is the number of days from when the patient enters the waiting list to when they leave it (for whatever reason). Blood group is important because the liver of a donor with blood group O can be used by a patient of any blood type, whereas (for example) a patient with blood type B cannot accept a liver from a donor of blood type A or AB. Thus type O patients on the waiting list are at a disadvantage because there is more competition for type O donor livers. Some analysis is shown in Figure 10. The first two lines (actually, one long line of code) are converting the years into one of three periods. The third line is showing how a few of the conversions work out. (a) (2 marks) Explain precisely what the Surv line in Figure 10 is doing. What does “censored” mean in this context? My answer: This is creating a response variable for a survival model. The response variable is followup time, with the event of interest being “received a liver transplant”. A patient who leaves the waiting list for any other reason will be treated as “censored”. (b) (2 marks) Consider these three patients, whose followup time in each case is 100 days: (i) the patient dies, (ii) the patient receives a liver transplant, (iii) the patient moves out of the study area. If you were to print the y values for each of these patients, how would they be displayed? My answer: Since the event of interest is “received a liver transplant”, anything else is censored. Thus only the second patient will be displayed as the number 100; the other two will be displayed as 100+. Thus: (i) 100+, (ii) 100, (iii) 100+. (c) (2 marks) In the model fit in Figure 10, does it look as if age has an effect on the response variable? Explain briefly. My answer: The P-value of 0.4317 is not small, so it appears to have no effect. (d) (2 marks) Does it look as if blood group has an effect on y (that is, are any of the blood groups different from the others in terms of their slope coefficients?) Explain briefly. My answer: Yes. Blood group A is the baseline, and blood group O is definitely different from A (P-value 1.1 × 10−14 ). Question 4 continues. . . This page: of possible 8 points. Question 4 continues. . . STAD29 Final Exam Page 6 of 18 (e) (3 marks) I removed age and sex from the model, obtaining the results shown in Figure 11. Can you tell from this output which blood group typically has patients who have to wait longest for liver transplants? Explain briefly. My answer: A positive slope coefficient means a “greater hazard” of the event happening, and a negative one means a “lesser hazard”. The event in this case is getting a liver transplant, which is (unusually for this kind of work) a good thing. So if the hazard is greater, a patient is more likely to get a liver transplant soon. The patients who have to wait longest are the ones with the most negative coefficient, that is the ones in blood group O. This is what I suggested in the question preamble, but the trick is to explain why you can see from the output that this group is worst off. (Full marks requires the use of the word “hazard” or something equivalent to it.) (f) (2 marks) Figure 12 contains predictions based on the model in Figure 11. Describe what the data frame transplant.new contains. My answer: All combinations of the 3 time periods and the 4 blood groups. (g) (1 mark) Which function in Figure 12 actually does the predictions? (No explanation needed.) My answer: survfit. That’s it. Question 4 continues. . . This page: of possible 6 points. Question 4 continues. . . STAD29 Final Exam Page 7 of 18 (h) (3 marks) Figure 13 shows a plot of the survival curves for the groups defined by time period and blood group. Line types 1, 2 and 4 are respectively solid, dashed and dot-then-dash. The solid lines are all bunched up at the bottom of the plot, and are hard to see. In which time period do patients typically have to wait longest for a liver transplant? Explain briefly. (How do you know which is the longest wait and which is the shortest?) My answer: The survival curves show the probability of “surviving” until each time point for each group. In this case, “surviving” means that the patient is still waiting for a liver transplant, so the most top-right survival curve is the one where the patients in that group are typically waiting longest. (This is backwards from the usual thing of top-right being best, because survival is actually bad, the way survival is being defined here. I wanted to check that you could think this through clearly.) The conclusion here ought to be the same as in part (e) — the means of making the conclusion is different, but the conclusion ought to be the same. The most top-right curves are all the dash-dot ones, which correspond to the 1998–1999 time period. So this is the time period where patients have to wait longest for a liver transplant. (You can see that the next worst is 95–97, so that things have been getting worse over time. My data source explains that the procedures for deciding who gets a liver transplant have been getting more sophisticated over time, but “the overall liver shortage remains acute”.) (i) (2 marks) Look again at Figure 13. Other things being equal, patients in which blood group tend to receive liver transplants soonest? Explain briefly. My answer: The patients in the green group, that is blood group AB. Compare the curves for the same time period, for example the dot-dash ones that are 1998–99, since they are easiest to see. Of these, the green ones are the lowest, so they receive liver transplants the soonest. There is another clue here, which is in the preamble of the question. I said that “type O patients are at a disadvantage” because type O livers can be used in a patient of any blood type. That is, we can expect type O patients to have to wait longest for a liver transplant, and indeed the blue type O “survival curve” is at the top for any time period. If you thought that type O patients got liver transplants quickest, this is an invitation to reconsider that thought. (j) (2 marks) It looks as if some of the survival curves in Figure 13 become flat, rather than heading towards zero. What does it mean for a survival curve to be flat at, say, 0.4, for ever beyond a certain time, rather than descending to zero? Answer in the context of this problem. My answer: Typically a survival curve will head towards zero, which would mean that there is a zero probability of surviving infinitely long (or for a very long time). But a survival curve that stays flat for ever means that someone in that group might survive for ever (they would never observe the event). In the context of this problem, if the survival curve is flat at 0.4 beyond, say, 2000 days, there is a probability 0.4 that a patient in that group will never receive a liver transplant. Sad but true. “The liver shortage remains acute”, indeed. Exam continues. . . This page: of possible 7 points. STAD29 Final Exam Page 8 of 18 5. The flea beetle Chaetocnema has three species, called concinna, heikertingeri and heptapotamica. Is it possible to distinguish beetles from these three species by means of body measurements? In a study, measurements were taken of Width (the maximal width of the aedeagus in the forepart) and Angle (the front angle of the aedeagus). The species was also noted, abbreviated Con, Hei and Hep (respectively). Figure 14 in the booklet of code and output shows some of the data. (The aedeagus is “a reproductive organ of male insects”.) (a) (3 marks) Figure 15 shows a MANOVA for these data. What do you conclude from this output? My answer: The null hypothesis is that all the Species have the same mean on both variables (and all combinations of them). This null is resoundingly rejected. Thus at least two of the species have means that differ on some combination of the two variables, Width and Angle. (b) (2 marks) In an attempt to find out why we concluded what we did in (a), a discriminant analysis was run. This is shown in Figure 16. Why are there two LDs? Explain briefly. My answer: There are 3 groups and two measured variables, and the smaller of 3 − 1 and 2 is 2. (c) (3 marks) Would large or small values of each of the two variables (consider them separately) make LD1 small? What would make LD2 small? My answer: LD1 will be small if Angle is small and Width is large (the latter because its coefficient in LD1 is negative). LD2 will be small if both measurements are large, since both coefficients are negative. Question 5 continues. . . This page: of possible 8 points. Question 5 continues. . . STAD29 Final Exam Page 9 of 18 (d) (3 marks) Looking again at Figure 16, do you think flea beetles of the species heikertingeri would have a large or small score on LD1? A large or small score on LD2? Explain briefly. If on either of these it is not clear what will happen, explain why. My answer: These beetles have the smallest mean width and the largest mean angle. So, in the light of the previous part, this is exactly the opposite of what will make LD1 small: that is, we’d expect LD1 to be large. As for LD2, this will be small if both measurements are large. One of them is, and one of them is not. So LD2 might be small and it might be large, and at this level of analysis it is not clear whether LD2 is small or large. If you look at the plot in Figure 17, this species is actually large on LD2 as well. You can “cheat” and look at the plot first, and then try to rationalize why Hei is large on LD2. You can argue that for these beetles, the Width is a lot smaller than for the others, so this is the dominant variable, and therefore LD2 will be large. I can go with that. (Note, however, that the angle measurements are smaller and less variable, so this is not quite as clear-cut as you might guess.) (e) (2 marks) Look at the plot in Figure 17. Do the species appear to be distinguishable using the Width and Angle measurements? Explain briefly. My answer: The groups appear to be almost completely distinct on the plot: Hep top left, Hei top right, Con at the bottom. (f) (3 marks) Look at the posterior probabilities in Figure 18. Find a flea beetle that is misclassified (that is, its predicted and actual species are different). Which number row is it in the data frame? What is it about its measurements that makes it hard to classify correctly? My answer: I can only find one, row #17. I found this by scanning down the Con column for the Cons until I found a posterior probability much less than 1. Row 17 is only 0.188. (Row 16 is the only other “off” one.) Comparing it with the other Con ones, its width of is the lowest of them all and its angle is on the high side. Or, compare with the means in Figure 16. Likewise, its width of 134 is much less than the mean width for the Cons of 146 and its angle is 15, higher than the mean angle of 14. Either way, LD1 will be large, so it looks as if this observation is the rightmost Con, the one almost over to the two Heis. (Curiously, it is that one Con that is misclassified, not the two Heis at the bottom right of Figure 17.) Exam continues. . . This page: of possible 8 points. STAD29 Final Exam Page 10 of 18 6. The bark of some tropical trees seems to offer protection from termites. Experimenters investigated the effects of these tree resins on termites. The resin was dissolved in a solvent and placed on filter paper in two different doses (5mg and 10mg). For each dosage level, eight dishes are set up with 25 termites in each dish. (There are thus 16 dishes altogether, 8 at each dose level.) The termites are fed the dosed filter paper and a daily count is made of the number of termites surviving (this is the response variable). Fifteen days were observed, but no observations were made on days 3 and 9 because they were Sundays. We will ignore the fact that two days were missing entirely. Some of the data is shown in Figure 19 of the booklet of code and output. (a) (2 marks) Why is a two-way analysis of variance not suitable for analyzing these data? My answer: The same dish is measured several times (once on each day). If a two-way ANOVA were to be suitable, a separate dish would have to be used on each day. (Or, measurements from the same dish would be expected to be correlated over time, and the analysis would have to deal with that, which regular 2-way ANOVA does not.) (b) (1 mark) Look at the code and output in Figure 20. Why did I have to use as.matrix in the first line? My answer: termites[,3:15] is a data frame, not a matrix, but lm needs a matrix below, so we have to turn response into a matrix. (c) (2 marks) In Figure 20, is any effect of dose the same for all days? How can you tell? Explain briefly. My answer: If the effect of dose were different on different days, there would be a significant interaction. But there is not, so the effect of dose must be consistent over the days. (d) (2 marks) In Figure 20, what do you conclude from the other tests in the table (the relevant ones you haven’t commented on yet)? Explain briefly, in the context of the experimental situation. My answer: We already commented on the interaction, so interest is in the main effects: of dose (significant) and days (also significant). That is, one of the doses has a consistently higher number of termites surviving (over all days), and there is a consistent difference over days in the number of termites surviving, regardless of the level of dose. (There isn’t any information here about which doses or days have a higher number of termites than others. That comes later.) If you didn’t comment on the interaction above, here is the place to do so. The test for the intercept says that the overall mean is not zero, hardly an earth-shattering finding. I said “relevant” in the question to encourage you not to waste time mentioning this. The main effects are what is important. Question 6 continues. . . This page: of possible 7 points. Question 6 continues. . . STAD29 Final Exam Page 11 of 18 (e) (2 marks) Figure 21 in the booklet of code and output shows the code required to produce an interaction plot. In the third line of code, I use a function gather. Describe what gather does in the context of this problem. My answer: This collects the columns day1 through day15 together into one column (since they are all numbers of surviving termites). The one column is called termite, with the number of days in days (in the form day9). I needed to do this, since that’s what the interaction plot needs in order to work. (f) (3 marks) What do you conclude from the interaction plot in Figure 21 about (i) any interaction and (ii) the nature of any main effects of dose and days? My answer: The lines on the interaction plot are not that parallel, so I would have expected a significant interaction. As for main effects: the number of termites goes down as the number of days increases, and the number of termites for dose 5 is higher than the number for dose 10, at any number of days. (g) (2 marks) Are the conclusions about interaction from the interaction plot and from the repeatedmeasures analysis of variance in Figure 20 consistent, or not? Expain briefly. My answer: The apparent non-parallelism on the interaction plot suggests a significant interaction, whereas in Figure 20, the interaction is nowhere near significant. This seems contradictory. (That’s as much as I needed you to say.) As to why: well, the interaction plot doesn’t show variability, of which there might be quite a lot. So I would trust the ANOVA. If you thought those lines on the interaction plot were “roughly parallel”, then you would expect to see no interaction in the ANOVA, and all would be consistent. (I don’t mind what you think about parallelism as long as you think something and you make a comment that follows from what you said.) (h) (2 marks) Based on the information you’ve seen so far, which of the two different doses of resin seems to be more effective at providing protection from termites? Explain briefly. My answer: Having fewer termites surviving corresponds to a better protection from termites. According to the interaction plot, the mean number of termites is lower at the higher dose of 10, for all days. So if we want to get rid of termites, this analysis says that we should use the dose of 10. Note that the ANOVA in Figure 20 only tells you that the dose makes a difference to the number of termites, but it doesn’t tell you which dose is better. For that, you need a table of means (which we don’t have here) or the interaction plot, as in Figure 21. Exam continues. . . This page: of possible 9 points. STAD29 Final Exam Page 12 of 18 7. The data shown in Figure 22 show protein consumption in 25 European countries for 9 food groups. We are interested in whether the countries tend to group together (whether there are countries whose inhabitants tend to eat similar amounts of protein in similar food groups). These data date from 1983. Note that the country names are the row names of the data frame (rather than being in a column). This matters for the code later. (a) (2 marks) A hierarchical cluster analysis, using Ward’s method, is shown in Figure 23. Explain briefly what the first line of code does, and why it was necessary here. My answer: The variables are on different scales (for example, Milk is typically large and Starch is always small), so that the variables need to be standardized, which is what scale does. (b) (2 marks) Why did I use dist on the second line of code, rather than as.dist? Explain briefly. My answer: Here, we have measurements on variables, so we have to create distances, which is what dist does. as.dist would be used if we already had distances and we just needed R to treat them properly. (c) (4 marks) A dendrogram is shown at the bottom of Figure 23. Suppose we want to divide the countries up into five clusters. Which countries would be in each cluster? My answer: Chop the tree at a height of about 6 (or maybe 7) and read off the countries that are grouped together: 1. Finland, Norway, Denmark, Sweden 2. Hungary, USSR, Poland, Czechoslovakia, East Germany 3. Switzerland, Austria, Netherlands, Ireland, Belgium, West Germany, France, UK 4. Albania, Bulgaria, Romania, Yugoslavia 5. Greece, Italy, Portugal, Spain Question 7 continues. . . This page: of possible 8 points. Question 7 continues. . . STAD29 Final Exam Page 13 of 18 (d) (2 marks) Would you say that the countries tend to cluster geographically, or not? Explain briefly. Use the map in Figure 24 to help you decide, if you need to. (The map comes from the same era as the data. Some of the countries have changed names since then.) My answer: I’d say that there is a very clear geographic distinction between the clusters: briefly, Scandinavia (north), east (Soviet bloc countries), western Europe, the Balkans (southeast) and Mediterranean. The countries in the last cluster aren’t geographically close, but they are known to have a similar diet. (e) (1 mark) In Figure 25, I obtain a K-means clustering of the countries with 5 clusters. I used 5 clusters because 5 clusters seemed reasonable from the hierarchical cluster analysis. What technique would I have used if I had not known that 5 clusters was suitable? (No explanation needed.) My answer: A scree plot. That’s all I need. If you had a scree plot, you would look for an elbow and take that many clusters, not one less as for principal components/factor analysis. But you don’t need to say that. (f) (2 marks) Look at the clusters obtained in Figure 25. Describe the differences, if any, from the clusters obtained in Figure 23. You may find the table at the bottom of Figure 25 helpful. My answer: The clustering is exactly the same, though the clusters are now numbered differently (an immaterial difference). I can tell because the table at the bottom has only one non-zero in each row and column. For example, all the countries in cluster 1 under hierarchical clustering are in cluster 5 under K-means. You can check this by seeing which countries are now in which cluster for K-means: they all have the same cluster-mates as they had before. Exam continues. . . This page: of possible 5 points. STAD29 Final Exam Page 14 of 18 8. Let’s use the same data as for the cluster-analysis question, and this time try to produce a map of the countries. This is done in Figure 26. (a) (1 mark) What kind of input does cmdscale require? My answer: Distances, as here, or dissimilarities. The clue is the use of protein.d, which came out of Figure 23 as the output of dist. “Output from dist” doesn’t quite do it, because I want you to show that you know what kind of thing comes out of dist and goes into cmdscale. (b) (2 marks) How might a multi-dimensional scaling map differ in general from an actual map? Explain briefly how Figure 26 differs from Figure 24, if you can. (If you can’t find a concise way of saying how the maps differ, say so.) My answer: An MDS map might be rotated or flipped over from a real map. Since we said earlier that there seemed to be some kind of a geographical clustering, it makes sense to try to find a correspondence between our MDS map and the truth. On the MDS map, the four Balkan countries (bottom right) are in the right place, but the four Mediterranean countries are right and top when they should be bottom and left, and the countries bottom and left are in western and eastern Europe, but mostly in the north. This correspondence would be explained by a flip (or reflection) along a line from top left to bottom right. This would exchange the Mediterranean countries with the northern ones, while leaving the Balkan countries alone. (c) (2 marks) Give an example of a way in which the results of the K-means cluster analysis of Question 7 (Figure 25) do not correspond with the map. My answer: This means finding some countries in the same cluster that are far apart on the map, or alternatively finding some countries in different clusters that are close together on the map. Either is good. For the first, note that the four countries in the Mediterranean cluster are all at top right of the map, but quite spread out (Italy and Portugal especially). For the second, note that Sweden is all mixed up with the western European countries (UK and Germany especially). In fact, all the Scandinavian and western European countries seem intermingled, even though they are in different clusters. I’m prepared to accept anything that says “these countries are close on one but not the other”, if it is supported by the results. Question 8 continues. . . This page: of possible 5 points. Question 8 continues. . . STAD29 Final Exam Page 15 of 18 (d) (2 marks) Some more output is in Figure 27. What do you conclude from it? Explain briefly. My answer: This is comparing a 2-dimensional map to a 3-dimensional one. I think the quality of fit of the 3-dimensional map is clearly better than the 2-dimensional one (75% vs. 63%), but any intelligent comment will do. (It may be, for example, that the Scandinavian countries differ from the others on the 3rd dimension, which is why they appeared on a (2-dimensional) map in with the western European ones.) (e) (2 marks) A non-metric multidimensional scaling map is shown in Figure 28. What is its stress value? How good of a job is the map doing of reproducing the distances? My answer: It is about 14 (percent). According to the scale in the notes, this is squarely in the middle of “fair”. So that’s how good of a job the map is doing. (Again, I wonder whether a 3-dimensional solution would work better than this 2-dimensional one.) (f) (2 marks) Is Figure 28 broadly similar to, or different from, the map in Figure 26? If different, explain how it is different. My answer: I’d say the maps are pretty much telling the same story. (They are even in the same orientation, though they didn’t have to be for this.) The four Balkan countries are off by themselves, the four Mediterranean countries form a loose cluster, and everything else is in kind of a blur together (that happens to be at the bottom left). If you want to find differences (eg. Finland is in the corner in one and not the other) go ahead, but I don’t think you’ll find anything of substance. Exam continues. . . This page: of possible 6 points. STAD29 Final Exam Page 16 of 18 9. A psychologist ran 13 psychological tests on 231 people. The tests were part of the Eysenck Personality Inventory (EPI below), the Big 5 inventory, and some other things. (The Big 5 theory holds that a person’s personality is driven by their position on the 5 dimensions shown below.) The tests were as follows: epiE EPI extraversion epiS EPI sociability epiImp EPI impulsivity epilie EPI lie scale epiNeur EPI neuroticism bfagree Big 5 agreeableness bfcom Big 5 conscientiousness bfext Big 5 extraversion bfneur Big 5 neuroticism bfopen Big 5 openness bdi Beck Depression Index traitanx Trait anxiety stateanx State anxiety “Impulsivity” is the tendency for a person to act without thinking; a “neurotic” person is one who suffers from a mental disorder such as anxiety or depression (the way they perceive the world is not how it is). The lie scale does indeed measure how likely a person is to tell lies. “State anxiety” is anxiety caused by a situation, and “trait anxiety” is anxiety that is part of a person’s character. An inventory is a list, in this case a list of statements that the person being tested says how much they agree with. I want to use a factor analysis to determine whether scores on certain of these inventories tend to go together. Some of the data is shown in Figure 29 of the booklet of code and output. (a) (2 marks) Why did I run a principal components analysis in Figure 30, when I said I wanted to use a factor analysis? Explain briefly. My answer: The factanal command requires me to supply a number of factors. For that, I need to consult a scree plot, and to do that I need to run a principal components analysis. (“To get a scree plot” is not a complete answer: I’d like you to fill in more of the story than that.) (b) (2 marks) From the scree plot in Figure 30, how many factors do you think you should use for the factor analysis? Explain briefly. My answer: There is a big elbow at 4, so we should take one less factor than this, namely 3. Said differently, the 3rd factor is on the mountain, but the fourth one appears to be on the scree. You could also say that the fifth one has an elbow and therefore you should take four factors. This would be saying that factor 4 is part of the mountain rather than the scree. Your decision about that one could go either way (and I’m happy with both). Another way to assess this is to say that there are three eigenvalues (labelled Variances on this picture) clearly bigger than 1. This also points to three factors for the factor analysis. The fourth eigenvalue is just less than 1, but on the borderline for inclusion. The quality of your reasoning, rather than the precise answer, is what counts. Question 9 continues. . . This page: of possible 4 points. Question 9 continues. . . STAD29 Final Exam Page 17 of 18 (c) (1 mark) I did a factor analysis with three factors. (This may or may not be the number you found in part (b).) The results are shown in Figure 31. How much of the variability in the data is explained by three factors? My answer: 58.6%, in the Cumulative Var row of the table near the bottom of the output. (d) (3 marks) Factor 1 seems to consist of which variables? Explain briefly. What do they seem to have in common? My answer: Look for the far-from-zero loadings: epiNeur, bfneur (both neuroticness), depression index and the two anxiety variables at the end of the list. These are all aspects of neuroticism (as I defined it above), since depression and anxiety are part of neuroticism. (In case you care, factor 2 is the Extraversion, Sociability and Impulsiveness parts of the Eysenck scale, so could be described as “outgoing”, and factor 3 is the four parts of the Big 5 scale apart from neuroticism that was captured in factor 1.) (e) (2 marks) Find a variable with a high uniqueness, and explain by looking at the factor loadings why its uniqueness is high. My answer: The highest (by a long way) is epilie (the score on the Eysenck lie inventory), a uniqueness of 0.824. A high uniqueness should occur because the variable doesn’t load heavily on any of the factors. For epilie, the loadings are all under 0.3 in size (−0.291, −0.274, 0.128). So this makes sense. The other variables all seem to have a high loading on at least one of the factors, so their uniquenesses are not nearly so high. This means that a person’s tendency to lie appears unrelated to any other personality traits that they have. Question 9 continues. . . This page: of possible 6 points. Question 9 continues. . . STAD29 Final Exam Page 18 of 18 (f) (2 marks) Would we be justified in looking at more factors than the three we did? Explain briefly, using the output in Figure 31. My answer: I’m looking for the hypothesis test at the bottom, where the P-value is very small and therefore we reject three factors (in favour of more being necessary). If you want, you can say that explaining 58.6% of the variability is not enough, and we need more factors to explain more. (There is a danger, though, that it could be like the personality example we did in class, where the later factors all explain a tiny amount of variability and you don’t get the total amount high unless you take a bunch of them.) (g) (2 marks) Look at the biplot in Figure 32. Find individual 96. Would you say that this is a neurotic person? Would you say that they are an extravert? Explain briefly. My answer: Individual 96 is over on the right. This person is off the end of the arrows for bfneur and epiNeur (which point right), so this is a highly neurotic person. The variable that is most closely related to extraversion is epiE, which points upwards; this individual is about a third of the way up the arrow, so is more extraverted than average, but not very extraverted. (h) (1 mark) Which number individual is not at all neurotic, but about average in terms of extraversion and sociability? (No explanation needed.) My answer: The two neurotic scales point to the right, so a not-at-all neurotic person would be on the left. The extraversion and sociability scales point up, so an average individual on these scales would be about halfway up. So I think the individual that best fits these is #9, on the left middle. I’m willing to take 86 or 118 as well (or 85), but I don’t want you to stray too far from “on the left, halfway up”. End of Exam This page: of possible 5 points.
© Copyright 2025 Paperzz