STAD29 / STA 1007 assignment 10 Not to be handed in (for your own learning) If this assignment were to be handed in, your allowable sources would be only: • the instructor • my textbook • your class notes • Google Maps (for one of the questions) Table of points for each question: Question: 1 2 3 Total Points: 19 12 30 61 1. The file http://www.utsc.utoronto.ca/~butler/d29/wisconsin.txt contains the road distances (in miles) between 12 cities in Wisconsin and neighbouring states. We are going to try to reproduce a map of the area using multidimensional scaling. (a) (3 marks) Read in the data and create a dist object, bearing in mind that the data in the file are already distances. Display your dist object. (Note that the default behaviour of read.table will store the city names as the row and column names of your data frame, rather than in the first column, so you will have to do something easier than but inconsistent with what we did in class.) Solution: 1 wisc=read.table("wisconsin.txt",header=T) d=as.dist(wisc) d ## Appleton Beloit Fort.Atkinson Madison Marshfield Milwaukee ## Beloit 130 ## Fort.Atkinson 98 33 ## Madison 102 50 36 ## Marshfield 103 185 164 138 ## Milwaukee 100 73 54 77 184 ## Monroe 149 33 58 47 170 107 ## Superior 315 377 359 330 219 394 ## Wausau 91 186 166 139 45 181 ## Dubuque 196 94 119 95 186 168 ## St.Paul 257 304 287 258 161 322 ## Chicago 186 97 113 146 276 93 ## Monroe Superior Wausau Dubuque St.Paul ## Beloit ## Fort.Atkinson ## Madison ## Marshfield ## Milwaukee ## Monroe ## Superior 362 ## Wausau 186 223 ## Dubuque 61 351 215 ## St.Paul 289 162 175 274 ## Chicago 130 467 275 184 395 (b) (2 marks) Obtain a vector containing the city names. (This, too, is not as in class.) Solution: Use the names of the data frame that you read in from the file. cities=names(wisc) cities ## [1] "Appleton" ## [5] "Marshfield" ## [9] "Wausau" "Beloit" "Milwaukee" "Dubuque" "Fort.Atkinson" "Madison" "Monroe" "Superior" "St.Paul" "Chicago" Or this also works, since the names are on the rows too: row.names(wisc) ## [1] "Appleton" ## [5] "Marshfield" ## [9] "Wausau" "Beloit" "Milwaukee" "Dubuque" "Fort.Atkinson" "Madison" "Monroe" "Superior" "St.Paul" "Chicago" (c) (2 marks) Run a (metric) multidimensional scaling on the data, to obtain a two-dimensional representation of the cities. (You don’t need to look at the results yet.) Solution: wisc.1=cmdscale(d) Page 2 (d) (3 marks) Plot the results of the multidimensional scaling, labelling the cities with their names. Use your judgement to decide where to place the city names, and how to make sure the whole city names are shown on the map. Solution: plot with text. Here’s my first go, with the city names on the right: plot(wisc.1) text(wisc.1,cities,pos=4) Appleton 50 ● ● Chicago ● 0 wisc.1[,2] ● ● Wausau ● Marshfield Milwaukee Fort.Atkinson ● ● ● Madison Beloit −50 ● −100 ● −100 St.Paul Monroe ● −200 Superior Dubuque 0 100 200 wisc.1[,1] Having the city names on the right appears good, but one of them went off the right-hand side. Also, the axis labels look weird, but I can’t think of anything better to replace them with than x and y: plot(wisc.1,xlab="x",ylab="y",xlim=c(-200,350)) text(wisc.1,cities,pos=4) Page 3 50 ● ● ● Wausau ● Marshfield Milwaukee Chicago ● 0 ● Appleton Fort.Atkinson ● Madison Beloit y ● ● −50 ● −100 ● −100 St.Paul Monroe ● −200 Superior Dubuque 0 100 200 300 x Your map may come out different from mine, but subject to the usual stuff about rotation and reflection it should be equivalent to mine. (e) (2 marks) Are cities close together on the map also close together in real life? Give an example or two. Solution: On the map, the trio of cities Madison, Beloit and Fort Atkinson are closest together. How far apart are they actually? Well, you can go back to the original file (or display of what I called d) and find them, or you can do this: cities ## [1] "Appleton" ## [5] "Marshfield" ## [9] "Wausau" "Beloit" "Milwaukee" "Dubuque" Cities 2, 3 and 4, so: Page 4 "Fort.Atkinson" "Madison" "Monroe" "Superior" "St.Paul" "Chicago" wisc[2:4,2:4] ## Beloit Fort.Atkinson Madison ## Beloit 0 33 50 ## Fort.Atkinson 33 0 36 ## Madison 50 36 0 These are all less than 50 miles apart. There are some others: Monroe and Madison are 47 miles apart, Wausau and Marshfield are 45 miles apart, but these appear further apart on the map. This doesn’t work: d[2:4,2:4] ## Error: incorrect number of dimensions because d, despite appearances, is actually one-dimensional (it just has a “print method” that displays it as a matrix). I don’t much care which cities you look at. Finding some cities that are reasonably close on the map and doing some kind of critical assessment of their actual distances apart is all I want. (f) (2 marks) Obtain a Google (or other) map of the area containing these twelve cities. (Include your map in your answer.) Solution: Since I like to show off, let me show you how you can do this in R, using the package ggmap. (Of course, you can just open the appropriate map in your browser and copy-paste it.): library(ggmap) wisc.map=get_map(location="Milwaukee, WI",zoom=6) ## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Milwaukee,+WI&zoom=6&siz ## Google Maps API Terms of Service : http://developers.google.com/maps/terms ## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Milwaukee,+W ## Google Maps API Terms of Service : http://developers.google.com/maps/terms ggmap(wisc.map) Page 5 47.5 lat 45.0 42.5 40.0 −90 −85 lon I centred this map around Milwaukee (a guess), which is not quite where the centre should be, since Milwaukee is in southeastern Wisconsin. The zoom option is how zoomed-in the map should be (a higher number is more zoomed-in). Likewise, 6 was a guess, and it seems that I need to zoom in a bit more. The other way of specifying a location, instead of the name or lat-long of the centre of the map, is to specify the corners of the map in degrees of latitude or longitude. We have to give four numbers: lower left longitude and latitude, upper right longitude and latitude. (Degrees west are negative, as you see on the lon scale of the above map.) This comes out as west, south, east and north limits of the map. Where are the 12 points we want to put on the map? We can get their latitudes and longitudes, which is called “geocoding”, and a function geocode is included in ggmap. First add the state names to the cities, to make sure Google Maps looks up the right ones. All of them are in Wisconsin, except for the last two: St. Paul is in Minnesota and Chicago is in Illinois: Page 6 states=rep("WI",12) states[11]="MN" states[12]="IL" cst=paste(cities,states) cst ## [1] "Appleton WI" ## [4] "Madison WI" ## [7] "Monroe WI" ## [10] "Dubuque WI" "Beloit WI" "Marshfield WI" "Superior WI" "St.Paul MN" "Fort.Atkinson WI" "Milwaukee WI" "Wausau WI" "Chicago IL" And then look them up: ll=geocode(cst) cbind(cst,ll) ## cst ## 1 Appleton WI ## 2 Beloit WI ## 3 Fort.Atkinson WI ## 4 Madison WI ## 5 Marshfield WI ## 6 Milwaukee WI ## 7 Monroe WI ## 8 Superior WI ## 9 Wausau WI ## 10 Dubuque WI ## 11 St.Paul MN ## 12 Chicago IL lon -88.42 -89.03 -88.84 -89.40 -90.17 -87.91 -89.64 -92.10 -89.63 -90.66 -93.09 -87.63 lat 44.26 42.51 42.93 43.07 44.67 43.04 42.60 46.72 44.96 42.50 44.95 41.88 What are the extreme corners of these? range(ll$lon) ## [1] -93.09 -87.63 range(ll$lat) ## [1] 41.88 46.72 (range in R produces the two extreme values, not the difference between the highest and lowest, which is what you might think of as a “range”.) We don’t get exactly the corners we ask for, since the map always comes out in the same proportions (we could ask for a long skinny map, but we’ll always get a rectangular one that fills the page), and also Google Maps converts the corners into a centre and zoom. I had to tinker with the numbers below, since on my first attempt the map zoomed in too much. I also asked for a “roadmap” to maximize the number of places marked on there. So: library(ggmap) wisc.map.2=get_map(location=c(-94,41.8,-87,46.8),maptype="roadmap") ## Warning: bounding box given to google - spatial extent only approximate. ggmap(wisc.map.2) Page 7 46 lat 45 44 43 42 −94 −92 −90 −88 lon This came out about right. Now we need to mark our 12 cities on the map. This is a ggplot map, so the right syntax is as below. ggmap(wisc.map.2)+geom_point(aes(x=lon,y=lat),data=ll) Page 8 ● 46 45 ● ● lat ● ● 44 ● 43 ● ● ● ● ● 42 ● −94 −92 −90 −88 lon We just squeezed all our cities onto the map. The city southwest of Wausau is Marshfield, the one between Madison and Milwaukee is Fort Atkinson, and the two east of Dubuque along the southern border of Wisconsin are Monroe and Beloit. The one way up at the top is Superior. After that long diversion, we come to: (g) (3 marks) Discuss how the map that came out of the multidimensional scaling corresponds to the actual (Google) map. Solution: Let’s pick a few places from the actual map, and make a table of where they are on the actual map and the cmdscale map: Place Real Cmdscale Superior northwest central east St. Paul central west southeast Dubuque central south central south Chicago southeast central west Appleton central east central north Page 9 This is a bit tricky. Dubuque is the only one in the right place, and the others that were west have become east and vice versa. So I think there is a flipping across a line going through Dubuque. That seems to be the most important thing; if you imagine the other points being flipped across a line going north-south through Dubuque, they all end up in about the right place. There might be a little rotation as well, but I’ll call that close enough. (For you, any comment along the lines of “flipped around this line” or “rotated about this much” that seems to describe what has happened, is fine.) This one calls for a Procrustes rotation, which I can do since I have longitudes and latitudes for my 12 cities. It goes like the one in class. ll is actually a data frame, so we need to convert it into a matrix. wisc.1 contained the coordinates for our two-dimensional solution: library(shapes) ## Loading required package: scatterplot3d ## Loading required package: rgl ## Loading required package: MASS wisc.pro=procOPA(as.matrix(ll),wisc.1,reflect=T) I discuss reflect=T below. Now, we need to plot these. I could have made a blank plot and then added wisc.pro$Ahat to it with points, which would have made the structure a little clearer. Anyhow. Note that Ahat and Bhat have two columns, so they will be used as respectively the x and y coordinates of the plot. I also extended the x-coordinate range so that Chicago’s name would fit. plot(wisc.pro$Ahat,col="red",xlim=c(-3.5,3)) text(wisc.pro$Ahat,cities,col="red",pos=4) points(wisc.pro$Bhat,col="blue") text(wisc.pro$Bhat,cities,col="blue",pos=4) Page 10 3 ● Superior Superior 2 ● ● St.Paul Wausau ● Wausau ● Marshfield ● Marshfield ● St.Paul 1 ● Appleton ● Appleton 0 lat ● −1 ● ● ● Dubuque ● ● ● Milwaukee Madison ● Milwaukee ●●Fort.Atkinson Fort.Atkinson Monroe ● ● Beloit Beloit Monroe ● Dubuque ● −3 −2 −1 0 1 Chicago Chicago ● 2 3 lon This is pretty good. The red and blue names match well, except perhaps for St. Paul. What does it think is the right transformation to get from MDS coordinates to real ones? wisc.pro$R ## [,1] [,2] ## [1,] -0.6823 0.7311 ## [2,] 0.7311 0.6823 This time, the diagonal elements differ in sign (indicating that a reflection was performed to make the points line up). Indeed, I allowed procOPA to use a reflection if it needed to by means of reflect=T above. According to Wikipedia (http://en.wikipedia.org/wiki/Coordinate_rotations_and_reflections), our R corresponds to a reflection about a line making angle θ with the x-axis, where cos 2θ = −0.6823 and sin 2θ = 0.7311. This means that 2θ = 133 degrees, or θ = 66.5 degrees. That is to say, we draw a line that makes an angle of this much with the x-axis of our cmdscale map, and then reflect the cities in that. For example, a line through Dubuque and Marshfield on the cmdscale map makes about the right angle, and then reflect all the cities in that. Page 11 (h) (2 marks) Calculate something that demonstrates that a one-dimensional map of the cities is a much worse representation than the two-dimensional one that we made before. Solution: Run again with eig=T and take a look at GOF (uppercase): cmdscale(d,2,eig=T)$GOF ## [1] 0.9129 0.9316 cmdscale(d,1,eig=T)$GOF ## [1] 0.7917 0.8079 The goodness-of-fit of the two-dimensional solution is pretty good, but that of the one-dimensional solution (which arranges all the cities along a line) is pretty awful. How awful? Let’s find out. I should have saved it from just above. For the plot, ones is a string of ones, as many as there are cities. ones=rep(1,12) v=cmdscale(d,1,eig=T) plot(ones,v$points) text(ones,v$points,cities,pos=4) Page 12 200 Superior ● St.Paul ● Marshfield Wausau ● Appleton ● ● ● ● Madison Dubuque Fort.Atkinson Monroe Beloit Milwaukee ● Chicago 0 v$points 100 ● −200 −100 ● 0.6 0.8 1.0 1.2 1.4 ones The cities get mapped onto a line that goes northwest (top) to southeast (bottom). This is not completely terrible, since there aren’t really any cities in the northeast of the state, but it is pretty awful. 2. In the previous assignment, you did a cluster analysis of ten brands of beer, as rated by 32 students. This time, we will do a non-metric multidimensional scaling of those same brands of beer. The data are in http://www.utsc.utoronto.ca/~butler/d29/beer.txt. (a) (2 marks) Noting that we want to assess distances between brands of beer, read in the data and do whatever you need to do to work out distances between the beers. Show your result. Solution: This is really a copy of last time. We need to transpose the data frame to get the beers in rows (dist works on distances between rows), then feed everything but the student IDs into dist: Page 13 beer=read.table("beer.txt",header=T) d=dist(t(beer[,-1])) d ## AnchorS Bass Becks Corona GordonB Guinness Heineken PetesW ## Bass 15.20 ## Becks 16.09 13.64 ## Corona 20.02 17.83 17.55 ## GordonB 13.96 11.58 14.42 13.34 ## Guinness 14.93 13.49 16.85 20.59 14.76 ## Heineken 20.66 15.10 13.78 14.90 14.07 18.55 ## PetesW 11.79 14.00 16.37 17.72 11.58 14.28 19.49 ## SamAdams 14.63 11.62 14.73 14.93 10.91 15.91 14.53 14.46 ## SierraN 12.61 15.10 17.94 16.97 11.75 13.34 19.08 13.42 ## SamAdams ## Bass ## Becks ## Corona ## GordonB ## Guinness ## Heineken ## PetesW ## SamAdams ## SierraN 12.12 This way also works, where you transpose everything and then take off the first row of the transposed result: m=t(beer) dist(m[-1,]) ## AnchorS Bass ## Bass 15.20 ## Becks 16.09 13.64 ## Corona 20.02 17.83 ## GordonB 13.96 11.58 ## Guinness 14.93 13.49 ## Heineken 20.66 15.10 ## PetesW 11.79 14.00 ## SamAdams 14.63 11.62 ## SierraN 12.61 15.10 ## SamAdams ## Bass ## Becks ## Corona ## GordonB ## Guinness ## Heineken ## PetesW ## SamAdams ## SierraN 12.12 Becks Corona GordonB Guinness Heineken PetesW 17.55 14.42 16.85 13.78 16.37 14.73 17.94 13.34 20.59 14.90 17.72 14.93 16.97 14.76 14.07 11.58 10.91 11.75 18.55 14.28 15.91 13.34 19.49 14.53 19.08 14.46 13.42 (b) (2 marks) Obtain a non-metric multidimensional scaling of the beers. (Comment coming up in a moment.) Page 14 Solution: library(MASS) beer.1=isoMDS(d) ## initial value 13.344792 ## iter 5 value 10.855662 ## iter 10 value 10.391446 ## final value 10.321949 ## converged (c) (2 marks) Obtain the stress value of the map, and comment on it. Solution: beer.1$stress ## [1] 10.32 The stress is around 10%, on the boundary between “good” and “fair”. It seems as if the map should be more or less worth using. (Insert your own hand-waving language here.) (d) (2 marks) Obtain a map of the beers, labelled with the names of the beers. On your plot, plot the actual locations with circles, and label the circles with the name of the beer, on the right. Make sure the scales go far enough so that you can see the names of the beers. Solution: This is slightly different from class, where I plotted the languages actually at their locations. But here, the beer names are longer, so we should plot the points and label them. The first “name” is “student”, which we need to get rid of: names=names(beer[,-1]) names ## [1] "AnchorS" "Bass" "Becks" "Corona" ## [7] "Heineken" "PetesW" "SamAdams" "SierraN" plot(beer.1$points,xlim=c(-10,10)) text(beer.1$points,names,pos=4) Page 15 "GordonB" "Guinness" Corona 6 ● SierraN 4 ● 2 AnchorS ● SamAdams GordonB 0 ● ● ● PetesW Heineken −4 −2 beer.1$points[,2] ● −6 ● ● −10 Bass ● Guinness Becks −5 0 5 10 beer.1$points[,1] (e) (2 marks) Find a pair of beers close together on your map. Are they similar in terms of student ratings? Explain briefly. Solution: I think Sam Adams and Gordon Biersch, right in the middle of the map. We can pull them out by name: Page 16 cbind(beer$SamAdams,beer$GordonB) ## [,1] [,2] ## [1,] 9 7 ## [2,] 7 8 ## [3,] 7 6 ## [4,] 8 5 ## [5,] 6 6 ## [6,] 4 7 ## [7,] 5 6 ## [8,] 5 5 ## [9,] 3 4 ## [10,] 4 6 ## [11,] 7 7 ## [12,] 7 9 ## [13,] 7 6 ## [14,] 6 6 ## [15,] 8 5 ## [16,] 5 6 ## [17,] 7 6 ## [18,] 8 2 ## [19,] 6 7 ## [20,] 6 5 ## [21,] 5 5 ## [22,] 8 7 ## [23,] 4 5 ## [24,] 7 7 ## [25,] 4 7 ## [26,] 6 3 ## [27,] 7 6 ## [28,] 6 6 ## [29,] 7 5 ## [30,] 7 7 ## [31,] 6 9 ## [32,] 5 6 These are, with a few exceptions (the most glaring being the 18th student), within a couple of points of each other. So I would say they are similar. Another way to show this is to make a scatterplot of them, and draw on it the line where the ratings are the same: plot(beer$SamAdams,beer$GordonB) abline(a=0,b=1) Page 17 9 ● 7 6 ● ● ● ● 5 8 ● ● ● ● ● ● ● ● ● ● ● ● ● 3 4 beer$GordonB ● 2 ● 3 4 5 6 7 8 9 beer$SamAdams These are basically closer to the line than not. (Note the trick for drawling a line with given intercept and slope: abline, with a= for the intercept and b= for the slope. The line y = x has intercept 0 and slope 1.) (f) (2 marks) In our cluster analysis, we found that Anchor Steam, Pete’s Wicked Ale, Guinness and Sierra Nevada were all in the same cluster. Would you expect them to be close together on your map? Are they? Explain briefly. Solution: If they are in the same cluster, we would expect them to “cluster” together on the map. Except that they don’t, really. These are the four beers over on the right of our map. They are kind of in the same general neighbourhood, but not really what you would call close together. (This is a judgement call, again.) In fact, none of the beers, with the exception of Sam Adams and Gordon Biersch in the centre, are really very close to any of the others. 3. The data in http://www.utoronto.ca/~butler/d29/weather_2014.csv is of the weather in a certain Page 18 location: daily weather records for 2014. The variables are: • day of the year (1 through 365) • day of the month • number of the month • season • low temperature (for the day) • high temperature • average temperature • time of the low temperature • time of the high temperature • rainfall (mm) • average wind speed • wind gust (highest wind speed) • time of the wind gust • wind direction (a) (4 marks) Read in the data, and create a data frame containing only the temperature variables, the rainfall and the wind speed variables (the ones that are actual numbers, not times or text). Display the first few lines of your data frame. Solution: I like dplyr (as you know). Or you could just pick out the columns by number: weather.0=read.csv("weather_2014.csv",header=T) library(dplyr) weather.0 %>% select(c(l.temp:ave.temp,rain:gust.wind)) -> weather head(weather) ## l.temp h.temp ave.temp rain ave.wind gust.wind ## 1 12.7 14.0 13.4 32.0 11.4 53.1 ## 2 11.3 14.7 13.5 64.8 5.6 41.8 ## 3 12.6 14.7 13.6 12.7 4.3 38.6 ## 4 7.7 13.9 11.3 20.1 10.3 66.0 ## 5 8.8 14.6 13.0 9.4 11.6 51.5 ## 6 11.8 14.4 13.1 38.9 9.9 57.9 (b) (2 marks) Find five-number summaries for each column by running summary on all the columns of the data frame (at once, if you can: remember apply?) Solution: This: apply(weather,2,summary) ## l.temp h.temp ave.temp rain ave.wind gust.wind ## Min. 3.1 9.8 7.3 0.00 0.00 3.2 ## 1st Qu. 9.1 14.4 12.0 0.00 2.30 22.5 ## Median 12.9 19.1 15.8 0.30 3.50 29.0 ## Mean 12.7 19.2 15.7 5.84 4.04 31.1 ## 3rd Qu. 16.3 23.3 19.3 5.30 5.20 38.6 ## Max. 22.6 31.5 26.6 74.90 16.60 86.9 Page 19 (c) (2 marks) Run a principal components analysis (on the correlation matrix). Solution: weather.1=princomp(weather,cor=T) (d) (2 marks) Obtain a summary of your principal components analysis. How many components do you think are worth investigating? Solution: summary(weather.1) ## Importance of components: ## Comp.1 ## Standard deviation 1.7831 ## Proportion of Variance 0.5299 ## Cumulative Proportion 0.5299 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 1.4138 0.74407 0.38585 0.33553 0.081141 0.3332 0.09227 0.02481 0.01876 0.001097 0.8631 0.95533 0.98014 0.99890 1.000000 The issue is to see where the standard deviations are getting small (after the second component, or perhaps the third one) and to see where the cumulative proportion of variance explained is acceptably high (again, after the second one, 86%, or the third, 95%). (e) (3 marks) Make a scree plot (the nice one, if you can). Does this support your conclusion from the previous part? Solution: I copied and pasted the code from class, changing some names: plot(weather.1$sdev^2,type="b",ylab="Eigenvalue") Page 20 1.5 ● 1.0 Eigenvalue 2.0 2.5 3.0 ● 0.5 ● ● 0.0 ● ● 1 2 3 4 5 6 Index I see elbows at 3 and at 4. Remember you want to be on the mountain for these, not on the scree, so this suggests 2 or 3 components, which is exactly what we got from looking at the standard deviations and cumulative variance explained. The eigenvalue-greater-than-1 thing says 2 components, rather than 3. The other way also works, though is harder to read: plot(weather.1) Page 21 1.5 0.0 0.5 1.0 Variances 2.0 2.5 3.0 weather.1 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 (f) (4 marks) Obtain the component loadings. How do the first three components depend on the original variables? (That is, what kind of values for the original variables would make the component scores large or small?) Solution: Page 22 weather.1$loadings ## ## Loadings: ## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 ## l.temp -0.465 -0.348 0.542 0.470 -0.379 ## h.temp -0.510 -0.231 -0.576 -0.381 -0.458 ## ave.temp -0.502 -0.311 0.804 ## rain 0.296 -0.397 0.853 -0.163 ## ave.wind 0.253 -0.560 -0.463 0.357 -0.529 ## gust.wind 0.347 -0.507 -0.230 -0.492 0.572 ## ## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 ## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 ## Proportion Var 0.167 0.167 0.167 0.167 0.167 0.167 ## Cumulative Var 0.167 0.333 0.500 0.667 0.833 1.000 1. This components loads mainly (and negatively) on the temperature variables, so when temperature is high, component 1 is low and vice versa. You could also say that it loads positively on the other variables, in which case component 1 is low if the temperature variables are high and the rain and wind variables are low. 2. This one loads most heavily on wind: when wind is high, component 2 is low. Again, you can make the judgement call that the other variables also feature in component 2, so that when everything is large, component 2 is small and vice versa. 3. This one is a bit clearer. The blank loadings are close to 0, and can be ignored. The main thing in component 3 is rain: when rainfall is large, component 3 is large. (g) (2 marks) Obtain the principal component scores for the first 20 days of the year, and display them alongside the other variables in your data frame. Solution: Same idea as in class. I think it can be done in dplyr too, though I couldn’t see how. Page 23 v=data.frame(weather,scores=weather.1$scores) head(v,n=20) ## l.temp h.temp ave.temp rain ave.wind gust.wind scores.Comp.1 ## 1 12.7 14.0 13.4 32.0 11.4 53.1 2.8428 ## 2 11.3 14.7 13.5 64.8 5.6 41.8 2.7871 ## 3 12.6 14.7 13.6 12.7 4.3 38.6 1.1108 ## 4 7.7 13.9 11.3 20.1 10.3 66.0 3.6214 ## 5 8.8 14.6 13.0 9.4 11.6 51.5 2.6742 ## 6 11.8 14.4 13.1 38.9 9.9 57.9 3.0950 ## 7 11.4 14.8 13.5 2.0 6.6 38.6 1.2201 ## 8 12.4 15.6 14.1 1.5 5.9 33.8 0.7337 ## 9 9.2 18.4 12.9 0.0 0.2 16.1 -0.2099 ## 10 8.3 14.8 11.0 0.0 1.4 24.1 0.8245 ## 11 5.8 14.8 9.5 0.3 1.1 16.1 1.0101 ## 12 9.4 15.2 12.1 10.7 4.7 41.8 1.6750 ## 13 7.3 12.9 10.2 15.7 3.1 35.4 2.1164 ## 14 11.4 13.9 12.8 8.1 4.7 38.6 1.3414 ## 15 9.4 13.1 12.0 29.0 5.9 43.5 2.5216 ## 16 9.0 12.2 10.8 6.9 5.4 49.9 2.3833 ## 17 7.7 11.4 9.3 25.4 7.2 41.8 3.1816 ## 18 7.5 10.9 9.0 17.0 1.4 24.1 1.9458 ## 19 6.4 11.4 8.7 2.5 3.3 32.2 2.1336 ## 20 6.9 12.2 9.2 2.8 1.1 17.7 1.2877 ## scores.Comp.2 scores.Comp.3 scores.Comp.4 scores.Comp.5 scores.Comp.6 ## 1 -3.127252 -0.004023 0.83603 -0.484456 -0.048587 ## 2 -2.309272 3.643532 0.28733 -0.430800 0.060407 ## 3 -0.254811 0.262561 0.26047 0.548246 0.004579 ## 4 -2.473400 -0.992383 -0.50801 0.006405 0.048003 ## 5 -2.028949 -1.683837 0.30950 -0.778030 0.188260 ## 6 -3.141807 0.663632 0.27422 -0.144678 -0.046651 ## 7 -0.327849 -0.957932 0.40619 0.051709 0.071742 ## 8 -0.101693 -0.742727 0.53706 0.026055 0.025823 ## 9 2.258912 0.534338 -0.26401 -0.141205 -0.109018 ## 10 2.003563 0.108545 -0.11724 0.160826 -0.067062 ## 11 2.724926 0.293739 -0.12458 -0.445328 -0.118472 ## 12 -0.070977 -0.080289 -0.26913 0.256326 -0.027600 ## 13 0.822099 0.653121 -0.21117 0.162944 0.023416 ## 14 -0.002489 -0.168938 0.25187 0.451020 0.035518 ## 15 -0.928698 0.955188 0.11778 -0.015696 0.117555 ## 16 -0.200012 -0.669237 -0.22024 0.733585 0.033594 ## 17 -0.626325 0.431170 0.37695 -0.389321 -0.085735 ## 18 1.821269 1.281010 0.28346 0.133515 -0.029614 ## 19 1.601731 -0.292889 -0.01086 0.156265 -0.033883 ## 20 2.618355 0.448194 0.24838 -0.085814 -0.044551 (h) (2 marks) Find a day that scores high on component 1, and explain briefly why it came out high (by looking at the measured variables). Solution: Day 17 has the highest component 1 score. This is one of the cooler days, especially for the daytime high. Also, there is a largish amount of rain. (The days with more rain were Page 24 warmer.) (i) (2 marks) Find a day that scores low on component 2, and explain briefly why it came out low. Solution: Day 6 (or day 1). These are days when the wind speed (average or gust) is on the high side. (j) (2 marks) Find a day that scores high on component 3, and explain briefly why it came out high. Solution: Day 2. Component 3 was mainly rain, so it is not surprising that the rainfall is the highest on this day. (k) (2 marks) Make a biplot of these data, labelling the days by the day count (from 1 to 365). You may have to get the day count from the original data frame that you read in from the file. Solution: biplot(weather.1,xlabs=weather.0$day.count) Page 25 −10 −5 0 5 10 gust.wind 45 40 ave.wind −0.20 261 10 5 0 −5 −10 ave.temp l.temp 361 359 344 345 354 339 352 365 24 35347 11 360 20 358 357 65348 342 86 50 23 30982 5249 669 72336 327 333 10 349 88 340 338 313 35618 337 19 351 265 97 117 343 341 53 8746 76 350 329 73 321 30 79 283 25 67 355 64 306 74 330 284 59 96 111 326 308 310 100 93 328 334 311 119 77 103 110 101 71 118 98 108 325 233 332 9514758 6313 282 126 107 181 26 305 31 362 102 232 69 106 21 8157 300 149335144 364 271 174 125 237 33 154 120 243 278 208 242 248 295299 175 104 8529 170 197 235 196 68 75148 236 143 231 9251 346 273 113 323 267 207 272 320 139 171 324 238 303 234 161 80 276 252 294292 266 198 216 347 203 114 145 115 121 178 131 146 302 109 70 301 14 177 244 205 22 48 206 202 176 185 182 2878 318 12 43 246 215 296 269 105 155 54 186 162 138 129128 151 227 199 247 230 297 16 224 251 277 256 241 173 37 133 169 142 9094 204 314 61 180 279 99 270 285 164 89 209 293 201 219 56 245 239 213 78 127 132 130 322 153 274 32 124 156 220 217 363 55 2717 188 160 268 83 122 331 240 36 315 84 218 312 298 172 228 221 304 275193 212 195 194 112 250 184 163 225 226 150 116 319 1562 316 152 263 223 189 253 183 140 229210 190 255 257 214 159 286 191 136249 123 291 222 44 165 42 280179 28 262 192 141 187 157 134 5 91 137 211 168254 41 258 200158 3072603839 34 167 4 288 135 259 264 317 1 6 290 rain 260 281 35 166 289 −15 0.00 −0.05 h.temp −0.15 −0.10 Comp.2 0.05 0.10 0.15 −15 37 −0.20 −0.15 −0.10 −0.05 0.00 0.05 0.10 0.15 Comp.1 (l) (3 marks) Looking at your biplot, what do you think was remarkable about the weather on day 37? Day 211? Confirm your guesses by looking at the appropriate rows of your data frame (and comparing with your summary from earlier). Solution: Day 37 is at the bottom right of the plot, at the pointy end of the arrows for rain, wind gust and average wind. This suggests a rainy, windy day: weather[37,] ## l.temp h.temp ave.temp rain ave.wind gust.wind ## 37 9.3 15.3 12.5 43.4 16.6 74 Those are high numbers for both rain and wind (the highest for average wind and above the third quartile otherwise), but the temperatures are unremarkable. Day 211 is towards the pointy end of the arrows for temperature, so this is a hot day: Page 26 weather[211,] ## l.temp h.temp ave.temp rain ave.wind gust.wind ## 211 22.6 31.5 26.6 0 4.5 33.8 This is actually the hottest day of the entire year: day 211 is highest on all three temperatures, while the wind speeds are right around average (and no rain is not completely surprising). Page 27
© Copyright 2025 Paperzz