STAD29 / STA 1007 assignment 10

STAD29 / STA 1007 assignment 10
Not to be handed in (for your own learning)
If this assignment were to be handed in, your allowable sources would be only:
• the instructor
• my textbook
• your class notes
• Google Maps (for one of the questions)
Table of points for each question:
Question:
1
2
3
Total
Points:
19
12
30
61
1. The file http://www.utsc.utoronto.ca/~butler/d29/wisconsin.txt contains the road distances (in
miles) between 12 cities in Wisconsin and neighbouring states. We are going to try to reproduce a map
of the area using multidimensional scaling.
(a) (3 marks) Read in the data and create a dist object, bearing in mind that the data in the file
are already distances. Display your dist object. (Note that the default behaviour of read.table
will store the city names as the row and column names of your data frame, rather than in the first
column, so you will have to do something easier than but inconsistent with what we did in class.)
Solution:
1
wisc=read.table("wisconsin.txt",header=T)
d=as.dist(wisc)
d
##
Appleton Beloit Fort.Atkinson Madison Marshfield Milwaukee
## Beloit
130
## Fort.Atkinson
98
33
## Madison
102
50
36
## Marshfield
103
185
164
138
## Milwaukee
100
73
54
77
184
## Monroe
149
33
58
47
170
107
## Superior
315
377
359
330
219
394
## Wausau
91
186
166
139
45
181
## Dubuque
196
94
119
95
186
168
## St.Paul
257
304
287
258
161
322
## Chicago
186
97
113
146
276
93
##
Monroe Superior Wausau Dubuque St.Paul
## Beloit
## Fort.Atkinson
## Madison
## Marshfield
## Milwaukee
## Monroe
## Superior
362
## Wausau
186
223
## Dubuque
61
351
215
## St.Paul
289
162
175
274
## Chicago
130
467
275
184
395
(b) (2 marks) Obtain a vector containing the city names. (This, too, is not as in class.)
Solution: Use the names of the data frame that you read in from the file.
cities=names(wisc)
cities
## [1] "Appleton"
## [5] "Marshfield"
## [9] "Wausau"
"Beloit"
"Milwaukee"
"Dubuque"
"Fort.Atkinson" "Madison"
"Monroe"
"Superior"
"St.Paul"
"Chicago"
Or this also works, since the names are on the rows too:
row.names(wisc)
## [1] "Appleton"
## [5] "Marshfield"
## [9] "Wausau"
"Beloit"
"Milwaukee"
"Dubuque"
"Fort.Atkinson" "Madison"
"Monroe"
"Superior"
"St.Paul"
"Chicago"
(c) (2 marks) Run a (metric) multidimensional scaling on the data, to obtain a two-dimensional representation of the cities. (You don’t need to look at the results yet.)
Solution:
wisc.1=cmdscale(d)
Page 2
(d) (3 marks) Plot the results of the multidimensional scaling, labelling the cities with their names.
Use your judgement to decide where to place the city names, and how to make sure the whole city
names are shown on the map.
Solution: plot with text. Here’s my first go, with the city names on the right:
plot(wisc.1)
text(wisc.1,cities,pos=4)
Appleton
50
●
●
Chicago
●
0
wisc.1[,2]
●
●
Wausau
●
Marshfield
Milwaukee
Fort.Atkinson
●
●
●
Madison
Beloit
−50
●
−100
●
−100
St.Paul
Monroe
●
−200
Superior
Dubuque
0
100
200
wisc.1[,1]
Having the city names on the right appears good, but one of them went off the right-hand side.
Also, the axis labels look weird, but I can’t think of anything better to replace them with than
x and y:
plot(wisc.1,xlab="x",ylab="y",xlim=c(-200,350))
text(wisc.1,cities,pos=4)
Page 3
50
●
●
●
Wausau
●
Marshfield
Milwaukee
Chicago
●
0
●
Appleton
Fort.Atkinson
●
Madison
Beloit
y
●
●
−50
●
−100
●
−100
St.Paul
Monroe
●
−200
Superior
Dubuque
0
100
200
300
x
Your map may come out different from mine, but subject to the usual stuff about rotation and
reflection it should be equivalent to mine.
(e) (2 marks) Are cities close together on the map also close together in real life? Give an example or
two.
Solution: On the map, the trio of cities Madison, Beloit and Fort Atkinson are closest together.
How far apart are they actually? Well, you can go back to the original file (or display of what
I called d) and find them, or you can do this:
cities
## [1] "Appleton"
## [5] "Marshfield"
## [9] "Wausau"
"Beloit"
"Milwaukee"
"Dubuque"
Cities 2, 3 and 4, so:
Page 4
"Fort.Atkinson" "Madison"
"Monroe"
"Superior"
"St.Paul"
"Chicago"
wisc[2:4,2:4]
##
Beloit Fort.Atkinson Madison
## Beloit
0
33
50
## Fort.Atkinson
33
0
36
## Madison
50
36
0
These are all less than 50 miles apart. There are some others: Monroe and Madison are 47
miles apart, Wausau and Marshfield are 45 miles apart, but these appear further apart on the
map.
This doesn’t work:
d[2:4,2:4]
## Error: incorrect number of dimensions
because d, despite appearances, is actually one-dimensional (it just has a “print method” that
displays it as a matrix).
I don’t much care which cities you look at. Finding some cities that are reasonably close on the
map and doing some kind of critical assessment of their actual distances apart is all I want.
(f) (2 marks) Obtain a Google (or other) map of the area containing these twelve cities. (Include your
map in your answer.)
Solution: Since I like to show off, let me show you how you can do this in R, using the package
ggmap. (Of course, you can just open the appropriate map in your browser and copy-paste it.):
library(ggmap)
wisc.map=get_map(location="Milwaukee, WI",zoom=6)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Milwaukee,+WI&zoom=6&siz
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Milwaukee,+W
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
ggmap(wisc.map)
Page 5
47.5
lat
45.0
42.5
40.0
−90
−85
lon
I centred this map around Milwaukee (a guess), which is not quite where the centre should
be, since Milwaukee is in southeastern Wisconsin. The zoom option is how zoomed-in the map
should be (a higher number is more zoomed-in). Likewise, 6 was a guess, and it seems that I
need to zoom in a bit more.
The other way of specifying a location, instead of the name or lat-long of the centre of the map,
is to specify the corners of the map in degrees of latitude or longitude. We have to give four
numbers: lower left longitude and latitude, upper right longitude and latitude. (Degrees west
are negative, as you see on the lon scale of the above map.) This comes out as west, south,
east and north limits of the map.
Where are the 12 points we want to put on the map? We can get their latitudes and longitudes,
which is called “geocoding”, and a function geocode is included in ggmap.
First add the state names to the cities, to make sure Google Maps looks up the right ones. All
of them are in Wisconsin, except for the last two: St. Paul is in Minnesota and Chicago is in
Illinois:
Page 6
states=rep("WI",12)
states[11]="MN"
states[12]="IL"
cst=paste(cities,states)
cst
## [1] "Appleton WI"
## [4] "Madison WI"
## [7] "Monroe WI"
## [10] "Dubuque WI"
"Beloit WI"
"Marshfield WI"
"Superior WI"
"St.Paul MN"
"Fort.Atkinson WI"
"Milwaukee WI"
"Wausau WI"
"Chicago IL"
And then look them up:
ll=geocode(cst)
cbind(cst,ll)
##
cst
## 1
Appleton WI
## 2
Beloit WI
## 3 Fort.Atkinson WI
## 4
Madison WI
## 5
Marshfield WI
## 6
Milwaukee WI
## 7
Monroe WI
## 8
Superior WI
## 9
Wausau WI
## 10
Dubuque WI
## 11
St.Paul MN
## 12
Chicago IL
lon
-88.42
-89.03
-88.84
-89.40
-90.17
-87.91
-89.64
-92.10
-89.63
-90.66
-93.09
-87.63
lat
44.26
42.51
42.93
43.07
44.67
43.04
42.60
46.72
44.96
42.50
44.95
41.88
What are the extreme corners of these?
range(ll$lon)
## [1] -93.09 -87.63
range(ll$lat)
## [1] 41.88 46.72
(range in R produces the two extreme values, not the difference between the highest and lowest,
which is what you might think of as a “range”.)
We don’t get exactly the corners we ask for, since the map always comes out in the same
proportions (we could ask for a long skinny map, but we’ll always get a rectangular one that
fills the page), and also Google Maps converts the corners into a centre and zoom. I had to
tinker with the numbers below, since on my first attempt the map zoomed in too much.
I also asked for a “roadmap” to maximize the number of places marked on there.
So:
library(ggmap)
wisc.map.2=get_map(location=c(-94,41.8,-87,46.8),maptype="roadmap")
## Warning: bounding box given to google - spatial extent only approximate.
ggmap(wisc.map.2)
Page 7
46
lat
45
44
43
42
−94
−92
−90
−88
lon
This came out about right.
Now we need to mark our 12 cities on the map. This is a ggplot map, so the right syntax is
as below.
ggmap(wisc.map.2)+geom_point(aes(x=lon,y=lat),data=ll)
Page 8
●
46
45
●
●
lat
●
●
44
●
43
●
●
●
●
●
42
●
−94
−92
−90
−88
lon
We just squeezed all our cities onto the map. The city southwest of Wausau is Marshfield, the
one between Madison and Milwaukee is Fort Atkinson, and the two east of Dubuque along the
southern border of Wisconsin are Monroe and Beloit. The one way up at the top is Superior.
After that long diversion, we come to:
(g) (3 marks) Discuss how the map that came out of the multidimensional scaling corresponds to the
actual (Google) map.
Solution: Let’s pick a few places from the actual map, and make a table of where they are on
the actual map and the cmdscale map:
Place
Real
Cmdscale
Superior
northwest
central east
St. Paul
central west
southeast
Dubuque central south central south
Chicago
southeast
central west
Appleton central east
central north
Page 9
This is a bit tricky. Dubuque is the only one in the right place, and the others that were west
have become east and vice versa. So I think there is a flipping across a line going through
Dubuque. That seems to be the most important thing; if you imagine the other points being
flipped across a line going north-south through Dubuque, they all end up in about the right
place. There might be a little rotation as well, but I’ll call that close enough.
(For you, any comment along the lines of “flipped around this line” or “rotated about this
much” that seems to describe what has happened, is fine.)
This one calls for a Procrustes rotation, which I can do since I have longitudes and latitudes
for my 12 cities. It goes like the one in class. ll is actually a data frame, so we need to convert
it into a matrix. wisc.1 contained the coordinates for our two-dimensional solution:
library(shapes)
## Loading required package: scatterplot3d
## Loading required package: rgl
## Loading required package: MASS
wisc.pro=procOPA(as.matrix(ll),wisc.1,reflect=T)
I discuss reflect=T below.
Now, we need to plot these. I could have made a blank plot and then added wisc.pro$Ahat
to it with points, which would have made the structure a little clearer. Anyhow. Note that
Ahat and Bhat have two columns, so they will be used as respectively the x and y coordinates
of the plot. I also extended the x-coordinate range so that Chicago’s name would fit.
plot(wisc.pro$Ahat,col="red",xlim=c(-3.5,3))
text(wisc.pro$Ahat,cities,col="red",pos=4)
points(wisc.pro$Bhat,col="blue")
text(wisc.pro$Bhat,cities,col="blue",pos=4)
Page 10
3
● Superior
Superior
2
●
●
St.Paul
Wausau
● Wausau
● Marshfield
● Marshfield
●
St.Paul
1
●
Appleton
● Appleton
0
lat
●
−1
●
●
●
Dubuque
●
●
● Milwaukee
Madison
● Milwaukee
●●Fort.Atkinson
Fort.Atkinson
Monroe
●
● Beloit
Beloit
Monroe
●
Dubuque
●
−3
−2
−1
0
1
Chicago
Chicago
●
2
3
lon
This is pretty good. The red and blue names match well, except perhaps for St. Paul. What
does it think is the right transformation to get from MDS coordinates to real ones?
wisc.pro$R
##
[,1]
[,2]
## [1,] -0.6823 0.7311
## [2,] 0.7311 0.6823
This time, the diagonal elements differ in sign (indicating that a reflection was performed to
make the points line up). Indeed, I allowed procOPA to use a reflection if it needed to by means
of reflect=T above.
According to Wikipedia (http://en.wikipedia.org/wiki/Coordinate_rotations_and_reflections),
our R corresponds to a reflection about a line making angle θ with the x-axis, where cos 2θ =
−0.6823 and sin 2θ = 0.7311. This means that 2θ = 133 degrees, or θ = 66.5 degrees. That is
to say, we draw a line that makes an angle of this much with the x-axis of our cmdscale map,
and then reflect the cities in that. For example, a line through Dubuque and Marshfield on the
cmdscale map makes about the right angle, and then reflect all the cities in that.
Page 11
(h) (2 marks) Calculate something that demonstrates that a one-dimensional map of the cities is a
much worse representation than the two-dimensional one that we made before.
Solution: Run again with eig=T and take a look at GOF (uppercase):
cmdscale(d,2,eig=T)$GOF
## [1] 0.9129 0.9316
cmdscale(d,1,eig=T)$GOF
## [1] 0.7917 0.8079
The goodness-of-fit of the two-dimensional solution is pretty good, but that of the one-dimensional
solution (which arranges all the cities along a line) is pretty awful.
How awful? Let’s find out. I should have saved it from just above. For the plot, ones is a
string of ones, as many as there are cities.
ones=rep(1,12)
v=cmdscale(d,1,eig=T)
plot(ones,v$points)
text(ones,v$points,cities,pos=4)
Page 12
200
Superior
●
St.Paul
●
Marshfield
Wausau
●
Appleton
●
●
●
●
Madison
Dubuque
Fort.Atkinson
Monroe
Beloit
Milwaukee
●
Chicago
0
v$points
100
●
−200
−100
●
0.6
0.8
1.0
1.2
1.4
ones
The cities get mapped onto a line that goes northwest (top) to southeast (bottom). This is not
completely terrible, since there aren’t really any cities in the northeast of the state, but it is
pretty awful.
2. In the previous assignment, you did a cluster analysis of ten brands of beer, as rated by 32 students.
This time, we will do a non-metric multidimensional scaling of those same brands of beer. The data are
in http://www.utsc.utoronto.ca/~butler/d29/beer.txt.
(a) (2 marks) Noting that we want to assess distances between brands of beer, read in the data and
do whatever you need to do to work out distances between the beers. Show your result.
Solution: This is really a copy of last time. We need to transpose the data frame to get the
beers in rows (dist works on distances between rows), then feed everything but the student
IDs into dist:
Page 13
beer=read.table("beer.txt",header=T)
d=dist(t(beer[,-1]))
d
##
AnchorS Bass Becks Corona GordonB Guinness Heineken PetesW
## Bass
15.20
## Becks
16.09 13.64
## Corona
20.02 17.83 17.55
## GordonB
13.96 11.58 14.42 13.34
## Guinness
14.93 13.49 16.85 20.59
14.76
## Heineken
20.66 15.10 13.78 14.90
14.07
18.55
## PetesW
11.79 14.00 16.37 17.72
11.58
14.28
19.49
## SamAdams
14.63 11.62 14.73 14.93
10.91
15.91
14.53 14.46
## SierraN
12.61 15.10 17.94 16.97
11.75
13.34
19.08 13.42
##
SamAdams
## Bass
## Becks
## Corona
## GordonB
## Guinness
## Heineken
## PetesW
## SamAdams
## SierraN
12.12
This way also works, where you transpose everything and then take off the first row of the
transposed result:
m=t(beer)
dist(m[-1,])
##
AnchorS Bass
## Bass
15.20
## Becks
16.09 13.64
## Corona
20.02 17.83
## GordonB
13.96 11.58
## Guinness
14.93 13.49
## Heineken
20.66 15.10
## PetesW
11.79 14.00
## SamAdams
14.63 11.62
## SierraN
12.61 15.10
##
SamAdams
## Bass
## Becks
## Corona
## GordonB
## Guinness
## Heineken
## PetesW
## SamAdams
## SierraN
12.12
Becks Corona GordonB Guinness Heineken PetesW
17.55
14.42
16.85
13.78
16.37
14.73
17.94
13.34
20.59
14.90
17.72
14.93
16.97
14.76
14.07
11.58
10.91
11.75
18.55
14.28
15.91
13.34
19.49
14.53
19.08
14.46
13.42
(b) (2 marks) Obtain a non-metric multidimensional scaling of the beers. (Comment coming up in a
moment.)
Page 14
Solution:
library(MASS)
beer.1=isoMDS(d)
## initial value 13.344792
## iter
5 value 10.855662
## iter 10 value 10.391446
## final value 10.321949
## converged
(c) (2 marks) Obtain the stress value of the map, and comment on it.
Solution:
beer.1$stress
## [1] 10.32
The stress is around 10%, on the boundary between “good” and “fair”. It seems as if the map
should be more or less worth using. (Insert your own hand-waving language here.)
(d) (2 marks) Obtain a map of the beers, labelled with the names of the beers. On your plot, plot the
actual locations with circles, and label the circles with the name of the beer, on the right. Make
sure the scales go far enough so that you can see the names of the beers.
Solution: This is slightly different from class, where I plotted the languages actually at their
locations. But here, the beer names are longer, so we should plot the points and label them.
The first “name” is “student”, which we need to get rid of:
names=names(beer[,-1])
names
## [1] "AnchorS" "Bass"
"Becks"
"Corona"
## [7] "Heineken" "PetesW"
"SamAdams" "SierraN"
plot(beer.1$points,xlim=c(-10,10))
text(beer.1$points,names,pos=4)
Page 15
"GordonB"
"Guinness"
Corona
6
●
SierraN
4
●
2
AnchorS
●
SamAdams
GordonB
0
●
●
●
PetesW
Heineken
−4
−2
beer.1$points[,2]
●
−6
●
●
−10
Bass
●
Guinness
Becks
−5
0
5
10
beer.1$points[,1]
(e) (2 marks) Find a pair of beers close together on your map. Are they similar in terms of student
ratings? Explain briefly.
Solution: I think Sam Adams and Gordon Biersch, right in the middle of the map. We can
pull them out by name:
Page 16
cbind(beer$SamAdams,beer$GordonB)
##
[,1] [,2]
## [1,]
9
7
## [2,]
7
8
## [3,]
7
6
## [4,]
8
5
## [5,]
6
6
## [6,]
4
7
## [7,]
5
6
## [8,]
5
5
## [9,]
3
4
## [10,]
4
6
## [11,]
7
7
## [12,]
7
9
## [13,]
7
6
## [14,]
6
6
## [15,]
8
5
## [16,]
5
6
## [17,]
7
6
## [18,]
8
2
## [19,]
6
7
## [20,]
6
5
## [21,]
5
5
## [22,]
8
7
## [23,]
4
5
## [24,]
7
7
## [25,]
4
7
## [26,]
6
3
## [27,]
7
6
## [28,]
6
6
## [29,]
7
5
## [30,]
7
7
## [31,]
6
9
## [32,]
5
6
These are, with a few exceptions (the most glaring being the 18th student), within a couple of
points of each other. So I would say they are similar. Another way to show this is to make a
scatterplot of them, and draw on it the line where the ratings are the same:
plot(beer$SamAdams,beer$GordonB)
abline(a=0,b=1)
Page 17
9
●
7
6
●
●
●
●
5
8
●
●
●
●
●
●
●
●
●
●
●
●
●
3
4
beer$GordonB
●
2
●
3
4
5
6
7
8
9
beer$SamAdams
These are basically closer to the line than not. (Note the trick for drawling a line with given
intercept and slope: abline, with a= for the intercept and b= for the slope. The line y = x has
intercept 0 and slope 1.)
(f) (2 marks) In our cluster analysis, we found that Anchor Steam, Pete’s Wicked Ale, Guinness and
Sierra Nevada were all in the same cluster. Would you expect them to be close together on your
map? Are they? Explain briefly.
Solution: If they are in the same cluster, we would expect them to “cluster” together on the
map. Except that they don’t, really.
These are the four beers over on the right of our map. They are kind of in the same general
neighbourhood, but not really what you would call close together. (This is a judgement call,
again.) In fact, none of the beers, with the exception of Sam Adams and Gordon Biersch in
the centre, are really very close to any of the others.
3. The data in http://www.utoronto.ca/~butler/d29/weather_2014.csv is of the weather in a certain
Page 18
location: daily weather records for 2014. The variables are:
• day of the year (1 through 365)
• day of the month
• number of the month
• season
• low temperature (for the day)
• high temperature
• average temperature
• time of the low temperature
• time of the high temperature
• rainfall (mm)
• average wind speed
• wind gust (highest wind speed)
• time of the wind gust
• wind direction
(a) (4 marks) Read in the data, and create a data frame containing only the temperature variables,
the rainfall and the wind speed variables (the ones that are actual numbers, not times or text).
Display the first few lines of your data frame.
Solution: I like dplyr (as you know). Or you could just pick out the columns by number:
weather.0=read.csv("weather_2014.csv",header=T)
library(dplyr)
weather.0 %>% select(c(l.temp:ave.temp,rain:gust.wind)) -> weather
head(weather)
##
l.temp h.temp ave.temp rain ave.wind gust.wind
## 1
12.7
14.0
13.4 32.0
11.4
53.1
## 2
11.3
14.7
13.5 64.8
5.6
41.8
## 3
12.6
14.7
13.6 12.7
4.3
38.6
## 4
7.7
13.9
11.3 20.1
10.3
66.0
## 5
8.8
14.6
13.0 9.4
11.6
51.5
## 6
11.8
14.4
13.1 38.9
9.9
57.9
(b) (2 marks) Find five-number summaries for each column by running summary on all the columns of
the data frame (at once, if you can: remember apply?)
Solution: This:
apply(weather,2,summary)
##
l.temp h.temp ave.temp rain ave.wind gust.wind
## Min.
3.1
9.8
7.3 0.00
0.00
3.2
## 1st Qu.
9.1
14.4
12.0 0.00
2.30
22.5
## Median
12.9
19.1
15.8 0.30
3.50
29.0
## Mean
12.7
19.2
15.7 5.84
4.04
31.1
## 3rd Qu.
16.3
23.3
19.3 5.30
5.20
38.6
## Max.
22.6
31.5
26.6 74.90
16.60
86.9
Page 19
(c) (2 marks) Run a principal components analysis (on the correlation matrix).
Solution:
weather.1=princomp(weather,cor=T)
(d) (2 marks) Obtain a summary of your principal components analysis. How many components do you
think are worth investigating?
Solution:
summary(weather.1)
## Importance of components:
##
Comp.1
## Standard deviation
1.7831
## Proportion of Variance 0.5299
## Cumulative Proportion 0.5299
Comp.2 Comp.3 Comp.4 Comp.5
Comp.6
1.4138 0.74407 0.38585 0.33553 0.081141
0.3332 0.09227 0.02481 0.01876 0.001097
0.8631 0.95533 0.98014 0.99890 1.000000
The issue is to see where the standard deviations are getting small (after the second component,
or perhaps the third one) and to see where the cumulative proportion of variance explained is
acceptably high (again, after the second one, 86%, or the third, 95%).
(e) (3 marks) Make a scree plot (the nice one, if you can). Does this support your conclusion from the
previous part?
Solution: I copied and pasted the code from class, changing some names:
plot(weather.1$sdev^2,type="b",ylab="Eigenvalue")
Page 20
1.5
●
1.0
Eigenvalue
2.0
2.5
3.0
●
0.5
●
●
0.0
●
●
1
2
3
4
5
6
Index
I see elbows at 3 and at 4. Remember you want to be on the mountain for these, not on the
scree, so this suggests 2 or 3 components, which is exactly what we got from looking at the
standard deviations and cumulative variance explained.
The eigenvalue-greater-than-1 thing says 2 components, rather than 3.
The other way also works, though is harder to read:
plot(weather.1)
Page 21
1.5
0.0
0.5
1.0
Variances
2.0
2.5
3.0
weather.1
Comp.1
Comp.2
Comp.3
Comp.4
Comp.5
Comp.6
(f) (4 marks) Obtain the component loadings. How do the first three components depend on the
original variables? (That is, what kind of values for the original variables would make the component
scores large or small?)
Solution:
Page 22
weather.1$loadings
##
## Loadings:
##
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## l.temp
-0.465 -0.348
0.542 0.470 -0.379
## h.temp
-0.510 -0.231
-0.576 -0.381 -0.458
## ave.temp -0.502 -0.311
0.804
## rain
0.296 -0.397 0.853
-0.163
## ave.wind
0.253 -0.560 -0.463 0.357 -0.529
## gust.wind 0.347 -0.507 -0.230 -0.492 0.572
##
##
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## SS loadings
1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.167 0.167 0.167 0.167 0.167 0.167
## Cumulative Var 0.167 0.333 0.500 0.667 0.833 1.000
1. This components loads mainly (and negatively) on the temperature variables, so when
temperature is high, component 1 is low and vice versa. You could also say that it loads
positively on the other variables, in which case component 1 is low if the temperature
variables are high and the rain and wind variables are low.
2. This one loads most heavily on wind: when wind is high, component 2 is low. Again, you
can make the judgement call that the other variables also feature in component 2, so that
when everything is large, component 2 is small and vice versa.
3. This one is a bit clearer. The blank loadings are close to 0, and can be ignored. The main
thing in component 3 is rain: when rainfall is large, component 3 is large.
(g) (2 marks) Obtain the principal component scores for the first 20 days of the year, and display them
alongside the other variables in your data frame.
Solution:
Same idea as in class. I think it can be done in dplyr too, though I couldn’t see how.
Page 23
v=data.frame(weather,scores=weather.1$scores)
head(v,n=20)
##
l.temp h.temp ave.temp rain ave.wind gust.wind scores.Comp.1
## 1
12.7
14.0
13.4 32.0
11.4
53.1
2.8428
## 2
11.3
14.7
13.5 64.8
5.6
41.8
2.7871
## 3
12.6
14.7
13.6 12.7
4.3
38.6
1.1108
## 4
7.7
13.9
11.3 20.1
10.3
66.0
3.6214
## 5
8.8
14.6
13.0 9.4
11.6
51.5
2.6742
## 6
11.8
14.4
13.1 38.9
9.9
57.9
3.0950
## 7
11.4
14.8
13.5 2.0
6.6
38.6
1.2201
## 8
12.4
15.6
14.1 1.5
5.9
33.8
0.7337
## 9
9.2
18.4
12.9 0.0
0.2
16.1
-0.2099
## 10
8.3
14.8
11.0 0.0
1.4
24.1
0.8245
## 11
5.8
14.8
9.5 0.3
1.1
16.1
1.0101
## 12
9.4
15.2
12.1 10.7
4.7
41.8
1.6750
## 13
7.3
12.9
10.2 15.7
3.1
35.4
2.1164
## 14
11.4
13.9
12.8 8.1
4.7
38.6
1.3414
## 15
9.4
13.1
12.0 29.0
5.9
43.5
2.5216
## 16
9.0
12.2
10.8 6.9
5.4
49.9
2.3833
## 17
7.7
11.4
9.3 25.4
7.2
41.8
3.1816
## 18
7.5
10.9
9.0 17.0
1.4
24.1
1.9458
## 19
6.4
11.4
8.7 2.5
3.3
32.2
2.1336
## 20
6.9
12.2
9.2 2.8
1.1
17.7
1.2877
##
scores.Comp.2 scores.Comp.3 scores.Comp.4 scores.Comp.5 scores.Comp.6
## 1
-3.127252
-0.004023
0.83603
-0.484456
-0.048587
## 2
-2.309272
3.643532
0.28733
-0.430800
0.060407
## 3
-0.254811
0.262561
0.26047
0.548246
0.004579
## 4
-2.473400
-0.992383
-0.50801
0.006405
0.048003
## 5
-2.028949
-1.683837
0.30950
-0.778030
0.188260
## 6
-3.141807
0.663632
0.27422
-0.144678
-0.046651
## 7
-0.327849
-0.957932
0.40619
0.051709
0.071742
## 8
-0.101693
-0.742727
0.53706
0.026055
0.025823
## 9
2.258912
0.534338
-0.26401
-0.141205
-0.109018
## 10
2.003563
0.108545
-0.11724
0.160826
-0.067062
## 11
2.724926
0.293739
-0.12458
-0.445328
-0.118472
## 12
-0.070977
-0.080289
-0.26913
0.256326
-0.027600
## 13
0.822099
0.653121
-0.21117
0.162944
0.023416
## 14
-0.002489
-0.168938
0.25187
0.451020
0.035518
## 15
-0.928698
0.955188
0.11778
-0.015696
0.117555
## 16
-0.200012
-0.669237
-0.22024
0.733585
0.033594
## 17
-0.626325
0.431170
0.37695
-0.389321
-0.085735
## 18
1.821269
1.281010
0.28346
0.133515
-0.029614
## 19
1.601731
-0.292889
-0.01086
0.156265
-0.033883
## 20
2.618355
0.448194
0.24838
-0.085814
-0.044551
(h) (2 marks) Find a day that scores high on component 1, and explain briefly why it came out high
(by looking at the measured variables).
Solution: Day 17 has the highest component 1 score. This is one of the cooler days, especially
for the daytime high. Also, there is a largish amount of rain. (The days with more rain were
Page 24
warmer.)
(i) (2 marks) Find a day that scores low on component 2, and explain briefly why it came out low.
Solution: Day 6 (or day 1). These are days when the wind speed (average or gust) is on the
high side.
(j) (2 marks) Find a day that scores high on component 3, and explain briefly why it came out high.
Solution: Day 2. Component 3 was mainly rain, so it is not surprising that the rainfall is the
highest on this day.
(k) (2 marks) Make a biplot of these data, labelling the days by the day count (from 1 to 365). You
may have to get the day count from the original data frame that you read in from the file.
Solution:
biplot(weather.1,xlabs=weather.0$day.count)
Page 25
−10
−5
0
5
10
gust.wind
45
40
ave.wind
−0.20
261
10
5
0
−5
−10
ave.temp
l.temp
361
359
344
345
354
339
352
365
24
35347
11
360
20
358
357
65348
342
86
50
23
30982
5249
669 72336
327
333
10 349
88
340
338
313
35618
337
19
351
265
97 117
343
341
53 8746
76
350
329
73
321
30
79
283
25
67
355
64
306
74
330
284
59
96
111
326
308
310
100
93
328
334
311
119
77
103
110
101
71
118
98
108
325
233
332
9514758
6313
282
126
107
181
26
305
31
362
102
232
69
106
21
8157
300
149335144
364
271
174
125
237
33
154
120
243
278
208
242
248
295299
175
104
8529
170
197
235
196
68
75148
236
143
231
9251 346
273
113
323
267
207
272
320
139
171
324
238
303
234
161
80
276
252
294292
266
198
216
347
203
114
145
115
121
178
131
146
302
109
70
301
14
177
244
205
22 48
206
202
176
185
182
2878 318
12 43
246
215
296
269
105
155
54
186
162
138
129128
151
227
199
247
230
297
16
224
251
277
256
241
173
37
133
169
142
9094
204
314
61
180
279
99
270
285
164
89
209
293
201
219
56
245
239
213
78
127
132
130
322
153
274
32
124
156
220
217
363
55
2717
188
160
268
83
122
331
240
36
315
84
218
312
298
172
228
221
304
275193
212
195
194
112
250
184
163
225
226
150
116 319 1562
316
152
263
223
189
253
183
140
229210 190
255
257 214
159
286
191 136249
123 291
222
44
165
42
280179
28
262
192
141
187 157
134
5
91
137
211 168254
41
258 200158
3072603839 34
167
4
288
135 259 264
317 1 6
290
rain
260 281
35
166
289
−15
0.00
−0.05
h.temp
−0.15
−0.10
Comp.2
0.05
0.10
0.15
−15
37
−0.20
−0.15
−0.10
−0.05
0.00
0.05
0.10
0.15
Comp.1
(l) (3 marks) Looking at your biplot, what do you think was remarkable about the weather on day
37? Day 211? Confirm your guesses by looking at the appropriate rows of your data frame (and
comparing with your summary from earlier).
Solution: Day 37 is at the bottom right of the plot, at the pointy end of the arrows for rain,
wind gust and average wind. This suggests a rainy, windy day:
weather[37,]
##
l.temp h.temp ave.temp rain ave.wind gust.wind
## 37
9.3
15.3
12.5 43.4
16.6
74
Those are high numbers for both rain and wind (the highest for average wind and above the
third quartile otherwise), but the temperatures are unremarkable.
Day 211 is towards the pointy end of the arrows for temperature, so this is a hot day:
Page 26
weather[211,]
##
l.temp h.temp ave.temp rain ave.wind gust.wind
## 211
22.6
31.5
26.6
0
4.5
33.8
This is actually the hottest day of the entire year: day 211 is highest on all three temperatures,
while the wind speeds are right around average (and no rain is not completely surprising).
Page 27