Principal Components - International Research Institute for Climate

Principal Components:
A Conceptual Introduction
Simon Mason
International Research Institute for Climate Prediction
The Earth Institute of Columbia University
Linking
Science
to
Society
What makes a good soccer team?
Everybody(?) has their favourite soccer team. But which is
the best team, and how can we determine that it is the
best?
We usually justify our choice of best team by describing it in
rather vague ways such as “good at scoring goals”,
“excellent defensive line”, “fair players”.
We need some quantifiable metrics rather than vague
descriptions.
Linking
Science
to
Sport!
Soccer-Playing Metrics
Metrics can be defined for measuring the quality of a soccer
team objectively.
Each metric could be measured over a season or a number
of seasons.
Linking
Science
to
Sport!
Soccer-Playing Metrics
1. Frequency of home wins
(home wins).
6. Frequency of away
losses (away losses).
2. Frequency of home
losses (home losses).
7. Frequency of away goals
scored (away for).
3. Frequency of home goals
scored (home for).
8. Frequency of away goals
ceded (away against).
4. Frequency of home goals
ceded (home against).
9. Number of bookings
(bookings).
5. Frequency of away wins
(away wins).
10. Average attendance
(attendance).
Linking
Science
to
Sport!
English Premiership Teams 2003/04
1. Arsenal
11. Leicester City
2. Aston Villa
12. Liverpool
3. Birmingham
13. Manchester City
4. Blackburn Rovers
14. Manchester United
5. Bolton Wanderers
15. Middlesbrough
6. Charlton Athletic
16. Newcastle United
7. Chelsea
17. Portsmouth
8. Everton
18. Southampton
9. Fulham
19. Tottenham Hotspur
10. Leeds United
20. Wolverhampton Wanderers
Linking
Science
to
Sport!
Home wins
Home losses
Home for
Home Against
Away wins
Away losses
Away for
Away against
Bookings
Attendance
Arsenal
15
0
40
14
11
0
33
12
58
38079
Aston Villa
9
4
24
19
12
4
24
25
58
36622
Birmingham
8
6
26
24
11
6
17
24
55
29074
Blackburn
5
10
25
31
6
5
26
28
67
24376
Bolton
6
5
24
21
2
5
24
35
66
26795
Charlton
7
6
29
29
6
8
22
22
42
26293
Chelsea
12
3
34
13
7
7
33
17
51
41234
Everton
8
6
27
20
8
8
18
37
59
38837
Fulham
9
6
29
21
5
8
23
25
68
16342
Leeds
5
7
25
31
4
6
15
48
81
36666
Leicester
3
6
19
28
5
9
29
37
73
30983
Liverpool
10
5
29
15
4
10
26
22
49
42677
Manchester City
5
5
31
24
2
12
24
30
53
46834
Manchester United
12
3
37
15
4
13
27
20
49
67641
Middlesbrough
8
7
25
23
7
8
19
29
58
30398
Newcastle
11
3
33
14
4
10
19
26
53
51440
Portsmouth
10
5
35
19
1
11
12
35
68
20108
Southampton
8
5
24
17
3
11
20
28
59
31717
Tottenham
9
6
33
27
3
14
14
30
63
34876
Wolves
7
7
23
35
0
12
15
42
70
28874
The Premiership Metric
In the Premiership the teams are ranked according to the
number of games they win and draw, and then by goal
difference if there are ties.
score  3.0  home wins  away wins 
1.0  home draws  away draws 
c   goals for  goals against 
0.0  bookings  attendance 
where
0.0  c
1.0
I.e., a weighted sum of the metrics is used to rank the
teams.
Linking
Science
to
Sport!
Points
Attendance
Bookings
Away against
Away for
Away losses
Away wins
Home against
Home for
Home losses
Home wins
Arsenal
15
0
40
14
11
0
33
12
58 38079
90
Chelsea
12
3
34
13
12
4
33
17
51 41234
79
Manchester Utd
12
3
37
15
11
6
27
20
49 67641
75
Liverpool
10
5
29
15
6
5
26
22
49 42677
60
Newcastle
11
3
33
14
2
5
19
26
53 51440
56
Aston Villa
9
4
24
19
6
8
24
25
58 36622
56
Charlton
7
6
29
29
7
7
22
22
42 26293
53
Bolton
6
5
24
21
8
8
24
35
66 26795
53
Fulham
9
6
29
21
5
8
23
25
68 16342
52
Birmingham
8
6
26
24
4
6
17
24
55 29074
50
Middlesbrough
8
7
25
23
5
9
19
29
58 30398
48
Southampton
8
5
24
17
4
10
20
28
59 31717
47
Portsmouth
10
5
35
19
2
12
12
35
68 20108
45
Tottenham
9
6
33
27
4
13
14
30
63 34876
45
Blackburn
5
10
25
31
7
8
26
28
67 24376
44
Mancester City
5
5
31
24
4
10
24
30
53 46834
41
Everton
8
6
27
20
1
11
18
37
59 38837
39
Leicester
3
6
19
28
3
11
29
37
73 30983
33
Leeds
5
7
25
31
3
14
15
48
81 36666
33
Wolves
7
7
23
35
0
12
15
42
70 28874
33
A General Metric
A good team should score highly on all the metrics (note
that losses, against and bookings can be measured so that
high scores indicate good play by multiplying these scores
by -1).
If we can combine the original metrics into one new metric
that captures as much of the information in the ten metrics
as possible, we will have a new general metric that we can
use as an overall measure of the quality of a soccer team.
Linking
Science
to
Sport!
Variance
The differences between the teams on the various metrics
provides the information we can use to distinguish good from
bad teams.
On some metrics (e.g., attendance) the differences are large,
but on others (e.g., home losses) most teams score about
the same. The variance of each metric tells us the total
amount of information we have to distinguish the teams.
The total information available to distinguish the teams is the
sum of the variances of each metric.
Linking
Science
to
Sport!
Variance
Home wins
Home losses
Away wins
Away losses
Home for
Home against
Away for
Away against
Bookings
Attendance
Total
8.2
4.2
11.0
11.8
29.0
42.4
36.1
74.1
90.3
134200466.4
134200773.7
Standardized
variance
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
10.00
Since virtually all of the total variance is contributed by
attendance, teams need to perform well on this metric.
Alternatively, the metrics could be standardized to give
them equal weight.
Linking
Science
to
Sport!
Standardize?
If we want to give each metric the same weight we should
standardize the data first otherwise a team which performs
poorly on a metric with high variance is likely to score badly
overall – it will be difficult to make up the large deficit from
metrics on which teams tend to score similarly.
The variance of the standardized metrics is 1.0. Therefore
the total standardized variance will be 10.0 (the number of
metrics).
Linking
Science
to
Sport!
Home wins
Home losses
Away wins
Away losses
Home for
Home against
Away for
Away against
Bookings
Attendance
Points
Arsenal
2.32
2.56
2.12
1.23
1.73
2.43
1.83
1.93
0.21
0.27
1.66
Chelsea
1.27
1.10
1.00
1.38
2.03
1.27
1.83
1.35
0.95
0.54
1.27
Manchester United
1.27
1.10
1.56
1.07
1.73
0.68
0.83
1.00
1.16
2.82
1.32
Liverpool
0.57
0.12
0.07
1.07
0.23
0.97
0.67
0.77
1.16
0.66
0.63
Newcastle
0.92
1.10
0.82
1.23 -0.98
0.97 -0.50
0.30
0.74
1.42
0.60
Aston Villa
0.23
0.61 -0.85
0.46
0.23
0.10
0.33
0.42
0.21
0.14
0.19
0.07 -1.07
0.53
0.39
0.00
0.77
1.89
-0.75
0.10
0.83
0.10
0.33 -0.74 -0.63
-0.71
-0.22
0.15 -0.08
0.10
0.17
0.42 -0.84
-1.61
-0.18
Charlton
-0.47 -0.37
Bolton
-0.82
Fulham
0.12 -0.85
0.23 -0.37
0.07
0.15
Birmingham
-0.12 -0.37 -0.48 -0.31 -0.38
0.68 -0.83
0.53
0.53
-0.51
-0.13
Middlesbrough
-0.12 -0.85 -0.67 -0.15 -0.08 -0.19 -0.50 -0.05
0.21
-0.40
-0.28
Southampton
-0.12
0.12 -0.85
0.77 -0.38 -0.48 -0.33
0.11
-0.28
-0.14
0.12
0.46 -0.98 -1.06 -1.66 -0.74 -0.84
-1.28
-0.42
0.82 -0.77 -0.38 -1.35 -1.33 -0.16 -0.32
-0.01
-0.36
-0.92
-0.58
Portsmouth
0.57
Tottenham
0.23 -0.37
1.19
Blackburn
-1.17 -2.32 -0.67 -1.38
Manchester City
-1.17
Everton
-0.12 -0.37 -0.30
Leicester
-1.86 -0.37 -1.78 -0.92 -0.68 -0.77
0.12
0.53
0.10
0.45 -0.31 -0.38 -0.48
0.67
0.07
0.07 -0.74
0.33 -0.16
0.74
1.02
0.02
0.31 -1.28 -0.77 -0.67 -0.98
0.11
0.33
-0.37
1.16 -0.98 -1.37
-0.35
-0.79
Leeds
-1.17 -0.85 -0.67 -1.38 -0.68 -1.64 -1.16 -2.25 -2.21
0.14
-1.19
Wolves
-0.47 -0.85 -1.04 -2.00 -1.58 -1.06 -1.16 -1.56 -1.05
-0.53
-1.13
The Average
The simplest combined score is to average the scores (or
standardized scores) on each metric.
average  0.1 home wins  ...  0.1 attendance
But information is lost: the variance of the average scores is
only about 0.59, compared to the total variance of 10.0).
Linking
Science
to
Sport!
The Average
Also, the simple average is not very informative: if we
ask why a team is good, the only way to answer is to
refer to all ten metrics, which is inefficient for two
reasons:
1. there are too many metrics to which to refer;
2. some of the metrics are very similar, so if we know that
a team scored well on one metric we can assume that it
probably scored well on a similar metric …
Linking
Science
to
Sport!
Correlations Between the Metrics
Some of the
characteristics.
metrics
seem
to
measure
similar
For example, home for and away for both relate to the
team’s goal-scoring achievements.
Correlations between the metrics can be used to tell us
whether the metrics are measuring similar aspects of the
quality of a soccer team.
Linking
Science
to
Sport!
Home wins
Home losses
Home for
Home against
Away wins
Away losses
Away for
Away against
Bookings
Attendance
Correlations Between the Metrics
Home wins
1.00
0.78
0.81
0.76
0.49
0.67
0.26
0.71
0.47
0.38
Home losses
0.78
1.00
0.67
0.78
0.48
0.64
0.45
0.59
0.43
0.52
Home for
0.81
0.67
1.00
0.58
0.47
0.50
0.20
0.60
0.44
0.44
Home against
0.76
0.78
0.58
1.00
0.47
0.64
0.42
0.65
0.51
0.46
Away wins
0.49
0.48
0.47
0.47
1.00
0.71
0.78
0.74
0.44
0.30
Away losses
0.67
0.64
0.50
0.64
0.71
1.00
0.71
0.88
0.62
0.30
Away for
0.26
0.45
0.20
0.42
0.78
0.71
1.00
0.64
0.33
0.29
Away against
0.71
0.59
0.60
0.65
0.74
0.88
0.64
1.00
0.75
0.28
Bookings
0.47
0.43
0.44
0.51
0.44
0.62
0.33
0.75
1.00
0.47
Attendance
0.38
0.52
0.44
0.46
0.30
0.30
0.29
0.28
0.47
1.00
Sum of diagonals = 10.
Linking
Science
to
Sport!
Independent Metrics
Positive correlations between the metrics show that they
are measuring similar aspects of the quality of a soccer
team.
We would like to combine the metrics somehow so that
common aspects are measured on a single metric, and
each combination measures a different aspect of the quality
of a soccer team (i.e., the correlations between these new
metrics is zero). The single metric must have high variance
so that teams can be distinguished effectively.
Linking
Science
to
Sport!
Independent Metrics
Objectives:
1. the new metrics are uncorrelated;
2. each metric in turn summarizes as much information as
possible (its variance is maximized);
3. there is no loss of information.
New metrics that meet these objectives are called principal
components.
Linking
Science
to
Sport!
Principal Components
Principal components are weighted sums of the original
metrics. Weighted sums are like weighted averages, except
that the weights do not have to add up to 1.0. Instead, with
principal components the squares of the weights add up to
1.01. The weights are known as eigenvectors, and are
frequently referred to as loadings.
The weighted sums are the scores on the new metrics. The
new metrics are called principal components.
1
A few authors draw the following distinction: for EOFs the sum of the
squared weights is 1; for principal components the sum is equal to the
length of the eigenvalue.
Linking
Science
to
Sport!
PC 1
PC 2
PC 3
PC 4
PC 5
PC 6
PC 7
PC 8
PC 9
PC 10
Covariances Between the Principal Components
PC 1
6.01
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
PC 2
0.00
1.32
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
PC 3
0.00
0.00
0.82
0.00
0.00
0.00
0.00
0.00
0.00
0.00
PC 4
0.00
0.00
0.00
0.71
0.00
0.00
0.00
0.00
0.00
0.00
PC 5
0.00
0.00
0.00
0.00
0.49
0.00
0.00
0.00
0.00
0.00
PC 6
0.00
0.00
0.00
0.00
0.00
0.22
0.00
0.00
0.00
0.00
PC 7
0.00
0.00
0.00
0.00
0.00
0.00
0.17
0.00
0.00
0.00
PC 8
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.13
0.00
0.00
PC 9
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.08
0.00
PC 10
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.05
Sum of diagonals = 10
Linking
Science
to
Sport!
Eigenvalues
The variances of the principal components are called
eigenvalues.
The total variance explained by all the principal components
is the same as that of the original standardized metrics, and
so no information is lost. But most of the total variance is
explained by only a few components. Compare the variance
of the average of the standardized score (0.59).
Principal components with variances > 1.0 have more
information than any of the original standardized metrics.
Linking
Science
to
Sport!
Soccer Team Principal Component 1
Linking
Science
to
Sport!
Soccer Team Principal Component 1
We can obtain a score for a team by calculating the
weighted average of its scores on the 10 original metrics:
PC 1Arsenal  0.342  home wins...  0.221 attendance
We can get a score for each team …
Linking
Science
to
Sport!
Soccer Team Principal Component 1
Linking
Science
to
Sport!
Soccer-Player Principal Component 1
The score tells us whether the team out-performs their
opponents, while playing fairly, and drawing large crowds.
Linking
Science
to
Sport!
Soccer-Player Principal Component 2
Linking
Science
to
Sport!
Soccer-Player Principal Component 2
The score tells us whether the team plays better at home or
away.
Linking
Science
to
Sport!