Challenging nostalgia and performance metrics in baseball

Challenging nostalgia and performance metrics in
baseball
Daniel Eck
September 15, 2016
Abstract
In this writeup, we show that the great old time players are overrated relative to
the great players of recent time and we show that performance metrics used to
compare players have substantial era biases. In showing that the old time players
are overrated, no individual statistics or era adjusted metrics are used. Instead,
we provide substantial evidence that the composition of the eras themselves are
drastically different. In particular, there were significantly fewer eligible MLB
players available in the older eras of baseball. As a consequence, we argue that
ESPN’s greatest MLB players of all time list includes too many old time players
in their top 25 and many performance metrics fail to adequately compare players
across eras.
1
Introduction
When one looks at the raw or advanced baseball statistics, one is often blown away
by the accomplishments of old time baseball players. The greatest players from the
old eras of major league baseball produced mind-boggling numbers. As examples,
see Babe Ruth’s batting average and pitching numbers, Honus Wagner’s 1900 season, Tris Speaker’s 1916 season, Walter Johnson’s 1913 season, Ty Cobb’s 1911
season, Lou Gehrig’s 1931 season, Rogers Hornsby’s 1925 season, and many others. The accomplishments achieved by players during these seasons are far beyond
what recent players produce and it seems, at first glance, that players from the old
eras were vastly superior to the players that play in more modern eras. But, is this
true? Are the old time ball players actually superior?
In this writeup, we investigate whether or not the old time baseball players
are overrated and whether or not the recent baseball players are underrated. We
define old time baseball players to be those that played in the MLB in the 1950
season or before. This year is chosen because it coincides with the decennial United
1
States Census and is close to the year in which baseball became integrated. Recent
baseball players are defined to be those that played in the MLB in the 1980 season
or after.
In the initial stages of this analysis, we do not compare baseball players via
their statistical accomplishments or the awards they earned over their careers. Such
measures often have era biases that are confounded with actual performance. The
only statistics that are of interest are when a player played and the eligible MLB
population at that time. The population data (in Section 2) is constructed and used
to formally test (in Section 4) whether or not too many old time players are represented in ESPN’s top 25 list (ESPN, 2015). We do discover that there are far too
many old time players included in ESPN’s top 25 list. The extent of our conclusions
conflicts with many conventional baseball evaluation metrics as well as a sophisticated era-adjustment detrending metric (Petersen, Penner, and Stanley, 2011). The
reference Petersen, Penner, and Stanley (2011) will henceforth be called PPS due to
repeated use. A detailed comparison and critique of the methodologies underlying
the evaluation metrics and our initial analysis is then given.
2
Data
The eligible MLB population is not over any particular time period. As a proxy, we
say that the eligible MLB population is the decennial count of males aged 20-24.
This demographic data for the United States eligible MLB population is obtained
from the US Census (Census, 2016). The MLB started in 1876 so our data collection begins with the 1880 Census. In 1947, the integration of the MLB began.
African American and Hispanic demographic data is added to our dataset starting
in 1950. Prior to 1950, the US black and Hispanic populations are excluded from
the tallies since baseball was not yet integrated.
African American and Hispanic citizens were not the only group discriminated
against by pre (and post) integration baseball. Players from Latin American countries were also discriminated against, and citizens from those countries are added
to the eligible MLB player pool after 1950. The population data of Latin American countries is obtained from Brea (2003). It should be noted that Brazil and
Argentina has been excluded from Latin America in this study since the Brazilians
and Argentinians do not largely play baseball and their populations are large compared with Latin American countries where baseball is considered to be a dominant
sport. In the 1990s the MLB and minors saw an influx of Asian baseball players
from Japan, Korea, and Taiwan. The populations from these countries are added to
the eligible MLB player talent pool for those years. The foreign Census data is not
as precise so 5% of the total population data is added to our datasets corresponding
to the estimate of the proportion of these counties’ populations which are males
aged 20-24.
The cumulative proportion means that at each era, the population of the previous eras is also included. The population dataset is seen below. As an example
2
of how to interpret this dataset, consider the year 1930. There were an 4.8 million
males aged 18-24 in 1930. The proportion of the eligible MLB population throughout all eras that was available to play baseball in 1930 or before is 0.113.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
3
year
1880
1890
1900
1910
1920
1930
1940
1950
1960
1970
1980
1990
2000
2010
pop
2.200
2.700
3.200
4.100
4.000
4.800
5.100
5.000
11.515
16.140
21.270
32.385
35.100
39.100
pop prop
0.012
0.026
0.043
0.065
0.087
0.113
0.140
0.167
0.228
0.315
0.429
0.602
0.790
1.000
The Greats
The population data is the essential part of our argument that the great old time
baseball players are overrated. At a quick glance of this population dataset, one can
see how small the proportion of the pre-1950 eligible MLB population actually is.
We now list the top 25 MLB players, along with the years they played, according
to ESPN. This list is displayed in Table 1 (ESPN, 2015). Using our definition of
old time and recent players, we see that Table 1 has 5 old time players and 2 recent
players in the top 10 and this list has 11 old time players and 7 recent players in the
top 25. When the eligible MLB population is considered, it appears that old time
players are over represented in this top 25 list. In the next Section, we formally test
whether or not this is so.
4
Testing
We now test whether or not there are too many old time players or too few recent
players included in the top 10 and top 25 lists. It should be noted that these two tests
are not independent of one another. If there are too many old time players included,
then it is likely that there are too few recent players included simply because there
were too many old time players included. One should pick a perspective of interest.
3
rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
name
Babe Ruth
Willie Mays
Barry Bonds
Ted Williams
Hank Aaron
Ty Cobb
Roger Clemens
Stan Musial
Mickey Mantle
Honus Wagner
Lou Gehrig
Walter Johnson
Greg Maddux
Rickey Henderson
Rogers Hornsby
Mike Schmidt
Cy Young
Joe Morgan
Joe Dimaggio
Frank Robinson
Randy Johnson
Tom Seaver
Alex Rodriguez
Tris Speaker
Steve Carlton
years played
1914-1935
1951-1973
1986-2007
1939-1960
1954-1976
1905-1928
1984-2007
1941-1963
1951-1968
1897-1917
1923-1939
1907-1927
1986-2008
1979-2003
1915-1937
1972-1989
1890-1911
1963-1984
1936-1951
1956-1976
1988-2009
1967-1986
1994-present
1907-1928
1965-1988
Table 1: ESPN’s list of the top 25 greatest baseball players to ever play in the MLB.
However, we will present tests from both perspectives. In order to conduct these
statistical tests, we make two assumptions:
• First, we assume that innate talent is uniformly distributed over the eligible
MLB population over the different eras.
• Second, we assume that the outside competition to the MLB available by
other sports leagues is offset by the increased salary incentives received by
more recent players.
We now have enough information and assumptions to conduct our tests. We start
out testing whether or not ESPN has included too many old time players in their
list. The test is of the form:
NH : The ESPN list in Table 1 is correct.
AH : There are too many old time players included in the ESPN list.
4
Here, NH and AH are short for null hypothesis and alternative hypothesis respectively. This testing format means that we assume that the ESPN list of the top 10
or top 25 greatest players is correct and we require strong evidence against this
assumption to conclude otherwise.
We now test whether or not 5 old time players in the top 10 is too many. Based
on the population statistics and our two assumptions, the chances that an MLB
eligible person from 1950 or before belongs in the top 10 is 0.167. The chances
of observing at least 5 such people in the top 10 is calculated using the binomial
distribution with n = 10 and p = 0.167. This chance is found to be 0.015. This
says that the chances of randomly distributed talent producing 5 or more players in
the top 10 is 2%. This is a small chance and is lower than the typical 5% threshold.
In other words, we would reject our null hypothesis and conclude that the ESPN
list included too many old time players in their top 10. This calculation is explained
in more detail in Appendix A.
We now perform a similar test with respect to the top 25 MLB players of all
time. There are 11 players included in Table 1 that played before the integration
of baseball. We find that the probability of there being 11 or more players in the
top 25 on the basis of random chance to be 0.001. This is significant evidence in
favor of there being too many old time players included in the top 25 list of MLB
players.
We now investigate whether or not there are too few recent players included in
ESPN’s list in Table 1. These tests are of the form:
NH : The ESPN list in Table 1 is correct.
AH : There are too few recent players included in the ESPN list.
It should be noted that these tests are confounded with the tests already conducted.
There are two players included in the top 10 that played after 1980. The probability
that a player that played after 1980 is included in the top 10 based on population
size and random chance is p = 0.571. The probability that at most 2 players from
after 1980 would reach the top 10 is equal to 0.02. This chance is small enough to
conclude that ESPN has included too few recent players in their top 10. There are
a total of 7 players included in the top 25 that played after 1980. We find that the
probability of there being at most 7 players from after 1980 as dictated by random
chance is 0.003. Again, this chance is small enough to conclude that ESPN has
included too few recent players in their top 25.
Remark 1. The first test conducted included Ted Williams and Stan Musial
in the pre-integration era. However, these two played a significant portion of their
careers, with success, in the era after integration. If we redefine old time players
to include those who played before 1960, then we arrive at a different conclusion.
The probability that 5 or more players from before 1960 would appear on a top 10
list subject to random chance is found to be 0.055. This chance is small but would
not be deemed to be statistically small when the 5% rule is adopted. Therefore, it
would be safe to conclude that there is not enough evidence to reject the hypothesis
5
that the ESPN top 10 list is correct when old time players are defined to be those
that played before 1960.
Remark 2. The redefinition of old time players to those that played before
1960 does not change the conclusion that ESPN’s top 25 list includes too many old
time players. The probability that at least 11 players that played before 1960 would
appear in a top 25 list is found to be 0.015.
5
Comparison of Methods
Our evaluation metric model free analysis of great players conflicts with the top 25
list in Table 1, conventional evaluation metrics such as Wins Above Replacement
(WAR), and a more sophisticated era-adjusted detrending metric of PPS. The underlying principles of these competing metrics are now compared with out methodology. We start with WAR.
5.1
Critique of WAR
Philosophically speaking, WAR is a one-number summary of the added value of a
player over a “replacement” level player. It is a statistic but it is not thought about
as a statistic in the way statisticians think about statistics. For instance, the replacement level player, which is central to the interpretation of WAR, is not a fixed and
known entity but rather a random outcome. In each year, the perceived value of
the replacement player changes based on the individuals comprising the statistical
outcome of that year’s baseball season. Therefore, the most central component to
the calculation of WAR is both random and is subject to a strong era-effect. The
ESPN top 25 list used a variation of WAR to compare players (Szymborski, 2015).
Another disadvantage of WAR is that it is more of a philosophical concept than
a calculable statistic. There are many different ways to calculate WAR (BaseballReference, 2016; Fangraphs, 2016) and these differing calculations assign different
values to the same performance. The arbitrariness of WAR as an actual statistical
tool coupled with its inherent randomness makes it a blunt tool for comparing performances across eras. As an example of this, consider the defensive comparison
between Ozzie Smith’s 1983 season and Rogers Hornsby’s 1917 season provided
in Table 2. Clearly Ozzie Smith had the superior defensive season but Rogers
Hornsby’s defensive WAR is substantially higher. In all, we see that it does not
make sense to use WAR as a way to compare players across eras.
Rogers Hornsby
Ozzie Smith
pos
SS
SS
Fld%
0.939
0.975
Chances
847
844
Errors
52
21
RF/9
5.62
5.51
dWAR
3.5
2.3
Table 2: Comparison of Rogers Hornsby’s 1917 defensive season and Ozzie
Smith’s 1983 defensive season.
6
5.2
Critique of detrending
We now critique the principles underlying the era-adjusted detrending of PPS. As
described by PPS, they detrend player statistics by normalizing achievements to
seasonal averages, which accounts for changes in relative player ability resulting
from both exogenous and endogenous factors, such as talent dilution from expansion, equipment and training improvements, as well as performance enhancing
drug usage. This is stated in their abstract. As an argument in favor of old time
players, talent dilution from expansion is terribly misunderstood and is fallacious
at best. If anything, the talent pool was more diluted in the earlier eras of baseball
because of a small eligible population size and the exclusion of entire populations
of people on racial grounds. See Table 3 for the specifics. As an argument in favor
of old time players, equipment and training improvements is not without fault because the same improvements are available to every competitor as well. PPS also
fails to account for the effect of modern day filming, which did not exist in the
older eras, as a way to gain insight on opponents’ strengths and weakness in their
training improvements assumption.
years
1900
1930
1950
1980
2000
2010
eligible pop.
3.2
4.8
5
21.27
35.1
39.1
number of teams
8
16
16
26
30
30
roster size
15
25
25
25
25
25
eligible pop. per team
0.02667
0.012
0.0125
0.03272
0.0468
0.05213
Table 3: Relative talent dilution when considering the eligible MLB population
sizes at select time periods. Population totals are in millions.
The principle assumptions about baseball used as justification for PPS detrending are questionable. The mathematics of PPS detrending are also questionable in
the context of baseball. We now dive into the mathematical formulation of the PPS
detrending method. PPS first calculates the prowess Pi (t) of an individual player i
as
P i(t) = xi (t)/yi (t)
(1)
where xi (t is an individual’s total number of successes out of his/her total number of opportunities yi (t) in a given year t. To compute the league-wide average
prowess, PPS then computes the weighted average for season t over all players
P
xi (t) X
hP (t)i = Pi
=
wi (t)Pi (t)
(2)
i yi (t)
i
where
yi (t)
wi (t) = P
.
i yi (t)
7
(3)
0
The
P index i runs over all players with at least y opportunities during year t, and
i yi is the total number of opportunities of all N (t) players during year t. A
cutoff y 0 = 100 is employed to eliminate statistical fluctuations that arise from
players with very short seasons. The PPS detrended metric for the accomplishment
of player i in year t is then given as,
xD
i (t) = xi (t)
P
hP (t)i
(4)
where P is the average of hP (t)i over the entire period,
2009
1 X
P =
hP (t)i.
110
t=1900
The PPS detrended metric xD
i (t) has its merits. It does adjust individual prowess
towards the historical average prowess. It is also dimensionless in a physical
sense with respect to the measured accomplishment. However, this metric is selffulfilling; it is specifically constructed in alignment with the opinion that prowesses
hP (t)i are simple deviations about P that are best addressed by shrinking era effects towards P .
Instead of the detrended metric (4), one could just as easily construct the detrended metric
hP (t)i
0
xD
.
(5)
i (t) = xi (t)
P
which also maintains the merits of dimensionlessness and detrending. However,
the metric (5) reaches the exact opposite conclusions as the metric (4). The for0
mulation of xD
i (t) rewards prowess from strong eras and punishes prowess from
weak eras. The only difference between the formulations of the two detrending
metrics is the placement of hP (t)i relative to P . The mathematical formulation of
PPS detrending is an artifact of the assumptions made. Detrending using (5) maintains all of the merits of PPS detrending but reaches opposite conclusions with the
justification of having the opposite opinion as the authors of PPS.
5.3
Critique of our testing procedure
Our testing procedure is valid under two assumptions made. Our first assumption
states that innate talent is uniformly distributed over the entire MLB eligible population over all eras. This assumption is on talent alone and is not concerned with
any era effects that are reflected in players’ statistical achievements. It is valid
from a genetic perspective where we think that genes associated with baseball talent remain within the MLB eligible population over time. However, the second
assumption made is quite strong and it is a result of our data collection methods.
This assumption states that the competition to baseball from other sports in more
recent eras is offset by increased salary incentives that were not available in older
8
eras. From 1880 to 1900, this assumption may hold. However, many consider the
time period from the 1920s to the 1960s to be an extended golden era of baseball
where competition from other sports was not a serious threat. Baseball players
from the 1920s to the 1960s did not make as much money as the modern players,
after adjusting for inflation, but the average players in many of these seasons made
very competitive salaries. If indeed such a golden era of baseball existed, then the
MLB eligible population actually competing for MLB positions would be higher
in these eras than in others. Our data collection method has not accounted for this
effect.
The existence of the golden era places our second assumption in questionable
territory. However, we can account for the golden era using various weighting
schemes. The idea is to weight each time period’s MLB eligible population with
respect to that year’s general population’s overall interest in baseball. The time
periods comprising the golden era receive higher weights than the other time periods. MLB interest over the time periods is given by Gallup polling data. The
Gallup data is incomplete so values are extrapolated. The extrapolation employed
treats the golden era as 1880-1950 and the unobserved weights over this time period are simply given the highest weights in the golden era. The new dataset includes the weighted MLB eligible populations for every ten year period from 1880
to 2010 and the cumulative proportion at each time period is calculated as before. We then perform the same hypothesis testing procedure with respected to the
weighted dataset as seen in Section 4. This procedure is replicated over three different weighting schemes which are explained in Appendix B. The p-values from
the hypothesis tests with respect to the weighted datasets are seen in Table 4. We
also reinvestigate talent dilution with respect to the three weighted datasets.
too many old in top 10
too many old in top 25
too few recent in top 10
too few recent in top 25
original data
0.0155
0.0012
0.0198
0.0031
w1
0.0407
0.0085
0.0314
0.0076
w2
0.0792
0.0305
0.0459
0.0156
w3
0.1379
0.0841
0.1863
0.1824
Table 4: Checking the robustness of our second assumption. Our second assumption is robust to the first two weighting schemes. It is not robust to the extremely
conservative third weighting scheme.
5.4
Comparison of critiques
All investigated methods have their shortcomings. However, our testing method
holds up well when we reweight our original data in ways that reflect relative disbelief in our second assumption. PPS do not offer any insight on how their method
fairs when their underlying assumptions are called into question. We do show that
their mathematics will support the exact opposite conclusion if one adopts assump9
years
1900
1930
1950
1980
2000
2010
original data
0.02667
0.012
0.0125
0.03272
0.0468
0.05213
w1
0.016
0.0072
0.005
0.01309
0.01872
0.02085
w2
0.02133
0.0096
0.005
0.01309
0.01872
0.02085
w3
0.00667
0.0048
0.00475
0.00524
0.00608
0.00626
Table 5: Relative talent dilution when considering the weighted eligible MLB population sizes at select time periods. Numbers are in millions of people per team.
tions that are in complete conflict with their assumptions. As seen in Section 5.1,
WAR does not offer a reasonable comparison of MLB players across eras because
its derivation involves a strong era effect.
6
Discussion
The players from the old eras of baseball receive significant attention and praise
as a result of their statistical achievements. On the other hand, more recent players are often tossed aside because of their more modest statistical achievements
or suspected steroid use. We argue that, with assumptions on the distribution of
talent, that the old time players are overrated and the recent players are underrated.
Our statistical tests provide overwhelming evidence in favor of this argument. We
conclude that the superior statistical accomplishments achieved by the old time
baseball players are a reflection of our inability to properly compare talent across
eras. It is highly unlikely that athletes from such a scarcely populated era of available baseball talent could represent a top 10 and top 25 list so abundantly. It is also
highly unlikely that athletes from such an abundantly populated era of available
baseball talent could represent a top 10 and top 25 list so scarcely.
Appendix A
We revisit the calculation discussed in the Testing section. The chance that an
eligible MLB player whose career was largely before 1950 appears in the top 10
occurs with probability p = 0.167. We use the binomial distribution to calculate
the probability of at least 5 eligible MLB players whose career was largely before
1950 appear in the top 10. More formally, we are interested in calculating P (X ≥
5) where X is the number of eligible MLB players whose career was largely before
1950 appearing in the top 10. So we have
P (X ≥ 5) = P (X = 5) + P (X = 6) + · · · + P (X = 10).
10
The binomial distribution is used to calculate P (X = 5), P (X = 6), ..., and
P (X = 10) where
10 x
P (X = x) =
p (1 − p)10−x
x
for x = 5, 6, 7, 8, 9, 10. For example, when x = 6 we have
10 6
P (X = 6) =
p (1 − p)4 .
6
Specifying x = 6 is equivalent to 6 eligible MLB players whose career was largely
before 1950 appearing in the top 10. We
do not care about the order that
players
10
10
appear in the top 10. The number 6 corrects for this. The number 6 is the
number of possible orderings of 6 “successes” and 4 “failures” that can occur. In
our context a success is an eligible MLB player whose career was largely before
1950 and a failure is an eligible MLB player whose career was largely after 1950.
Any particular ordering of the top 10 with 6 eligible MLB players whose career
was largely before 1950 appearing in the list occurs with probability p6 (1 − p)4 .
Therefore the probability that 6 eligible MLB players whose career was largely
before 1950 appear in the top 10 is given by
10 6
P (X = 6) =
p (1 − p)4 .
6
R statistical software was used as a calculator throughout the Testing section. Similar calculations to this one are made to calculate p-values in the Testing section.
Appendix B
The weights discussed in Section 5.3 are given in the table below. These weights
reflect Gallup polling data (Moore and Carroll, 2002; Carlson, 2001; Gallup, 2016)
on the subject of fan interest in baseball throughout the years. Polling data does not
exist for all time periods. Specifically, there are no records before 1940. The first
two weighting schemes reflect overall fan interest in baseball. It is apparent from
the Gallup dataset that the proportion of Americans sampled who are baseball fans
has been around 40% from 1940 on. The first weighting scheme (w1) places a 0.60
weight on fan interest with the assumption that more Americans from those time
periods favored, and played, baseball due to lack of competition from other sports.
The second weighting scheme (w2) places even stronger weights on pre-1940 fan
interest. The third weighting scheme (w3) reflects the proportion of Americans
sampled that list baseball as their favorite sport. Again, we place a higher weight
for pre-1940 fan interest.
All three weighting schemes serve as an educated proxy for the eligible MLB
population thought to strive towards a career in professional baseball. The weighting schemes w2 and w3 are especially conservative. We do not expect fan interest
11
in pre-1940s baseball to be as high as our weighting schemes due to lack of media
exposure. Furthermore, we do not expect the proportion of the eligible pre-1940s
MLB population thought to strive towards a career in professional baseball due
to relatively low MLB salaries in these time periods. The appropriateness of w3
is particularly questionable since individuals play baseball even if it is not the individual’s favorite sport. It should be noted that these weights are obtained from
survey data obtained in the United States only and are applied, at equal value, to the
other countries. Therefore our weighting schemes are biased but rich survey data
on the topic of fan interest in baseball is unavailable from from the other countries.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
year
1880
1890
1900
1910
1920
1930
1940
1950
1960
1970
1980
1990
2000
2010
w1
0.60
0.60
0.60
0.60
0.60
0.60
0.40
0.40
0.40
0.40
0.40
0.40
0.40
0.40
12
w2
0.80
0.80
0.80
0.80
0.80
0.80
0.40
0.40
0.40
0.40
0.40
0.40
0.40
0.40
w3
0.25
0.25
0.25
0.40
0.40
0.40
0.35
0.38
0.34
0.28
0.16
0.16
0.13
0.12
References
Baseball-Reference.com. (2016). WAR Explained.
http://www.baseball-reference.com/about/war_
explained.shtml
Brea, J. A. (2003). Population Dynamics in Latin America.
Population Bulletin, 58, no. 1. http://www.igwg.org/Source/58.
1PopulDynamicsLatinAmer.pdf
Carlson, D. K. (2001). Although Fewer Americans Watch Baseball, Half Say
They are Fans.
http://www.gallup.com/poll/4624/
Although-Fewer-Americans-Watch-Baseball-Half-Say-They-Fans.
aspx?g_source=baseball%20interest&g_medium=search&g_
campaign=tiles
Census (2016). https://www.census.gov/
ESPN. (2015). ESPN’s Hall of 100.
http://espn.go.com/mlb/feature/video/_/id/8652210/
espn-hall-100-ranking-all-greatest-mlb-players
Fangraphs. (2016). What is WAR?
http://www.fangraphs.com/library/misc/war/
Gallup. (2016). Sports.
http://www.gallup.com/poll/4735/sports.aspx
Moore, D. W. and Carroll, J. (2002). Baseball Fan Numbers Steady, But Decline
May Be Pending.
http://www.gallup.com/poll/6745/
Baseball-Fan-Numbers-Steady-Decline-May-Pending.
aspx?g_source=baseball%20interest&g_medium=search&g_
campaign=tiles
Petersen, A. M., Penner, O., Stanley, H. E. (2011). Methods for detrending success
metrics to account for inflationary and deflationary factors. Eur. Phys. J. B., 79,
67–78.
Szymborski, D. (2015). Hall of 100 methodology. http://espn.go.com/
mlb/story/_/id/12127337/hall-100-methodology-mlb
13