Analysis of Baseball Team Construction - Skyline

Skyline - The Big Sky Undergraduate Journal
Volume 1 | Issue 1
Article 2
2013
Analysis of Baseball Team Construction
Kevin Ferris
Montana State University, [email protected]
Follow this and additional works at: http://skyline.bigskyconf.com/journal
Part of the Sports Studies Commons, and the Statistics and Probability Commons
Recommended Citation
Ferris, Kevin (2013) "Analysis of Baseball Team Construction," Skyline - The Big Sky Undergraduate Journal: Vol. 1 : Iss. 1 , Article 2.
Available at: http://skyline.bigskyconf.com/journal/vol1/iss1/2
This Research Article is brought to you for free and open access by Skyline - The Big Sky Undergraduate Journal. It has been accepted for inclusion in
Skyline - The Big Sky Undergraduate Journal by an authorized editor of Skyline - The Big Sky Undergraduate Journal.
Analysis of Baseball Team Construction
Acknowledgments
This work was supported by funding from the Undergraduate Scholars Program at Montana State University. I
would also like to thank Jim Robison-Cox for all the help he provided with the project. Further ideas and
influence came from Steve Cherry and Mark Greenwood, and I greatly appreciate their contributions. Finally,
I would like to thank Gregg Ferris, Lik Ming Aw, Blaine Ferris, and Else Trygstad-Burke for their help in
preparing this paper.
This research article is available in Skyline - The Big Sky Undergraduate Journal: http://skyline.bigskyconf.com/journal/vol1/iss1/2
Ferris: Analysis of Baseball Team
Abstract
To succeed in the competitive environment of Major League Baseball,
teams must assemble the best possible collection of players. While doing so
primarily involves acquiring talented players, teams must also account for how
these players fit together. In this paper, I used the WAR metric to explore how
the distribution of player talent affects team performance. Specifically, an analysis
was conducted to determine the effect that altering the moments of this
distribution has on team performance. I found that increasing the standard
deviation of a team's position players tends to negatively impact the team, and
that teams with higher standard deviation and skewness for pitchers also tend to
perform slightly worse. These results suggest that teams could benefit by
accounting for the spread of player talent.
Published by Skyline - The Big Sky Undergraduate Journal, 2013
1
Skyline - The Big Sky Undergraduate Journal, Vol. 1 [2013], Iss. 1, Art. 2
Introduction
“But for 2013, the question [of how successful the Red Sox will be] centers on the
star power [they] have retained." -- Gordon Edes, ESPN Boston [3]
“We're probably a ball club where people will say, `They're missing the
superstar.' Well, I don't really care. I think this team will challenge one another. I
think they'll play hard. We're going to be that pest that never goes away." -- Kevin
Towers, Arizona Diamondbacks General Manager [6]
Beginning with Bill James's Baseball Abstracts in the mid 1980's, and
popularized by Michael Lewis's Moneyball, the rise in popularity of “advanced
statistics" has seen a plethora of new statistics sweep across the sports landscape.
From hockey to tennis, teams and fans are analyzing sports differently now than
they did 10 years ago.
The style of the statistical metrics differs wildly from sport to sport. Since
baseball primarily involves a faceoff between the batter and the pitcher,
determining the “winner" of each plate appearance is a relatively straightforward
matter. A substantial amount of work has been done on evaluating the impact a
single player has on his team by determining the number of plate appearances he
“won.”
Conversely, in basketball and hockey, how a group of individuals plays
together is more heavily scrutinized. In these sports, one notable statistic is the
plus/minus stat, which simply tells how many more points a group of 5 players
scored when they were playing together compared to the other team. This
approach, which focuses on analyzing a team rather than an individual, arises
because these sports feature a group of players acting together to achieve a
common goal. Reducing their collective performance to an individual level is
http://skyline.bigskyconf.com/journal/vol1/iss1/2
2
Ferris: Analysis of Baseball Team
extremely difficult, so more emphasis has been placed on team-oriented analysis
in these sports.
Here, I attempt to explore baseball performance through team-oriented
analysis. Specifically, I set out to analyze how different types of players work
together by answering the question, “Which is better for a team: to have two
average players or one All Star caliber player and one subpar player?” It could be
that there is no difference between the two scenarios, in which case teams should
simply try to acquire the best possible collection of players. Alternatively, the
effect of having one good player could outweigh the effect of a bad player or vice
versa. If that is the case, then teams will have to account for the team dynamic
when trying to improve.
Data
To explore the effect that the composition of player talent can have on a
baseball team’s performance, this analysis needed a reliable estimator of player
talent in a given year. This is made difficult because player talent manifests itself
in different ways. Some players excel on the pitching mound while others
contribute on the field or at the plate. Ideally, this study could account for
fielding, hitting, and pitching ability when analyzing distributions of player talent.
Traditionally, hitters have been evaluated by Batting Average, RBIs, and
Home Runs, pitchers by Wins, ERA, and Strikeouts, and fielding performance by
Errors and Fielding Percentage.1 These statistics have been the subject of much
research, and it has been found that they do not do a very good job of evaluating
player performance. The batting and fielding statistics are not representative of
everything that a batter or fielder does, while the pitching statistics are influenced
by many other factors than just the pitcher [7]. Other player evaluation metrics
had to be used.
1
If the reader is unfamiliar with any of the statistics discussed in this section, I would recommend
checking the Baseball Statistics Wikipedia page.
Published by Skyline - The Big Sky Undergraduate Journal, 2013
3
Skyline - The Big Sky Undergraduate Journal, Vol. 1 [2013], Iss. 1, Art. 2
There are multiple outcomes that a hitter can have for each plate
appearance. He could strikeout (his worst scenario), hit a home run (his best
scenario), or do something in between. Since each unique outcome of a plate
appearance either helps or hurts a team, a useful hitting statistic would apply a
weight to each outcome where each weight depends upon how much its respective
outcome helps or hurts the team. A baseball player’s hitting value would then be
the sum of his weighted outcomes throughout the year. The Weighted On-Base
Average (wOBA) statistic attempts to do just that. The weights it uses have been
derived by calculating how many runs each outcome is worth on average [13].
Because it fulfills the desired properties, wOBA was used as a measure of the
value of a player’s batting performance over the course of a baseball season.
Fielding performance is the most difficult aspect to measure. Presently,
the best way to measure a player’s fielding performance is to begin by comparing
his performance to the performance of an “average” fielder. Each fielder is given
credit if he makes a play that an “average” fielder would fail to make, and
deducted credit if he fails to make a play that an “average” fielder would make. A
player’s total fielding value for a year would be the sum of all these credits and
deductions. Fielders could also be given credit for how often they throw out or
otherwise influence base runners in a similar manner. Ultimate Zone Rating
(UZR), Defensive Runs Saved (DRS), and Total Zone Rating (TZR) are three
statistics that model fielder performance in this manner [11, 14]. For this
analysis, they were used to incorporate a player’s fielding ability during a season.
How to properly evaluate pitchers has been the subject of much scrutiny.
Most studies have found that pitchers have relatively little control over what
happens when a ball is put in play [7]. These results suggest that pitching
performance should be analyzed primarily based on outcomes over which a
pitcher has control – walks, home runs, and strikeouts. The metrics built off of
these studies apply a weight to each of these outcomes. These weights are then
used to calculate how many runs that pitcher “should” have allowed while
http://skyline.bigskyconf.com/journal/vol1/iss1/2
4
Ferris: Analysis of Baseball Team
“ignoring” other outcomes.2 Fielding Independent Pitching (FIP) is a metric
developed along these guidelines. The calculation of some metrics, however, is
still based on how many runs a pitcher did allow. The number of runs is then
adjusted according to run scoring environment and the defense of the pitcher’s
team. An example of this type of metric is adjusted Runs Allowed (xRA).
Despite the differences in calculation, the two metrics generally tend to give
similar results.
These metrics give a very good idea of a player’s contribution in each
facet of a baseball game. Since the purpose of this paper was to look at a player’s
overall talent level, the next step was to combine these statistics into one metric
which can summarize a player’s total contribution to his team. This is exactly
what the Wins Above Replacement (WAR) statistic does. It takes a player’s
hitting, fielding, and pitching performance, puts them on a common scale, and the
sums them.3 In doing so, WAR attempts to answer the question, “If [a] player got
injured and his team had to replace him with a minor leaguer, how many fewer
games would his team win?” [12] WAR would therefore appear to be a very good
evaluation of a player’s overall contribution to his team during a season.
It is important to note that WAR is not a perfect statistic. Especially with
respect to fielding, the current estimates are not perfect representations of a
player’s value. However, WAR has been shown to have a very strong
relationship with wins [1]. Using linear regression, it was found that on average a
one WAR increase is associated with winning almost exactly one more game per
season [2]. These results suggest that WAR is a fairly good representation of a
player’s contributions. This study proceeded using WAR as a measure of how
much a player helped his team over the course of a season.
2
The metrics don’t ignore the other outcomes entirely, but they are almost entirely driven by
walks, strikeouts, and home runs.
3
WAR also considers base running, but base running is a minor part of the game and does not
contribute very much to WAR. As a result, I chose not to discuss it here.
Published by Skyline - The Big Sky Undergraduate Journal, 2013
5
Skyline - The Big Sky Undergraduate Journal, Vol. 1 [2013], Iss. 1, Art. 2
Because there are multiple metrics used to calculate player performance,
there are many different ways that a player’s WAR could be calculated.
However, two – fWAR and rWAR – are the most used. fWAR uses wOBA,
UZR, and FIP while rWAR uses wOBA, DRS, and xRA.4 To avoid any problems
with combining fWAR and rWAR, this paper uses two separate models: one for
fWAR, and one for rWAR. However, the metrics are very similar (the correlation
between fWAR and rWAR for position players is 0.92) so the two models should
yield similar results.
The data were collected in June, 2012. The data used for this analysis
begin in 1974 (the first year fWAR was available for pitchers) and end in 2011.
The three strike-shortened seasons of 1981, 1994, and 1995 were omitted. The
data consist of both fWAR and rWAR values for each player over this time
period.
Methodology
To analyze the theoretical difference between a team acquiring two
average players and a team acquiring one good and one bad player, the moments
of each team’s distribution of player WAR were analyzed. Since the WAR for
two average players should be much closer together than the WAR for a good and
a bad player, the standard deviation of the two average players should be smaller.
This concept may be extended beyond the simple case of two players to the entire
team: teams which emphasize average players should have smaller standard
deviations than teams that emphasize the extreme players. Furthermore, a WAR
above 4 is considered good, while an average WAR is close to 2 [12], so a team
of more average players will also tend to have lower skewness values. This
4
Prior to 2002, UZR and DRS are not available so fWAR and rWAR use statistics which are
similar to TZR.
http://skyline.bigskyconf.com/journal/vol1/iss1/2
6
Ferris: Analysis of Baseball Team
analysis proceeds by checking to see if standard deviation and skewness of a
team’s distribution of player talent influences that team’s outcome.
Distributions for 10 Worst Teams
Distributions for 10 Best Teams
1.5
Density
1.0
0.5
0.0
0
4
8
12
0
4
8
12
Player WAR
Figure 1: Batter WAR Density Plot
Figure 1 contrasts the distribution of position player WAR for the 10 most
successful teams of the last 30 years with the distribution of the 10 least
successful teams. It is clear that the successful teams have distributions that are
more spread out; however, they also have fewer players with a WAR close to 0.
Players with 0 WAR are near replacement level, and provide barely any value to
the team. It could be that successful teams are successful simply because they
avoid having too many replacement level players on the team. To avoid this
possibility, the model of team success will have to control for a team’s overall
talent level.
Published by Skyline - The Big Sky Undergraduate Journal, 2013
7
Skyline - The Big Sky Undergraduate Journal, Vol. 1 [2013], Iss. 1, Art. 2
The model this paper uses therefore begins by accounting for the sum of
the WAR for the players on each team (an estimate of that team’s overall talent
level). It then examines the effect of changing a team’s standard deviation and
skewness. The model takes the following form:
While WAR attempts to put batters and pitchers on the same scale, there is
some evidence that 1 WAR for batters does not have the same effect as 1 WAR
for pitchers. As a result, the model is updated so the values for pitchers and
position players are evaluated separately. rWAR separates pitchers and hitters
while fWAR goes further and separates starting and relief pitchers. Since WAR
for relief pitchers (relievers) could differ from WAR for starting pitchers
(starters), the model using fWAR looks like:
The model using rWAR looks similar, but starter and reliever WAR are merger
into a single pitcher term. Finally, some controlling variables such as league,
year, team’s wins in the previous year, and run scoring environment were
considered. Year and run scoring environment did not provide any useful
information and were removed for the model. The team’s wins in the previous
year were used for both the fWAR and rWAR models, while league was only
needed in the rWAR model.
Results and Discussion
http://skyline.bigskyconf.com/journal/vol1/iss1/2
8
Ferris: Analysis of Baseball Team
Effects of Batter Standard Deviation for different WAR
120
Sum = -6
Sum = 6
Sum = 17
Sum = 29
Sum = 41
Sum = 53
Wins
100
80
60
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
Batter SD
Figure 2: Batter Effects Plot
See Table 1 on page 10 for a full summary of the regression results.
Skewness dropped out of the model for batters and relievers (p-value testing the
final model versus a model with a skewness interaction was 0.9071). However, in
the rWAR model, it was found to be important for pitchers while standard
deviation was not. Both were important for starting pitchers in the fWAR model.
Since the results were similar and the fWAR model has a slightly lower AIC, the
subsequent discussion uses the fWAR model only.
A plot of the relationship between batter standard deviation and wins is
presented in Figure 2. Each panel in the plot corresponds to a different sum of a
team’s batter fWAR.
Published by Skyline - The Big Sky Undergraduate Journal, 2013
9
Skyline - The Big Sky Undergraduate Journal, Vol. 1 [2013], Iss. 1, Art. 2
The slope of the estimated regression line changes substantially across the
panels. It appears that teams with poor batters (i.e. teams with a low sum) are
generally helped by increasing the standard deviation of the team’s position
players. For teams with talented batters, the relationship changes, and it is
generally harmful to increase the standard deviation. The visualizations are
supported by the regression results: the estimated coefficient for the standard
deviation term is 3.575, while the coefficient of the interaction term is estimated
to be -0.134. These results suggest that for weak teams, the negative effect of the
interaction term is offset by the positive effect of the standard deviation term. For
strong teams, however, the positive effect no longer outweighs the negative effect.
For starting pitchers, skewness is most important in the fWAR model. In
Figure 3, it can be seen that starting pitcher skewness is negatively related to team
performance regardless of the level of pitching talent. These results suggest that
increasing the skewness of a team’s starting pitchers by one while holding the
sum constant is estimated to be associated with a decrease of between 0.274 and
3.877 wins on average. High skewness values for a team usually mean that team
has one or two elite starters. In baseball, starting pitchers pitch approximately
once every five days. These results might be suggesting that a team is better off
acquiring two average starters – who combine to pitch two out of every five days
– than one elite starter.
http://skyline.bigskyconf.com/journal/vol1/iss1/2
10
Ferris: Analysis of Baseball Team
Effects of Starting Pitcher Skewness
Sum = -0.2
Sum = 5
Sum = 10
Sum = 16
Sum = 21
Sum = 26
95
90
Wins
85
80
75
70
-1 0 1
2
-1 0 1
2
-1 0
1 2
-1 0 1
2
-1 0 1
2
-1 0
1 2
Starter Skewness
Figure 3: Starter Effects Plot
The results of the fWAR regression provide strong evidence that reliever
interactions are important. Figure 4 shows the effects of changing reliever
standard deviation for teams with different reliever WAR. Here, it looks as
though teams with different standard deviations can have fairly different records.
Published by Skyline - The Big Sky Undergraduate Journal, 2013
11
Skyline - The Big Sky Undergraduate Journal, Vol. 1 [2013], Iss. 1, Art. 2
Effects of Relief Pitcher SD
Sum = -4
Sum = -1
0.5 1.0 1.5
0.5 1.0 1.5
Sum = 2
Sum = 4
Sum = 7
Sum = 10
0.5 1.0 1.5
0.5 1.0 1.5
95
90
Wins
85
80
75
70
0.5 1.0 1.5
0.5 1.0 1.5
Reliever SD
Figure 4: Reliever Effects Plot
Teams with a fairly weak bullpen and a low standard deviation appear to
perform much worse than teams with a weak bullpen and high standard deviation.
A team with a weak bullpen but a high standard deviation would probably have
many poor relievers, and only one or two strong relievers. Perhaps, in the past
such team benefitted by using the good relievers in close games and the weak
ones in blowouts. This would also explain why the effect of reliever standard
deviation decreases as the bullpen improves. The benefit of having one elite
reliever for close games is not as powerful for a good bullpen – in this case, the
next best option is still an above average reliever.
http://skyline.bigskyconf.com/journal/vol1/iss1/2
12
Ferris: Analysis of Baseball Team
Conclusion
This paper set out to answer the question of whether it is more beneficial
for a team to acquire two average players or one good player and one bad player.
The evidence found in this paper suggests that there is a difference between the
approaches. Interestingly, the more beneficial option depends on the level of
team talent. Teams with good position players might opt for a different decision
than teams with poor position players. This suggests that team performance in
baseball is far more complicated than just the overall talent of a team’s players.
In the previous discussion of the effects of starter skewness, it was noted
that increasing starter skewness is usually harmful to team performance. Based
on this result, when teams try to improve their pitching staffs they should also try
to minimize the increase in skewness. However, increasing talent level while
holding skewness constant is challenging (the correlation between the two
variables is estimated to be 0.59). Therefore, teams face a trade-off when
considering how to improve their starting pitching staffs. They must decide
whether the benefit from improving with a single starter is worth the skewness
cost.
As a final note, this study is not meant to draw firm conclusions about the
nature of baseball teams. Rather, it is more of an exploration into the mechanics
of team performance. It has been shown that there is a strong relationship
between distribution of player talent and team performance. However, a team
with poor relievers should not expect immediate improvement by increasing the
standard deviation of its relievers. This study suggests that teams may be able to
improve by accounting for their distribution of player talent.
Published by Skyline - The Big Sky Undergraduate Journal, 2013
13
Skyline - The Big Sky Undergraduate Journal, Vol. 1 [2013], Iss. 1, Art. 2
Acknowledgements
This work was supported by funding from the Undergraduate Scholars
Program at Montana State University. I would also like to thank Jim RobisonCox for all the help he provided with the project. Further ideas and influence
came from Steve Cherry and Mark Greenwood, and I greatly appreciate their
contributions. Finally, I would like to thank Gregg Ferris, Lik Ming Aw, Blaine
Ferris, and Else Trygstad-Burke for their help in preparing this paper.
http://skyline.bigskyconf.com/journal/vol1/iss1/2
14
Ferris: Analysis of Baseball Team
References
[1] Dave Cameron. War: It works. Fangraphs, 2009.
[2] Glenn DuPaul. What is war good for? Hardball Times, 2012.
[3] Gordon Edes. Enough sox star power? ESPN Boston, 2013.
[4] J. Fox and J. Hong. Effect displays in r for multinomial and proportional-odds
logit models: Extensions to the effects package. Journal of Statistical
Software, 2009.
[5] Marek Hlavac. stargazer: LaTeX code for well-formatted regression and
summary statistics tables. Harvard University, Cambridge, USA, 2013. R
package version 3.0.1.
[6] Tyler Kepner. Boom-and-bust diamondbacks try for a steady approach. New
York Times, 2013.
[7] J. Keri and B.B. Prospectus. Baseball Between the Numbers: Why Everything
You Know about the Game Is Wrong. Basic Books, 2007.
[8] D Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch. e1071:
Misc Functions of the Department of Statistics (e1071), TU Wien, 2012. R
package version 1.6-1.
[9] Jose Pinheiro, Douglas Bates, Saikat DebRoy, Deepayan Sarkar, and R Core
Team. nlme: Linear and Nonlinear Mixed Effects Models, 2013. R
package version 3.1-109.
[10] R Core Team. R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria, 2013.
[11] Steve Slowinski. Uzr. Fangraphs, 2013.
[12] Steve Slowinski. What is war? Fangraphs, 2013.
[13] Steve Slowinski. woba. Fangraphs, 2013.
[14] Baseball Reference Staff. Position player war calculations and details.
Baseball Reference, 2013.
Published by Skyline - The Big Sky Undergraduate Journal, 2013
15
Skyline - The Big Sky Undergraduate Journal, Vol. 1 [2013], Iss. 1, Art. 2
[15] Hadley Wickham. ggplot2: elegant graphics for data analysis. Springer New
York, 2009.
[16] Hadley Wickham. The split-apply-combine strategy for data analysis. Journal
of Statistical Software, 40(1):1{29, 2011.
[17] Yihui Xie. knitr: A general-purpose package for dynamic report generation in
R, 2013. R package version 1.1.
http://skyline.bigskyconf.com/journal/vol1/iss1/2
16
Ferris: Analysis of Baseball Team
Table 1: Regression Results
Bsum
Bsd
Bsum:Bsd
fSSum
fSsd
fSSkew
fSSum:fSSkew
fRsum
fRsd
fRsum:fRsd
fWAR Model
1.069***
(0.069)
3.575***
(1.252)
-0.134***
(0.041)
0.583***
(0.090)
2.128***
(0.588)
-2.075**
(0.919)
0.093
(0.068)
1.532***
(0.191)
8.086***
(1.616)
-1.101***
(0.296)
rPsum
rPskew
Constant
Note:
Published by Skyline - The Big Sky Undergraduate Journal, 2013
39.470***
(1.913)
rWAR Model
1.125***
(0.056)
2.265**
(0.935)
-0.087**
(0.036)
0.966***
(0.023)
-0.812***
(0.215)
48.122***
(1.217)
*p<0.1; **p<0.05; ***p<0.01
17