Journal of Quantitative Analysis in Sports

Journal of Quantitative Analysis in
Sports
Volume 5, Issue 2
2009
Article 4
2008 N ORTHERN C ALIFORNIA S YMPOSIUM ON S TATISTICS
AND O PERATIONS R ESEARCH IN S PORTS
Chasing DiMaggio: Streaks in Simulated
Seasons Using Non-Constant At-Bats
David M. Rockoff∗
∗
†
Philip A. Yates†
Iowa State University, [email protected]
California State Polytechnic University - Pomona, [email protected]
c
Copyright 2009
The Berkeley Electronic Press. All rights reserved.
Chasing DiMaggio: Streaks in Simulated
Seasons Using Non-Constant At-Bats∗
David M. Rockoff and Philip A. Yates
Abstract
On March 30, 2008, Samuel Arbesman and Steven Strogatz had their article “A Journey to
Baseball’s Alternate Universe” published in The New York Times. They simulated baseball’s entire
history 10,000 times to ask how likely it was for anyone in baseball history to achieve a streak that
is at least as long as Joe DiMaggio’s hitting streak of 56 in 1941. Arbesman and Strogatz treated
a player’s at bats per game as a constant across all games in a season, which greatly overestimates
the probability of long streaks. The simulations in this paper treated at-bats in a game as a random
variable. For each player in each season, the number of at-bats for each simulated game was bootstrapped. The number of hits for player i in season j in game k is a binomial random variable with
the number of trials being equal to the number of at bats the player gets in game k and the probability of success being equal to that player’s batting average for that season. The result of using
non-constant at-bats in the simulation was a decrease in the percentage of the baseball histories
to see a hitting streak of at least 56 games from 42% (Arbesman and Strogatz) to approximately
2.5%.
KEYWORDS: hitting streaks, simulations
∗
The information used here was obtained free of charge and is copyrighted by Retrosheet.
Interested parties may contact Retrosheet at 20 Sunset Road, Newark, DE 19711 or at
www.retrosheet.org. Thanks to Cliff Blau for DiMaggio’s 1941 game-by-game data.
Rockoff and Yates: Chasing DiMaggio
1
Background
The summer of 1941 was one of the most compelling times in baseball history. That
year, Ted Williams of the Boston Red Sox posted a batting average of .406 with
37 home runs, 120 runs batted in, an on base percentage of .553, and a slugging
average of .735. He would be the last man to hit over .400 in a season. His league
and park adjusted OPS (OPS+) was 235, the fourth highest mark of the twentieth
century. This means that his OPS was 135% better than an average player in the
American League. All of these accomplishments earned Williams second place
in the American League balloting for Most Valuable Player. The 1941 American
League MVP award went to New York Yankee Joe DiMaggio. DiMaggio hit .357
with 30 home runs, 125 runs batted in, an on base percentage of .440, a slugging
average of .643, and an OPS+ of 184. However, DiMaggio had something that
Williams did not: he hit safely in 56 straight games, a record that many say will
never be approached.
Due to the statistical anomaly that is DiMaggio’s hitting streak, there has been
much discussion and analysis about this feat. Gould (1989) wrote how “DiMaggio’s streak is the most extraordinary thing that ever happened in American sports.”
Short and Wasserman (1989) quantified the probability of streak similar to DiMaggio’s with a conservative estimate of 3 at-bats per game. Berry (1991) compared
the probabilities of DiMaggio’s 56 game hitting streak and Williams’ batting .406
and found the two to be fairly comparable. Albright (1993) discussed ways of identifying wide-scale streakiness in terms of hitting over a four year period. Warrack
(1995) calculated the probability of DiMaggio having a 56-game hitting streak. Albert and Williamson (2001) simulated from a Bayesian model to measure a player’s
streakiness for hitting streaks and for home run streaks. Freiman (2002) attempted
to find the probability of a 56-game hitting streak by treating these streaks as being
not independent. Albert (2008) used an exchangeable model to estimate hitting abilities of players, understand those players’ streakiness, and to identify some players
who are streakier than others. Arbesman and Strogatz (2008) ran simulations of
baseball seasons to estimate the probability of long hitting streaks, which resulted
in a 56 (or more) game hitting streak in 42% of simulated baseball histories. They
treated a player’s at-bats per game as constant across all games in a season.
This article will improve upon the results of Arbesman and Strogatz (2008).
When they treated a player’s at-bats per game as constant across all games in a
season, they greatly overestimated the probability of long streaks. Varying the atbats per game will not only better mimic the natural flow of a player’s season, but
also will reduce the estimated probability of a long hitting streak.
Published by The Berkeley Electronic Press, 2009
1
Journal of Quantitative Analysis in Sports, Vol. 5 [2009], Iss. 2, Art. 4
2
Model
To illustrate how constant at-bats can overestimate the likelihood of long hitting
streaks, Warrack (1995) estimates the probability of a player getting at least one hit
in a game, p, in the following manner. Let pi be the probability that a batter with
batting average A comes to bat i times during a game. Then p can be estimated by
p̂ = ∑ pi (1 − (1 − A)i ).
This function is concave in i. Let B be the average number of at-bats per game. One
can estimate p by
p̂ = 1 − (1 − A)B .
One can make the argument that the first estimate of p is the expected value of
the probability of a player getting at least one hit. The second estimate of p is the
probability of a player getting at least one hit over their average number of at-bats
per game. By Jensen’s Inequality, the second p̂ is greater than or equal to the first
p̂. This means that using constant at-bats will overestimate the likelihood of long
hitting streaks.
A brief example will be used to support this idea. Suppose a player’s batting
average is .300, and over two games, the player has 8 total at-bats. The probability
of getting a hit in both games is
p̂ = 1 − (1 − .300)i 1 − (1 − .300) j ,
where i is the number of at-bats in the first game, j is the number of at-bats in the
second game, and i + j = 8.
Table 1 summarizes the probabilities in this example. The highest probability
is the case where the player averaged 4 at-bats over the two games. Once these
at-bats are allowed to vary, the probability of getting at least one hit in both games
(a two-game hitting streak) decreases. Low at bat games hurt the player’s chances
for getting a hit in a game. Extended over long stretches, say a 162 game season,
this effect is telescoped. Figure 1 further illustrates the need to vary at-bats when
analyzing hitting streaks. Over time, the average number of at-bats per game has
decreased.
2.1
The Data
Game data was obtained from Retrosheet. For each season in their database, this
data includes multitudes of information on every single plate appearance in the
major leagues, including unique game identifier, batter, and whether the appearance
http://www.bepress.com/jqas/vol5/iss2/4
DOI: 10.2202/1559-0410.1167
2
Rockoff and Yates: Chasing DiMaggio
Probability of Hit
In Both Games
Game 1 Game 2
p̂
AB
AB
1
7
0.275
2
6
0.450
3
5
0.547
4
4
0.577
5
3
0.547
6
2
0.450
7
1
0.275
Table 1: Probability of a Player with a .300 Batting Average Getting a Hit in Two
Games with Eight Total at-bats
Figure 1: Boxplots of At-bats per Game Over All Baseball Seasons
Published by The Berkeley Electronic Press, 2009
3
Journal of Quantitative Analysis in Sports, Vol. 5 [2009], Iss. 2, Art. 4
resulted in an at bat. This allows one to determine the number of at-bats of each
player in each game for each season. Retrosheet has game-by-game data for all
of major league baseball from 1954 − 2007, as well as for the National League in
1911, 1921, 1922, and 1953.
2.2
The Model
For hitter i in season j, the batting average is denoted as pi j . For example, the
2000 and 2001 versions of a player like Barry Bonds will be treated as two different
players. If the ith hitter played in k games in season j, then
ABi j = (ABi j1 , ABi j2 , · · · , ABi jk )
are the number of at-bats for each game over the course of the season. Assuming
that at-bats over the course of a single game are independent of each other, then the
number of hits a player i in season j gets in game k, denoted Hi jk , has a binomial
distribution with n = ABi jk and p = pi j . This can be written as
Hi jk ∼ Bin(ABi jk , pi j ).
R version 2.7.2 (2008) was used to run the simulations. In these simulations,
each player in each season had their distribution of at-bats over the k games played
in a season. These at-bats were sampled with replacement to create a “simulated”
season’s worth of at-bats. To use notation similar to Efron and Tibshirani (1993),
the “simulated” season’s worth of at-bats are
AB∗i j = (AB∗i j1 , AB∗i j2 , · · · , AB∗i jk ).
If m seasons are simulated, then for player i and season j,
∗2
∗m
AB∗1
i j , ABi j , · · · , ABi j
represent that player’s “at-bats” in the simulations. The number of hits a player gets
in each game in the mth simulated season is
∗m
H∗m
i j ∼ Bin(ABi j , pi j ).
After randomly generating base hits from this binomial distribution, a hitting streak
was considered to be any run of hits in H∗m
i j that are greater than zero. The simulations kept track of each player’s maximum hitting streak in any given simulated
season.
http://www.bepress.com/jqas/vol5/iss2/4
DOI: 10.2202/1559-0410.1167
4
Rockoff and Yates: Chasing DiMaggio
Player
Felipe Alou
Julio Franco
Alex Rodriguez
Rogers Hornsby
Ichiro Suzuki
Jimmy Rollins
Rogers Hornsby
Ralph Garr
Kirby Puckett
Ichiro Suzkui
Rod Carew
Bobby Murcer
Luis Castillo
Wade Boggs
Larry Walker
Doug Glanville
Pete Rose
Tim Raines
Magglio Ordonez
Nellie Fox
Don Demeter
Year
1966
1991
1996
1921
2004
2007
1922
1974
1986
2007
1977
1973
2002
1985
1997
1999
1975
1984
2007
1955
1962
40+
4
3
10
21
34
2
23
13
6
14
12
1
2
10
7
4
1
1
10
1
1
Simulated Season
56+ Min Q1 Q2
2
9
14
16
1
8
14
16
2
9
15
18
1
6
14
17
5
11
18
22
1
7
13
15
2
7
15
18
2
9
15
18
1
8
14
17
1
10
16
19
2
7
16
19
1
5
11
13
1
7
12
14
1
9
16
19
1
8
14
17
1
7
13
16
1
8
13
15
1
7
11
14
1
8
14
17
1
7
12
15
1
6
10
12
50+
3
1
3
3
8
1
4
3
2
2
4
1
1
2
1
1
1
1
2
1
1
Q3
21
20
22
22
27
18
23
22
21
23
23
15
17
23
20
20
18
16
20
18
14
Max
75
74
72
71
69
64
63
63
62
61
60
60
58
57
57
57
57
57
56
56
56
Table 2: Hitting Streaks of At Least 56 Games in 1000 Simulated Baseball Seasons
Actual Season
Player
Felipe Alou
Julio Franco
Alex Rodriguez
Rogers Hornsby
Ichiro Suzuki
Jimmy Rollins
Rogers Hornsby
Ralph Garr
Kirby Puckett
Ichiro Suzkui
Rod Carew
Bobby Murcer
Luis Castillo
Wade Boggs
Larry Walker
Doug Glanville
Pete Rose
Tim Raines
Magglio Ordonez
Nellie Fox
Don Demeter
Year
1966
1991
1996
1921
2004
2007
1922
1974
1986
2007
1977
1973
2002
1985
1997
1999
1975
1984
2007
1955
1962
AB
666
589
601
592
704
716
623
606
680
678
616
616
606
653
568
628
662
622
595
636
550
AVG
.327
.341
.358
.397
.372
.296
.401
.353
.327
.351
.388
.304
.305
.368
.366
.325
.317
.309
.363
.311
.307
HR
31
15
36
21
8
30
42
11
31
6
14
22
2
8
49
11
7
8
28
6
29
RBI
74
78
123
126
60
94
152
54
96
68
100
95
39
78
130
73
74
60
139
59
107
OBP
.361
.408
.414
.458
.414
.344
.459
.383
.366
.396
.449
.357
.364
.450
.452
.376
.406
.393
.434
.364
.359
Longest
Streak
16
15
20
21
14
33
14
16
25
15
12
35
28
16
9
14
14
15
14
Table 3: Actual Statistics for the Players with Simulated Hitting Streaks of At Least
56 Games
Published by The Berkeley Electronic Press, 2009
5
Journal of Quantitative Analysis in Sports, Vol. 5 [2009], Iss. 2, Art. 4
Method
Max
Constant AB 75
Variable AB
57
40+
57
41
50+
8
2
56+
2
1
Table 4: 10,000 Simulations of DiMaggio’s 1941 Season
3
Simulation & Results
Using the data obtained from Retrosheet’s game logs, 1000 baseball “histories”
were run in R. Technically, these are half-histories because very little of the data
is for games prior to 1953. Since we have 58 seasons worth of data, and each
season was simulated 1000 times, there are 58,000 simulated seasons. DiMaggio’s
record 56-game hitting streak was matched or exceeded in 30 of them, or 0.00517%.
Table 2 lists the players who achieved this feat in the simulations; table 3 lists those
players’ actual statistics.
Felipe Alou had the longest streak, 75 games in the simulated 1966 season.
Ichiro Suzuki is the player appearing most often among these record streaks. His
simulated 2004 seasons included 5 record-breakers, meaning that he had a 1-in200 chance of breaking the record that year. He also broke the record once in the
simulated 2007 season.
Twenty five of the simulated half-histories, or 2.5%, contained at least one
streak of 56 games, meaning that there was a 2.5% chance that there would actually
have been such a streak in the past 58 years.
As an explicit comparison of the variable at-bat model with the constant atbat model, 10,000 simulations of Joe DiMaggio’s 1941 season were run for each
model. Table 4 shows for each model the maximum streak, the number of seasons
with a streak of at least 40 games, the number of seasons with a streak of at least 50
games, and the number of seasons with a streak of at least 56 games. The constant
at-bat model resulted in greater numbers all around, confirming that long streaks
are more rare under the variable at-bat model.
4
Summary & Conclusions
Ideally, simulations would be run on the entire history of baseball. However, for the
time being, Retrosheet game-by-game data goes back essentially to 1953. Thus the
results presented here are not directly comparable to those obtained by Arbesman
and Strogatz, who frame their simulations mainly in terms of baseball histories.
It may be possible to use known (1953 and after) game-by-game at-bat distrihttp://www.bepress.com/jqas/vol5/iss2/4
DOI: 10.2202/1559-0410.1167
6
Rockoff and Yates: Chasing DiMaggio
butions to model unknown (before 1953) at-bat distributions based solely on average at-bats per game. It was hoped that players’ game-by-game at-bats would follow a “nice” distribution, such as Poisson with mean parameter equal to a player’s
average at-bats per game during the season, represented in statistical terms as
ABi j
ABi jk ∼ Poi
.
Gi j
Thus far, all such models examined have proven to be poor fits to the actual data.
A somewhat more complex method to model a player’s unknown at-bat distribution would be to use a player with a known at-bat distribution and the same season
average at-bats per game. As a simplistic example, if players with 4.0 at-bats per
game in a season tend to have 3 at-bats in one-third of their games , 4 at-bats in
one-third of their games, and 5 at-bats in the remaining one-third of their games,
that same distribution could be applied to pre-1953 players who had 4.0 at-bats per
game.
Another way in which the simulations may be modified is by treating a player’s
batting average as a random variable that changes from game to game or at-bat
to at-bat, rather than remaining constant throughout the season. Some promising
candidates are the beta and the normal distributions, with mean equal to the player’s
actual season batting average.
Lastly, some limitations of the research discussed in this article ought to be
taken into account. For one, these simulations did not take into account certain
real-life baseball decisions that may factor into a player’s at-bats during a hitting
streak. For instance, a player in the midst of a lengthy streak is unlikely to be
pulled by the manager in the middle of the game if he has not yet gotten a hit in the
game. Similarly, such a player may be less likely to “settle” for a base-on-balls if a
streak is on the line. In the simulations, if a player has a sizeable streak going and
is randomly assigned a two at-bat game, that’s his tough simulated luck.
Furthermore, since the simulations only capture each player’s longest streak in
each simulated season, they do not account for the remote possibility of multiple
long streaks by a player in a given simulated season. Similarly, the results do not
capture multi-season streaks such as the one Jimmy Rollins had (in real life) at the
end of 2005 and the beginning of 2006.
Published by The Berkeley Electronic Press, 2009
7
Journal of Quantitative Analysis in Sports, Vol. 5 [2009], Iss. 2, Art. 4
References
Albert, J. (2008). Streaky Hitting in Baseball. Journal of Quantitative Analysis in
Sports, 4(1), Article 3.
Albert, J., & Williamson, P. (2001). Using Model/Data Simulations to Detect
Streakiness. The American Statistician, 55(1), 41–50.
Albright, S. C. (1993). A Statistical Analysis of Hitting Streaks in Baseball. Journal
of the American Statistical Association, 88(424), 1175–1183.
Arbesman, S., & Strogatz, S. (2008 March 30). A Journey to Baseball’s Alternate Universe. The New York Times. (Retrieved: 2008 March 31;
http://www.nytimes.com/2008/03/30/opinion/30strogatz.html)
Berry, S. (1991). The Summer of ’41: A Probability Analysis of DiMaggio’s Streak
and Williams’ Average of .406. Chance, 4(4), 8–11.
Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Boca Raton:
Chapman & Hall/CRC.
Freiman, M. (2002). 56-Game Hitting Streaks Revisited. The Baseball Research
Journal, 31, 11–15.
Gould, S. J. (1989). The Streak of Streaks. Chance, 2(2), 10–16.
R Development Core Team. (2008). R: A Language and Environment for Statistical Computing. Vienna, Austria. (ISBN 3-900051-07-0; http://www.Rproject.org)
Short, T., & Wasserman, L. (1989). Should We Be Surprised by the Streak of
Streaks? Chance, 2(2), 13.
Warrack, G. (1995). The Great Streak. Chance, 8(3), 41–43, 60.
http://www.bepress.com/jqas/vol5/iss2/4
DOI: 10.2202/1559-0410.1167
8