Journal of Quantitative Analysis in Sports Volume 5, Issue 2 2009 Article 4 2008 N ORTHERN C ALIFORNIA S YMPOSIUM ON S TATISTICS AND O PERATIONS R ESEARCH IN S PORTS Chasing DiMaggio: Streaks in Simulated Seasons Using Non-Constant At-Bats David M. Rockoff∗ ∗ † Philip A. Yates† Iowa State University, [email protected] California State Polytechnic University - Pomona, [email protected] c Copyright 2009 The Berkeley Electronic Press. All rights reserved. Chasing DiMaggio: Streaks in Simulated Seasons Using Non-Constant At-Bats∗ David M. Rockoff and Philip A. Yates Abstract On March 30, 2008, Samuel Arbesman and Steven Strogatz had their article “A Journey to Baseball’s Alternate Universe” published in The New York Times. They simulated baseball’s entire history 10,000 times to ask how likely it was for anyone in baseball history to achieve a streak that is at least as long as Joe DiMaggio’s hitting streak of 56 in 1941. Arbesman and Strogatz treated a player’s at bats per game as a constant across all games in a season, which greatly overestimates the probability of long streaks. The simulations in this paper treated at-bats in a game as a random variable. For each player in each season, the number of at-bats for each simulated game was bootstrapped. The number of hits for player i in season j in game k is a binomial random variable with the number of trials being equal to the number of at bats the player gets in game k and the probability of success being equal to that player’s batting average for that season. The result of using non-constant at-bats in the simulation was a decrease in the percentage of the baseball histories to see a hitting streak of at least 56 games from 42% (Arbesman and Strogatz) to approximately 2.5%. KEYWORDS: hitting streaks, simulations ∗ The information used here was obtained free of charge and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at 20 Sunset Road, Newark, DE 19711 or at www.retrosheet.org. Thanks to Cliff Blau for DiMaggio’s 1941 game-by-game data. Rockoff and Yates: Chasing DiMaggio 1 Background The summer of 1941 was one of the most compelling times in baseball history. That year, Ted Williams of the Boston Red Sox posted a batting average of .406 with 37 home runs, 120 runs batted in, an on base percentage of .553, and a slugging average of .735. He would be the last man to hit over .400 in a season. His league and park adjusted OPS (OPS+) was 235, the fourth highest mark of the twentieth century. This means that his OPS was 135% better than an average player in the American League. All of these accomplishments earned Williams second place in the American League balloting for Most Valuable Player. The 1941 American League MVP award went to New York Yankee Joe DiMaggio. DiMaggio hit .357 with 30 home runs, 125 runs batted in, an on base percentage of .440, a slugging average of .643, and an OPS+ of 184. However, DiMaggio had something that Williams did not: he hit safely in 56 straight games, a record that many say will never be approached. Due to the statistical anomaly that is DiMaggio’s hitting streak, there has been much discussion and analysis about this feat. Gould (1989) wrote how “DiMaggio’s streak is the most extraordinary thing that ever happened in American sports.” Short and Wasserman (1989) quantified the probability of streak similar to DiMaggio’s with a conservative estimate of 3 at-bats per game. Berry (1991) compared the probabilities of DiMaggio’s 56 game hitting streak and Williams’ batting .406 and found the two to be fairly comparable. Albright (1993) discussed ways of identifying wide-scale streakiness in terms of hitting over a four year period. Warrack (1995) calculated the probability of DiMaggio having a 56-game hitting streak. Albert and Williamson (2001) simulated from a Bayesian model to measure a player’s streakiness for hitting streaks and for home run streaks. Freiman (2002) attempted to find the probability of a 56-game hitting streak by treating these streaks as being not independent. Albert (2008) used an exchangeable model to estimate hitting abilities of players, understand those players’ streakiness, and to identify some players who are streakier than others. Arbesman and Strogatz (2008) ran simulations of baseball seasons to estimate the probability of long hitting streaks, which resulted in a 56 (or more) game hitting streak in 42% of simulated baseball histories. They treated a player’s at-bats per game as constant across all games in a season. This article will improve upon the results of Arbesman and Strogatz (2008). When they treated a player’s at-bats per game as constant across all games in a season, they greatly overestimated the probability of long streaks. Varying the atbats per game will not only better mimic the natural flow of a player’s season, but also will reduce the estimated probability of a long hitting streak. Published by The Berkeley Electronic Press, 2009 1 Journal of Quantitative Analysis in Sports, Vol. 5 [2009], Iss. 2, Art. 4 2 Model To illustrate how constant at-bats can overestimate the likelihood of long hitting streaks, Warrack (1995) estimates the probability of a player getting at least one hit in a game, p, in the following manner. Let pi be the probability that a batter with batting average A comes to bat i times during a game. Then p can be estimated by p̂ = ∑ pi (1 − (1 − A)i ). This function is concave in i. Let B be the average number of at-bats per game. One can estimate p by p̂ = 1 − (1 − A)B . One can make the argument that the first estimate of p is the expected value of the probability of a player getting at least one hit. The second estimate of p is the probability of a player getting at least one hit over their average number of at-bats per game. By Jensen’s Inequality, the second p̂ is greater than or equal to the first p̂. This means that using constant at-bats will overestimate the likelihood of long hitting streaks. A brief example will be used to support this idea. Suppose a player’s batting average is .300, and over two games, the player has 8 total at-bats. The probability of getting a hit in both games is p̂ = 1 − (1 − .300)i 1 − (1 − .300) j , where i is the number of at-bats in the first game, j is the number of at-bats in the second game, and i + j = 8. Table 1 summarizes the probabilities in this example. The highest probability is the case where the player averaged 4 at-bats over the two games. Once these at-bats are allowed to vary, the probability of getting at least one hit in both games (a two-game hitting streak) decreases. Low at bat games hurt the player’s chances for getting a hit in a game. Extended over long stretches, say a 162 game season, this effect is telescoped. Figure 1 further illustrates the need to vary at-bats when analyzing hitting streaks. Over time, the average number of at-bats per game has decreased. 2.1 The Data Game data was obtained from Retrosheet. For each season in their database, this data includes multitudes of information on every single plate appearance in the major leagues, including unique game identifier, batter, and whether the appearance http://www.bepress.com/jqas/vol5/iss2/4 DOI: 10.2202/1559-0410.1167 2 Rockoff and Yates: Chasing DiMaggio Probability of Hit In Both Games Game 1 Game 2 p̂ AB AB 1 7 0.275 2 6 0.450 3 5 0.547 4 4 0.577 5 3 0.547 6 2 0.450 7 1 0.275 Table 1: Probability of a Player with a .300 Batting Average Getting a Hit in Two Games with Eight Total at-bats Figure 1: Boxplots of At-bats per Game Over All Baseball Seasons Published by The Berkeley Electronic Press, 2009 3 Journal of Quantitative Analysis in Sports, Vol. 5 [2009], Iss. 2, Art. 4 resulted in an at bat. This allows one to determine the number of at-bats of each player in each game for each season. Retrosheet has game-by-game data for all of major league baseball from 1954 − 2007, as well as for the National League in 1911, 1921, 1922, and 1953. 2.2 The Model For hitter i in season j, the batting average is denoted as pi j . For example, the 2000 and 2001 versions of a player like Barry Bonds will be treated as two different players. If the ith hitter played in k games in season j, then ABi j = (ABi j1 , ABi j2 , · · · , ABi jk ) are the number of at-bats for each game over the course of the season. Assuming that at-bats over the course of a single game are independent of each other, then the number of hits a player i in season j gets in game k, denoted Hi jk , has a binomial distribution with n = ABi jk and p = pi j . This can be written as Hi jk ∼ Bin(ABi jk , pi j ). R version 2.7.2 (2008) was used to run the simulations. In these simulations, each player in each season had their distribution of at-bats over the k games played in a season. These at-bats were sampled with replacement to create a “simulated” season’s worth of at-bats. To use notation similar to Efron and Tibshirani (1993), the “simulated” season’s worth of at-bats are AB∗i j = (AB∗i j1 , AB∗i j2 , · · · , AB∗i jk ). If m seasons are simulated, then for player i and season j, ∗2 ∗m AB∗1 i j , ABi j , · · · , ABi j represent that player’s “at-bats” in the simulations. The number of hits a player gets in each game in the mth simulated season is ∗m H∗m i j ∼ Bin(ABi j , pi j ). After randomly generating base hits from this binomial distribution, a hitting streak was considered to be any run of hits in H∗m i j that are greater than zero. The simulations kept track of each player’s maximum hitting streak in any given simulated season. http://www.bepress.com/jqas/vol5/iss2/4 DOI: 10.2202/1559-0410.1167 4 Rockoff and Yates: Chasing DiMaggio Player Felipe Alou Julio Franco Alex Rodriguez Rogers Hornsby Ichiro Suzuki Jimmy Rollins Rogers Hornsby Ralph Garr Kirby Puckett Ichiro Suzkui Rod Carew Bobby Murcer Luis Castillo Wade Boggs Larry Walker Doug Glanville Pete Rose Tim Raines Magglio Ordonez Nellie Fox Don Demeter Year 1966 1991 1996 1921 2004 2007 1922 1974 1986 2007 1977 1973 2002 1985 1997 1999 1975 1984 2007 1955 1962 40+ 4 3 10 21 34 2 23 13 6 14 12 1 2 10 7 4 1 1 10 1 1 Simulated Season 56+ Min Q1 Q2 2 9 14 16 1 8 14 16 2 9 15 18 1 6 14 17 5 11 18 22 1 7 13 15 2 7 15 18 2 9 15 18 1 8 14 17 1 10 16 19 2 7 16 19 1 5 11 13 1 7 12 14 1 9 16 19 1 8 14 17 1 7 13 16 1 8 13 15 1 7 11 14 1 8 14 17 1 7 12 15 1 6 10 12 50+ 3 1 3 3 8 1 4 3 2 2 4 1 1 2 1 1 1 1 2 1 1 Q3 21 20 22 22 27 18 23 22 21 23 23 15 17 23 20 20 18 16 20 18 14 Max 75 74 72 71 69 64 63 63 62 61 60 60 58 57 57 57 57 57 56 56 56 Table 2: Hitting Streaks of At Least 56 Games in 1000 Simulated Baseball Seasons Actual Season Player Felipe Alou Julio Franco Alex Rodriguez Rogers Hornsby Ichiro Suzuki Jimmy Rollins Rogers Hornsby Ralph Garr Kirby Puckett Ichiro Suzkui Rod Carew Bobby Murcer Luis Castillo Wade Boggs Larry Walker Doug Glanville Pete Rose Tim Raines Magglio Ordonez Nellie Fox Don Demeter Year 1966 1991 1996 1921 2004 2007 1922 1974 1986 2007 1977 1973 2002 1985 1997 1999 1975 1984 2007 1955 1962 AB 666 589 601 592 704 716 623 606 680 678 616 616 606 653 568 628 662 622 595 636 550 AVG .327 .341 .358 .397 .372 .296 .401 .353 .327 .351 .388 .304 .305 .368 .366 .325 .317 .309 .363 .311 .307 HR 31 15 36 21 8 30 42 11 31 6 14 22 2 8 49 11 7 8 28 6 29 RBI 74 78 123 126 60 94 152 54 96 68 100 95 39 78 130 73 74 60 139 59 107 OBP .361 .408 .414 .458 .414 .344 .459 .383 .366 .396 .449 .357 .364 .450 .452 .376 .406 .393 .434 .364 .359 Longest Streak 16 15 20 21 14 33 14 16 25 15 12 35 28 16 9 14 14 15 14 Table 3: Actual Statistics for the Players with Simulated Hitting Streaks of At Least 56 Games Published by The Berkeley Electronic Press, 2009 5 Journal of Quantitative Analysis in Sports, Vol. 5 [2009], Iss. 2, Art. 4 Method Max Constant AB 75 Variable AB 57 40+ 57 41 50+ 8 2 56+ 2 1 Table 4: 10,000 Simulations of DiMaggio’s 1941 Season 3 Simulation & Results Using the data obtained from Retrosheet’s game logs, 1000 baseball “histories” were run in R. Technically, these are half-histories because very little of the data is for games prior to 1953. Since we have 58 seasons worth of data, and each season was simulated 1000 times, there are 58,000 simulated seasons. DiMaggio’s record 56-game hitting streak was matched or exceeded in 30 of them, or 0.00517%. Table 2 lists the players who achieved this feat in the simulations; table 3 lists those players’ actual statistics. Felipe Alou had the longest streak, 75 games in the simulated 1966 season. Ichiro Suzuki is the player appearing most often among these record streaks. His simulated 2004 seasons included 5 record-breakers, meaning that he had a 1-in200 chance of breaking the record that year. He also broke the record once in the simulated 2007 season. Twenty five of the simulated half-histories, or 2.5%, contained at least one streak of 56 games, meaning that there was a 2.5% chance that there would actually have been such a streak in the past 58 years. As an explicit comparison of the variable at-bat model with the constant atbat model, 10,000 simulations of Joe DiMaggio’s 1941 season were run for each model. Table 4 shows for each model the maximum streak, the number of seasons with a streak of at least 40 games, the number of seasons with a streak of at least 50 games, and the number of seasons with a streak of at least 56 games. The constant at-bat model resulted in greater numbers all around, confirming that long streaks are more rare under the variable at-bat model. 4 Summary & Conclusions Ideally, simulations would be run on the entire history of baseball. However, for the time being, Retrosheet game-by-game data goes back essentially to 1953. Thus the results presented here are not directly comparable to those obtained by Arbesman and Strogatz, who frame their simulations mainly in terms of baseball histories. It may be possible to use known (1953 and after) game-by-game at-bat distrihttp://www.bepress.com/jqas/vol5/iss2/4 DOI: 10.2202/1559-0410.1167 6 Rockoff and Yates: Chasing DiMaggio butions to model unknown (before 1953) at-bat distributions based solely on average at-bats per game. It was hoped that players’ game-by-game at-bats would follow a “nice” distribution, such as Poisson with mean parameter equal to a player’s average at-bats per game during the season, represented in statistical terms as ABi j ABi jk ∼ Poi . Gi j Thus far, all such models examined have proven to be poor fits to the actual data. A somewhat more complex method to model a player’s unknown at-bat distribution would be to use a player with a known at-bat distribution and the same season average at-bats per game. As a simplistic example, if players with 4.0 at-bats per game in a season tend to have 3 at-bats in one-third of their games , 4 at-bats in one-third of their games, and 5 at-bats in the remaining one-third of their games, that same distribution could be applied to pre-1953 players who had 4.0 at-bats per game. Another way in which the simulations may be modified is by treating a player’s batting average as a random variable that changes from game to game or at-bat to at-bat, rather than remaining constant throughout the season. Some promising candidates are the beta and the normal distributions, with mean equal to the player’s actual season batting average. Lastly, some limitations of the research discussed in this article ought to be taken into account. For one, these simulations did not take into account certain real-life baseball decisions that may factor into a player’s at-bats during a hitting streak. For instance, a player in the midst of a lengthy streak is unlikely to be pulled by the manager in the middle of the game if he has not yet gotten a hit in the game. Similarly, such a player may be less likely to “settle” for a base-on-balls if a streak is on the line. In the simulations, if a player has a sizeable streak going and is randomly assigned a two at-bat game, that’s his tough simulated luck. Furthermore, since the simulations only capture each player’s longest streak in each simulated season, they do not account for the remote possibility of multiple long streaks by a player in a given simulated season. Similarly, the results do not capture multi-season streaks such as the one Jimmy Rollins had (in real life) at the end of 2005 and the beginning of 2006. Published by The Berkeley Electronic Press, 2009 7 Journal of Quantitative Analysis in Sports, Vol. 5 [2009], Iss. 2, Art. 4 References Albert, J. (2008). Streaky Hitting in Baseball. Journal of Quantitative Analysis in Sports, 4(1), Article 3. Albert, J., & Williamson, P. (2001). Using Model/Data Simulations to Detect Streakiness. The American Statistician, 55(1), 41–50. Albright, S. C. (1993). A Statistical Analysis of Hitting Streaks in Baseball. Journal of the American Statistical Association, 88(424), 1175–1183. Arbesman, S., & Strogatz, S. (2008 March 30). A Journey to Baseball’s Alternate Universe. The New York Times. (Retrieved: 2008 March 31; http://www.nytimes.com/2008/03/30/opinion/30strogatz.html) Berry, S. (1991). The Summer of ’41: A Probability Analysis of DiMaggio’s Streak and Williams’ Average of .406. Chance, 4(4), 8–11. Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Boca Raton: Chapman & Hall/CRC. Freiman, M. (2002). 56-Game Hitting Streaks Revisited. The Baseball Research Journal, 31, 11–15. Gould, S. J. (1989). The Streak of Streaks. Chance, 2(2), 10–16. R Development Core Team. (2008). R: A Language and Environment for Statistical Computing. Vienna, Austria. (ISBN 3-900051-07-0; http://www.Rproject.org) Short, T., & Wasserman, L. (1989). Should We Be Surprised by the Streak of Streaks? Chance, 2(2), 13. Warrack, G. (1995). The Great Streak. Chance, 8(3), 41–43, 60. http://www.bepress.com/jqas/vol5/iss2/4 DOI: 10.2202/1559-0410.1167 8
© Copyright 2026 Paperzz