ArticlePDF Available

Joe DiMaggio Done It Again … and Again and Again and Again?

Authors:
14 VOL. 24, NO. 1, 2011
Joe DiMaggio Done It Again …
and Again and Again and Again?
David Rockoff and Philip Yates
One of the most-celebrated feats
in the history of American
sports is baseball player Joe
DiMaggio’s 56-game hitting streak dur-
ing the 1941 season. In major league
baseball, a 30% success rate by a hitter
is considered good, and a batter will
typically have three to five attempts in a
game; Joltin’ Joe hit safely in 56 consecu-
tive games, a record that has rarely been
approached since. One might wonder
just how amazing this accomplishment
is, or how surprised people should be
that there has been such a long streak
in the history of the sport. Many have
attempted to quantify this using statisti-
cal methods.
A basic probability approach goes
something like this: If a player has a
.300 batting average (i.e., gets a hit in
30% of his at-bats) and has four at-bats
each game, his probability of getting at
least one hit in a given game is 1-((1-
.3)4), or .76. Thus, the probability of
this player getting a hit in every game
during a given 56-game stretch is .7656,
or .00000021 (assuming all at-bats and
all games are independent).
But, we are really interested in
the probability of there ever being a
56-game hitting streak by any player,
not just the probability of one particular
player achieving it over a given 56-game
stretch. A direct probability approach
is then difficult, mainly because the
universe of 56-game stretches for a
player is not independent; performance
in games 1–56 during a season shares
much information with performance in
games 2–57.
Fortunately, advances in computer
technology and in the availability of
baseball data have made a simulation
approach feasible.
The inspiration for our research came
from a New York Times article on March
Joe DiMaggio done it again!
Joe DiMaggio done it again!
Clackin’ that bat, gone with the wind!
Joe DiMaggio’s done it again!
– “Joe DiMaggio Done It Again,” Woody Guthrie, 1949
AP Photo/Preston Stroup
CHANCE 15
30, 2008, titled “A Journey to Baseball’s
Alternative Universe.” Samuel Arbes-
man and Steven Strogatz ran simula-
tions of baseball seasons to estimate
the probability of long hitting streaks,
using data for each batter during each
season in major league history. They
treated a player’s at-bats per game as
constant across all games in a season;
simulated 10,000 baseball histories; and
tabulated which player held the longest
streak, when he did it, and how long his
streak was.
There was a hitting streak of at least
56 games in 42% of these simulated
histories, meaning we should not be all
that amazed that there has been one in
our actual observed history.
Don M. Chance, in the CHANCE
article “What Are the Odds? Another
Look at DiMaggio’s Streak,” used a
modified calculation of hitting oppor-
tunities to study the likelihood of long
hitting streaks, arguing that noninten-
tional bases on balls and sacrifice flies
are opportunities for a hit and should
be included in any calculation. This
increases the number of opportunities
in a game, but decreases the probability
of success in a single opportunity. The
net effect is a decrease in the probability
of a hit in most games and, thus, of a
lengthy hitting streak.
We disagree with the notion that a
base on balls is a missed opportunity
for a hit. A base on balls is usually quite
beneficial for the team on the receiving
end. A good hitter, unless he has a very
long streak going already, is unlikely to
eschew the walk in favor of swinging
at bad pitches in a dire bid to get a
hit. That is presumably why a base on
balls is not counted against a player’s
batting average.
Constant vs. Variable At-Bats
We noted that assigning the same num-
ber of at-bats for each game greatly
overestimates the probability of long
streaks. This is due to Jensen’s inequal-
ity. The probability of a two-game hit-
ting streak is much lower if the player’s
at-bats are, say, two and then six than
if his at-bats are four and then four. A
simple example can be seen in Figure 1.
The probability of a two-game hitting
streak is much lower if the player has
fewer at-bats in each game. Going down
a northwest-southeast diagonal of this
lattice structure graphically illustrates
Jensen’s inequality. Thus, the constant
at-bat assumption overestimates the
likelihood of long hitting streaks. The
need to vary at-bats also is due to the
fact that they have decreased and are
Figure 2. Densities estimates of at-bats per game by decade
Note: The unit of analysis is a player-season.
Figure 1. Contour plot of the probability of
a player with a .300 batting average getting
a hit in each of two games
even more varied over time. Figure 2
illustrates this phenomenon.
The simulations run in “Chasing
DiMaggio: Streaks in Simulated Sea-
sons Using Non-Constant At-Bats,”
published in the Journal of Quantitative
Analysis in Sports in 2009, varied at-bats
using Retrosheet game data for all of
major league baseball from 1954–2007,
as well as for the National League in
1911, 1921, 1922, and 1953. Since the
publication of that paper, Retrosheet
has added game data from both the
American League and National League
for the 1920–1929 seasons.
It should be noted that these simula-
tions are not true simulations of a game,
but simulations of a player’s at-bats in
a game over all the games played in
a season. Unfortunately, due to the
unavailability of some game-by-game
data, some or all of the careers of some
of the best hitters in baseball (e.g., Willie
Keeler, Ted Williams, Joe DiMaggio) are
not included in this analysis.
The following is a brief overview of
the simulations with varying at-bats.
For each hitter in each season, the bat-
ting average is fixed, using that player’s
actual batting average for that season.
16 VOL. 24, NO. 1, 2011
This allows us to treat the 1990 and
2001 versions of a player such as Barry
Bonds as two players. We assumed at-
bats over the course of a single game are
independent. This means the number of
hits a player gets in a given game has a
binomial distribution with the number
of trials equal to the at-bats in that game
and the probability of success equal to
the batting average for the season. Using
the game-by-game data from Retrosheet,
each player in each season had their at-
bats sampled with replacement from
their actual at-bat distribution to create
a simulated season’s worth of games.
This was done 1,000 times to create
1,000 “simulated” baseball histories. A
hitting streak was considered to be any
run of games with at least one hit. In
the results section, this method will be
denoted as Binom.
Table 1—Top 20 Maximum Hitting Streaks in 1,000 Simulated Baseball Histories
Table 2—Hitting Streaks in 18,607,000 Simulated Player–Seasons
Varying Batting Average
While the number of at-bats varied each
game, the batting average remained
constant. The question remained: How
should we vary batting average in this
simulation study? In “Chasing DiMaggio:
Streaks in Simulated Seasons Using Non-
Constant At-Bats,” why did the authors
feel the need to vary at-bats during the
simulations? Over the course of a baseball
season, a player is not going to have the
same number of at-bats in every game.
There may be days when the player is
starting and others when he may be com-
ing off the bench to pinch-hit. Varying
the at-bats is an attempt to mimic this
phenomenon in the simulations.
How do we attempt to vary batting
average in our simulations to resemble
the course of a season? We attempted
this in three ways. The first method
we used was the simplest approach to
varying batting average. We treated a
player’s chance at a hit in a given at-bat
as a beta random variable, taking the
player’s actual number of hits (successes)
in a season as the first shape parameter
and the player’s actual number of outs
(failures) in a season as the second shape
parameter. The mean of this random
variable would be the batter’s batting
average in a season. In the results section,
this method will be denoted as Beta.
The second method of varying bat-
ting average treats hit probability in a
given game as correlated with perfor-
mance in ”neighboring” games, which
may better mimic so-called hot- and
cold-hand effects. For a brief descrip-
tion, let’s look at how the method works
for a neighborhood of 15 games.
CHANCE 17
Results
Table 1 lists the top 20 performances
in the simulations. It should be noted
that these are the peak streaks for
a player. For example, using the
Binom-15 method, the 1922 version
of George Sisler had a maximum hit-
ting streak of 95 games. His second-
highest streak was 83 games. That
streak is not included on the list since
his highest was 95.
Another way to look at the results
is to summarize the hitting streaks
over each simulated player-season;
yet another is to summarize over
each of the individual 1,000 base-
ball histories. Table 2 compares our
four methods, showing how many
simulated player-seasons contained a
hitting streak of at least 40 games, at
least 50 games, and at least 56 games
(DiMaggio’s record).
Table 3 shows the same breakdown,
but over entire simulated histories.
The Binom-15 method yields the most
long streaks—561 individual player-
seasons contained a hitting streak of
56 games or longer, and a whopping
450 of the 1,000 histories featured
such a streak. This means that, if
we assume batting ability fluctuates
smoothly over the course of a month,
we should not be all that surprised
that there has been a 56-game hitting
streak in real life. It also shows that
results change significantly depend-
ing on our assumptions.
How did Joltin’ Joe do in our simula-
tions using all four methods, plus the
constant at-bat method? We ran 10,000
simulations of DiMaggio’s 1941 season,
using his actual game-by-game at-bats.
In the simulations, we start with a
fixed batting average for each game
and run a simulated season, just like the
Binom method. Hit probabilities are
then updated by incorporating informa-
tion from neighboring games into the
simulated season. For instance, in game
50 of a simulated season, the probability
of a base hit is reflected by the player’s
performance in games 35 through 65.
Using the new hit probabilities, the
simulations generate a new array of
hits. This process is repeated for each
game in a player’s season and denoted
as Binom-15.
The third method of varying batting
average is the same as Binom-15, except
it uses a 30-game neighborhood; that is,
the probability of a base hit in game 50
should be reflected by the player’s per-
formance in games 20 through 80. This
method is denoted as Binom-30.
Table 3—Hitting Streaks in 1,000 Simulated Baseball Histories
Table 4—Hitting Streaks for 1941 Joe DiMaggio in 10,000 Simulations
18 VOL. 24, NO. 1, 2011
DiMaggio’s actual game-by-game data
were obtained from Cliff Blau, a member
of the Society for American Baseball
Research. Table 4 shows he reached
his record only a few times with each
method and had more long streaks
with the two Binom methods. The
first two rows also reinforce that when
batting average is treated as constant,
the assumption of constant at-bats
makes a long streak more likely.
Multiseason Streaks
At the end of the 2005 season through
the beginning of the 2006 season,
Jimmy Rollins of the Philadelphia
Phillies had a hitting streak of 38
games. His streak would not have
been captured by the methods previ-
ously discussed, although it officially
counts as a streak. Chance looked at
the odds of long hitting streaks using
a player’s career data. His analysis
used only 125 players—the top 100
in career batting average, plus all
other players who had real-life hit-
ting streaks of at least 30 games.
We ran additional simulations on
these good hitters to estimate the
frequency of long hitting streaks that
span two seasons. We started with
the same list of players for our analy-
sis using the game-by-game at-bats
for their entire career found in the
Retrosheet data. Retrosheet contains
the complete careers for only 24 of
those players and partial data for an
additional 62. In these simulations,
multiseason long streaks occurred
approximately one-tenth as often as
single-season long streaks, meaning
that to truly compare the likelihood
of witnessing a long streak, we should
increase the number of long streaks in
Tables 2 and 3 by 10%.
Concluding Thoughts
Why is there such a difference
between the simulations when bat-
ting average is fixed and batting aver-
age varies under the “neighbor”
method? If a batter was successful in
his first 15 or 30 games, his batting
average (probability of success of a
hit) is going to be higher for the next
game. A player who starts a simulated
season “hot” is going to have longer
streaks using this method. The
30-game neighborhood method had
a streak greater than DiMaggio’s 1941
streak in 34.3% of our simulated base-
ball seasons. The 15-game neighbor-
hood method had a streak greater
than DiMaggio’s 1941 streak in 45%
of our simulated baseball seasons.
Over a stretch of 15 games, it might
be more common to see a player have
a higher batting average than over a
stretch of 30 games. This may have
produced longer hitting streaks in the
simulations. If we are to believe the
results of these attempts at simulating
baseball histories, it is not surprising
that someone did have a 56-game hit-
ting streak during the course of Major
League Baseball’s history. The surprise
comes into play when we pick out a
certain player, such as Joltin’ Joe or
Sisler, to have the specific streak.
Further Reading
Arbesman, S., and S. Strogatz. 2008.
A journey to baseball’s Universe.
The New York Times. March 30.
Berry, S. 1991. The summer of ’41: A
probability analysis of DiMaggio’s
streak and Williams’ average of
.406. CHANCE 4(4):8–11.
Chance, D. M. 2009. What are the
odds? Another look at DiMaggio’s
streak. CHANCE 22(2):33–42.
Gould, S. J. 1989. The streak of
streaks. CHANCE 2(2):10–16.
McCotter, T. 2008. Hitting streaks
don’t obey your rules: Evidence
that hitting streaks aren’t just by-
product of random variation. Base-
ball Research Journal 37:62–70.
Rockoff, D. M., and P. A. Yates. 2009.
Chasing DiMaggio: Streaks in
simulated seasons using non-con-
stant at-bats. Journal of Quantitative
Analysis in Sports 5(2), Article 4.
www.bepress.com/jqas/vol5/iss2/4.
Short, T., and L. Wasserman. 1989.
Should we be surprised by the
streak of streaks? CHANCE
2(2):13.
Warrack, G. 1995. The great streak.
CHANCE 8(3):41–43, 60.
Joe DiMaggio with bat ready at first day’s workout on March 6, 1946, in Bradenton, Florida,
after returning from Panama. DiMaggio had been out since 1942 for U.S. Army service.
AP Photo/Preston Stroup
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
On March 30, 2008, Samuel Arbesman and Steven Strogatz had their article "A Journey to Baseball's Alternate Universe" published in The New York Times . They simulated baseball's entire history 10,000 times to ask how likely it was for anyone in baseball history to achieve a streak that is at least as long as Joe DiMaggio's hitting streak of 56 in 1941. Arbesman and Strogatz treated a player's at bats per game as a constant across all games in a season, which greatly overestimates the probability of long streaks. The simulations in this paper treated at-bats in a game as a random variable. For each player in each season, the number of at-bats for each simulated game was bootstrapped. The number of hits for player i in season j in game k is a binomial random variable with the number of trials being equal to the number of at bats the player gets in game k and the probability of success being equal to that player's batting average for that season. The result of using non-constant at-bats in the simulation was a decrease in the percentage of the baseball histories to see a hitting streak of at least 56 games from 42% (Arbesman and Strogatz) to approximately 2.5%.
Article
There have been more hitting streaks in Major League Baseball than we would expect. All batting lines of MLB hitters from 1957-2006 were randomly permuted 10,000 times and the number of hitting streaks of each length from 2 to 100 was measured. The average count of each length streak was then compared to the corresponding total from real-life, when the games were in chronological order. The number of streaks in real-life was significantly higher than over the random permutations. Non-starts (such as pinch-hitting appearances) were removed since these may be unduly reducing the number of streaks in the permutations; the number of streaks in the permutations increased but was still significantly lower than real-life totals. Possible explanations are given for why more streaks have appeared in real-life than we would expect, including possibly the hot hand idea. Contact at trentm@email.unc.edu Comment: 18 pages, 4 figures; UPDATED with full charts for the permutations involving only starts, and the one involving ALL games
The streak of streaks
  • S J Gould
Should we be surprised by the streak of streaks
  • T Short
  • L Wasserman
  • T. Short
The great streak Joe DiMaggio with bat ready at first day's workout on
  • G Warrack
Warrack, G. 1995. The great streak. CHANCE 8(3):41–43, 60. Joe DiMaggio with bat ready at first day's workout on March 6, 1946, in Bradenton, Florida, after returning from Panama. DiMaggio had been out since 1942 for U.S. Army service.
A journey to baseball’s Universe. The New York Times
  • S Arbesman
  • S Strogatz
Arbesman, S., and S. Strogatz. 2008. A journey to baseball's Universe. The New York Times. March 30.