stats 25 baseline

Ask Dr. STATS
JEROME P. KEATING
Division of Mathematics and Statistics
University of Texas at San Antonio
501 West Durango Boulevard
San Antonio, TX 78207
DAVID W. SCOTT1
Department of Statistics, MS-138
Rice University
6100 Main Street
Houston, TX 77005-1892
A Primer on Density Estimation
for the Great Home Run Race of ’98
Questions about graphing data are frequently
asked throughout courses in Statistics. The most
basic questions deal with “how to” form density
estimates. Our students often inquire about how
the smooth density estimates are constructed in
certain articles and want to know how to produce
such smooth graphs. In this article, we review a
fundamental approach in density estimation and
illustrate the procedure on the lengths of home
runs hit by Sammy Sosa and Mark McGwire in the
Great Home Run Race of ’98.
■ The Data:
In the “Great Home Run Race of ‘98,” Sammy
Sosa of the Chicago Cubs and Mark McGwire of
the St. Louis Cardinals battled throughout the last
two months of the season for the title of baseball’s
greatest single season home run hitter. As the sun
sets on this magnificent season, we analyze and
review their quest, which was Roger Maris’ record
of 61 home runs set in 1961. His record retains an
asterisk, because Maris hit his 61 home runs in a
162-game season, whereas Babe Ruth hit 60 home
runs in a 154-game season. After the first 154
games of the 1961 season, Maris had 58 home
runs. Ruth was a larger than life figure, whose
magnetism and charisma lent more weight to his
records. His premature death, no doubt,
contributed all the more to the lore that followed
the name of Ruth throughout baseball history.
1
Research supported in part by NSF grant
DMS 96-26187
16
STATS #25
■
SPRING 1999
Jerome P. Keating
David W. Scott
These contributing factors had more to do with the
well-known asterisk than the length of the season.
As the 1998 season dawned, McGwire started
faster. By the end of April, McGwire had 11 home
runs, whereas Sosa had 6; by the end of May,
McGwire had 27 to Sosa’s 13. In June, Sosa went
into overdrive hitting 20 home runs, a monthly
record, and closing McGwire’s lead to only 4 home
runs by month’s end. Sosa hit one more than
McGwire in July, and on August 19, against St.
Louis in Chicago’s Wrigley Field, Sosa went ahead
of McGwire by hitting his 48th home run in the
fifth inning. However, his lead was short-lived (58
minutes) as McGwire tied him with a home run in
the 8th inning of the same game and reclaimed the
lead with a solo home run two innings later.
Sammy Sosa tied Mark McGwire again at 55 on
August 31, at 62 on September 13, at 63 on
September 16 and at 65 on September 23.
On the last Friday of the season, September
25th, for only the second time in the season, Sosa
took the lead with a 462-foot home run in the
Astrodome. However, this lead lasted but 45
minutes as McGwire struck back with his 66th
home run in the bottom of the fourth inning in St.
Louis. Both players surpassed the mark of 61 home
runs in the first 154 games of the season removing
the need for any asterisks on their records. Just as
McGwire started strongly, he finished the same
way. McGwire’s surge of five home runs on the final
weekend of the season propelled him to a
magnificent 70-home run season.
It would be myopic, to concentrate solely on
McGwire’s season for in doing so, we miss Sammy
Sosa’s magnificent season within a season. While
Mark McGwire’s 1998 home runs are accentuated
by some of heroic distances, Sammy Sosa’s 1998
home run rate is an overlooked topic. It is an
understatement to say that Sosa is a streak hitter.
From May 25 (the Cubs 50th game) through
September 13, (the Cubs 150th game) Sammy Sosa
hit an incredible number of 53 home runs in only
ASA
101 consecutive games. In fact, Sosa hit 60 home
runs in just 131 consecutive games dating from
May 3, 1998 through September 25, 1998.
We can always use sample descriptive statistics
to compare the
Table 1. Comparision of home
hitters but these
runs: McGwire vs. Sosa.
values, while quite
informative are
Statistics McGwire Sosa
limited. We can see
Mean 423.75 407.48
from the table
Median
423
410
below
that
Mode
419
430
McGwire’s home
Std. Dev.
46.41
38.14
run lengths have
Kurtosis
-0.37
-0.53
larger mean and
Skewness
0.38
0.30
median values than
Range
204
160
Sosa’s. However,
Minimum
341
340
there is much more
Maximum
545
500
to this comparison
Count
70
66
as we shall see.
Can we use statistical procedures to better
describe their home run lengths? By an examination
of the length of their home runs, can we determine
any similarities or differences between their home
run swings?
■ Density Estimation
We draw frequency histograms from our very
first courses in Statistics. These histograms are
usually crude, have cumbersome special rules, and
require some subjective input. The two most
critical features in density
estimation are:
1. the choice of the
interval or bin width,
and
2. the starting
point for the
frequency
histogram.
The first critical feature, interval width, of the
histogram has been subject to discussion since at
least Sturges’ rule (1926). For bin width selection
procedures, Scott (1992) provides an historical
account that will make students familiar with the
long history behind this basic, but unresolved,
issue in density estimation. Wand (1997) states the
statistician’s dilemma as follows:
The most important parameter of a histogram is
the bin width because it controls the tradeoff
between presenting a picture with too much detail
(“under-smoothing”) or too little detail (“oversmoothing”) with respect to the true distribution.
Statisticians differ on their choice for the
interval width, h, but these differences are often
predicated by the practitioner’s motive in his or her
cross-examination of these data. The width that we
have chosen reflects our interest in the detection of
modes, which is sometimes referred to in a
colloquial context as “bump-hunting.”
Wand (1992) provides a modification to a
well-known procedure of Scott (1979); the
modification satisfies an asymptotic optimality
condition. The choice of the width, h, of the
interval that we use to graph the home run lengths
of McGwire and Sosa is one that is quite popular
among exploratory data analysts and is given by:
^
3.49 3 s
}}
1/3
h5
n
^
IQ R
}
5 min(S, }
where s
1.349 ), S is the sample standard
deviation of these data, n is the size of the sample,
and IQR is the inter-quartile range of these data.
This expression for h is a modification of the
original taken from Scott (1979), that suggested
^
s
5 S, and is based on normal scale bin-width
selection. The normalizing constant (c 5 1.349)
guarantees that IQR/c is an asymptotically unbiased
estimator of s whenever the underlying data are
normally distributed, i.e.,
1 1.349 2
IQR
E }} 5 s
The practical importance of this method is
that these data determine the width of the interval
using traditional measures of scatter and the
sample size.
We shall consider two smoother alternatives to
the histogram in this note. The simplest is the
frequency polygon, which connects the midpoints
of the histogram with line segments. Scott (1985a)
showed that the histogram from which a frequency
polygon is constructed can and should have a
wider bin width, which for normal data is given
by
ASA
STATS #25
■
SPRING 1999
17
2.15 3 s
h 5 }}
n1/5
^
A self-contained primer on the
derivation of these formulae is given in
the appendix at the end of this article.
■ The Five-Step Process:
In this technique, follow the five
steps to draw relative frequency polygons:
i. Choose a starting point, denoted by
t0, which is significantly less than
the minimum sample value (for
example let t0 5 0).
ii. Separate these data into intervals,
called classes, determined by h. Let
bi denote the beginning point of the Figure 1. Combining two frequency polygons.
ith class and ei the corresponding
given to the choice of h, an equally perplexing
class endpoint (i.e., b0 5 t0, e0 5 t0 1 h, b1 5
t0 1 h, e1 5 t0 1 2h, and so on).
problem is that the starting point may be
iii. In each interval, count the number of
subjectively chosen. Scott’s (1985b) approach
occurrences, known as the frequency of the
allows these data to dictate the choice of t0 as well,
by averaging frequency polygons with different
class, and denote it by fi.
iv Plot the class frequency, fi, against the midpoint,
starting points. To obtain another perspective of
mi, of each class from i 5 1,…, k, where k is the
the density function, start at another point, t'0 ,
number of classes.
where t'0 5 t0 2 h/2.
This choice shifts the bins over one-half the
v Connect adjacent pairs of points to form the
interval width and the five-step process outlined
sides of the frequency polygon.
above is repeated. Since the frequency polygons
An interval along the x-axis will form one side of
based on the first endpoints and the shifted
the polygon. To convert a frequency polygon into a
endpoints differ, we want to combine these
density estimator, note that the area under the
frequency polygons to obtain a new density
polygon can be calculated using the trapezoidal rule.
estimator, which is a better estimator of the
Let the area of the ith trapezoid be denoted by Ai.
Renumber the sequence so that we start the interval
unknown density function.
at the smallest value, midpoint, m1, for which the
If we assume that each frequency polygon
corresponding frequency, f1, is non-zero, and end at
provides important information, a clearer picture
(mk,fk). Hence, total area of the polygon becomes:
appears by averaging the polygons. This is the
k11
k11
k11
fundamental addition provided by Scott (1985b) to
h
A 5 ^ Ai 5 ^ {fi21 1 fi}h 5 }} ^ {fi21 1 fi} 5 nh
his earlier work. He averages the graphs to obtain
2 i50
i50
i50
an averaged-shifted histogram (ASH). The averaged
where f0 5 0 and fk11 5 0. Thus to convert the
graph produces a frequency polygon, which has
frequency polygon into a formal density estimator
twice the number of sides. The outcome of this
divide the frequencies by nh, so that the area of the
averaging process is that we have a graph, which is
polygon will total 1. Because we will compare
smoother than its progenitors. In the following
density estimators for different hitters for whom
combined graph, Figure 1, the two frequency
the number of home runs differ, it is important to
polygons with offset starting points are averaged to
use the density estimators (the relative frequency
produce a smoother frequency polygon. The
polygons) as opposed to the un-normalized
composite features can be seen in the combined
frequency polygons.
graph but the number of modes is still unclear.
Technically speaking, we are using averaged shifted
■ A Two Stage (Five-Step) Process
relative frequency polygons (ASRFP).
Consider again Figure 1, which contains two
The frequency polygon is influenced by the
frequency polygons of Mark McGwire’s home runs.
choice of starting at t 0 and moving over a
The first frequency polygon is based on starting at
prescribed length, h, to each successive endpoint.
zero and moving over a class interval, 29.329, to
While much attention in the statistical literature is
form successive endpoints. The second histogram
18
STATS #25
■
SPRING 1999
ASA
Figure 2. McGwire’s ASH — narrow width.
is formed, as the first, but with a
different starting point, namely, h/2.
The first frequency polygon is bimodal,
whereas the second frequency polygon
is unimodal. Both are skewed to the
right. The problem, here, is not
whether the interval width is too large
or too small, but rather where the
polygon starts.
Averaging m shifted-frequency
polygons results in an essentially
smooth graph and thereby eliminates
the starting point as a nuisance
parameter, which we mentioned earlier
as the second critical feature in density
estimation using either the ordinary
histogram or frequency polygon.
■ Applying Scott’s Five-Step
Approach to Home Run Data
Figure 3. McGwire’s ASH — wide width.
Figure 4. McGwire’s ASH — width optimal.
ASA
Let’s first illustrate the need for an
artful choice of the interval width, h.
Apply Scott’s two-stage method to the
lengths of home runs hit by Mark
McGwire by combining the home runs
of these two sluggers. The associated
standard deviation is s 5 43.217 with
n 5 136. Since our goal is to compare
the home run lengths of these two
sluggers, we are combining their data
so as to form a common width, h. The
third and first quartiles of their
combined home run lengths are Q3 5
438.5 and Q1 5 379.25, respectively.
Since IQR/1.349 5 43.921 . s 5
43.217, the interval width becomes
h 5 29.329 and h/2 5 14.665.
To illustrate Wand’s tradeoff, we
shall use three different interval
widths: a narrow width at 50% of h,
h 1 5 14.665, an optimal width at
h 2 5 29.329, and a wide width at
150% of h, h3 5 43.994.
In this process, apply the five-step
procedure beginning at some fixed
point, t 5 300, and reapply the method
by starting at a staggered starting point,
t 5 300 1 hi/2. These three bin width
selections produce markedly different
frequency polygons. Notice that Figure
2, with the narrow width, has multiple
modes and a narrower range space for
McGwire’s home run lengths. This
figure depicts the classic problem of
under-smoothing, which Wand
addressed earlier.
STATS #25
■
SPRING 1999
19
In Figure 3, with the widest width,
h3, the polygon has one mode and a
much wider range space for McGwire’s
home run lengths. This figure typifies
the phenomenon known as oversmoothing. Notice, that Figure 4
contains the frequency polygon, which
has an interval width that does not
excessively under-smooth nor oversmooth the data.
■ The Smoothed Graph
In this section, we return to our
primary goal of comparing the density
estimators for the home runs of
Sammy Sosa and Mark McGwire.
Actually, we came by the following Figure 5. Combined ASH — width optimal. Sosa and McGwire.
observations quite innocently. We gave
these data to a class and asked them to
graph the density function for the
combined home run data for McGwire
and Sosa. In Figure 5, notice that their
combined home run lengths has a
distinctly bimodal estimate of the
density function. With this
observation, we wanted to see how the
density functions for the two players
fared individually. For this reason, we
used a common interval width to
simplify the comparison. Also, the
combined figure is a weighted average
of the Sosa and McGwire figures.
In Figure 6, we smoothed the
graphs more by increasing the
number of starting points and
averaging the different frequency Figure 6. Comparison of ASHs —Sosa vs. McGwire.
polygons associated with each starting
point. The smooth graph given above
choosing h too large. We also misrepresent the value
is found by choosing four equi-spaced starting
of the mode. (Observe that 410 feet is a mode in
points within the initial interval, forming four
Figure 3 but is an anti-mode in Figure 4!)
frequency polygons, and averaging the four
However, the 430-foot mode will clear the
functions. The striking pattern in the smoothed
outfield wall in just about any direction in any
frequency polygon is the presence of two
National League Park. Were the pitches, associated
coincidental modes for these stars. These sluggers
with this primary mode, fastballs? Were pitches hit
have two primary distances, modes, about which
on the secondary mode off-speed pitches, such as
their home run lengths are most concentrated.
curve balls and change-ups? Notice that the mode
The first mode’s distance is around 380 feet and
of 430 feet was primary for both players, whereas
the second is around 430 feet. The first mode is
the mode, of 380 feet, was secondary. The higher
obviously one that represents home runs hit to
frequencies obtained by Sosa at each mode indicate
either left or right field, because a 380-footer would
that he more typically hit home runs of these two
not be a home run if it were hit to straight away
lengths. Hence, the sluggers are quite similar.
center field. In fact, Mark McGwire hit only three
The long right tail in the frequency polygon
opposite field home runs all year, whereas Sammy
for McGwire indicates that he hit some extremely
Sosa hit 16 opposite field home runs. Notice that in
long home runs. However, the number of such
Figure 3, in reference to the quote from Wand, we
“blasts” was not sufficient to produce a third mode
would completely mask (over-smooth) this mode by
20
STATS #25
■
SPRING 1999
ASA
for the lengths of his home runs. McGwire hit
massive home runs of 545, 527, 511, 509, and 501
feet. In reference to the quote from Wand, if an
excessively small interval width (under-smoothing)
is chosen, one can artificially create a mode in the
500-foot range as evidenced in Figure 2. For most
players, hitting five 500-foot home runs would be
more than just a career event but McGwire has
now done this in consecutive seasons.
This method has a complication in that it can
be insensitive to natural bounds in these data.
This problem is much like the problems that face
statisticians who work with edge detection in
image reconstruction. In these data, a natural
boundary exists in that the shortest possible home
run is at least 300 feet due to major league
standards. However, this problem can be
addressed by using medians as opposed to means.
■ References
Scott, D. W. (1979), “On Optimal and Data-Based
Histograms,” Biometrika, 66, 605–610.
——— (1985a), “Frequency Polygons,” Journal of the
American Statistical Association, 80, 348–35.
——— (1985b), “Averaged Shifted Histograms,”
Annals of Statistics, 13, 1024–1040.
——— (1992), Multivariate Density Estimation, New
York: John Wiley & Sons.
Sturges, H. A. (1926), “The Choice of a Class
Interval,” Journal of the American Statistical
Association, 21, 65–66.
Wand, M. P., (1997) “Data-Based Choice of
Histogram Bin Width,” The American
Statistician, 51, 59–64.
■ Appendix: A Primer
on the Theory of Histograms
How does one arrive at the optimal interval
widths given earlier in the manuscript in the
density estimation section? The remarkable ability
of the histogram and related non-parametric
estimators is to display features in any unknown
density. The theoretical calibration, required to
achieve good performance, uses only simple tools
from statistics and calculus. We illustrate this with
a new approach. As noted earlier, the normalized
form of the histogram is
^
nk
nh
f h(x) 5 }} for x [ Bk 5 (kh,(k 1 1)h).
where nk is the number of sample observations,
which fall in interval Bk, and h is the width of the
interval. Intuitively, in order for a histogram to be
consistent, the bin width must become smaller as
the sample size increases. We will see precisely
how this must happen.
ASA
For a given choice of bin width, h, and a true
density, f(x), the total estimation error of a
histogram is computed in three steps: the mean
squared error is computed for each x; the total (or
integrated) mean squared error is computed over
each bin; and the bin-by-bin results are accumulated.
Furthermore, since
^
^
MSE[f (x)] 5 Var[f (x)] 1 [E(x) 2 f(x)]2,
we can accumulate the total variance and squaredbias portions separately.
The computation is simplified by noting that
the number of occurrences, nk , Binomial (n,pk),
where
pk 5
E f(x)dx
Bk
is the probability X i [ B k . Then the variance
becomes
p2k
npk(1 2 pk)
pk
}
}}
}
5 2 2 nh2 .
Var[f h(x)] 5
(nh)2
nh
^
Since each bin width equals h, and the
variance is constant within each bin, the total
integrated variance (IV) equals
`
IV 5 ^
E
k52` Bk
1
nh
`
pk 2 p2k
pk 2 p2k
}
}
nh2 dx 5 ^
nh
k52`
1
n
`
5 }} 2 }}
p2k
^
}
h
E
f(x)dx 5 1
k52`
since
`
^
pk 5
k52`
`
2`
One can verify that the resultant value of IV is
nonnegative. Moreover, by the Mean Value
Theorem, note that
pk 5
E f(x)dx 5 hf(c )
k
Bk
exactly for some point ck [ Bk. Thus by standard
Riemannian integral approximations,
2
1 pk
1
5 2 }}^ f(ck)2 ? h
2}}^}
n k h
n k
1
n
E
2 }}
h→0
`
f(x)2dx
2`
This term is of lower order than the leading term,
1/nh. Note that both terms will vanish if we ensure
that any choice of h 5 h(n) satisfies h(n)→0 and
n ? h(n)→` as n→`.
The bias calculation is somewhat more
involved. Clearly,
^
npk
nh
pk
h
E[f h(x)] 5 }} 5 }}.
STATS #25
■
SPRING 1999
21
Table A.1. Homerun distances: McGwire vs. Sosa.
Homerun Distance
McGwire
Sosa
364
371
368
350
364
430
419
420
424
430
347
434
462
370
419
420
437
440
419
410
371
420
362
460
358
400
527
430
381
410
545
370
478
370
440
410
471
380
451
340
425
410
366
420
477
410
397
415
433
430
Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Number
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Rather than relying on the mean value theorem and
Taylor’s expansions, let us instead consider a
specific density, f(x) 5 a 1 mkx for x [ Bk; that is,
the true density is a piecewise continuous linear
function defined on the same histogram mesh. We
can compute the bin probability exactly, as well as
the bias and the integrated squared bias: First,
pk 5
E
(k11)h
(ak 1 mkx)dx 5 ak 1 mkh2(k 1 }12}).
k?h
Next the integrated squared bias for the kth bin,
ISBk, is equal to
E
ISBk 5
E Bias[f (x)] dx 5
1
2
h
Bk
(k11)h
k?h
^
Number
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
Homerun Distance
McGwire
Sosa
385
380
477
438
393
414
509
482
501
364
450
363
472
374
497
417
458
464
381
430
430
480
341
480
385
434
417
344
423
410
375
462
403
435
377
370
The total mean integrated squared error
(MISE) is
MISE 5 IV 1 ISB
1
nh
1
n
E
5 }} 2 }}
`
E
f(x)2dx 1 }112}h2
2`
`
f '(x)2dx.
2`
Any continuously differentiable density can
be well approximated by a piecewise linear
density and our formula for the MISE holds for all
such densities (but not exactly, with error that
vanishes as n → `). In particular, for a normal
density,
1
Ef '(x) dx 5 }
.
4Ïp
ws
2
3
[(ak 1 mkh(k 1 }12}) 2 (ak 1 mkh)]2dx 5 }112}h3m2k .
Since the density is piecewise linear, f'(x) 5 mk, so
that
f'(x)2dx 5 h ? m2k .
E
Bk
Therefore,
`
ISB 5 ^ ISBk 5 }112}h2^h ? m2k
k52`
k
E f '(x) dx 5 h E
5 }112}h2^
2
k Bk
22
Homerun Distance
McGwire
Sosa
388
380
423
380
409
366
356
500
409
380
438
390
437
400
449
364
433
432
461
428
431
440
472
365
485
420
405
347
415
438
511
390
425
375
458
374
452
400
408
361
374
480
464
360
398
368
409
430
369
440
STATS #25
■
SPRING 1999
1
}}
12
2
`
f '(x)2dx.
2`
The goal is to find the bin width, which minimizes
the MISE; so by direct methods of the calculus, we
can differentiate MISE with respect to h and show
the best bin width is h* 5 3.49sn21/3 as claimed
before. Similar analyses result in optimal
smoothing parameters for the frequency polygon
(given earlier) and for the averaged shifted
histogram (Scott, 1985b). The smoothing
parameter for the ASH is about 20% larger than for
the frequency polygon, but in this paper, we do
not draw too close a distinction.
ASA