1994 Salaries of Major League Baseball Players

An Example of the Central Limit Theorem in Action
Using the 1994 Salaries of Major League Baseball Players
The central limit theorem says that for large n, the sampling distribution of X is approximately
(
N µ,
σ
n
) for any population with finite standard deviation. As the sample size increases, the
distribution of X becomes closer to a normal distribution, regardless of what the population
distribution may be.
To illustrate this, let’s look at a real set of data: the 1994 salaries of major league baseball
players. A histogram of the salaries is below.
Figure 4:
Number of Players
1994 Salaries of Major League Baseball Players
200
150
100
50
0
0
1
Salaries include signing bonuses
Data source: USA Today, April 5, 1994
2
3
4
5
6
Salary (in millions)
Notice that this histogram is extremely strongly skewed to the right. There were a total of
747 players, and more than a third of them had salaries below $200,000 (the first bar on the left
side of the histogram). For this example, we will treat the collection of all 747 players as our
population and choose random samples of the players. In order to make this resemble an infinite
population, we will allow repetition in our samples – that is, the same player can be chosen more
than once in a single sample.
Suppose that we take a random sample of 10 of these 747 baseball players and find the
mean salary of these 10 players. I did this and got an mean of $645,750. Then I did it a second
time, picking another 10 players at random and found a mean of $1,197,900. I repeated this
procedure 100 times. Each time 10 players were chosen at random and the mean salary for those
10 was recorded. There was quite a bit of variation in the results (the means ranged from
$166,550 to $2,309,333), but each time we are getting an estimate of the mean salary for all 747
players (which was $1,183,714).
I also repeated this procedure with samples of other sizes. I found 100 samples of size 2,
100 samples of size 5, and 100 samples of size 40. For each sample, I recorded the mean salary.
On the next page there are four histograms of the mean salaries from these samples, one for each
of the sample sizes.
Important features of these histograms:
“ As the sample size increases, the variability of the means from the different samples
decreases. This reflects the fact that the variance of X is given by σ / n, which
decreases as n increases. This indicates that with larger sample sizes, the mean
from your sample is more reliably close to the mean of the entire population.
2
“ With a sample size of two (that is, choose two players at random and average their
salaries), the distribution of the means is still very strongly skewed to the right.
However, as sample size increases, the histograms become much closer to
symmetric, single-peaked, bell-shaped curves – like the normal distribution!
Finally, on the back of the sheet, there are normal probability plots for the means of the samples of
different sizes. These graphs plot the data on the x-axis (in this case, the different values of X )
against their percentiles on the y-axis. The y-axis is scaled based on the normal distribution. If the
data follow very closely the pattern of a normal distribution, then the plot will follow almost
perfectly along a straight line.
The normal probability plots indicate that:
The means of samples of size two very clearly do not follow the normal distribution. (This was
also evident from the histogram since it was strongly skewed to the right).
The distribution of means of samples of size 5 is also not very normal. It is also somewhat skewed
to the right, though not as much as the samples of size 2 were. It also has two high outliers.
The distribution of the means of samples of size 10 is only very slightly skewed to the right, and is
very close to symmetric. The normal probability plot indicates that it is pretty close to normal.
By the time we get to samples of size 40, the distribution of the means is almost perfectly
symmetric and the normal probability plot indicates that it is very close to normal.
So this provides an illustration of the fact that as the sample size increases, the distribution of X
becomes closer to a normal distribution, regardless of what the population distribution may be.
An implication of this is that when n is fairly large and we need to do probability computations, we
can use the normal distribution.
Approximate Sampling Distributions for the Average 1994 Salaries
of Random Samples of Major League Baseball Players
Samples of Size 2
Samples of Size 5
10
Percent of Samples
Percent of Samples
20
10
0
5
0
0
1000000
2000000
3000000
4000000
0
1000000
Average Salary in Sample
3000000
4000000
Average Salary in Sample
Samples of Size 10
Samples of Size 40
30
Percent of Samples
20
Percent of Samples
2000000
10
0
20
10
0
0
1000000
2000000
3000000
Average Salary in Sample
4000000
0
1000000
2000000
3000000
Average Salary in Sample
4000000
Normal Probability Plot for N = 2
Normal Probability Plot for N = 5
Mean:
1109190
StDev:
950180
99
95
95
90
90
80
80
70
70
Percent
Percent
99
60
50
40
30
10
5
1
1
3000000
4000000
0
1000000
Data
Mean:
1088984
StDev:
473962
99
95
95
90
90
80
80
70
70
60
50
40
30
60
50
40
30
20
20
10
10
5
5
1
1
1000000
Data
247334
3000000
Normal Probability Plot for N = 40
Percent
Percent
99
500000
2000000
Data
Normal Probability Plot for N = 10
0
1193613
StDev:
30
5
2000000
Mean:
40
20
1000000
714721
50
10
0
1282994
StDev:
60
20
-1000000
Mean:
1500000
2000000
500000
700000
900000
1100000
Data
1300000
1500000
1700000
1900000