1.4 requency histograms

is of course 7. If the middle two countries had 7 and 9 gold medals, then we would
have selected 8 as the median, according to the convention.
SECTION 1.3 EXERCISES
1. In a national survey, men were asked why
they exercise. Here are the reasons they gave:
Reason
Health
Stress relief
Weight loss
Other
Percentage
responding
51
25
20
4
Draw a circle graph of these data.
2. Toss a coin 30 times and record the numbers of
heads and tails. Display your result as a circle
graph. (Hint: To draw a circle graph, you will
need to change your results into percentages.)
3. Draw a line graph for the high temperatures
in Exercise 6 of Section 1.2. Use the graph to
answer the following questions:
a. What are the highest and lowest temperature?
b. What is the range of the data?
c. What do the center and spread of the data
tell you about the weather on this day?
4. Draw a line graph for the weights of rats
used in agricultural research in Exercise 2
of Section 1.2. Use the graph to answer the
following questions:
a. What are the highest and lowest weights?
b. What is the range of the data?
c. Find the median, or middle number, of the
data.
For additional exercises, see page 714.
1.4 FREQUENCY HISTOGRAMS
In statistics, a bar graph is called a frequency histogram, or just histogram.
The horizontal axis of such a graph tells the scale of the observations
involved, and the vertical axis shows how many observations fall in various
intervals. That is, it gives the numbers of data points, or frequencies, of the
observations in various intervals. Thus, when statisticians say “frequency,”
they simply mean the number of data points. Since it is the frequency
of the data occurring in various locations over the range of the data that
determines the overall shape of a data set, the concept of a frequency
histogram is very useful and important. The histogram is the most common
way to graphically display data. It loses some information (the exact value
of each data point, that is, the leaf), but it is extremely informative about
centering, spread, and especially shape. These are the three aspects of data
we usually care most about.
Our first example of a histogram is the one presented in Figure 1.3.
The amount of tar in a cigarette (measured in milligrams) was measured
for 24 brands of cigarettes. The graph shows the number (frequency)
of those 24 cigarette brands that contain given amounts of tar. For example, the leftmost bar represents the number of cigarette brands that
contain from 0 to less than 2.5 milligrams of tar per cigarette. (This range
10
Number of cigarette brands
9
8
7
6
5
4
3
2
1
0
0 2.5
5 7.5 10 12.5 15 17.5
Amount of tar (mg)
Figure 1.3
Histogram of tar content of selected brands of cigarettes.
can be expressed as 0 to 2.5⫺ , the minus sign indicating that the value 2.5
is not included.) That bar reaches to a height of 1, meaning that one brand
has a tar content in that range. We read the other bars, corresponding to tar
ranges 2.5 to 5⫺ , 5 to 7.5⫺ , and so on, in a similar manner. This convention
is needed so that we know which interval to put 5.0 into, for example.
How did we construct this histogram? The first step was to take the
“raw” data, or original 24 pieces of data, and construct a frequency table.
Here are the raw data:
1.1
8.9
14.7
4.0
9.0
15.0
4.5
11.6
15.0
4.9
12.1
15.3
7.2
12.5
15.8
7.5
12.9
16.2
7.9
13.3
16.8
8.3
14.1
17.3
From these data and using intervals of width 2.5, we obtain the needed
frequency table, shown in Table 1.11. Then we simply convert this frequency
table into Figure 1.3 in the obvious way.
What does the histogram of Figure 1.3 tell about the data—that is,
tell us about tar content? We see that although some of the 24 brands
have relatively low amounts of tar, more of the brands are near the high
end of the range of tar contents measured. Further, we see that none of
the brands has a tar content greater than 17.5 mg. Thus the graph shows
Table 1.11
Frequency Table
of Tar (mg) per Cigarette
Interval
⫺
0–2.5
2.5–5.0⫺
5.0–7.5⫺
7.5–10.0⫺
10.0–12.5⫺
12.5–15.0⫺
15.0–17.5⫺
Total
Frequency (f )
1
3
1
5
2
5
7
24
the frequencies of the data fully in various intervals, and this shows us the
overall shape of the data.
We might ask why there is no value exceeding 17.5 mg. Does that value
represent the natural tar content of tobacco, and does the shape of the graph
(high to the right, low to the left) arise from the fact that some manufacturers
remove tar from their tobacco? To answer such questions, we must leave
statistics; we might ask a chemist, for example.
What if, instead of using the ranges—or intervals—0 to 2.5⫺ , 2.5 to 5,
and so on, we had used the broader ranges 0 to 5⫺ , 5 to 10⫺ , and so on? Of
course we could construct a new frequency table, starting from the original
data, as we did above. But we can now more easily work directly from
Figure 1.3. Looking at our first group, it is clear that the number of cigarette
brands whose cigarettes contain between 0 and 5.0 mg of tar is 4. Similarly,
the number of brands having a tar content from 5 to 10 mg is 6. We continue
in this manner and create the histogram in Figure 1.4.
Have we lost or gained anything by graphing the data this way? We see
that the rectangles still increase in height as we move from low tar content
to high tar content, so our new graph still communicates the fact that more
of the 24 brands have higher tar content than lower. But the heights increase
more gradually than in Figure 1.3. And the two rightmost rectangles in
Figure 1.4 are of the same height, meaning that just as many of the brands
have a tar content between 10 and 15 mg as between 15 mg and 20 mg.
Judging from Figure 1.4 alone, we would not be able to say whether, within
the range from 10 mg to 20 mg, more of the cigarette brands had tar content
closer to 20 than to 10. All this suggests that by consolidating the bars of
Figure 1.3—that is, by combining the intervals—we have lost information
that we would prefer to keep. For these data we would do better to leave
the graph as it is in Figure 1.3.
It is also possible to graph a greater number of bars than is useful. For
example, is the histogram of Figure 1.5 very useful in displaying the general
10
9
Number of brands
8
7
6
5
4
3
2
1
0
0
5
10
15
20
Amount of tar (mg)
Figure 1.4
Histogram derived from
that of Figure 1.3 by combining bars.
10
5
0
0
10
Figure 1.5
20
30
40
50
60
70
80
A histogram with too many bars.
shape of the data set it was constructed from? If we choose wider bars, we
obtain the histogram of Figure 1.6. Thus we discover that the shape of the
histogram is rather flat over the range of the data—a very useful piece of
information. Clearly, Figure 1.5 hid from us this important fact about the
approximate flatness in the shape of the data.
10
5
0
0
10
Figure 1.6
bining bars.
20
30
40
50
60
70
80
Histogram derived from that of Figure 1.5 by com-
As a general rule of thumb, 5 to 15 bars are usually appropriate. Too
few flatten out important local aspects of the general shape of the data, as
we saw in Figure 1.4. Too many lead to a wide and accidental variation of
bar heights that hides the general shape of the data, as we saw in Figure 1.5.
Clearly, the more data points there are, the more bars are appropriate. For
example, 200 data points could allow 15 bars, whereas 30 data points might
be best displayed with 5 to 6 bars. This decision is a judgment call on the
part of the user; rigid rules about this are not appropriate.
Histograms can be readily made from stem-and-leaf tables, which are
basically just a special kind of histogram. Let’s look at the stem-and-leaf table
of annual earnings to the nearest $1000 in Table 1.12. By simply enclosing
each row (set of leaves) in a bar, we have a respectable histogram. See
Figure 1.7a. We can also count how many leaves there are for each stem
and redraw our histogram showing the frequencies (how many leaves there
are) along one axis of the graph. See Figure 1.7b. It is more common to
use vertical bars instead of horizontal bars in bar graphs. In that case, our
histogram of the annual earnings looks like the one in Figure 1.8. Note that
the horizontal axis scale (salaries in $1000s) has been included, too; we put
in the scale of the horizontal axis so that it is clear which income each bar
represents, with the first interval from 10 to 20⫺ , and so on. Thus, as stated
above, a stem-and-leaf plot is really a special case of a histogram.
Using bars of unequal widths is occasionally useful. For example, we
may use wider bars in regions where the data are more sparse. Exercise 7(b)
and (c) provides such an example.
Table 1.12
Stem-and-Leaf Plot of Annual Earnings
(Thousands of Dollars)
Stem
Leaf
1
2
3
4
5
6
7
6,7
9,9,9,2,7,7,8,3,5
4,5,1,2,6,6,3,9
6,9,7,7,7,5,9,8,6,3,0,3,4,8,3
3,5,3,7,7,9
3,2,2,4,4,0,0
7,4
Key: “1 6” stands for $16,000.
Stem
Leaf
1
2
3
4
5
6
7
6,
9,
4,
6,
3,
3,
7,
7
9,
5,
9,
5,
2,
4
9,
1,
7,
3,
2,
2,
2,
7,
7,
4,
7,
6,
7,
7,
4,
7,
6,
5,
9
0,
8, 3, 5
3, 9
9, 8, 6, 3, 0, 3, 4, 8, 3
0
(a)
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15
(b)
Figure 1.7
Table 1.12.
Histogram derived from stem-and-leaf plot of
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0 10 20 30 40 50 60 70 80
Annual salary ($1000s)
Figure 1.8
Histogram
of Figure 1.7 put in usual
position.
SECTION 1.4 EXERCISES
1. Draw a histogram of the data given in the
stem-and-leaf plot below. The data (from Exercise 4 of Section 1.2) are the scores of 20
students on a statistics test.
Stem
Leaf
7
8
9
2,3,4,5,9
0,2,3,4,4,5,7,8
0,1,2,3,5,6,6
Key: “8 2” stands for 82.
2. Draw the stem-and-leaf plot in Exercise 1 with
six stems instead of three. Draw a histogram
of the data. Does the addition of more stems
give a clearer picture of the data?
3. Draw a histogram of the data in Exercise 2
of Section 1.2 using five bars: 180–190⫺ , 190–
200⫺ , and so on. Now draw a separate histogram with each bar split in half: 180–185⫺ ,
185–190⫺ , and so on. Which histogram gives
you a clearer picture of the data? Note: 190⫺
means that the bar excludes 190, and so on.
4. Toss a coin 30 times and keep track of the numbers of heads and tails. Construct a histogram
to show your results. (Hint: Your histogram
will have only two bars.) If you perform the
experiment again, would you expect to get the
same results? Why or why not?
5. Fingerprints can be classified according to
the number of ridges between “loops” in the
patterns. The number of ridges is called the
ridge count for a particular person. Suppose
we have the following ridge counts for the
fingerprints of 20 people:
189
207
186
205
181
201
189
213
205
185
192
207
210
188
194
213
6. Draw a histogram of the high temperatures
from Exercise 6 of Section 1.2, choosing the
bars so that you get the “best” picture of the
data. Explain your choice of bars.
For additional exercises, see page 715.
198
192
215
220
Draw a stem-and-leaf plot of these data. Then
draw a histogram.
1.5 RELATIVE FREQUENCY HISTOGRAMS
AND POLYGONS
As we have seen, the histogram is a powerful graphical tool for showing
the general shape, or distribution, of a data set. Consider the frequency
table of 30 exam scores given by Table 1.13. We first note that we can
rescale these frequencies relative to the total number of data points, 30,
by dividing each frequency by 30. Doing so converts them to proportions,
also called relative frequencies. In this textbook such proportions are
thought of as probabilities, and in fact they are often called experimental
probabilities and are denoted by P. We will study them in detail in Chapter 5.
Table 1.13
Frequency
Table of Exam Scores
Table 1.14
Proportions of Persons
Obtaining Various Exam Scores, Used
for Figure 1.9
Score
Frequency ( f )
Score
f
P
75
76
77
78
79
80
81
82
83
1
0
2
3
4
11
5
3
1
75
76
77
78
79
80
81
82
83
1
0
2
3
4
11
5
3
1
0.03
0
0.07
0.10
0.13
0.37
0.17
0.10
0.03
Total
30
Total
30
1.00