The Stata Journal

The Stata Journal
Editor
H. Joseph Newton
Department of Statistics
Texas A & M University
College Station, Texas 77843
979-845-3142; FAX 979-845-3144
[email protected]
Editor
Nicholas J. Cox
Department of Geography
Durham University
South Road
Durham City DH1 3LE UK
[email protected]
Associate Editors
Christopher F. Baum
Boston College
Rino Bellocco
Karolinska Institutet, Sweden and
Univ. degli Studi di Milano-Bicocca, Italy
A. Colin Cameron
University of California–Davis
David Clayton
Cambridge Inst. for Medical Research
Mario A. Cleves
Univ. of Arkansas for Medical Sciences
William D. Dupont
Vanderbilt University
Charles Franklin
University of Wisconsin–Madison
Joanne M. Garrett
University of North Carolina
Allan Gregory
Queen’s University
James Hardin
University of South Carolina
Ben Jann
ETH Zürich, Switzerland
Stephen Jenkins
University of Essex
Ulrich Kohler
WZB, Berlin
J. Scott Long
Indiana University
Thomas Lumley
University of Washington–Seattle
Roger Newson
Imperial College, London
Marcello Pagano
Harvard School of Public Health
Sophia Rabe-Hesketh
University of California–Berkeley
J. Patrick Royston
MRC Clinical Trials Unit, London
Philip Ryan
University of Adelaide
Mark E. Schaffer
Heriot-Watt University, Edinburgh
Jeroen Weesie
Utrecht University
Nicholas J. G. Winter
University of Virginia
Jeffrey Wooldridge
Michigan State University
Stata Press Production Manager
Stata Press Copy Editor
Lisa Gilmore
Gabe Waggoner
Jens Lauritsen
Odense University Hospital
Stanley Lemeshow
Ohio State University
Copyright Statement: The Stata Journal and the contents of the supporting files (programs, datasets, and
c by StataCorp LP. The contents of the supporting files (programs, datasets, and
help files) are copyright help files) may be copied or reproduced by any means whatsoever, in whole or in part, as long as any copy
or reproduction includes attribution to both (1) the author and (2) the Stata Journal.
The articles appearing in the Stata Journal may be copied or reproduced as printed copies, in whole or in part,
as long as any copy or reproduction includes attribution to both (1) the author and (2) the Stata Journal.
Written permission must be obtained from StataCorp if you wish to make electronic copies of the insertions.
This precludes placing electronic copies of the Stata Journal, in whole or in part, on publicly accessible web
sites, fileservers, or other locations where the copy may be accessed by anyone other than the subscriber.
Users of any of the software, ideas, data, or other materials published in the Stata Journal or the supporting
files understand that such use is made without warranty of any kind, by either the Stata Journal, the author,
or StataCorp. In particular, there is no warranty of fitness of purpose or merchantability, nor for special,
incidental, or consequential damages such as loss of profits. The purpose of the Stata Journal is to promote
free communication among Stata users.
The Stata Journal, electronic version (ISSN 1536-8734) is a publication of Stata Press. Stata and Mata are
registered trademarks of StataCorp LP.
The Stata Journal (2007)
7, Number 3, pp. 413–433
Speaking Stata: Turning over a new leaf
Nicholas J. Cox
Department of Geography
Durham University
Durham City, UK
[email protected]
Abstract. Stem-and-leaf displays have been widely taught since John W. Tukey
publicized them energetically in the 1970s. They remain useful for many distributions of small or modest size, especially for showing fine structure such as digit
preference. Stata’s implementation stem produces typed text displays and has
some inevitable limitations, especially for comparison of two or more displays.
One can re-create stem-and-leaf displays with a few basic Stata commands as
scatterplots of stem variable versus position on line with leaves shown as marker
labels. Comparison of displays then becomes easy and natural using scatter,
by(). Back-to-back presentation of paired displays is also possible. I discuss variants on standard stem-and-leaf displays in which each distinct value is a stem,
each distinct value is its own leaf, or axes are swapped. The problem shows how
one can, with a few lines of Stata, often produce standard graph forms from first
principles, allowing in turn new variants. I also present a new program, stemplot,
as a convenience tool.
Keywords: gr0028, stemplot, stem-and-leaf, graphics, distributions, digit preference
1
Introduction
I was attracted to stem-and-leaf displays when I first learned about them in the mid1970s from John Tukey’s writings. They were a fixture in my introductory courses from
the start of my teaching career. But over the last decade or so, I have come to use
them less. As usually implemented, they are less flexible than dot or strip displays.
This column is a story of how one can produce stem-and-leaf displays by using Stata’s
graphics, thus increasing their versatility. Within this story, other questions will emerge:
How far can you get using a few basic commands? When is it better to wrap up those
commands within a program?
2
Stem-and-leaf displays
Just in case you have missed out on stem-and-leaf displays, let us run through the basic
idea and its implementation in Stata as stem.
The name stem-and-leaf is attributable to Tukey (1972, 1977). Their widespread
use is undoubtedly the result of his energetic and enthusiastic advocacy. The basic idea
appears in other places, including Dudley (1946, 22), as reproduced by Emerson and
c 2007 StataCorp LP
gr0028
414
Speaking Stata
Hoaglin (1983, 19), and Japanese railway timetables, as reproduced by Tufte (1990,
45–46). Among many textbook accounts are those of Griffiths, Stirling, and Weldon
(1998); Wild and Seber (2000); and Moore and McCabe (2006).
Let us look at some examples.
. sysuse auto
(1978 Automobile Data)
. stem length
Stem-and-leaf plot for length (Length (in.))
14*
15*
16*
17*
18*
19*
20*
21*
22*
23*
279
45567
133455589
00002234457999
02469
23356788889
00001113446667
224788
00112
03
Each value (e.g., 142) is split into so-called stem (e.g., 14) and leaf (e.g., 2) parts.
Here each distinct stem is used just once, but there is no rule about that. Another
example shows how stems are repeated on two or more lines:
. stem mpg
Stem-and-leaf plot for mpg (Mileage (mpg))
1t
1f
1s
1.
2*
2t
2f
2s
2.
3*
3t
3f
3s
3.
4*
22
44444455
66667777
88888888899999999
00011111
22222333
444455555
666
8889
001
455
1
Here, at least for English speakers, accidents of language make t, f, and s easy
labels for two or three, four or five, and six or seven, respectively. In general, a stem
may be used most easily one, two, or five times with decimal representations.
2.1
A note on units
In most countries in the world, the units used in examples in this column are not
standard, so I will explain. The car lengths in the previous example are measured in
inches. An inch is defined to equal 25.4 mm. Twelve inches = 1 foot, 5,280 feet = 1 mile,
and 1 U.S. gallon is about 3.785 liters. Hence 1 mile per gallon (mpg) is about 0.425
N. J. Cox
415
km per liter or 2.352 liter per km. Twenty miles per gallon, the median in the auto
dataset, is about 8.50 km per liter. Pennycuick (1988) and Darton and Clark (1994)
are two of many useful references on units of measurement.
2.2
Advantages
The display is a graphical condensation of the list of ordered values 142, 147, 149, . . . ,
230, 233. Once you realize that, its key features are immediate, which is much of the
attraction:
• A low-technology method. You can do stem-and-leaf displays by hand, using any
convenient piece of paper. (Paper with large squares keeps things tidy.) Evidently,
that consideration weighs less and less in an era when many readers take laptops
with them wherever they go. But it still has weight, especially as a way of getting
children and other learners to look at distributions in small, simple datasets. Similarly, one can easily type or print a stem-and-leaf display. That was an advantage
to many in the mid-1970s, but again it counts for little 30 years later. Indeed,
handling typed displays in some fixed-width font may now be less familiar and
more awkward for you than handling graphic images.
• Keeps most or all information. Each value is split into stem and leaf parts. For
length, there is no more detail, so the stem-and-leaf display with leaves that are
single digits loses no information about magnitudes. Whenever that is not true,
there are various possibilities, including rounding the data and using leaves that
are two or more digits.
• Shows distributions in histogram style, but including fine structure. The ordering
and grouping yield a histogram on its side, which is, presumably, why stem-andleafs are usually shown with magnitude increasing down the page. A rotation of
90◦ counterclockwise yields a histogram. Demonstrating this effect physically by
rotating an acetate transparency in the classroom is, unfortunately, now regarded
as technologically incorrect. The stem-and-leaf display, however, retains more
detail (as seen, often all the detail) of individual data points. That quality is often
useful, especially for inspecting outliers and other extremes and any granularity in
the distribution, such as may arise from digit preference (Preece 1981). A display
need not appear in published reports for it to be useful in exploratory checking of
your data.
• Allows calculation of measures on the basis of ordered values. The ordering allows
fairly easy hand calculation of summary measures on the basis of order statistics
(median, quartiles, and so forth). Tukey’s low-technology definitions of what later
were called letter values (Hoaglin 1983) involved nothing more complicated than
splitting the difference between adjacent order statistics. (Doing so makes the
definitions trickier to extend to variables associated with weights.) This detail still
deserves brief emphasis in introductory texts and courses but is of little importance
for researchers able to calculate such measures directly with software such as Stata.
416
Speaking Stata
2.3
Limitations
The stem ([R] stem) command was first introduced in Stata 3.0 in 1992. It does what
it does well and has smart defaults based on the distribution of each variable. Its main
features include handles for tuning the display and the possibility of repeating displays
for distinct groups with a by() option.
Given these advantages, why are stem-and-leaf displays not used more? The limitations are just about as obvious:
• Problems with large datasets. Stem-and-leaf displays work well for datasets of
small or modest size but otherwise may easily be awkward, if not impracticable.
The length of the longest line, the number of lines, and the legibility of leaves can
quickly become problematic, and there are few easy trade-offs. Using some kind of
dot plot (in the sense of [R] dotplot, not [G] graph dot) is then more practical:
we must give up on the idea of a leaf and use some small point symbol instead.
Alternatively, many people would turn to histograms, kernel density estimates,
quantile plots, or other representations.
• Are final digits interesting? The last digit(s) may often not be at all interesting or
informative, especially when data were produced by some automated measurement
procedure. (Whenever people produce numbers, there is a chance of fine structure,
perhaps trivial, as a result of digit preferences or other quirks.) If the detail of
leaves is noise to the user, then the simplicity of dot plots is again preferable.
• Comparison can be awkward. Comparison of two or more stem-and-leaf displays
can be difficult or inefficient unless they are displayed side by side with a common
magnitude scale. The action of stem, almost inevitably, is to emit displays in
sequence, so only reediting in text or word processing software could achieve this.
Aligning displays, line length, and legibility of leaves could again prove problematic.
3
Stem-and-leaf displays as scatterplots
These limitations are serious, but they do not rule out the usefulness of stem-andleaf displays in all circumstances. Moreover, some are largely consequences of using
printed text to show stem-and-leaf displays as, in essence, tables rather than graphs.
For example, keeping track of various small tables as bundles of characters and trying
to emit them properly aligned would be a small nightmare, but getting small graphs
aligned as you wish is easy, because you just tell Stata to do it.
A fresh look at the problem grows from the realization that stem-and-leaf displays
are essentially a kind of scatterplot. The ingredients include the following:
1. The variable in question, or a rounded variant of it, gives the y coordinate. For
that, we may need to learn a little about how to round.
N. J. Cox
417
2. The position of each leaf on each line gives the x coordinate. We will have to
construct this. A natural start is just to ensure that the first, second, and later
leaves on each line are plotted against 1, 2, and so forth. (Only for one purpose
will we need to do anything more complicated.)
3. A leaf must be shown at each point (x, y), rather than a point symbol. For that,
we may need to learn a little about how to extract the needed digits. Leaves
will be shown as marker labels, and marker symbols will be suppressed. If you
are not familiar with marker labels, check out [G] marker label options or the
equivalent online help.
In following this basic idea, we need not commit ourselves to producing exact replicas
of typed or printed displays. Using graphics rather than text may well mean that
different design choices are easier, more attractive, or more informative.
3.1
From first principles
The official dotplot ([R] dotplot) command yields displays similar to stem-and-leaf
displays, but showing point symbols rather than leaves. stripplot on SSC is a loosely
similar user-written command. So, we could try tricking either into showing leaves. A
more direct approach is to try building up a command for ourselves from first principles.
This strategy strengthens understanding and often shows that a few simple commands
will give us the result we desire, thus avoiding the writing of a program.
My habit is to start with small toy examples and get something simple working
quickly, not to play with the dataset I really want to work with (which is likely to be
much larger), or to sketch an elaborate design in advance. (This is not necessarily good
advice for large, complicated programming projects.) In that spirit, consider mpg in
the auto dataset, the second stem-and-leaf example above. The values are all integers,
ranging from 12 to 41. The simplest choice, at first sight, is to use the second digits as
leaves. One way to do that is to extract the second digit as the remainder on division
by 10:
. gen leaf = mod(mpg, 10)
Another way is to extract it as the last character of the string equivalent of mpg:
. gen leaf = substr(string(mpg), -1, 1)
Here -1 as the second argument indicates the last character in a string. Naturally,
. gen leaf = substr(string(mpg), 2, 1)
would produce the same result with mpg, but one needs only a little foresight to see that
this is less general, applying only to two-digit integers. scatter will happily use either
numeric or string variables as marker labels, so the choice is for us to make on other
grounds.
418
Speaking Stata
We need x and y coordinates. We need an x coordinate more, so let’s get that first,
and just use each distinct value as its own stem:
. by mpg, sort: gen x = _n
Under by:, the observation number n is evaluated separately within distinct groups
defined by its argument. Within groups of distinct values of mpg, it thus takes on the
values 1, 2, and so forth, exactly as desired. If you want a more detailed tutorial on
by:, see Cox (2002).
Figure 1 shows our initial stem-and-leaf plot:
. scatter mpg x, mla(leaf) mlabpos(0) ms(none)
5
4
5
1
0
9
8
0
8
8
6
5
4
3
2
1
0
9
8
7
6
5
4
6
5
4
3
2
1
0
9
8
7
6
5
4
6
5
4
3
2
1
0
9
8
7
6
4
2
2
5
4
5
2
1
2
1
9
8
7
6
9
8
9
8
4
4
4
9
8
9
8
8
10
20
Mileage (mpg)
30
40
1
0
2
4
6
8
10
x
Figure 1: Stem-and-leaf plot for miles per gallon with scatter. Each distinct value
serves as a stem. Final digits serve as leaves.
When we suppress marker symbols and show marker labels instead, using mlabpos(0)
to center labels on where symbols would have been is often best.
The result is not especially pretty. More importantly, we are most of the way to
where we want to be after just three lines. My first thoughts on this graph are in terms
of simple cosmetic improvements. I do not want the x-axis title. I do not want the
x-axis labels (although others might want a frequency scale on that axis). I do want
horizontal y-axis labels. To match stem-and-leaf conventions, I do not want them to be
ticked. I do not want the associated grid lines (although this preference also is a matter
of taste). With a few minor changes, we produce figure 2.
419
N. J. Cox
. scatter mpg x, mla(leaf) mlabpos(0) ms(none) yla(, ang(h) noticks nogrid)
> xla(none) xti("")
1
40
Mileage (mpg)
5
4
5
30
1
0
9
8
8
8
20
6
5
4
3
2
1
0
9
8
7
6
5
4
6
5
4
3
2
1
0
9
8
7
6
5
4
6
5
4
3
2
1
0
9
8
7
6
4
2
2
0
5
4
5
2
1
2
1
9
8
7
6
9
8
9
8
4
4
4
9
8
9
8
8
10
Figure 2: Same as figure 1 but with cosmetic improvements to axis labels and titles
If you are working through this sequence with your copy of Stata, you might be
using a graph scheme that sets up a default of yla(, ang(h)), but making that choice
explicit will do no harm.
My next thoughts are that I have more space to play with than I expected. Figure 3
suggests a variant in which each value is its own leaf:
. scatter mpg x, mla(mpg) mlabpos(0) ms(none) yla(, ang(h) noticks nogrid)
xla(none) xti("")
(Continued on next page)
420
Speaking Stata
41
40
Mileage (mpg)
35
34
35
30
31
30
29
28
28
28
20
26
25
24
23
22
21
20
19
18
17
16
15
14
26
25
24
23
22
21
20
19
18
17
16
15
14
26
25
24
23
22
21
20
19
18
17
16
14
12
12
30
25
24
25
22
21
22
21
19
18
17
16
19
18
19
18
14
14
14
19
18
19
18
18
10
Figure 3: Same as figure 2, but now each value serves as its own leaf.
Now the y-axis labels are redundant, and we can take them off. If we do that, we
will probably want to keep a gap in the same space, which is shown in figure 4.
. scatter mpg x, mla(mpg) mlabpos(0) ms(none) xla(none) xti("") yla(none)
> ysc(titlegap(4))
41
Mileage (mpg)
35
34
31
30
29
28
35
30
28
28
26
25
24
23
22
21
20
19
18
17
16
15
14
26
25
24
23
22
21
20
19
18
17
16
15
14
26
25
24
23
22
21
20
19
18
17
16
12
12
14
25
24
25
22
21
22
21
19
18
17
16
19
18
19
18
14
14
14
19
18
19
18
18
Figure 4: Same as figure 3, but now y-axis labels have been removed.
N. J. Cox
421
Our result is no longer a conventional stem-and-leaf display, but no matter: it is showing
the same information in readable form.
Using distinct values as stems will work well only if there are enough ties, that is,
not too many distinct values. (Some people say unique values, but for others that term
means those values that occur just once.) You can see the number of distinct values by
following a tabulate with display r(r), because the number of rows in the table is
saved immediately after a tabulate. mpg has 21 distinct values.
If you try this calculation with the first example above, length, the resulting graph
(not shown here) is too messy. However, we can reuse most of the code. The main
change is the need to recalculate the x coordinate.
. by length, sort: replace x = _n
. scatter length x, mla(length) mlabpos(0) ms(none) xla(none) xti("")
> yla(none) ysc(titlegap(4))
There are 47 distinct values of length. In general, performing tabulate beforehand
on a potential stem variable will give a foretaste of the number of stems needed and
their lengths. My own rule is that more than about 30 stems is usually too messy,
unless you are happy to produce a tall graph and can convince those in power over you
(advisors, supervisors, reviewers, editors) that it is a good idea.
3.2
Rounded stems
We now need to focus on how to produce rounded stems. stem’s choices for length,
which varies between 142 and 233 inches, are 14(0), . . . , 23(0), which are simple and
appear good. Your first thought may be to extract the stem as the first two characters
of the string equivalent of length, as, say, substr(string(length), 1, 2), but this
choice would pose two problems.
First, we need a numeric variable for graphing. That problem is easily soluble:
convert the string to numeric by wrapping the expression with real() to give
real(substr(string(length), 1, 2)).
Second, a deeper problem is that extracting the first few digits will give some wrong
answers for any variable with different numbers of digits. (Consider the results if values
varied between 142 and 2,330 instead.) Approaching the problem directly as a numeric
calculation is better in general. Thus the stems for length are obtainable by dividing
by 10 and rounding down:
. gen stem = int(length/10)
The floor() function would do as well as the int() function so long as values are
not negative. However, if values are ever negative, floor(), which always rounds down,
gives the wrong answer. int(), which always rounds towards zero, is preferable. The
stem for −142 should be −14, not −15.
422
Speaking Stata
This rule would work well enough to re-create stem’s rendering of length, but we
need a more general solution for whenever the width (in stem’s terminology) is not 10,
or even any other exact power of 10. Suppose that we wanted twice as many stems than
is given by 10. A rule that was
. gen stem = int(length/5)
would yield a stem of 28 for values 140–144, 29 for values 145–149, and so forth. But we
would not want those stems to be shown as axis labels. A better command for graph
purposes would be
. gen stem = width * int(varname / width)
So, we should start with
. gen stem = 10 * int(length/10)
However, there is a better approach that will save us work in the long run. Using
clonevar ([D] clonevar) first,
. clonevar stem = length
. replace stem = 10 * int(length/10)
has a pleasant result. The effect of clonevar is that stem is born an exact copy of
length, notably with the same variable label. That is a useful detail for graphing.
Naturally, following clonevar with replace does not affect the variable label.
Let us put together in one place the code for a better stem-and-leaf plot for length,
starting from nothing. We will need to calculate the x coordinate by using the variable
stem—ensuring that the leaves are in the right order—and the leaf by using length
itself. The code will produce figure 5.
.
.
.
.
.
.
.
>
sysuse auto, clear
clonevar stem = length
replace stem = 10 * int(length/10)
sort stem length
by stem: gen x = _n
gen leaf = substr(string(length), -1, 1)
scatter stem x, mla(leaf) mlabpos(0) ms(none) yla(, ang(h) noticks nogrid)
xla(none) xti("")
423
N. J. Cox
240
Length (in.)
220
200
180
160
140
0
3
0
0
1
1
2
2
2
4
7
8
8
0
0
0
0
1
1
1
3
4
4
6
2
3
3
5
6
7
8
8
8
8
9
0
2
4
6
9
0
0
0
0
2
2
3
4
4
5
7
1
3
3
4
5
5
5
8
9
4
5
5
6
7
2
7
9
6
6
7
9
9
9
Figure 5: Stem-and-leaf plot for car lengths. Stems are based on intervals of width
10 in. Leaves are final digits.
From here, there are various ways to proceed. One is to make explicit all stems (and
follow the convention of showing them truncated) by adding
yla(140 "14" 150 "15" 160 "16" 170 "17" 180 "18" 190 "19" 200 "20" 210
"21" 220 "22" 230 "23")
Another is to note that there is plenty of white space to spare. So we could use the
values as their own leaves and suppress the y-axis labels:
. scatter stem x, mla(length) mlabpos(0) ms(none) yla(none) xla(none) xti("")
> ysc(titlegap(4))
3.3
Comparisons
One of the limitations of the stem command is that with a by() option, stem-and-leaf
displays for different groups can be produced only in vertical sequence. Being able to
compare displays that are aligned horizontally would be better, suggesting the use of
scatter with a by() option. The only questions are what difference this might make
to preprocessing and what extra details we might need to think about graphically.
Let us return to mpg and examine its variation with groups of foreign or rep78.
For simplicity, values will be their own leaves. For clarity, we will start again from the
top to show the full sequence of commands, which will produce figures 6 and 7.
424
Speaking Stata
.
.
.
>
sysuse auto, clear
by foreign mpg, sort: gen x = _n
scatter mpg x, by(foreign, compact) xla(none) xti("") ms(none) mla(mpg)
mlabpos(0) yla(none)
Domestic
Foreign
41
35
35
Mileage (mpg)
34
30
29
28
26
25
24
31
30
28
28
26
24
22
21
20
19
18
17
16
15
14
22
21
20
19
18
17
16
15
14
12
12
26
25
24
23
24
22
21
20
19
18
22
22
19
18
19
18
16
16
14
14
19
18
19
18
25
25
23
23
21
21
18
17
18
17
25
19
14
14
Graphs by Car type
Figure 6: Stem-and-leaf plots of miles per gallon for domestic and foreign cars. Each
value serves as its own leaf.
. by rep78 mpg, sort: gen X = _n
. scatter mpg X, by(rep78, compact row(1)) xla(none) xti("") ms(none)
> mla(mpg) mlabpos(0) yla(none)
1
2
3
4
5
41
Mileage (mpg)
35 35
34
30
29
28
28 28
26
25
24
22
18
25 25 25
24
23 23
22
21 21
24 24
18 18
17
16
14
23
22
21
20
19
18
17
16
15
14
31
30
22
21 21
20 20
19 19 19 19 19 19
18
18 18
16
14
25
18 18
17 17
16
15
14 14
12 12
Graphs by Repair Record 1978
Figure 7: Stem-and-leaf plots of miles per gallon by repair record. Each value serves as
its own leaf.
N. J. Cox
425
The calculation of the x coordinate needs to be done separately by the grouping
variable. Suboptions of by() may come into play. compact and row(1) are useful.
However, beware that total does not do what you might hope, because it necessarily
superimposes different groups.
3.4
Back to back
When two stem-and-leaf plots are compared, some people like to plot them back to
back. The result loosely resembles the pyramidal displays often used to compare age
structure of populations for males and females. We could discuss whether the result is
more attractive or more informative, but let us focus on how it would be done. Look
again at figure 6. On the right-hand side the plot for foreign cars is as we would wish
for a back-to-back plot. What we need to change is the display on the left-hand side
for domestic cars. Each line should be reversed and placed against the right-hand side.
foreign is 0 for domestic cars (and 1 for foreign cars). Then the command
. by foreign mpg, sort: gen bbx = cond(foreign == 0, -_n, _n)
assigns negative x coordinates, −1 down, on the left-hand side and positive coordinates,
1 and up, on the right-hand side. (If you want a tutorial on cond(), see Kantor and Cox
[2005].) We need to ensure that the largest value of the x coordinate appears in the
left-hand panel and the smallest value in the right-hand panel. That way, the data
points in the left-hand display nearly touch the right-hand edge of their display and
vice versa. Some analysis shows that this goal is achieved by a translation (shunt, in
plainer English) of the left-hand panel coordinates to the right:
. summarize bbx, meanonly
. replace bbx = bbx + max(-r(min), r(max)) + 1 if foreign == 0
Let us look at that again. By construction, the x coordinates are negative on the lefthand side and positive on the right-hand side. Thus r(min), which holds the minimum
value of the x coordinates bbx after a summarize, is always negative. Its negation
-r(min) is the length of the longest stem in the left-hand panel. r(max) is the length
of the longest stem in the right-hand panel. Our translation is of whichever is larger,
plus 1, to the right.
If you care about the detailed logic of that translation, focus on some examples.
(Otherwise, skip to the next paragraph.) Suppose that the longest stems have length 4
on the left-hand side and length 3 on the right-hand side. These will have x coordinates
−4, −3, −2, −1 and 1, 2, 3, respectively. For the displays to touch as desired, −4 must
be translated to a number at least 1 and −1 must be translated to a number at least
3. The gentlest solution is the mapping from −4, −3, −2, −1 to 1, 2, 3, 4, namely,
adding 5, the length of the longer stem plus 1. If the stems were reversed, namely, with
x coordinates −3, −2, −1 and 1, 2, 3, 4, the same solution would apply. If we had two
stems of equal length, say, −3, −2, −1 and 1, 2, 3, the same rule would apply: the
longer stem (either one) has length 3, and adding 4 maps −3, −2, −1 to 1, 2, 3. We
do not see any of these values, because the x-axis labels are all suppressed, but we do
need to do this for the graph to appear as we would wish.
426
Speaking Stata
Once the translation is done, we have all we want, as shown in figure 8:
. scatter mpg bbx, by(foreign, compact) xla(none) xti("") ms(none) mla(mpg)
> mlabpos(0) yla(none)
Domestic
Foreign
41
35
35
Mileage (mpg)
34
28
26
24
19
19
18
19
18
22
22
19
18
19
18
22
21
20
19
18
16
16
14
14
14
31
30 30
29
28 28
22
21
20
19
18
17
16
15
14
26 26
25 25
24 24
23
22
21 21
20
19
18 18
17 17
16
15
14 14
12
12
24
25
25
23
23
25
21
18
17
Graphs by Car type
Figure 8: Back-to-back stem-and-leaf plots of miles per gallon for domestic and foreign
cars
In typed back-to-back stem-and-leaf displays, stems are shown on a spine between
the two parts. Having the stems as axis labels, whether shown or suppressed, is more
straightforward when we do it graphically. A second set of labels could be shown on
the right-hand axis if desired.
3.5
Other possibilities
By no means have we exhausted the possibilities for simple variations on the basic
theme. Here are some more.
• Reversing y-axis scale. So far, as you will have noticed, we have followed the usual
graph convention in which response magnitude, shown on the y axis, increases
upward. The convention with text stem-and-leaf displays is the opposite. The
graph option ysc(reverse) will produce that.
• Different leaf choices. Although we have considered in detail how to produce
leaves from some or all of the digits of the variable whose distribution is being
shown, doing so is not compulsory. Something different could be used for the
leaf, including values coding for some categorical variable. The leaf is whatever is
specified as marker label with the mlabel() option.
427
N. J. Cox
• Swapping axes. We can easily swap the y and x axes to produce a display in which
magnitude is horizontal and leaves are stacked vertically.
• Changing leaf size. Being able to tweak the marker label size with the mlabsize()
option is useful.
• Aspect ratio. We could also change the aspect ratio to use more or less horizontal
space, often simply by tuning xsize().
4
stemplot
One of my main aims in these columns is to emphasize how much you can do with a few
basic commands to get where you want to be. Here being able to create or re-create basic
graphs and to devise your own variations on them gives flexibility born from freedom.
You do not depend on whether someone has written a program to do what you want
to do or on whether you can find such a program if it exists. Nevertheless, producing
even simple results can require mastery of fiddly little details: the manipulation needed
to produce back-to-back displays is one example. Many users may not relish the minor
acrobatics that may be needed.
Therefore, as an alternative, I publish a stemplot command with this column, encapsulating the main ideas so far, and then some more.
4.1
Syntax
, by(byvar , byopts ) back digits(#)
format(format) width(#) horizontal graph options
stemplot varname
if
in
stemplot produces stem-and-leaf plots and similar plots for numeric variable varname by using Stata’s graphics. For stem-and-leaf plots as typed text displays, see
[R] stem or help stem.
4.2
Options
by(byvar , byopts ) specifies that displays are to be shown separately by groups defined by byvar. See [G] by option or help by option. The suboption total is
legal but will not work as may be desired, because it superimposes, not juxtaposes,
individual displays.
back specifies that a two-group display specified using by() be shown back to back.
digits(#) specifies the number of digits to be shown as leaves. The default is
digits(1). To be more precise, the rule for leaves is to show
substr(string(varname, "format"), -digits, digits)
428
Speaking Stata
or
string(varname, "format")
whenever that is empty. Here format is by default the display format of varname,
as shown by describe and set by format. See also the format() option described
below.
format(format) specifies a format other than the display format of varname to be
shown in determining leaves. See also the digits() option described above. This
option is seldom specified.
width(#) specifies the width of intervals defining distinct lines.
horizontal specifies a horizontal display in which the magnitude axis is horizontal and
leaves are stacked vertically. back is not allowed with horizontal.
graph options are any of the options documented in [G] graph twoway scatter. For
example, mlabel() overrides the program’s choice of leaf, and mlabsize() controls
the size of leaves.
4.3
Comments
How does the scope of stemplot differ from that of combinations of individual commands, as discussed previously? Here we note the most important features.
1. stemplot preserves sort order, as is standard programming practice for any program not intended to change that order.
2. There are more checks on user input.
3. stemplot allows specifying if and in.
4. Various options allow tuning of the display.
4.4
A last look at those cars
Suppose that we wanted to show some information about the make of cars. The full
name within the variable make is too long to fit comfortably on a stem-and-leaf display,
but we can compromise on the first word:
. gen M = word(make, 1)
Some experimentation shows that we should sort within stems, use a different choice
of mlabpos(), and stretch the x-axis scale so that the rightmost marker label fits:
. sort mpg M
. stemplot mpg, mla(M) mlabpos(3) xsc(r(. 10))
To save space, I do not show the result here, but you can experiment for yourself.
429
N. J. Cox
5
Choral finale
For a final example, we leave the world of automobiles and turn to choral music.
Chambers, Cleveland, Kleiner, and Tukey (1983, 350) listed the heights of singers in
the New York Choral Society in 1979. Cleveland (1993) plotted the dataset in various
ways; it is downloadable from http://www.stat.purdue.edu/˜wsc/visualizing.tables.txt.
Here we focus first on the broad subdivision between soprano, alto, tenor, and bass
parts, ignoring the distinction between first (higher pitch) and second (lower pitch)
groups for each part.
A general relationship between singer part and height is no surprise. What a stemand-leaf plot can show is the fine structure of the distributions, based on the singers’
own statements. Here the choice of units is important, because of their psychological
overtones. Although the data are reported in inches, people in countries like the United
States commonly think of their heights in feet and inches, not just in inches. An adult
height of 6 feet (72 inches) is moderately tall by most standards (although clearly not,
for example, that of basketball players). Similarly, an adult height of 5 feet (60 inches)
is rather short by most standards, or petite if you prefer. Fully metric readers might
note that 6 feet is about 1.83 meters and 5 feet about 1.52 meters.
Ordinary experience shows that although many adults are happy with their heights,
a few would prefer to be taller and a few, shorter. Also, some people may be selfconscious about being distinct from whatever they regard as a norm. Let us bear all
this in mind while examining these data.
Using stemplot is convenient here to produce figure 9.
. stemplot height, by(spart, row(1) compact noiytic
> note(units are inches, place(e))) yla(60 "5’" 66 "5’ 6’’" 72 "6’")
> yaxis(1 2) yla(60(6)72, axis(2) ang(h))
(Continued on next page)
430
Speaking Stata
Soprano
Alto
Tenor
Bass
66
55555555
4
333
22
2222222222222
11111111
1111111
000000
0000
000000000000
99
99999999
9999
88
88
888888
88888888
7777
777777
77
77
666666666
6666666666666
6666
666
55555555555555555555
555555555
5
44444
44444444
44
333333
3333333
22222222222
222
1111
1111
0000
0
6’
2
height
0
5’ 6’’
5’
44444
33
72
66
60
units are inches
Figure 9: New York Choral Society singers’ heights, 1979. Note not only the general
relationship between singer part and height but also intriguing hints of fine structure.
spart is a numeric variable with value labels, which we need here to show soprano, alto,
tenor, and bass in their natural order. A string variable would sort alphabetically from
alto to tenor, which would not be a good choice. Two y axes permit two sets of labels,
one in feet and inches and one in inches, making the comparison easier, even though all
readers can presumably divide small numbers by 12.
The plot does indeed show intriguing, although contradictory, fine structure. No
singers admit to being less than 60 inches (5 feet) tall. The modes for sopranos at 55
inches and for altos at 56 inches may be genuinely different, or they may reflect some
personal preferences. Why do no altos report themselves as 71 inches? Is there some
wishful thinking among some of the basses who claim 72 inches (6 feet)? There are
hints of other features that may owe more to psychology than to anatomy.
We can also bring the split into higher and lower groups for each part (first and
second sopranos, and so forth). One way to explore that is to use that variable as a
new leaf (figure 10).
. stemplot height, by(spart, row(1) noiytic
> note(1 = high; 2 = low, place(e))) mla(group)
> yla(60 "5’" 66 "5’ 6’’" 72 "6’") yaxis(1 2) yla(60(6)72, axis(2) ang(h))
431
N. J. Cox
Soprano
Alto
Tenor
Bass
12
11221221
1
12222
21
111
11
2221221212111
22221221
1211111
222122
2211
112112221111
21
22221222
1211
12
11
112122
11111122
2212
112121
11
22
211112121
2112121212211
2111
121
11111122111112112112
121221212
1
21222
12121222
11
112212
1112221
21222111211
111
1212
1111
2221
1
6’
1
height
2
5’ 6’’
5’
72
66
60
1 = high; 2 = low
Figure 10: New York Choral Society singers’ heights, 1979, by singer part and groups
with relatively high or low pitch
Naturally, any fine structure identified in this way should be tested by independent
evidence. We need to be clear that we are not seizing on quirks produced by sampling
variation. As Harrell (2001, ix) aptly remarks, “Using the data to guide the data
analysis is almost as dangerous as not doing so.” The opportunity for measuring these
singers again is long past, but there may be similar (preferably larger) datasets that
could provide some checks. Statistical and scientific caution aside, the stem-and-leaf
plot is one of the graphical tools to try when fine structure within distributions is of
interest.
6
Conclusion
Stem-and-leaf displays are one of the staples of exploratory statistical graphics. However, producing them as text displays, as with Stata’s stem command, has inevitable
limitations, especially for comparison of distributions across groups. Revisiting stemand-leaf as full-blown graphs requires no more than the idea of scatterplots with marker
labels. A few steps lead not only to acceptable stem-and-leaf displays but also to variants on them. Comparison of groups is then easy by using the standard by() option of
twoway scatter.
This column also illustrates a useful Stata strategy of using basic commands and toy
examples and moving step by step toward what you want. Technique is thus strengthened. Serendipitous discoveries of unforeseen possibilities are likely. Although the
project followed through to produce a new stemplot command, the greater lesson is
the flexibility and power of the language available to you.
432
7
Speaking Stata
References
Chambers, J. M., W. S. Cleveland, B. Kleiner, and P. A. Tukey. 1983. Graphical
Methods for Data Analysis. Belmont, CA: Wadsworth.
Cleveland, W. S. 1993. Visualizing Data. Summit, NJ: Hobart Press.
Cox, N. J. 2002. Speaking Stata: How to move step by: step. Stata Journal 2: 86–102.
Darton, M., and J. Clark. 1994. The Dent Dictionary of Measurement. London: Dent.
Dudley, J. W. 1946. Examination of Industrial Measurements. New York: McGraw–Hill.
Emerson, J. D., and D. C. Hoaglin. 1983. Stem-and-leaf displays. In Understanding
Robust and Exploratory Data Analysis, ed. D. C. Hoaglin, F. Mosteller, and J. W.
Tukey, 7–32. New York: Wiley.
Griffiths, D., W. D. Stirling, and K. L. Weldon. 1998. Understanding Data: Principles
and Practice of Statistics. Brisbane: Wiley.
Harrell, F. E., Jr. 2001. Regression Modeling Strategies, With Applications to Linear
Models, Logistic Regression, and Survival Analysis. New York: Springer.
Hoaglin, D. C. 1983. Letter values: A set of selected order statistics. In Understanding
Robust and Exploratory Data Analysis, ed. D. C. Hoaglin, F. Mosteller, and J. W.
Tukey, 33–57. New York: Wiley.
Kantor, D., and N. J. Cox. 2005. Depending on conditions: A tutorial on the cond()
function. Stata Journal 5: 413–420.
Moore, D. S., and G. P. McCabe. 2006. Introduction to the Practice of Statistics. New
York: Freeman.
Pennycuick, C. J. 1988. Conversion Factors: SI Units and Many Others. Chicago:
University of Chicago Press.
Preece, D. A. 1981. Distribution of final digits in data. Statistician 30: 31–60.
Tufte, E. R. 1990. Envisioning Information. Cheshire, CT: Graphics Press.
Tukey, J. W. 1972. Some graphic and semigraphic displays. In Statistical Papers in
Honor of George W. Snedecor, ed. T. A. Bancroft and S. A. Brown, 293–316. Ames,
IA: Iowa State University Press.
———. 1977. Exploratory Data Analysis. Reading, MA: Addison–Wesley.
Wild, C. J., and G. A. F. Seber. 2000. Chance Encounters: A First Course in Data
Analysis and Inference. New York: Wiley.
N. J. Cox
433
About the author
Nicholas Cox is a statistically minded geographer at Durham University. He contributes talks,
postings, FAQs, and programs to the Stata user community. He has also coauthored 15 commands in official Stata. He wrote several inserts in the Stata Technical Bulletin and is an editor
of the Stata Journal.