Statistics 22000 - Winter 2004
A Short Introduction to STATA 8
1. A data example (IPS Data Appendix 5)
Each March, the Bureau of Labor Statistics carries out an Annual Demographic Supplement
to its monthly Current Population Survey. The data file contains data about 55899 people
from the March 2000 survey of the Bureau of Labor Statistics. Only people aged 25 to 65
who have worked but whose main work experience is not agriculture are included.
The data set consists of the five variables Age, Education, Sex, Total income, and Job
class:
ID:
Age:
Education:
Sex:
Total income:
Job class:
index number
age in years
1=no high school, 2=some high school, 3=high school diploma,
4=some college, 5=bachelor’s degree, 6=postgraduate degree
1=male, 2=female
income from all sources
5=private sector, 6=government, 7=self employed
The data file is available on the textbook CD or can be downloaded from the course web
page at http://www.stat.uchicago.edu/~eichler/stat22000/Data/individuals.txt.
2. Reading data into STATA
After starting STATA, we first need to read in the data, which can be done by the infile
command:
. infile ID AGE EDUC SEX EARN JOB using individuals.txt
The variables are specified in the order they appear in the data file. To drop any current
data from memory before reading in the new data, one can use the option clear. STATA
also allows to download data directly from a web page:
. infile ID AGE EDUC SEX EARN JOB using
> http://www.stat.uchicago.edu/~eichler/stat22000/Data/individuals.txt
For data files which have been created by spreadsheet (e.g. the files on the IPS CD), one
can use the insheet command
. insheet using individuals.txt
Entries have to be separated by commas or tabs; the first line can contain the variable names.
Alternatively to the infile or insheet command, one can make use of the interactive
user interface of STATA 8 to import the data. For this one needs to choose the menu item
Import in the File menu and follow through a couple of dialog boxes. At the end, STATA
will show the corresponding command in the Results window.
At the beginning of your STATA session it is also a good idea to issue the log command
which allows you to make a full record of your Stata session. A log is a file containing what
you type and Stata’s output. You can use this file to copy relevant parts of your input and
STATA’s output into your homework document. The command
1
. log using mylog.log, replace
writes all in- and output to the file mylog.log. The option replace specifies that if the file
already exists, its content is to be overwritten.
3. Numerical summaries
A simple numerical summary of the data can be generated by the command summarize:
. summarize
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------ID |
55899
27950
16136.8
1
55899
AGE |
55899
41.94712
10.23888
25
65
EDUC |
55899
3.832376
1.223973
1
6
SEX |
55899
1.477379
.4994925
1
2
EARN |
55899
37864.61
36158.03
-24998
425510
-------------+-------------------------------------------------------JOB |
55899
5.36811
.6573005
5
7
To see a more detailed list of summary statistics, we use the option detail:
. summarize EARN, detail
EARN
------------------------------------------------------------Percentiles
Smallest
1%
608
-24998
5%
5000
-9999
10%
9064
-9999
Obs
55899
25%
17000
-9999
Sum of Wgt.
55899
50%
29717
Mean
37864.61
Largest
Std. Dev.
36158.03
75%
46505
419304
90%
72000
419351
Variance
1.31e+09
95%
96638
424770
Skewness
3.403862
99%
229533
425510
Kurtosis
20.50475
The command tabstat provides a more flexible alternative to summarize as it allows
to specify which summary statistics are included. For example, to produce a Five-number
summary for the variable EARN, we type
. tabstat EARN, stats(min p25 median p75 max)
variable |
min
p25
p50
p75
max
-------------+-------------------------------------------------EARN |
-24998
17000
29717
46505
425510
----------------------------------------------------------------
For more information about the available statistics for the option stats we refer to the
online help of STATA.
Many commands in STATA allow to specify subsets of the data. For example, to obtain
a five-number summary of the total income of all men and of all women in the sample, we
type
. tabstat EARN if SEX==1, stats(min p25 median p75 max)
variable |
min
p25
p50
p75
max
-------------+-------------------------------------------------EARN |
-24998
22146
36000
55853
425510
---------------------------------------------------------------. tabstat EARN if SEX==2, stats(min p25 median p75 max)
variable |
min
p25
p50
p75
max
-------------+-------------------------------------------------EARN |
-9999
13004
23012
36200
385068
----------------------------------------------------------------
2
4. Generating and changing variables
The commands generate and replace allow us to create new variables or change the values
of existing variables. Suppose, for example, that we want to use a different grouping of
education level: people with no high school diploma, with high school diploma, with Bachelor’s degree, with advanced degree. The following sequence of commands generates the
corresponding categorical variable with four levels:
. generate EDUC4=1
. replace EDUC4=2 if EDUC==3 | EDUC==4
. replace EDUC4=3 if EDUC==5
. replace EDUC4=4 if EDUC==6
. label values EDUC4 E4Names
. label define E4Names 1 "no HS diploma" 2 "HS diploma" 3 "Bachelor’s" 4 "Advanced"
. table EDUC4
-------------------------EDUC4 |
Freq.
--------------+----------no HS diploma |
5,847
HS diploma |
33,460
Bachelor’s |
10,991
Advanced |
5,601
--------------------------
Here, the commands label define and label values are used to associate labels with
the levels of the categorical variable. The command label define specifies what labels go
with what number while the label values command specifies to which variable these labels
apply.
Next, we want to change the total income to units of thousand dollars. This can be
accomplished by the command:
. replace EARN=round(EARN/1000)
STATA has a large number of functions such as e.g. natural logarithm ln(). A complete list
can be obtained via the online help of STATA.
5. Bar graphs and pie charts
As our first plots we want to produce a bar graph and a pie chart for the variable EDUC. As
a first step we issue the command
. set scheme s1mono
which switches the graphic display to a simpler grey level presentation. Bar graphs can be
obtained by the command graph bar.
.
.
.
>
.
.
>
set scheme s1mono
label values EDUC Education
label define Education 1 "no HS" 2 "some HS" 3 "HS diploma" 4 "some college"
5 "Bachelor’s" 6 "postgrad"
generate COUNT=1
graph bar (sum) COUNT, over (EDUC) ytitle(Number of people)
lintensity(255)
3
30
20,000
Percentage of people
10
20
Number of people
10,000
15,000
some HS HS diplomasome college Bachelor’s
0
5,000
0
no HS
postgrad
no HS
some HS HS diplomasome college Bachelor’s
postgrad
Figure 1: Bar chart for the variable EDUC
The graph bar command plots the sum of the variable COUNT over the levels of the variable
EDUC. Since the variable COUNT always takes the value 1, the heights of the bars are the
counts over the levels of EDUC. Similarly, the following two commands produce a bar graph
with height given in percentages:
. generate PERC=100/_N
. graph bar (sum) PERC, over (EDUC) ytitle(Percentage of people)
> lintensity(255)
Here the generic variable _N yields the total number of individuals in the data set. The
option lintensity(255) secifies that bars are outlined by black lines.
A more sophisticated bar graph is the following which compares the level of education of
men and women in the sample:
2,000
Number of people
4,000
6,000
8,000
10,000
label values SEX Sex
label define Sex 1 "Male" 2 "Female"
graph bar (sum) COUNT, over(SEX) over(EDUC) asyvars ytitle(Number of people)
lintensity(255)
0
.
.
.
>
no HS
some HS HS diplomasome college Bachelor’s
Male
postgrad
Female
Figure 2: Level of education for men and woman aged 25 to 65
4
no HS
some HS
HS diploma
some college
Bachelor’s
postgrad
Figure 3: Pie chart for variable EDUC
Pie charts can be obtained by the command graph pie:
. graph pie, over(EDUC) line(lc(black)) legend(position(2) row(6))
Here, the option legend controls the placement and the format of the legend while line(lc(black))
defines the color of the lines used to outline the slices.
6. Stemplots and histograms
Stem-and-leaf plots can be obtained by the command stem. We illustrate the use of stem
by producing a stemplot for the variable EARN. Since stem-and-leaf plots are only suitable
for small data sets, we sample 100 observations (dropping the other observations). Further
we convert EARN to units of thousand dollars in order to have a single digit leaf:
. sample 100, count
. replace EARN=round(EARN/1000)
. stem EARN
Stem-and-leaf plot for EARN
0* | 012334555577889999
1* | 001222333333555556677899
2* | 22334455555567899
3* | 0122235677889
4* | 022557
5* | 000023467
6* | 0
7* | 33
8* | 034
9* | 0
10* | 000
11* |
12* | 24
13* |
14* |
15* |
16* |
17* | 5
Note that I have used a fixed-width font such as Courier to draw the stem-and-leaf plot.
A proportionally spaced font (such as Times Roman) does not allocate the same width to
every character.
5
.025
.02
Density
.01
.015
.005
0
0
20
40
60
80
100
120
140
Total income ($ thousands)
160
180
200
Figure 4: Histogram for the variable EARN
The stem command has an option called lines that controls the number of lines in the
display. The possible values are: 1, 2, 5, and 10.
. stem EARN if EARN<=100, lines(2)
Stem-and-leaf plot for EARN
0* | 012334
0. | 555577889999
1* | 001222333333
1. | 555556677899
2* | 223344
2. | 55555567899
3* | 012223
3. | 5677889
4* | 022
4. | 557
5* | 0000234
5. | 67
6* | 0
6. |
7* | 33
7. |
8* | 034
8. |
9* | 0
9. |
10* | 000
The histogram for the variable EARN is obtained by the command histogram:
. histogram EARN, start(0) width(10) xlabel(0(20)200)
> xtitle("Total income ($ thousands)") xtick(0(10)200)
Here the options start and width define the offset and the width of the bins while the
options xlabel, xtick, and xtitle are used to specify the labels, ticks, and title of the
x-axis.
7. Boxplots
One of the most common application of boxplots is the comparison of the distribution of
one variable over the categories of one or more other variables. As an example, suppose we
6
400000
Income ($ thousands)
100000 200000 300000
0
Male
Female
Male
no HS
Female
some HS
Male
Female
Male
HS diploma
Female
some college
Male
Female
Bachelor’s
Male
Female
postgrad
Figure 5: Boxplots comparing the distributions of income for men and women with six
different levels of education.
are interested how the distribution of income depends on gender and the level of education.
This involves the comparison of twelve distributions (two levels of the variable SEX and six
levels of the variable EDUC).
Boxplots are produced by the command graph box:
.
.
.
.
.
.
>
.
>
infile ID AGE EDUC SEX EARN JOB using individuals.txt, clear
replace EARN=EARN/1000
label values SEX Sex
label define Sex 1 "Male" 2 "Female"
label values EDUC Education
label define Education 1 "no HS" 2 "some HS" 3 "HS diploma" 4 "some college"
5 "Bachelor’s" 6 "postgrad"
graph box EARN, over(SEX) over(EDUC) xsize(8) ysize(3) lintensity(255)
ytitle(Income ($ thousands))
Here the option over specifies that the distribution of the target variable EARN is to be
compared over the levels of the two categorical variables SEX and EDUC. The options xsize
and ysize are used to define width and height of the plot.
8. Normal quantile plots
The data file hotdogs.txt contains the calorie and the sodium content of three types of hot
dogs. The calorie content can be compared by a box plot:
.
.
.
.
infile Type Calories Sodium using hotdogs.txt, clear
label values Type TypeName
label define TypeName 1 "Beef" 2 "Meat" 3 "Poultry"
graph box Calories, over(Type) lintensity(255)
To find out whether the distribution of calorie content is approximately normally distributed we use normal quantile plots which can be produced by the command qnorm
.
.
.
.
qnorm
qnorm
qnorm
graph
Calories if Type==1, saving(cal1) ytitle(Calorie content of beef hot dogs)
Calories if Type==2, saving(cal2) ytitle(Calorie content of meat hot dogs)
Calories if Type==3, saving(cal3) ytitle(Calorie content of poultry hot dogs)
combine cal1.gph cal2.gph cal3.gph, rows(1) xsize(8) ysize (3)
7
200
150
100
50
Beef
Meat
Poultry
200
Calorie content of poultry hot dogs
100
150
200
120
140
160
Inverse Normal
180
200
50
50
50
Calorie content of beef hot dogs
100
150
Calorie content of meat hot dogs
100
150
200
Figure 6: Boxplots comparing the calorie content of three types of hot dogs.
120
140
160
Inverse Normal
180
200
80
100
120
Inverse Normal
140
160
Figure 7: Normal quantile plots for the calorie content of three types of hot dogs.
The option saving(filename) saves the graph in the file filename.gph. The saved plots can
later be combined using the graph combine commmand.
The normal quantile plots indicate that the calorie content of the three types of hot dogs
might not be normally distributed since the values are less concentrated in the middle part
of the distribution than expected for a normal distribution.
9. Converting STATA graphs to EPS format
To save the actual graph as an EPS file (which can be included into other documents) we
need to use the following command
. translate @Graph new.eps, translator(Graph2eps)
where new.eps is the name of the EPS file. If the file new.eps does already exist, we need
to add the option replace:
. translate @Graph new.eps, translator(Graph2eps) replace
Under Windows, you might also be able to use copy-paste for including plots in your document.
8
© Copyright 2026 Paperzz