Data Management 3 - Open Computing Facility

University of California, Santa Cruz
Department of Economics
ECON 294A (Fall 2014)- Stata Lab
Instructor: Manuel Barron1
Data Management 3
1
Announcements
1.1
Please Ask Questions
If you don’t understand something during lab, please ask me. If you have a question, it is
very likely that other students also have that question. It may be that I didn’t explain it
correctly, or that I myself made a mistake. I learn a lot from questions (for instance, that
Stata has an “undo” command2 ).
1.2
A Note on Replicability
Please make sure you do-files are 100% replicable. This means that you need to include all
the commands that you used to get your results. For instance, if you open a dataset using
the drop-down menus, make sure to copy and paste the code into the do-file. The idea is that
if anybody who has access to the same data as you runs the do-file they will automatically
get the same results as you.
To make sure your do-file is replicable, save it, close Stata, re-open Stata, and run it. If
you didn’t get the same results (or if you got an error message), your do-file is not replicable.
In principle, you could also send your do-files (and relevant datasets) to a classmate and
ask them to open it in their computers. However, since you have to send them datasets and
there are some compatibility issues between Apple and PC (for instance \vs / when writing
directory paths), it is enough that it works in your computer.
It is difficult to provide an exact recipe for replicability, but it is useful to start with “clear
all”, so Stata removes everything it may have in memory already. Then, set the working
directory. Make sure to copy and paste from the output window any relevant command that
may have appeared if you used the drop-down menus. Also, make sure you do not overwrite
the original dataset. If you do, then you wont be able to replicate the analysis.
1
2
Comments welcome. contact me at mbarron4 [at] ucsc
... and that it is pretty useless.
1
do-file
clear all
cd "/Users/Manue/Econ294A/data/"
use week3.dta, clear
[...Stata commands...]
save week3_clean.dta, replace
2
labels
Labels tell us what variables mean, so they are very useful. Working with them is also pretty
straightforward. The main reason I have postponed the discussion about labels until now is
because the first two lab sessions already had a lot of material.
2.1
Variable labels
Variable labels are descriptions of the variables. For instance, we may have a variable called
“inc”, a measure of income. We may want to attach a label, e.g., “income”. We may also
want to describe that a bit further. For instance, we may want to describe if it is household
or individual income, its frequency (monthly, yearly, weekly), and the units of measurement.
Here some examples of labels we could attach to the “income variable”:
do-file
label variable income "income"
label variable income "household income"
label variable income "household income, thousand $ per year"
label variable income "annual household income, thousand $"
The first type of label is pretty much useless. If the name of the variable is income, giving
it the label of “income” provides no information at all. The second option is a bit better,
because it distinguishes between household and individual income, but it says nothing about
the frequency or units of measurement. Units of measurement are especially important with
monetary values because sometimes they are measured in logs.
The third line gives us a pretty informative label. It says its household income, measured
in thousands of dollars per year (i.e., a value of 100 means that the household perceives an
2
annual income of $100,000. The fourth line gives us the same information with different
phrasing. Its just a matter of taste between those last two.
As with comments in code, it is easy to get carried away. You don’t want to use a label
that is too long. In general terms I would say “too long”.
do-file
label variable income "income of every working-age household member,
measured in US Dollars per year"
Such detailed information belongs to a document usually called the “codebook”.
2.2
Value labels
Value labels are labels that describe what different values of a variable mean. For instance,
in the auto.dta dataset, the variable “foreign” takes the value of 1 if the car is foreign and
0 if it is domestic.
To see the values, we can use the command tabulate with the option no label
command window
. tabulate foreign
Car type |
Freq.
Percent
Cum.
------------+----------------------------------Domestic |
52
70.27
70.27
Foreign |
22
29.73
100.00
------------+----------------------------------Total |
74
100.00
. tabulate foreign, nolabel
Car type |
Freq.
Percent
Cum.
------------+----------------------------------0 |
52
70.27
70.27
1 |
22
29.73
100.00
------------+----------------------------------Total |
74
100.00
3
command window
. gen domestic = 1 if foreign==0
(22 missing values generated)
. replace domestic = 0 if foreign==1
(22 real changes made)
. tab domestic
domestic |
Freq.
Percent
Cum.
------------+----------------------------------0 |
22
29.73
29.73
1 |
52
70.27
100.00
------------+----------------------------------Total |
74
100.00
. label define dom 0 "foreign" 1 "domestic"
. label values domestic dom
. tab domestic
domestic |
Freq.
Percent
Cum.
------------+----------------------------------foreign |
22
29.73
29.73
domestic |
52
70.27
100.00
------------+----------------------------------Total |
74
100.00
3
wildcards
Stata has two wildcards: the asterisk “*” and the question mark “?”. While the question
mark replaces exactly one character, the asterisk is more flexible: you can use it to replace
one character, many characters, or no characters. This will become clearer with a couple of
examples.
3.1
the question mark - replace one character
The question mark “?” replaces exactly one character.
4
command window
sysuse auto, clear
. sum pric?
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------price |
74
6165.257
2949.496
3291
15906
This will summarize all the variables that start with “pric” and have exactly one additional character after the “c”. In the auto dataset, this is only the “price” variable.
command window
. gen rep88=rep78 if foreign==1
. replace rep88=0 if foreign==0
. gen rep79=rep78 if foreign==0
. replace rep79=0 if foreign==1
. sum rep?8
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------rep78 |
69
3.405797
.9899323
1
5
rep88 |
73
1.232877
1.9897
0
5
We generate a variable called rep88. When we issue the command “sum rep?8”, Stata
looks for all variables that start with rep, have one more character -any character-, and finish
with an 8. The variables that meet said condition in this dataset are rep78 and rep88. (Why
does rep79 fail to meet the condition?)
3.2
the asterisk - replace any number of characters
The asterisk replaces any number of characters (including no characters). For instance, we
can ask Stata to summarize all variables that start with rep. :
5
command window
. sum rep*
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------rep78 |
69
3.405797
.9899323
1
5
rep88 |
73
1.232877
1.9897
0
5
rep79 |
70
2.071429
1.572604
0
5
Say we want to see the summary statistics for all the variables in the auto.dta dataset
that start with price.
command window
. sum price*
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------price |
74
6165.257
2949.496
3291
15906
. sum price?
variable price? not found
r(111);
We can also ask Stata to summarize all variables that end in 8:
command window
. sum *8
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------rep78 |
69
3.405797
.9899323
1
5
rep88 |
73
1.232877
1.9897
0
5
In addition, we may ask Stata to summarize all variables that start with rep and end
with 78. Note that the asterisk can replace the “empty space” (but the question mark can’t
do that - it replaces exactly one character).
6
command window
. sum rep*78
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------rep78 |
69
3.405797
.9899323
1
5
. sum rep?78
variable rep?78 not found
r(111);
What does this command do?
command window
. sum rep*7?
4
recode
recode changes the code of a variable. Say we have a variable called “married” that takes
the value of 1 for people who are married and the value of 2 for people who are not married.
We may want to transform it to an indicator variable that takes the value of 1 for people
who are married and the value of 0 for people who are not married. To do so, we use the
recode command. Lets check out the help file.
command window
. recode married 2=0
So the syntax is like this: first, we type the command name, then the variable we wish
to recode. After that we type the original value, one equal sign, and the new value. Why do
we type one equal sign and not two?
Lets say we want to split the sample between married people and the rest. First, we need
to inspect which values mean which marital status. married = 1 single = 2 widow(er) = 3
Don’t know = 99
We can use recode as follows:
do-file
recode married 2=0 3=0 99=.
recode married 2/3=0 99=.
7
5
encode
Imagine you are working with a dataset that has the respondents’ country of origin as strings
(i.e. as words). For operational reasons, we may want to create a numerical code for this
variable. encode allows us to encode string variables.
do-file
use household.dta, clear
encode state, gen(scode)
One of the nice features of encode is that if your original variable already has a label it
copies that label so you don’t have to do it. In addition, it attaches value labels corresponding
to each code, so if you browse through your data you will still see the strings (not just the
codes associated to those numbers).
6
Missing Values
As we saw earlier, Stata has two types of missing values. For numeric variables, missing
values are represented by a period “.” Stata interprets the period as infinity.
For string variables missing values are represented by “” (without blank space).
7
egen
egen is an extension to the generate command. It allows us to generate a variable that
contains, say, the average earnings for all individuals in the sample.
do-file
egen earnings = rowtotal(income1 income2)
egen mean_earnings = mean(earnings)
In the same fashion, we can calculate the maximum, minimum, mean absolute deviation,
median, mode, rank, standard deviation, and many other values. The options that perform
calculations over the whole row (i.e. different sources of income for each individual) start
with “row”, which makes them easy to spot in the help file.
This command becomes way more useful when performed over subgroups of our dataset.
The bysort command allows us to do exactly that.
8
commands by groups of observations: by and by sort
In some occasions we may want to calculate values for groups of observations. For instance,
we have data on households and we want to know the number of males in the household.
8
do-file
gen adult=1 if age>=18 & age!=.
replace adult=0 if age<18
sort hhid
by hhid: egen hhadults = sum(adult)
The by varlist : prefix repeats a command for each group of observations for which the
variables in varlist are the same. The data must first be sorted by varlist. This can be done
by using the sort command, which orders the observations in ascending order according to
the variable(s) given in the command.
A cleaner way of doing this is using the bysort prefix (abbreviated by bus. For example,
suppose we want to create for each individual a variable that equals the number of adults in
a household. Let’s say we define “adult” as being 18 years old or older.
do-file
drop hhadults
bys hhid: egen hhadults = sum(adult)
9
collapse
The collapse command collapses datasets by groups of variables. It allows to calculate values of the relevant variables in the dataset, like mean, number of observations, median,
percentiles, standard error of the mean, standard deviation, sum, maximum, minimum values, the inter-quartile range, among others.
collapse (sum) adults (mean) pcincome=earnings, by(hhid)
This command calculates the number of adults and the per capita income per household.
It keeps those two variables and the hhid variable in memory, and drops everything else.
collapse (sum) hhsize=adult hhincome=earnings ///
(mean) pcincome=earnings (first) hhsize2=hhadults, by(hhid)
This command created three variables: hhsize, hhincome, and pcincome. It keeps those
variables, plus the hhid variable, and removes everything else from memory.
10
reshape
reshape is a Stata command that allows us to change the structure of the data. It converts
the data from “long” to “wide” form and vice versa.
9
Long Form
id year income
“i” “j”
“stub”
1 2013
4.1
1 2014
4.5
2 2013
3.3
2 2014
3.0
The wide form is given by:
id
“i”
1
2
Wide Form
income2013 income2014
“stub 1”
“stub 2”
4.1
4.5
3.3
3.0
To go from long to wide:
reshape wide income, i(id) j(year)
, where year is an existing variable. To go from wide to long,
reshape long income, i(id) j(year)
, where year is a new variable.
11
Exercise - Graphs
Using collapse and the graph commands we’ve seen this far, graph
1. Mean income (and mean standard errors) by years of schooling
2. Mean income (and 95% confidence intervals) by gender
10