University of California, Santa Cruz Department of Economics ECON 294A (Fall 2014)- Stata Lab Instructor: Manuel Barron1 Data Management 3 1 Announcements 1.1 Please Ask Questions If you don’t understand something during lab, please ask me. If you have a question, it is very likely that other students also have that question. It may be that I didn’t explain it correctly, or that I myself made a mistake. I learn a lot from questions (for instance, that Stata has an “undo” command2 ). 1.2 A Note on Replicability Please make sure you do-files are 100% replicable. This means that you need to include all the commands that you used to get your results. For instance, if you open a dataset using the drop-down menus, make sure to copy and paste the code into the do-file. The idea is that if anybody who has access to the same data as you runs the do-file they will automatically get the same results as you. To make sure your do-file is replicable, save it, close Stata, re-open Stata, and run it. If you didn’t get the same results (or if you got an error message), your do-file is not replicable. In principle, you could also send your do-files (and relevant datasets) to a classmate and ask them to open it in their computers. However, since you have to send them datasets and there are some compatibility issues between Apple and PC (for instance \vs / when writing directory paths), it is enough that it works in your computer. It is difficult to provide an exact recipe for replicability, but it is useful to start with “clear all”, so Stata removes everything it may have in memory already. Then, set the working directory. Make sure to copy and paste from the output window any relevant command that may have appeared if you used the drop-down menus. Also, make sure you do not overwrite the original dataset. If you do, then you wont be able to replicate the analysis. 1 2 Comments welcome. contact me at mbarron4 [at] ucsc ... and that it is pretty useless. 1 do-file clear all cd "/Users/Manue/Econ294A/data/" use week3.dta, clear [...Stata commands...] save week3_clean.dta, replace 2 labels Labels tell us what variables mean, so they are very useful. Working with them is also pretty straightforward. The main reason I have postponed the discussion about labels until now is because the first two lab sessions already had a lot of material. 2.1 Variable labels Variable labels are descriptions of the variables. For instance, we may have a variable called “inc”, a measure of income. We may want to attach a label, e.g., “income”. We may also want to describe that a bit further. For instance, we may want to describe if it is household or individual income, its frequency (monthly, yearly, weekly), and the units of measurement. Here some examples of labels we could attach to the “income variable”: do-file label variable income "income" label variable income "household income" label variable income "household income, thousand $ per year" label variable income "annual household income, thousand $" The first type of label is pretty much useless. If the name of the variable is income, giving it the label of “income” provides no information at all. The second option is a bit better, because it distinguishes between household and individual income, but it says nothing about the frequency or units of measurement. Units of measurement are especially important with monetary values because sometimes they are measured in logs. The third line gives us a pretty informative label. It says its household income, measured in thousands of dollars per year (i.e., a value of 100 means that the household perceives an 2 annual income of $100,000. The fourth line gives us the same information with different phrasing. Its just a matter of taste between those last two. As with comments in code, it is easy to get carried away. You don’t want to use a label that is too long. In general terms I would say “too long”. do-file label variable income "income of every working-age household member, measured in US Dollars per year" Such detailed information belongs to a document usually called the “codebook”. 2.2 Value labels Value labels are labels that describe what different values of a variable mean. For instance, in the auto.dta dataset, the variable “foreign” takes the value of 1 if the car is foreign and 0 if it is domestic. To see the values, we can use the command tabulate with the option no label command window . tabulate foreign Car type | Freq. Percent Cum. ------------+----------------------------------Domestic | 52 70.27 70.27 Foreign | 22 29.73 100.00 ------------+----------------------------------Total | 74 100.00 . tabulate foreign, nolabel Car type | Freq. Percent Cum. ------------+----------------------------------0 | 52 70.27 70.27 1 | 22 29.73 100.00 ------------+----------------------------------Total | 74 100.00 3 command window . gen domestic = 1 if foreign==0 (22 missing values generated) . replace domestic = 0 if foreign==1 (22 real changes made) . tab domestic domestic | Freq. Percent Cum. ------------+----------------------------------0 | 22 29.73 29.73 1 | 52 70.27 100.00 ------------+----------------------------------Total | 74 100.00 . label define dom 0 "foreign" 1 "domestic" . label values domestic dom . tab domestic domestic | Freq. Percent Cum. ------------+----------------------------------foreign | 22 29.73 29.73 domestic | 52 70.27 100.00 ------------+----------------------------------Total | 74 100.00 3 wildcards Stata has two wildcards: the asterisk “*” and the question mark “?”. While the question mark replaces exactly one character, the asterisk is more flexible: you can use it to replace one character, many characters, or no characters. This will become clearer with a couple of examples. 3.1 the question mark - replace one character The question mark “?” replaces exactly one character. 4 command window sysuse auto, clear . sum pric? Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------price | 74 6165.257 2949.496 3291 15906 This will summarize all the variables that start with “pric” and have exactly one additional character after the “c”. In the auto dataset, this is only the “price” variable. command window . gen rep88=rep78 if foreign==1 . replace rep88=0 if foreign==0 . gen rep79=rep78 if foreign==0 . replace rep79=0 if foreign==1 . sum rep?8 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------rep78 | 69 3.405797 .9899323 1 5 rep88 | 73 1.232877 1.9897 0 5 We generate a variable called rep88. When we issue the command “sum rep?8”, Stata looks for all variables that start with rep, have one more character -any character-, and finish with an 8. The variables that meet said condition in this dataset are rep78 and rep88. (Why does rep79 fail to meet the condition?) 3.2 the asterisk - replace any number of characters The asterisk replaces any number of characters (including no characters). For instance, we can ask Stata to summarize all variables that start with rep. : 5 command window . sum rep* Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------rep78 | 69 3.405797 .9899323 1 5 rep88 | 73 1.232877 1.9897 0 5 rep79 | 70 2.071429 1.572604 0 5 Say we want to see the summary statistics for all the variables in the auto.dta dataset that start with price. command window . sum price* Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------price | 74 6165.257 2949.496 3291 15906 . sum price? variable price? not found r(111); We can also ask Stata to summarize all variables that end in 8: command window . sum *8 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------rep78 | 69 3.405797 .9899323 1 5 rep88 | 73 1.232877 1.9897 0 5 In addition, we may ask Stata to summarize all variables that start with rep and end with 78. Note that the asterisk can replace the “empty space” (but the question mark can’t do that - it replaces exactly one character). 6 command window . sum rep*78 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------rep78 | 69 3.405797 .9899323 1 5 . sum rep?78 variable rep?78 not found r(111); What does this command do? command window . sum rep*7? 4 recode recode changes the code of a variable. Say we have a variable called “married” that takes the value of 1 for people who are married and the value of 2 for people who are not married. We may want to transform it to an indicator variable that takes the value of 1 for people who are married and the value of 0 for people who are not married. To do so, we use the recode command. Lets check out the help file. command window . recode married 2=0 So the syntax is like this: first, we type the command name, then the variable we wish to recode. After that we type the original value, one equal sign, and the new value. Why do we type one equal sign and not two? Lets say we want to split the sample between married people and the rest. First, we need to inspect which values mean which marital status. married = 1 single = 2 widow(er) = 3 Don’t know = 99 We can use recode as follows: do-file recode married 2=0 3=0 99=. recode married 2/3=0 99=. 7 5 encode Imagine you are working with a dataset that has the respondents’ country of origin as strings (i.e. as words). For operational reasons, we may want to create a numerical code for this variable. encode allows us to encode string variables. do-file use household.dta, clear encode state, gen(scode) One of the nice features of encode is that if your original variable already has a label it copies that label so you don’t have to do it. In addition, it attaches value labels corresponding to each code, so if you browse through your data you will still see the strings (not just the codes associated to those numbers). 6 Missing Values As we saw earlier, Stata has two types of missing values. For numeric variables, missing values are represented by a period “.” Stata interprets the period as infinity. For string variables missing values are represented by “” (without blank space). 7 egen egen is an extension to the generate command. It allows us to generate a variable that contains, say, the average earnings for all individuals in the sample. do-file egen earnings = rowtotal(income1 income2) egen mean_earnings = mean(earnings) In the same fashion, we can calculate the maximum, minimum, mean absolute deviation, median, mode, rank, standard deviation, and many other values. The options that perform calculations over the whole row (i.e. different sources of income for each individual) start with “row”, which makes them easy to spot in the help file. This command becomes way more useful when performed over subgroups of our dataset. The bysort command allows us to do exactly that. 8 commands by groups of observations: by and by sort In some occasions we may want to calculate values for groups of observations. For instance, we have data on households and we want to know the number of males in the household. 8 do-file gen adult=1 if age>=18 & age!=. replace adult=0 if age<18 sort hhid by hhid: egen hhadults = sum(adult) The by varlist : prefix repeats a command for each group of observations for which the variables in varlist are the same. The data must first be sorted by varlist. This can be done by using the sort command, which orders the observations in ascending order according to the variable(s) given in the command. A cleaner way of doing this is using the bysort prefix (abbreviated by bus. For example, suppose we want to create for each individual a variable that equals the number of adults in a household. Let’s say we define “adult” as being 18 years old or older. do-file drop hhadults bys hhid: egen hhadults = sum(adult) 9 collapse The collapse command collapses datasets by groups of variables. It allows to calculate values of the relevant variables in the dataset, like mean, number of observations, median, percentiles, standard error of the mean, standard deviation, sum, maximum, minimum values, the inter-quartile range, among others. collapse (sum) adults (mean) pcincome=earnings, by(hhid) This command calculates the number of adults and the per capita income per household. It keeps those two variables and the hhid variable in memory, and drops everything else. collapse (sum) hhsize=adult hhincome=earnings /// (mean) pcincome=earnings (first) hhsize2=hhadults, by(hhid) This command created three variables: hhsize, hhincome, and pcincome. It keeps those variables, plus the hhid variable, and removes everything else from memory. 10 reshape reshape is a Stata command that allows us to change the structure of the data. It converts the data from “long” to “wide” form and vice versa. 9 Long Form id year income “i” “j” “stub” 1 2013 4.1 1 2014 4.5 2 2013 3.3 2 2014 3.0 The wide form is given by: id “i” 1 2 Wide Form income2013 income2014 “stub 1” “stub 2” 4.1 4.5 3.3 3.0 To go from long to wide: reshape wide income, i(id) j(year) , where year is an existing variable. To go from wide to long, reshape long income, i(id) j(year) , where year is a new variable. 11 Exercise - Graphs Using collapse and the graph commands we’ve seen this far, graph 1. Mean income (and mean standard errors) by years of schooling 2. Mean income (and 95% confidence intervals) by gender 10
© Copyright 2026 Paperzz