Graphing and weights..

Analysis of Census and Survey Data for Social Science Research
University of Cape Coast Stata Workshop, July 11-20, 2011
Graphs and weights
Summary:
This module teaches you how to use Stata to produce graphs, including histograms, pie charts
and bar diagrams. We will also learn why most datasets include sample weights, and when
and how to use them.
Open Stata and load the 2000 Ghana census file
Graphs
Stata can produce a wide variety of graphs specific to our research needs. At the beginning
of this course we mentioned that Stata has a comprehensive menu system through which
commands can be selected. We have, however, only concentrated on using Stata at the
command line level. This is because in most cases using the command line is far quicker and
easier than using the menu system. However, graphing syntax can be rather long and
somewhat complicated, especially if you are unfamiliar with it. It is therefore easier to learn
how to generate graphs using the drop-down menu. After generating a graph through the
menu system, Stata will automatically display the necessary code in the results window.
Looking at this code is an excellent way of becoming familiar with the correct syntax and
once you have gained some confidence you can switch from using the graphing menu to
entering syntax at the command line level! We will explain how to use the drop-down menu
but also subsequently present the correct code so that the graphs are easily reproducible.
If you click on the “Graphics” tab in the Stata menu, you can see that Stata can generate a
large number of different graphs. We will learn how to create and interpret three of the most
widely used kinds of graphs -- histograms, pie graphs, and bar charts. We will learn about
another type of graph, scatter plots, at a later point.
Histogram
Probably the easiest graph to create is a histogram. Histograms are a graphical tool that tell
us the fraction (or percent) of observations, for any given variable, that fall within different
ranges. Histograms are used for continuous variables. If we have a continuous variable, such
as income or age, a histogram will tell us what percentage or fraction of the sample fall into
different income ‘bins’ or ‘groups’. We will present the basics of graphing histograms in
Stata by working through the example we present below.
To draw a histogram, click on the Graphics drop-down menu and then on the “histogram”
option. A window will open up on your screen. Take time to look at this window. You
should notice a number of tabs running across the top with different titles. These tabs allow
us to manipulate our graph in various ways and we will go through the most important ones
as we progress through this module. For now let’s focus on the tab that is open, the ‘main’
tab. This tab is where we enter the most important information, that is the variable we are
interested in examining. There are a few other specifications we also have to choose on this
tab but we are lucky that Stata does most of the hard work for us.
The main tab
To graph the histogram we simply need to do the following on the main tab:
Analysis of Census and Survey Data for Social Science Research – University of Cape Coast – July 2011
·
Use the drop-down list of variables to choose a variable or just type in variable name in
the box for variable. Let’s use the variable age.
·
Make sure that “Data are continuous” is selected
(since age is a continuous variable)
·
Under Y-axis select “Percent”
(this is the most intuitive and easily interpretable measure to have on the Y-axis.)
·
Click OK
Stata then produces the histogram, which shows up in a new window. Also notice that the
code for this histogram appears in the results window and in the command window. This
makes it easy to re-use the command to make modifications.
The first line of code shows us what we would have entered at the command line level to
generate this graph. It’s not too complicated, but as soon as we start specifying options, the
command-line code will become more complicated. Nonetheless, try to get a feel for it as it is
important to be able to use both the drop down menus and the command line for graphing.
What does the histogram tell us about the age distribution in Ghana?
This graph is a good start, but Stata allows for many more specifications. For example,
instead of using "percent", we could have specified density, frequency, or fraction. Each
would produce a similar looking histogram but each would have a different y-axis. For now,
let's continue with percent and focus on improving the graph in other ways.
The titles tab
Graph titles are an important part of presentation that we should not overlook. Titles make
graphs not only more understandable but also give them a more professional appearance.
Furthermore, they allow us to include important additional information such as the source of
the data used. Let’s reopen the histogram menu and add titles to our graph. As you may
notice, Stata remembers all the information you put in the last time. Click on the title tab, and
enter the following information in the appropriate lines:
Title: Age distribution in Ghana (2000)
Subtitle: (Unweighted)
Note: Source: IPUMS 1% subsample of Ghanaian Census 2000
The x and y axis tabs
We may, in our endeavour to make the graph more informative, also want to format the axes.
Stata allows for a wide range of alterations. We can change the scale, title, thickness, or
colour of major and minor ticks/labels. We will concentrate only on the basic axis formatting
options and leave further experimentation up to you.
Before we begin, it will be useful to clarify what we mean when we refer to major and minor
ticks/labels. Very simply, “ticks” are the small lines that come off the axes. They normally
indicate the scale of the graph by dividing up the space between major ticks into sections of
equal lengths. Major ticks have numerical labels while minor ticks do not. For example, on
the x-axis of the graph above, we only have major ticks; each representing a 20 year increase
in age. Stata will generally use its own default formula to calculate the interval between
“ticks” but we can often improve upon this to make our graph more easily readable and
interpretable. Now let’s begin by turning our attention to the x-axis first:
Click on the x axis tab and then add a new x axis title (“Age”).
Then click on the “major ticks/label properties”. When we click on “Major tick/label
properties”, a new window will open. In this window, we can view the different rules with
2
Analysis of Census and Survey Data for Social Science Research – University of Cape Coast – July 2011
which we can specify the spacing of the labels on the x axis. You can experiment with the
different options, but we prefer to change the selection from “default” to “Range/Delta”.
This option requires us to specify a minimum, a maximum and delta, where delta is the
interval between major ticks on the x-axis (i.e. the increments). From our previous graph, we
know that age does not exceed 100, so we set this as the maximum. We set the minimum to 0,
and we try a delta of 5. Once we have done this we click “Accept”. You could do a similar
thing with minor ticks.
Now click on the y axis tab and on “Minor tick/label properties”. Click on “Range/Delta” and
enter 0 as minimum, 10 as maximum and a delta of 1. Then click Accept.
Click ok.
Notice that the code for creating this histogram with all the specified options is now much
longer and more complicated than the code for our simple histogram. Can you see what each
term in this code is doing?
The if tab
We might be interested in the age distribution not only for Ghanaians as a whole, but for men
and women separately. To restrict the sample to females, click on the if tab and enter the
restriction sex==2.
The by tab
Often, it is also nice to be directly able to compare whether two distributions look similar. Do
men and women have similar age patterns or are there important differences? We can answer
this question by clicking on the by tab.
Click on the box “Draw sub graphs for unique values of variables” and then enter the variable
of interest the space below this (in our case sex). Then go back to the if tab and get rid of the
restriction sex==2. Also get rid of the major tick property changes that we made under the x
axis tab to make the graph less cluttered. Now click ok. Your graph should now show two
histograms, one for men and one for women.
Try making a histogram for the number of children ever born for women aged 45-49. Use the
“width of bin” option, setting it to 1. Beware missing value codes!
Pie charts
Similar to histograms, pie graphs are graphical tools which inform the reader of the
distribution of any particular variable. In contrast to histograms, however, Pie graphs are used
for categorical variables only.
We will start our explanation of pie charts with an easy example: suppose we want to know
whether there are an equal number of men and women in our dataset. One way we could find
out whether this is the case is by producing a pie chart for gender. (Note: If we want to know
whether there are an equal number of men and women in Ghana, we should really either use
the full census or use weights. We will learn more about weights later today).
We create the pie chart as follows:
Click on the graphics drop down menu tab and select the pie chart option (like we did for the
histogram section). Once again a window will appear, but this time the window will be for
graphing pie charts.
Note that this window has very similar tabs to the histogram window, so you will be familiar
with how to use these tabs already. The window will open on the “main” tab. Make sure the
“Graph by categories” option is selected. Enter the name of the variable of interest under
category variable. Let’s use the variable sex. Then click ok.
3
Analysis of Census and Survey Data for Social Science Research – University of Cape Coast – July 2011
A new window with the pie chart will pop up.
The relevant syntax for this graph is:
graph pie, over(sex)
What does this graph tell us?
The “if/in” tab
Now suppose we want to dig a bit deeper and investigate the gender distribution of newborns and infants in our dataset for Ghana. To do this we need to restrict on the age variable
to be either 0 or 1.
In the pie chart window, click on the “if/in” tab. Type in our restriction:
age==0 | age==1
Could we have written the age restriction in a different way?
Click ok. What does this new graph tell us?
Notice that the code for this new pie chart is now
graph pie if age==0 | age==1, over(sex)
Now, let’s look at a categorical variable with more than two categories. Pie charts can be
extremely helpful for exploring patterns when there are a number of categories.
Suppose we’re interested in the distribution of cooking fuel across households. One
difference from variables like age or gender is that this is a household variable. When you
make tables or graphs with household-level information you need to take that into account. If
you don’t control for the number of people in the household you will count large households
more often than smaller ones. This creates bias and your results will not reflect the true
distribution. So we only want to use one observation per household for these variables.
How do we do this? The egen command is very helpful here. If you type
help egen
Stata gives you a list of potential functions that you can use with egen. Scroll down until you
get to the function “tag”. This is exactly what we want: This will “tag” one observation for
each distinct group defined by the variables we choose. In our case we want Stata to choose
one person per household. Since the cooking fuel question is a household question and the
answer is therefore identical for all household members, it doesn’t matter which household
member we choose.
Type:
egen hh_tag=tag(serial)
since serial is a household identifier code that will be the same for all members of a
household, but unique across households.
Now let’s see whether the command worked properly. Type
sort serial
order serial hh_tag
You should see that only one person per household has a value of hh_tag equal to 1 if you
browse the data, whereas all other household members are assigned a 0 for this variable.
We can now proceed with our pie graph of cooking fuels. Open the pie graph window.
Remember to reset all the information that we entered previously. Enter fuelck as the new
variable of interest. In the “if/in” tab now specify the restriction
hh_tag==1
4
Analysis of Census and Survey Data for Social Science Research – University of Cape Coast – July 2011
Then click on the “by” tab. We may not only be interested in the distribution of different
cooking fuels, but also in how this distribution differs between rural and urban areas.
Therefore, check the box “Draw subgraphs for unique values of variables”, enter the variable
urban in the appropriate line and click ok.
You should now see two pie charts next to each other, showing the distribution of cooking
fuels in rural and urban areas. What is the most popular cooking fuel in rural and urban areas,
respectively?
Tip:
Since pie charts only represent fractions of the total population in rural and urban areas, they
do not tell you differences in the use of cooking fuels in absolute numbers. If you wanted to
know whether more households use kerosene in urban than in rural areas, for example, you
should use tabulate to check this!
Let’s make this graph look a bit nicer. First, you may want to reduce the number of
categories: We are not interested in values of the variable that are labelled as NIU or
unknown/missing. Set these values to missing (remember that when you are working on your
research projects, it may be helpful to create a new variable to keep the original values in case
you may want to go back to them at some point in the future). To do this, type
numlabel, add
tab fuelck
Do you remember what the numlabel command does?
Then type
replace fuelck=. if fuelck==0 | fuelck==99
Next, go back to the pie chart window and enter a title in the titles tab. Then click on the
slices tab.
You will see that the window is divided into two sections: Slice Properties and Labels. In
each of these sections you can either customize all the slices together or each individual slice
separately. We choose for simplicity sake to customize all slices or labels at once.
Let’s start with the slice properties:
- make sure that “Customize all slice properties” is selected;
- Click on “Slice Properties (all)”.
A new window will appear. Here we simply:
- select the “Explode slice” option;
- Click on accept.
We could also add some labels to our graph. To do that:
- make sure the “Customize all slice labels” option is selected;
- click on the “Label Properties” button.
Once again, a new window will appear:
- change the label type to percent ;
- change the colour option to white;
- click “Accept”.
Click ok and look at the new graph and notice the effects of our formatting. The “explode”
option has made our slices look as if they have exploded out of the pie chart! This can be
particularly effective if we just make one slice that we want to emphasize “explode” out. As
you can see, the labels didn’t work so well in our case since we have many categories with
5
Analysis of Census and Survey Data for Social Science Research – University of Cape Coast – July 2011
very small percentages. So you would either have to do without labels, change their size and
orientation in the slice menu options to make sure that they are not on top of each other, or
collapse some of the small categories into one by using the “replace” command to have fewer
categories. We will do some reformatting in our next example.
Bar graphs
Bar graphs can be a useful alternative option to pie charts when you have a variable with
many categories. Bar charts can be used not only for categorical variables but also for
continuous variables. The reason for this is that the bar chart is quite diverse in its
applicability as it allows us to incorporate summary statistics in our graphs.
Suppose we want to visualize the mean number of persons in a household in every region by
using a bar chart.
You will find a bar graph option on the graphics drop down menu. Clicking on it will call up
the bar graph window. As usual, we start by focusing on the “Main” tab.
As you see, the “Main” tab looks rather different in comparison to the histogram and pie
chart windows. Under “Type of data” the bar graph gives us the option of each bar reflecting
the actual data or a specific summary statistic. In most situations you will want to keep the
summary statistics option selected:
- make sure the “graph by creating summary statistics” option is selected.
Next we enter our variable of interest in the first row of the “Statistics to Plot” section:
- enter the persons variable in the first row under the variable heading.
You may have noticed that we can also choose the summary statistic which the bar chart will
use. In our first example we will stick to using the mean (but there are also a range of other
options that you can play around with).
Remember that “persons” is a household-level variable. Therefore, go to the if/in tab and
enter the restriction hh_tag==1 as before.
Now click on the “categories” tab. This is where we specify our grouping variable, in our
case region.
- click on the “Categories” tab;
- select “Group 1”;
- enter the regngh variable as the grouping variable.
- Click ok
Clearly this graph is in need of some formatting. Below we change the angle of the group 1
labels so that they do not overlap.
- In the “categories tab” click on Properties (for group one);
- In the sub-menu change the angle of labels to 45 degrees;
- Click accept.
Suppose we also want to see differences between rural and urban areas in terms of the
average number of people per household in each region.
- Click on the “categories” tab
- Tick group 2 and enter urban as the grouping variable
- Click ok
It is often useful to add numerical labels to the bars (it is also useful to take note of the
numerous bar formatting options available in this tab). To do this:
- Click on the “bars” tab
- Under bar labels select “Label with bar height”
6
Analysis of Census and Survey Data for Social Science Research – University of Cape Coast – July 2011
To make the graph less cluttered, let’s get rid of the group 2 grouping variable under the
categories tab. Then click ok.
Sample weights
Most survey and census data include sample weights. In household surveys these often reflect
issues in sample design, such as an oversample of certain subgroups. A census is supposed to
include everyone in the population, so the weights come about for a different reason. They
are typically based on ex-post enumeration surveys that indicate that certain groups are
underrepresented in the census. The weights adjust the census to be more representative of
what is thought to be the true population. In our case, we are using a random subsample of
the actual census, so we can use weights to make the analysis for our sample representative of
the whole population. Calculating how many people lived in Ghana in 2000 on Day 1 by
using the display command is one example where you have already seen weights being used.
Each observation in the data can be thought of as representing some larger group of people in
the total population. How large the group is depends on various factors such as where that
person lives and the person’s age, ethnicity and gender. All of this information is summarized
in a person's weight. To make your results representative of the population, you tell Stata to
use the weight provided in the data set. Stata uses this weight to weigh some observations
more heavily than others. Weights are typically defined at both the individual and household
level. Whether you use household or individual weights depends on whether your analysis is
at the household or individual level. If you are looking at the percentage of households with
electricity, for example, you would use household weights. If you are looking at the
distribution of education for individuals you would use individual weights. IPUMS uses the
same weight variables in all of the data sets.
Find the weight variables in the data set:
lookfor weight
There are four weight variables: wtper, wtper_new, wthh, wthh_new. The wtper and wthh
variables are the person weight and household weight created by IPUMS. The wtper_new
and wthh_new are new household and person weights that we created that account for the
fact that we drew a subsample from the IPUMS sample. This 2000 file is a 10% sample of
the IPUMS sample, so we multiplied the IPUMS weights by 10. Look at the summary
statistics for these weights:
sum wt*
We will want you to use the wtper_new and wthh_new weights, with the choice depending
on whether your analysis is at the individual or household level. The syntax for producing
weighted results is the same in most Stata commands: you specify the weight variable inside
square brackets at the end of the command but before the comma. Below are some examples:
sum age [w=wtper_new], detail
tab age [w=wtper_new]
Since the above tab of age includes everyone, the weighted total at the bottom should be
approximately the total population of Ghana in 2000.
tab age [w=wtper_new], sum(hrswrk1)
sum persons if relate==1 [w=wthh_new]
Let’s look at how much difference the weights make in various calculations:
sum age
sum age [w=wtper_new], detail
Is there a difference between the weighted and unweighted distributions for age? Why?
7
Analysis of Census and Survey Data for Social Science Research – University of Cape Coast – July 2011
Now let’s see how things look for another country. Open the file
ipums_tanzania2002.5%.dta, which is a 5% subsample of the original IPUMS file for
Tanzania (which was itself a 10% subsample of the complete census).
Repeat the summary statistics for age from above. Is there a difference between weighted and
unweighted distributions for age? Why?
You should get in the habit of using weights for all of your results, though (as in this case) in
most cases the results should be similar with and without weights.
Let’s go back to our Ghana census dataset for the next section.
Population pyramids
Population pyramids are a very effective tool that demographers use to look at age structure.
We can use the Ghana 2000 census data to look at the age structure of the Ghanaian
population. It will be simplest to work with five year age groups. Here is a convenient way
to create a five year age group variable:
gen age5=5*int(age/5)
label var age5 “Five year age group”
tab age age5
Stata is very good at making age pyramids using horizontal twoway bar graphs. There is
even a nice example in the Stata help tutorial, which you can find if you type:
search pyramid
This will take you to the instructions for the twoway_bar command. The following steps
walk through a modified version of the same approach:
Use egen to generate the sum of the sample weights for males and females in each five year
age group and for the total population. Then create a variable that gives the proportion of the
population in each five year age group/gender cell.
egen pop5=sum(wtper_new), by(age5 sex)
egen totpop=sum(wtper_new)
gen pop5m=100*pop5/totpop if sex==1
gen pop5f=-100*pop5/totpop if sex==2
bysort age5 sex: gen count=_n
Note that the pop5f variable is given negative values, a trick that makes the bar graph look
like a pyramid. The last command just creates a simple way to limit the graph to one
observation per age/gender cell, making the graph faster to generate and print. We now have
what we need for a population pyramid. Here is a simple version:
twoway bar pop5m pop5f age5 if count==1 & age5<100, horizontal
Here is a more complicated version that makes the graph look much better (this is in the do
file agepyramid.do):
twoway bar pop5m pop5f age5 if count==1 & age5<100, horizontal xtitle("Female
Male" "Percent of Population") xline(0) barwidth(4 4) xlabel(-7.5 "7.5" -5 "5" -2.5 "2.5" 0
2.5 5 7) legend(off) ylabel(0 10 to 95) title("Age Pyramid, Ghana 2000")
To insert this into a paper it is nice to export it to a file in Windows MetaFile (.wmf) or TIFF
(.tif) format.
graph export agepyr_gh2000.wmf, replace
graph export agepyr_gh2000.tif, replace
The do file agepyramid.do shows how to generate age pyramids for people with and without
electricity separately. The same approach could be used to make age pyramids for different
years, for native born versus immigrants, etc.
8