Thursday, June 30: Basic Graphical Displays

Lecture 3: Distributions
Regression III:
Advanced Methods
William G. Jacoby
Michigan State University
Goals of the lecture
• Examine data in graphical form
– Graphs for looking at univariate distributions
• Histograms
• Density estimation
• Quantile comparison plots
• Boxplots
– Plots exploring relationships
• Parallel boxplots
• Scatterplots (including scatterplot matrices and
dynamic three-dimensional scatterplots)
• Bivariate & three dimensional density estimation
• Conditioning plots
2
Histograms
• Dissects the range of the data into bins of equal width
along the horizontal axis
• Vertical axis represents the frequency counts (or
percents, proportions)—Bars represent the counts
• Fewer bins, smoother histogram, but less detail about
the distribution
• Trade-off between smoothness and detail: We want to
preserve as much detail as possible but we do not want
the graph to be too rough (difficult to discern shape)
Stem-and-leaf displays
• Alternative form of histogram that uses the numerical
data themselves to form the bars of the histogram
• Data are broken into two parts: a “stem” and a “leaf”
3
Choosing the number of bins
• Simple rule of thumb for small
datasets (approx. 100 or less)
is:
15
10
5
0
• For larger samples, the car
package for R implements
Freedman and Diaconis
(1981) recommended
formula from the n.bins
function:
Frequency
20
25
H is to g ra m o f in c o m e
0
5000
10000
15000
20000
25000
inc o m e
4
Nonparametric Density Estimation
• Although histograms can be very useful for examining
the distribution of a variable, the diagram can differ
dramatically depending on the number of bins employed
• This problem can be overcome (partially) using
nonparametric density estimation
– Density estimation is an attempt to estimate the
probability density function of a variable based on
the sample, but less formally it can be thought of as
a way of averaging and smoothing the
histogram
– Since a density function encloses an area of 1, we
first must rescale the histogram so that the total
area under the smoothed line (the area within the
bins) equals 1. In other words, we examine the
proportion of cases at specific points in the
histogram rather than the frequency counts
5
Kernel Density Estimation
• Essentially a sophisticated form of locally weighted
averaging of the distribution
• Use a weight function (kernel) that ensures the enclosed
area of the curve equals 1
– Probability density functions (such as the standard
normal density function) are good choices because
they are smooth and symmetric
• The kernel density estimate is calculated as follows:
6
Selecting the Bandwidth
• Unlike histograms we no longer set the number of bins;
instead we must select the bandwidth h. We can do this
visually, but statistical theory provides some help:
• The population standard deviation σ is unknown so we
replace it with an adaptive estimator of spread (The
sample standard deviation S can be inflated if the
underlying density isn’t normal):
— Hinge spread is the inter-quartile range; 1.349 is the
hinge spread of the standard normal distribution.
• The formula for the bandwidth is then:
7
An example of Kernel Density
Estimation
0.00012
0.00008
0.00004
0.00000
• As the bandwidth
increases, the density curve
becomes smoother
— Ideally we want a
smooth curve like the
black line to the right
(bw=1087)
bw=1087
bw=400
Density
• If the underlying density
distribution is substantially
nonnormal,
produces a window width 2h
that is too wide (i.e., the
line is too rough), but it is
good as a starting point
Histogram with Density Estimation
0
5000
10000
20000
income
8
R script for previous graph
9
0.00008
0.00004
0.00000
Probability density function
• The sm package for
R allows you to plot
variability bands
that are a width of
two standard errors
• These bands can be
especially useful for
assessing modality
• More details are in
Bowman and
Azzalini (1997:
Chapter 2)
0.00012
Density Estimates with confidence
envelopes
0
10000
20000
30000
income
10
0.00008
0.00004
0.00000
Probability density function
• The sm package also
allows you to fit a
normal reference
band which indicates
the likely position of
the density estimate
when the data are
normal (blue area)
• Again, more details
are in Bowman and
Azzalini (1997:
Chapter 2)
0.00012
Density Estimates with a normal
reference band
0
10000
20000
30000
income
11
Quantile Comparison Plots (1)
• Quantile comparison plots are most useful for
examining the tails of the distribution
• Unlike histograms and density functions, they do not
require arbitrary bins and thus preserve the continuous
nature of the data. They do so by comparing the sample
distribution to a theoretical cumulative distribution
function (CDF)
• We could substitute the empirical cumulative distribution
function (ECDF), which gives the proportion of data that
fall below each x value as x moves from left to right, for
the CDF
• Because the EDCF is typically rough (a “stair-step”
function that rises a height of 1/n at each
observation), however, the quantile comparison plot
does not construct it directly
12
Quantile Comparison Plots (2)
• Order the cases from lowest to highest X(1), X(2),…,X(n)
• Calculate the cumulative proportion before each X(i):
• zi values that correspond to the
cumulative probability Pi are found
from the inverse of the CDF:
13
Quantile Comparison Plots (3)
• Plot the zi values on the horizontal axis; the X(i)on the
vertical axis—If X is normal, then X(i) zi. In other
words, the plot should be approximately linear
• Draw a line that connects the hinges (the quartiles)
• A 95% confidence envelope is constructed as follows:
• A positive skew is indicated by points above the line
on both ends; A negative skew is indicated by points
below the lines on both ends
• Heavy tailed distributions are indicated by points above
the confidence envelope for high values and points
below for low values.
14
GENERAL.MANAGERS
20000
PHYSICIAN
LAWYERS
15000
OSTEOPATHS.CHIROPRACTORS
5000
10000
VETERINARIANS
PILOTS
0
income
• QQ-plots are easily
implemented using the car
package.
• Here we see a positive skew
since points lie above the
line on both tails (a
negative skew is indicated
by points below the lines on
both tails)
• As we can see this plot is
useful for examining the
tails of a distribution
• They tell us nothing about
the mode, however
25000
Quantile Comparison Plots (4)
-2
-1
0
1
2
norm quantiles
15
Boxplots
25000
20000
LAWYERS
OSTEOPATHS.CHIROPRACTORS
VETERINARIANS
5000
10000
15000
GENERAL.MANAGE
PHYSICIANS
0
Income
• Display the center, spread, and
outliers of a distribution
– Vertical axis represents the
range of the variable
– A box is drawn around the
hinge spread
– A line is drawn at the median
– Outliers more than 1.5 hinge
spreads past the hinges are
marked individually
– ‘Whiskers’ connect the box to
the most extreme nonoutlying observation
Boxplot of Income
16
Side-by-side Boxplots:
Helpful for comparing many distributions
5000
10000
15000
20000
25000
B oxplots of In come for differen t O ccu pation T ypes
bc
prof
wc
17
Why use graphs to examine
relationships?
15
10
0
5
Y
10
0
5
Y
• Anscombe’s (1973)
contrived data show the
importance of using
graphs in data analysis
rather than simply
looking at numerical
outputs
• Four very different
relationships with exactly
the same correlation
coefficient, standard
deviation of the residuals,
and coefficients and
standard errors
• The linear regression line
adequately summarizes
only graph (a)
(b) Distorts curvilinear rel.
15
(a) Accurate summary
0
5
10
15
20
0
5
10
15
20
(c) Draw n to outlier
(d) "Chases" outlier
10
0
5
Y
0
5
Y
10
15
X
15
X
0
5
10
X
15
20
0
5
10
15
X
18
20
Scatterplots
• One of the most used of all statistical graphs—
summarizes the relationship between two
quantitative variables
– Including nonparametic smooths and linear
regression lines can aid visualization of the trend
– A useful technique when one of the variables does
not take on many different values, or when the
sample is so large that the data are over-plotted
and there are few “empty spaces” on the graph, is
to jitter the data (i.e., add a random component
to each value)
• We can jitter one or both of the variables in a
scatterplot
19
Jittering scatterplots
jitte re d
20
10
15
Conservative attitudes
20
15
5
10
Conservative attitudes
25
25
30
30
N o jitte r
20
40
60
Age
80
100
20
40
60
80
100
Age
20
Identifying categorical variables
1.2
1.3
1.4
Democ ratic
Non-Democ ratic
1.1
Attitudes towards inequality(mean)
• Distinguishing between
categories of a categorical
control variable in scatterplots
can help show important
patterns we might otherwise
miss
• Different slopes for nondemocracies and democracies
suggests an interaction
between democracy and
income inequality in their
effects on attitudes
30
40
50
60
Gini c oeffic ient
21
Bivariate Density Estimates
• The kernel smoothing method used for histograms can
be easily extended to the joint distribution of two
random continuous variables
• The bivariate density function takes the following form:
• Where K is the kernel function and (h1 and h2) are the
joint smoothing parameters
• For univariate densities, probabilities are associated with
area under the density curve. For a bivariate density
curve, probabilities are associated with volume under
the density, where the total volume equals one
22
Types of Bivariate Density Plots
• Perspective plots: the
joint distribution is shown
in a 3D plot—height is
used to show level of
density
• Imageplots: different
intensities of colour or
shading denote density
levels
• Contour plots or slice
plots: lines trace paths of
constant levels of density
(similar to the depiction
of elevation in a
geographical contour
map)
23
R script for Bivariate Density Plots
24
Three Dimensional Density Estimates
• The three dimensional density estimate also extends
simply from the bivariate case:
• Where K is the kernel function and (h1, h2, and h3) are
the joint smoothing parameter
• In these plots contours represent closed surfaces
• Like the other density estimates, these are helpful for
assessing clustering of the data
25
Some examples of three dimensional
density estimates
secpay
prestige
income
education
gini
gdp
26
Scatterplotmatrix
1.5
60
1.3
20
30
40
50
gini
| ||||| |||||||| |||||||||| || | ||| | | | |
1.1 1.2 1.3 1.4 1.5 1.6
secpay
||||| |||||||||||| | || | |
| |
15000
25000
gdp
0 5000
• Plots individual
scatterplots for all
possible bivariate
relationships at one
time
• Can be enhanced by
adding density
estimates for each
variable on the
diagonal
• Note: Only marginal
relationships are
depicted (i.e., no
control for other
variables)
1.1
||||
||||
|||||||| | | | ||||| || |
20
30
40
50
60
0
10000
20000 30000
27
Conditioning plots:
An example from the CES
Given : men
-0.5
40
60
80
0.5
1.0
100
20
40
1.5
60
80
100
Given : education
3
2
5
0
10 15 20
1
25 30
5
10
15 20 25
jitter(lascale)
30
5
10
4
5
15 20 25
30
20
0.0
20
40
60
80
100
age
28
Next Topic:
– Transformations for univariate and bivariate data
29