Lecture 3: Distributions Regression III: Advanced Methods William G. Jacoby Michigan State University Goals of the lecture • Examine data in graphical form – Graphs for looking at univariate distributions • Histograms • Density estimation • Quantile comparison plots • Boxplots – Plots exploring relationships • Parallel boxplots • Scatterplots (including scatterplot matrices and dynamic three-dimensional scatterplots) • Bivariate & three dimensional density estimation • Conditioning plots 2 Histograms • Dissects the range of the data into bins of equal width along the horizontal axis • Vertical axis represents the frequency counts (or percents, proportions)—Bars represent the counts • Fewer bins, smoother histogram, but less detail about the distribution • Trade-off between smoothness and detail: We want to preserve as much detail as possible but we do not want the graph to be too rough (difficult to discern shape) Stem-and-leaf displays • Alternative form of histogram that uses the numerical data themselves to form the bars of the histogram • Data are broken into two parts: a “stem” and a “leaf” 3 Choosing the number of bins • Simple rule of thumb for small datasets (approx. 100 or less) is: 15 10 5 0 • For larger samples, the car package for R implements Freedman and Diaconis (1981) recommended formula from the n.bins function: Frequency 20 25 H is to g ra m o f in c o m e 0 5000 10000 15000 20000 25000 inc o m e 4 Nonparametric Density Estimation • Although histograms can be very useful for examining the distribution of a variable, the diagram can differ dramatically depending on the number of bins employed • This problem can be overcome (partially) using nonparametric density estimation – Density estimation is an attempt to estimate the probability density function of a variable based on the sample, but less formally it can be thought of as a way of averaging and smoothing the histogram – Since a density function encloses an area of 1, we first must rescale the histogram so that the total area under the smoothed line (the area within the bins) equals 1. In other words, we examine the proportion of cases at specific points in the histogram rather than the frequency counts 5 Kernel Density Estimation • Essentially a sophisticated form of locally weighted averaging of the distribution • Use a weight function (kernel) that ensures the enclosed area of the curve equals 1 – Probability density functions (such as the standard normal density function) are good choices because they are smooth and symmetric • The kernel density estimate is calculated as follows: 6 Selecting the Bandwidth • Unlike histograms we no longer set the number of bins; instead we must select the bandwidth h. We can do this visually, but statistical theory provides some help: • The population standard deviation σ is unknown so we replace it with an adaptive estimator of spread (The sample standard deviation S can be inflated if the underlying density isn’t normal): — Hinge spread is the inter-quartile range; 1.349 is the hinge spread of the standard normal distribution. • The formula for the bandwidth is then: 7 An example of Kernel Density Estimation 0.00012 0.00008 0.00004 0.00000 • As the bandwidth increases, the density curve becomes smoother — Ideally we want a smooth curve like the black line to the right (bw=1087) bw=1087 bw=400 Density • If the underlying density distribution is substantially nonnormal, produces a window width 2h that is too wide (i.e., the line is too rough), but it is good as a starting point Histogram with Density Estimation 0 5000 10000 20000 income 8 R script for previous graph 9 0.00008 0.00004 0.00000 Probability density function • The sm package for R allows you to plot variability bands that are a width of two standard errors • These bands can be especially useful for assessing modality • More details are in Bowman and Azzalini (1997: Chapter 2) 0.00012 Density Estimates with confidence envelopes 0 10000 20000 30000 income 10 0.00008 0.00004 0.00000 Probability density function • The sm package also allows you to fit a normal reference band which indicates the likely position of the density estimate when the data are normal (blue area) • Again, more details are in Bowman and Azzalini (1997: Chapter 2) 0.00012 Density Estimates with a normal reference band 0 10000 20000 30000 income 11 Quantile Comparison Plots (1) • Quantile comparison plots are most useful for examining the tails of the distribution • Unlike histograms and density functions, they do not require arbitrary bins and thus preserve the continuous nature of the data. They do so by comparing the sample distribution to a theoretical cumulative distribution function (CDF) • We could substitute the empirical cumulative distribution function (ECDF), which gives the proportion of data that fall below each x value as x moves from left to right, for the CDF • Because the EDCF is typically rough (a “stair-step” function that rises a height of 1/n at each observation), however, the quantile comparison plot does not construct it directly 12 Quantile Comparison Plots (2) • Order the cases from lowest to highest X(1), X(2),…,X(n) • Calculate the cumulative proportion before each X(i): • zi values that correspond to the cumulative probability Pi are found from the inverse of the CDF: 13 Quantile Comparison Plots (3) • Plot the zi values on the horizontal axis; the X(i)on the vertical axis—If X is normal, then X(i) zi. In other words, the plot should be approximately linear • Draw a line that connects the hinges (the quartiles) • A 95% confidence envelope is constructed as follows: • A positive skew is indicated by points above the line on both ends; A negative skew is indicated by points below the lines on both ends • Heavy tailed distributions are indicated by points above the confidence envelope for high values and points below for low values. 14 GENERAL.MANAGERS 20000 PHYSICIAN LAWYERS 15000 OSTEOPATHS.CHIROPRACTORS 5000 10000 VETERINARIANS PILOTS 0 income • QQ-plots are easily implemented using the car package. • Here we see a positive skew since points lie above the line on both tails (a negative skew is indicated by points below the lines on both tails) • As we can see this plot is useful for examining the tails of a distribution • They tell us nothing about the mode, however 25000 Quantile Comparison Plots (4) -2 -1 0 1 2 norm quantiles 15 Boxplots 25000 20000 LAWYERS OSTEOPATHS.CHIROPRACTORS VETERINARIANS 5000 10000 15000 GENERAL.MANAGE PHYSICIANS 0 Income • Display the center, spread, and outliers of a distribution – Vertical axis represents the range of the variable – A box is drawn around the hinge spread – A line is drawn at the median – Outliers more than 1.5 hinge spreads past the hinges are marked individually – ‘Whiskers’ connect the box to the most extreme nonoutlying observation Boxplot of Income 16 Side-by-side Boxplots: Helpful for comparing many distributions 5000 10000 15000 20000 25000 B oxplots of In come for differen t O ccu pation T ypes bc prof wc 17 Why use graphs to examine relationships? 15 10 0 5 Y 10 0 5 Y • Anscombe’s (1973) contrived data show the importance of using graphs in data analysis rather than simply looking at numerical outputs • Four very different relationships with exactly the same correlation coefficient, standard deviation of the residuals, and coefficients and standard errors • The linear regression line adequately summarizes only graph (a) (b) Distorts curvilinear rel. 15 (a) Accurate summary 0 5 10 15 20 0 5 10 15 20 (c) Draw n to outlier (d) "Chases" outlier 10 0 5 Y 0 5 Y 10 15 X 15 X 0 5 10 X 15 20 0 5 10 15 X 18 20 Scatterplots • One of the most used of all statistical graphs— summarizes the relationship between two quantitative variables – Including nonparametic smooths and linear regression lines can aid visualization of the trend – A useful technique when one of the variables does not take on many different values, or when the sample is so large that the data are over-plotted and there are few “empty spaces” on the graph, is to jitter the data (i.e., add a random component to each value) • We can jitter one or both of the variables in a scatterplot 19 Jittering scatterplots jitte re d 20 10 15 Conservative attitudes 20 15 5 10 Conservative attitudes 25 25 30 30 N o jitte r 20 40 60 Age 80 100 20 40 60 80 100 Age 20 Identifying categorical variables 1.2 1.3 1.4 Democ ratic Non-Democ ratic 1.1 Attitudes towards inequality(mean) • Distinguishing between categories of a categorical control variable in scatterplots can help show important patterns we might otherwise miss • Different slopes for nondemocracies and democracies suggests an interaction between democracy and income inequality in their effects on attitudes 30 40 50 60 Gini c oeffic ient 21 Bivariate Density Estimates • The kernel smoothing method used for histograms can be easily extended to the joint distribution of two random continuous variables • The bivariate density function takes the following form: • Where K is the kernel function and (h1 and h2) are the joint smoothing parameters • For univariate densities, probabilities are associated with area under the density curve. For a bivariate density curve, probabilities are associated with volume under the density, where the total volume equals one 22 Types of Bivariate Density Plots • Perspective plots: the joint distribution is shown in a 3D plot—height is used to show level of density • Imageplots: different intensities of colour or shading denote density levels • Contour plots or slice plots: lines trace paths of constant levels of density (similar to the depiction of elevation in a geographical contour map) 23 R script for Bivariate Density Plots 24 Three Dimensional Density Estimates • The three dimensional density estimate also extends simply from the bivariate case: • Where K is the kernel function and (h1, h2, and h3) are the joint smoothing parameter • In these plots contours represent closed surfaces • Like the other density estimates, these are helpful for assessing clustering of the data 25 Some examples of three dimensional density estimates secpay prestige income education gini gdp 26 Scatterplotmatrix 1.5 60 1.3 20 30 40 50 gini | ||||| |||||||| |||||||||| || | ||| | | | | 1.1 1.2 1.3 1.4 1.5 1.6 secpay ||||| |||||||||||| | || | | | | 15000 25000 gdp 0 5000 • Plots individual scatterplots for all possible bivariate relationships at one time • Can be enhanced by adding density estimates for each variable on the diagonal • Note: Only marginal relationships are depicted (i.e., no control for other variables) 1.1 |||| |||| |||||||| | | | ||||| || | 20 30 40 50 60 0 10000 20000 30000 27 Conditioning plots: An example from the CES Given : men -0.5 40 60 80 0.5 1.0 100 20 40 1.5 60 80 100 Given : education 3 2 5 0 10 15 20 1 25 30 5 10 15 20 25 jitter(lascale) 30 5 10 4 5 15 20 25 30 20 0.0 20 40 60 80 100 age 28 Next Topic: – Transformations for univariate and bivariate data 29
© Copyright 2026 Paperzz