STATISTICS 1403 PROBABILITY AND STATISTICS FOR

STATISTICS 1403
PROBABILITY AND STATISTICS FOR THE BIOSCIENCES
PROBLEM SET: GAUSSIAN AND LAPLACE FENCES
Abstract. What qualifies a sample observation as an outlier? Explore two
common distributions to get an idea.
1. Introduction
An outlier is defined to be an observation that is “extremely far” from the center
of its distribution. In exploratory data analysis, we often use the box-and-whisker
plot with fences to identify outliers (observations outside the fences). We define
the lower and upper fences based on the quartiles and interquartile range as
L = Q1 − 1.5 IQR
U = Q3 + 1.5 IQR
IQR = Q3 − Q1
Notice there is no probability statement attached to the appearance of an outlier
in this definition. This exercise will find such statements.
2. The Gaussian Distribution
Calculate the probability that an observation from a sample of normally
distributed observations is identified as an outlier:
(a) Find the z-scores for the first and third quartiles, Q1 and Q3 (the 25% and
75% points).
(b) Find the number of standard deviation units making up the interquartile
range, IQR.
(c) Find the z-scores which correspond to the lower and upper fences, L and
U.
(d) For normally distributed data, what is the probability that an observation
x is an outlier, i.e x < L or x > U ? (i.e. p = Pr(X < L) + Pr(X > U ))
(e) Using a binomial model, estimate the sample size n such that the probability
of seeing at least one outlier, P r(N > 0), is at least 90%. This number is
the integer value of n satisfying the inequality
Pr(N > 0) = 1 − Pr(N = 0) = 1 − (1 − p)n ≥ 0.9
1
2
GAUSSIAN AND LAPLACE FENCES
3. Laplace Distribution
The Laplace distribution is a symmetric model for absolute deviations from the
mean, µ. Its CDF and inverse CDF (or quantile function) are
1 |x−µ|/b
x<µ
2e
F (x) = Pr(X < x) =
1 − 12 e−(x−µ)/b x ≥ µ
and
F −1 (p) = µ − b sign(p − 0.5) ln(1 − 2|p − 0.5|)
where b is a scale factor.
As before calculate the probability that an observation from a sample of
exponentially distributed observations is identified as an outlier. Set µ = 0 and
β = 1 to simplify calculations so all scores are in standard units.
(a) Find the scores for the first and third quartiles, Q1 and Q3 (the 25% and
75% points).
(b) Find the number of standard units making up the interquartile range, IQR.
(c) Find the scores which correspond to the lower and upper fences, L and U .
(d) For Laplace distributed data, what is the probability that an observation x
is an outlier, i.e x < L or x > U ?
(e) Using a binomial model, estimate the sample size n such that the probability
of seeing at least one outlier, P r(N > 0), is at least 90%. (Use the same
inequality as used for the Gaussian distribution.)
4. Comparison
(a) Draw parallel box-and-whisker-plots, labeled with standard scores, for the
Gaussian and Laplace distributions. (Since you have no data, there will be
no outliers.)
(b) Find the excess kurtosis for the Gaussian and Laplace distributions in
any standard statistical reference (cite in in your write-up).
GAUSSIAN AND LAPLACE FENCES
3
(c) Compare the outlier probabilities for the two distributions. Based on this
comparison, what do you think the kurtosis statistic means?
5. Your Analysis
Your analysis should be handwritten. Include a brief problem statement (in your
own words, do NOT copy these instructions), and a brief narrative explaining your
calculations and findings. Submit your write-up via Blackboard as a scanned or
saved .PDF or Word .DOC file.