Statistics and Probability

Virtual University of Pakistan
Lecture No. 12
Statistics and Probability
by
Miss Saleha Naghmi Habibullah
IN THE LAST LECTURE,
YOU LEARNT
•Mean Deviation
•Standard Deviation and Variance
•Coefficient of variation
TOPICS FOR TODAY
•Chebychev’s Inequality
•The Empirical Rule
•The Five-Number Summary
Question-1:
‘How many measurements are within 1 standard
deviation of the mean?’
Question-2:
‘How many measurements are within 2 standard
deviations?’
and so on.
For any specific data set, we can answer these questions by
counting the number of measurements in each of the intervals.
However, if we are interested in obtaining a general
answer to these questions the problem is a bit more difficult.
The first, which applies to any set of data, is derived
from a theorem proved by Russian mathematician,
P.L. Chebychev (1821-1894). The second, which
applies to mound-shaped, symmetric distributions of
data, is based upon empirical evidence that has
accumulated over the years.
And this set of answers is valid and applicable
even if our distribution is slightly skewed.
Let us begin with the Chebychev’s theorem.
Chebychev’s Rule applies to any data set,
regardless of the shape of the frequency distribution of
the data.
“For any number k greater than 1, at least 1 –
1/k2 of the data-values fall within k standard
deviations of the mean, i.e., within the interval (X –
kS,X + kS)”
This means that:
a)At least 1-1/22 = 3/4 will fall within 2 standard
deviations of the mean, i.e. within the interval
(X – 2S,X + 2S).
b) At least 1-1/32=8/9 of the data-values will fall
within 3 standard deviations of the mean, i.e.
within the interval (X – 3S,X + 3S)
Because of the fact that Chebychev’s theorem
requires k to be greater than 1, therefore no
useful information is provided by this theorem on
the fraction of measurements that fall within 1
standard deviation of the mean, i.e. within the
interval (X–S,X+S).
Next, let us consider the Empirical Rule
mentioned above.
Relative Frequency
This is a rule of thumb that applies to data
sets with frequency distributions that are moundshaped and symmetric, as follows:
Measurements
According to this empirical rule:
a) Approximately 68% of the measurements will
fall within 1 standard deviation of the mean, i.e.
within the interval (X – S,X + S)
b) Approximately 95% of the measurements will
fall within 2 standard deviations of the mean, i.e.
within the interval (X – 2S,X + 2S).
c) Approximately 100% (practically all) of the
measurements will fall within 3 standard
deviations of the mean, i.e. within the interval
(X – 3S,X + 3S).
EXAMPLE
The 50 companies’ percentages of revenues
spent on R&D (i.e. Research and Development) are:
13.5
7.2
9.7
11.3
8.0
9.5
7.1
7.5
5.6
7.4
8.2
9.0
7.2
10.1
10.5
6.5
9.9
5.9
8.0
7.8
8.4
8.2
6.6
8.5
7.9
8.1
13.2
11.1
11.7
6.5
6.9
9.2
8.8
7.1
6.9
7.5
6.9
5.2
7.7
6.5
10.5
9.6
10.6
9.4
6.8
13.5
7.7
8.2
6.0
9.5
Calculate the proportions of these measurements
that lie within the intervals X  S,X  2S, and X 
3S, and compare the results with the theoretical values.
The mean and standard deviation of these data come
out to be 8.49 and 1.98, respectively.
Hence
(X – S,X + S)
= (8.49 – 1.98, 8.49 + 1.98)
= (6.51, 10.47)
A check of the measurement reveals that 34 of
the 50 measurements, or 68%, fall between 6.51 and
10.47. Similarly, the interval
(X – 2S,X + 2S)
= (8.49 – 3.96, 8.49 + 3.96)
= (4.53, 12.45)
contains 47 of the 50 measurements, i.e. 94% of the
data-values.
Finally, the 3-standard deviation interval
around X, i.e. (X – 3S,X + 3S)
= (8.49 – 5.94, 8.49 + 5.94)
= (2.55, 14.43)
contains all, or 100%, of the measurements.
In spite of the fact that the distribution of
these data is skewed to the right, the percentages
of data-values falling within 1, 2, and 3 standard
deviations of the mean are remarkably close to the
theoretical values (68%, 95%, and 100%) given by
the Empirical Rule.
The fact of the matter is that, unless the
distribution is extremely skewed, the mound-shaped
approximations will be reasonably accurate. Of
course, no matter what the shape of the distribution,
Chebychev’s Rule, assures that at least 75% and at
least 89% (8/9) of the measurements will lie within 2
and 3 standard deviations of the mean, respectively.
In this example, 94% of the values are lying
inside the interval X + 2S, and this percentage IS
greater than 75%.
Similarly, 100% of the values are lying inside
the interval
X + 3S, and this percentage IS greater than 89%.
In the last lecture, we noted that when all the
values in a set of data are located near their mean,
they exhibit a small amount of variation or
dispersion. And those sets of data in which some
values are located far from their mean have a large
amount of dispersion. Expressing these
relationship in terms of the standard deviation,
which measures dispersion, we can say that when
the values of a set of data are concentrated near
their mean, the standard deviation is small.
And when the values of a set of data are
scattered widely about the mean, the standard
deviation is large.
In exactly the same way, if the standard
deviation computed from a set of data is large,
the values from which it is computed are
dispersed widely about their mean. A
useful
rule that illustrates the relationship between
dispersion and standard deviation is given by
Chebychev’s thorem, named after the Russian
mathematician P.L. Chebychev (1821-1894).
This theorem enables us to calculate for
any set of data the minimum proportion of
values that can be expected to lie within a
specified number of standard deviations of the
mean.
The theorem tells us that at least 75% of the values in a
set of data can be expected to fall within two standard
deviations of the mean, at least 89% (8/9) within three
standard deviations of the mean, and at least 94%
(15/16) within four standard deviations of the mean.
CHEBYCHEV’S THEOREM
Given a set of n observations x1, x2, x3, …., xn
on the variable X, the probability is at least
(1 – 1/k2) that X will take on a value within k
standard deviations of the mean of the set of
observations (where k > 1).
Suppose that a set of data has a mean of 150 and a
standard deviation of 25. Putting k = 2 in the
Chebychev’s theorem, at least
1 – 1/(2)2 = 75% of the data-values will take on a value
within two standard deviations of the mean. Since the
standard deviation is 25, hence 2(25) = 50, and at least
75% of the data-values will take on a value between 150
– 50 = 100 and 150 + 50 = 200.
Consequently, we can say that we can expect at
least 75% of the values to be between 100 and 200.
By similar calculations we find that we can expect at
least 89% to be between 75 and 225, and at least 96% to
be between 25 and 275.
(The last statement has been made by putting k = 5 in the
formula 1 - 1/k2)
Suppose that another set of data has the same mean as
before, i.e. 150, but a standard deviation of 10.
Applying Chebychev’s theorem, for this set of data
we can expect at least 75% of the values to be between 130
and 170, at least 89% to be between 120 and 180, and at
least 96% to be between 100 and 200. The above results are
summarized in the following table:
PERCENTAGE
OF DATA
At least
75 %
At least
89 %
At least
96 %
FOR DATA-SET
NO. 1
Lies Between
100 & 200
Lies Between
75 & 225
Lies Between
25 & 275
FOR DATA-SET
NO. 2
Lies Between
130 & 170
Lies Between
120 & 180
Lies Between
100 & 200
THE SYMMETRIC CURVE
Thus the intervals computed for the latter set of data are all
narrower than those for the former.
For two symmetric, hump-shaped distributions having the
same mean, this point is depicted in the following diagram:
f
X
100
130
150 170
200
Therefore, we see that for a set of data with a small
standard deviation, a larger proportion of the values
will be concentrated near the mean than for a set of
data with a large standard deviation. A limitation of
the Chebychev’s theorem is that it gives no
information at all about the probability of observing a
value within one standard deviation of the mean, since
1 – 1/k2 = 0 when k = 1. Also, it should be noted that
the Chebychev’s theorem provides weak information
for our variable of interest.
For many random variables, the probability of
observing a value within 2 standard deviations of the
mean is far greater than 1 – 1/22 = 0.75.
In this way, the Chebychev’s theorem and the
Empirical Rule play an important role in
understanding the nature and importance of
the standard deviation as a measure of
dispersion.
FIVE-NUMBER SUMMARY
A five-number summary consists of
X0,
Q1, Median, Q3, Xm
It provides us quite a good idea about the
shape of the distribution.
If the data were perfectly symmetrical, the
following would be true:
1.The distance from Q1 to the median would be equal
to the distance from the median to Q3:
f
Q1
~
X
Q3
X
2.
The distance from X0 to Q1 would be equal
to the distance from Q3 to Xm.
f
X0
Q1
Q3
Xm
X
3.
The median, the mid-quartile range, and the
midrange would all be equal.
All these measures would also be equal to the
arithmetic mean of the data:
f
X
~
X  X  Mid  Range
 Mid  quartile range
On the other hand, for nonsymmetrical distributions, the following
would be true:
1. In right-skewed distributions the
distance from Q3 to Xm greatly exceeds
the distance from X0 to Q1.
THE POSITIVELY SKEWED CURVE
f
X0
Q1
Q3
Xm
X
2. In right-skewed distributions,
median < mid-quartile range < midrange:
f
X
~
X
Mid-quartile
Range
MidRange
Similarly, in left-skewed distributions, the distance
from X0 to Q1 greatly exceeds the distance from Q3
to Xm.
Also, in left-skewed distributions, midrange < midquartile range < median.
EXAMPLE:
Suppose that a study is being conducted
regarding the annual costs incurred by students
attending public versus private colleges and
universities in the United States of America.
In
particular, suppose, for exploratory purposes, our
sample consists of 10 Universities whose athletic
programs are members of the ‘Big Ten’ Conference.
The annual costs incurred for tuition fees,
room, and board at 10 schools belonging to Big Ten
Conference are given as follows:
Annual Costs
Name of University
(in $000)
Indiana University
15.6
Michigan State University
17.0
Ohio State University
15.2
Pennsylvania State University
16.4
Purdue University
15.2
University of Illinois
15.4
University of Iowa
13.0
University of Michigan
23.1
University of Minnesota
14.3
University of Wisconsin
14.9
If we wish to state the five-number summary for
these data, the first step will be to arrange our dataset in ascending order:
Ordered Array:
X0 = 13.0 14.3 14.9 15.2 15.2 15.4 15.6 16.4 17.0 Xm = 23.1
MEDIAN AND QUARTILES
FOR THIS DATA-SET:
1)
The median for this data comes out to be 15.30
thousand dollars.
2)
The first quartile comes out to be 14.90
thousand dollars, and
3)
The third quartile comes out to be 16.40
thousand dollars.
The Five-Number Summary:
X0
Q1
~
X
Q3
Xm
13.0 14.9 15.3 16.4 23.1
If we apply the rules that I am conveyed to you a
short while ago, it is clear that the annual cost data for
our sample are right-skewed.
We come to this conclusion because of two
reasons:
1.
2.
The distance from Q3 to Xm (i.e., 6.7) greatly
exceeds the distance from X0 to Q1 (i.e., 1.9).
If we compare the median (which is 15.3), the
mid-quartile range (which is 15.65), and the
midrange (which is 18.05), we observe that
median < mid-quartile range < midrange.
Both these points clearly indicate that our
distribution is positively skewed.
The gist of the above discussion is that the
five-number summary is a simple yet effective
way of determining the shape of our frequency
distribution --- without actually drawing the
graph of the frequency distribution.
IN TODAY’S LECTURE,
YOU LEARNT
•Chebychev’s Inequality
•The Empirical Rule
•The Five-Number Summary
IN THE NEXT LECTURE,
YOU WILL LEARN
•Box and Whisker Plot
•Measures of Skewness and Kurtosis
•Moments