descriptive statistics

DESCRIPTIVE STATISTICS
Lectures delivered by
Prof. K. K. Achary
YENEPOYA RESEARCH CENTRE
Background assumed: data, data types, scales of measurement, data
collection ,organization and presentation(visualization or visual summary )
In this module we study summarization of data through different descriptive
measures ,like measures of central tendency and measures of variability
(dispersion or spread ),contexts and data types for using the different
measures, computation and interpretation of the measures etc.
• A descriptive measure of a set of values (
observations/data set ) is a representative value(
generally a single number) which summarizes the
characteristic properties/features of data. The
common features observed are
• concentrate around a central value
• scatter( dispersion/spread) around a central
value
• asymmetry in data distribution
• peakedness of data distribution
Contd…
• The ‘central value’ and ‘spread’ are numerical
summaries of the data. The center of a data set is
commonly called the average. There are many ways to
describe the average value of a distribution. The most
appropriate measures of center and spread depend on
the shape of the distribution.
• Once these three characteristics of the distribution are
known, we can analyze the data for interesting
features, including unusual data values, called outliers.
•
•
•
•
Consider the following data sets.
Age at first child birth:
22,24,24,23,21,25,26,23,24,25,26,26,23,24,24
Blood cholesterol:
145,145,143,189,180,196,145,198,205,209,227,2
21,292,302, 253,233,283
• Day temperature:
27,26,27,28,29,29,29,30,30,30,29,30,31,31,31,32
,31,32,32,35
• What are your general comments?
• The descriptive measures can be computed for
the whole mass of data( if available(?)) i.e.
population ,related to a study or only a
representative part of the population, i.e. sample
(random sample to be specific ).
• If we compute a measure from a random sample,
it is termed “ statistic”/descriptive statistics.
• What is it called if we compute it for a
population? ( It is called parameter value.)
• Measures of central Tendency:(Averages)
• Some of the measures of central tendency are:
• Mean( arithmetic mean or average value as
commonly used)
• Median
• Mode
• We have positional averages like, quartiles,
deciles and percentiles
• Arithmetic mean:
• Definition: Arithmetic mean of a sample ( or
population ) is the “average” of all
observations in the sample/population.
• It is given by
• Arithmetic mean = (Total of all values in the
sample/population)/Number of values in the
sample/population.
• Let there be n values in the sample and they are
recorded as :
• x1 , x2 , x3 ,… xn ( you can also use y1 ,y2 , y3,… yn )
• Sum or total of n observations-( x1 + x2 + x3 +…. +
xn ) = ∑xi
x

• Mean (Arithmetic mean/average ) x  n
• This formula is used to compute the mean of
ungrouped data.( What is ungrouped data? What
is grouped data? )
i
•
•
•
•
•
•
•
•
•
•
•
Computation steps:
Add all observations(Sum up)
Find no. of observations in the sample data set ( n )
Divide the total by n. This value is mean
Example:
Consider the age data.
22,24,24,23,21,25,26,23,24,25,26,26,23,24,24
Here n=15
Now ,the sum( total) is 360.
Hence mean= (360/15) = 24
The mean age at first child birth is 24 years.
• Compute the mean level of blood cholesterol.
• Computation of mean in grouped data:
• Grouped data may be in the form of a discrete
frequency distribution or continuous
frequency distribution.
• Example of a discrete frequency table
(distribution)
• Score level (xi)
•
0
•
1
•
2
•
3
•
4
•
5
No. of students (fi)
3
12
23
46
18
6
• The mean is computed in the same way as
before.
• Total of all observations =
0x3+1x12+2x23+3x46+4x18+5x6 = 298
• No. of observations=108
• Mean = 298/108 =2.759259
=2.76
• For the above case ,computational formula can
be developed as follows.
• Let the distinct values be x1 , x2 , x3 ,…. Xk where
xi repeats fi times. Then the total of all
observations is given by x1f1 +x2f2 + … +xkfk.. The
total number of observations is f1+f2+ … +fk = n
• Hence the formula for mean is= (∑xifi)/n
• Computation of mean from continuous
frequency table ( distribution ):
• Recollect the method of constructing a frequency
table when observations are recorded on a
continuous variable. Let the midpoint of the ith
class interval be denoted as xi and the
corresponding class frequency be fi. Then the
mean can be computed using the above formula
by taking midpoint values for xi’ s and
• f1+f2+ … +fk = n, where k is the number of classes.
Properties of the mean
 For a given set of data mean is unique and it is
computed using all the observations.
 Computation is easy. Can be computed for
interval and ratio scale data.
 Extreme values influence the mean and it may be
completely distorted by such values.
 Ex: consider the age of mothers data
 22,24,24,23,21,25,26,23,24,25,26,26,23,24,24
 Mean age is 24 years
• Suppose one observation is 38 ( say the last
one)
• The revised total is 360-24+38=374
• Mean is 374/15 =24.9
• Suppose 48 and 45 are two observations
replacing 21 and 23. Then the mean age
becomes = 27.3
MEDIAN
• The median of a given data set is the value which
divides the ordered set of values into two equal
parts such that the number of values equal to or
greater than the median is equal to the number
of values less than or equal the median.
• In a sample data set having n values, the median
is (n+1)/2 th value, when the observations have
been arranged in increasing order.
• Note that when n is odd the median is the
middlemost value.
• If n =11 then median is the sixth observation.
• If n=12 then the median is the value at 6.5th
position.
• Ex: age data
22,24,24,23,21,25,26,23,24,25,26,26,23,24,24
• Arrange in increasing order:
• 21,22,23,23,23,24,24,24,24,24,25,25,26,26,26
•
•
•
•
•
•
•
•
n=15 hence position of the median is (15+1)/2
=8th observation= 24
The median age at first childbirth is 24 years.
ex: 34,45,67,87,39,93,56,48 (ages of persons )
N=8 & median is 4.5th value
Ordered sample : 34,39,45,48,56,67,87,93
Median is average of 4th and 5th values
Median=(48+56)/2=104/2=52 years
• Compute the mean age for the sample.
• Mean age is 58.63 years
• Mean is large here due to the influence of
large values
• Computation of median in grouped data:
– For discrete frequency table use the definition and
identify the value corresponding to the cum.freq
(n+1)/2
• Ex
• Score level (xi)
•
0
•
1
•
2
•
3
•
4
•
5
(fi)
3
12
23
46
18
6
cum.freq.
3
15
38
84
102
108
n=108 & (n+1)/2=54.5
Hence median is the 54.5th value in the ordered
sample.
Observe that the 54th and 55th observations are
having the value 3.
Hence median score is 3.
Try more examples from books.
Computation of median from grouped
data(continuous freq. table )
• First compute the true class limits(if required )
and the cumulative frequency of each class
• Identify the first class interval for which the
cumulative frequency is greater than or equal
to (n/2). This class is the median class.
• Use the following formula to compute the
median .
Median =
•
•
•
•
•
•
i n
L  (  c)
f 2
Here the notations have the following meaning
L = lower limit of the median class
I = width of the median class
f= frequency of the median class
n= total no. of observations ( sample size)
c= cum. Freq. of class preceding the median class
Example
Class interval
Freq.
Cum.freq.
5
10
15
20
25
30
35
3
8
16
12
9
6
6
3
11
27
39
48
54
60
-
10
15
20
25
30
35
40
• Here n= 60, so n/2 = 30
• Hence median class is 20 – 25, L=20 & i=5
• f=12
• c=27
• So median = 20 + (5/12)(30-27)
•
median = 21.25
50% of the observations fall below 21.25 and 50%
of the observations are greater than or equal the
median value.
• HW: Compute mean for the above data and
compare with median. What are your
observations?
• Properties of median:
• It is unique. It is computed from the ordered
data. It is a positional average. It is not
influenced by extreme values. If the extreme
class intervals are open ended, then also we
can compute median.
MODE
• Mode is the most commonly occurring value
in a data set.
• For ungrouped data mode can be obtained by
finding the value which appears /repeats max.
no. of times.
• Ex:Consider the data set:
• 3,4,3,5,5,6,4,3,5,5,7,8,8,5,8,6,9,5
• Here 5 repeats 6 times. Hence mode=5
• Note: Mode is not unique. In a dataset more
than one value might repeat max. time.
• Ex: Consider the data set:
• 3,4,3,5,5,6,4,3,5,5,3,8,8,5,3,6,3,5
• Here 3 and 5 repeat 6 times. Hence both 3
and 5 are modal values.
• Question: If in a data set all values repeat
equal no. of times, what is the mode?
Grouped discrete frequency distribution
• In a grouped frequency table,mode is identified
as the value with the largest frequency.
• Score level (xi)
(fi)
•
0
3
•
1
12
•
2
23
•
3
46
•
4
18
Mode= 3
•
5
6
Computing mode in continuous frequency
distribution
• Mode is the value that has the highest
frequency in a data set.
• For grouped data, modal class is the class with
the highest frequency.
• To find mode for grouped data, use the
following formula:
•
•
•
•
•
1
Mode  L  (
)i
1   2
Here L = lower limit/boundary of modal class
i = width of modal class
∆1= difference between the frequency of
modal class and the class before the
modal class
∆2 = difference between the frequency of modal
class and the frequency of the class after the modal
class
• The above formula can be written in the following form:
•
( f1  f 0 )
i
Mode= L +
( 2 f1  f 0  f 2 )
1
L= lower limit/boundary of modal class
i= width of modal class
f0 = freq. of class preceding the modal class
f1 = freq. of modal class
f2 = freq. of class next to modal class
Note: compare with the previous formula
∆1 = f1-f0, ∆2 = f1- f2
• Summary & comments
Descriptive statistics are summary figures or measures to
summarise a data set using a single number . The measures of
central tendency provide a summary statistic giving an idea
about the concentration of the data around a central value.
Mean, median and mode are the three important measures of
central tendency.
Use mean for data in interval or ratio scale and median for
ordinal scale. Mode is valid in all cases but it is an ill-defined
measure. While mean is used in parametric inference,median
is an important measure used in non-parametric inference.
• Median can be computed graphically from the
cumulative frequency curves/ogive curve
• Some remarks:
• While computing mean,median and mode from grouped
frequency tables, we are not using the original sample
valuesa(this information is lost when we group the data).
• To compute mean we assume that the values in each class are
concentrated at the midpoint. Hence we are getting an
estimate of the mean.
• To compute mean and median, we use interpolation
technique to estimate the measures .
• Hence , in grouped data we estimate the measures of central
tendency.
Positional Averages
• Positional averages are measures computed from ordered data.
These are defined with reference to the position of the measure.
The important positional measures are:
• Quartiles: These are measures which divide the ordered data into
four equal parts. The first quartile, Q1 is the value which divides
Q
the ordered dataset in such a way that 25% of the observations fall
below it and the remaining observations lie above it.
In an ungrouped data, 1 is the value at the position ( n  1)
4
• The second quartile is the median.
• The third quartile is the value which divides the ordered dataset
such that 75% of the observations fall below it and the remaining
observations lie above it. It is denoted by Q3 .
3
Q
Contd…
• Computation of Quartiles:
--Explained in the class with an example for
ungrouped and grouped data.
---- Note down the formulae for computing
quartiles from continuous frequency
distribution ( as discussed in the class )
----Solve more examples for practice
Percentiles
• Self study
• Prepare a study report on percentiles–
definition, computation, usage and
interpretation.