descriptive statistics

DESCRIPTIVE STATISTICS
Lectures delivered by
Prof.K.K.Achary
YENEPOYA RESEARCH CENTRE
• A descriptive measure of a set of values (
observations/data set ) is a representative value(
generally a single number) which summarizes the
characteristic properties/features of data. The
common features observed are
• concentration of data around a central value
• scatter( dispersion/spread) around a central
value
• asymmetry in data distribution
• peakedness of data distribution
Contd…
• The ‘central value’ and ‘spread’ are numerical
summaries of the data.
• The center of a data set is commonly called the
average.There are many ways to describe the average
value of a distribution.
• The most appropriate measures of center and spread
depend on the shape of the distribution and scale of
measurement.
• Once the characteristics of the distribution are known,
we can analyze the data for interesting
features, including unusual data values, called outliers.
•
•
•
•
Consider the following data sets.
Age at first child birth:
22,24,24,23,21,25,26,23,24,25,26,26,23,24,24
Blood cholesterol:
145,145,143,189,180,196,145,198,205,209,227,2
221,292,302, 253,233,283
• Day temperature:
27,26,27,28,29,29,29,30,30,30,29,30,31,31,31,32
,31,32,32,35
• What are your general comments?
• The descriptive measures can be computed for the whole
mass of data( if available(?)) i.e. population ,related to a
study or only a representative part of the population,i.e.
sample (random/non random sample ).
• If we compute a measure from a random sample, it is
termed “ statistic”/descriptive statistics.
• The measures describing the characteristics of a population
are called parameters. Using the sample data we estimate
the parameters.The functions of sample observations which
estimate the parameters are called ‘ estimators’ / ‘statistic’.
• The numerical value of the estimator is called an ‘estimate’.
Measures of central Tendency(Averages)
• Some of the measures of central tendency
are:
• Mean( arithmetic mean or average value as
commonly used)
• Median
• Mode
• We have positional averages like, quartiles,
deciles and percentiles
• Arithmetic mean:
• Definition: Arithmetic mean of a sample ( or
population ) is the “average” of all
observations in the sample/population.
• It is given by
• Arithmetic mean = (Total of all values in the
sample/population)/Number of values in the
sample/population.
• Let there be n values in the sample and they are
recorded as :
• x1 , x2 , x3 ,… xn ( you can also use y1 ,y2 , y3,… yn )
• Sum or total of n observations-( x1 + x2 + x3 +…. +
xn ) = ∑xi
xi

Mean 
•
n
• This formula is used to compute the mean of
ungrouped data.( What is ungrouped data? What
is grouped data? )
• Computation steps:
– Add all observations(Sum up)
– Find no. of observations in the sample data set ( n )
– Divide the total by n. This value is mean
•
•
•
•
•
•
•
Example:
Consider the age data.
22,24,24,23,21,25,26,23,24,25,26,26,23,24,24
Here n=15
Now ,the sum( total) is 360.
Hence mean= (360/15) = 24
The mean age at first child birth is 24 years.
• Compute the mean level of blood cholesterol.
• Computation of mean in grouped data:
• Grouped data may be in the form of a discrete
frequency distribution or continuous
frequency distribution.
• Let us first consider a discrete frequency
distribution
Example of a discrete frequency table
(distribution)
• Score level (xi)
•
0
•
1
•
2
•
3
•
4
•
5
• Total
No. of students (fi)
3
12
23
46
18
6
108
• The mean is computed the same way as
before.
• Total of all observations =
0x3+1x12+2x23+3x46+4x18+5x6 = 298
• No. of observations=108
• Mean = 298/108 =2.759259
=2.76
• For the above case ,computational formula can be
developed as follows.
• Let the distinct values be x1 , x2 , x3 ,…. xk where xi
repeats fi times.
• Then the total of all observations is given by x1f1 +x2f2
+ … +xkfk.. The total number of observations is f1+f2+ …
+fk = n
• Hence the formula for mean is:
x f

Mean 
i
n
i
• Computation of mean from continuous
frequency table ( distribution ):
• Recollect the method of constructing a frequency
table when observations are recorded on a
continuous variable.
• Let the midpoint of the ith class interval be
denoted as xi and the corresponding class
frequency be fi.
• Then the mean can be computed using the above
formula by taking midpoint values for xi’ s and
• f1+f2+ … +fk = n, where k is the number of classes.
x

Mean 
i
fi
n
Remember that we have lost information on the
individual values. We assume that in every
interval the mid point is the representative
value. Hence we get only an approximate
value ( estimate ) of the mean; it is not the
actual /true value.
Properties of the mean
 For a given set of data mean is unique and it is
computed using all the observations.
 Computation is easy. Can be computed for
interval and ratio scale data.
 Extreme values influence the mean and it may be
completely distorted by such values.
 Ex: consider the ‘age of mothers data’
 22,24,24,23,21,25,26,23,24,25,26,26,23,24,24
 Mean age is 24 years
• Suppose one observation is 39 ( say the last
one)
• The revised total is 360-24+39=375
• Mean is 375/15 =25 years
• Suppose 18 and 16 are two observations
replacing 21 and 23.
• Then the mean age becomes = 23.3 years
An important note
• The mean value obtained may not be one of
the values in the data set
• If the data set contains many extreme values
(outliers) , then mean is not an appropiate
average.