descriptive statistics

DESCRIPTIVE STATISTICS-1
Lecture - 5
Prof.K.K.Achary
YENEPOYA RESEARCH CENTRE
 Any
finite or infinte collection of
individuals, not necessarily animate,
subject to a statistical study is called a
population ( also called universe )
 Example: population of students ( quite
general)
 Population of students in YU – specified to
institution
 Population of tobacco using students in
Mangalore
 Populations
are usually described in terms
of their observational units, their extent(
coverage ) and time.
 Careful definition of the population is
essential in designing a research study
 In a research study, it may not be possible to
collect information/data from the entire
study population, because it may be too
large
 Then we go for a sample from the study
population
A
sample is a subset of individuals/subjects
from the study population, generally a
representative part of the population
 In statistical study , we are considering
random or unbiased samples.
 There are different methods of selecting
random samples from a given population
 The type of random sampling and the size of
the sample are very important in a research
study and they should be predetermined
 The
information that is measured
/observed from every study subject is a
variable.
 The collection of all such measurements
from all subjects from the
sample(population) is called sample
data( population data )
 Example: sample: Tobacco using students
in YU
 Study variables: age,gender,tobacco use
A
descriptive measure of a set of values (
observations/data set ) is a representative
value( generally a single number) which
summarizes the characteristic
properties/features of data. The common
features observed are
 concentration of data around a central value
 scatter( dispersion/spread) around a
central value
 asymmetry in data distribution
 peakedness of data distribution
The ‘central value’ and ‘spread/dispersion’ are
numerical summaries of the data.
 The center of a data set is commonly called the
average. There are many ways to describe the
average value of a distribution.
 The most appropriate measures of center and
spread depend on the shape of the distribution
and scale of measurement.
 Once the characteristics of the distribution are
known, we can analyze the data for interesting
features, including unusual data values, called
outliers.

 Consider
the following data sets.
 Age at first child birth:
 22,24,24,23,21,25,26,23,24,25,26,26,23,24,24
 Blood cholesterol:
145,145,143,189,180,196,145,198,205,209,22
7,222,292,302, 253,233,283
 Day temperature:
27,26,27,28,29,29,29,30,30,30,29,30,31,31,31
,32,31,32,32,35
 What are your general comments?
The descriptive measures can be computed for the
whole mass of data( if available(?)) i.e. population
,related to a study or only a representative part of the
population,i.e. sample (random/non random sample
).
 If we compute a measure from a random sample, it is
termed “ statistic”/descriptive statistics.
 The measures describing the characteristics of a
population are called parameters. Using the sample
data we estimate the parameters. The functions of
sample observations which estimate the parameters
are called ‘ estimators’ / ‘statistic’.
 The numerical value of the estimator is called an
‘estimate’.

Measures of central
Tendency(Averages)
 Central
tendency is the tendency of the
data/observations to cluster around a
central value. This central value is also
called an ‘average’
 Some of the measures of central tendency
are:
 Mean( arithmetic mean or average value as
commonly used)
 Median
 Mode
 We have positional averages like, quartiles,
deciles and percentiles
 Arithmetic
mean:
 Arithmetic mean of a sample ( or
population ) is the “average/mean” of all
observations in the sample/population.
 It is given by
 Arithmetic mean = (Total of all
observations in the
sample/population)/Number of
observations in the sample/population.
 Let
there be n values in the sample and they
are recorded as :
 x1 , x2 , x3 ,… xn ( you can also use y1 ,y2 ,
y3,… yn )
 Sum or total of n observations-( x1 + x2 + x3
+…. + xn ) = ∑xi

x

Mean 
n

i
This formula is used to compute the mean
of ungrouped data.( What is ungrouped
data? What is grouped data? )
 Computation steps:
• Add all observations(Sum up)
• Find no. of observations in the sample data set ( n )
• Divide the total by n. This value is mean
 Example:

Consider the age data.

22,24,24,23,21,25,26,23,24,25,26,26,23,24,24
 Here n=15
 Now ,the sum( total) is 360.
 Hence mean= (360/15) = 24
 The mean age at first child birth is 24 years.
 Compute
the mean level of blood
cholesterol.
 Computation of mean in grouped data:
 Grouped data may be in the form of a
discrete frequency distribution or
continuous frequency distribution.
 Let us first consider a discrete frequency
distribution
 Score







0
1
2
3
4
5
Total
level (xi)
No. of students (fi)
3
12
23
46
18
6
108
 The
mean is computed the same way as
before.
 Total of all observations =
0x3+1x12+2x23+3x46+4x18+5x6 = 298
 No. of observations=108
 Mean = 298/108 =2.759259
=2.76
 For
the above case ,computational formula
can be developed as follows.
 Let the distinct values be x1 , x2 , x3 ,…. xk
where xi repeats fi times.
 Then the total of all observations is given by
x1f1 +x2f2 + … +xkfk.. The total number of
observations is f1+f2+ … +fk = n
 Hence the formula for mean is:
x f

Mean 
i
n
i
Computation of mean from continuous
frequency table ( distribution ):
 Recollect the method of constructing a frequency
table when observations are recorded on a
continuous variable.
 Let the midpoint of the ith class interval be
denoted as xi and the corresponding class
frequency be fi.
 Then the mean can be computed using the above
formula by taking midpoint values for xi’ s and
 f1+f2+ … +fk = n, where k is the number of
classes.

x

Mean 
i
fi
n
Remember that we have lost information
on the individual values. We assume that
in every interval the mid point is the
representative value. Hence we get only
an approximate value ( estimate ) of the
mean; it is not the actual /true value.






For a given set of data mean is unique and it
is computed using all the observations.
Computation is easy. Can be computed for
interval and ratio scale data.
Extreme values influence the mean and it
may be completely distorted by such
values.
Ex: consider the ‘age of mothers data’
22,24,24,23,21,25,26,23,24,25,26,26,23,24,24
Mean age is 24 years
 Suppose
two observations are 39 ( say
the last two)
 The revised total is
 360-24-24+39+39= 390
 Mean is 390/15 =26 years
 Suppose 18 and 16 are two observations
replacing 21 and 23.
 Then the mean age becomes = 23.3
years
An important note
 The
mean value obtained may not be
one of the values in the data set
 If the data set contains many extreme
values (outliers) , then mean is not an
appropriate measure of central
tendency/average.