measures of dispersion

MEASURES OF DISPERSION
Dispersion or variability or spread –all terms mean the same
Lectures delivered by
Prof.K.K.Achary
Yenepoya Research centre
• Consider the following three data sets giving
the systolic BP of three groups of individuals
( hypothetical data )
•
•
•
•
Group1:
120,125,125,120,120,125,128,120,122,130,125,120
Group 2
120,120,130,110,125,110,110,145,120,125,135,130
• Group 3
• 110,110,115,105,130,140,155,140,140,105,120,110
• The mean BP of the three groups is the same
• Mean = 123.33 for each group
• Let us plot the observations in relation to the
mean value as dot plots.
• Group 1:
•
•
x
x
x
x
x
x
x
_________________xx__*_x___x__x_________
mean
• Quickly draw the dot plots for the other two
groups.
• Can you see the variations ?
• Which group shows more variation?
• Comments??????
What is dispersion?
• Dispersion refers to the spread or variability of
data.In a data set ,if all values are same, there is no
dispersion( zero dispersion)
• Usually ,the dispersion is measured around a central
value.
• A measure of dispersion conveys information
regarding the amount of variability present in a data
set.
Some measures of dispersion
• Range: It is defined as the difference detween
the largest and the smallest observations in a
data set.
R  x L  xS
xL
• Where
xL
xS
are largest and smallest values.
• Range depends on the two extreme values of
the data set. Larger the range, greater will be
the spread/variability in the data set.
• It gives an idea about the overall
variability/range of variation in the data set.
• For small samples( ≤ 10 ),we can use it as a
measure of variability.
• Simple measure, does not depend on all
observations,useful in quality control techniques
• Example:
• Consider the BP data sets of three groups of
people.
• For group 1 R= 130-120 =10
• For group 2 R= 145-110 = 35
• For group 3 R= 155-105 = 50
• Group 3 is the most dispersed data set.
Inter-quartile Range
• The first quartile of a data set is defined as the
value which has 25% of the observations
falling below it,when the observations are
arranged in increasing order. In ungrouped
data ,the first quartile is computed using the
formula:
•
Q1 = ( n  1)
4
thvalue
• Similarly , the third quartile of a data set is
defined as the value which has 75% of the
observations fall below it,in an ordered data
set. For ungrouped data,it is given by
• Q3 = 3( n  1)
4
th value
• For grouped frequency distribution, write
down the formula for the two quartiles
• Inter-quartile range is defined as the
difference between Q3 and Q1.
• IQR = Q3 –Q1
• IQR provides the range for the central 50% of
observations in a data set, since there are 25%
observations below Q1 and 25% observations
above Q3. Q1 and Q3 provide the limits within
which the central 50% values fall.
• IQR/2 is called semi-interquartile range.
MEAN DEVIATION
• Mean deviation is defined as the average of
absolute deviations measured from a central
value. We define mean deviation about
median as follows:
• M.D.(median) =
Where ‘ m bar ‘
Is the median
1
x
i

m

n
Some observations
• Absolute deviation is the numerical value of
deviation of an observation from a central
value,say,mean or median or mode. Only the
numerical values are considered and sign is
ignored.
• Ex: 3-8 = -5 = actual deviation/difference.
absolute diff. is 5
•
16-8 =8. absolute diff. is 8
• An important note: The sum of actual deviations
of observations from their mean is zero.
•
•
•
•
Mean deviation about median is the smallest
Why??
Do you have an intuitive answer?
Since median is the middlemost value,the
absolute difference of each observation from
the median value has to be small
• M.D. about median is commonly considered.
Variance and standard deviation
• Instead of taking absolute deviations, we can
consider average of squared differences of
observations from a central value as a
measure of dispersion.
• Variance is such a measure.
• For a given population , variance is defined as
the average of squared differences of
observations from their arithmetic mean.
• Let x1 , x2 , x3 ,…. xN be the N observations
and μ be the population mean. Then the sum
of squared deviations of the observations are:
• (x1-μ)2+ (x2-μ)2+…+(xN-μ)2=∑(xi – μ)2
• Variance = ∑(xi – μ)2 / N
• This is usually denoted as σ2
• The positive square root of variance is called
standard deviation
• For a sample of n observations, the sample
variance is given by
1
2
s 
( xi  x )

n 1
2
Here the denominator is (n-1) because the sum of
the deviations from mean is zero and hence the
degrees of freedom is (n-1). Also, the sample
variance is an unbiased estimator of the
population variance.
• Since variance is computed from squared
deviations,the unit of measurement is
‘squared’. Therefore, we consider the positive
squareroot of variance/sample variance and
call it ‘standard deviation’.
• The population standard deviation is σ and the
sample standard deviation is
Illustrative examples
• Compute IQR , M.D. about median and
standard deviation for the BP data of all
three groups.
Computation of variance in grouped freq.table
1
2
 s
i f ) x  ix ( 
1 n
2
• In this formula,n= total frequency i.e. ∑fi
•
fi = frequency of the ith value/interval
•
xi= ith value of the variable/midpoint of
the ith class interval
•
= mean of the sample data
x
•
Formula for direct computation
Ungrouped data:
1
s 
n 1
2


n
2
 x  n  1 ( x)
2
i
x
2
i
n 1

( xi )
2
n(n  1)
• Grouped data:
1
n
2
2
s 
( xi f i ) 
( x)
n 1
n 1
2
An example
• Consider the BP data for group 3
• Solve this using the direct computation
formula
Properties of standard deviation
• SD is the prefered measure of variability . It
measures the typical distance of an
observation from mean of the sample.
• If all values are same,then SD=0
• SD is heavily influenced by outliers/extreme
values.
• For comparing variation in two or more
samples,coefficient of variation is a better
measure.
Coefficient of variation
• Variance( or standard deviation ) is an absolute
measure of dispersion . In order to compare
dispersion in sample data measured in different units
of measurement, we define relative measures of
dispersion.
• Coefficient of variation(C.V.) is one such measure. It
is defined as the ratio of standard deviation to
mean.

• For population , C.V. =

• In case of sample data,the sample coefficient
of variation is defined as the ratio of sample
standard deviation to sample mean.
•
s
• cv =
x
• CV measures the extent of variability in
relation to the mean. It can also be
interpreted as the variation per unit mean.
• Cv is relevant in data types having a real zero.
If the mean is close to zero/negative, CV is not
meaningful.
• In practice ,it is used as a percentage value by
multiplying the ratio by 100.
• CV is positive valued and it can exceed one.
• What is the meaning of CV=0?
• It can be used to compare the variability of
two or more samples. Small value of CV is
interpreted as less variability in the data set.
• “less variability” implies “greater consistency”
• Useful when two data sets are given in two
different units of measurement, like,height
measured in inches and centimetres.
• Example:
• Compute the CV for BP data of three groups
of jndividuals.
• Group1:
• Sample mean = 123.33
• Sample variance=12.2424
• Cv =0.027
•
•
•
•
•
•
•
•
Group 2:
Sample mean = 123.33
Sample var. = 115.152
Cv = 0.087
Group 3:
Sample mean = 123.33
Sample var. = 287.879
CV = 0.138
• In the analysis of experimental data where
measurements are made in different
concentration levels, if SD increases in
proportion to concentration, then CV is
prefered to SD.
• Adding CV’s or taking the average value of
CV’s for several samples and using it as a
measure of variation is not correct.
• Can you do it with SD?
Some remarks
•
•
•
•
•
Consider this example:
Sample 1: mean=80,SD=12,
Sample2: mean=0.5,SD=0.2,
CV1=0.15 or 15%, CV2= 0.4 or 40%
If you compare the variability in the two samples
using SD,then sample 1 is having more spread.
But CV tells us that sample 2 has more variability.
• Variability is a property observed in relation to a
central value in the data set.
Some new measures:
• Since CV cannot be defined when mean is
negative,it is suggested to use absolute value
of mean. The ratio is called relative standard
deviation.(Not a popular one !!!).
• AS a nonparametric counterpart,the ratio of
the interquartile range to median is
suggested. i.e. (Q3-Q1)/ median
• Again, not a popular measure!!