Lecture 2

Measures and indexes of dispersion
Data Science 101 Team
Measures of spread/dispersion/variability
I
I
I
A measure of center needs to be complemented by a measure
of spread around this center.
The definition of averages that we explore naturally lend
themselves to measures of variability
Variance: average square distance from the mean
V(x1 , . . . , xn ) =
I
n
1X
(xi − x̄ )2
n i=1
Standard Deviation:
v
u
n
u1 X
t
(xi − x̄ )2
n
I
i=1
Note that R actually divides by n − 1 rather than n. This is
because when x1 , . . . , xn are a sample from a larger population
of possible values, dividing by n − 1 one has a “better”
estimator for the population quantity.
A note: data with frequencies
I
Often data is summarized so that we have counts of
occurrences of the same values: we have a set v1 , . . . , vm of
possible values, with their frequencies fi
v1 v2 · · ·
f1 f2 · · ·
vm
fm
- Calculating averages and standard deviations has to adapt to this
different set-up
1
v̄ = Pm
m
X
1
Variance = Pm
m
X
vi fi
i=1 fi i=1
i=1 fi
(vi − v̄ )2 fi
i=1
A note: the maximal variance of x1 , . . . , xn
I
I
Generally speaking, the variance of a dataset can be arbitrarily
large
Let’s consider some restrictions that make the statement
meaningful
I
xi ≥ 0 ∀i
I
fix the total sum of values
n
X
xi = nx̄
i=1
n
X
(xi − x̄ )2 =
i=1
n
X
(xi2 + x̄ 2 − 2xi x̄ )
i=1
=
=
n
X
i=1
n
X
xi2
+
n
X
x̄ − 2x̄
i=1
=
i=1
n
X
i=1
xi2 + nx̄ 2 − 2x̄ (nx̄ )
i=1
n
X
2
xi2 − nx̄ 2
xi
A note: the maximal variance of x1 , . . . , xn
n
X
So, V(x1 , . . . , xn ) = (
xi2 − nx̄ 2 )/n. Now,
i=1
n
X
n
X
xi2 − nx̄ 2 ≤ (
i=1
xi )2 − nx̄ 2 = n2 x̄ 2 − nx̄ 2 = x̄ 2 n(n − 1)
i=1
Which means that
V(x1 , . . . , xn ) ≤ x̄ 2 (n − 1)
I
Can we imagine a set of values of x1 , . . . , xn for which the
variance is actually equal to this max?
Index of concentration
The opposite of spread-out is “concentrated.”
Let’s consider variables like the one we just talked about, that is
with only positive values. One such variable might be the income
of households in a nation.
It is interesting to study how “concentrated” or not such income is.
One can imagine that the total income of a nation is the total
amount of a resource that one could distribute.
Income inequality in the media
Income inequality in politics
How can we measure “income inequality”?
I
Let’s think we have a population with n individuals, each with
income x1 , . . . xn .
I
nx̄ is the total income in the population (with x̄ =
I
What would be the values of x1 , . . . xn in the case of maximal
“income equality”?
I
What would be the values of x1 , . . . xn in the case of maximal
“income inequality”?
I
How are we going to judge cases in the middle?
I
Any known measure?
I
Any measure we can come up with given what we already
know?
Pn
i=1 xi /n)
A graphical display for income distribution
We take the values x1 , . . . xn and order them
x(1) ≤ x(2) ≤ · · · ≤ x(n)
For simplicity, we are going to drop the parentheses from the index
notation, and just remember that x1 is the smallest index.
We now calculate two quantities:
i
Fi =
n
I
I
Pi
xj
j=1 xj
j=1
Qi = Pn
F1 is the fraction of the population that correspond to the
bottom earner; F2 is the fraction of the population that
correspond to the two bottom earners etc.
Q1 is the fraction of the national income earned by the bottom
earner; Q2 is the fraction of the national income earned by the
two bottom earners etc.
A graphical display for income distribution
I
Let’s think about the relation between Fi and Qi in the case of
perfect income equality
I
In general, Qi ≤ Fi . To see this, let’s look at their definition
P
and multiply by j=1n xj and divide by i
Qi
j=1 xj
Pn
j=1 xj
≤ Fi
Pi
≤
Pi
j=1 xj
i
i
n
Pn
≤
j=1 xj
n
and remember that the xi are increasing.
A graphical display for income distribution
Income values = 1,2,3,10,15,15,30,50
1.00
●
0.75
Q
●
0.50
●
0.25
●
●
0.00
●
0.00
●
●
0.25
●
0.50
F
0.75
1.00
A graphical display for income distribution
Perfect equality
Maximal inequality
1.00
1.00
●
●
●
0.75
0.75
●
0.50
Q
Q
●
●
0.50
●
0.25
0.25
●
●
0.00
0.00
●
0.00
0.25
0.50
0.75
1.00
●
●
0.00
F
How could we use this to construct an Index?
●
0.25
●
●
0.50
F
●
●
0.75
●
1.00
An idea for the index
1.00
●
0.75
Q
●
0.50
●
0.25
●
●
0.00
●
0.00
●
●
0.25
●
0.50
F
0.75
1.00
From area to index
I
Index varies between 0 and 1
I
Area in between curves = 1/2- area under bottom curve
I
Area under bottom curve: sum of areas of trapezoids
A=
I
n
(Fi − Fi−1 )(Qi + Qi−1 )
1 X
−
2 i=1
2
Gini’s index= G =
n
X
A
=1−
(Fi − Fi−1 )(Qi + Qi−1 )
1/2
i=1
How do things change if we have repetition?
I
data in the form
x1 ≤ x2 ≤ · · · ≤ xk
n1
with
I
P
j
nk
nj = n
Define
Pi
Fi =
I
···
n2
j=1 nj
n
Pi
j=1 nj xj
Qi = Pk
Everything else stays the same.
j=1 nj xj
Revisiting the Income data
Race
White
●
Hispanic
●
Black
Asian
●
●
All
●
0.47
0.48
Gini Index
0.49
0.50
Gini index for other data
Gini index for other data
Something to note
We can calculate the following summary of “mutual variability”
∆=
k X
k
X
|xi − xj |
i=1 j=1
And one can show that
G=
∆
2x̄
ni nj
n n