Measures and indexes of dispersion Data Science 101 Team Measures of spread/dispersion/variability I I I A measure of center needs to be complemented by a measure of spread around this center. The definition of averages that we explore naturally lend themselves to measures of variability Variance: average square distance from the mean V(x1 , . . . , xn ) = I n 1X (xi − x̄ )2 n i=1 Standard Deviation: v u n u1 X t (xi − x̄ )2 n I i=1 Note that R actually divides by n − 1 rather than n. This is because when x1 , . . . , xn are a sample from a larger population of possible values, dividing by n − 1 one has a “better” estimator for the population quantity. A note: data with frequencies I Often data is summarized so that we have counts of occurrences of the same values: we have a set v1 , . . . , vm of possible values, with their frequencies fi v1 v2 · · · f1 f2 · · · vm fm - Calculating averages and standard deviations has to adapt to this different set-up 1 v̄ = Pm m X 1 Variance = Pm m X vi fi i=1 fi i=1 i=1 fi (vi − v̄ )2 fi i=1 A note: the maximal variance of x1 , . . . , xn I I Generally speaking, the variance of a dataset can be arbitrarily large Let’s consider some restrictions that make the statement meaningful I xi ≥ 0 ∀i I fix the total sum of values n X xi = nx̄ i=1 n X (xi − x̄ )2 = i=1 n X (xi2 + x̄ 2 − 2xi x̄ ) i=1 = = n X i=1 n X xi2 + n X x̄ − 2x̄ i=1 = i=1 n X i=1 xi2 + nx̄ 2 − 2x̄ (nx̄ ) i=1 n X 2 xi2 − nx̄ 2 xi A note: the maximal variance of x1 , . . . , xn n X So, V(x1 , . . . , xn ) = ( xi2 − nx̄ 2 )/n. Now, i=1 n X n X xi2 − nx̄ 2 ≤ ( i=1 xi )2 − nx̄ 2 = n2 x̄ 2 − nx̄ 2 = x̄ 2 n(n − 1) i=1 Which means that V(x1 , . . . , xn ) ≤ x̄ 2 (n − 1) I Can we imagine a set of values of x1 , . . . , xn for which the variance is actually equal to this max? Index of concentration The opposite of spread-out is “concentrated.” Let’s consider variables like the one we just talked about, that is with only positive values. One such variable might be the income of households in a nation. It is interesting to study how “concentrated” or not such income is. One can imagine that the total income of a nation is the total amount of a resource that one could distribute. Income inequality in the media Income inequality in politics How can we measure “income inequality”? I Let’s think we have a population with n individuals, each with income x1 , . . . xn . I nx̄ is the total income in the population (with x̄ = I What would be the values of x1 , . . . xn in the case of maximal “income equality”? I What would be the values of x1 , . . . xn in the case of maximal “income inequality”? I How are we going to judge cases in the middle? I Any known measure? I Any measure we can come up with given what we already know? Pn i=1 xi /n) A graphical display for income distribution We take the values x1 , . . . xn and order them x(1) ≤ x(2) ≤ · · · ≤ x(n) For simplicity, we are going to drop the parentheses from the index notation, and just remember that x1 is the smallest index. We now calculate two quantities: i Fi = n I I Pi xj j=1 xj j=1 Qi = Pn F1 is the fraction of the population that correspond to the bottom earner; F2 is the fraction of the population that correspond to the two bottom earners etc. Q1 is the fraction of the national income earned by the bottom earner; Q2 is the fraction of the national income earned by the two bottom earners etc. A graphical display for income distribution I Let’s think about the relation between Fi and Qi in the case of perfect income equality I In general, Qi ≤ Fi . To see this, let’s look at their definition P and multiply by j=1n xj and divide by i Qi j=1 xj Pn j=1 xj ≤ Fi Pi ≤ Pi j=1 xj i i n Pn ≤ j=1 xj n and remember that the xi are increasing. A graphical display for income distribution Income values = 1,2,3,10,15,15,30,50 1.00 ● 0.75 Q ● 0.50 ● 0.25 ● ● 0.00 ● 0.00 ● ● 0.25 ● 0.50 F 0.75 1.00 A graphical display for income distribution Perfect equality Maximal inequality 1.00 1.00 ● ● ● 0.75 0.75 ● 0.50 Q Q ● ● 0.50 ● 0.25 0.25 ● ● 0.00 0.00 ● 0.00 0.25 0.50 0.75 1.00 ● ● 0.00 F How could we use this to construct an Index? ● 0.25 ● ● 0.50 F ● ● 0.75 ● 1.00 An idea for the index 1.00 ● 0.75 Q ● 0.50 ● 0.25 ● ● 0.00 ● 0.00 ● ● 0.25 ● 0.50 F 0.75 1.00 From area to index I Index varies between 0 and 1 I Area in between curves = 1/2- area under bottom curve I Area under bottom curve: sum of areas of trapezoids A= I n (Fi − Fi−1 )(Qi + Qi−1 ) 1 X − 2 i=1 2 Gini’s index= G = n X A =1− (Fi − Fi−1 )(Qi + Qi−1 ) 1/2 i=1 How do things change if we have repetition? I data in the form x1 ≤ x2 ≤ · · · ≤ xk n1 with I P j nk nj = n Define Pi Fi = I ··· n2 j=1 nj n Pi j=1 nj xj Qi = Pk Everything else stays the same. j=1 nj xj Revisiting the Income data Race White ● Hispanic ● Black Asian ● ● All ● 0.47 0.48 Gini Index 0.49 0.50 Gini index for other data Gini index for other data Something to note We can calculate the following summary of “mutual variability” ∆= k X k X |xi − xj | i=1 j=1 And one can show that G= ∆ 2x̄ ni nj n n
© Copyright 2026 Paperzz