Empirical distribution function and Percentiles

Empirical distribution function and Percentiles
1
Q1. Given Data={-1,0,1}. What proportion of the data do not exceed 𝑦 = 2 ?
Ans. There are 3 observations in the data, out of which 2 do not exceed
1
2
(we compare
1
2
each data value to 2 and count). Therefore the required proportion is equal to = 3.
Let us now generalize this question
Q2. Given the data {π‘₯1 , … … , π‘₯𝑛 }, what portion of the data is less than or equal to ?
Ans. We have to compute the number of observations in the data that are less than or equal to
𝑦. To get this number we compare each π‘₯𝑖 with 𝑦, 𝑖 = 1, … , 𝑛. If π‘₯𝑖 does not exceed 𝑦, we
count as one and zero otherwise. Hence
The number of π‘₯𝑖 ’s , 𝑖 = 1, … , 𝑛, not exceeding 𝑦 equals to βˆ‘π‘›π‘–=1 𝐼(π‘₯𝑖 ≀ 𝑦),
where 𝐼(π‘₯𝑖 ≀ 𝑦)=1 if π‘₯𝑖 ≀ 𝑦 and zero otherwise.
Therefore the proportion of the data not exceeding 𝑦, equals to
1
𝑛
βˆ‘π‘›π‘–=1 𝐼(π‘₯𝑖 ≀ 𝑦).
Let us denote this proportion by 𝐹𝑛 (𝑦). Given the data {π‘₯1 , … … , π‘₯𝑛 }, we have a function as
follows
0, 𝑦 < π‘₯(1)
𝐹𝑛 (𝑦) =
1
,
𝑛
2
,
𝑛
π‘₯(1) ≀ 𝑦 < π‘₯(2)
π‘₯(2) ≀ 𝑦 < π‘₯(3)
…..
…..
(𝑛 βˆ’ 1)
, π‘₯(π‘›βˆ’1) ≀ 𝑦 < π‘₯(𝑛)
𝑛
1,
π‘₯(𝑛) ≀ 𝑦
{
Note:- {π‘₯(1) , … … , π‘₯(𝑛) } are the sorted data. That is π‘₯(1) denotes the minimum and π‘₯(𝑛) is
the maximum. π‘₯(𝑖) denotes the observation such that exactly 𝑖 βˆ’ 1 π‘₯𝑖 ’s , 𝑖 = 1, … , 𝑛, are less
than π‘₯(𝑖) .
The function 𝐹𝑛 has a name, viz. Empirical distribution function.
Exercise: Plot the function 𝐹𝑛 and state properties.
Q3. Given the data {π‘₯1 , … … , π‘₯𝑛 } and 0 < 𝑝 < 1, can you find a number 𝑦 such that
100𝑝 percent of the data do not exceed 𝑦.
Ans. Recall that 𝐹𝑛 (𝑦) is the proportion of the data not exceeding 𝑦.
Therefore , to answer the above question, it is natural to solve the equation 𝐹𝑛 (𝑦) = 𝑝.
However from the definition of 𝐹𝑛 (𝑦) it is important to realize that there may not be any 𝑦.
for which 𝐹𝑛 (𝑦) = 𝑝. (Why is that so? Well 𝐹𝑛 (𝑦) can be only equal to one of the 𝑛 + 1
1 2
π‘›βˆ’1
numbers {0, 𝑛, 𝑛......., 𝑛 , 1} and 𝑝 may not be equal to any one of these numbers.)
Moreover, if such a 𝑦 exists, it may not be unique. Eg. 𝐹4 (𝑦) = 0.5, π‘₯(2) ≀ 𝑦 < π‘₯(3) .
However, our purpose is served if we can get a 𝑦, such that
1. 𝐹𝑛 (𝑦) β‰₯ 𝑝 and
2. for any number 𝑧 < 𝑦, 𝐹𝑛 (𝑧) < 𝑝.
Since 𝐹𝑛 is a non negative monotonically non decreasing function increasing to 1, the set
{π‘₯: 𝐹𝑛 (π‘₯) β‰₯ 𝑝} is bounded below. Therefore we can define
𝑦 = inf{π‘₯: 𝐹𝑛 (π‘₯) β‰₯ 𝑝}.
Note: Such a 𝑦 satisfies 1 and 2 (why? 𝑧 < 𝑦 and 𝐹𝑛 (𝑧) β‰₯ 𝑝, then 𝑦 is not even a lower
bound of the set {π‘₯: 𝐹𝑛 (π‘₯) β‰₯ 𝑝}. Therefore 𝑧 < 𝑦 ⟹ 𝐹𝑛 (𝑧) < 𝑝.)
The 𝑦 in the above definition is the 100𝑝 percentile.
Percentile: Given the data {π‘₯1 , … … , π‘₯𝑛 } and 0 < 𝑝 < 1, the 100𝑝 percent
percentile is denoted by 𝑄𝑝 , and is defined as
𝑄𝑝 = inf{π‘₯: 𝐹𝑛 (π‘₯) β‰₯ 𝑝}.
The percentiles divide the data 100 equal parts.
Quartiles: We can divide the data into four parts using the 25th, 50th and 75th
percentiles, viz 𝑄𝑝 , 𝑝 = 0.25, 0.50, 0.75. These percentiles are called quartiles,
denoted by 𝑄1 , 𝑄2 , 𝑄3 .
So therefore, there are 25percent of the data not exceeding 𝑄1, 50percent of the
data not exceeding 𝑄2 , 75percent of the data not exceeding 𝑄3 .
Ex1. What percent of the data are between 𝑄1 and 𝑄3 ?
Ex2. What percent of the data are between 𝑄2 and 𝑄3 ?
Ex3. What is the relation between 𝑄2 and the median ?