Selection problem

CS 360: Data Structures and Algorithms
Divide-and-Conquer (part 3)
Selection problem:
Given array A[1…n] and value k where 1 ≤ k ≤ n, find and
return the kth smallest element in A
• If k=1 ⇒ minimum element
• If k=n ⇒ maximum element
• If k=(1+n)/2 ⇒ median element
Algorithm 1 for selection problem:
sort A into ascending order;
return A[k];
Algorithm 1 takes θ(n lg n) time using merge sort or heap sort
Can we develop a faster algorithm?
Note: without sorting, we can determine the minimum or
maximum element in θ(n) time. [How?]
Goal: solve the selection problem for arbitrary k in θ(n) time
Idea: divide-and-conquer, use pivot similar to quick sort, but
only make recursive call on one of the two subarrays, because
we don’t need to completely sort the entire array
Algorithm 2 for selection problem:
Choose “pivot element” that is hopefully near the median of A
Recall these strategies for choosing the pivot in quick sort:
pivot = A[low]
pivot = A[high]
pivot = A[(low+high)/2]
pivot = A[random (low, high)]
pivot = median (A[random (low, high)], A[random (low, high)],
A[random (low, high)])
pivot = median (A[low], A[(low+high)/2], A[high])
Select (A, k) {
pivot = …; // choose any of the above strategies
create three empty lists: L, E, G;
for each x in A
if (x<pivot) add x to L;
else if (x==pivot) add x to E;
else /* (x>pivot) */ add x to G;
if (k <= L.size)
return Select (L, k);
else if (k <= L.size + E.size)
return pivot;
else return Select (G, k – L.size – E.size);
}
Analysis of Algorithm 2 for selection problem:
Best case: θ(n), if pivot happens to be the kth smallest element
Worst case: θ(n2), if pivot is always near the minimum or
maximum value
Average case: θ(n), if sometimes pivot yields a good split and
sometimes a bad split, based on probabilities
Recurrence for worst case:
T(n) = T(n–1) + θ(n) or T(n) = T(n–2) + θ(n)
Recurrence for average case (assuming no duplicates):
1
T(n) = � � Σ1≤k≤n [�
n
k−1
n
�T(k–1) + �
n−k
n
�T(n–k)] + θ(n)
Does not conform to the master recurrence theorem, so it’s
difficult to solve
How can we achieve worst-case θ(n) time for selection?
Note: if we could be very lucky to always guess the median
element as the pivot, then
T(n) = T(n/2) + θ(n) ⇒ T(n) = θ(n)
So we want a new strategy for choosing a pivot that’s always
close to the median
Algorithm 3 for selection problem:
Same as algorithm 2, except for new pivot strategy:
Choose an odd number g (later we’ll see g=5 is best)
Partition the n elements into groups of size g each
(So the number of groups = n/g)
Find the median of each group
(Note: we can sort each group in θ(g2) = θ(1) time)
Let M = list of all these group medians, so size of M is n/g
Find the median of M by calling Algorithm 3 recursively
(Note: because we can’t sort M in θ(n) time)
Let pivot = the median of M = Select (M, (1 + n/g)/2)
(So pivot is the median-of-medians)
Next continue the same as in Algorithm 2:
create three empty lists: L, E, G;
for each x in A
if (x<pivot) add x to L;
else if (x==pivot) add x to E;
else /* (x>pivot) */ add x to G;
if (k <= L.size)
return Select (L, k);
else if (k <= L.size + E.size)
return pivot;
else return Select (G, k – L.size – E.size);
Stop the recursion when n is below some threshold (such as
n < 3g or n < g2), and solve using Algorithm 1 or Algorithm 2
Example: n=25, let g=5
A 1 14 11 15 13 23 17 4 19 6 0 10 8 3 2 9 21 12 22 16 24 18 5 20 7
To find the median of A, call Select (A, (1+25)/2) = Select (A, 13)
A 1 14 11 15 13 23 17 4 19 6 0 10 8 3 2 9 21 12 22 16 24 18 5 20 7
M = [13, 17, 3, 16, 18]
pivot = Select (M, (1+25/5)/2) = Select ([13,17,3,16,18], 3) = 16
L = [1,14,11,15,13,4,6,0,10,8,3,2,9,12,5,7]
E = [16]
G = [23,17,19,21,22,24,18,20]
L.size = 16
E.size = 1
G.size = 8
k=13 ⇒ k <= L.size ⇒ Select (L, 13) = 12
Next suppose we call Select (A, 21) using same array A
Almost everything proceeds exactly as above
k=21 ⇒ k > L.size + E.size ⇒ Select (G, 21–16–1)
⇒ Select (G, 4) = 20
Analysis of Algorithm 3 for selection problem:
Two recursive calls
• pivot = Select (M, (1 + n/g)/2)
• only one of Select (L, k) or Select (G, k – L.size – E.size)
T(n) = T(M.size) + T(max(L.size, G.size)) + θ(n)
Recall M.size = n/g
What is upper bound for L.size and G.size?
Note:
pivot is the median of M
So half of the n/g elements in M must be ≤ pivot
Half of the n/g groups have medians ≤ pivot
Each of these groups has at least g/2 elements ≤ pivot
Altogether, at least (1/2)(n/g)(g/2) = n/4 elements ≤ pivot
All these n/4 elements are in L and E (so they’re not in G)
Therefore G.size ≤ 3n/4
Analogously we can show that L.size ≤ 3n/4
so max(L.size, G.size) ≤ 3n/4
Intuition: Select (A, n/4) ≤ pivot ≤ Select (A, 3n/4),
so pivot is closer to median than it is to min or max elements
T(n) = T(n/g) + T(3n/4) + θ(n)
Does not conform to the master recurrence theorem, but we
can solve it easily by another approach
T(n) = T(n/g) + T(3n/4) + cn, for some constant c > 0
Guess that T(n) = dn, for some other constant d > 0
dn = d(n/g) + d(3n/4) + cn
d = d/g + 3d/4 + c
d ( 1/4 – 1/g ) = c
Note: must have g > 4 for this equation to be solvable, so
choose group size g=5
d ( 1/4 – 1/5 ) = c
d ( 1/20 ) = c
d = 20c
So T(n) = dn = 20cn = θ(n)
Algorithm 3 is a worst-case θ(n)-time algorithm