Divide-and-conquer: Median Statistitcs

Divide-and-conquer: Median Statistitcs
Curs 2017
The divide-and-conquer strategy.
1. Break the problem into smaller
subproblems,
2. recursively solve each problem,
3. appropriately combine their
answers.
Julius Caesar (I-BC)
”Divide et impera”
Known Examples:
I
Binary search
I
Mergesort
I
Quicksort
I
Strassen matrix multiplication
J. von Neumann
(1903-57)
Merge sort
Recurrences Divide and Conquer
T (n) = 3T (n/2) + O(n)
The algorithm under analysis divides input of size n into 3
subproblems, each of size n/2, at a cost (of dividing and joining
the solutions) of O(n)
size n
n/2
n/4
1
1
n/4
1
n/2
n/4
n/4
n/4
n/2
n/4
n/4
n/4
n/4
1
1
1
T (n) = 3T (n/2) + O(n).
k=0
size n
n
n/2
lg n
n/2
n/2
k=1
3/2n
n/4
n/4
n/4
n/4
n/4
n/4
n/4
n/4
n/4
k=2
9/4n
27/8n
1
1
1
k=lg n
3 lg n
T(n)=3T(n/2)+O(n)
At depth k of the tree there are 3k subproblems, each of size n/2k .
For each of those problems we need O(n/2k ) (splitting time +
combination time).
Therefore the cost at depth k is:
n 3 k
× O(n).
3 × k =
2
2
k
with max. depth k = lg n.
3 2
3 3
3 lg n
3
Θ(n)
1 + + ( ) + ( ) + ··· + ( )
2
2
2
2
P n
3 k
Therefore T (n) = lg
k=0 O(n)( 2 ) .
We have a geometric series of ratio 3/2, starting at O(n) (k = 0)
ending at O(3lg n ) = O(nlg 3 ) (k = lg n). As the series is
increasing, T (n) is dominated by the last term:
T (n) = O(n × nlg 3/2 ) =O(n1.58 ).
T (n) = T (n/4) + T (n/2) + n2
1 2
) + ( 18 )2 ) + · · · )n2
Series of costs: (1 + (( 14 )2 + ( 12 )2 ) + (( 16
5
25
= (1 + 16
+ 256
+ · · · )n2
Decreasing geometric series dominated by 1st. term, n2 .
T (n) = O(n2 )
General setting: Basic Theorem
n
T (n) = aT ( ) + f (n),
b
where n: size of the problem,
n/b: size of the subproblems
f (n): cost of divide the problem and combine the solutions
Theorem
Let a ≥ 1, b > 1, d ≥ 0 be constants. The recurrence
T (n) = aT (n/b) + O(nd ) has asymptotic solution:
1. T (n) = O(nd ), if d > logb a,
2. T (n) = O(nd lg n) if d = logb a,
3. T (n) = O(nlogb a ), if d < logb a.
Cost
n
f(n)
a
n/b
lgbn
a
a
n/b2
a
O(1)
O(1)
(a/b) f(n)
n/b
n/b2
n/b2
n/b2
a
a
a
O(1)
O(1)
O(1)
nlgan
O(1)
O(1)
2
(a/b) f(n)
O(1)
O(1)
Recall 1: Binary Serach
Input: Given a key k and an sorted list A,
Problem: Decide is k is in A.
1. Divide: Check the middle element y in A. If x = y , we are done.
2. Conquer: Recursively if x < y then search the left subarray of
A, otherwise search the right subarray.
3. Combine: Trivial. When we hit x, stop and yes. Otherwise NO.
T (n) = 1T (n/2) + Θ(1) ⇒ T (n) = Θ(lg n).
Function Random-Partition
Ran-Partition (A[p, . . . , q])
chose pivot r = rand(p, q) u.a.r.
interchange A[p] and A[r ]
compare r with all elements in A
set r in its place, elements ≤ r to its left and the remainning to
its right
Random-Quicksort
Use Ran-Partition and consider the following randomized Divide
and Conquer algorithm, on input A[1, . . . , n]:
Ran-Quicksort (A[p, . . . , q])
r = Ran-Partition (A[p, . . . , q])
if p < q then
Ran-Quicksort (A[1, . . . , r − 1])
Ran-Quicksort (A[r + 1, . . . , q])
else
return A[p]
end if
Example
A={1,3,5,6,8,10,12,14,15,16,17,18,20,22,23}
Ran−Partition of input
8
16
3
6
1
5
18
12
15
10
14
22
17
20
23
Expected Complexity of Ran-Partition
• The expected running time T (n) of Rand-Quicksort is dominated
by the number of comparisons.
• Every call to Rand-Partition has cost
Θ(1) + Θ(number of comparisons)
{z
}
|
|q−p|
• If we can count the number of comparisons, we can bound the
the total time of Quicksort.
• Let X be the number of comparisons made in all calls of
Ran-Quicksort
• X is a rv as it depends of the random choices of Ran-Partition
Expected Complexity of Ran-Partition
• Note: In the first application of Ran-Partition A[r ] compares
with all n − 1 elements.
• Key observation: Any two keys are compared iff one of them is a
pivot, and they are compared at most one time.
10 12 14 15
16 17 18
20 22 23
never compare
For simplicity assume all keys are different, for any input A[i, . . . , j]
to Ran-Quicksort, 1 ≤ i < j ≤ n, let Zi,j be the ordered set of key
{zi , zi+1 , . . . , zj } (with zi the smallest).
• Note |Zi,j | = j − i + 1
• Therefore choosing u.a.r. a pivot is done with probability
1
1
=
|Zi,j |
j −i +1
.
Define the indicator r.v.:
(
1
Xij =
0
if zi is compared to zj ,
otherwise.
P
Pn
Then, X = n−1
i=1
j=i+1 Xi,j
(this is true because we never compare a pair more than once)


n−1 X
n
n−1 X
n
X
X


E [Xi,j ]
E [X ] = E
Xi,j =
i=1 j=i+1
i=1 j=i+1
As E [Xi,j ] = 0Pr [Xi,j = 0] + 1Pr [Xi,j = 1]
∴ E [Xi,j ] = Pr [Xi,j = 1] = Pr [zi is compared to zj ]
End of the proof and main theorem
E [X ] =
Pn−1 Pn
i=1
j=i+1 Pr [zi
is compared to zj ]
As zi and zj compare iff one of them is chosen as pivot, then
Pr [Xi,j = 1] = Pr [zi is pivot] + Pr [zj is pivot]
Because pivots as chosen u.a.r. in Zi,j :
Pr [zi is pivot] = Pr [zj is pivot] =
Therefore:
E [X ] =
1
j−i+1
n−1 X
n
X
i=1 j=i+1
2
.
j −i +1
E [X ] =
n−1 X
n
X
i=1 j=i+1
n
X
=2·
i=1
2
j −i +1
1 1
1
( + + ··· +
)
2 3
n−i +1
n
X
1
1 1
<2·
( + + ··· + )
2 3
n
i=1
=2·
n
X
Hn = 2 · n · Hn = O(n lg n).
i=1
Therefore, E [X ] = 2n ln n + Θ(n).
Theorem
The expected complexity of Ran-Quicksort is E [Tn ] = O(n lg n).
Selection and order statistics
Problem: Given a list A of n of unordered distinct keys, and a
i ∈ Z, 1 ≤ i ≤ n, select the k-smallest element x ∈ A that is larger
than exactly i − 1 other elements in A.
Notice if:
1. i = 1 ⇒ MINIMUM element
2. i = n ⇒ MAXIMUM element
3. i = b n+1
2 c ⇒ the MEDIAN
4. i = b0.9 · nc ⇒ order statistics
Sort A (O(n lg n) and search for A[k] (Θ(n)).
Can we do it in linear time?
Yes, Selection is easier than Sorting
Quick-Select
Given unordered A[1, . . . , n] return the i-th. smaller element
I
Quick-Select (A[p, . . . , q], i)
I
r = Ran-Partition (p, q) to find
position of pivot
I
if i = r return A[r ]
I
if i < r Quick-Select
(A[p, . . . , r − 1], i)
I
else Quick-Select
(A[r + 1, . . . , q], i)
Search for i=2 in A
A
m
u
h
e
c
b
1
k
v
8
3=Ran−Partition(1,8)
e
1
c
b
3
h
u
v
k m
Quick-Select Algorithm
Quickselect (A[p, . . . , q], i)
if p = q then
return A[p]
else
r =Ran-Partition (A[p, . . . , q])
k =r −p+1
if i = k then
return A[q]
if i < k then
return Quickselect (A[p, . . . , r − 1], i)
else
return Quickselect (A[r + 1, . . . , q], i − k)
end if
end if
end if
Analysis.
I
Lucky: at each recursive call the search space is reduced in
9/10 of the size. Then T (n) ≤ T (9n/10) + Θ(n) = Θ(n).
I
Unlucky: T (n) = T (n − 1) + Θ(n) = Θ(n2 ). In this case it is
worst than sorting!.
Theorem
Given A[1, . . . , n] and i, the expected number of steps for
Quick-Select to find the i-th. element in A is O(n)
Worst case of Quick-Select: O(n2 )
Deterministic linear selection.
Generate deterministically a good split element x.
Divide the n elements in bn/5c groups, each with 5 elements (+
possible one group with < 5 elements).
Deterministic linear selection.
Sort each set to find its median, say xi . (Each sorting needs 5
comparisons, i.e. Θ(1)) Total: bn/5c
Deterministic linear selection.
• Use recursively Select to find the median x of the medians
{xi }, 1 ≤ i ≤ dn/5e.
• Use deterministic Partition (quick sort) to re-arrange the groups
corresponding to medians {xi } around x, in linear time on the
number of medians.
x
Deterministic linear selection.
Al least 3( 12 bn/5c) = b3n/10c of the elements are ≤ x.
x
Deterministic linear selection.
Al least 3( 12 bn/5c) = b3n/10c of the elements are ≥ x.
x
The deterministic algorithm
Select (A, i)
1.- Divide the n elements into bn/5c groups of 5
2.- Find the median by insertion sort, and take
the middle element
3.- Use Select recursively to find the median x of the bn/5c
medians
4.- Use Partition to place x and its group. Let k=rank of x
5.- if i = k then
return x
else if i < k then
use Select to find the i-th smallest in the left
else
use Select recursively to find the i − k-th smallest in the right
end if
Notice steps 4 and 5 are the same as Quickselect.
Example
Get the mean (bn/2c) on the following input:
3
13
9
4
5
1
15 12
10
3
4
1
2
6
8
5
9
13
10
12
15
11
14
17
2 6
14
8 11 17
PARTITION around 10:
3 4 5 9
1 2 6 8 10 13 12 15 11 14 17
To get the 7th element (mean)
call SELECT on this instance
Worst case Analysis.
I
As at least ≥
3n
10
3n
10
of the elements are ≥ x.
elements are < x.
I
At least
I
In the worst case, step 5 calls Select recursively
≤ n − 3n
10 = 7n/10
I
Steps 1, 2 and 4 take O(n) time. Step 3 takes time T (n/5)
and step 5 takes time ≤ T (7n/10).
therefore, we have
(
Θ(1)
T (n) =
T (n/5) + T (7n/10) + Θ(n)
Therefore, T (n) = Θ(n)
if n ≤ 50
if n > 50
Notice: If we make groups of 7, the number of elements ≥ x is
which yield T (n) ≤ T (n/7) + T (5n/7) + O(n) with solution
T (n) = O(n).
However, if we make groups of 3, then
T (n) ≤ T (n/3) + T (2n/3) + O(n), which has a solution
T (n) = O(n ln n).
2n
7 ,
Conclusions
I
From a randomized algorithm we remove the randomization
to get a fast deterministic algorithm for selection.
I
From the practical point of view, the deterministic algorithm
is slow. Use Quickselect.