Disjoint Set or Union-Find ADT

Disjoint Set or Union-Find ADT
[ Section 4.2 ]
The Disjoint Set ADT:
Objects: A collection of nonempty disjoint sets S = S1 , S2 , ..., Sk ,
i.e., each Si is a nonempty set that has no element in common
with any other Sj . In mathematical notation this is:
Si ∩ Sj = , ∀i #= j
Each set is identified by a unique element called its representative.
Operations:
• MAKE-SET(x): Given an element x that does not already
belong to one of the sets, create a new set {x} that contains only x (a
• FIND-SET(x): Given an element x, return the representative of the
• UNION(x,y):
– Given two distinct elements x and y
– let Sx be the set that contains x and Sy be the set that
contains y.
– UNION(x,y) is an operation that results in a new set
consisting of Sx ∪ Sy .
Two things need to be updated to maintain a valid collection
of sets:
1. Remove Sx and Sy from the collection (since all the sets must be
2. Picks a representative for the new set.
Note: If both x and y belong to the same set already (i.e.,
Sx = Sy ), nothing is done by this operation.
Applications
• Maintaining the set of connected components of a graph.
• Maintain lists of duplicate copies of webpages.
• Constructing a minimum spanning tree for a graph (Kruskal).
Finding connected components:
For all v in V do
MAKE-SET(v)
For all (u,v) in E do
UNION(u,v)
Q: How can we test whether u and v are connected?
FIND-SET(u) = FIND-SET(v) .
Kruskal’s Algorithm for MSTs
Intuition: Grow an MST A by repeatedly adding the “lightest”
edge from E that does not create a cycle.
Example
9
s
9
5
5
8
2
8
6
4
2
10
4
6
10
Order of edges in priority queue: 2,4,5,6,8,9,10
9
9
5
8
5
8
2
4
10
6
2
4
6
10
KRUSKAL-MST(G=(V,E),w:E->Z)
A := {};
insert the edges into a priority queue Q;
for each vertex v in V, MAKE-SET(v);
while (Q not empty)
e = EXTRACT-MIN(Q) \\e = (u,v);
if FIND-SET(u) =/= FIND-SET(v) then
UNION(u,v);
A := A U {e};
end if
end for
END KRUSKAL-MST
Q: Which line checks that adding edge e to A does not create a
cycle?
FIND-SET(u) #= FIND-SET(v).
Data Structures for Disjoint Sets
1. Arrays
2. Linked lists
3. Trees
Arrays
• One position for each element.
• Each position stores the element and an index to the set
representative For example, the collection of sets
{{A}, {B, E}, {C, F, G}, {D}}
could be represented using the following array:
1
A
1
2
B
2
3
C
6
4
D
4
5
E
2
6
F
6
7
G
6
Operations
• MAKE-SET(x): store x in the next available location, takes
time: O(1):
• FIND-SET(x): the index value indicates the representative
of the set containing x, so FIND-SET(x) takes time O(1).
• UNION(x,y): Let X be the index value stored with x and
Y be the index value stored with y. If X #= Y , then...
go through every element in the array and replace every index value
Example: UNION(A,E) results in:
1
A
2
B
3
C
4
D
5
E
6
F
7
G
Worst-case sequence complexity for m operations:
Upper bound:
• In terms of n, the number of elements, each operation takes
time at most O(n)
• How big can n get? O(m)
• Therefore, any sequence of m operations takes time O(m2 ).
Lower bound:
• What is the worst possible sequence of m operations? Do
m/2 MAKE-SETs
followed by m/2 − 1
UNIONs
– each UNION will take time Ω(m/2) (the number of elements)
– so the sequence will take time Ω((m/2) ∗ (m/2 − 1)) = Ω(m2
Therefore the worst-case sequence complexity is Θ(m2 ).
Circularly-linked List
Represent each set by a circularly-linked list. Let listx represent
the list containing x and assume the head of the list is the representative element.
• MAKE-SET(x): make a new set by creating a new linked list with ele
Complexity:
O(1)
• FIND-SET(x): in the worst-case, we need to traverse every link in
Complexity: Ω( length of list)
• UNION(x,y): Append listy to the end of listx , update the pointer fro
Complexity: O(1)
Worst-case sequence complexity for m operations:
Upper bound
The number of elements in the structure at any point in any sequence of m operations is ≤ m. The complexity of each operation in a sequence is O(m), so the total time is O(m2 ).
Lower Bound:
What is the worst possible sequence of m operations?
• Perform m/4 MAKE-SETs with different elements.
• The do m/4 − 1 UNIONs constructing one list with m/4 elements,
• Followed by m/2 FIND-SETs on the second element in
the list, so that each one requires time Ω(m/4) .
• Total time is Ω(m2 ).
Problem?
Find-set very bad complexity.
How can we improve this?
Linked-list with front pointer.
Represent each set by a linked list, with
• the first element in the list being its representative
• each element in the list has a pointer back to the head and
a pointer to the next element.
• Let listx represent the list containing x.
Operations
• MAKE-SET(x): create a new linked list with element x
Complexity:
O(1)
• FIND-SET(x): Follow x’s pointer back to the head.
Complexity: O(1)
• UNION(x,y): Append listy to the end of listx , update the pointers .
Complexity: Θ length of listy
.
Why?
Since we can find the head of listy and the tail of listx in constant tim
Worst-case sequence complexity for m operations:
Upper Bound
Same as before.
Lower Bound
• Perform m/2 + 1 MAKE-SETs with different elements,
• Then do m/2 − 1 UNIONs. This creates one longer and longer list a
• Total time is Ω(m2 ).
Problem?
still very inefficient if we have long unions.
Q: How can we fix this?
Linked list with extra pointer to front and “union-by-weight”
⇒ Now we keep track of the number of elements in each list.
Q: Are MAKE-SET and FIND-SET affected?
no.
Q: What about UNION?
we always append the smaller set to the longer one (so we have fewer po
This is called “union-by-weight”. The “weight” of a set is simply
its size.
Worst-case sequence complexity for m operations:
Upper Bound
• Let n be the number of MAKE-SET operations in the sequence (so there are never more than n elements in total).
• For some arbitrary element x, how many times can x’s back
pointer can be updated? Consider when this happens:
This happens only when listx is UNIONed with a set that is no small
• So each time x’s back pointer is updated, the resulting set
must have size at least twice |listx |.
• So the limit on the number of times that x) s back pointer is
updated is log(n) times.
• This is true for every element x. Therefore, the total number
of pointer updates during the entire sequence of operations
is O(n log n).
• Since the time for other operations is still O(1), and there
are m operations in total, the total time for the entire sequence is O(m + n log n).
Trees
⇒ Represent each set by a tree, where each element points to
its parent and the root points back to itself.
The representative of a set is the root.
Note that the trees are not necessarily binary trees:
• MAKE-SET(x) just create a new tree with root x.
Complexity: O(1)
• FIND-SET(x):: simply follow ”parent” pointers back to the root of x
Complexity: O(depth of x)
• UNION(x,y): just make the root of one of the trees point to the roo
root_y
root_x
y
x
root_x
UNION (x,y)
root_y
x
y
Complexity: Θ(max{height(treex ), height(treey )})
Worst-case sequence complexity for m operations:
Lower Bound
• Just like for the linked list with back pointers but no size
• I.e., we can create a tree that is just one long chain with m/4 element
How can we create this tree? using a combination of MAKE-SET
and UNION operations.
for i = 1 to m/4 do MAKE-SET(xi) for i = 1 to m/4 - 1 do UNION(x( i + 1),
X(m/4-1)
Xm/4
m/4 - 1
.
.
.
X1
UNION(Xm/4, X(m/4-1))
Xm/4
X(m/4-1)
m/4 - 1
.
.
.
X1
Creating this tree takes m/4
UNION operations.
MAKE-SET operations and m/4 − 1
• Now FIND-SET takes time Ω(m) .
• If we perform m/2 FIND-SET operations, we get a sequence whose total time is Ω(m2 ).
Q: How do we know there is not a sequence of operations that
takes longer than Θ(m2 )?
Q: How can we improve the trees data structure representation
of disjoint sets?
Add “union-by-weight”.
⇒ keep track of the weight (i.e., size) of each tree and always append the smaller tree to the larger one when performing UNION.
• The complexity of MAKE-SET and UNION are still O(1)
and O(max(height(treex ), height(treey ))).
• When one tree is appended to another, what is the weight
of the new tree?
the sum of the two weights
• What is the complexity of FIND-SET?
Suppose during a sequence of m operations, there are n
MAKE-SET operations, then . . . the maximum height of any tree is O
Q: How would we prove this?
(The proof is by induction on the height h of the trees.)
Q: What does this tell us about the running time of any
individual FIND-SET operation?
O(log n)
So the total time for the entire sequence is O(m log n).
Q: Can we do better?
Add path compression
When performing FIND-SET(x),
• keep track of the nodes visited on the path from x to the
root of the tree by using a stack or queue
• once the root is found, update the parent pointers of each node to poi
Q: How does this affect the complexity of the FIND-SET operation? doubles it the first time, makes it constant the rest of the time
Q: What is the complexity of a UNION(x,y) operation?
depends on whether FIND-SET has already been called on one/both of x
Q: Does the improvement in complexity of UNION and subsequent FIND-SET operations out-weigh the increase in cost of
the initial FIND-SET?
Q: How might we answer this?
do amortized analysis–we’ll see this topic next.
Consider a sequence of operations including
• n MAKE-SET ops,
• at most n − 1 UNIONs and
• f FIND-SET ops, the worst-case running time of a single
operation in the sequence is:
Θ(
f log n
) if f ≥ n
log(1 + f /n)
Θ(n + f log n)
if f < n
Q: Can we do better?
Add “union-by-rank” and path compression.
Q: What measure of trees matters the most?
With trees, the measure that matters the most for the running time is the
Therefore it makes more sense to relate heuristics to the height
of a tree rather than the overall weight in the UNION operation.
• Define the rank of a tree to be an upper bound on the height
of the tree.
• Note that the rank may not be equal to the height of the tree.
• We’ll store the rank of a tree at it’s root.
Operations
• MAKE-SET(x): Same as before, add rank(x)=0.
• UNION(x,y): We know rank(treey ) and rank(treex ).
Which root of treex and treey becomes the new root?
the node with higher rank is the new root
What is the rank of the new tree?
same as larger rank unless the two nodes have the same rank, pick
• FIND-SET: Nothing changes–use path compression. Does
not affect rank.
This is the state of the art disjoint set implementation.
Q: How good is the worst-case sequence complexity?
It is possible to prove that the worst-case time for a sequence of
m operations, where there are n MAKE-SETs, is O(m log∗ n).
Q: What is log∗ ?
It is the number of times that you need to apply log to n until the answer is
Example:
5
< log 40 < 6
⇒
2
< log log 40 < 3
⇒
⇒
0 < log log log log 40 < 1
n = 40 ⇒
⇒
1 < log log log 40 < 2
Back to Kruskal’s Algorithm
KRUSKAL-MST(G=(V,E),w:E->Z)
A := {};
insert the edges into a priority queue Q;
for each vertex v in V, MAKE-SET(v);
while (Q not empty)
e = EXTRACT-MIN(Q) \\e = (u,v);
if FIND-SET(u) =/= FIND-SET(v) then
UNION(u,v);
A := A U {e};
end if
end for
END KRUSKAL-MST
Q: If graph G has n vertices and is connected, then how many
edges does G have?
m>n−1
Q: Inserting the edges into a priority queue and extracting the
min for each edge takes how long?
m log m
Suppose that we implement the disjoint set ADT using linked lists
with union-by-weight.
(Remember, linked-lists have a pointer back to the representative element
How many MAKE-SETs do we do?
n
Complexity?
O(n)
Q: How many FIND-SETs do we do?
at most 2m-since we could visit the endpoints of an edge at most 2 times.
Complexity?
O(m)
Q: How many UNIONs do we do?
at most m
Complexity?
at most O(n log n)
So the worst-case complexity of Kruskals is O(m log m + n +
m + n log n).
The bottleneck is the sorting (priority queue step). Therefore the
complexity is O(m log m)