Parallel Computing

Algorithms complexity
Parallel computing
Yair Toaff
Gil Ben Artzi
Orly Margalit
027481498
025010679
037616638
Parallel computing - MST
The problem:
Given a graph G= (V , E) with weights.
We need to find a minimal spanning tree
with the minimum total weight.
Parallel computing - MST
Kruskal algorithm
• Sort the graphs edges by weight.
• In each step add the edge with the
minimal weight that doesn’t close a
cycle.
Parallel computing - MST
Complexity
Single processor:
Sorting – O(m log m) = O( n2 log n)
For each step O(1) there are O(n2) steps
Total – O(n2 log n )
Parallel computing - MST
O(m) processors:
Sorting O( log 2 m )
Each step O(1)
Total O( n2 )
Parallel computing - MST
Prim algorithm
• Randomly choose a vertex for tree
initialization.
• In every step choose the edge with
minimal weight form a vertex in the
tree to a vertex not in the tree.
Parallel computing - MST
Complexity
Single processor:
Find the edge in step i O( n * i)
Total n + 2n + … + n2 = O(n3)
Parallel computing - MST
O(n) processors:
There is a processor for each vertex so
every step takes O(n)
Total O(n2)
Parallel computing - MST
O(m) processors
In each step there are more processors then edges so
finding the minimum takes O( log n)
Total O ( n log n)
Parallel computing - MST
O(m2) processors
In each step finding the minimum takes O( 1)
Total O ( n)
Parallel computing - MST
Sulin algorithm
• Treat every vertex as a tree
• In each step randomly choose a tree and
find the edge with the minimal weight from
a vertex in the tree to a vertex not in the tree
Parallel computing - MST
Complexity:
Single processor
Same as Kruskal algorithm
Parallel computing - MST
O(n) processors:
There is a processor for every vertex so finding the
minimum takes O( n )
In each step only half of the trees remain so there are
O ( log n ) steps
Total O( n log n)
Parallel computing - MST
O( n2 ) processors:
There are n processors for every vertex
so finding the minimum takes O(log n)
Total O(log 2 n )
Parallel computing - MST
O( n3 ) processors:
There are n2 processors for every vertex
so finding the minimum takes O(1)
Total O(log n )
Merge Sort
MS( p,q,c) - p,q indexes c is the array
If ( p < q )
{
MS( p , (p+q)/2 , c )
MS( (p+q)/2 , q , c )
merge( p , (p+q)/2 , q , c)
}
Merge Sort
Single processor
In every step the merge takes O(n), there are
O(log n) steps.
Total O( n log n )
Merge Sort
O(n) processors:
In every step the merge is done in parallel
time( MS(n)) = O(1) + time(merge( n / 2))
By using regular merge we get
O( 1 + 2 + 4 + … + n ) = (2log n + 1) = O(n)
Merge Sort
Parallel merge
The problem: given 2 sorted arrays A,B
with size n/2 we need to merge them
efficiently while keeping them sorted
Merge Sort
Let us define 2 sub arrays:
ODD A = [a1 , a3 , a5 …]
EVEN A = [a0 , a2 , a4 …]
Merge Sort
And 2 functions:
Combine( A , B ) = [ a0 , b0 , a1 , b1 , … ]
Sort-combined( A ) – for each pair a2i a(2i+1) if
they are in the right order do nothing else
replace each of them with the other
Merge Sort
Parallel merge ( A , B )
{
C = parallel merge ( ODD A , EVEN B )
D = parallel merge ( ODD B , EVEN A )
L = combine ( C , D )
Return (sort-combined ( L ) )
}
Merge Sort
Complexity:
Time ( parallel merge ( n ) ) =
Time ( parallel merge ( n/2) ) + O(1)
= O(log n)
Merge Sort
What is left is to prove the algorithm.
Theorem: if an algorithm sort every array of
(0 , 1) it will sort every array.
Merge Sort
Let us mark the number of ‘1’ in A as 1a
and in B as 1b
The number of ‘1’ in ODD A is 1a /2
The number of ‘1’ in EVEN A is  1a /2 
Merge Sort
As a result of it the difference between the
number of ‘1’ in C and in D is 0 or 1.
 Array L will be sorted except maybe one
point where the ‘0’ and ‘1’ meet
 sort-combined will do 1 swap at most.
Merge Sort
Complexity of merge sort using parallel
merge:
Log 1 + log 2 + log 4 + log 8 + … + log n =
0 + 1+ 2 + 3 + … + log n = O( log 2 n)
Sum
•
•
•
•
Input : Array of n elements of type integer.
Output : Sum of elements.
One processor - O(n) operations.
Two processors - Still O(n) operations.
Sum
• What could we do if we have O(n) processors ?
• Parallel algorithm
– For each phase till we have only one element
• Each processor adds two elements together
• We have now N/2 new elements
• Complexity
– We have done more operations , so what have we
gained ?
– Since in each phase we stay with only half of the
elements, we can view it as a binary tree where each
level represents the new current elements, overall depth
is O(logn) levels. Each level in the tree is O(1), total of
O(logn) time.
Max1 – Max2
• Input : Array of n elements of type integer.
• Output : The first and the second maximum
elements in the array
• One processor , 2n operations.
• Two processors , each insertion takes 3
operation (compare to each of the other
elements that are candidates ) , 2n/3
operations
Max1 – Max2
• Parallel algorithm - recursive solution
– Divide 2 groups (G1,G2).
– Find MAX for each group (LocalM1,LocalM2)
– If LocalM1>LocalM2
• Create new group G3 := (LocalM2+G1)
• MAX2 must be in G3, since in G2 there is no
element that is bigger than LocalM2
Max1 – Max2
• Example
– End of recursive
M1[10] * M1[7] * M1[1] * M1[3] * M1[100] * M1[8] * M1[55] * M1[6]
– Up one phase
M1[10],M2[7] * M1[3],M2[1] * M1[100],M2[8] * M1[55],M2[6]
– Up one phase
M1[10],M2[7,3] * M1[100],M2[8,55]
– The result
M1[100] * M2 [10,8,55]
Max1 – Max2
• Complexity
– 1 processor
• n operations of comparing all elements in tree for Max1 , logn
operation comparing elements for Max2, Total (n+logn)
– O(n) processors
• We could find Max1and rerun the algorithm to find Max2,
each in logn, total of 2logn.
• However , we can use the previous algorithm and add G3 in
parallel , and we get logn for finding Max1, loglogn for finding
Max2
Max & Min groups
• Input : 2 groups ( G1,G2) of sorted elements
• Output : 2 groups (G1`,G2`), where in one
group all elements are bigger than all the
elements in the other group
• One processor - Insert all elements into 2
stack, always compare the stack heads, the
minimum is inserted into the Min group.
• Complexity - O(n) operations
Max & Min groups
• There is a major subtle in the previous algorithm when
trying to apply it to parallel computing – each element
must be compared until we will find an element that is
higher himself.
• We would like to find a method to compare as less as we
can each elements with the others , the best is only one
comparison per element.
• Any member of the min group is necessarily smaller than
at least half of the elements.
• If we could conclude this, we can classified the element in
the right group immediately
• Any suggestion ?
Max & Min groups
• Parallel algorithm
– Insert all elements from G1 into list L1 in a reverse
order , and all elements of G2 into list L2 in regular
order
– Element j in L1 is bigger than n-j-1 elements of his list
– Element j in L2 is bigger than j-1 elements of his list
– So , by comparing element i in both lists we get
• If L1[i]>L2[i] , L1[i] is bigger than n-i-1 elements in L1 , and
i+1(including L2[i]) elements in L2 , total of n elements. L2[i]
is smaller than n-i elements of L2 and i+1 elements element of
L1 , total of n elements.
• And vice versa
– We can now insert the element immediately to their
groups
Max & Min groups
• Example
– Groups
• G1 = 7,10,100,101
• G2 = 1,11,18,99
– Lists
• L1 = 101,100,10,7
• L2 = 1, 11,18, 99
– Comparing : (101,1),(100,11),(10,18),(7,99)
– Result : G1’= 101,100,18,99 ,G2’ = 1,11,10,7
Max & Min groups
• Complexity
–
–
–
–
We have compare element i of each lists
Each element has only one comparison
O(n) processor , O(1) time !
Can we do better for one processor now ?
Signed elements
• Input : Array of elements , some of them are signed
• Output : 2 Arrays of elements , one contain the signed , the
other the unsigned, keeping the order between the elements
• One processor
– Make one pass , drop each element into the correct
array
– O(n) operations
• Since we need to maintain the order between the elements ,
we must know for each element , how many elements
should be before him
• how could we improve the Algorithm by adding more
processors ?
Signed elements array
• Parallel algorithm
– Create another array (A2) of elements, where in
each location of a signed element insert 1 and
in each location of unsigned elements insert 0
– Now we can do the parallel prefix algorithm
and obtaining each element position in the
destination array
– We can do the same for the unsigned elements
Signed elements array
• Example
–
–
–
–
Input : [x1,x2,x3`,x4,x5`,x6,x7`,x8`,x9]
A2 : [0 , 0 , 1 , 0 , 1 ,0 ,1 , 1 ,0 ]
Prefix: [0 , 0 , 1 , 1 , 2 , 2 ,3 , 4 , 4 ]
Result: x3’1 , x5`2 , x7`3 , x8`4
• Complexity
– O(n) processor , O(logn) time !
Scheduling
• Input : Array of jobs , contains the time for executing each
job , and the deadline for finishing it.
• Output : Is there a scheduling satisfying the above
condition ?
• Parallel algorithm
– Sort the deadlines
– Create prefix for executing time of each job
– In order to exist a scheduling , PrefixExecTime(i)<DeadLine[i]
•
Complexity O(n) processors
– O(lognlogn) to sort, O(logn) to do prefix , O(1) to compare
CAG - Clique
• Input : CAG
• Output : maximum clique exist
• Reminder
– Clique : A vertex is in a clique iff there is an edge from
each of the vertex in the clique to himself
– CAG : Circular Arc Graph , A graph where each vertex
is on a circle . There is an edge between two vertex iff
there is a join segment on the circle between those two
vertex
CAG – Clique
• Examples
v1
v4
– Clique [V1,V2,V3]
v2
– CAG
v3
v3
v1
v2
v4
CAG - Clique
• Parallel algorithm
– Loop through element list twice
• If Element == start of a vertex ,
BoundriesArray[i]=+1;
• If Element == end of a vertex , and we already pass
the start of this vertex , BoundriesArray[i]= -1 ;
– PrefixArray := Prefix ( BoundriesArray)
– MaxClique := Max ( PrefixArray)
CAG - Clique
• Example , CAG from previous slide
– BoundriesArray
[ (v1,+),(v2,+),(v1,-),(v4,+),(v3,-),(v4,-),(v2,+),(v1,+ ),(v3,+ )(v2,-),(v1,-)]
– PrefixArray
[1,2,1,2,1,0,1,2,3,2,1]
– MaxClique is 3 !
• Note : There is a need to loop twice trough the list
of vertex since we consider only end of vertex that
we already pass the start.
CAG – Clique
• Complexity
– One processor , O(n)
– O(n) processors , logn + logn
– O( n^2) processors , logn + o(1)
Exclusive Read & Exclusive Write
• EREW
• Most simple computer
• Only one processor can read/write to a
certain memory block at a time
Concurrent Read &
Exclusive Write
• CREW
• Only one processor can write to a certain
memory block at a time.
• Multiple processors can simultaneously
read from a common memory block.
Exclusive Read &
Concurrent Write
• ERCW
• Only one processor can read a certain
memory block at a time.
• Multiple processors can simultaneously
write to a common memory block.
Concurrent Read &
Concurrent Write
•
•
•
•
CRCW
Most powerful computer
Very complex memory control
Multiple processors can simultaneously
read/write to a common memory block
Concurrent Write
Problem:
• Multiple processors writing different values
to a common memory block  every
processor overwrites on previous
processor’s value.
r
sso
e
c
o
1
Pr
Memory
Block
Processor 2
Proc
esso
r3
Concurrent Write
Solution1:
• Restrict Write – a unique value can only be
written to the memory block.
ss
o ce
or 1
Pr
1
1
1
Processor 2
1 Pr
oce
sso
r3
Concurrent Write
Solution2:
• Combine Write – a unique value is stored
for every distinct processor in the shared
memory block.
ss
o ce
or 1
Pr
1
1,2,4
2
Processor 2
4 Pr
oce
sso
r3
Restrict Write
A good example of Restrict Write is a
Boolean problem.
X1
X2
X3
Result
Restrict Write
X1  X2  X3  Result
Initial value: Result = 0
Only value one is written to Result
result = 0;
For i = 1 to n doip (do in parallel)
if (Xi = = 1)
then result = 1;
}
{
Max Value - O(n2) Processors
Reminder:
One processor
O(n) processors
O(n2) processors
: O(n) operations.
: O(log2n) operations.
:?
We can represent the comparison between numbers
as a matrix. If x1< x2 then coordinate (1,2) gets a
value of one, else it gets a value of zero.
Max Value - O(n2) Processors
X1
X2
X3
Result
X1
(1,1)
(1,2)
(1,3)
Row1
X2
(2,1)
(2,2)
(2,3)
Row2
X3
(3,1)
(3,2)
(3,3)
Row3
• A processor is allocated for each cell in the matrix.
• All the processors with “value = 1” write
simultaneously to the result cell in their row.
Max Value - O(n2) Processors
3
6
4
Result
3
0
1
1
1
6
0
0
0
0
4
0
1
0
1
Max Value
Total operations with O(n2) processors : O(1)
– Generating the Matrix : O(1) operations
(one processor per cell)
– Generating the result column : O(1) operations
Sort - O(n2) Processors
Reminder:
One processor : O(nlog2n) operations.
O(n) processors : O(log22n) operations (merge sort)
O(n2) processors : ?
• As before, we generate a comparison matrix.
• The result cells will receive the sum of the current row.
Each row has O(n) processors, therefore the sum operation
takes O(log2n) operations.
• The result column represents the index of the sorted array
in descending order.
Sort - O(n2) Processors
3
6
4
Result
3
0
1
1
2
6
0
0
0
0
4
0
1
0
1
Total operations with O(n2) processors : O(log2n)
– Generating the Matrix : O(1) operations
(one processor per cell)
– Generating the result column : O(log2n) operations
Multiplication Of Matrix
• Matrixes that can be multiplied must
obeyed the dimension law : RnCm * RmCk
a11
a12
b11
b12
a11b11 + a12b21 a11b12 + a12b22
a21
a22
b21
b22
a21b11 + a22b21 a21b12 + a22b22
Multiplication Of Matrix
Input: Two matrixes of size n*n (Mnn)
Output: One matrix Mnn
Total operations with one processor : O(n3)
• n2 cells
• Sum of each cell with O(n) variables and one
processor, O(n) operations
Multiplication Of Matrix
Total operations with o(n) processors : O(n2)
• Processor per cell in a column.
• n columns
• Sum of each cell with O(n) variables and one
processor, O(n) operations
 O(n)sum * ncolumn = O(n2)
Multiplication Of Matrix
Total operations with O(n2) processors : O(n)
• n2 cells
• Processor per cell
• Sum of each cell with O(n) variables and one processor,
O(n) operations
 O(n)sum * 1cell = O(n)
 Each cell is summed simultaneously
Multiplication Of Matrix
Total operations with O(n3) processors : O(log2n)
• n2 cells
• O(n) processors per cell
• Sum of each cell with O(n) variables and O(n) processor,
O(log2n) operations
 O(log2n)sum * 1cell = O(log2n)
 Each cell is summed simultaneously
Multiplication Of Boolean Matrix
Total operations with O(n3) processors : O(1)
• n2 cells
• O(n) processors per cell
• Sum of each cell with O(n) variables and O(n) processor,
O(1) operations
 O(1)sum * 1cell = O(1)
 Each cell is summed simultaneously
Shortest Path Between Vertexes
V2
1
1
V3
V1
1
1
V4
Problem:
• Finding if path exists between 2 vertexes
• Finding the shortest path between 2
vertexes
Shortest Path Between Vertexes
• Represent the graph as a matrix Ann.
• If an arc exists between vertex X1 and X2, then coordinates
(1,2) & (2,1) get a value of one, otherwise zero.
• Matrix Ann - all the vertexes that are of one arc distance
from each other.
V1 V2
V3
V4
V2
1
1
V3
V1
1
1
V4
V1
0
1
0
1
V2
1
0
1
0
V3
0
1
0
1
V4
1
0
1
0
Shortest Path Between Vertexes
• Matrix Ann2 - all the vertexes that are of two arcs distance
from each other.
• Ann + Ann2 = all routes of distance of one and two arcs.
V2
V1
V2
V3
V4
V1
2
0
2
0
V2
0
2
0
2
V3
2
0
2
0
V4
0
2
0
2
1
1
V3
V1
1
1
V4
Shortest Path Between Vertexes
• Ann + Ann2 + Ann3 + …Annn = B - all routes of distance 1 to n arcs.
• Any zero values in matrix B, represents no link exists between
the two vertexes.
V1 V2
V3
V4
V2
1
1
V1
2
1
2
1
V2
1
2
1
2
V3
2
1
2
1
V4
1
2
1
2
V3
V1
1
1
V4
Shortest Path Between Vertexes
Total operations with 1 processors : O(n4)
• Building of Matrix Ann : O(n) operations
• Multiplication of matrix : O(n3) operations
• Creation of Ann,Ann2 ,Ann3 , … ,Annn : O(n4) operations
• Sum of the Matrixes : O(n3) operations
Shortest Path Between Vertexes
Total operations with O(n) processors : O(n3)
•
•
•
•
Building of Matrix Ann : O(1) operations
Multiplication of matrix : O(n2) operations
Creation of Ann,Ann2 ,Ann3 , … ,Annn : O(n3) operations
Sum of the Matrixes : O(n2) operations (ncell * ncolumn)
Shortest Path Between Vertexes
Total operations with O(n2) processors: O(n2)
• Building of Matrix Ann : O(1) operations
• Multiplication of matrix : O(n) operations
• Creation of Ann,Ann2 ,Ann3 , … ,Annn : O(n2) operations
• Sum of the Matrixes : O(n) operations (process per
cell)
Shortest Path Between Vertexes
Total operations with O(n3) processors: O(nlog2n)
• Building of Matrix Ann : O(1) operations
• Multiplication of matrix : O(log2n) operations
• Creation of Ann,Ann2 ,Ann3 , … ,Annn : O(nlog2n) operations
• Sum of the Matrixes : O(log2n) operations (o(n)
processors per cell)
Shortest Path Between Vertexes
Total operations with O(n4) processors : O(log22n)
• Building of Matrix Ann : O(1) operations
• Multiplication of matrix : O(log2n) operations with O(n3)
processors
• Creation of Ann,Ann2 ,Ann3 , … ,Annn : O(log22n) operations
(prefix algorithm)
• Sum of the Matrixes : O(log2n) operations
• Boolean Output (link exist True or False) : O(log2n) operations