PARALLEL AND DISTRIBUTED
COMPUTATION
(Lucidi di L. Pagli)
• MANY INTERCONNECTED PROCESSORS WORKING CONCURRENTLY
P4
P5
P3
INTERCONNECTION
NETWORK
P2
P1
....
• CONNECTION MACHINE (THINKING COMP. & C.)
• INTERNET
Pn
64.000 Pocessors
Connects all the computers of the world
Cuba 1
THREE TYPES OF MULTIPROCESSING FRAMEWORKS, CLOSELY RELATED
• CONCURRENT
• PARALLEL
• PRAM
• Bounded-degree network and VLSI
•DISTRIBUTED
MULTIPROCESSING ACTVITIES TAKE PLACE IN A SINGLE MACHINE (POSSIBLY USING
SEVERAL PROCESSORS), SHARING MEMORY AND TASKS.
TECHNICAL ASPECTS
•PARALLEL COMPUTERS (USUALLY) WORK IN TIGHT SYNCRONY, SHARE MEMORY TO A
LARGE EXTENT AND HAVE A VERY FAST AND RELIABLE COMMUNICATION MECHANISM
BETWEEN THEM.
• DISTRIBUTED COMPUTERS ARE MORE INDEPENDENT, COMMUNICATION IS LESS
FREQUENT AND LESS SYNCRONOUS, AND THE COOPERATION IS LIMITED.
PURPOSES
• PARALLEL COMPUTERS COOPERATE TO SOLVE MORE EFFICIENTLY (POSSIBLY)
DIFFICULT PROBLEMS
• DISTRIBUTED COMPUTERS HAVE INDIVIDUAL GOALS AND PRIVATE ACTIVITIES.
SOMETIME COMMUNICATIONS WITH OTHER ONES ARE NEEDED. (E. G. DISTRIBUTED
DATA BASE OPERATIONS).
PARALLEL COMPUTERS: COOPERATION IN A POSITIVE SENSE
DISTRIBUTED COMPUTERS: COOPERATION IN A NEGATIVE SENSE, ONLY
WHEN IT IS NECESSARY
Cuba 2
FOR PARALLEL SYSTEMS
WE ARE INTERESTED TO SOLVE ANY PROBLEM IN PARALLEL
FOR DISTRIBUTED SYSTEMS
WE ARE INTERESTED TO SOLVE IN PARALLEL
PARTICULAR PROBLEMS ONLY, TYPICAL EXAMPLES ARE:
•COMMUNICATION SERVICES
ROUTING
BROADCASTING
•MAINTENANCE OF CONTROL STUCTURE
*SPANNING TREE CONSTRUCTION
TOPOLOGY UPDATE
*LEADER ELECTION
•RESOURCE CONTROL ACTIVITIES
LOAD BALANCING
MANAGING GLOBAL DIRECTORIES
* MUTUAL EXCLUSION
Cuba 3
PARALLEL ALGORITHMS
• WHICH MODEL OF COMPUTATION IS THE BETTER TO USE?
• HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM?
• HOW TO CONSTRUCT EFFICIENT ALGORITHMS?
MANY CONCEPTS OF THE COMPLEXITY THEORY MUST BE REVISITED
• IS THE PARALLELISM A SOLUTION FOR HARD PROBLEMS?
• ARE THERE PROBLEMS NOT ADMITTING AN EFFICIENT PARALLEL SOLUTION,
THAT IS INHERENTLY SEQUENTIAL PROBLEMS?
UNDECIDABLE PROBLEMS REMAIN UNDECIDABLE!
Cuba 4
PRAM MODEL
• Joseph Jajà
An introduction to Parallel Algorithms Addison-Wesley Pub. Comp. 1992
•Karp R.M., Ramachandra V.
A survey of parallel algorithm for shared-memory machines J. Van Leuwen
Ed. Handbook of Theoretical Comp. Science
• Jan Parberry
Parallel Complexity Theory Research Notes in Theoretical Computer Science.
John Wiley&Son 1987
TO FOCUS ON ALGORITHMIC ISSUES INDEPENDENTLY OF PHYSICAL LOCATIONS
Cuba 5
1
2
3
P1
P2
.
Pi
.
?
Common Memory
.
.
.
Pn
m
PRAM n RAM processors numbered from 1 to n and
connected to a common memory of m cells
ASSUMPTION: at each time
P unit each Pi can read a memory cell, make an internal
computation and write another
memory cell.
i
CONSEQUENCE: any pair of processor Pi Pj can communicate in constant time!
Pi writes the message in cell x at time t
Pi reads the message in cell x at time t+1
Cuba 6
ASSUMPTIONS
• Shared-memory: The array A is stored in the global memory and can be accessed by any
processor.
•Synchronous mode of operation: In each unit of time, each processor is allowed to execute
an istruction or to stay idle.
There are several variations regarding the handling of simultaneous access to the same
memory location.
EREW-PRAM (exclusive read exclusive write)
CREW-PRAM (concurrent read exclusive write)
CRCW-PRAM (concurrent read CONCURRENT write) and a policy to resolve concurrent writes
Common, Priority, Arbitrary
The three models do not differ substantially in their computationa power!
If each processor can execute its own local program we have a
MIMD (multiple instruction multiple data) model
SIMD (single instruction multiple data) model
otherwise
Cuba 7
Dal Bertossi Cap. 27
• Sommatoria n log n
• Sommatoria n
[R. Grossi]
Cuba 8
Important parameters of the efficiency of a parallel algorithm
Tp(n) (or Tp)
p(n) (or p)
LOWER BOUND
parallel time
number of processors
of the parallel computation
Let A a problem and Ts be the complexity of the optimal sequential
(or the best known) algorithm for A, we have:
Tp >= Ts / p
Cn = Tp p
cost of the parallel algorithm
The parallel algorithm can be converted into a sequential algorithm that
runs in O(Cn ) time: the single processor simulates the p processors in p
steps for each of the Tp parallel step.
If the parallel time would be less than Ts / p, we could derive a sequential
algorithm better than the optimal one!!
Cuba 9
Parallel algorithm
time
processor P1
processor P2
1
op1
op4
2
op2
op5
3
op3
p=2 Tp=3 C=6
can be simulated by a single processor in a number of steps (time)
Š6
Sequential algorithm
time
1
2
3
4
op1
op4
op2
op5
5
op3
Ts=5
Tp >= Ts/p
Ts/Tp
speed up of the parallel algorithm
Cuba 10
MAXIMUM on the PRAM
Input: an array A of n=2k elements in the shared memory of a PRAM with n/2 processors
Output: the maximum element stored in location S.
Algorithm MAX
begin
for all k where 1 <= k <= log n do in parallel
if i <= n/2k do in parallel A[i] := max {A[2i], A[2i-1]}
MAX := A[1]
end
S
P1
A(3)
P1
A(1)
A(3)
A(7)
P1
P2
A(2)
A(3)
A(6)
A(7)
P1
P2
P3
P4
A(2)
A(3)
A(4)
A(5)
A(6)
A(7)
A(8)
Cuba 11
From the previous lower bound and sequential computation
C = Tpn
not optimal
From algorithm MAX
C = Tpn = O(nlog n)
Better algorithm:
• divide the n elements in k =n/log n subsets of log n elements each
P1
P2
P3
Pk
...................
m1
m2
m3
mk
• each processor computes the maximum mi of its subsets
with the sequential algorithm in time O(log n)
•algorithm MAX is executed among the local maxima, time
O(log (n/log n)) = O(logn - loglog n)= O(logn)
Overal time:
Tp = O(log n)
and
C = Tpn = O(n)
p= n/ log n
optimal
Cuba 12
PERFORMANCE OF PARALLEL ALGORITHM
Four ways of measuring the performance of parallel algorithm:
1. P(n) processors and Tp(n) time.
2. C(n) = P(n)Tp(n) cost and Tp(n) time.
The number of processors depends on the size n of the problems.
The second relation can be generalized to any number p<P(n) processors
each of the Tp parallel step can be simulated by the p processors
in O(P(n)/p) substeps; this simulation takes a total of O( Tp(n)P(n)/p) time.
3. O( Tp(n)P(n)/p) time for any number p<P(n) processors
If the number of processors p is larger than P(n), we can clearly achieve the
runnng time Tp(n) by using P(n) processors only.
Relation 3 can be further generalized.
4. O(C(n)/p + Tp(n)) time for any number p processors
In conclusion,in the design of a PRAM alg., we can assume as many processor
we need and use the proper relation to analyze it.
Cuba 13
PERFORMANCE OF ALGORITHM MAX
1. P(n)= n/2 processors and Tp(n) =O(log n) time.
2. C(n) = P(n)Tp(n) = O(n log n) cost and Tp(n)= O(log n) time
Assume p= log n processors
3. O(Tp(n)P(n)/p) = O (logn n/logn) = O(n) time
Therefore
4. O(logn n/p + logn) time. If p<=n, O(log n) time, otherwise O(logn n/p ) time.
Work W(n) of a parallel algorithm: total number of operations used.
Work of alg. MAX:
W(n) = SUMj=1, logn(n/2j) + 1 = O(n)
W(n) < C(n)
W(n) measures the total number of operations and has nothing to do with
the number of processors available, while C(n) measures the cost of the alg.
relative to the number p of processors available.
Cuba 14
Work-time presentation of a parallel algorithm
any number of parallel operations at each time unit is allowed
BRENT PRINCIPLE :
given a parallel algorithm that runs in time T(n) and requires W(n) work,
we can adapt this algorithm to run on a p-processors PRAM in time
Tp(n) < |W(n)/p| + T(n)
Let Wi(n) be the number of operation of time unit i, 1<= i <= T(n). Simulate
each set of Wi(n) operations in |Wi(n)/p| parallel steps of the p processors,
for each 1<= i <= T(n).
The p-processors PRAM algorithm takes <= SUMi |Wi(n)/p| <= SUMi (|Wi(n)/p| +1)
< SUMi |Wi(n)/p| + T(n).
The Brent Principle assumes that the scheduling of the operations to the processors is
always a trivial task. This is not always true. It’s easy if we use C(n) in place of W(n)
Cuba 15
t
1
7
14
17
2
8
15
18
3
9
16
19
4
10
20
5
11
21
6
12
22
13
23
25
30
algorithm A1
T1 = 6
29
36
24
W(n) =36
Wi
6
7
3
8
5
7
A1 can be simulated by A2 with 3 processors in time T2(n) <= 36/3 + 6 =18
1
4
7
10
2
5
8
3
6
9
t1
14
17
20
23
25
28
30
33
11
15
18
21
24
26
29
31
34
12
16
19
22
32
35
t2
13
t3
t4
27
t5
36
T2 = 14
t6
Cuba 16
Dal Bertossi cap. 27
Tecniche di base:
• Somme prefisse
• Ordinamento non ottimo
• List ranking con pointer jumping
• Ciclo euleriano
[R. Grossi]
Cuba 17
PARALLEL DIVIDE AND CONQUER
• Partion the input in several subsets of almost equal size
• Solve recursively the subproblem defined by each subset
• Combine the solutions of the subproblems into a solution
to the overall problem
CONVEX HULL
sequential algorithm
O(n logn)
v3
v4
v2
UPPER
v1
HULL
v5
LOWER HULL
v7
p
v6
q
•Sort the x-coordinates in increasing order
Tp =O(log n)
•Compute the UPPER HULL
Cuba 18
w. l. o. g., let
n = 2k
x (v1) < x (v2) < . . .< x (vn)
• Divide the points in two subsets
S1 (v1, v2 , . . .,vn/2)
S2 (vn/2+1, . . .,vn)
suppose the UPPER HULL of S1 and S2 is already computed
q2
b
a
Compute ab :
upper common tangent
q’2
q3
q’4
q4
q1
S1
• The UPPER HULL of S is formed by
q’3
q’1
q5
S2
q1 , . . . , qi = a, q’j = b, . . . ,q’s
Algorithm UPPER HULL (Sketch)
1. if n<=4 use brute force method to determine UH(S)
2. Let S1 (v1, v2 , . . .,vn/2), S2 (vn/2+1, . . .,vn) recursively compute UH(S1) and UH(S2)
in parallel
(Tp(n/2) time and 2W(n /2) operations)
3. Find the Upper Common Tangent between UH(S1) and UH(S2) and deduce UH(S)
O(log n sequential time)
O(n ) operations
Tp(n) = Tp(n/2) +O(log n)= O(log2 n)
W(n) = O(nlogn)
Cuba 19
Intractable problems remain intractable in parallel
For an intractable problem (NP-hard) the only known solution require exponential time:
Ts = abn
p = nc (polynomial in the size of the input)
From the lower bound:
TP >= abn/ n c > a(b/2)n
for large value of n
still exponential
We consider only the class P, and in particular the class NC P.
NC is the class of all (decision) problems that can be solved in solved in
polylog parallel time (i. e. Tp is of order O(logkn)), with a polynomial number
of processors.
NC contains problems that can be
efficiently solved in parallel
Cuba 20
PARALLEL
SEQUENTIAL
Class NC
Efficient Algorithm
Class P
Efficient Algorithm
?
?
NC = P
P
= NP
•There are problems belonging to P for
which NO EFFICIENT PARALLEL
algorithm is known.
•There is no proof that such an algorithm not
exists
P-complete Problems
NP-complete Problems
Monotone Circuit Value
Satisfiability
P
NP
NP1
P1
MCV
P2
P3
. .
Ph
Goldshlager Th. (1984)
SAT
NP2
NP3
.
.
NPK
Cook’s Th. (1969)
Cuba 21
MONOTONE CIRCUIT VALUE PROBLEM (MCVP)
a
b
c
d
e
f
g
z =(((a AND b) OR c) AND d) AND ((e AND f) OR g)
z
Determine the value of the single output of a Boolean Circuit consisting of two-valued
AND and OR gates and a set of inputs and their complements,
DEPTH FIRST SEARCH
1
a
b
2
2
1
e
c
1
2
2
f
3
g
1
3
d
a
b
c
d
e
f
g
dfs numbers
1
2
5
3
4
6
7
arcs numbered according to the order of appereance on the adjacency list
Cuba 22
MAX FLOW
3
5
2
s
2
1
3
4
2
1 1
1
1
2
t
2
1
3
s
1
1
3
4
1
0
1
0
1
t
2
1
3
f=6
A directed graph N (network) with two distinguished vertices: source s and sink t;
each arc is labelled with its capacity (positive integer).
A flow f is a function, such that
1. 0 <= f(e) <= c(e), for all arcs e (capacity constraint)
2. the sum of the flow of all incoming arcs to any node (!= s,t), is equal to sum
of the flow on all outgoing arcs. (conservation constraint)
The value of the FLOW is given by the sum of the flow of the outgoing arcs of s (= to
the sum of the flow of all incoming arcs to t).
Find the maximum possible value of the flow.
Sequential Algorithm O(n3)
No efficient parallel solution is known
Cuba 23
Decisional
Parallel
Problems
Reducibility Notion: Let A1 and A2 be decisional problems. A1 is NC-reducible to A2 if
there exists an NC-algorithm that transforms an arbitrary input u1 of A1 into an input u2
of A2, such that A1 answer yes for u1 if and only if A2 answer yes for u2.
A2 is at least as difficult as A1
A problem A is P-Complete if every problem in the class P is NC-reducible to A
If A is P-complete
ANC iff P=NC
If A is NP-complete
AP iff P=NP
The hope of finding an efficient parallel algorithm is very low
To show that a problem A is P-Complete
- A P
- MCVP is NC-reducible to A
MCVP
input: Acyclic network of gates AND, OR (two-valued input) and an assignement
of constant values 0,1 at each input line
output: compute the value of the single output value
Cuba 24
Sketch of the GOLDSHLAGER’S theorem
An arbitrary problem A P can be formulated as an MCVP problem.
• MCVP P because z can be computed in O(n) sequential time.
• if A P is accepted by a deterministic TM in time T(n), polynomial for any input
n.
output
0
1
input
n
Q = {q1, . . . , qs} set of States
= { a1, . . . , am} tape’s alphabet
d : Q x
Q x x { L, R} transition function
The corresponding boolean circuit is defined by the following boolean functions:
1. H (i,t) = 1 if the head is on cell i a time t. 0 <= T <= T(n), 1<= i <= T(n).
2. C(i, j, t) = 1 if the cell i contains the symbol aj at time t. 0 <= T<= T(n), 1<= i <=T(n),
1<= j <= m.
3. S (k,t) =1 if the state of the TM is qk at time t. 1<=k<= s, 0<= T<= T(n).
Each step of the Turing machine can be described by one level of the circuit computing
H (i, t), C( i, j, t) and S(k, t ).
Cuba 25
0
EX:
q1
-
1
1q2R
q2
0q3R 1q2L
q3
0q3L 1q3R
TM
Q = {q1, q2, q3} S = {0,1}
1
1
t=0
1 ,i=1
H (i, 0)
1 , i = 1, j =2
C (i, j, 0)
0 ,2< i<n
i-1
i
1, k= 1
S (k, 0) =
0 , i = 1, j =2
0, k°1
i+1
t>0
H (i, t) = ( H (i-1, t-1) AND “right shift”) OR ( H(i+1, t-1) AND “left shift”)
“left shift” = ((S (2, t-1) AND C (i+1, 2, t-1)) OR (S(3, t-1) AND C (i+1, 1, t-1))
analogously compute C (i, j, t) and S (k, t). The circuit value is given by C(1, * , T (n))
and can be computed in O (log n) time with a quadratic number of processors.
Cuba 26
THE PRAM IS A THEORETICAL (UNFEASIBLE) MODEL
• The interconnection network between processors and memory would require
a very large amount of area .
• The message-routing on the interconnection network would require time
proportional to network size (i. e. the assumption of a constant access time
to the memory is not realistic).
WHY THE PRAM IS A REFERENCE MODEL?
• Algorithm’s designers can forget the communication problems and focus their
attention on the parallel computation only.
•There exist algorithms simulating any PRAM algorithm on bounded degree
networks.
E. G. A PRAM algorithm requiring time T(n), can be simulated in a mesh of tree
in time T(n)log2n/loglogn, that is each step can be simulated with a slow-down
of log2n/loglogn.
• Instead of design ad hoc algorithms for bounded degree networks, design more
general algorithms for the PRAM model and simulate them on a feasible network.
Cuba 27
• For the PRAM model there exists a well developed body of techniques
and methods to handle different classes of computational problems.
• The discussion on parallel model of computation is still HOT
The actual trend:
COARSE-GRAINED MODELS (BSP, LOGP)
• The degree of parallelism allowed is independent from the number
of processors.
• The computation is divided in supersteps, each one includes
• local computation
• communication phase
• syncronization phase
the study is still at the beginning!
Cuba 28
© Copyright 2026 Paperzz