Parallel and Multicore Computing Theory

Parallel and Multicore Computing Theory
Vijaya Ramachandran
Department of Computer Science
University of Texas at Austin
0-0
T HE M ULTICORE E RA
• Multicores have arrived and the multicore era represents a paradigm shift in
general-purpose computing.
• Algorithms research needs to address the multitude of challenges that come
with this shift to the multicore era.
1
T HE PAST: T HE
VON
N EUMANN E RA
An algorithm in the von Neumann model assumes a single processor that
executes unit-cost steps with unit-cost access to data in memory.
• A very simple abstract model
• Has been very successful for the past several decades
• Has facilitated development of good portable code whose performance by
and large matched the theoretical analysis:
– Sorting: Quick-sort, Merge-sort, Heap-sort
– Graph algorithms: minimum spanning tree, shortest paths, maximum flow
2
T HE P RESENT
INTO THE
F UTURE : M ULTICORE E RA
• p cores, each with private cache of size M
• An arbitrarily large global shared memory
• Data organized in blocks of size B.
3
4
D ESIGN
AND
A NALYSIS
OF
E FFICIENT M ULTICORE A LGORITHMS
• Determine a suitable abstract high level model for multicore algorithms.
• Develop suitable cost measures.
• Develop algorithms that are independent of machine parameters.
Such algorithms will result in portable code that runs efficiently across
different multicore environments.
• Design algorithmic techniques that give rise to efficient multicore algorithms
for important computational problems.
5
O UTLINE
OF
TALK
A tour through theory results in parallel algorithms, with a few stops along the
way for algorithms, theorems and proofs.
• PRAM: Model, algorithms and complexity.
• Communication costs: Bulk-synchronous computing; Routing.
• Cache-efficiency; Multicore algorithms; Multicore scheduling.
6
PRAM M ODEL
7
A DDING n E LEMENTS
S EQUENTIAL -A DD(A[1..n]; s)
Input. An array A[1..n] of n numbers.
Output. The sum (+) of the elements in array A in s, where + is an associative
operation.
s := 0
for 1 ≤ k ≤ n do s := s + A[k] end for
S EQUENTIAL -A DD is an optimal (linear-time) sequential algorithm.
8
PRAM A LGORITHM
TO
A DD n E LEMENTS
PAR -A DD(A[1..n]; s)
Input. An array A[1..n] of n numbers.
Output. The sum (+) of the elements in array A in s, + an associative operation.
The algorithm uses an auxiliary array B.
For convenience we assume n = 2k for some integer k ≥ 0.
if n = 1 then s := A[1] else
pfor 1 ≤ j ≤ n2 do B[j] := A[2j − 1] + A[2j] end pfor
PAR -A DD(B[1..n/2]; s)
end if
9
PAR -A DD
10
PAR -A DD(A[1..n]; s)
if n = 1 then s := A[1] else
pfor 1 ≤ j ≤ n2 do B[j] := A[2j − 1] + A[2j] end pfor
PAR -A DD(B[1..n/2]; s)
end if
Number of parallel steps T∞ (n):
T∞ (n) = 1 + T∞ (n/2),
with T∞ (1) = 1
So, T∞ (n) = log n.
• Readily achieved using n/2 PRAM processors,
• But this dedicates Θ(n log n) work to the computation.
• T∞ (n) = O(log n) is also achieved with logn n processors, dedicating only
O(n) work to the computation (asymptotically optimal work).
11
B RENT ’ S S CHEDULING P RINCIPLE
Brent’s (or WT) Scheduling Principle
Consider a computation that can be performed in t parallel steps, with wi
operations being performed in step i, 1 ≤ i ≤ T∞ .
∑T∞
Let W = i=1
wi (hence, W is the work performed by the computation).
⌊ ⌋
This computation can be executed on p processors in time W
+ T∞ (assuming
p
processor allocation does not incur additional overhead).
For convenience, let us set T∞ = t. (In PAR -A DD we had wi =
12
n
2i ,
and t = log n.)
B RENT ’ S S CHEDULING P RINCIPLE
Brent’s (or WT) Scheduling Principle
Consider a computation that can be performed in t parallel steps, with wi
operations being performed in step i, 1 ≤ i ≤ T∞ .
∑T∞
Let W = i=1
wi (hence, W is the work performed by the computation).
⌊ ⌋
This computation can be executed on p processors in time W
+ T∞ (assuming
p
processor allocation does not incur additional overhead).
For convenience, let us set T∞ = t. (In PAR -A DD we had wi =
n
2i ,
and t = log n.)
Observe:
• For PAR -A DD, we have W =
• Set p =
n
log n .
Then, with p =
⌊
W
p
∑
n
i 2i
n
log n
⌋
(
+ T∞ = O
= O(n).
processors, PAR -A DD runs in time
n
+ log n
n/ log n
13
)
= O(log n)
B RENT ’ S S CHEDULING P RINCIPLE
Brent’s (or WT) Scheduling Principle
Consider a computation that can be performed in t parallel steps, with wi
operations being performed in step i, 1 ≤ i ≤ T∞ .
∑T∞
Let W = i=1 wi (hence, W is the work performed by the computation).
⌊ ⌋
This computation can be executed on p processors in time W
+ T∞ (assuming
p
processor allocation does not incur additional overhead).
For convenience, let us set T∞ = t. (In PAR -A DD we had wi =
n
2i ,
and t = log n.)
Observation 2: Brent’s Principle tells us that:
• If we design a work-optimal algorithm for some T∞ , then it can be mapped
on to any finite number of processors, p ≤ TW∞ , and maintain optimal work
and speed-up.
14
P ROOF
OF
B RENT ’ S T HEOREM
Proof: In each parallel step, distribute the operations uniformly across the p
processors.
⌈ ⌉
So, in ith step, each processor has ≤ wpi operations.
Hence, time taken to perform the computation on the p processors is
T
⌉
t ⌈
∑
wi
⌋
t ⌊
∑
wi
≤
≤
(
+ 1)
p
p
i=1
i=1
( t ⌊ ⌋)
∑ wi
+t
=
p
i=1
⌊ t
⌋
⌊ ⌋
∑ wi
W
≤
+t =
+t
p
p
i=1
15
O PTIMALITY
OF
PAR -A DD
PAR -A DD(A[1..n]; s)
if n = 1 then s := A[1] else
pfor 1 ≤ j ≤ n2 do B[j] := A[2j − 1] + A[2j] end pfor
PAR -A DD(B[1..n/2]; s)
end if
• Work W (n) = O(n) is optimal.
• T (n, p) = O( np + log n) is optimal when
n
p
= Ω(log n).
• How about T∞ (n) = log n : can we do better?
16
B OUNDED FAN - IN C IRCUIT M ODEL
PAR -A DD(A[1..n]; s)
if n = 1 then s := A[1] else
pfor 1 ≤ j ≤ n2 do B[j] := A[2j − 1] + A[2j] end pfor
PAR -A DD(B[1..n/2]; s)
end if
• PAR -A DD can be computed ‘in hardware’ using + gates.
• Depth of this circuit gives T∞ (n).
• T∞ (n) = Ω(log n) follows easily from a ‘fan-in’ argument.
• T∞ (n) = Ω(log n) also holds for EREW and CREW PRAM, regardless of the
number of processors available — but proof is more challenging.
17
CRCW PRAM
• COMMON CRCW PRAM: In a concurrent write to a location, all writes
should write the same value.
• ARBITRARY CRCW PRAM: In a concurrent write, the value written by one
of the participating writes will be written.
• PRIORITY CRCW PRAM: Processors have predetermined priorities, and in
a concurrent write, the value written by the highest priority processor is
written.
P RIORIT Y > ARBIT RARY > COM M ON
A > B means Model A can achieve the same bounds as Model B (or better).
• Many practical parallel machines support ARBITRARY CRCW.
18
PAR -A DD
ON
CRCW PRAM
T∞ depends on the nature of the associative operation +:
• + = OR : Compute in constant time on COMMON CRCW PRAM with n
processors. Work-Time Optimal.
• + = ⊕ (Exclusive OR):
(
T∞ = Ω
log n
log log n
)
with any polynomial number of
processors, even on a PRIORITY CRCW PRAM.
• + = max: Can compute in constant time on COMMON CRCW PRAM with
n2 processors.
— How about computing the max with linear work?
19
M AXIMUM
M AX 1(A[1..n], m)
ON
COMMON CRCW PRAM
{Auxiliary array M [1..n] is initialized to all zeros.}
1. pfor all (i, j), 1 ≤ i < j ≤ n, do if A[i] < A[j] then M [i] := 1 else M [j] := 1
end pfor
2. pfor 1 ≤ i ≤ n do if M [i] = 0 then m := A[i] end pfor
After Step 1, M [i] = 0 iff A[i] contains the maximum, so algorithm is correct if all
elements are distinct.
• Step 1 can be performed in constant time with C(n, 2) processors.
• Step 2 can be performed in constant time with n processors and has no
concurrent writes.
Constant time algorithm using θ(n2 ) processors.
20
P ROCESSOR - EFFICIENT CRCW A LGORITHM
M AX 2(A[1..n], p)
FOR
M AXIMUM
{Assume that p ≥ 2n}
0. if n = 1 then max := A[1] else
⌊p⌋
1. Let q = n and set r = max{2, q}.
⌈n⌉
Divide the array A into g = r groups of at most r elements each.
Compute the maximum element in each group using algorithm M AX 1.
Place the computed maximums in array B[1..g].
2. max := M AX 2(B[1..g], p)
• Number of comparisons made in Step 1:
g · C(r, 2) ≤ (n/r) · r2 ≤ n · p/n = p.
So, Step 1 takes constant time c (using Max1) in each iteration.
• Number of recursive calls, T (n, p) = O( number of parallel steps ).
21
N UMBER
OF
PARALLEL S TEPS
IN
M AX 2
• Number of elements, n1 , at end of first iteration =
Number of groups in the first iteration =
n
(p/n)
=
n2
p
=p·
( )2
n
p
• The number of processors is the same in all iterations.
• Hence, for a suitable constant c,
(
T (n, p) = c + T
( )2 )
n
p·
, p
p
• Also, ni , the number of elements at the end of the ith iteration, is:
(
ni = p ·
ni−1
p
)2
( )2i
n
=p·
p
22
(established by induction on i)
(
T (n, p)
=
=
( )2 )
n
c+T p·
, p
p
( ( )2 )
2
n
2c + T p ·
, p
(iterating once)
p
···
=
(
i·c+T
( )2i )
n
p·
, p
p
Base case occurs when size of array becomes 1:
( )2i
n
p·
= 1,
p
so
( p )2i
n
Taking logs, we need: 2i
Taking logs once again, we get: i
= p
=
log p
log np
= log log p − log log
Since n = p/2, we have number of iterations i = O(log log n).
23
p
n
L OWER B OUND : T∞ (n) = Ω(log log n)
FOR
n CRCW P ROCESSORS
• Consider any n-processor CRCW PRAM algorithm A that computes the
maximum of n elements.
Claim. A requires Ω(log log n) parallel steps on (Priority) CRCW PRAM.
• We’ll consider an evolving graph G = (V, E), where
– there is a vertex for every element in the input (so |V | = n), and
– an edge between two vertices, if they have been directly compared at this
point in the execution of algorithm A.
24
CRCW L OWER B OUND
FOR
M AXIMUM
• Initially, G = (V, E) is the empty graph on n vertices.
• At the end of the first iteration, there are p = n edges placed in G,
corresponding to the p comparisons made by algorithm A.
Call this graph G1 = (V1 , E1 ).
• Adversary: Picks a (large) independent set I in G1 , and for each v ∈ I it
ensures that v is given a value larger than the elements to which it is
compared.
This is possible for all v ∈ I since I is an independent set.
Observe: Any element of I can be the maximum element, based on the
outcomes fixed by the adversary!
So, after the first step, we are left with solving the Maximum problem on an input
of size ≥ |I|.
25
CRCW L OWER B OUND
FOR
M AXIMUM
What can we say about the size of I, a large independent set in G1 ?
Turan’s Theorem. A graph with n vertices and m edges contains an
n2
independent set of size ≥ 2m+n
.
In our case, n = m = p in the first step, so |I| ≥
n
3.
So in the second iteration we need to solve the maximum problem on an input of
size n1 ≥ n3 .
We can apply this argument repeatedly.
26
Turan’s Theorem. A graph with n vertices and m edges contains an
n2
independent set of size ≥ 2m+n
.
Apply the same argument repeatedly (recall m = p = n in every parallel step):
• After ith parallel step in algorithm, there is a set of ni+1 ≥
n2i
2n+ni
elements
— that have not been compared with one another in any of the steps, and
— further all of them are larger than any of the remaining elements.
• To solve the maximum problem, we need at least a number of iterations at
least the minimum i such that ni+1 ≤ 1.
• Since ni+1 ≥
ni ≥
n
32i−1
n2i
3n
and the initial value of n1 = n, we obtain:
.
Conclude: The minimum i such that n1 ≤ 1 is Ω(log log n).
27
A S AMPLING
OF
E FFICIENT PRAM A LGORITHMS
• List Ranking. Determine the distance to the end of a linked list for each
element in an n element linked list.
• Sorting.
• Undirected Graph Connectivity.
• (Open) Ear Decomposition versus depth-first search.
• Matrix Multiplication and Transitive Closure.
• Reachability in a Directed Graph and the Transitive Closure Bottleneck.
28
NC
• NC is the class of computational problems that have a parallel algorithm
which runs in polylog time with a polynomial number of processors.
This definition is independent of the exact PRAM model used.
• Note that NC ⊆ P.
• It is an open question whether P = NC, similar to the P = NP problem.
• Analogous to NP-completeness, we have P-completeness that gives
evidence that a problem is unlikely to be in NC.
29
P- COMPLETENESS
NC-reducibility
Let L1 and L2 be two languages. We say that L1 is NC-reducible to L2 (denoted
by L1 ≤N C L2 ) if there is an N C algorithm that converts any given input I1 for
L1 into an input I2 for L2 such that I1 ∈ L1 if and only if I2 ∈ L2 .
Definition. A language L is P-complete if the following two conditions are
satisfied:
L∈P
and ∀L′ ∈ P,
L ′ ≤N C L
Instead of working with NC reductions we can work with space bounded
reductions, by using the fact the Logspace ⊆ N C.
30
C IRCUIT VALUE P ROBLEM (CVP)
1. The input to CVP is a circuit c with Boolean values supplied at its input
nodes. This CVP input is supplied in the following format:
c =< g1 , g2 , ..., gn >, where for each i, 1 ≤ i ≤ n,
gi is an input node with its supplied value
or gi = gj ∨ gk j, k < i
or gi = gj ∧ hk j, k < i
or gi = ¬gj
j<i
2. The output of CVP is the value at the output gate gn .
Theorem. CVP is P-complete.
31
S OME OTHER P- COMPLETE P ROBLEMS
• Monotone circuit-value problem.
• (Lexicographic-first) depth-first search.
• Maximum flow problem.
(If the capacities are polynomially bounded in the input size, then this
problem is in Randomized NC and is not known to be P-complete.)
32
BACK
TO
PRAM M ODEL
The PRAM abstracts away some important features of real machines:
• PRAM ignores latency.
• PRAM ignores bandwidth issues.
• PRAM ignores global synchronization cost.
• Equating NC with ‘feasible parallel computations’ does not appear to be very
reasonable.
(Less reasonable than equating P with ‘feasible sequential computations’.)
The PRAM can be considered as a lowest common denomination for
synchronous parallel computing.
• Hence designing algorithms on the PRAM can be viewed a first step towards
designing parallel algorithms for a ‘real’ parallel machine.
33
F IGURES
OF I NTERCONNECTION
N ETWORKS ( FROM M ATH W ORLD )
34
B ULK - SYNCHRONOUS PARALLEL M ODEL (BSP)
BSP abstracts the salient features of many fixed interconnection networks with a
small number of parameters.
• p processors, each of which has its own local memory.
• Processors communicate by sending point-to-point messages (or packets)
through a fixed (but unknown) interconnection network:
– The latency l gives a bound on the time needed for the first packet from
source processor to reach its destination.
The value of l is a function of the diameter of the network, and the routing
algorithm.
– The bandwidth (or gap) parameter g specifies the delay needed between
sending successive messages from a source processor.
Example. If k packets are sent from processor u to processor v then the last
packet is received at v at time no later than
T = l + (k − 1)g
35
36
R OUTING
IN I NTERCONNECTION
N ETWORKS
• Example. If k packets are sent from processor u to processor v then the last
packet is received at v at time no later than
T = l + (k − 1)g
• If u is adjacent to v in the network graph, then l = O(1).
• If only one pair of processors sent packets to each other, then g = O(1).
• In general, each processor in the interconnection network may need data
from some other arbitrary processor, so l = Ω(diameter), and g may not
remain O(1) due to congestion in the network.
• A routing algorithm sends data between processors according to their
specified need through suitable paths in the interconnection network.
• Need to develop provably efficient routing algorithms with small l and g.
37
R ANDOMIZED R OUTING
ON
H YPERCUBIC N ETWORKS
Hypercube on N = 2n Vertices.
• We have N nodes/processors, 0 ≤ i ≤ N − 1.
• Node i is represented as an n-bit number i = (i0 , i1 , ...in−1 ) where i0 is the
least significant bit. Hence,
n−1
∑
i=
ij .2j
j=0
• Permutation Routing. Each processor i has a packet νi which it needs to
send to a destination processor d(i) and the d(i) form a permutation on the
processor IDs.
• A routing algorithm is a strategy for sending each νi from node i to node d(i)
along a suitable path in the hypercube determined by it.
• Each node can send one packet along each edge incident on it in each step.
38
O BLIVIOUS R OUTING A LGORITHMS
• An oblivious routing algorithm is one that chooses a path for each νi that
depends only on i and d(i) and not on any information pertaining to the
other packets.
• An algorithm that routes based on traffic from other nodes is called an
adaptive routing algorithm; these can give better routing bounds since they
can use more information, but an oblivious algorithm is simpler.
Fact.
Any deterministic oblivious routing algorithm for the hypercube will take
√
Ω( logNn ) steps in the worst case.
39
B IT-F IXING R OUTING A LGORITHM
To send a packet from i = (i0 , i1 , ...in−1 ) to j = (j0 , j1 , ...jn−1 ):
• The bit-fixing routing algorithm compares the bit representations of i and j in
order from least significant to the most significant.
• If k is the least significant unexamined bit position where ik and jk differ,
then the packet is currently at node (j0 , · · · jk−1 , ik , · · · in−1 ), and it is sent to
j ′ = (j0 , j1 , ..., jk−1 , jk , ik+1 , · · · , in−1 ) along hypercube edge (ik , jk ).
• In other words, the packet is routed from node i to node j by correcting bit
by bit.
• Routes any packet from source to destination in ≤ log n steps, if there is no
congestion.
40
T WO -P HASE
RANDOMIZED
R OUTING
Each node i has a packet that needs to be routed to node d(i).
• In parallel for 0 ≤ i ≤ N − 1:
Node i picks a random destination σ(i) ∈ {0, 1, 2, ..., N − 1}.
Comment: The {σ(i)} need not form a permutation.
• Phase I: The bit-fixing routing algorithm routes each νi from i to σ(i).
• Phase II: The bit-fixing algorithm routes each νi from σ(i) to d(i).
After Phase 2 is executed, every packet νi has been routed to its destination d(i).
Theorem. The two-phase randomized routing algorithm routes every packet to
( )
its destination in O(log N ) time, with probability greater than 1 − O N1 .
41
R ANDOMIZED R OUTING
AND
h-R ELATION
• In general, a processor may need to send or receive more than one packet
in a routing step.
• An h-relation is a routing step in which every processor sends or receives at
most h packets.
An h-relation can be viewed as a sequence of h permutations, where some
destinations may be empty.
Theorem. Any h-relation can be routed on an N -node hypercube in
O(log N + h) steps w.h.p. in N using the randomized routing algorithm.
42
BACK
TO
BSP
A BSP algorithm proceeds in supersteps.
• In each superstep, each processor can send or receive messages and
perform several steps of local computations.
• Computation in current superstep is based on messages received from
other processors in earlier supersteps.
• At end of superstep, there is a global synchronization across all processors.
• The cost of superstep is C = W + g.h + L, where
W = maximum number of local operations performed by any processor,
h = maximum number of messages sent or received by any processor in this
superstep; so, the BSP routes an h-relation in this superstep.
L = cost to perform global synchronization at the end of superstep.
(L ≥ l)
The cost of the BSP algorithm is the sum of the costs of all of its supersteps.
This is the parallel time.
43
BSP C OST M ODEL
FOR THE
H YPERCUBE N ETWORK
• We defined cost of a BSP superstep as C = W + g.h + L, where
W = maximum number of local operations performed by any processor,
h = maximum number of messages sent or received by any processor in this
superstep; so, the BSP routes an h-relation in this superstep.
L = cost to perform global synchronization at the end of superstep.
(L ≥ l)
• on the N -node hypercube:
– Global synchronization can be performed in O(log N ) time.
– If we use randomized two-phase routing on the hypercube, then g = O(1)
and L = log n w.h.p.
44
D ESIGNING E FFICIENT BSP A LGORITHMS
• We can use the BSP cost metric to design efficient algorithms directly for the
BSP model.
• Another Approach. Come up with a work-preserving emulation of PRAM
on BSP:
– Give a simulation of an arbitrary p′ processor PRAM step on a
p-processor BSP so that t · p = O(p′ ), where t is the cost of the simulated
BSP superstep.
(The ‘big Oh’ will hide L and g terms.)
( ′)
t = O pp is called the slowdown of the emulation.
– We will give such an emulation of EREW PRAM on BSP with
p′ = Θ(p log p) (so the slowdown is Θ(log p)).
– For the p-processor hypercube using randomized routing, this emulation
is work-preserving, even after incorporating the Θ(log p) cost for L.
45
EREW PRAM
ON
BSP: R ANDOMIZED W ORK -P RESERVING E MULATION
Work-preserving emulation of a p′ -processor EREW PRAM step on a
p-processor (‘R-R hypercube’) BSP with O(log p) slowdown w.h.p. in p.
• Hash the global memory of EREW PRAM into the BSP local memories.
• Distribute the p′ PRAM processors evenly across over the p BSP processors,
′
so that each BSP processor is assigned at most ⌈ pp ⌉ PRAM processors.
• Each superstep of BSP corresponds to emulating one step of EREW PRAM.
In each superstep, each BSP processor executes the current step of each
PRAM processor assigned to it. So,
( ′)
— Each processor executes w ≤ pp local steps, so W = O(p′ /p).
( ′)
— Each processor requests ≤ pp memory values from other processors,
so we are halfway to obtaining h = O(p′ /p).
Recall cost of this BSP super-step on hypercube is (w.h.p. in p):
W + gh + L = O((p′ /p) + h + log p) = O(log p + h) since p′ = Θ(p log p)
46
EREW PRAM
To achieve O
( ′)
h = O pp .
( ′)
p
p
ON
BSP
slowdown, we need each superstep to route an h-relation for
• A potential problem: There may be too many requests for memory values
sent to a single processor.
• Solution. Use a good hash function h : M −→ {0 , 1 , ..., p − 1 } which hashes
the entire global memory M of the EREW PRAM into the local memories of
the BSP processors.
• Here, for simplicity, assume that every element m ∈ M is equally likely to be
hashed into any i ∈ {0, 1, ..., p − 1}, independent of other memory locations.
47
A NALYSIS
OF THE
E MULATION A LGORITHM
We assume that the EREW PRAM has p′ processors and the BSP has p
processors, with p′ ≥ c · p log p processors, c a constant.
Consider an arbitrary step of the p′ -processor EREW PRAM.
• The total number of memory requests generated in this step is at most p′ ,
and let these be to PRAM locations m1 , m2 , · · · , mp′ .
• Consider an arbitrary BSP processor Q.
• For 1 ≤ i ≤ p′ , let Xi be the 0-1 random variable given by
Xi
=
1, if mi is hashed onto local memory of processor Q
0, otherwise
∑p′
• Let, X = i=1 Xi . Then, X is the number of memory requests sent to
processor Q in this step.
48
A NALYSIS
OF
E MULATION A LGORITHM
∑p′
• X = i=1 Xi . Then, X is the number of memory requests sent to processor
Q in this step.
[∑ ′
] ∑ ′
p
p
p′
• E[X] = E
X
=
E[X
]
=
i
i=1 i
i=1
p
• Since X is the sum of mutually independent 0-1 random variables, we can
apply Chernoff bounds to X. Hence,
1
P r(X ≥ (1 + β)E[X]) ≤
e
Using β = 12 .
Simplifying, we get, for c = 24,
(
)
′
1
3p
1
Pr X ≥ 2 p
≤ e2 log
p = p2
49
β 2 E[X]
3
E MULATION A LGORITHM A NALYSIS
We saw that the probability that the number of memory requests sent to BSP
′
processor Q in this BSP superstep is greater than (3/2) · pp is at most 1/p2 .
• Extend analysis to all p BSP processors using the Union bound:
(
3 p′
P r some BSP processor has ≥ ·
of the mi
2 p
50
)
≤
p
1
=
p2
p
EREW PRAM
ON
BSP: S UMMARY
We have established:
Theorem. Let M ′ be a p′ -processor EREW PRAM and M a p-processor BSP,
and let p′ ≥ c · p · log p for a suitable constant c (c = 24 suffices).
Let the global shared memory of the EREW PRAM be hashed into the p memory
modules of the BSP using a ‘good’ hash function.
Let the p′ EREW PRAM processors be distributed evenly in an arbitrary manner
′
across the p BSP processors so that each BSP processor is given at most ⌈ pp ⌉
of the EREW PRAM processors.
Then, each EREW PRAM step can be executed in a BSP superstep in which the
h-relation to be routed satisfies h = O(p′ /p) w.h.p. in p and the local computation
at each BSP processor takes O(p′ /p) time.
51
S UMMARY
• Hypercubic networks with randomized routing give rise to L = O(log p),
g = O(1) whp in p for the BSP model.
• Randomized emulation of EREW PRAM on BSP gives work-preserving
emulation of EREW PRAM algorithms on hypercubic networks with only a
small O(log p) reduction in maximum parallelism.
• So we can make use of the extensive results known for efficient PRAM
algorithms to obtain efficient BSP algorithms, and hence efficient algorithms
for more realistic parallel machines.
• Caveat: Hashing the memory locations destroys locality of memory
accesses, and results in large constant factor increases in the bounds.
52
T HE P RESENT
INTO THE
F UTURE : M ULTICORE E RA
• p cores, each with private cache of size M
• An arbitrarily large global shared memory
• Data organized in blocks of size B.
53
54
C ACHE M ISSES
AND
FALSE S HARING
Cache Miss. A cache miss occurs in a computation if the data item being read
is not in cache.
This results in a delay while the block that contains the data item is read into
cache (by evicting a data item present in cache – we assume an optimal cache
replacement policy).
Cache misses can occur in both sequential and parallel executions.
False Sharing . False sharing occurs if the same block of data is accessed by
two or more processors in a parallel environment, and at least one of these
processors writes into a location in the block.
55
• Each of P1 and P2 could incur the cost of B/2 cache misses as the block
ping-pongs between their caches in order to serve their write requests.
56
• If cores write an unbounded number of times to this block, there is no a priori
bound on the worst case delay at a core due to an fs miss.
57
D ESIGN
AND
A NALYSIS
OF
E FFICIENT M ULTICORE A LGORITHMS
• Determine a suitable abstract high level model for multicore algorithms:
multithreaded algorithms.
• Develop suitable cost measures: work- and cache- efficiency (including
reduced false-sharing costs) with good parallelism.
• Develop algorithms that are independent of machine parameters:
resource-oblivious algorithms
Such algorithms will result in portable code that runs efficiently across
different multicore environments.
• Design algorithmic techniques that give rise to efficient multicore algorithms
for important computational problems:
58
M ULTITHREADED C OMPUTATIONS
∑n
M-Sum(A[1..n], s)
% Returns s = i=1 A[i]
if n = 1 then return s := A[1] end if
fork(M-Sum(A[1..n/2], s1 ); M-Sum(A[ n2 + 1..n], s2 ))
join: return s = s1 + s2
• Sequential execution computes recursively in a dfs traversal of this
computation tree.
• Forked tasks can run in parallel.
• Runs on p ≥ 1 cores in O(n/p + log p) parallel steps by forking log p times to
generate p parallel tasks.
M-Sum is an example of a Balanced Parallel (BP) computation.
59
M-S UM
60
Depth-n-MM(X, Y, Z, n)
% Returns n × n matrix Z = A · B
if n = 1 then return Z ← Z + X · Y end if
fork(
D EPTH - N -MM(X11 , Y11 , Z11 , n/2);
D EPTH - N -MM(X11 , Y12 , Z12 , n/2);
D EPTH - N -MM(X21 , Y11 , Z21 , n/2);
D EPTH - N -MM(X21 , Y12 , Z22 , n/2) )
join
fork(
D EPTH - N -MM(X12 , Y21 , Z11 , n/2)
D EPTH - N -MM(X12 , Y22 , Z12 , n/2)
D EPTH - N -MM(X22 , Y21 , Z21 , n/2)
D EPTH - N -MM(X22 , Y22 , Z22 , n/2) )
join
Depth-n-MM is an example of Hierarchical Balanced Parallel (HBP) computation.
61
C OMPUTATION DAG
62
FOR
D EPTH - N -MM
A NALYSIS
OF
D EPTH - N -MM
• Number of operations W (n) = Θ(n3 ).
• T∞ (n) = 2 · T∞ ( n2 ) with T∞ (1) = 1.
Hence, T∞ (n) = n.
This is fairly good parallelism since the work performed is Θ(n3 ), though it is
not NC.
• Number of cache misses I(n) (sequential execution):

 O(n2 /B + n)
if 3n2 ≤ M
I(n) =
 8I(n/2) + O(1) otherwise
(
Hence the I/O complexity is I(n) = O
3
n
√
B M
)
,
• Contrast with standard Matrix multiplication, where the number of cache
misses is Θ(n3 ) if both X and Y are stored in the standard row-major order,
and the matrices are very large.
63
M ULTITHREADED C OMPUTATIONS
• Many programming languages support multithreading.
• Current run-time environments have run-time schedulers that schedule
available parallel tasks on idle cores.
• Multithreaded computations can be scheduled by most run-time schedulers
since a thread generates a parallel task in its task queue at each fork in the
computation.
64
W ORK -S TEALING
∑n
M-Sum(A[1..n], s)
% Returns s = i=1 A[i]
if n = 1 then return s := A[1] end if
fork(M-Sum(A[1..n/2], s1 ); M-Sum(A[ n2 + 1..n], s2 ))
join: return s = s1 + s2
• Computation starts in first core C
• At each fork, second forked task is placed on C’s task queue T .
• Computation continues at C (in sequential order), with tasks popped from
tail of T as needed.
• Task at head of T is available to be stolen by other cores that are idle.
65
66
W ORK - STEALING S CHEDULERS
• Work-stealing is a well-known method in scheduling with various heuristics
used for stealing protocol.
• Randomized work-stealing (RWS):
– When a processor becomes idle and need to steal a task, it picks a
processor uniformly at random and steals the task at the top of its task
queue.
– The steal is unsuccessful if the chosen task queue is empty, or if another
processor is attempting to steal from the same queue – in this latter case
one of the attempting processors succeeds.
– A stealing processor repeatedly attempts to steal until it achieves a
successful steal.
67
A NALYSIS
OF
R ANDOMIZED W ORK - STEALING
Question: During the execution of a multithreaded computation under RWS,
how many steal attempts occur?
Theorem. For a series-parallel dynamic computation dag D,
• The number of steal attempts (expected, w.h.p) is O(p · T∞ ), assuming unit
cost operations.
• Assuming a cost of b for a cache miss, unit cost for local operations, and
cost s ≥ b for a successful steal, assuming an unsuccessful steal takes no
more time than a successful steal, and assuming an upper bound b · Γ on
the worst-case cost for any occurrence of false sharing:
(
)
b
Number of steals S = O p · T∞ · (1 + · Γ)
s
This bound is both for expected number, and the number w.h.p. in n.
68
E XAMPLES
OF
S ERIES -PARALLEL DAG
• DAG we saw earlier for MM-Sum (fork-join tree) is a series-parallel DAG.
• Depth-n-MM: If we expand out the nodes for recursive computations in the
DAG below, we get a series-parallel DAG.
69
T HEOREMS
Theorem 1. Let D be a series-parallel computation dag with start node t and let
T∞ be the maximum length of any path descending D. Then, when scheduled
using RWS, with high probability in n, the number of successful steals is:
(
)
b
S = O p · [T∞ + · T∞ · Γ(D, B)] .
(1)
s
The time spent on all steal attempts is O(s · S).
Theorem 2. For a block-resilient HBP computation,
(
)
b
S = O p · [T∞ + · l(D, B) ] , where l = o(T∞ · B).
s
70
P ROOF
OF
T HEOREM 1: N OTATION
• Computation dag D with start node t.
For each u ∈ D, h(u) is height of u in dag D.
• At each node u:
– an operation on in-cache data takes 1 time units,
– cost of a cache miss is b time units,
– cost of a false sharing miss is at most b · Γ time units.
So, computation at node u takes O(b · Γ) time.
• Steals:
– cost for a successful steal of a task is s ≥ b time units,
– cost for an unsuccessful steal is at most s time units.
71
P OTENTIAL F UNCTION ϕ
FOR
P ROOF
OF
T HEOREM 1
Let D be the (dynamic) computation dag with start node t.
• Each node is given a cost of b · Γ.
• If the node is a fork or join node, it is given cost b · Γ + 2s.
• Cost of a path in D is the sum of the costs of nodes on the path.
For each node u in dag D:
• h(u) is the height of u in D (the length of a longest descending path).
• We define: c(u) =
b·Γ
s
· (h(u) + 1) + 2 · h(u)
So, c(u) is (1/s) times the maximum cost of a path starting at u.
For the root t of dag D,
c(t) =
b·Γ
s
72
· T∞ + 2 · (T∞ − 1)
P OTENTIAL F UNCTION ϕ ( CONTINUED )
We first define ϕ(u) for each vertex u ∈ D:
• For each task τu on a task queue, ϕ(u) = 21+c(u)
• For each task τu currently being executed by a processor, ϕ(u) = 2c(u)−(x/s)
• For all other vertices u, ϕ(u) = 0.
The potential ϕ =
Note:
• Initial potential is 2c(t) ;
• potential at termination is 20 = 1.
73
∑
u
ϕ(u).
P HASES
Divide the computation into two types of phases:
• Steal Phase. If at least half the potential ϕ is associated with vertices u
whose associated tasks τu are on task queues.
Lasts until 2p attempted steals complete, whether successful or not.
• Computation Phase. If at least half the potential ϕ is associated with
vertices u whose associated tasks τu are being executed by processors.
Lasts b time units.
74
A NALYZING
A
S TEAL P HASE
Claim. In a steal phase, with probability at least
its starting value.
1
16 ,
ϕ reduces to at most
15
16
of
Proof.
• In any non-empty task queue, the heights of successive tasks decrease by
an additive factor of at least 2.
Hence, the tasks at the heads of the task queues have at least 1/3 the total
potential of tasks in that queue.
• Let τu be at the head of a task queue.
P r(τu is not stolen in this phase) = (1 − 1/p)2p ≤ 1/e2 ≤ 1/4.
Hence τu is stolen with probability at least 3/4.
• If τu is stolen, the potential ϕ(u) decreases by a factor of 2.
Hence, the expected value of ϕ is reduced to at most 23 ϕ +
The Claim follows using Markov’s inequality.
75
1
4
·
ϕ
3
+
3
4
·
1
2
·
ϕ
3
=
21
24 ϕ.
A NALYZING
A
C OMPUTATION P HASE
(
Claim. In a computation phase, ϕ reduces to at most 1 −
value.
b
8s
)
of its starting
Proof.
Suppose a processor C is executing the task τu for vertex u. Three types of
computation outcomes are possible:
• (a) C completes the execution of τu and continues either executing a task
taken from the task queue, or (if task queue is empty) attempting to steal.
In either case the potential reduces by at least ϕ(u).
• (b) C executes τu and performs a fork, creating tasks τv and τw , placing τw
on its task queue.
• (c) C executes τu without performing a fork.
76
A NALYSIS
OF
C OMPUTATION P HASE ( CONT.)
• (b) C executes τu and performs a fork, creating tasks τv and τw , placing τw
on its task queue.
Since we added an additional cost of 2s to forking nodes, we have
ϕ(w) ≤ ϕ(u)/2 and ϕ(v) ≤ ϕ(u)/4.
Hence ϕ(u) reduce to at most
3
4
its start value.
• (c) C executes τu without performing a fork.
Then, ϕ(u) reduces to at most
ϕ(u)2−b/s ≤ ϕ(u)(1 −
b
2s )
77
C ONCLUDING A NALYSIS
OF
C OMPUTATION P HASE
Combining Cases (a), (b) and (c):
ϕ(u) reduces to at most
(
{
})
(
)
1 b
b
1 − min
,
· ϕ(u) ≤ 1 −
· ϕ(u)
4 2s
4s
Hence, the overall potential ϕ reduces to at most
(
)
b
ϕ
b
ϕ
+ (1 − ) · = 1 −
·ϕ
2
4s 2
8s
78
B OUNDING
THE
N UMBER
OF
S TEALS
Let x be number of successful steal phases
y be number of unsuccessful steal phases
z be number of computation phases
• y = O(x) whp using Chernoff, since constant success probability.
• Since a compute phase reduces ϕ by (1 − (b/8s)) and a successful steal
b
phase reduces ϕ by 15/16, we have:
x + 8s
z = O(c(t)).
(Recall that c(t) = O( sb · Γ · T∞ + T∞ ))
• Number of successful steals across all compute phases is O(pbz/s).
• Hence, total number of successful steals is
b
b
S = O(p · x + pbz/s) = O(p · (x + z)) = O(p · c(t)) = O(p · T∞ · (1 + Γ))
s
s
• Time spent on steals is
O(s · 2p · (x + y) + z · bp) = O(ps(x + sb z) = O(s · S)
79
W RAP - UP
• We would like to keep the number of steals small, since the computation will
be as cache-efficient as the sequential computation in between steals.
(
)
b
• The bound for RWS, O p · T∞ · (1 + s Γ) , will be small if
– T∞ is small (i.e., the algorithm is highly parallel), and
– Γ is small.
In general Γ can be unbounded, but for certain classes of algorithms
(such as block-resilient HBP), Γ = O(B), which is reasonably small.
80
B LOCK R ESILIENT HBP A LGORITHMS
A LGORITHM
f (r)
L(r)
T∞
Q(n, M, B)
Scans (MA, PS)
1
1
O(log n)
O(n/B)
Matrix Transposition (in BI)
1
1
O(log n)
Strassen’s MM (in BI)
1
√
r
√
r
1
O(log2 n)
1
√
r
O(log n)
O(n/B)
λ −1
O(nλ /(B · M 2
))
2
O(n /B)
O(log n)
O(n2 /B)
gap
O(log n)
1
1
O(log n · log log n)
O(log2 n · log log n)
1
O(log3 n · log log n)
O(n2 /B)
O( n logM n)
B
O( n logM n)
B
O( n logM n · log n)
B
√
O(n3 /(B M ))
KNOWN
RM to BI
Direct BI to RM
MODIFIED
Connected Comp.∗
√
r
√
r
√
r
√
r
Depth-n-MM
1
1
O(n)
√
r
√
r
1
O(log n)
1
O(log n · log log n)
BI-RM (gap RM)
FFT
List Ranking
NEW
BI-RM for FFT∗
Sort (SPMS)
2
O( n logM n)
B
O( n logM n)
B
f (r) is the ‘cache-friendliness’ function [Cole-R12a], and L(r) is the ‘block-sharing’
function [Cole-R12b].
MA is Matrix Addition and PS is Prefix Sums. RM is Row Major and BI is Bit Interleaved.
Input size is n2 for matrix computations, and n otherwise. All algorithms, except those
marked with ∗ , match their standard sequential work bound.
λ = log2 7 in Strassen’s algorithm.
81
Block
Resilient HBP
Algorithm
Scans, MT
RM to BI
MM, Strassen
Depth-n-MM
I-GEP
BI to RM for
RWS Expected # Steals, S
with FS Misses [Cole-R12c]
Cache Misses with
S Steals [Cole-R12a]
FS Misses
[Cole-R12b]
p · (log n + b B)
s
p · (log n + b B)
s
2
b
p · (log n + B log n)
s
√
b
p · (n + n B)
s
√
p · (n · log2 n + b n B)
s
p · (log n + b B)
s
Q + S [FS06,CR12a]
S·B
Q+S·B
1 2
Q+S3 n +S
B
1 2
n
3
Q+S
+ S [FS06,CR12a]
B
1 2
Q + S 3 n + S [FS06,CR12a]
B
2
Q + S · B + n log logB n
B
S·B
MM and FFT
LCS
FFT, sort
List Ranking
p(1 + b ) · nlog2 3
s
p · (log n · log log n
+ b B logB n)
s
Q+n
p · log n · log log n
·(log n + b B)
s
√
S/B + S [FS06,CR12a]
S·B
S·B
S·B
S·B
S·B
Csort = O(Q + S · B
log n
+n
)
B log[(n log n)/S]
S·B
Q + Csort · log n
S·B
Bounds on the number of steals under RWS when fs misses can occur, and on cache
misses and cost of fs misses as a function of the number of steals (for any scheduler);
O( · ) omitted on every term. Note that for MT, n should be replaced by n2 .
82
Block Resilient
HBP Algorithm
L(r)
Fs Misses with S
Parallel Tasks
Value of S for
Scheduler SC
Cache Misses w/ S
Parallel Tasks
Scans (PS, MT)
1
B·S
p
Depth-n-MM
1
B·S
p3/2
MM, Strassen
1
B·S
p log p
p
Q+S·B
BI-RM (gap RM)
gap
B·S
n
√ B·S
p
n
√
min{
, B log2 B}BS
p
p
Direct BI to RM
1
√
r
Q+S
1 2
Q+S3 n +S
B
1 2
n
Q+S3
+S
B
Q+S·B
p
Q+S·B
RM to BI
BI-RM for FFT
1
B·S
FFT, SPMS Sort
1
B·S
log n
log(n2 /p)
log n
p·
log(n/p)
p · log
Results for a simple centralized scheduler SC .
O( · ) omitted on every term.
Note that for MT, n should be replaced by n2 .
83
2
Q + SB + n log logB n
B
[ log n
]
Q + SB + n
B
(n log n)
log
S
S UMMARY
• There is a rich theory of algorithms for parallel computation.
• Parallelism is here to stay, and will soon become all-pervasive in computing.
• Many many challenging research problems remain.
– Multicore algorithmic techniques, multicore scheduling, fault tolerance,
energy efficiency.
– PRAM: Maximum matching, transitive closure bottleneck.
– Parallel Complexity: Parallel Computation Thesis and communication
cost; complexity classes for efficient speed-ups (CC and CC-complete
problems). .
84
P OINTERS
TO
L ITERATURE
• PRAM: Vast literature!
– Book on Parallel Algorithms by Joseph JaJa is a good reference.
– Chapter on parallel algorithms by Karp and Ramachandran in Handbook
of Theoretical Computer Science, Volume 1 gives a quick overview.
• BSP and Randomized Routing: Papers by Valiant.
• Multicore:
– Multithreaded: Papers by our group at UT-Austin, Cole-Ramachandran,
Blelloch-Gibbons et al., Frigo-Strumpen, and Leiserson’s group at MIT.
– Bulk-synchronous Multicore: Valiant, Arge et al., Lopez-Ortiz et al.
• RWS Scheduling Bounds: Blumofe-Leiserson, Acar-Blelloch-Blumofe,
Cole-Ramachandran.
85