Introduction to Parallel Algorithms Cilk+ Writing parallel (pseudo

Introduction to Parallel Algorithms
Cilk+
Dynamic Multithreading
I
Also known as the fork-join model
I
Shared memory, multicore
I
Cormen et. al 3rd edition, Chapter 27
Nested Parallelism
I
Spawn a subroutine, carry on with other work.
I
Similar to fork in POSIX.
I
The multithreaded model is based on Cilk+, available at
svn://gcc.gnu.org/svn/gcc/branches/cilkplus
I
Programmer specifies possible paralellism
I
Runtime system takes care of mapping to OS threads
I
Cilk+ contains several more features than our model, e.g.
parallel vector and array operations.
I
Similar primitives are available in java.util.concurrent
Parallel Loop
I
iterations of a for loop can execute in parallel.
I
Like OpenMP
Writing parallel (pseudo)-code
Fibonacci Example
Keywords
parallel Run the loop (potentially) concurrently
spawn Run the procedure (potentially) concurrently
sync Wait for all spawned children to complete.
Serialization
I
remove keywords
I
serialized (correct) parallel code is correct serial code
Adding parallel keywords to correct serial code might make it
incorrect
I
I
I
missing sync
loop iterations not independent
function Fib(n)
if n ≤ 1 then
return n
else
x = spawn Fib(n − 1)
y = Fib(n − 2)
sync
return x + y
end if
end function
I
Code in Java, Clojure and Racket available from http:
//www.cs.unb.ca/~bremner/teaching/cs3383/examples
Computation DAG
Work and Speedup
Strands
Sequence of instructions containing no parallel, spawn, return from
spawn, or sync.
T1 Work, sequential time.
function Fib(n)
if n ≤ 1 then
.
return n
else
x = spawn Fib(n − 1)
y = Fib(n − 2)
.
sync
return x + y
.
end if
end function
nodes strands
Tp Time on p processors.
Work Law
Tp ≥ T1 /p
speedup := T1 /Tp ≤ p
Figure clrs27_2 in text
down edges spawn
up edges return
horizontal edges sequential
critical path longest path in
DAG
Parallelism
span weighted length of
critical path ≡
lower bound on
time
Tp Time on p
processors.
T∞ Span, time given
unlimited
processors.
Span and Parallelism Example
We could idle processors:
Tp ≥ T∞
(1)
Best possible speedup:
parallelism = T1 /T∞
≥ T1 /Tp = speedup
Assume strands are unit cost.
I
T1 = 17
I
T∞ = 8
I
Parallelism = 2.125 for this
input size.
Figure clrs27_2 in text
Composing span and work
Work of Parallel Fibonacci
Write T (n) for T1 on input n.
T (n) = T (n−1)+T (n−2)+Θ(1)
Substitute the inductive
hypothesis
Let φ ≈ 1.62 be the solution to
T (n) ≤ a(φn−1 + φn−2 ) − 2b + Θ(1)
A
A
B
φ2 = φ + 1
B
=a
A+B
We can show by induction that
AkB
T (n) ∈ Θ(φn )
series T∞ (A + B) = T∞ (A) + T∞ (B)
series or parallel T1 = T1 (A) + T1 (B)
T (n) ≤ aφn − b
T∞ (n) = max(T∞ (n − 1), T∞ (n − 2)) + Θ(1)
= T∞ (n − 1) + Θ(1)
Transforming to sum, we get
T∞ ∈ Θ(n)
T1 (n)
parallelism =
=Θ
T∞ (n)
I
φn
n
So an inefficient way to compute Fibonacci, but very parallel
φ+1 n
φ −b
φ2
= aφn − b
Assume
Span and Parallelism of Fibonacci
choose b large enough
≤a
parallel T∞ (AkB) = max(T∞ (A), T∞ (B))
φ+1 n
φ − b + (Θ(1) − b)
φ2
(IH)
(Ω() is left as an exercise)