Parallel Loops Parallel Matrix-Vector product Divide and Conquer

Parallel Loops
Parallel Matrix-Vector product
To compute y = Ax, we can
simply compute in parallel
parallel for i = 1 to n do
statement...
statement...
end for
yi =
I
Run n copies in parallel with local setting of i.
I
Effectively n-way spawn
I
Can be implemented with spawn and sync
I
Span
T∞ (n) = Θ(log n) + max T∞ (iteration i)
i
n
X
aij xj
function Mat-Vec(A, x, y )
Let n = rows(A)
parallel for i = 1 to n do
RowMult(A,x,y,i)
end for
end function
j=1
function RowMult(A,x,y,i)
yi = 0
for j = 1 to n do
yi = yi + aij xj
end for
end function
T1 (n) ∈ Θ(n2 )
(serialization)
T∞ (n) = Θ(log(n)) + Θ(n)
| {z }
| {z }
parallel for
Divide and Conquer Matrix-Vector product
RowMult
Scheduling
Scheduling Problem
function MVDC(A, x, y , f , t)
if f == t then
RowMult(A,x,y,f)
else
m = b(f + t)/2c
spawn MVDC(A, x, y , f , m)
MVDC(A, x, y , m + 1, t)
sync
end if
end function
Figure clrs27_4 in text
I
I
Abstractly Mapping threads to processors
T∞ (n) = Θ(log n)
(binary tree)
Pragmatically Mapping logical threads to a thread pool.
Θ(n) leaves (one per
row)
Ideal Scheduler
I
Θ(n) interior nodes
(binary tree)
I
T1 (n) = Θ(n2 )
On-Line No advance knowledge of when threads will spawn or
complete.
Distributed No central controller.
I
to simplify analysis, we relax the second condition, but fully
distributed versions exist.
A greedy centralized scheduler
Optimal and Approximate Scheduling
Recall
Maintain a ready queue of strands ready to run.
Tp ≥ T1 /p
Scheduling Step
Complete Step If ≥ p (# processors) strands are ready, assign p
strands to processors.
Tp ≥ T∞
To simplify analysis, split any non-unit strands into a chain of
unit strands
I
Therefore, after one time step, we schedule again.
Counting Complete Steps
Tp ≥ max(T1 /p, T∞ ) = opt
With the greedy algorithm with can achieve
Tp ≤
Let k be the number of complete steps.
I
At each complete step we do p units of work.
I
Every unit of work corresponds to one step of the
serialization, so kp ≤ T1 .
I
T1
+ T∞ ≤ 2 max(T1 /p, T∞ ) = 2 × opt
p
Counting Incomplete Steps
I
I
(span)
Therefore
Incomplete Step Otherwise, assign all waiting strands to processors
I
(work law)
I
Let G be the computation
DAG of remaining strands at
the start of an incomplete
step.
The ready queue of strands
is exactly the set of sources
(in-degree 0) in G
I
In incomplete step runs all
sources in G
I
Every longest path starts at
a source (otherwise, extend)
I
After an incomplete step,
length of longest path
shrinks by 1
I
There can be at most T∞
such steps.
Therefore k ≤ T1 /p
Parallel Slackness
Slackness and Scheduling
slackness :=
T1
parallelism
=
parallel slackness =
p
pT∞
speedup =
T1
T1
≤
= p × slackness
Tp
T∞
T1
p × T∞
Theorem
For sufficiently large slackness,
the greed scheduler approaches
time T1 /p.
Then
T∞ ≤
T1
cp
Recall that with the greedy
scheduler,
T1
Tp ≤
+ T∞
p
Suppose
I
If slackness < 1, speedup < p
I
If slackness ≥ 1, linear speedup achievable for given number
of processors
T1
≥c
p × T∞
Substituting (1), we have
T1
1
Tp ≤
1+
p
c
(1)