Multithreaded Programming in Cilk

Multithreaded
Programming in
Cilk
LECTURE 1
Charles E. Leiserson
Adapted to current Cilk syntax by Shirley Moore
Supercomputing Technologies Research Group
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Cilk
A C language for programming
dynamic multithreaded applications
on shared-memory multiprocessors.
Example applications:
●
●
●
●
virus shell assembly ● dense and sparse matrix
computations
graphics rendering
● friction-stir welding
n-body simulation
simulation
heuristic search
● artificial evolution
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
2
Shared-Memory Multiprocessor
P
$
P
$
…
P
$
Network
Memory
I/O
In particular, over the next decade,
chip multiprocessors (CMP’s) will be
an increasingly important platform!
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
3
Cilk Is Simple
• Cilk extends the C language with just a handful
of keywords.
• Every Cilk program has a serial semantics.
• Not only is Cilk fast, it provides performance
guarantees based on performance abstractions.
• Cilk is processor-oblivious.
• Cilk’s provably good runtime system automatically manages low-level aspects of parallel
execution, including protocols, load balancing,
and scheduling.
• Cilk supports speculative parallelism.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
4
Minicourse Outline
● LECTURE 1
Basic Cilk programming: Cilk keywords,
performance measures, scheduling.
● LECTURE 2
Analysis of Cilk algorithms: matrix
multiplication, sorting, tableau construction.
● LABORATORY
Programming matrix multiplication in Cilk
— Dr. Bradley C. Kuszmaul
● LECTURE 3
Advanced Cilk programming: inlets, abort,
speculation, data synchronization, & more.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
5
LECTURE 1
•
•
•
•
•
•
•
© 2006 Charles E. Leiserson
Basic Cilk Programming
Performance Measures
Parallelizing Vector Addition
Scheduling Theory
A Chess Lesson
Cilk’s Scheduler
Conclusion
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
6
Fibonacci
int fib (int n) {
if (n<2) return (n);
else {
Cilk code
int x,y;
int fib (int n) {
x = fib(n-1);
if (n<2) return (n);
y = fib(n-2);
else {
return (x+y);
int x,y;
}
x = cilk_spawn fib(n-1);
}
y = cilk_spawn fib(n-2);
cilk_sync;
C elision
return (x+y);
}
}
Cilk is a faithful extension of C. A Cilk program’s
serial elision is always a legal implementation of
Cilk semantics. Cilk provides no new data types.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
7
Basic Cilk Keywords
int fib (int n) {
if (n<2) return (n);
else {
int x,y;
x = cilk_spawn fib(n-1);
y = cilk_spawn fib(n-2);
cilk_sync;
return (x+y);
}
}
Control cannot pass this
point until all spawned
children have returned.
© 2006 Charles E. Leiserson
The named child
Cilk procedure
can execute in
parallel with the
parent caller.
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
8
Dynamic Multithreading
int fib (int n) {
if (n<2) return (n);
Example:
else {
int x,y;
x = cilk_spawn fib(n-1);
y = cilk_spawn fib(n-2); 4
cilk_sync;
return (x+y);
}
3
2
}
2
“Processor
oblivious”
1
© 2006 Charles E. Leiserson
1
0
1
fib(4)
0
The computation dag
unfolds dynamically.
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
9
Multithreaded Computation
initial thread
final thread
continue edge
return edge
spawn edge
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
10
Cactus Stack
Cilk supports C’s rule for pointers: A pointer to stack
space can be passed from parent to child, but not from
child to parent. (Cilk also supports malloc.)
A
A
A
C
B
D
B
A
C
A
C
D
A
C
B
D
E
E
A
C
E
Views of stack
Cilk’s cactus stack supports several views in parallel.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
11
LECTURE 1
•
•
•
•
•
•
•
© 2006 Charles E. Leiserson
Basic Cilk Programming
Performance Measures
Parallelizing Vector Addition
Scheduling Theory
A Chess Lesson
Cilk’s Scheduler
Conclusion
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
12
Algorithmic Complexity Measures
TP = execution time on P processors
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
13
Algorithmic Complexity Measures
TP = execution time on P processors
T1 = work
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
14
Algorithmic Complexity Measures
TP = execution time on P processors
T1 = work
T∞ = span*
* Also called critical-path length
or computational depth.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
15
Algorithmic Complexity Measures
TP = execution time on P processors
T1 = work
T∞= span*
*Also called critical-path length
or computational depth.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
16
Speedup
Definition: T1/TP = speedup on P processors.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
17
Parallelism
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
18
Example: fib(4)
1
2
3
4
8
7
6
5
Assume for simplicity that each Cilk thread in
fib() takes unit time to execute.
Work: T1 = 17
?
Span: T∞1 == ?8
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
19
Example: fib(4)
Assume for simplicity that each Cilk thread in
fib() takes unit time to execute.
Work: T1 = 17
?
Using many more
than 2 processors
Span: T1 = 8
?
makes little sense.
Parallelism: T1/T∞ = 2.125
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
20
LECTURE 1
•
•
•
•
•
•
•
© 2006 Charles E. Leiserson
Basic Cilk Programming
Performance Measures
Parallelizing Vector Addition
Scheduling Theory
A Chess Lesson
Cilk’s Scheduler
Conclusion
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
21
Parallelizing Vector Addition
C
void vadd (real *A, real *B, int n){
int i; for (i=0; i<n; i++) A[i]+=B[i];
}
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
22
Parallelizing Vector Addition
C
C
void vadd (real *A, real *B, int n){
int i; for (i=0; i<n; i++) A[i]+=B[i];
}
void vadd (real *A, real *B, int n){
if (n<=BASE) {
int i; for (i=0; i<n; i++) A[i]+=B[i];
} else {
vadd (A, B, n/2);
vadd (A+n/2, B+n/2, n-n/2);
}
}
Parallelization strategy:
1. Convert loops to recursion.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
23
Parallelizing Vector Addition
C
Cilk
void vadd (real *A, real *B, int n){
int i; for (i=0; i<n; i++) A[i]+=B[i];
}
void vadd (real *A, real *B, int n){
if (n<=BASE) {
int i; for (i=0; i<n; i++) A[i]+=B[i];
} else {
cilk_spawn vadd (A, B, n/2;
cilk_spawn vadd (A+n/2, B+n/2, n-n/2;
} cilk_sync;
}
Parallelization strategy:
1. Convert loops to recursion.
2. Insert Cilk keywords.
© 2006 Charles E. Leiserson
Side benefit:
D&C is generally
good for caches!
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
24
Vector Addition
void vadd (real *A, real *B, int n){
if (n<=BASE) {
int i; for (i=0; i<n; i++) A[i]+=B[i];
} else {
cilk_spawn vadd (A, B, n/2);
cilk_spawn vadd (A+n/2, B+n/2, n-n/2);
cilk_sync;
}
}
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
25
Vector Addition Analysis
To add two vectors of length n, where BASE = (1):
Work: T1 = (n)
?
Span: T∞ = (lg
?
n)
Parallelism: T1/T∞ = 
? (n/lg n)
BASE
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
26
Another Parallelization
C
Cilk
void vadd1 (real *A, real *B, int n){
int i; for (i=0; i<n; i++) A[i]+=B[i];
}
void vadd (real *A, real *B, int n){
int j; for (j=0; j<n; j+=BASE) {
vadd1(A+j, B+j, min(BASE, n-j));
}
}
void vadd1 (real *A, real *B, int n){
int i; for (i=0; i<n; i++) A[i]+=B[i];
}
void vadd (real *A, real *B, int n){
int j; for (j=0; j<n; j+=BASE) {
cilk_spawn vadd1(A+j, B+j, min(BASE, n-j));
}
cilk_sync;
}
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
27
Analysis
…
…
BASE
To add two vectors of length n, where BASE = (1):
Work: T1 = (n)
?
Span: T∞ = (n)
?
Parallelism: T1/T∞ = (1)
?
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
28
Optimal Choice of BASE
…
…
BASE
To add two vectors of length n using an optimal
choice of BASE to maximize parallelism:
Work: T1 = (n)
?
Span: T∞ = (BASE
?
+ n/BASE)
Choosing BASE = √n , T∞ = (√n)
Parallelism: T1/T∞ = (
? √n )
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
29
LECTURE 1
•
•
•
•
•
•
•
© 2006 Charles E. Leiserson
Basic Cilk Programming
Performance Measures
Parallelizing Vector Addition
Scheduling Theory
A Chess Lesson
Cilk’s Scheduler
Conclusion
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
30
Scheduling
• Cilk allows the
programmer to express
potential parallelism in
an application.
• The Cilk scheduler
maps Cilk threads onto
processors dynamically
at runtime.
• Since on-line schedulers
are complicated, we’ll
illustrate the ideas with
an off-line scheduler.
© 2006 Charles E. Leiserson
P
$
P
$
…
P
$
Network
Memory
Multithreaded Programming in Cilk — LECTURE 1
I/O
July 13, 2006
31
Greedy Scheduling
IDEA: Do as much as possible on every step.
Definition: A thread is ready
if all its predecessors have
executed.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
32
Greedy Scheduling
IDEA: Do as much as possible on every step.
Definition: A thread is ready
P=3
if all its predecessors have
executed.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
33
Greedy Scheduling
IDEA: Do as much as possible on every step.
Definition: A thread is ready
P=3
if all its predecessors have
executed.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
34
Greedy-Scheduling Theorem
Theorem [Graham ’68 & Brent ’75].
Any greedy scheduler achieves
TP  T1/P + T.
P=3
Proof.
• # complete steps  T1/P,
since each complete step
performs P work.
• # incomplete steps  T∞,
since each incomplete step
reduces the span of the
unexecuted dag by 1. ■
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
35
Optimality of Greedy
Corollary. Any greedy scheduler
achieves within a factor of 2 of optimal.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
36
Linear Speedup
Definition. The quantity (T1/T∞ )/P is
called the parallel slackness.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
37
Cilk Performance
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
38
LECTURE 1
•
•
•
•
•
•
•
© 2006 Charles E. Leiserson
Basic Cilk Programming
Performance Measures
Parallelizing Vector Addition
Scheduling Theory
A Chess Lesson
Cilk’s Scheduler
Conclusion
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
39
Cilk Chess Programs
● Socrates placed 3rd in the 1994 International
Computer Chess Championship running on NCSA’s
512-node Connection Machine CM5.
● Socrates 2.0 took 2nd place in the 1995 World
Computer Chess Championship running on Sandia
National Labs’ 1824-node Intel Paragon.
● Cilkchess placed 1st in the 1996 Dutch Open
running on a 12-processor Sun Enterprise 5000. It
placed 2nd in 1997 and 1998 running on Boston
University’s 64-processor SGI Origin 2000.
● Cilkchess tied for 3rd in the 1999 WCCC running
on NASA’s 256-node SGI Origin 2000.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
40
Socrates Normalized Speedup
1
T1/TP
T1/T
TP = T
TP = T1/P + T
0.1
measured speedup
0.01
0.01
0.1
1
P
T1/T
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
41
Developing Socrates
• For the competition, Socrates was to run
on a 512-processor Connection Machine
Model CM5 supercomputer at the University
of Illinois.
• The developers had easy access to a similar
32-processor CM5 at MIT.
• One of the developers proposed a change to
the program that produced a speedup of over
20% on the MIT machine.
• After a back-of-the-envelope calculation, the
proposed “improvement” was rejected!
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
42
Socrates Speedup Paradox
Original program
Proposed program
T32 = 40 seconds
T32 = 65 seconds
TP  T1/P + T
T1 = 2048 seconds
T = 1 second
T1 = 1024 seconds
T = 8 seconds
T32 = 2048/32 + 1
= 65 seconds
T32 = 1024/32 + 8
= 40 seconds
T512 = 2048/512 + 1
= 5 seconds
T512= 1024/512 + 8
= 10 seconds
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
43
Lesson
Work and span can
predict performance on
large machines better
than running times on
small machines can.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
44
LECTURE 1
•
•
•
•
•
•
•
© 2006 Charles E. Leiserson
Basic Cilk Programming
Performance Measures
Parallelizing Vector Addition
Scheduling Theory
A Chess Lesson
Cilk’s Scheduler
Conclusion
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
45
Cilk’s Work-Stealing Scheduler
Each processor maintains a work deque
of ready threads, and it manipulates the
bottom of the deque like a stack.
Spawn!
P
© 2006 Charles E. Leiserson
P
P
Multithreaded Programming in Cilk — LECTURE 1
P
July 13, 2006
46
Cilk’s Work-Stealing Scheduler
Each processor maintains a work deque
of ready threads, and it manipulates the
bottom of the deque like a stack.
Spawn!
P
© 2006 Charles E. Leiserson
P
Spawn!
P
Multithreaded Programming in Cilk — LECTURE 1
P
July 13, 2006
47
Cilk’s Work-Stealing Scheduler
Each processor maintains a work deque
of ready threads, and it manipulates the
bottom of the deque like a stack.
Return!
P
© 2006 Charles E. Leiserson
P
P
Multithreaded Programming in Cilk — LECTURE 1
P
July 13, 2006
48
Cilk’s Work-Stealing Scheduler
Each processor maintains a work deque
of ready threads, and it manipulates the
bottom of the deque like a stack.
Return!
P
© 2006 Charles E. Leiserson
P
P
Multithreaded Programming in Cilk — LECTURE 1
P
July 13, 2006
49
Cilk’s Work-Stealing Scheduler
Each processor maintains a work deque
of ready threads, and it manipulates the
bottom of the deque like a stack.
Steal!
P
P
P
P
When a processor runs out of
work, it steals a thread from the
top of a random victim’s deque.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
50
Cilk’s Work-Stealing Scheduler
Each processor maintains a work deque
of ready threads, and it manipulates the
bottom of the deque like a stack.
Steal!
P
P
P
P
When a processor runs out of
work, it steals a thread from the
top of a random victim’s deque.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
51
Cilk’s Work-Stealing Scheduler
Each processor maintains a work deque
of ready threads, and it manipulates the
bottom of the deque like a stack.
P
P
P
P
When a processor runs out of
work, it steals a thread from the
top of a random victim’s deque.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
52
Cilk’s Work-Stealing Scheduler
Each processor maintains a work deque
of ready threads, and it manipulates the
bottom of the deque like a stack.
Spawn!
P
P
P
P
When a processor runs out of
work, it steals a thread from the
top of a random victim’s deque.
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
53
Performance of Work-Stealing
Theorem: Cilk’s work-stealing scheduler
achieves an expected running time of
TP  T1/P + O(T∞)
on P processors.
Pseudoproof. A processor is either working or
stealing. The total time all processors spend
working is T1. Each steal has a 1/P chance of
reducing the span by 1. Thus, the expected cost
of all steals is O(PT∞). Since there are P
processors, the expected time is
(T1 + O(PT∞))/P = T1/P + O(T∞) . ■
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
54
Space Bounds
Proof (by induction). The
work-stealing algorithm
maintains the busy-leaves
property: every extant
procedure frame with no
extant descendents has a
processor working on it. ■
© 2006 Charles E. Leiserson
P=3
P
S1
P
P
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
55
Linguistic Implications
Code like the following executes properly
without any risk of blowing out memory:
for (i=1; i<1000000000; i++) {
cilk_spawn foo(i);
}
cilk_sync;
MORAL
Better to steal parents than children!
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
56
LECTURE 1
•
•
•
•
•
•
•
© 2006 Charles E. Leiserson
Basic Cilk Programming
Performance Measures
Parallelizing Vector Addition
Scheduling Theory
A Chess Lesson
Cilk’s Scheduler
Conclusion
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
57
Key Ideas
• Cilk is simple: spawn, sync
• Recursion, recursion, recursion, …
• Work & span
• Work & span
• Work & span
• Work & span
• Work & span
• Work & span
• Work & span
• Work & span
•
Work & span
•
Work & span
•
Work & span
•
Work & span
•
Work & span
•
Work & span
•
Work & span
© 2006 Charles E. Leiserson
Multithreaded Programming in Cilk — LECTURE 1
July 13, 2006
58

Download Report

Multithreaded Programming in Cilk

Paperzz.com

Your Paperzz