Par Lab OS and Architecture Research

Building Composable Parallel
Software with Liquid Threads
Heidi Pan*, Benjamin Hindman+, Krste Asanovic+
*MIT, +UC Berkeley
Microsoft Numerical Library Incubation Team Visit
UC Berkeley, April 29, 2008
Today’s Parallel Programs are Fragile
Integer Programming App (B&B)
spawn

spawn
Parallel programming usually needs to
be aware of hardware resources to
achieve good performance.
Don’t incur overhead of thread creation if
no resources to run in parallel.
 Run related tasks on same core to
preserve locality.

spawn
spawn
Task Parallel Library
(TPL) Runtime

Today’s programs don’t have direct
control over resources, but hope that
the OS will do the right thing.
Create 1 kernel thread per core.
 Manually multiplex work onto kthreads
to control locality & task prioritization.

OS
KT0 KT1 KT2 KT3 KT4 KT5

P0
P1
P2
P3
P4
Even if the OS tries to bind each thread
to a particular core, it’s still not enough!
P5
2
Today’s Parallel Codes are Not Composable
parallel for
Integer Programming App (B&B)
spawn
spawn
Math
Lib
(MKL)
spawn
spawn
1 2 8
5 0 0
9 2 0

OS
P1
P2
P3

P4
2
0
2
6
3
8
8
0
0
8
2
6
4
5
3
1
5
2
3
0
1
0
3
3
7
5
2
0
7
5
3
1
5
7
8
9
3
2
2
1
6
0
9
2
0
9
2
0
OpenMP Runtime
Task Parallel Library
(TPL) Runtime
P0
1
5
9
6
1
4
4 3 7
5 0 5
3 1 2
3 3 9
1 2 2
5 2 0
6 6 8
1 3 2
4 8 6
1 0 0
5 3 7
2 3 5
7 1 9
8 6 2
9 0 0
The system is oversubscribed!
Today’s typical solution: use sequential
version of libraries within parallel app!
P5
3
Global Scheduler is Not the Right Solution
Integer Programming App (B&B)

Solver
1 2 8
5 0 0
9 2 0


parallel constructs
spawn, parallel for, …

Generic Global Scheduler
(User or OS)
Difficult to design a one-size-fits-all
scheduler that provides enough
expressiveness and performance for
a wide range of codes efficiently.
How do you design a dynamic loadbalancing scheduler that preserves
locality of both divide-and-conquer
and linear algebra algorithms?
Difficult to convince all SW vendors
and programmers to comply to the
same programming model.
Difficult to optimize critical sections
of code w/o interfering with or
changing the global scheduler.
4
Cooperative Hierarchical Scheduling
Integer Programming App (B&B)
Goals:
 Distributed Scheduling

Solver
1 2 8
5 0 0
9 2 0
OpenMP
Scheduler
(Child)
TPL Scheduler (Parent)

Hierarchical Scheduling


Customizable, scalable, extensible
schedulers that make localized
code-specific scheduling decisions.
Parent decides relative priority of
its children.
Cooperative Scheduling

Schedulers cooperate with each
other to achieve globally optimal
performance for app.
5
Cooperative Hierarchical Scheduling

Distributed Scheduling


3
8
4
5
7
0
5
1
3
2
2
1
1
5
9
9
1
4
2
0
2
9
1
2
8
0
0
0
5
3
OpenMP
Hierarchical Scheduling


At any point in time, each scheduler has
full control over a subset of the kernel
threads allotted to the application to
schedule its code.
2
7
1
5
3
8
A scheduler decides how many of its
kernel threads to give to each child
scheduler, and when these threads
are given.
OpenMP
OpenMP
TPL
Cooperative Scheduling

A scheduler decides when to relinquish
its kernel threads instead of being
pre-empted by its parent scheduler.
OpenMP
OpenMP
TPL
6
Standardizing Inter-Scheduler Interface
Integer Programming App (B&B)
Solver
1 2 8
5 0 0
9 2 0
OpenMP
Scheduler
(Child)
TPL Scheduler (Parent)
Standardized Inter-Scheduler
Resource Management Interface
to achieve
Cooperative Hierarchical Scheduling
Need to extend sequential ABI to support the transfer of resources!
7
Updating the ABI for the Parallel World
Integer Programming App (B&B)
T0

transfers the thread to the
callee, which has full control of
register & stack resources to
schedule its instructions, and
cooperatively relinquishes
thread upon return.
 Identical to sequential call.
solve(A) {
2
7
9
5
3
8
2
4
8
0
0
3
5
2
3
1
T3 T4 T5
OpenMP

};
(steal)
TPL Scheduler
OS
P0
P1
P2
P3
P4
P5
call
23
78
 Call
T1
T2
Functional ABI
Resource Mgmt ABI
callee registers
with caller to ask for
more resources.
 Caller enters callee on
additional threads that it
decides to grant.
 Callee cooperatively
yields threads.
85
02
92
54
03
31
t
ret
call
reg
 Parallel
23
78
85
02
92
54
t
enter
03
31
yield
unreg
ret
8
The Case for a Resource Mgmt ABI
By making resources a first-class citizen, we enable:

Composability:


Scalability:


Code can call any library function without worrying about inadvertently
oversubscribing the system’s resources.
Heterogeneity:


Code can be written without knowing the context in which it will be
called to encourage abstraction, reuse, and independence.
An application can incorporate parallel libraries that are implemented in
different languages and/or linked with different runtimes.
Transparency:

A library function looks the same to its caller, regardless of whether its
implementation is sequential or parallel.
9
TPL Example: Managing Child Schedulers
T0
T0
T2
steal
enter
call
solve(A) {
2
7
9
5
3
8
2
4
steal
8
0
0
3
OpenMP
};

T0:

T1:
T2:

1
5
2
3
1
2
T2
call
T1
0
T1
spawn
enter
3
1
steal
steal
1) Push continuations at spawn points onto work queue.
2) Upon child registration, push child’s enter to recruit more threads.
3) Child keeps track of its own parallelism (not pushed onto parent queue).
Steal subtree to compute.
Steal enter task, which effectively grants the thread to the child.
10
MVMult Ex: Managing Variable # of Threads
next task
parallel for
1
5
9
6
1
4
2
0
2
6
3
8
8
0
0
8
2
6
4
5
3
1
5
2
3
0
1
0
3
3
call
reg
7
5
2
0
7
5
3
1
5
7
8
9
3
2
2
1
6
0
9
2
0
9
2
0
enter
t




enter
unreg
ret
yield
yield
Partition work into tasks, each operating on an optimal cache block size.
Instead of statically mapping all tasks onto a fixed number of threads (SPMD),
tasks are dynamically fetched by current threads (and load balanced).
No loss of locality if no reuse of data between tasks.
Additional synchronization may be needed to impose an ordering of
noncommutative floating-point operations.
11
Liquid Threads Model
call
enter
P0
P1
P2
P3
P0
P1
P2
P3
P0
P1
P2
P3
P0
P1
P2
P3
enter
yield
ret
yield
t


Thread resources flow dynamically & flexibly between different modules.
More robust parallel codes that adapt to different/changing environments.
12
Lithe: Liquid Thread Environment

ABI

call
ret
functional
enter
yield
request
cooperative
resource
management
:



Not a (high-level) programming model.
Low-level ABI for expert programmers
(compiler/tool/standard library developers)
to control resources & map parallel codes.
Lithe can be deployed incrementally b/c it
supports sequential library function calls &
provides some basic cooperative schedulers.
Lithe also supports management of other
resources, such as memory and bandwidth.
Lithe also supports (uncooperative)
revocation of resources by the OS.
13
Lithe’s Interaction with the OS
App 3
App 2
App 1
App
App
App
1
time-multiplexing



OS
OS
P0 P1 P2 P3
P0 P1 P2 P3
space-multiplexing
(spatial partitioning)
Up till now, we’ve implicitly assumed that we’re the only app running,
but the OS is usually time-multiplexing multiple apps onto the machine.
We believe that a manycore OS should partition the machine spatially &
give each app direct control over resources (cores instead of kthreads).
The OS may want to dynamically change the resource allocation between
the apps depending on the current workload.
Lithe-compliant schedulers are robust and can easily absorb additional
threads given by the OS & yield threads voluntarily to the OS.
 Lithe-compliant schedulers can also easily dynamically check for contexts
from threads pre-empted by the OS to schedule on remaining threads.
 Lithe-compliant schedulers don’t use spinlocks (deadlock avoidance).

14
Status: In Early Stage of Development
add/kill thread
Slither
Fibonacci on
Vthread
(Work Stealing Scheduler)

Slither simulates a variable-sized partition.
We simulate hard threads using pthreads
 We simulate partitions using processes.


User can dynamically add/kill threads from the Vthread
partition through the Slither prompt & Vthread will adapt.
15
Summary

Lithe defines a new parallel ABI that:




supports cooperative hierarchical scheduling.
enables a liquid threads model in which thread resources
flow dynamically & flexibly between different modules.
provides the foundation to build composable & robust
parallel software.
The work is funded partly by
16

Download Report

Par Lab OS and Architecture Research

Paperzz.com

Your Paperzz