Parallel Programming Models of Today and Tomorrow

The Landscape of Parallelism –
Parallel Programming Models
of Today and Tomorrow
Meeting C++ 2015
Michael Wong (IBM)
[email protected]; http:://wongmichael.com
http://isocpp.org/wiki/faq/wg21:michael-wong
IBM and Canadian C++ Standard Committee HoD
OpenMP CEO
Chair of WG21 SG5 Transactional Memory , SG14 Games/Low Latency
Director, Vice President of ISOCPP.org
Vice Chair Standards Council of Canada Programming Languages
Acknowledgement and Disclaimer
Ÿumerous people internal and external to the
N
original OpenMP group, in industry and academia,
have made contributions, influenced ideas,
written part of this presentations, and offered
feedbacks to form part of this talk.
IŸeven lifted this acknowledgement and
disclaimer from some of them.
Ÿut I claim all credit for errors, and stupid
B
mistakes. These are mine, all mine!
Legal Disclaimer
This work represents the view of the author and
Ÿ
does not necessarily represent the view of IBM.
IŸBM, PowerPC and the IBM logo are trademarks
or registered trademarks of IBM or its
subsidiaries in the United States and other
countries.
Ÿ ther company, product, and service names
O
may be trademarks or service marks of others.
Welcome to Parallel Programming Model of today and tomorrow
4
Agenda
•
•
•
•
•
•
Why the rush to Massive Parallelism?
What Now?
Parallel Programming models of Today from OpenMP
Parallel Programming model of Today from C++11/14/17
SG14
Parallel Programming of Tomorrow that joins C++ and
OpenMP
5
Why is GPU important now?
• Or is it a flash in the pan?
• The race to exascale computing .. 10 18 flops
• Vertical scale is in GFLOPS
Top500 contenders
Internet of Things
• All forms of accelerators, DSP, GPU, APU, GPGPU
• Network heterogenous consumer devices
– Kitchen appliances, drones, signal processors, medical imaging, auto,
telecom, automation, not just graphics engines
Agenda
•
•
•
•
•
•
Why the rush to Massive Parallelism?
What Now?
Parallel Programming models of Today from OpenMP
Parallel Programming model of Today from C++11/14/17
SG14
Parallel Programming of Tomorrow that joins C++ and
OpenMP
10
•The new C++11 Std is
What now?
–1353 pages compared to 817 pages in C++03
•The new C++14 Std is
–1373 pages (N3937), vs the free n3972
•The new C11 is
–701 pages compared to 550 pages in C99
•OpenMP 3.1 is
–160 pages and growing
•OpenMP 4.0 is
–320 pages
•OpenMP 4.5 is
–359 pages
Agenda
•
•
•
•
•
•
Why the rush to Massive Parallelism?
What Now?
Parallel Programming models of Today from OpenMP
Parallel Programming model of Today from C++11/14/17
SG14
Parallel Programming of Tomorrow that joins C++ and
OpenMP
12
Standardize directivebased multi-language
high-level parallelism
that is performant,
productive and
portable
heterogeneous device model
• OpenMP 4.0 supports accelerators/coprocessors
• Device model:
– one host
– multiple acclerators / coprocessors of the same kind
14
Data mapping: shared or distributed memory
Shared memory
Memory
Processor X
Cache
A
Processor Y
A
Cache
A
• The corresponding variable in the
device data environment may share
storage with the original variable.
Distributed memory
Memory X
Processor X
Cache
A
• Writes to the corresponding variable
may alter the value of the original
variable.
Accelertor Y
Memory Y
A
A
15
OpenMP 4.0 Device Constructs
•
Execute code on a target device
– omp target [clause[[,] clause],…]
structured-block
– omp declare target
[function-definitions-or-declarations]
•
Map variables to a target device
– map ([map-type:] list) // map clause
–
map-type := alloc | tofrom | to | from
omp target data [clause[[,] clause],…]
structured-block
– omp target update [clause[[,] clause],…]
– omp declare target
[variable-definitions-or-declarations]
•
Workshare for acceleration
– omp teams [clause[[,] clause],…]
structured-block
– omp distribute [clause[[,] clause],…]
for-loops
16
SAXPY: Serial (host)
17
SAXPY: Serial (host)
18
SAXPY:
Coprocessor/Accelerator
19
SAXPY:
Coprocessor/Accelerator
20
Future plans
•Clang 3.8 (~Feb, 2016): trunk switches to clang OpenMP lib,
upstream OpenMP 4.0 with focus on Accelerator delivery; start
code dropping for OpenMP 4.5
•Clang 3.9 (~Aug 2016): Complete OpenMP 4.0 and continue to Add
OpenMP 4.5 functionality
•Clang 4.0 (~Feb 2017): clang/llvm becomes reference compiler;
follow OpenMP ratification with collaborated contribution?
The future
C++14 Implemented in Clang 3.5
Clang 4.0 becomes OpenMP
reference compiler and tracks
OpenMP closely?
9/3/2014
2/28/2017
C++14 Ratify
OpenMP 4.5 Ratified/Release
C++17 Ratify?
11/12/2015
5/31/2017
5/31/2014
OpenMP
C++175.0
Ratified/Release?
Release?
C++14 Released
11/12/2017
12/31/2017
12/31/2014
C++17
Implemented
in Clang 4.0?
Clang 3.5 Release
8/31/2014
2/28/2017
OpenMP 4.0 Ratified/Release
11/12/2013
2013
2014
Today
2015
2016
2017
2017
8/31/2015
2/28/2017
Clang 3.7
Release
Clang 4.0
Release?
2/29/2016
2/28/2015
Clang 3.6 Release
Clang 3.8
Release
8/31/2016
Clang 3.9 Release?
Agenda
•
•
•
•
•
•
Why the rush to Massive Parallelism?
What Now?
Parallel Programming models of Today from OpenMP
Parallel Programming model of Today from C++11/14/17
SG14
Parallel Programming of Tomorrow that joins C++ and
OpenMP
23
A tale of two cities
Will the two galaxies ever join?
C++ support for Accelerators
•Memory allocation
•Templates
•Exceptions
•Polymorphism
•Current Technical Specifications
–Concepts, Parallelism, Concurrency, TM
Programming GPU/Accelerators
• OpenGL
• DirectX
• CUDA
• OpenCL
• OpenMP
• OpenACC
• C++ AMP
• HPX
• HSA
• SYCL
• Vulkan
• A preview of C++
WG21 Accelerator
model SG1/SG14 TS2
(SC15 LLVM HPC talk)
C++11, 14, 17
C++1Y(1Y=17 or 22) Concurrency Plan
Parallelism
Parallel STL Algorithms:
Data-Based Parallelism.
(Vector, SIMD, ...)
Task-based parallelism (cilk,
OpenMP, fork-join)
MapReduce
Pipelines
Concurrency
Future Extensions (then,
wait_any, wait_all):
Executors:
Resumable Functions, await (with
futures)
Counters
Queues
Concurrent Vector
Unordered Associative Containers
Latches and Barriers
upgrade_lock
Atomic smart pointers
Status after Oct Kona C++ Meeting
Project
What’s in it?
Status
Filesystem TS
Standard filesystem interface
Published!
Library Fundamentals TS I
optional, any, string_view and
more
Published!
Library Fundamentals TS II
source code information capture
and various utilities
Voted out for balloting by
national standards bodies
Concepts (“Lite”) TS
Constrained templates
Publicated!
Parallelism TS I
Parallel versions of STL
algorithms
Published!
Parallelism TS II
TBD. Exploring task blocks,
progress guarantees, SIMD
Under active development
Transactional Memory TS
Transactional Memory TS
Published!
Project
What’s in it?
Status
Concurrency TS I
improvements to future, latches
and barriers, atomic smart
pointers
Voted out for publication!
Concurrency TS II
TBD. Exploring executors,
synchronic types, atomic views,
concurrent data structures
Under active development
Networking TS
Sockets library based on
Boost.ASIO
Design review completed;
wording review of the spec in
progress
Ranges TS
Range-based algorithms and
views
Design review completed;
wording review of the spec in
progress
Numerics TS
Various numerical facilities
Beginning to take shape
Array Extensions TS
Stack arrays whose size is not
known at compile time
Direction given at last meeting;
waiting for proposals
Reflection
Code introspection and (later)
reification mechanisms
Still in the design stage, no ETA
Project
What’s in it?
Status
2D drawing API
Waiting on proposal
author to produce
updated standard
wording
Modules
A component system
to supersede the
textual header file
inclusion model
Microsoft and Clang
continuing to iterate on
their implementations
and converge on a
design. The feature will
target a TS, not
C++17.
Coroutines
Resumable functions
At least two competing
designs. One of them
may make C++17.
Contracts
Preconditions,
postconditions, etc.
In early design stage
Graphics
The Parallel and concurrency planets
…
of C++
… …
SG5
Transactional
Memory TS
3
SG1 Par/Con
TS
…
…
SG14 Low
Latency
…
…
…
…
SG5: Transactional Language Constructs
for C++ TS
vs.
2015: SG5 TM Language in a nutshell
1 construct for transactions
1.
Compound Statements
2 Keywords for different types of TX
atomic_noexcept | atomic_commit | atomic_cancel {<compound-statement> }
synchronized {<compound-statement> }
37
1 Function/function pointer keyword
transaction_safe
transaction_safe_dynamic
-must be a keyword because it conveys necessary semantics on type
1 Function/function pointer attribute
[[transaction_unsafe]]
-provides static checking and performance hints, so it can be an attribute
[[optimized_for_synchronized]]
-provides a speculative version for synchronized blocks for the common case when no
unsafe functions are called
Transaction statement
•2 forms
atomic_noexcept {x++;}
atomic_commit
{x++;}
atomic_cancel
{x++;}
synchronized
38
{x++;}
atomic
synchronized
Atomic & synchronized transactions
atomic_cancel {
x++;
if (cond)
throw 1;
}
synchronized {
x++;
print(x);
}
Appear to execute
atomically
Can be cancelled
Unsafe statements
prohibited
Cannot be cancelled
No other restrictions
on content
All transactions appear to execute in serial order
Racy programs have undefined behavior
39
I/O Without Transactions
void foo()
{
cout << “Hello Concurrent Programming World!” << endl;
}
// Thread 1
foo();
// Thread 2
foo();
Hello Concurrent Programming World!
HelloConcurrent
Hello Concurrent
Concurrent
Hello
Programming
World! Programming Programming
World! World!
...Hello Concurrent Programming Hell World!...
(and other fun [and appropriate] variations)
40
I/O With Transactions
void foo()
{
cout << “Hello Concurrent Programming World!” << endl;
}
// Thread 1
atomic_noexcept
{
foo();
}
// Thread 2
atomic_noexcept
{
foo();
}
Hello Hello ... Hello
Three Hello’s?
There are only two calls?
41
I/O and Irrevocable Actions: Take Two
void foo()
{
cout << “Hello Concurrent Programming World!” << endl;
}
// Thread 1
synchronized
{
foo();
}
Hello Concurrent Programming World!
Hello Concurrent Programming World!
(only possible answer)
42
// Thread 2
synchronized
{
foo();
}
C++1Y(1Y=17 or 22) Concurrency Plan
Parallelism
Parallel Algorithms:
Data-Based Parallelism.
(Vector, SIMD, ...)
Task-based parallelism (cilk,
OpenMP, fork-join)
MapReduce
Pipelines
Concurrency
Future Extensions (then, wait_any,
wait_all):
Executors:
Resumable Functions, await (with
futures)
Counters
Queues
Concurrent Vector
Unordered Associative Containers
Latches and Barriers
upgrade_lock
Atomic smart pointers
Concurrency and parallelism:
They’re not the same thing!
Concurrency
Parallelism
Why: express component
interactions for effective
program structure
How: interacting threads
that can wait on events or
each other
Why: exploit hardware
efficiently to scale
performance
How: independent tasks
that can run
simultaneously
A program can have both
From Pablo Halpern
44
Sports analogy
Photo credit André Zehetbauer (CC) BY-SA 2.0
Photo credit JJ Harrison (CC) BY-SA 3.0
Concurrency
From Pablo Halpern
Parallelism
45
SG1: C++ Std Parallelism TS
• Jared Hoberock, Nvidia (thanks for the slides)
• Jaydeep Marathe, Nvidia
• Michael Garland, Nvidia
• Olivier Giroux, Nvidia
• Vinod Grover , Nvidia
• Artur Laksberg, Microsoft
• Herb Sutter , Microsoft
• Arch Robison, Intel
•http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4407.html
Slide 46
Project History
● Multi-vendor collaboration
○
○
○
NVIDIA, Microsoft, Intel
Based on
Thrust(NVIDIA) targets multicore
GPU
○ PPL(MS)
○ TBB(Intel)
Pre-existing experience
● Thrust: github.com/thrust
● Boost: github.com/boostorg/compute
● Bolt: github.com/HSA-Libraries/Bolt
● libstdc++:
gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html
● C++ AMP: ampalgorithms.codeplex.com
● PPL: msdn.microsoft.com/en-us/library/dd492418.aspx
● TBB: www.threadingbuildingblocks.org
Implementations
•Microsoft: parallelstl.codeplex.com
● HPX: stellar-group.github.io/hpx
● Codeplay:
github.com/KhronosGroup/SyclParallelSTL
● HSA: www.hsafoundation.com/hsa-for-mathscience
● Lutz: github.com/t-lutz/ParallelSTL
● NVIDIA: github.com/n3554/n3554
Motivation
● Parallelize existing algorithms library
● Target the broadest range of current and future
architectures
● Concurrency never mandatory
● Vendor extensible
Execution Policies Published 2015
using namespace std::experimental::parallelism;
std::vector<int> vec = ...
// previous standard sequential sort
std::sort(vec.begin(), vec.end());
// explicitly sequential sort
std::sort(std::seq, vec.begin(), vec.end());
// permitting parallel execution
std::sort(std::par, vec.begin(), vec.end());
// permitting vectorization as well
std::sort(std::par_vec, vec.begin(), vec.end());
Execution Policies
// sort with dynamically-selected execution
size_t threshold = …
std::execution_policy exec = std::seq;
if(vec.size() > threshold)
{
exec = std::par;
}
std::sort(exec, vec.begin(), vec.end());
Execution Policies
std::sort(vectorize_in_this_thread, vec.begin(), vec.end());
std::sort(submit_to_my_thread_pool, vec.begin(), vec.end());
std::sort(execute_on_that_gpu, vec.begin(), vec.end());
std::sort(offload_to_my_fpga, vec.begin(), vec.end());
std::sort(send_to_the_cloud, vec.begin(), vec.end());
Implementation-Defined
(Non-Standard)
What is an Execution Policy?
● Promise that a particular kind of reordering will
preserve meaning of program
● Request that the implementation use some sort
of execution strategy
● What sorts of reordering are allowable?
Sequential Execution
std::algo(std::seq, begin, end, Func);
algo invokes user-provided function objects in sequential
order in the calling thread
Calling thread
Func()
Func()
Func()
Calling thread
Parallelizable Execution
std::algo(std::par, begin, end, Func);
algo is permitted to invoke user-provided function objects
unsequenced if invoked in different threads, or invoke
them in indeterminate order if executed on one thread.
Calling thread
T1
T2
Func()
Func()
Func()
Func()
Func()
Func()
Calling thread
Parallelizable+Vectorizable Execution
std::algo(std::par_vec, begin, end, Func);
std::algo(std::vec, begin, end);
algo is permitted to invoke user-defined function objects in
unordered fashion in unspecified threads, or invoke them
unsequenced if executed on one thread.
It is the caller’s responsibility to ensure correctness, for
example that the invocation of functions do not attempt to
synchronize with each other
Difference between std::par and
std::par_vec
●
Function object invocation is unordered when in different threads
●
When executed in the same thread
○
○
std::par => unspecified, sequenced invocations
std::par_vec => unsequenced invocations
New Algorithms and the rest
● std::for_each
o
o
returns an iterator instead
of a functor
avoids discarding
information when
implementing higher-level
algorithms
● std::for_each_n
o
implements
std::generate_n,
std::copy_n, etc.
adjacent_difference
adjacent_find
all_of
any_of
copy
copy_if
copy_n
count
count_if
equal
exclusive_scan
fill
fill_n
find
find_end
find_first_of
find_if
find_if_not
for_each
for_each_n
generate
generate_n
includes
inclusive_scan
inner_product
inplace_merge
is_heap
is_heap_until
is_partitioned
is_sorted
is_sorted_until
lexicographical_compare
max_element
merge
min_element
minmax_element
mismatch
move
none_of
nth_element
partial_sort
partial_sort_copy
partition
partition_copy
reduce
remove
remove_copy
remove_copy_if
remove_if
replace
replace_copy
replace_copy_if
replace_if
reverse
reverse_copy
rotate
rotate_copy
search
search_n
set_difference
set_intersection
set_symmetric_difference
set_union
sort
stable_partition
stable_sort
swap_ranges
transform
transform_exclusive_scan
transform_inclusive_scan
transform_reduce
uninitialized_copy
uninitialized_copy_n
uninitialized_fill
uninitialized_fill_n
unique
unique_copy
No Parallelization
binary_search
equal_range
lower_bound
upper_bound
is_permutation
push_heap
pop_heap
make_heap
sort_heap
is_heap
next_permutation
prev_permutation
copy_backward
move_backward
random_shuffle
shuffle
accumulate
partial_sum
iota
is_heap_until
Future Parallelism TS for C++2020
•Control structures focus on bulk work creation
•Execution policies expose non-sequential execution
•Executors control how/where work is created
•Execution agents are the unit of work
•Futures convey dependencies
•Efficient implementation across a variety of
platforms
SG1: C++ Extensions for Concurrency
•Produced by the Concurrency Study Group (SG1)
with input from LEWG, LWG
•Separate document and is not part of ISO C++
Standard
•Goal: Eventual Inclusion into ISO C++ Standard
•Available online: http://wg21.link/n4538
•Work in progress:
https://github.com/cplusplus/concurrency_ts
62
What’s in it?
•Improvements to std::future
•Latches and Barriers
•Atomic smart pointers
63
•Artur Laksberg, Microsoft (thanks for the slides)
•Niklas Gustafsson, Microsoft
•Herb Sutter, Microsoft
•Sana Mithani, Microsoft
•http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2015/n4399.ht
ml
Asynchronous Calls
•Building blocks:
–std::async: Request asynchronous execution
of a function.
–Future: token representing function’s result.
•Unlike raw use of std::thread objects:
–Allows values or exceptions to be returned.
•Just like “normal” function calls.
IBM
65
Asynchronous Computing in C++ by Hartmut Kaiser
Asynchronous Computing in C++ by Hartmut Kaiser
Asynchronous Computing in C++ by Hartmut Kaiser
Standard Concurrency Interfaces
•std::async<>and std::future<>: concurrency as with sequential processing
– one location calls a concurrent task and dealing with the outcome is as simple as with local sub-functions
•std: :thread: low-level approach
–one location calls a concurrent lask and has to provide low-level techniques to handle
the outcome
•std::promise<> and std::future<>: Simplify processing the outcome
– one location calls a concurrent task but dealing with the outcome is simplified
•packaged_task<> : helper to separate task definition from call
– one location defines a task and provides a handle for the outcome
– another location decides when to call the task and the arguments
– the call must not necessarily happen in another thread
std::future Refresher
•std::future<T> -- a proxy for an eventual value of type T
•std::promise<T> -- a one-way channel to set the future.
future
shared
state<T>
promise
70
Long-Running Work
// Launch a task:
future<int> work = async([] { return
CountLinesInFile(…); });
// Collect the results:
cout << work.get();
Composition of Long-Running Work
// Launch a task:
future<int> work1 = async([] { return
CountLinesInFile(…); });
// Launch another task:
future<int> work2 = async([] { return
CountLinesInFile(…); });
// Collect the results:
cout << work1.get() + work2.get();
Continuation
// Launch a task:
future<int> work = CountLinesInFileAsync(...);
work.then([] (future<int> f) {
// Get the result
cout << f.get();
});
Binding Multiple Continuations
future<T1> t = async([]()
{
return func1();
}).then ([](future<T1> n)
{
return func2(n.get());
}).then ([](future<T2> n)
{
return func3(n.get());
}).then ...
Advanced Composition
f.then(A).then(B);
f.then(A);
f.then(B);
auto f = when_all(a, b);
auto f = when_any(a, b);
Join
vector<future<int>> futures = get_futures();
auto futureResult =
when_all (begin(futures), end(futures))
.then([](future<vector<future<string>>> results) {
for (auto& s : results.get() ) // will not block
{
cout << s.get(); // will not block
}
});
Choice
vector<future<int>> futures = get_futures();
auto futureResult =
when_any (begin(futures), end(futures))
.then([](future<vector<future<string>>> results) {
for (auto& s : results.get() ) // will not block
{
if(s.ready())
cout << s.get(); // will not block
}
});
make_ready_future
future<int> compute(int x) {
if (x < 0) return make_ready_future<int>(-1);
if (x == 0) return make_ready_future<int>(0);
future<int> f1 = async([]() { return do_work(x); });
return f1;
}
is_ready
future<int> f1 = async([]() { return possibly_long_computation();
});
if(!f1.is_ready()) {
//if not ready, attach a continuation and avoid a blocking wait
fl.then([] (future<int> f2) {
int v = f2.get();
process_value(v);
});
}
// if ready, no need to add continuation, process value right away
else {
int v = f1.get();
process_value(v);
}
unwrap
future<future<int>> outer_future = async([]{
future<int> inner_future = async([] {
return 1;
});
return inner_future;
});
future<int> inner_future = outer_future.unwrap();
inner_future.then([](future<int> f) {
do_work(f.get());
});
Other things C++ need:
Types of parallelism
•Most common types:
–Data: coming in SIMD proposal
–Task: coming in executors and task blocks
–Embarrassingly parallel: async and threads
–Dataflow: Concurrency TS (.then)
81
Agenda
•
•
•
•
•
•
Why the rush to Massive Parallelism?
What Now?
Parallel Programming models of Today from OpenMP
Parallel Programming model of Today from C++11/14/17
SG14
Parallel Programming of Tomorrow that joins C++ and
OpenMP
82
The Birth of Study Group 14
Towards Improving C++ for Games/Financial &
Low Latency
Among the top users of C++!
http://blog.jetbrains.com/clion/2015/07/infographics-cpp-facts-before-clion/
The Breaking Wave: N4456
CppCon 2014
C++ committee panel leads to
impromptu game developer meeting.
Google Group created.
Discussions have outstanding industry
participation.
N4456 authored and published!
Formation of SG14
N4456 presented at Spring 2015
Standards Committee Meeting in
Lenexa.
Very well received!
Formation of Study Group 14:
Game Dev & Low Latency
Chair: Michael Wong (IBM)
Two SG14 meetings planned:
● CppCon 2015 (this Wednesday)
● GDC 2016, hosted by SONY
https://isocpp.org/std/the-committee
Improving Communication/Feedback/review cycle
Standard C++ Committee members come to CPPCon/GDC (hosted by SONY)
The Industry
SG14
Game Dev &
Low Latency
Standard C++
Committee
Industry members come to CppCon/GDC
Meetings are opportunities to present papers or discuss existing proposals
SG14 approved papers are presented by C++ Committee members at Standard meetings for voting
Feedback goes back through SG14 to industry for revision
Rinse and repeat
The Industry name linkage brings in lots of people
• The First Industry-named SG that gains
connection with
• Games
• Financial/Trading
• Banking
• Simulation
• +HPC/Big Data Analysis?
Audience of SG14 Goals and Scopes: Not just games!
Shared Common Interest
Better support in C++ for:
Interactive Simulation
Simulation and
Training Software
Real-time Graphics
Video Games
Low Latency
Computation
Finance/Trading
Constrained
Resources
Embedded Systems
HPC/BigData Analytic
Where We Are
Google Groups
https://groups.google.com/a/isocpp.org/forum/?fromgroups#!forum/s
g14
GitHub
https://github.com/WG21-SG14/SG14
Created by Guy Davidson
SG14 are interested in following these proposals
• GPU/Acccelerator support
• Executors
•
3 ways: low-latency, parallel loops, server
task dispatch
• Atomic views
• Coroutines
• noexcept library additions
•
Use std::error_code for signaling errors
• Early SIMD in C++ investigation
•
There are existing SIMD papers
suggesting eg. “Vec<T,N>” and “for simd
(;;)”
•
•
•
•
•
•
•
•
•
•
•
Array View
Node-based Allocators
String conversions
hot set
vector and matrix
Exception and RTTI costs
Ring or circular buffers
Flat_map
Intrusive containers
Allocator interface
Radix sort
●
Spatial and geometric algorithms
●
Imprecise but faster alternatives for math algorithms
●
Cache-friendly hash table
●
Contiguous containers
●
Stack containers
●
Fixed-point numbers
●
plf::colony and plf::stack
Agenda
•
•
•
•
•
•
Why the rush to Massive Parallelism?
What Now?
Parallel Programming models of Today from OpenMP
Parallel Programming model of Today from C++11/14/17
SG14
Parallel Programming of Tomorrow that joins C++ and
OpenMP
93
C++ Standard GPU/Acelerators
•Attended by both National Labs and
commercial/consumers
•Glimpse into the future
•No design as yet, but several competing
design candidates
•Offers the best chance of a model that works
across both domains for C++ (only)
Grand Unification?
C++ Std+ proposals already have
many features for accelerators
•Asynchronous tasks (C++11 futures plus
C++17 then, when*, is_ready,…)
•Parallel Algorithms
•Executors
•Multi-dim arrays, Layouts
Candidates to C++ Std Accelerator
Model
•C++AMP
–Restrict keyword is a mistake
–GPU Hardware removing traditional hurdles
–Modern GPU instruction sets can handle nearly
full C++
–Memory systems evolving towards single heap
Better candidates
•Goal: Use standard C++ to express all intranode parallelism
1.
2.
3.
4.
Agency extends Parallelism TS
HCC
SYCL extends Parallelism TS
HPX extends parallelism and concurrency TS
1. Agency based
•https://github.com/jaredhoberock/agency/wi
ki/Quick-Start-Guide
Three executor proposals
•Network based: ASIO from Networking based
on Boost::asio: Chris Kohlhoff
•Task Based: Original Google proposal from
2013 that was pulled from Concurrency TS:
Chris Mysen, Matt Austern, Jeffrey Yasskin
(Google)
•Agency based: extends Parallelism TS by Jared
Hoberrock, Jeff Garland, Olivier Girioux
Executors and Future concerns
•Policies describe what work is created, not
where/how
•Application developers care about where
•Library developers care about how
–Parallelism TS v1 -> non-standard means
–Closed set of policies
–Work creation should be standardized
Many forms of work creations
•Different agents: threads, tasks, …
•For loops
•Fibers
•Vector loops
•GPUs
Agency Proposal
•Uniform API
•Composition with execution policies
•Advertised agent semantics
•Bulk agent creation
•Standard executor types
•Convenient control structures
Parallelism TS V2 proposal
using namespace std::experimental::parallelism::v2;
std::vector<int> data = ...
// NEW: permitting parallel execution on a user-defined executor
my_executor my_exec = ...
sort(par.on(my_exec), data.begin(), data.end());
// NEW: permitting parallel execution in the current thread only
sort(this_thread::par, data.begin(), data.end());
// NEW: permitting vector execution in the current thread only
sort(this_thread::vec, data.begin(), data.end());
How to write for_each_n today
template<class InputIterator, class Function>
InputIterator for_each(random_access_iterator_tag,
sequential_execution_policy&,
InputIterator first, InputIterator last, Function f)
{
for(; first != last; ++first)
{
f(*first);
}
return first;
}
How to write for_each parallelized
today
template<class InputIterator, class Function>
InputIterator for_each(random_access_iterator_tag, parallel_execution_policy&,
InputIterator first, InputIterator last, Function f)
{
auto n = last - first;
#pragma omp parallel for
for(size_t i = 0; i != n; ++i)
{
f(first[i]);
}
return first + n;
}
How to write for_each vectorized
today
template<class InputIterator, class Function>
InputIterator for_each(random_access_iterator_tag, parallel_vector_execution_policy&,
InputIterator first, InputIterator last, Function f)
{
auto n = last - first;
#pragma omp parallel for simd
for(size_t i = 0; i != n; ++i)
{
f(first[i]);
}
return first + n;
}
What’s wrong with these?
•Poor separation of concerns
–Algorithm semantics conflated with work creation
–Redundant code
•Combinatorial problem
–Execution policies
–Algorithm implementation complexity
Future for_each
template<class ExecutionPolicy, class InputIterator, class Function>
InputIterator for_each(random_access_iterator_tag, ExecutionPolicy& exec,
InputIterator first, InputIterator last, Function f)
{
auto n = last - first;
using executor_type = typename ExecutionPolicy::executor_type;
executor_traits<executor_type>::execute(policy.executor(), [=](auto idx)
{
f(first[idx]);
},
n);
return first + n;
}
Agency Proposal
•Uniform API: executor_traits
•Execution Categories
•Standard Executors
•Execution Policy Interoperation
•this_thread-specific policies
•Convenient control structures
Uniform API: executor_traits
Generic interface for manipulating all types of
executors
template<class Executor>
struct executor_traits
{
using executor_type = Executor;
using execution_category = ...
using index_type = ...
using shape_type = ...
template<class T>
using future = ...
// functions follow
...
};
template<class Executor>
struct executor_traits
{
...
template<class Function>
static future<result_of_t<Function()>>
async_execute(executor_type& ex, Function f);
template<class Function>
static future<void>
async_execute(executor_type& ex, Function f, shape_type
shape);
template<class Function>
static result_of_t<Function()> execute(executor_type& ex,
Function f);
template<class Function>
static void execute(executor_type& ex, Function f, shape_type
shape);
async_execute()
•Request work asynchronously
•Two forms: singleton & bulk
–Either one may implement the other
–Cost of task launch may be relatively expensive
•SIMD
•GPUs
•Clusters
•Returns my_executor::future
Singleton async_execute()
my_executor exec;
// create a single invocation
auto f = executor_traits<my_executor>::async_execute(exec,
[]
{
std::cout << “Hello from agent” << std::endl;
});
f.wait();
Bulk async_execute()
my_executor exec;
std::mutex mut;
// create a group of invocations
auto f = executor_traits<my_executor>::async_execute(exec, [&mut](auto idx)
{
mut.lock();
std::cout << “Hello from agent ” << idx << std::endl;
mut.lock();
},
5);
f.wait();
execute(): Synchronizing
Execution
my_executor exec;
std::mutex mut;
executor_traits<my_executor>::execute(exec, [&mut](auto idx)
{
mut.lock();
std::cout << “Hello from agent ” << idx << std::endl;
mut.lock();
},
5);
• Some executors may not be naturally asynchronous
execute() captures these cases
SAXPY Agent using Parallelism TS2
Non-hierarchical
void saxpy(size_t n, float a, const float* x, const float* y, float* z)
{
using namespace agency;
bulk_invoke(par(n), [=](parallel_agent &self)
{
int i = self.index();
z[i] = a * x[i] + y[i];
});
}
// clang -std=c++11 -I. -lstdc++ pthread hierarchical_saxpy.cpp
#include <agency/execution_policy.hpp>
#include <vector>
#include <cassert>
#include <algorithm>
int main() {
using namespace agency;
size_t n = 1 << 16;
std::vector<float> x(n, 1), y(n, 2), z(n);
float a = 13.;
size_t inner_size = 256;
size_t outer_size = (n + (inner_size - 1)) /
inner_size;
auto policy = con(outer_size, seq(inner_size));
bulk_invoke(policy,
[&](concurrent_group<sequential_agent>& self) {
// compute a flat, global index
size_t i = self.outer().index() * inner_size +
self.inner().index();
if(i < n)
{ // do the math
z[i] = a * x[i] + y[i];
}
});
float expected = a * 1. + 2.;
assert(std::all_of(z.begin(), z.end(), [=](float x){
return expected == x; }));
std::cout << "OK" << std::endl;
return 0;
}
2. Heterogeneous Compute
Compiler
•Boltzmann Initiative
•http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2015/p0
069r0.pdf
Goal of C++ Std Accelerators
•All intra-node parallelism can be expressed
with standard C++
–No need for directives or other language/library
extensions
•Portable to a variety of accelerator and CPU
architectures
Identifying parallel regions
• Do we need a special keyword to identify accelerator code?
– Want to limit the “viral” effect of keyword rippling through the call hierarchy
• HCC has mode to compile all code for both CPU and GPU targets
– No keyword required
– Slow compile time
– Deferred compilation architectures could improve compilation performance
• Automatic identification of accelerator code
– Detect calls to launch accelerator work and traverse call chain to upgrade “child” functions
– Works if child function code visible to compiler (similar restriction as C++ templates)
– Still misses some cases (virtual functions, linked functions)
• Keyword not required, but still (currently) useful
– HCC uses “__attribute__((hc))” or (“[[hc]]”) to identify accelerator code
A C++ Compiler For Heterogeneous Computing by Ben Sander
– Handle tricky cases missed by automatic detection
Heterogeneous Code Generation
•Heterogeneous devices often have different capabilities
– HCC CPU/GPU targets have different instruction sets and architecture
– Instruction capability differences disappearing
• Many GPUs now support full C++
• Indirect branches, misaligned data accesses, goto, stacks
– Still, differences may exit
• Due to software, resource, or other gaps
•Handling of feature gaps
– Compilation error
– Transparent fall back to CPU
– Both useful
A C++ Compiler For Heterogeneous Computing by Ben Sander
ASYNCHRONOUS EXECUTION
• GPU Architecture Basics
– Memory-based queues used to schedule and execute commands
– Commands include data-copies, parallel execution “kernels”, dependencies, configuration
– Hardware-based dependency resolution
– Efficiently wait for dependencies, signal completion – all without host intervention
• hc::completion_future
– Based on C++ std::future
– Returned by commands
– HCC extends “then” to schedule device-side commands (no host intervention)
– HCC implementation identifies accelerator commands via specialization and leverages GPU HW
– copy(…). then(for_each(…). then(copy();
– when_all, when_any (N4501)
– Combine futures, return another future, in a single function
– Can leverage dependency resolution hardware
– HCC uses is_ready()
• HCC supports timestamps in the future (start, stop)
A C++ Compiler For Heterogeneous Computing by Ben Sander
– Handy for gaining insight into performance of the asynchronous task
Parallel STL
•Significant addition to the C++ Standard
– Algorithm-level abstraction
– Higher-level expression of programmer intent
– Enables architecture-specific optimizations
– Example: use of GPU hardware for efficient execution of hierarchical parallelism
– Execution policy provides “parallel” assertion
– Enables parallel execution on GPU, multi-core CPU, SIMD
– Executors could provide mechanism for programmer to control parallel execution
– Examples: control “where”, grid / group dimensions, unroll optimizations, and more
•Would like asynchronous versions of Parallel
Algorithms
– i.e. return a future
A C++
Compiler
For Heterogeneous
Computing
TS v2?
(auto return
type, executors, HPX
examples)by Ben Sander
– Perhaps these already exist in Parallelism
GPU MEMORY ARCHITECTURE
EVOLUTION
A C++ Compiler For Heterogeneous Computing by Ben Sander
Multidimensional Arrays
•
Multi-dimensional arrays commonly used on GPUs
– Image processing, HPC, others
– Opportunity to improve C++ standard
• Polymorphic views are attractive
– Separate data allocation from data access pattern
– Improve code portability to different architectures
– Possible tiling optimizations for “HBM” cache
• Would like to use some Parallel Algorithms with multi-dimensional data
– Iterator operator()++ iterates across all dimensions and touches all elements
– Multi-dim iterations apparently a complex topic…
• Bounds checking
– Useful, but performance is critical
– Bounds checking must be free or gated by user-specified switch
A C++ Compiler For Heterogeneous Computing by Ben Sander
HIERARCHICAL PARALLELISM
• Many systems, including GPUs, designed hierarchically
– GPUs provide “groups” of “threads”,
– Communication between threads in same group (“Near) can be 10X faster than “Far“ !
– Modern CPUs also have clusters of locality : threads (SMT), caches, memory controllers
• GPU Group Hardware Features
1. Barrier synchronization for threads in same group
2. High-speed cache memory shared by threads in the group
3. Relaxed atomic operations
A C++ Compiler For Heterogeneous Computing by Ben Sander
AGENCY Execution Agents
•you create groups of execution agents and have them
execute tasks
•An execution agent is a stand-in for some underlying physical
thing -- maybe a thread, a SIMD lane, or a process -- which
actually executes our task of interest.
•Programming algorithms in terms of execution agents instead
of something really specific -- like threads -- makes our
programs portable across different kinds of hardware
platforms.
3. SYCL
•https://www.khronos.org/sycl
Overview of SYCL
4. HPX
•http://stellar.cct.lsu.edu/pubs/ParallelizingST
L.pdf
HPX and Stellar
•STE||AR is about shaping a scalable future
with a new approach to parallel computation
•Most notable ongoing project by STE||AR is
HPX: A general purpose C++ runtime system
for parallel and distributed applications of any
scale
HPX
•HPX enables programmers to write fully
asynchronous code using hundreds of millions of
threads
•First open source implementation of the ParallelX
execution model
–Starvation
–Latencies
–Overhead
Types of executors in HPX
•standard executors
–parallel, sequential
•this thread executors
–static queue, static priority queue
•thread pool executors, and thread pool os executors
–local queue, local priority queue
–static queue, static priority queue
•service executors
–io pool, parcel pool, timer pool, main pool
•distribution policy executor
Extending with Execution Policies
•The .with syntax to extend parallel algorithms
auto par_auto = par.with(auto_chunk_size());
// equivalent to par
auto par_static =
par.with(static_chunk_size());
auto my_policy =
par.with(my_exec).on(my_chunk_size);
auto my_task_policy = my_policy(task);
HPX: for_each_n
• template<typename ExPolicy, typename F>
• static typename detail::algorithm_result<ExPolicy, Iter>::type
• parallel(ExPolicy const& policy, Iter first, std::size_t count, F && f)
• {
• if(count != 0)
• {
• return util::foreach_n_partitioner<ExPolicy>::call(policy, first, count,
• [f](Iter part_begin, std::size_t part_size)
• {
• util::loop_n(part_begin, part_size, [&f](Iter const& curr)
• {
• f(*curr);
• });
• });
• }
• return detail::algorithm_result<ExPolicy, Iter>::get( std::move(first));
What is still needed from C++
•Async Parallel Algorithms
•Scoped atomics
•Futures – timestamps?
•Lightweight threads and solutions for TLS
The future Programming Model
•Executors that can dispatch massively parallel
•Marries Futures Concurrency and Parallelisim
•An Accelerator programming model that
enables both HPC and Consumer domain for
C++ is coming
•Supports C++ specific constructs and
integrates closely with C++ Std
Food for thought and Q/A
• C11/C++14 Standards
– C++ : http://www.openstd.org/jtc1/sc22/wg21/prot/14882fdis/n3937.pdf
–C++ (post C++14 free version): http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf
–C: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
• Participate and feedback to Compiler
–What features/libraries interest you or your customers?
–What problem/annoyance you would like the Std to resolve?
–Is Special Math important to you?
–Do you expect 0x features to be used quickly by your customers?
• Talk to me at my blog:
–http://www.ibm.com/software/rational/cafe/blogs/cpp-standard
139
My blogs and email address
•ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong
OpenMP CEO: http://openmp.org/wp/about-openmp/
My Blogs: http://ibm.co/pCvPHR
C++11 status: http://tinyurl.com/43y8xgf
Boost test results
http://www.ibm.com/support/docview.wss?rs=2239&context=SSJT9L&
uid=swg27006911
C/C++ Compilers Feature Request Page
http://www.ibm.com/developerworks/rfe/?PROD_ID=700
Chair of WG21 SG5 Transactional Memory:
https://groups.google.com/a/isocpp.org/forum/?hl=en&fromgroups#!f
orum/tm
Chair of WG21 SG14 Games Dev/Low Latency:
https://groups.google.com/a/isocpp.org/forum/?fromgroups#!forum/s
g14
140