The Landscape of Parallelism – Parallel Programming Models of Today and Tomorrow Meeting C++ 2015 Michael Wong (IBM) [email protected]; http:://wongmichael.com http://isocpp.org/wiki/faq/wg21:michael-wong IBM and Canadian C++ Standard Committee HoD OpenMP CEO Chair of WG21 SG5 Transactional Memory , SG14 Games/Low Latency Director, Vice President of ISOCPP.org Vice Chair Standards Council of Canada Programming Languages Acknowledgement and Disclaimer umerous people internal and external to the N original OpenMP group, in industry and academia, have made contributions, influenced ideas, written part of this presentations, and offered feedbacks to form part of this talk. Ieven lifted this acknowledgement and disclaimer from some of them. ut I claim all credit for errors, and stupid B mistakes. These are mine, all mine! Legal Disclaimer This work represents the view of the author and does not necessarily represent the view of IBM. IBM, PowerPC and the IBM logo are trademarks or registered trademarks of IBM or its subsidiaries in the United States and other countries. ther company, product, and service names O may be trademarks or service marks of others. Welcome to Parallel Programming Model of today and tomorrow 4 Agenda • • • • • • Why the rush to Massive Parallelism? What Now? Parallel Programming models of Today from OpenMP Parallel Programming model of Today from C++11/14/17 SG14 Parallel Programming of Tomorrow that joins C++ and OpenMP 5 Why is GPU important now? • Or is it a flash in the pan? • The race to exascale computing .. 10 18 flops • Vertical scale is in GFLOPS Top500 contenders Internet of Things • All forms of accelerators, DSP, GPU, APU, GPGPU • Network heterogenous consumer devices – Kitchen appliances, drones, signal processors, medical imaging, auto, telecom, automation, not just graphics engines Agenda • • • • • • Why the rush to Massive Parallelism? What Now? Parallel Programming models of Today from OpenMP Parallel Programming model of Today from C++11/14/17 SG14 Parallel Programming of Tomorrow that joins C++ and OpenMP 10 •The new C++11 Std is What now? –1353 pages compared to 817 pages in C++03 •The new C++14 Std is –1373 pages (N3937), vs the free n3972 •The new C11 is –701 pages compared to 550 pages in C99 •OpenMP 3.1 is –160 pages and growing •OpenMP 4.0 is –320 pages •OpenMP 4.5 is –359 pages Agenda • • • • • • Why the rush to Massive Parallelism? What Now? Parallel Programming models of Today from OpenMP Parallel Programming model of Today from C++11/14/17 SG14 Parallel Programming of Tomorrow that joins C++ and OpenMP 12 Standardize directivebased multi-language high-level parallelism that is performant, productive and portable heterogeneous device model • OpenMP 4.0 supports accelerators/coprocessors • Device model: – one host – multiple acclerators / coprocessors of the same kind 14 Data mapping: shared or distributed memory Shared memory Memory Processor X Cache A Processor Y A Cache A • The corresponding variable in the device data environment may share storage with the original variable. Distributed memory Memory X Processor X Cache A • Writes to the corresponding variable may alter the value of the original variable. Accelertor Y Memory Y A A 15 OpenMP 4.0 Device Constructs • Execute code on a target device – omp target [clause[[,] clause],…] structured-block – omp declare target [function-definitions-or-declarations] • Map variables to a target device – map ([map-type:] list) // map clause – map-type := alloc | tofrom | to | from omp target data [clause[[,] clause],…] structured-block – omp target update [clause[[,] clause],…] – omp declare target [variable-definitions-or-declarations] • Workshare for acceleration – omp teams [clause[[,] clause],…] structured-block – omp distribute [clause[[,] clause],…] for-loops 16 SAXPY: Serial (host) 17 SAXPY: Serial (host) 18 SAXPY: Coprocessor/Accelerator 19 SAXPY: Coprocessor/Accelerator 20 Future plans •Clang 3.8 (~Feb, 2016): trunk switches to clang OpenMP lib, upstream OpenMP 4.0 with focus on Accelerator delivery; start code dropping for OpenMP 4.5 •Clang 3.9 (~Aug 2016): Complete OpenMP 4.0 and continue to Add OpenMP 4.5 functionality •Clang 4.0 (~Feb 2017): clang/llvm becomes reference compiler; follow OpenMP ratification with collaborated contribution? The future C++14 Implemented in Clang 3.5 Clang 4.0 becomes OpenMP reference compiler and tracks OpenMP closely? 9/3/2014 2/28/2017 C++14 Ratify OpenMP 4.5 Ratified/Release C++17 Ratify? 11/12/2015 5/31/2017 5/31/2014 OpenMP C++175.0 Ratified/Release? Release? C++14 Released 11/12/2017 12/31/2017 12/31/2014 C++17 Implemented in Clang 4.0? Clang 3.5 Release 8/31/2014 2/28/2017 OpenMP 4.0 Ratified/Release 11/12/2013 2013 2014 Today 2015 2016 2017 2017 8/31/2015 2/28/2017 Clang 3.7 Release Clang 4.0 Release? 2/29/2016 2/28/2015 Clang 3.6 Release Clang 3.8 Release 8/31/2016 Clang 3.9 Release? Agenda • • • • • • Why the rush to Massive Parallelism? What Now? Parallel Programming models of Today from OpenMP Parallel Programming model of Today from C++11/14/17 SG14 Parallel Programming of Tomorrow that joins C++ and OpenMP 23 A tale of two cities Will the two galaxies ever join? C++ support for Accelerators •Memory allocation •Templates •Exceptions •Polymorphism •Current Technical Specifications –Concepts, Parallelism, Concurrency, TM Programming GPU/Accelerators • OpenGL • DirectX • CUDA • OpenCL • OpenMP • OpenACC • C++ AMP • HPX • HSA • SYCL • Vulkan • A preview of C++ WG21 Accelerator model SG1/SG14 TS2 (SC15 LLVM HPC talk) C++11, 14, 17 C++1Y(1Y=17 or 22) Concurrency Plan Parallelism Parallel STL Algorithms: Data-Based Parallelism. (Vector, SIMD, ...) Task-based parallelism (cilk, OpenMP, fork-join) MapReduce Pipelines Concurrency Future Extensions (then, wait_any, wait_all): Executors: Resumable Functions, await (with futures) Counters Queues Concurrent Vector Unordered Associative Containers Latches and Barriers upgrade_lock Atomic smart pointers Status after Oct Kona C++ Meeting Project What’s in it? Status Filesystem TS Standard filesystem interface Published! Library Fundamentals TS I optional, any, string_view and more Published! Library Fundamentals TS II source code information capture and various utilities Voted out for balloting by national standards bodies Concepts (“Lite”) TS Constrained templates Publicated! Parallelism TS I Parallel versions of STL algorithms Published! Parallelism TS II TBD. Exploring task blocks, progress guarantees, SIMD Under active development Transactional Memory TS Transactional Memory TS Published! Project What’s in it? Status Concurrency TS I improvements to future, latches and barriers, atomic smart pointers Voted out for publication! Concurrency TS II TBD. Exploring executors, synchronic types, atomic views, concurrent data structures Under active development Networking TS Sockets library based on Boost.ASIO Design review completed; wording review of the spec in progress Ranges TS Range-based algorithms and views Design review completed; wording review of the spec in progress Numerics TS Various numerical facilities Beginning to take shape Array Extensions TS Stack arrays whose size is not known at compile time Direction given at last meeting; waiting for proposals Reflection Code introspection and (later) reification mechanisms Still in the design stage, no ETA Project What’s in it? Status 2D drawing API Waiting on proposal author to produce updated standard wording Modules A component system to supersede the textual header file inclusion model Microsoft and Clang continuing to iterate on their implementations and converge on a design. The feature will target a TS, not C++17. Coroutines Resumable functions At least two competing designs. One of them may make C++17. Contracts Preconditions, postconditions, etc. In early design stage Graphics The Parallel and concurrency planets … of C++ … … SG5 Transactional Memory TS 3 SG1 Par/Con TS … … SG14 Low Latency … … … … SG5: Transactional Language Constructs for C++ TS vs. 2015: SG5 TM Language in a nutshell 1 construct for transactions 1. Compound Statements 2 Keywords for different types of TX atomic_noexcept | atomic_commit | atomic_cancel {<compound-statement> } synchronized {<compound-statement> } 37 1 Function/function pointer keyword transaction_safe transaction_safe_dynamic -must be a keyword because it conveys necessary semantics on type 1 Function/function pointer attribute [[transaction_unsafe]] -provides static checking and performance hints, so it can be an attribute [[optimized_for_synchronized]] -provides a speculative version for synchronized blocks for the common case when no unsafe functions are called Transaction statement •2 forms atomic_noexcept {x++;} atomic_commit {x++;} atomic_cancel {x++;} synchronized 38 {x++;} atomic synchronized Atomic & synchronized transactions atomic_cancel { x++; if (cond) throw 1; } synchronized { x++; print(x); } Appear to execute atomically Can be cancelled Unsafe statements prohibited Cannot be cancelled No other restrictions on content All transactions appear to execute in serial order Racy programs have undefined behavior 39 I/O Without Transactions void foo() { cout << “Hello Concurrent Programming World!” << endl; } // Thread 1 foo(); // Thread 2 foo(); Hello Concurrent Programming World! HelloConcurrent Hello Concurrent Concurrent Hello Programming World! Programming Programming World! World! ...Hello Concurrent Programming Hell World!... (and other fun [and appropriate] variations) 40 I/O With Transactions void foo() { cout << “Hello Concurrent Programming World!” << endl; } // Thread 1 atomic_noexcept { foo(); } // Thread 2 atomic_noexcept { foo(); } Hello Hello ... Hello Three Hello’s? There are only two calls? 41 I/O and Irrevocable Actions: Take Two void foo() { cout << “Hello Concurrent Programming World!” << endl; } // Thread 1 synchronized { foo(); } Hello Concurrent Programming World! Hello Concurrent Programming World! (only possible answer) 42 // Thread 2 synchronized { foo(); } C++1Y(1Y=17 or 22) Concurrency Plan Parallelism Parallel Algorithms: Data-Based Parallelism. (Vector, SIMD, ...) Task-based parallelism (cilk, OpenMP, fork-join) MapReduce Pipelines Concurrency Future Extensions (then, wait_any, wait_all): Executors: Resumable Functions, await (with futures) Counters Queues Concurrent Vector Unordered Associative Containers Latches and Barriers upgrade_lock Atomic smart pointers Concurrency and parallelism: They’re not the same thing! Concurrency Parallelism Why: express component interactions for effective program structure How: interacting threads that can wait on events or each other Why: exploit hardware efficiently to scale performance How: independent tasks that can run simultaneously A program can have both From Pablo Halpern 44 Sports analogy Photo credit André Zehetbauer (CC) BY-SA 2.0 Photo credit JJ Harrison (CC) BY-SA 3.0 Concurrency From Pablo Halpern Parallelism 45 SG1: C++ Std Parallelism TS • Jared Hoberock, Nvidia (thanks for the slides) • Jaydeep Marathe, Nvidia • Michael Garland, Nvidia • Olivier Giroux, Nvidia • Vinod Grover , Nvidia • Artur Laksberg, Microsoft • Herb Sutter , Microsoft • Arch Robison, Intel •http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4407.html Slide 46 Project History ● Multi-vendor collaboration ○ ○ ○ NVIDIA, Microsoft, Intel Based on Thrust(NVIDIA) targets multicore GPU ○ PPL(MS) ○ TBB(Intel) Pre-existing experience ● Thrust: github.com/thrust ● Boost: github.com/boostorg/compute ● Bolt: github.com/HSA-Libraries/Bolt ● libstdc++: gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html ● C++ AMP: ampalgorithms.codeplex.com ● PPL: msdn.microsoft.com/en-us/library/dd492418.aspx ● TBB: www.threadingbuildingblocks.org Implementations •Microsoft: parallelstl.codeplex.com ● HPX: stellar-group.github.io/hpx ● Codeplay: github.com/KhronosGroup/SyclParallelSTL ● HSA: www.hsafoundation.com/hsa-for-mathscience ● Lutz: github.com/t-lutz/ParallelSTL ● NVIDIA: github.com/n3554/n3554 Motivation ● Parallelize existing algorithms library ● Target the broadest range of current and future architectures ● Concurrency never mandatory ● Vendor extensible Execution Policies Published 2015 using namespace std::experimental::parallelism; std::vector<int> vec = ... // previous standard sequential sort std::sort(vec.begin(), vec.end()); // explicitly sequential sort std::sort(std::seq, vec.begin(), vec.end()); // permitting parallel execution std::sort(std::par, vec.begin(), vec.end()); // permitting vectorization as well std::sort(std::par_vec, vec.begin(), vec.end()); Execution Policies // sort with dynamically-selected execution size_t threshold = … std::execution_policy exec = std::seq; if(vec.size() > threshold) { exec = std::par; } std::sort(exec, vec.begin(), vec.end()); Execution Policies std::sort(vectorize_in_this_thread, vec.begin(), vec.end()); std::sort(submit_to_my_thread_pool, vec.begin(), vec.end()); std::sort(execute_on_that_gpu, vec.begin(), vec.end()); std::sort(offload_to_my_fpga, vec.begin(), vec.end()); std::sort(send_to_the_cloud, vec.begin(), vec.end()); Implementation-Defined (Non-Standard) What is an Execution Policy? ● Promise that a particular kind of reordering will preserve meaning of program ● Request that the implementation use some sort of execution strategy ● What sorts of reordering are allowable? Sequential Execution std::algo(std::seq, begin, end, Func); algo invokes user-provided function objects in sequential order in the calling thread Calling thread Func() Func() Func() Calling thread Parallelizable Execution std::algo(std::par, begin, end, Func); algo is permitted to invoke user-provided function objects unsequenced if invoked in different threads, or invoke them in indeterminate order if executed on one thread. Calling thread T1 T2 Func() Func() Func() Func() Func() Func() Calling thread Parallelizable+Vectorizable Execution std::algo(std::par_vec, begin, end, Func); std::algo(std::vec, begin, end); algo is permitted to invoke user-defined function objects in unordered fashion in unspecified threads, or invoke them unsequenced if executed on one thread. It is the caller’s responsibility to ensure correctness, for example that the invocation of functions do not attempt to synchronize with each other Difference between std::par and std::par_vec ● Function object invocation is unordered when in different threads ● When executed in the same thread ○ ○ std::par => unspecified, sequenced invocations std::par_vec => unsequenced invocations New Algorithms and the rest ● std::for_each o o returns an iterator instead of a functor avoids discarding information when implementing higher-level algorithms ● std::for_each_n o implements std::generate_n, std::copy_n, etc. adjacent_difference adjacent_find all_of any_of copy copy_if copy_n count count_if equal exclusive_scan fill fill_n find find_end find_first_of find_if find_if_not for_each for_each_n generate generate_n includes inclusive_scan inner_product inplace_merge is_heap is_heap_until is_partitioned is_sorted is_sorted_until lexicographical_compare max_element merge min_element minmax_element mismatch move none_of nth_element partial_sort partial_sort_copy partition partition_copy reduce remove remove_copy remove_copy_if remove_if replace replace_copy replace_copy_if replace_if reverse reverse_copy rotate rotate_copy search search_n set_difference set_intersection set_symmetric_difference set_union sort stable_partition stable_sort swap_ranges transform transform_exclusive_scan transform_inclusive_scan transform_reduce uninitialized_copy uninitialized_copy_n uninitialized_fill uninitialized_fill_n unique unique_copy No Parallelization binary_search equal_range lower_bound upper_bound is_permutation push_heap pop_heap make_heap sort_heap is_heap next_permutation prev_permutation copy_backward move_backward random_shuffle shuffle accumulate partial_sum iota is_heap_until Future Parallelism TS for C++2020 •Control structures focus on bulk work creation •Execution policies expose non-sequential execution •Executors control how/where work is created •Execution agents are the unit of work •Futures convey dependencies •Efficient implementation across a variety of platforms SG1: C++ Extensions for Concurrency •Produced by the Concurrency Study Group (SG1) with input from LEWG, LWG •Separate document and is not part of ISO C++ Standard •Goal: Eventual Inclusion into ISO C++ Standard •Available online: http://wg21.link/n4538 •Work in progress: https://github.com/cplusplus/concurrency_ts 62 What’s in it? •Improvements to std::future •Latches and Barriers •Atomic smart pointers 63 •Artur Laksberg, Microsoft (thanks for the slides) •Niklas Gustafsson, Microsoft •Herb Sutter, Microsoft •Sana Mithani, Microsoft •http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2015/n4399.ht ml Asynchronous Calls •Building blocks: –std::async: Request asynchronous execution of a function. –Future: token representing function’s result. •Unlike raw use of std::thread objects: –Allows values or exceptions to be returned. •Just like “normal” function calls. IBM 65 Asynchronous Computing in C++ by Hartmut Kaiser Asynchronous Computing in C++ by Hartmut Kaiser Asynchronous Computing in C++ by Hartmut Kaiser Standard Concurrency Interfaces •std::async<>and std::future<>: concurrency as with sequential processing – one location calls a concurrent task and dealing with the outcome is as simple as with local sub-functions •std: :thread: low-level approach –one location calls a concurrent lask and has to provide low-level techniques to handle the outcome •std::promise<> and std::future<>: Simplify processing the outcome – one location calls a concurrent task but dealing with the outcome is simplified •packaged_task<> : helper to separate task definition from call – one location defines a task and provides a handle for the outcome – another location decides when to call the task and the arguments – the call must not necessarily happen in another thread std::future Refresher •std::future<T> -- a proxy for an eventual value of type T •std::promise<T> -- a one-way channel to set the future. future shared state<T> promise 70 Long-Running Work // Launch a task: future<int> work = async([] { return CountLinesInFile(…); }); // Collect the results: cout << work.get(); Composition of Long-Running Work // Launch a task: future<int> work1 = async([] { return CountLinesInFile(…); }); // Launch another task: future<int> work2 = async([] { return CountLinesInFile(…); }); // Collect the results: cout << work1.get() + work2.get(); Continuation // Launch a task: future<int> work = CountLinesInFileAsync(...); work.then([] (future<int> f) { // Get the result cout << f.get(); }); Binding Multiple Continuations future<T1> t = async([]() { return func1(); }).then ([](future<T1> n) { return func2(n.get()); }).then ([](future<T2> n) { return func3(n.get()); }).then ... Advanced Composition f.then(A).then(B); f.then(A); f.then(B); auto f = when_all(a, b); auto f = when_any(a, b); Join vector<future<int>> futures = get_futures(); auto futureResult = when_all (begin(futures), end(futures)) .then([](future<vector<future<string>>> results) { for (auto& s : results.get() ) // will not block { cout << s.get(); // will not block } }); Choice vector<future<int>> futures = get_futures(); auto futureResult = when_any (begin(futures), end(futures)) .then([](future<vector<future<string>>> results) { for (auto& s : results.get() ) // will not block { if(s.ready()) cout << s.get(); // will not block } }); make_ready_future future<int> compute(int x) { if (x < 0) return make_ready_future<int>(-1); if (x == 0) return make_ready_future<int>(0); future<int> f1 = async([]() { return do_work(x); }); return f1; } is_ready future<int> f1 = async([]() { return possibly_long_computation(); }); if(!f1.is_ready()) { //if not ready, attach a continuation and avoid a blocking wait fl.then([] (future<int> f2) { int v = f2.get(); process_value(v); }); } // if ready, no need to add continuation, process value right away else { int v = f1.get(); process_value(v); } unwrap future<future<int>> outer_future = async([]{ future<int> inner_future = async([] { return 1; }); return inner_future; }); future<int> inner_future = outer_future.unwrap(); inner_future.then([](future<int> f) { do_work(f.get()); }); Other things C++ need: Types of parallelism •Most common types: –Data: coming in SIMD proposal –Task: coming in executors and task blocks –Embarrassingly parallel: async and threads –Dataflow: Concurrency TS (.then) 81 Agenda • • • • • • Why the rush to Massive Parallelism? What Now? Parallel Programming models of Today from OpenMP Parallel Programming model of Today from C++11/14/17 SG14 Parallel Programming of Tomorrow that joins C++ and OpenMP 82 The Birth of Study Group 14 Towards Improving C++ for Games/Financial & Low Latency Among the top users of C++! http://blog.jetbrains.com/clion/2015/07/infographics-cpp-facts-before-clion/ The Breaking Wave: N4456 CppCon 2014 C++ committee panel leads to impromptu game developer meeting. Google Group created. Discussions have outstanding industry participation. N4456 authored and published! Formation of SG14 N4456 presented at Spring 2015 Standards Committee Meeting in Lenexa. Very well received! Formation of Study Group 14: Game Dev & Low Latency Chair: Michael Wong (IBM) Two SG14 meetings planned: ● CppCon 2015 (this Wednesday) ● GDC 2016, hosted by SONY https://isocpp.org/std/the-committee Improving Communication/Feedback/review cycle Standard C++ Committee members come to CPPCon/GDC (hosted by SONY) The Industry SG14 Game Dev & Low Latency Standard C++ Committee Industry members come to CppCon/GDC Meetings are opportunities to present papers or discuss existing proposals SG14 approved papers are presented by C++ Committee members at Standard meetings for voting Feedback goes back through SG14 to industry for revision Rinse and repeat The Industry name linkage brings in lots of people • The First Industry-named SG that gains connection with • Games • Financial/Trading • Banking • Simulation • +HPC/Big Data Analysis? Audience of SG14 Goals and Scopes: Not just games! Shared Common Interest Better support in C++ for: Interactive Simulation Simulation and Training Software Real-time Graphics Video Games Low Latency Computation Finance/Trading Constrained Resources Embedded Systems HPC/BigData Analytic Where We Are Google Groups https://groups.google.com/a/isocpp.org/forum/?fromgroups#!forum/s g14 GitHub https://github.com/WG21-SG14/SG14 Created by Guy Davidson SG14 are interested in following these proposals • GPU/Acccelerator support • Executors • 3 ways: low-latency, parallel loops, server task dispatch • Atomic views • Coroutines • noexcept library additions • Use std::error_code for signaling errors • Early SIMD in C++ investigation • There are existing SIMD papers suggesting eg. “Vec<T,N>” and “for simd (;;)” • • • • • • • • • • • Array View Node-based Allocators String conversions hot set vector and matrix Exception and RTTI costs Ring or circular buffers Flat_map Intrusive containers Allocator interface Radix sort ● Spatial and geometric algorithms ● Imprecise but faster alternatives for math algorithms ● Cache-friendly hash table ● Contiguous containers ● Stack containers ● Fixed-point numbers ● plf::colony and plf::stack Agenda • • • • • • Why the rush to Massive Parallelism? What Now? Parallel Programming models of Today from OpenMP Parallel Programming model of Today from C++11/14/17 SG14 Parallel Programming of Tomorrow that joins C++ and OpenMP 93 C++ Standard GPU/Acelerators •Attended by both National Labs and commercial/consumers •Glimpse into the future •No design as yet, but several competing design candidates •Offers the best chance of a model that works across both domains for C++ (only) Grand Unification? C++ Std+ proposals already have many features for accelerators •Asynchronous tasks (C++11 futures plus C++17 then, when*, is_ready,…) •Parallel Algorithms •Executors •Multi-dim arrays, Layouts Candidates to C++ Std Accelerator Model •C++AMP –Restrict keyword is a mistake –GPU Hardware removing traditional hurdles –Modern GPU instruction sets can handle nearly full C++ –Memory systems evolving towards single heap Better candidates •Goal: Use standard C++ to express all intranode parallelism 1. 2. 3. 4. Agency extends Parallelism TS HCC SYCL extends Parallelism TS HPX extends parallelism and concurrency TS 1. Agency based •https://github.com/jaredhoberock/agency/wi ki/Quick-Start-Guide Three executor proposals •Network based: ASIO from Networking based on Boost::asio: Chris Kohlhoff •Task Based: Original Google proposal from 2013 that was pulled from Concurrency TS: Chris Mysen, Matt Austern, Jeffrey Yasskin (Google) •Agency based: extends Parallelism TS by Jared Hoberrock, Jeff Garland, Olivier Girioux Executors and Future concerns •Policies describe what work is created, not where/how •Application developers care about where •Library developers care about how –Parallelism TS v1 -> non-standard means –Closed set of policies –Work creation should be standardized Many forms of work creations •Different agents: threads, tasks, … •For loops •Fibers •Vector loops •GPUs Agency Proposal •Uniform API •Composition with execution policies •Advertised agent semantics •Bulk agent creation •Standard executor types •Convenient control structures Parallelism TS V2 proposal using namespace std::experimental::parallelism::v2; std::vector<int> data = ... // NEW: permitting parallel execution on a user-defined executor my_executor my_exec = ... sort(par.on(my_exec), data.begin(), data.end()); // NEW: permitting parallel execution in the current thread only sort(this_thread::par, data.begin(), data.end()); // NEW: permitting vector execution in the current thread only sort(this_thread::vec, data.begin(), data.end()); How to write for_each_n today template<class InputIterator, class Function> InputIterator for_each(random_access_iterator_tag, sequential_execution_policy&, InputIterator first, InputIterator last, Function f) { for(; first != last; ++first) { f(*first); } return first; } How to write for_each parallelized today template<class InputIterator, class Function> InputIterator for_each(random_access_iterator_tag, parallel_execution_policy&, InputIterator first, InputIterator last, Function f) { auto n = last - first; #pragma omp parallel for for(size_t i = 0; i != n; ++i) { f(first[i]); } return first + n; } How to write for_each vectorized today template<class InputIterator, class Function> InputIterator for_each(random_access_iterator_tag, parallel_vector_execution_policy&, InputIterator first, InputIterator last, Function f) { auto n = last - first; #pragma omp parallel for simd for(size_t i = 0; i != n; ++i) { f(first[i]); } return first + n; } What’s wrong with these? •Poor separation of concerns –Algorithm semantics conflated with work creation –Redundant code •Combinatorial problem –Execution policies –Algorithm implementation complexity Future for_each template<class ExecutionPolicy, class InputIterator, class Function> InputIterator for_each(random_access_iterator_tag, ExecutionPolicy& exec, InputIterator first, InputIterator last, Function f) { auto n = last - first; using executor_type = typename ExecutionPolicy::executor_type; executor_traits<executor_type>::execute(policy.executor(), [=](auto idx) { f(first[idx]); }, n); return first + n; } Agency Proposal •Uniform API: executor_traits •Execution Categories •Standard Executors •Execution Policy Interoperation •this_thread-specific policies •Convenient control structures Uniform API: executor_traits Generic interface for manipulating all types of executors template<class Executor> struct executor_traits { using executor_type = Executor; using execution_category = ... using index_type = ... using shape_type = ... template<class T> using future = ... // functions follow ... }; template<class Executor> struct executor_traits { ... template<class Function> static future<result_of_t<Function()>> async_execute(executor_type& ex, Function f); template<class Function> static future<void> async_execute(executor_type& ex, Function f, shape_type shape); template<class Function> static result_of_t<Function()> execute(executor_type& ex, Function f); template<class Function> static void execute(executor_type& ex, Function f, shape_type shape); async_execute() •Request work asynchronously •Two forms: singleton & bulk –Either one may implement the other –Cost of task launch may be relatively expensive •SIMD •GPUs •Clusters •Returns my_executor::future Singleton async_execute() my_executor exec; // create a single invocation auto f = executor_traits<my_executor>::async_execute(exec, [] { std::cout << “Hello from agent” << std::endl; }); f.wait(); Bulk async_execute() my_executor exec; std::mutex mut; // create a group of invocations auto f = executor_traits<my_executor>::async_execute(exec, [&mut](auto idx) { mut.lock(); std::cout << “Hello from agent ” << idx << std::endl; mut.lock(); }, 5); f.wait(); execute(): Synchronizing Execution my_executor exec; std::mutex mut; executor_traits<my_executor>::execute(exec, [&mut](auto idx) { mut.lock(); std::cout << “Hello from agent ” << idx << std::endl; mut.lock(); }, 5); • Some executors may not be naturally asynchronous execute() captures these cases SAXPY Agent using Parallelism TS2 Non-hierarchical void saxpy(size_t n, float a, const float* x, const float* y, float* z) { using namespace agency; bulk_invoke(par(n), [=](parallel_agent &self) { int i = self.index(); z[i] = a * x[i] + y[i]; }); } // clang -std=c++11 -I. -lstdc++ pthread hierarchical_saxpy.cpp #include <agency/execution_policy.hpp> #include <vector> #include <cassert> #include <algorithm> int main() { using namespace agency; size_t n = 1 << 16; std::vector<float> x(n, 1), y(n, 2), z(n); float a = 13.; size_t inner_size = 256; size_t outer_size = (n + (inner_size - 1)) / inner_size; auto policy = con(outer_size, seq(inner_size)); bulk_invoke(policy, [&](concurrent_group<sequential_agent>& self) { // compute a flat, global index size_t i = self.outer().index() * inner_size + self.inner().index(); if(i < n) { // do the math z[i] = a * x[i] + y[i]; } }); float expected = a * 1. + 2.; assert(std::all_of(z.begin(), z.end(), [=](float x){ return expected == x; })); std::cout << "OK" << std::endl; return 0; } 2. Heterogeneous Compute Compiler •Boltzmann Initiative •http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2015/p0 069r0.pdf Goal of C++ Std Accelerators •All intra-node parallelism can be expressed with standard C++ –No need for directives or other language/library extensions •Portable to a variety of accelerator and CPU architectures Identifying parallel regions • Do we need a special keyword to identify accelerator code? – Want to limit the “viral” effect of keyword rippling through the call hierarchy • HCC has mode to compile all code for both CPU and GPU targets – No keyword required – Slow compile time – Deferred compilation architectures could improve compilation performance • Automatic identification of accelerator code – Detect calls to launch accelerator work and traverse call chain to upgrade “child” functions – Works if child function code visible to compiler (similar restriction as C++ templates) – Still misses some cases (virtual functions, linked functions) • Keyword not required, but still (currently) useful – HCC uses “__attribute__((hc))” or (“[[hc]]”) to identify accelerator code A C++ Compiler For Heterogeneous Computing by Ben Sander – Handle tricky cases missed by automatic detection Heterogeneous Code Generation •Heterogeneous devices often have different capabilities – HCC CPU/GPU targets have different instruction sets and architecture – Instruction capability differences disappearing • Many GPUs now support full C++ • Indirect branches, misaligned data accesses, goto, stacks – Still, differences may exit • Due to software, resource, or other gaps •Handling of feature gaps – Compilation error – Transparent fall back to CPU – Both useful A C++ Compiler For Heterogeneous Computing by Ben Sander ASYNCHRONOUS EXECUTION • GPU Architecture Basics – Memory-based queues used to schedule and execute commands – Commands include data-copies, parallel execution “kernels”, dependencies, configuration – Hardware-based dependency resolution – Efficiently wait for dependencies, signal completion – all without host intervention • hc::completion_future – Based on C++ std::future – Returned by commands – HCC extends “then” to schedule device-side commands (no host intervention) – HCC implementation identifies accelerator commands via specialization and leverages GPU HW – copy(…). then(for_each(…). then(copy(); – when_all, when_any (N4501) – Combine futures, return another future, in a single function – Can leverage dependency resolution hardware – HCC uses is_ready() • HCC supports timestamps in the future (start, stop) A C++ Compiler For Heterogeneous Computing by Ben Sander – Handy for gaining insight into performance of the asynchronous task Parallel STL •Significant addition to the C++ Standard – Algorithm-level abstraction – Higher-level expression of programmer intent – Enables architecture-specific optimizations – Example: use of GPU hardware for efficient execution of hierarchical parallelism – Execution policy provides “parallel” assertion – Enables parallel execution on GPU, multi-core CPU, SIMD – Executors could provide mechanism for programmer to control parallel execution – Examples: control “where”, grid / group dimensions, unroll optimizations, and more •Would like asynchronous versions of Parallel Algorithms – i.e. return a future A C++ Compiler For Heterogeneous Computing TS v2? (auto return type, executors, HPX examples)by Ben Sander – Perhaps these already exist in Parallelism GPU MEMORY ARCHITECTURE EVOLUTION A C++ Compiler For Heterogeneous Computing by Ben Sander Multidimensional Arrays • Multi-dimensional arrays commonly used on GPUs – Image processing, HPC, others – Opportunity to improve C++ standard • Polymorphic views are attractive – Separate data allocation from data access pattern – Improve code portability to different architectures – Possible tiling optimizations for “HBM” cache • Would like to use some Parallel Algorithms with multi-dimensional data – Iterator operator()++ iterates across all dimensions and touches all elements – Multi-dim iterations apparently a complex topic… • Bounds checking – Useful, but performance is critical – Bounds checking must be free or gated by user-specified switch A C++ Compiler For Heterogeneous Computing by Ben Sander HIERARCHICAL PARALLELISM • Many systems, including GPUs, designed hierarchically – GPUs provide “groups” of “threads”, – Communication between threads in same group (“Near) can be 10X faster than “Far“ ! – Modern CPUs also have clusters of locality : threads (SMT), caches, memory controllers • GPU Group Hardware Features 1. Barrier synchronization for threads in same group 2. High-speed cache memory shared by threads in the group 3. Relaxed atomic operations A C++ Compiler For Heterogeneous Computing by Ben Sander AGENCY Execution Agents •you create groups of execution agents and have them execute tasks •An execution agent is a stand-in for some underlying physical thing -- maybe a thread, a SIMD lane, or a process -- which actually executes our task of interest. •Programming algorithms in terms of execution agents instead of something really specific -- like threads -- makes our programs portable across different kinds of hardware platforms. 3. SYCL •https://www.khronos.org/sycl Overview of SYCL 4. HPX •http://stellar.cct.lsu.edu/pubs/ParallelizingST L.pdf HPX and Stellar •STE||AR is about shaping a scalable future with a new approach to parallel computation •Most notable ongoing project by STE||AR is HPX: A general purpose C++ runtime system for parallel and distributed applications of any scale HPX •HPX enables programmers to write fully asynchronous code using hundreds of millions of threads •First open source implementation of the ParallelX execution model –Starvation –Latencies –Overhead Types of executors in HPX •standard executors –parallel, sequential •this thread executors –static queue, static priority queue •thread pool executors, and thread pool os executors –local queue, local priority queue –static queue, static priority queue •service executors –io pool, parcel pool, timer pool, main pool •distribution policy executor Extending with Execution Policies •The .with syntax to extend parallel algorithms auto par_auto = par.with(auto_chunk_size()); // equivalent to par auto par_static = par.with(static_chunk_size()); auto my_policy = par.with(my_exec).on(my_chunk_size); auto my_task_policy = my_policy(task); HPX: for_each_n • template<typename ExPolicy, typename F> • static typename detail::algorithm_result<ExPolicy, Iter>::type • parallel(ExPolicy const& policy, Iter first, std::size_t count, F && f) • { • if(count != 0) • { • return util::foreach_n_partitioner<ExPolicy>::call(policy, first, count, • [f](Iter part_begin, std::size_t part_size) • { • util::loop_n(part_begin, part_size, [&f](Iter const& curr) • { • f(*curr); • }); • }); • } • return detail::algorithm_result<ExPolicy, Iter>::get( std::move(first)); What is still needed from C++ •Async Parallel Algorithms •Scoped atomics •Futures – timestamps? •Lightweight threads and solutions for TLS The future Programming Model •Executors that can dispatch massively parallel •Marries Futures Concurrency and Parallelisim •An Accelerator programming model that enables both HPC and Consumer domain for C++ is coming •Supports C++ specific constructs and integrates closely with C++ Std Food for thought and Q/A • C11/C++14 Standards – C++ : http://www.openstd.org/jtc1/sc22/wg21/prot/14882fdis/n3937.pdf –C++ (post C++14 free version): http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf –C: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf • Participate and feedback to Compiler –What features/libraries interest you or your customers? –What problem/annoyance you would like the Std to resolve? –Is Special Math important to you? –Do you expect 0x features to be used quickly by your customers? • Talk to me at my blog: –http://www.ibm.com/software/rational/cafe/blogs/cpp-standard 139 My blogs and email address •ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong OpenMP CEO: http://openmp.org/wp/about-openmp/ My Blogs: http://ibm.co/pCvPHR C++11 status: http://tinyurl.com/43y8xgf Boost test results http://www.ibm.com/support/docview.wss?rs=2239&context=SSJT9L& uid=swg27006911 C/C++ Compilers Feature Request Page http://www.ibm.com/developerworks/rfe/?PROD_ID=700 Chair of WG21 SG5 Transactional Memory: https://groups.google.com/a/isocpp.org/forum/?hl=en&fromgroups#!f orum/tm Chair of WG21 SG14 Games Dev/Low Latency: https://groups.google.com/a/isocpp.org/forum/?fromgroups#!forum/s g14 140
© Copyright 2025 Paperzz