Parallel STL in today’s SYCL Ruymán Reyes [email protected] Codeplay Research 15th November, 2016 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 2 The presenter Ruyman Reyes, PhD I Background in HPC, programming models and compilers → Worked in HPC Scientific Code (ScaLAPACK, GROMACs, CP2K) → Created the first Open Source OpenACC implementation I Contributor to SYCL Specification I Lead of ComputeCpp (Codeplay’s SYCL implementation) I Coordinating the work on SYCL Parallel STL 3 Codeplay Software We build software development tools for SoC I Software company based in Edinburgh I 42 developers Different background and skill set I I I I Games Industry, AI, compilers, HPC, robotics Various levels of expertise (graduates to PhD) Customers work in all areas of Industry I I I Smartphones Self-driving cars Game consoles Our technology is probably in your pocket! 4 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 5 Parallel STL: Democratizing Parallelism in C++ I Various libraries offered STL-like interface for parallel algorithms → Thrust, Bolt, libstdc++ Parallel Mode, AMP algorithms I In 2012, two separate proposals for parallelism to C++ standard: → NVIDIA (N3408), based on Thrust (CUDA-based C++ library) → Microsoft and Intel (N3429), based on Intel TBB and PPL/C++AMP I Made joint proposal (N3554) suggested by SG1 → Many working drafts for N3554, N3850, N3960, N4071, N4409 I Final proposal P0024R2 accepted for C++17 during Jacksonville I Latest status on C++ draft in github 6 Existing implementations Following the evolution of the document I Microsoft: http://parallelstl.codeplex.com I HPX: http://stellar-group.github.io/hpx/docs/html/hpx/ manual/parallel.html I HSA: http://www.hsafoundation.com/hsa-for-math-science I Thibaut Lutz: http://github.com/t-lutz/ParallelSTL I NVIDIA: http://github.com/n3554/n3554 I Codeplay: http://github.com/KhronosGroup/SyclParallelSTL 7 What is Parallelism TS adding? A set of execution policies and a collection of parallel algorithms I The Execution Policies I Paragraphs explaining the conditions for parallel algorithms I New parallel algorithms I The exception_list class 8 Sorting with the STL A sequential sort std :: vector < int > data = { 8 , 9 , 1 , 4 }; std :: sort ( std :: begin ( data ) , std :: end ( data ) ) ; if ( std :: is_sorted ( data ) ) { cout << " Data is sorted ! " << endl ; } 9 Sorting with the STL A parallel sort std :: vector < int > data = { 8 , 9 , 1 , 4 }; std :: sort ( std :: par , std :: begin ( data ) , std :: end ( data ) ) ; if ( std :: is_sorted ( data ) ) { cout << " Data is sorted ! " << endl ; } 9 Sorting with the STL A parallel sort std :: vector < int > data = { 8 , 9 , 1 , 4 }; std :: sort ( std :: par , std :: begin ( data ) , std :: end ( data ) ) ; if ( std :: is_sorted ( data ) ) { cout << " Data is sorted ! " << endl ; } I par is an object of an Execution Policy I The sort will be executed in parallel using an implementation-defined method 9 The Execution Policy Standard policy classes Defined in the execution namespace I class sequenced_policy: → Never do parallel I class parallel_policy: → Can use caller thread, but may span others (e.g, std::thread) → Invocations do not interleave on a single thread I class parallel_unsequenced_policy: → Can use caller thread or others (e.g std::thread) → Multiple invocations may be interleaved on a single thread Global objects constexpr se qu enced_policy sequenced ; constexpr p arallel_policy par ; constexpr p a r a l l e l _ u n s e q _ p o l i c y par_unseq ; 10 The Execution Policy Choosing different parallel implementations // May execute in parallel std :: sort ( std :: par , std :: begin ( data ) , std :: end ( data ) ) ; // May be parallelized and vectorized std :: sort ( std :: par_unseq , std :: begin ( data ) , std :: end ( data ) ) ; // Will not be parallelized std :: sort ( std :: sequenced , std :: begin ( data ) , std :: end ( data ) ) ; Propagating the policy to the end user template < typename Policy , typename Iterator > void l ib ra ry_function ( Policy p , Iterator begin , Iterator end ) { std :: sort (p , begin , end ) ; std :: for_each (p , begin , end , [&] ( Iterator :: value_type e & ) { e ++; } ) ; std :: for_each ( std :: sequenced , begin , end , n o n _ p a r a l l e l _ o p e r a t i o n ) ; } Implementations can define their own Execution Policies 11 Dealing with exceptions I Different execution threads may abort with different exceptions I Parallel STL algorithms may throw an exception_list I Note that an uncaught exception on the parallel_unsequenced_policy will cause a terminate. class exce ption_list : public exception { public : typedef unspecified iterator ; size_t size () const noexcept ; iterator begin () const noexcept ; iterator end () const noexcept ; virtual const char * what () const noexcept ; }; 12 Parallel Algorithms I Overloads to STL algorithms taking the SYCL Execution Policy I Not all STL algorithms are suitable for Parallel Execution! 13 Introducing new algorithms to the STL For Each template < class ExecutionPolicy , class InputIterator , class Function > void for_each ( ExecutionPolicy && exec , InputIterator first , InputIterator last , Function f ) ; template < class ExecutionPolicy , class InputIterator , class Size , class Function > InputIterator for_each_n ( ExecutionPolicy && exec , InputIterator first , Size n , Function f ) ; template < class InputIterator , class Size , class Function > InputIterator for_each_n ( InputIterator first , Size n , Function f ) ; I for_each: Applies f to elements in [first, last). I for_each_n: Applies f to elements in [first, first + n). 14 Introducing new algorithms to the STL Numerical parallel algorithms template < class InputIterator > typename iterator_traits < InputIterator >:: value_type reduce ( InputIterator first , InputIterator last ) ; template < class InputIterator , class T > T reduce ( InputIterator first , InputIterator last , T init ) ; template < class InputIterator , class T , class BinaryOperation > T reduce ( InputIterator first , InputIterator last , T init , B ina ryOperation binary_op ) ; I As opposed to accumulate, binary_op is executed on an unespecified order. 15 Introducing new algorithms to the STL Other algorithms initroduced I Exclusive/Inclusive Scan (Prefix Sum) I Transform Reduce I Transform Exclusive/Inclusive Scan Basic block to construct other algorithms and applications! 16 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 17 The SYCL Parallel STL Spec & Examples I Enable C++17 Parallel STL to run on any SYCL-supported device → Any OpenCL platform with SPIR support I Improves productivity of C++ developers worried about performance I Integrates nicely with existing SYCL codebases I Completely Open Source → https://github.com/KhronosGroup/SyclParallelSTL SYCL Parallel STL introduces two execution policies 18 Sorting with the STL A sequential sort std :: vector < int > data = { 8 , 9 , 1 , 4 }; std :: sort ( std :: begin ( data ) , std :: end ( data ) ) ; if ( std :: is_sorted ( data ) ) { cout << " Data is sorted ! " << endl ; } 19 Sorting with the STL Sorting on the GPU! std :: vector < int > data = { 8 , 9 , 1 , 4 }; std :: sort ( sycl_policy , std :: begin ( v ) , std :: end ( v ) ) ; if ( std :: is_sorted ( data ) ) { cout << " Data is sorted ! " << endl ; } 19 Sorting with the STL Sorting on the GPU! std :: vector < int > data = { 8 , 9 , 1 , 4 }; std :: sort ( sycl_policy , std :: begin ( v ) , std :: end ( v ) ) ; if ( std :: is_sorted ( data ) ) { cout << " Data is sorted ! " << endl ; } I sycl_policy is an Execution Policy I data is an standard stl::vector I Technically will use the device returned by default_selector 19 The SYCL Policy template < typename KernelName = DefaultKernelName > class s y c l _ e x e c u t i o n _ p o l i c y { public : using kernelName = KernelName ; s y c l _ e x e c u t i o n _ p o l i c y () = default ; s y c l _ e x e c u t i o n _ p o l i c y ( cl :: sycl :: queue q ) ; cl :: sycl :: queue get_queue () const ; }; I Indicates algorithm will be executed using a SYCL-device I Can optionally take a queue → Re-use device-selection → Asynchronous data copy-back → ... 20 Why the KernelName template? How are algorithms implemented? auto f = [ vectorSize , & bufI , & bufO , op ]( cl :: sycl :: handler & h ) mutable { ... auto aI = bufI . template get_access < access :: mode :: read >( h ) ; auto aO = bufO . template get_access < access :: mode :: write >( h ) ; h . parallel_for < /* The Kernel Name */ >(r , [ aI , aO , op ]( cl :: sycl :: id <1 > id ) { aO [ id . get (0) ] = UserFunctor ( aI [ id . get (0) ]) ; }) ; }; 21 Why the KernelName template? How are algorithms implemented? auto f = [ vectorSize , & bufI , & bufO , op ]( cl :: sycl :: handler & h ) mutable { ... auto aI = bufI . template get_access < access :: mode :: read >( h ) ; auto aO = bufO . template get_access < access :: mode :: write >( h ) ; h . parallel_for < /* The Kernel Name */ >(r , [ aI , aO , op ]( cl :: sycl :: id <1 > id ) { aO [ id . get (0) ] = UserFunctor ( aI [ id . get (0) ]) ; }) ; }; Two separate calls can generate different kernels! transform ( par , v . begin () , v . end () , [=]( int & val ) { val ++; }) ; transform ( par , v . begin () , v . end () , [=]( int & val ) { val - -; }) ; 21 Using named policies and queues using namespace cl :: sycl ; using namespace experimental :: parallel :: sycl ; std :: vector < int > v = ...; // Transform de f a u l t _ s e l e c tor ds ; { queue q ( ds ) ; sort ( s y c l _ e x e c u t i o n _ p o l i c y ( q ) , std :: begin ( v ) , std :: end ( v ) ) ; sycl_execution_policy < class myName > sepn1 ( q ) ; transform ( sepn1 , std :: begin ( v ) , std :: end ( v ) , std :: begin ( v ) , [=]( int i ) { return i + 1;}) ; } I Only required for lambdas, not functors I Device selection and queue are re-used I Data is copied in/out in each call! 22 Avoiding data-copies using buffers using namespace cl :: sycl ; using namespace experimental :: parallel :: sycl ; std :: vector < int > v = ...; d e f a u l t _ s e l e c tor h ; { buffer < int > b ( std :: begin ( v ) , std :: end ( v ) ) ; b . set_ final _data ( v . data () ) ; { queue q ( h ) ; sort ( s y c l _ e x e c u t i o n _ p o l i c y ( q ) , begin ( b ) , end ( b ) ) ; sycl_execution_policy < class transform1 > sepn1 ( q ) ; transform ( sepn1 , begin ( b ) , end ( b ) , begin ( b ) , []( int num ) { return num + 1; }) ; } } I Buffer is constructed from STL containers I Data will be copied back to the container when buffer is done → Note the additional copy from vec to buffer and vice-versa 23 Using device-only data using namespace experimental :: parallel :: sycl ; d e f a u l t _ s e l e c tor h ; { buffer < int , 1 > b ( range <1 >( size ) ) ; b . set_ final _data ( v . data () ) ; { cl :: sycl :: queue q ( h ) ; { auto hostAcc = b . get_access < mode :: read_write , target :: host_buffer >() ; for ( auto & : hostAcc ) { * i = r e a d _ d a t a _ f r o m _ f i l e (...) ; } } sort ( s y c l _ e x e c u t i o n _ p o l i c y ( q ) , begin ( b ) , end ( b ) ) ; transform ( sycl_policy , begin ( b ) , end ( b ) , begin ( b ) , std :: negate < int >() ) ; } } I Data is initialized in the host using a host accessor I After host accessor is done, data is on the device 24 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 25 Heterogeneous Execution Policy Distribute workload in Parallel STL algorithms I Execution on two devices at the same time I Designed for integrated CPU / GPU platforms I User sets the decide percentage of work assigned to GPU/CPU I Policy distributes workload accordingly HiPEAC internship I Research work funded via collaboration grant with HiPEAC I Ph.D Student from University of Malaga (A. Vilches) 26 Heterogeneous Execution Example cl :: sycl :: queue q ; a m d _ c p u _ s e l ector cpu_sel ; cl :: sycl :: queue q2 ( cpu_sel ) ; sycl :: s y c l _ h et er og e ne ou s_ ex e cu ti on _ po li cy < class TransformAlgorithm1 > snp ( q , q2 , ratio ) ; auto mytransform = [&]() { float pi = 3.14; std :: experimental :: parallel :: transform ( snp , std :: begin ( v1 ) , std :: end ( v1 ) , std :: begin ( v2 ) , std :: begin ( res ) , [=]( float a , float b ) { return pi * a + b ; }) ; }; 27 Manual distribution of the work 28 Heterogeneous Execution Policy in Use 29 Heterogeneous Execution Trivial Implementation ... { auto buf1_q1 = sycl :: helpers :: ma ke_ co nst _b uff er ( first1 , first1 + crosspoint ) ; auto buf2_q1 = sycl :: helpers :: ma ke_ co nst _b uff er ( first2 , first2 + crosspoint ) ; auto res_q1 = sycl :: helpers :: make_buffer ( result , result + crosspoint ) ; auto buf1_q2 = sycl :: helpers :: ma ke_ co nst _b uff er ( first1 + crosspoint , last1 ) ; auto buf2_q2 = sycl :: helpers :: ma ke _co ns t_b uf fer ( first2 + crosspoint , first2 + elements ) ; auto res_q2 = sycl :: helpers :: make_buffer ( result + crosspoint , result + elements ) ; impl :: transform ( named_sep , q1 , buf1_q1 , buf2_q1 , res_q1 , binary_op ) ; impl :: transform ( named_sep , q2 , buf1_q2 , buf2_q2 , res_q2 , binary_op ) ; } ... 30 Heterogeneous load balancing Dynamic decision of heterogeneous balancing I Percentage offloading is runtime value I Developers can create runtime evaluation functions → Depending on workload → Depending on platform → Depending on user-input 31 Outline 1 Parallelism TS 2 The SYCL parallel STL 3 Heterogeneous Execution with Parallel STL 4 Conclusions and Future Work 32 Conclusions Parallel STL I Enables developers to quickly exploit parallel architectures I SYCL makes implementing these algorithms for heterogeneous platforms trivial → Just write single-source C++ → Will work on any OpenCL + SPIR platform! Heterogeneous Execution I Plenty of heterogeneous platform out there I Complex to work with them! I SYCL allows developers to focus on algorithms and distribute work I Heterogeneous Policy can be optimized and customized per platform → Runtime can get platform information and use custom balancing → Users can extend the policy with specific balancing decisions 33 @codeplaysoft [email protected] codeplay.com
© Copyright 2026 Paperzz