SYCL : ParallelSTL Ruyman Reyes : Nov16

Parallel STL in today’s SYCL
Ruymán Reyes
[email protected]
Codeplay Research
15th November, 2016
Outline
1 Parallelism TS
2 The SYCL parallel STL
3 Heterogeneous Execution with Parallel STL
4 Conclusions and Future Work
2
The presenter
Ruyman Reyes, PhD
I
Background in HPC, programming models and compilers
→ Worked in HPC Scientific Code (ScaLAPACK, GROMACs, CP2K)
→ Created the first Open Source OpenACC implementation
I
Contributor to SYCL Specification
I
Lead of ComputeCpp (Codeplay’s SYCL implementation)
I
Coordinating the work on SYCL Parallel STL
3
Codeplay Software
We build software development tools for SoC
I
Software company based in Edinburgh
I
42 developers
Different background and skill set
I
I
I
I
Games Industry, AI, compilers, HPC, robotics
Various levels of expertise (graduates to PhD)
Customers work in all areas of Industry
I
I
I
Smartphones
Self-driving cars
Game consoles
Our technology is probably in your pocket!
4
Outline
1 Parallelism TS
2 The SYCL parallel STL
3 Heterogeneous Execution with Parallel STL
4 Conclusions and Future Work
5
Parallel STL: Democratizing Parallelism in C++
I
Various libraries offered STL-like interface for parallel algorithms
→ Thrust, Bolt, libstdc++ Parallel Mode, AMP algorithms
I
In 2012, two separate proposals for parallelism to C++ standard:
→ NVIDIA (N3408), based on Thrust (CUDA-based C++ library)
→ Microsoft and Intel (N3429), based on Intel TBB and
PPL/C++AMP
I
Made joint proposal (N3554) suggested by SG1
→ Many working drafts for N3554, N3850, N3960, N4071, N4409
I
Final proposal P0024R2 accepted for C++17 during Jacksonville
I
Latest status on C++ draft in github
6
Existing implementations
Following the evolution of the document
I
Microsoft: http://parallelstl.codeplex.com
I
HPX: http://stellar-group.github.io/hpx/docs/html/hpx/
manual/parallel.html
I
HSA: http://www.hsafoundation.com/hsa-for-math-science
I
Thibaut Lutz: http://github.com/t-lutz/ParallelSTL
I
NVIDIA: http://github.com/n3554/n3554
I
Codeplay: http://github.com/KhronosGroup/SyclParallelSTL
7
What is Parallelism TS adding?
A set of execution policies and a collection of parallel algorithms
I
The Execution Policies
I
Paragraphs explaining the conditions for parallel algorithms
I
New parallel algorithms
I
The exception_list class
8
Sorting with the STL
A sequential sort
std :: vector < int > data = { 8 , 9 , 1 , 4 };
std :: sort ( std :: begin ( data ) , std :: end ( data ) ) ;
if ( std :: is_sorted ( data ) ) {
cout << " Data is sorted ! " << endl ;
}
9
Sorting with the STL
A parallel sort
std :: vector < int > data = { 8 , 9 , 1 , 4 };
std :: sort ( std :: par , std :: begin ( data ) , std :: end ( data ) ) ;
if ( std :: is_sorted ( data ) ) {
cout << " Data is sorted ! " << endl ;
}
9
Sorting with the STL
A parallel sort
std :: vector < int > data = { 8 , 9 , 1 , 4 };
std :: sort ( std :: par , std :: begin ( data ) , std :: end ( data ) ) ;
if ( std :: is_sorted ( data ) ) {
cout << " Data is sorted ! " << endl ;
}
I
par is an object of an Execution Policy
I
The sort will be executed in parallel using an implementation-defined
method
9
The Execution Policy
Standard policy classes
Defined in the execution namespace
I
class sequenced_policy:
→ Never do parallel
I
class parallel_policy:
→ Can use caller thread, but may span others (e.g, std::thread) →
Invocations do not interleave on a single thread
I
class parallel_unsequenced_policy:
→ Can use caller thread or others (e.g std::thread) → Multiple
invocations may be interleaved on a single thread
Global objects
constexpr se qu enced_policy sequenced ;
constexpr p arallel_policy par ;
constexpr p a r a l l e l _ u n s e q _ p o l i c y par_unseq ;
10
The Execution Policy
Choosing different parallel implementations
// May execute in parallel
std :: sort ( std :: par , std :: begin ( data ) , std :: end ( data ) ) ;
// May be parallelized and vectorized
std :: sort ( std :: par_unseq , std :: begin ( data ) , std :: end ( data ) ) ;
// Will not be parallelized
std :: sort ( std :: sequenced , std :: begin ( data ) , std :: end ( data ) ) ;
Propagating the policy to the end user
template < typename Policy , typename Iterator >
void l ib ra ry_function ( Policy p , Iterator begin , Iterator end ) {
std :: sort (p , begin , end ) ;
std :: for_each (p , begin , end , [&] ( Iterator :: value_type e & ) { e ++; } ) ;
std :: for_each ( std :: sequenced , begin , end , n o n _ p a r a l l e l _ o p e r a t i o n ) ;
}
Implementations can define their own Execution Policies
11
Dealing with exceptions
I
Different execution threads may abort with different exceptions
I
Parallel STL algorithms may throw an exception_list
I
Note that an uncaught exception on the
parallel_unsequenced_policy will cause a terminate.
class exce ption_list : public exception {
public :
typedef unspecified iterator ;
size_t size () const noexcept ;
iterator begin () const noexcept ;
iterator end () const noexcept ;
virtual const char * what () const noexcept ;
};
12
Parallel Algorithms
I
Overloads to STL algorithms taking the SYCL Execution Policy
I
Not all STL algorithms are suitable for Parallel Execution!
13
Introducing new algorithms to the STL
For Each
template < class ExecutionPolicy ,
class InputIterator , class Function >
void for_each ( ExecutionPolicy && exec ,
InputIterator first , InputIterator last ,
Function f ) ;
template < class ExecutionPolicy ,
class InputIterator , class Size , class Function >
InputIterator for_each_n ( ExecutionPolicy && exec ,
InputIterator first , Size n ,
Function f ) ;
template < class InputIterator , class Size , class Function >
InputIterator for_each_n ( InputIterator first , Size n ,
Function f ) ;
I
for_each: Applies f to elements in [first, last).
I
for_each_n: Applies f to elements in [first, first + n).
14
Introducing new algorithms to the STL
Numerical parallel algorithms
template < class InputIterator >
typename iterator_traits < InputIterator >:: value_type
reduce ( InputIterator first , InputIterator last ) ;
template < class InputIterator , class T >
T reduce ( InputIterator first , InputIterator last , T init ) ;
template < class InputIterator , class T , class BinaryOperation >
T reduce ( InputIterator first , InputIterator last , T init ,
B ina ryOperation binary_op ) ;
I
As opposed to accumulate, binary_op is executed on an
unespecified order.
15
Introducing new algorithms to the STL
Other algorithms initroduced
I
Exclusive/Inclusive Scan (Prefix Sum)
I
Transform Reduce
I
Transform Exclusive/Inclusive Scan
Basic block to construct other algorithms and applications!
16
Outline
1 Parallelism TS
2 The SYCL parallel STL
3 Heterogeneous Execution with Parallel STL
4 Conclusions and Future Work
17
The SYCL Parallel STL
Spec & Examples
I Enable C++17 Parallel STL to run on any SYCL-supported device
→ Any OpenCL platform with SPIR support
I Improves productivity of C++ developers worried about performance
I Integrates nicely with existing SYCL codebases
I Completely Open Source
→ https://github.com/KhronosGroup/SyclParallelSTL
SYCL Parallel STL introduces two execution policies
18
Sorting with the STL
A sequential sort
std :: vector < int > data = { 8 , 9 , 1 , 4 };
std :: sort ( std :: begin ( data ) , std :: end ( data ) ) ;
if ( std :: is_sorted ( data ) ) {
cout << " Data is sorted ! " << endl ;
}
19
Sorting with the STL
Sorting on the GPU!
std :: vector < int > data = { 8 , 9 , 1 , 4 };
std :: sort ( sycl_policy , std :: begin ( v ) , std :: end ( v ) ) ;
if ( std :: is_sorted ( data ) ) {
cout << " Data is sorted ! " << endl ;
}
19
Sorting with the STL
Sorting on the GPU!
std :: vector < int > data = { 8 , 9 , 1 , 4 };
std :: sort ( sycl_policy , std :: begin ( v ) , std :: end ( v ) ) ;
if ( std :: is_sorted ( data ) ) {
cout << " Data is sorted ! " << endl ;
}
I
sycl_policy is an Execution Policy
I
data is an standard stl::vector
I
Technically will use the device returned by default_selector
19
The SYCL Policy
template < typename KernelName = DefaultKernelName >
class s y c l _ e x e c u t i o n _ p o l i c y {
public :
using kernelName = KernelName ;
s y c l _ e x e c u t i o n _ p o l i c y () = default ;
s y c l _ e x e c u t i o n _ p o l i c y ( cl :: sycl :: queue q ) ;
cl :: sycl :: queue get_queue () const ;
};
I
Indicates algorithm will be executed using a SYCL-device
I
Can optionally take a queue
→ Re-use device-selection
→ Asynchronous data copy-back
→ ...
20
Why the KernelName template?
How are algorithms implemented?
auto f = [ vectorSize , & bufI , & bufO , op ]( cl :: sycl :: handler & h ) mutable {
...
auto aI = bufI . template get_access < access :: mode :: read >( h ) ;
auto aO = bufO . template get_access < access :: mode :: write >( h ) ;
h . parallel_for < /* The Kernel Name */ >(r ,
[ aI , aO , op ]( cl :: sycl :: id <1 > id ) {
aO [ id . get (0) ] = UserFunctor ( aI [ id . get (0) ]) ;
}) ;
};
21
Why the KernelName template?
How are algorithms implemented?
auto f = [ vectorSize , & bufI , & bufO , op ]( cl :: sycl :: handler & h ) mutable {
...
auto aI = bufI . template get_access < access :: mode :: read >( h ) ;
auto aO = bufO . template get_access < access :: mode :: write >( h ) ;
h . parallel_for < /* The Kernel Name */ >(r ,
[ aI , aO , op ]( cl :: sycl :: id <1 > id ) {
aO [ id . get (0) ] = UserFunctor ( aI [ id . get (0) ]) ;
}) ;
};
Two separate calls can generate different kernels!
transform ( par , v . begin () , v . end () , [=]( int & val ) { val ++; }) ;
transform ( par , v . begin () , v . end () , [=]( int & val ) { val - -; }) ;
21
Using named policies and queues
using namespace cl :: sycl ;
using namespace experimental :: parallel :: sycl ;
std :: vector < int > v = ...;
// Transform
de f a u l t _ s e l e c tor ds ;
{
queue q ( ds ) ;
sort ( s y c l _ e x e c u t i o n _ p o l i c y ( q ) , std :: begin ( v ) , std :: end ( v ) ) ;
sycl_execution_policy < class myName > sepn1 ( q ) ;
transform ( sepn1 , std :: begin ( v ) , std :: end ( v ) ,
std :: begin ( v ) , [=]( int i ) { return i + 1;}) ;
}
I
Only required for lambdas, not functors
I
Device selection and queue are re-used
I
Data is copied in/out in each call!
22
Avoiding data-copies using buffers
using namespace cl :: sycl ;
using namespace experimental :: parallel :: sycl ;
std :: vector < int > v = ...;
d e f a u l t _ s e l e c tor h ;
{
buffer < int > b ( std :: begin ( v ) , std :: end ( v ) ) ;
b . set_ final _data ( v . data () ) ;
{
queue q ( h ) ;
sort ( s y c l _ e x e c u t i o n _ p o l i c y ( q ) , begin ( b ) , end ( b ) ) ;
sycl_execution_policy < class transform1 > sepn1 ( q ) ;
transform ( sepn1 , begin ( b ) , end ( b ) , begin ( b ) ,
[]( int num ) { return num + 1; }) ;
}
}
I
Buffer is constructed from STL containers
I
Data will be copied back to the container when buffer is done
→ Note the additional copy from vec to buffer and vice-versa
23
Using device-only data
using namespace experimental :: parallel :: sycl ;
d e f a u l t _ s e l e c tor h ;
{
buffer < int , 1 > b ( range <1 >( size ) ) ;
b . set_ final _data ( v . data () ) ;
{
cl :: sycl :: queue q ( h ) ;
{
auto hostAcc = b . get_access < mode :: read_write ,
target :: host_buffer >() ;
for ( auto & : hostAcc ) {
* i = r e a d _ d a t a _ f r o m _ f i l e (...) ;
}
}
sort ( s y c l _ e x e c u t i o n _ p o l i c y ( q ) , begin ( b ) , end ( b ) ) ;
transform ( sycl_policy , begin ( b ) , end ( b ) , begin ( b ) ,
std :: negate < int >() ) ;
}
}
I
Data is initialized in the host using a host accessor
I
After host accessor is done, data is on the device
24
Outline
1 Parallelism TS
2 The SYCL parallel STL
3 Heterogeneous Execution with Parallel STL
4 Conclusions and Future Work
25
Heterogeneous Execution Policy
Distribute workload in Parallel STL algorithms
I
Execution on two devices at the same time
I
Designed for integrated CPU / GPU platforms
I
User sets the decide percentage of work assigned to GPU/CPU
I
Policy distributes workload accordingly
HiPEAC internship
I
Research work funded via collaboration grant with HiPEAC
I
Ph.D Student from University of Malaga (A. Vilches)
26
Heterogeneous Execution Example
cl :: sycl :: queue q ;
a m d _ c p u _ s e l ector cpu_sel ;
cl :: sycl :: queue q2 ( cpu_sel ) ;
sycl :: s y c l _ h et er og e ne ou s_ ex e cu ti on _ po li cy < class TransformAlgorithm1 > snp (
q , q2 , ratio ) ;
auto mytransform = [&]() {
float pi = 3.14;
std :: experimental :: parallel :: transform (
snp , std :: begin ( v1 ) , std :: end ( v1 ) , std :: begin ( v2 ) , std :: begin ( res ) ,
[=]( float a , float b ) { return pi * a + b ; }) ;
};
27
Manual distribution of the work
28
Heterogeneous Execution Policy in Use
29
Heterogeneous Execution Trivial Implementation
...
{
auto buf1_q1 =
sycl :: helpers :: ma ke_ co nst _b uff er ( first1 , first1 + crosspoint ) ;
auto buf2_q1 =
sycl :: helpers :: ma ke_ co nst _b uff er ( first2 , first2 + crosspoint ) ;
auto res_q1 = sycl :: helpers :: make_buffer ( result , result + crosspoint ) ;
auto buf1_q2 =
sycl :: helpers :: ma ke_ co nst _b uff er ( first1 + crosspoint , last1 ) ;
auto buf2_q2 = sycl :: helpers :: ma ke _co ns t_b uf fer ( first2 + crosspoint ,
first2 + elements ) ;
auto res_q2 =
sycl :: helpers :: make_buffer ( result + crosspoint , result + elements ) ;
impl :: transform ( named_sep , q1 , buf1_q1 , buf2_q1 , res_q1 , binary_op ) ;
impl :: transform ( named_sep , q2 , buf1_q2 , buf2_q2 , res_q2 , binary_op ) ;
}
...
30
Heterogeneous load balancing
Dynamic decision of heterogeneous balancing
I
Percentage offloading is runtime value
I
Developers can create runtime evaluation functions
→ Depending on workload
→ Depending on platform
→ Depending on user-input
31
Outline
1 Parallelism TS
2 The SYCL parallel STL
3 Heterogeneous Execution with Parallel STL
4 Conclusions and Future Work
32
Conclusions
Parallel STL
I
Enables developers to quickly exploit parallel architectures
I
SYCL makes implementing these algorithms for heterogeneous
platforms trivial
→ Just write single-source C++
→ Will work on any OpenCL + SPIR platform!
Heterogeneous Execution
I
Plenty of heterogeneous platform out there
I
Complex to work with them!
I
SYCL allows developers to focus on algorithms and distribute work
I
Heterogeneous Policy can be optimized and customized per platform
→ Runtime can get platform information and use custom balancing
→ Users can extend the policy with specific balancing decisions
33
@codeplaysoft
[email protected]
codeplay.com