Stencil Operator

GSL+SEL
Stencil computations made cool
Mauro Bianco
Ugo Varetto
CSCS
Motivation
• Many computations are represented by
iterating over a data set to apply functions to
its elements
• Developers do not have an uniform
abstraction to express these computations
– Re-inventing the wheel every time
– Developing, debugging, maintaining many codes
that look similar but they are not
– Several performance issues are shared among all
these versions
Stencil Computation (For Regular Grids)
• Given a regular D-dimensional grid
• Compute a function in all pertinent elements
which depends on elements at fixed offsets
– Fixed w.r.t. grid size
• Pertinent elements are those for which the
offsets are well defined
• The iteration order is a parameter of the
computation
Applying function
Induced Core Space
Definitions
• Function is called Stencil Operator
• Stencil operator is applied to a core element
• Stencil/Shape: the minimum enclosing
polyhedron containing the cells around the
core which are accessed by the operator
• Shape minus the core is the set of halo cells
• Stencil operator reads from core and halo cells
(read set) and writes into the core (write set)
Think parallel
• For loops are not in the jargon of GSL/SEL
• The order of application of the stencil
operator is a partial order
– Partial orders are DAGs = not easy
• But our DAGs typically have structure
– do_all: no order specified
– do_i, do_j, do_k: increment i, j, or k
– do_diamond: ensure (i-1,j-1,k-1) are computed
– Etc.
• We call them iteration spaces
From Halos to Ghosts
• Given a shape and his halo we can derive
implementation for different architectures
– Sequential: identify the best iteration matching
• requirements (do_all, do_diamond, …) and
• layout (e.g., exploiting locality)
– Parallel: passing from halo to ghost cells is simple
• MPI implementation with domain decomposition and
ghost cell exchange (PRAM to MPI)
– GPUs
• Using aggressive multithreading of graphical cards
GSL Concepts
• Generic Programming Approach
– Decoupling of algorithms, data, operators
• Our classification:
– Storage: Area of memory with data
– Grid: Representation of storage as D-dims grid
– Stencil operator: Function to apply to grid elements
– Shape: Area accessed by the operator
– Iteration Space: Partial order required by algorithm
Storage
• Storage class abstracts a 1D contiguous
address space
• T* is also accepted, with T being an arbitrary
data type
• Special traits to be used for specialization and
customization
– Special instructions at beginning/end of loops
– Special care at beginning/end of computation
Stencils/Shapes
• Template class that specify the extension of the halo
– E.g., 3D shape specify
• hiu: number of cells in which halo extends on elements preceding the
core (indices less than core)
• hid: number of cells in which halo extends on elements following the
core (indices greater than core)
• hju: …
• Constructor that takes a pointer to a grid and coordinates
– Set the core pointer
• Methods to access elements around the core in the halo
region (for reading)
– value_type const & operator()(int, int, int) const;
• Methods to access the core element for read/write
• Methods for modifying the core pointer (moving the shape)
Statefull Stencils
• Additionally provide methods to obtain
– Index of the core element in the Grid
– Global index of the core in case the Grid is a
subgrid of a bigger grid
• Additional flexibility at the cost of
– Memory usage
– Performance (actual tests tends to confirm the
impact is visible only if index methods are used)
Grids=Storage+Shape
• GSL::Grid3D<stencil_3x3x3, double*, GSL::ijk>
• The shape/stencil is specified without
template arguments
– Ease of specification
• Second argument is the storage type
• Then comes the layout arguments
– Specifies how data would be traversed by a
minimal stride loop
– ijk mean the loop would be a ‘for i, for j, for k’
Subgrids and Regions
• Regions are tuples specifying initial corners
and sizes of a subgrid in a grid.
• Given a region
– Obtain a subgrid
• grid.subgrid(region)
• Result type is the same as grid
– Obtain a re-shaped subgrid
• grid.reshape<newstencil>(region)
• Result type is grid type with a different shape
Stencil Operators
• Function objects with additional traits to
specify useful characteristics
struct stencil_operator_eq_copy {
typedef bool result_type;
template <typename St1, typename St2>
bool operator()(St1 const &u, St2 &v) const {
bool res = (u() == v()); // Equal?
v()=u(); // Copy
return res;
}
};
Note that ‘u’ cannot be
written while ‘v’ can
Iteration Spaces
• Iteration spaces specify requirements!
• Available in GSL (in decreasing parallelism)
– do_all: Visit all the (core) elements
– do_reduce: Visit all elements and compute a
reduction on values returned by operator
– do_i_inc, do_j_inc, do_k_inc: Ensure (i,j,k) is
processed only if either (i-1,j,k), (i,j-1,k), or (i, j, k-1)
have been
– do_i_dec, do_j_dec, do_k_dec: analogous
– do_diamond: Ensure (i-1, j-1, k-1) is processed before
(i,j,k)
• Diamond is way much less parallel than do_all
Iteration Spaces Implementation
• General First – Specialize Later
– Basic implementation with no much care for performance
template <typename Grid, typename Operator>
struct ARCH::do_all(Grid const &g, Operator const &op) {
for(int i = 0; i < g.nx(); ++i)
for(int j = 0; j < g.mx(); ++j)
for(int k = 0; k < g.lx(); ++k) {
typename Grid::stencil_type s(&g, i, j, k);
op(s);
}
– Specializations can be provided for specific cases and/or
applications*
*Code can look ugly (for several reasons: reduce redundancy, improve performance,…), but this is internal code,
not seen at top level, which is provided by the library developers, or by an advanced user after the basic code is up.
#define MACRO_IMPL(z, n, _)
\
template <BOOST_PP_ENUM_PARAMS_Z(z, BOOST_PP_INC(n), typename _Grid), typename _stencil_operator> \
struct sequential::_DO_(all, n)<BOOST_PP_ENUM_PARAMS_Z(z, BOOST_PP_INC(n), _Grid), _stencil_operator,\
typename boost::enable_if<
\
GSL::same_major_with_base<typename GSL::_3D_major TYPE_CHK(BOOST_PP_INC(n))> \
>::type >
\
: GSL::nary_loop<void, BOOST_PP_INC(n)>
\
{
\
TYPE_INST(BOOST_PP_INC(n))
\
typedef typename boost::remove_reference<_stencil_operator>::type stencil_op; \
typedef typename Grid0::major_type major_type;
\
\
void operator()( BOOST_PP_ENUM_BINARY_PARAMS_Z(z, BOOST_PP_INC(n), Grid, const &grid), stencil_op const &sten_op)\
{
\
assert(_impl::check_grids3D(BOOST_PP_ENUM_PARAMS_Z(z, BOOST_PP_INC(n), grid) )); \
boost::tuple<int, int, int> bounds(grid0.nx(), grid0.mx(), grid0.lx()); \
int i, j, k;
\
boost::tuple<int&, int&, int&> indices(i, j, k);
\
int & i1 = boost::get<major_type::_3D_outer_dimension>(indices); \
int & i2 = boost::get<major_type::_3D_middle_dimension>(indices); \
int & i3 = boost::get<major_type::_3D_inner_dimension>(indices); \
const int N1 = boost::get<major_type::_3D_outer_dimension>(bounds); \
const int N2 = boost::get<major_type::_3D_middle_dimension>(bounds); \
const int N3 = boost::get<major_type::_3D_inner_dimension>(bounds); \
i3 = 0;
\
int NN = N3;
\
for (i1 = 0; i1 < N1; ++i1) {
\
for (i2 = 0; i2 < N2; ++i2) {
\
STEN_INST(BOOST_PP_INC(n))
\
for (int ii=0; ii < NN; ++ii) {
\
sten_op(BOOST_PP_ENUM_PARAMS_Z(z, BOOST_PP_INC(n),stencil));\
STEN_INC(BOOST_PP_INC(n))
\
}
\
}
\
}
\
}
\
};
BOOST_PP_REPEAT(GSL_MAX_GRIDS, MACRO_IMPL, nil)
An example: Averaging in 3D
• Step 1: storage and init
The difference is only
for illustration
struct init_f {
template <typename St>
void operator()(St &u) const {
int i, j;
get_index(i,j);
if ( (i%2) && (j%2) )
u()=1.0;
The rest of the
else
arguments are
u()=0.0;
defaulted to double* }
and ij
};
double* storage
= new double[n * m];
double* storage_bef = new double[n * m];
We use do alls since
there are not loop
carried dependencies…
Grid2D<stencil_1x1_stateful> grid1x1(storage(), n, m);
Grid2D<stencil_1x1_stateful> grid1x1_bef(storage_bef(), n, m);
do_all<sequential>( grid1x1, init_f() );
do_all<openmp>( grid1x1_bef, init_f() );
An example: Averaging in 3D
• Preparing data
Grid2D< stencil_3x3 > grid(storage, n, m);
region2D region(1,1,n-1,m-1);
Grid2D< stencil_1x1 > grid_bef = grid1x1_bef.reshape<stencil_1x1>(region);
We need region since
otherwise the core
spaces of grid and
grid_bef are not the
same. An assert is
raised if this is the case
Reshape is needed
since we change the
type of stencil from
statefull to stateless
A possibility, which is implemented in 2D, is to
intersect the core spaces automatically, at the cost of
little overhead, but also of increased abstraction,
which may, imho, difficult to track down by a user.
This may be a better trade off between abstraction
and actual implementation
An example: Averaging in 3D
One can advocate for
incorporating outer
loop in a construct, too
This is simply a struct
that execute the two
operators sequentially
struct stencil_operator_3x3_avg
: unary_op { // Used for fusing operations
template <typename Stencil>
void operator()(Stencil &u) const {
u() = 1.0/9.0 *
(u()+u(-1,-1)+u(-1, 0)+u(-1, 1)
+u(0,-1) +
+u(0, 1)
+u(1,-1) +u(1, 0) +u(1, 1));
}
};
bool res=true;
do {
res = do_reduce<cuda>
( grid
, grid_before
, fused_operator( stencil_operator_3x3_avg()
, stencil_op_eq_copy(EPSI))
, std::logical_and<bool>());
} while (!res);
A variation of the
operator we sow before
Tips and Tricks
• GSL tries to reduce the amount of platform
specific considerations
• The choice to fuse operators and loops should
be guided by considerations of functionality,
but has lot of repercussions on performance
• To overcome this problem we can automate
loop fusion by using SEL
SEL: Stencil Embedded Language
• A prototype of a DSEL for combining stencil
computations
do_all(grd3x3x3, avg) + do_reduce(grd3x3x3, grd_1x1x1_bf, equ, l_and)
• Meaning: “perform a do_all averaging on a
3x3x3 grid followed by a reduction to check
correctness”
• Since the same grid appears in both loops, the
DSEL fuse the computation into
do_reduce(grd3x3x3, grd_1x1x1_bf, fuse(avg, equ), l_and)
From Imperative to Declarative
• When using SEL user adopts a declarative
approach
– Specification + Information
eval(do_all(grd3x3x3, avg)
+ do_reduce(grd3x3x3, grd_1x1x1_bf, equ, l_and)
, context)
– All arguments are placeholders
• Needed to postpone execution – lazy evaluation
• Allow symbolic analysis of programs
• Downside: expressions tend to grow in size
Lunch Bill
• Need to associate placeholders to real data
Grid3D< stencil_3x3x3 > grid3x3x3(storage, n, m, l);
Grid3D< stencil_1x1x1 > grid1x1x1_bef(storage_bef, n, m, l);
typedef fvector<Grid3D<stencil_3x3x3>, Grid3D<stencil_1x1x1> > GVEC;
typedef fvector<operator_avg, operator_eq, std::logical_and<bool> > OVEC;
GVEC Gvec(grid3x3x3, grid1x1x1_bef);
OVEC Ovec(operator_avg(), operator_eq(1.0e-4), std::logical_and<bool>());
SEL_context<cuda,GVEC,OVEC> context(Gvec, Ovec);
MAKE_GRID(0,
MAKE_GRID(1,
MAKE_OPER(0,
MAKE_OPER(1,
MAKE_OPER(2,
grd3x3x3);
grd_1x1x1_bf);
avg);
equ);
l_and);
template <int I>
struct _grid {};
template <int I>
struct _oper {};
ThereMap
is boiler
code that(types
can beand
reduced
grids plate
and operators
Givingmacros),
the
execution
engine
Define
grids
as
before
(e.g.,
by using
but
with
some
potential
values)
to indices (in this
case
of
vectors)
information
about
these
maps
(and
Associating
to
indices
some
mnemonic
drawbacks.
about implementation
use)
(placeholders)
symbols to be to
used
in expressions
GSL vs SEL
Default inlining can get worse
from beginning to end
Aggressive inlining
does not guarantee the
best performance
GSL BASE
SEL BASE
SEL allow loop
fusion with no
penalty
SEL FUSION
DSEL Considerations
• SEL is for loop constructs
– Useful to analyze macrostructures
– Not drastic change w.r.t. GSL for syntax
• More semantics can be available (e.g., loop fusion)
• DSEL for stencil operators
– More than merely syntax embellishment
– More semantics is the golden rule
• Perform transformations the user may be not aware of
• Enable auto-tuning at expression level
Sensitivity to expression writing
• Given an expression like
u(0,0,0) = 1./7.0 * (u(0,0,0)+u(1,0,0) +u(0,1,0) +u(0,0,1)
+u(-1,0,0)+u(0,-1,0)+u(0,0,-1))
• Can be written in (at least) 5040 different ways
400x400x400 3D grid of doubles
• In this case a factor 2!
In this case it was esay (after
analyzing the permutations):
Each big step corresponds to
moving u(0,0,-1) to the right!
Time (ms)
– Does the way which?
– If yes, how much?
Implenetation (sorted by time)
A little more formallly
• Give the set A={+,-,~} we can define a
descriptor (a,b,c) where a,b,c belong to A
• A storage has a Fastest Iteration Order (FIO)
– The iteration order that guarantees fastest scan of
elements
An apparent redundancy
• Specifying stencil operators define stencil
access, but not for multi grid operators