Tile Reduction

Tile Reduction: the first step
towards tile aware
parallelization in OpenMP
Ge Gan
Department of Electrical and Computer Engineering
Univ. of Delaware
Overview
•
•
•
•
•
•
•
Background
Motivation
A new idea: Tile Reduction
Experimental Results
Conclusion
Related Work
Future Work
1
Tile and Tiling
• Tile is a natural representation of the arraybased data objects which are heavily used in
scientific algorithms.
• Tiling improves data locality.
• Tiling can increase parallelism and reduce
synchronization in parallel programs.
• It is an effective compiler optimizing technique
• Essentially, it is a program design paradigm
• Supported in many parallel programming
languages: ZPL, CAF, HTA, etc.
2
OpenMP
• OpenMP is the de facto standard for sharedmemory parallel programming
• Provides a simple and flexible interface for
developing portable and scalable parallel
application
• Support incremental parallelization
• Maintain sequential consistency
• OpenMP is ”tile oblivious”, no directive or clause
can be used to annotate data tile and carry such
information to compiler
3
A Motivating Example
4
Parallelizing: the traditional way(1)
5
Parallelizing: the traditional way(2)
• Can only leverage the traditional scalar
reduction in OpenMP
• Parallelism is trivial
• Data locality is not good
• Not natural and intuitive
6
The Expected Parallelization
• View the inner most two loops as a macro
operation performing on the 2x2 data tiles
• Aggregate the data tiles in parallel
• More parallelism
• Better data locality
7
Tile Reduction Interface
8
Terms
• Reduction Tile: the data tile under reduction
• Tile descriptor: the “multi-dimensional array” in
the list construct
• Reduction kernel loops: the loops involved in
performing “one” recursive calculation
• Tile name: the variable name leading the tile
descriptor
• Dimension descriptor: the tuples that follows the
tile name in the tile descriptor
9
A Use Case
Tiled Matrix Multiplication
Tile Reduction Applied on the Tiled
Matrix Multiplication Code
10
Code Generation (1)
• Step 1: distribute the iterations of the parallelized
loop among the threads
• Step 2: allocate memory for the private copy of
the tile used in the local recursive calculation
• Step 3: perform the local recursive calculation
which is specified by the reduction kernel loops
• Step 4: update the global copy of the reduction
tile
11
Code Generation (2)
12
Experimental Results (1)
2D Histogram Reduction
13
Experimental Results (2)
Matrix-Matrix Multiplication
14
Experimental Results (3)
Matrix-Vector Multiplication
15
Conclusions
• As one of the building block of the tile aware
parallelization theory, tile reduction brings more
opportunities to parallelize dense matrix applications.
• For some benchmarks, tile reduction is a more natural
and intuitive way to reason about the best parallelization
decision.
• For some benchmarks, tile reduction not only can
improve data locality, but also can expose more
parallelism.
• Interface is user friendly.
• Code generation is as simple as the scalar reduction in
the current OpenMP.
• Runtime overhead is trivial.
16
Similar Works
•
•
•
•
•
•
•
Parallel reduction is supported in:
C**: Viswanathan, G., Larus, J.R.: User-defined reductions for efficient
communication in data-parallel languages. Technical Report 1293, University of
Wisconsin-Madison (Jan 1996)
SAC: Scholz, S.B.: On defining application-specific high-level array operations by
means of shape invariant programming facilities. In: APL ’98: Proceedings of the
APL98 conference on Array processing language, New York, NY, USA, ACM (1998)
32–38
ZPL: Deitz, S.J., Chamberlain, B.L., Snyder, L.: High-level language support for userdefined reductions. J. Supercomput. 23(1) (2002) 23–37
UPC Consortium: UPC Collective Operations Specifications V1.0 A publication of the
UPC Consortium (2003)
Forum, M.P.I.: MPI: A message-passing interface standard (version 1.0). Technical
report (May 1994) URL http://www.mcs.anl.gov/mpi/mpi-report.ps.
Kambadur, P., Gregor, D., Lumsdaine, A.: Openmp extensions for generic libraries.
In: Lecture Notes in Computer Science: OpenMP in a New Era of Parallelism,
IWOMP’08, International Workshop on OpenMP. Volume 5004/2008., Springer Berlin
/ Heidelberg (2008) 123–133
17
Future Works
• Design and develop OpenMP pragma directives
that can be used to help compiler to generate
efficient data movement code for parallel
applications running on many-core platforms
with highly non-uniform memory system, like the
Cyclops-64 processor
18