Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware Overview • • • • • • • Background Motivation A new idea: Tile Reduction Experimental Results Conclusion Related Work Future Work 1 Tile and Tiling • Tile is a natural representation of the arraybased data objects which are heavily used in scientific algorithms. • Tiling improves data locality. • Tiling can increase parallelism and reduce synchronization in parallel programs. • It is an effective compiler optimizing technique • Essentially, it is a program design paradigm • Supported in many parallel programming languages: ZPL, CAF, HTA, etc. 2 OpenMP • OpenMP is the de facto standard for sharedmemory parallel programming • Provides a simple and flexible interface for developing portable and scalable parallel application • Support incremental parallelization • Maintain sequential consistency • OpenMP is ”tile oblivious”, no directive or clause can be used to annotate data tile and carry such information to compiler 3 A Motivating Example 4 Parallelizing: the traditional way(1) 5 Parallelizing: the traditional way(2) • Can only leverage the traditional scalar reduction in OpenMP • Parallelism is trivial • Data locality is not good • Not natural and intuitive 6 The Expected Parallelization • View the inner most two loops as a macro operation performing on the 2x2 data tiles • Aggregate the data tiles in parallel • More parallelism • Better data locality 7 Tile Reduction Interface 8 Terms • Reduction Tile: the data tile under reduction • Tile descriptor: the “multi-dimensional array” in the list construct • Reduction kernel loops: the loops involved in performing “one” recursive calculation • Tile name: the variable name leading the tile descriptor • Dimension descriptor: the tuples that follows the tile name in the tile descriptor 9 A Use Case Tiled Matrix Multiplication Tile Reduction Applied on the Tiled Matrix Multiplication Code 10 Code Generation (1) • Step 1: distribute the iterations of the parallelized loop among the threads • Step 2: allocate memory for the private copy of the tile used in the local recursive calculation • Step 3: perform the local recursive calculation which is specified by the reduction kernel loops • Step 4: update the global copy of the reduction tile 11 Code Generation (2) 12 Experimental Results (1) 2D Histogram Reduction 13 Experimental Results (2) Matrix-Matrix Multiplication 14 Experimental Results (3) Matrix-Vector Multiplication 15 Conclusions • As one of the building block of the tile aware parallelization theory, tile reduction brings more opportunities to parallelize dense matrix applications. • For some benchmarks, tile reduction is a more natural and intuitive way to reason about the best parallelization decision. • For some benchmarks, tile reduction not only can improve data locality, but also can expose more parallelism. • Interface is user friendly. • Code generation is as simple as the scalar reduction in the current OpenMP. • Runtime overhead is trivial. 16 Similar Works • • • • • • • Parallel reduction is supported in: C**: Viswanathan, G., Larus, J.R.: User-defined reductions for efficient communication in data-parallel languages. Technical Report 1293, University of Wisconsin-Madison (Jan 1996) SAC: Scholz, S.B.: On defining application-specific high-level array operations by means of shape invariant programming facilities. In: APL ’98: Proceedings of the APL98 conference on Array processing language, New York, NY, USA, ACM (1998) 32–38 ZPL: Deitz, S.J., Chamberlain, B.L., Snyder, L.: High-level language support for userdefined reductions. J. Supercomput. 23(1) (2002) 23–37 UPC Consortium: UPC Collective Operations Specifications V1.0 A publication of the UPC Consortium (2003) Forum, M.P.I.: MPI: A message-passing interface standard (version 1.0). Technical report (May 1994) URL http://www.mcs.anl.gov/mpi/mpi-report.ps. Kambadur, P., Gregor, D., Lumsdaine, A.: Openmp extensions for generic libraries. In: Lecture Notes in Computer Science: OpenMP in a New Era of Parallelism, IWOMP’08, International Workshop on OpenMP. Volume 5004/2008., Springer Berlin / Heidelberg (2008) 123–133 17 Future Works • Design and develop OpenMP pragma directives that can be used to help compiler to generate efficient data movement code for parallel applications running on many-core platforms with highly non-uniform memory system, like the Cyclops-64 processor 18
© Copyright 2026 Paperzz