Brute force

Automatic Differentiation: Introduction
• Automatic differentiation (AD) is a technology for
transforming a subprogram that computes some function
into a subprogram that computes the derivatives of that
function
• Derivatives used in optimization, nonlinear solvers,
sensitivity analysis, uncertainty quantification
• Forward mode of AD is efficient for problems with few
independent variables or Jacobian-vector products
• Reverse mode of AD is efficient for problems with few
dependent variables or JTv products
• Efficiency of generated code depends on sophistication of
underlying compiler analysis and combinatorial algorithms
AD: Current Capabilities
• Fortran 77: ADIFOR 2.0/3.0
–
–
–
–
Robust, mature tool with excellent language coverage
Excellent compiler analysis
Efficient forward mode (small number of independents)
Adequate reverse mode (small number of dependents)
• C/C++: ADIC 2.0
– Semi-mature tool with full C language coverage
– Sophisticated differentiation algorithms
– Efficient forward mode
• Fortran 90: OpenAD/F
–
–
–
–
–
New tool with partial language coverage
Sophisticated differentiation algorithms
Accurate and novel compiler analysis
Innovative templating mechanism
Efficient forward and reverse modes
AD: Application Highlight
Sensitivity of flow through Drake passage to bottom topography, using MIT shallow water model
Runtime (m:s)
Simulation alone
Ratio
Memory
—
2:20
1.0
Basic adjoint
143:37
61.6
6.87M
Improved checkpointing
141:20
60.6
21.44M
21:51
9.4
3.17M
23 days
14,400
Add compiler analysis
Finite differences
—
AD: Future Capabilities
• C/C++: ADIC 2.x
– Enhanced support for C++ (basic templating, operator
overloading)
• Fortran 90: OpenAD/F
– Improved language coverage (user-defined types, pointers, etc.)
• Both tools
–
–
–
–
–
–
New differentiation algorithms
New checkpointing mechanisms
Advanced compiler analysis
Efficient forward and reverse modes
Integration with CSCAPES coloring algorithms
Ease of use through integration with PETSc and Zoltan toolkits
Load Balancing: Introduction
Goals:
• Provide software and algorithms for load balancing
(partitioning) that can easily be used by parallel
applications.
• Load balancing: distribute work evenly among processors
while minimizing communication cost. Reduces parallel
run time.
• Static load balancing (often called “partitioning”)
– Application computation and communication patterns do not
change
– Partition and distribute data once
• Dynamic load balancing
– In dynamic or adaptive applications, computation and
communication change over time.
– Load balancing should be invoked at certain intervals.
– Try to reduce data migration (application data to move)
Load Balancing: Current Capabilities
• Zoltan: Software toolkit for parallel data management and
load balancing
– Available at http://www.cs.sandia.gov/Zoltan
• Collection of many load-balancing methods
– Geometric: RCB, space filling curves
– Graph and hypergraph partitioning
• Data-structure neutral interface
– Call-back functions
– Single, common interface for many methods
• Allows applications to “plug and play”
• Portable, parallel code (MPI)
– Used in many DOE and Sandia applications
– Can run on thousands of processors
Load Balancing: Applications
C02
C
2
R
L2
1
2
1
C
R2
Rg2
R
1
1
2
C2
2
2
Rg02
2
1
1
1
• Large variety of applications, requirements, data structures.
1
Rl
Cm012
2
2
1 2
1
SOURCE_VOLTAGE
C
R
2
Vs
R
1
INDUCTOR
Cm12
C
L1
R
1
R1
2
1
2
2
Rg01
R
C01
C
R
2
1
INDUCTOR
1
1
1
2
C1
C
2
2
Rs
Rg1
R
Parallel electronics networks
Particle methods
Adaptive mesh refinement
=
Cell Modeling
Multiphysics simulations
Crash simulations
A
x
Linear solvers &
preconditioners
b
Load Balancing: Future Capabilities
• Scalable hypergraph partitioning
– Hypergraphs accurately model communication volume
– We aim to improve scalability to thousands of processors
• 2d matrix partitioning
– Reduce communication compared to standard 1d distribution
• Multiconstraint partitioning
– Multi-physics simulation
• Complex objectives partitioning
– E.g., simultaneously balance computation and memory
• Parallel sparse matrix ordering (nested dissection)
Reordering Transformations: Introduction
• Irregular memory access patterns make
performance sensitive to data and iteration orders
• Run-time reordering transformations schedule
data accesses and iterations to maximize
performance
• Preliminary work on reordering heuristics shows
that hypergraph models outperform graph models
• Full sparse tiling: new inspector/executor strategy
that exploits inter-iteration locality
RT: Current Capabilities
• Open source package implementing several data and
iteration reordering heuristics: Data_N_Comp_Reorder
• Data reordering heuristics
–
–
–
–
–
–
Breadth first search (graph-based)
Consecutive packing
Partitioning (graph-based)
Breadth first search (hypergraph-based)
Consecutive packing (hypergraph-based)
Partitioning (hypergraph-based)
• Iteration reordering heuristics
–
–
–
–
Breadth first search (hypergraph-based)
Lexicographical sorting and various approximations
Consecutive packing (hypergraph-based)
Partitioning (hypergraph-based)
• Full sparse tiling implementation for model problems
RT: Application Highlight
• Reordering for a meshquality improvement code
(FeasNewt – T. Munson)
• Hypergraph-BFS data
reordering coupled with
Cpack iteration reordering
offers best performance
• Reordering leads to
performance within 90%
of memory bandwidth limit
for sparse matvec
2500
2000
Peak
1500
Memory Bandwidth
Limit
Original
1000
Reordered
500
0
Hessian
Gradient
Matmul
RT: Future Capabilities
• New hypergraph-based runtime reordering
transformations
• Comparison between hypergraph-based and
bipartite graph-based runtime reordering
transformations
• Hypergraph partitioners for load balancing
modified to work well for reordering
transformations
• Hierarchical full sparse tiling for hierarchical
parallel systems
Graph Coloring and Matching: Introduction
• Graph coloring deals with
partitioning a set of binary-related
objects into few groups of
“independent” objects
• Sparsity exploitation in
computation of Jacobians and
Hessians leads to a variety of
graph coloring problems. Sources
of problem variations:
1d partition
2d partition
Jacobian
Distance-2
coloring
Star
bicoloring
Direct
Hessian
Star coloring
NA
Direct
Jacobian
NA
Acyclic
bicoloring
Subst
Hessian
Acyclic coloring
NA
Subst
– Unsymmetric vs symmetric matrix
– Direct vs substitution method
– Uni- vs bi-directional partitioning
• Matching deals with finding a “large” set of independent edges in a graph
• Variant matching problems occur in load-balancing, process scheduling,
linear solvers, preconditioners, etc.
• Orthogonal sources of variation in matching problems:
•
•
Bipartite vs general graphs
Cardinality vs weighted problems
GCM: Current Capabilities
• Coloring
Serial:
– Developed novel (greedy) algorithms for distance-1, distance-2,
star and acyclic coloring problems. A package implementing these
algorithms and corresponding variant ordering routines available.
Parallel:
– Developed a scheme for parallelizing greedy coloring algorithms
on distributed-memory computers. MPI implementations of
distance-1 and distance-2 coloring made available via Zoltan.
• Matching
– Algorithms that compute optimal solutions for matching problems
are polynomial in time, but slow and difficult to parallelize.
– High quality approximate solutions can be computed in (near)
linear time. Approximation techniques make parallelization easier.
– Developed fast approximation algorithms for several matching
problems.
– Efficient implementations of exact matching algorithms available.
GCM: Application Highlights
•
Coloring
–
–
–
–
•
Automatic differentiation (sparse Jacobians and Hessians)
Parallel computation (discovery of concurrency, data migration)
Frequency allocation
Register allocation in compilers, etc
Matching
– Numerical preprocessing in sparse linear systems:
•
permute a matrix such that its diagonal or block diagonal are heavy.
– Block triangular decomposition in sparse linear systems:
•
decompose a system of equations into smaller sets of systems.
– Graph partitioning:
•
guide the coarsening phase of multilevel graph partitioning methods.
GCM: Future Capabilities
• Develop and implement star and acyclic
bicoloring algorithms for Jacobian computation
• Develop parallel algorithms that scale to
thousands of processors for the various coloring
problems (distance-1, distance-2, star, acyclic)
• Integrate coloring software with automatic
differentiation tools
• Develop petascale parallel matching algorithms
based on approximation techniques