On-line Automated Performance
Diagnosis on Thousands of Processors
Philip C. Roth
Future Technologies Group
Computer Science and Mathematics Division
Oak Ridge National Laboratory
Paradyn Research Group
Computer Sciences Department
University of Wisconsin-Madison
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
1
High Performance Computing Today
Large parallel computing resources
Tightly coupled systems (Earth Simulator, BlueGene/L, XT3)
Clusters (LANL Lightning, LLNL Thunder)
Grid
Large, complex applications
ASCI Blue Mountain job sizes (2001)
512 cpus: 17.8%
1024 cpus: 34.9%
2048 cpus: 19.9%
Small fraction of peak performance is the rule
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
2
Achieving Good Performance
Need to know what and where to tune
Diagnosis and tuning tools are critical for realizing potential of
large-scale systems
On-line automated tools are especially desirable
Manual tuning is difficult
Finding interesting data in large data volume
Understanding application, OS, hardware interactions
Automated tools require minimal user involvement; expertise is
built into the tool
On-line automated tools can adapt dynamically
Dynamic control over data volume
Useful results from a single run
But: tools that work well in small-scale environments often don’t scale
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
3
Barriers to Large-Scale Performance Diagnosis
• Managing performance data volume
• Communicating efficiently between distributed tool
components
• Making scalable presentation of data and analysis results
Tool Front End
Tool
Daemons
d0
d1
d2
d3
dP-4
dP-3
dP-2
dP-1
App
Processes
a0
a1
a2
a3
aP-4
aP-3
aP-2
aP-1
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
4
Our Approach for Addressing These
Scalability Barriers
MRNet: multicast/reduction infrastructure
for scalable tools
Distributed Performance Consultant: strategy
for efficiently finding performance
bottlenecks in large-scale applications
Sub-Graph Folding Algorithm: algorithm for
effectively presenting bottleneck diagnosis
results for large-scale applications
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
5
Outline
Performance Consultant
MRNet
Distributed Performance Consultant
Sub-Graph Folding Algorithm
Evaluation
Summary
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
6
Performance Consultant
Automated performance diagnosis
Search for application performance problems
Start with global, general experiments (e.g., test
CPUbound across all processes)
Collect performance data using dynamic instrumentation
Collect only the data desired
Remove the instrumentation when no longer needed
Make decisions about truth of each experiment
Refine search: create more specific experiments based on
“true” experiments (those whose data is above userconfigurable threshold)
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
7
Performance Consultant
c001.cs.wisc.edu
c002.cs.wisc.edu
c128.cs.wisc.edu
myapp367
myapp4287
myapp27549
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
8
Performance Consultant
CPUbound
main
Do_row
Do_col
c001.cs.wisc.edu
c002.cs.wisc.edu
c128.cs.wisc.edu
myapp367
myapp4287
myapp27549
…
c001.cs.wisc.edu
c002.cs.wisc.edu
myapp{367}
myapp{4287}
myapp{27549}
main
main
main
Do_mult
…
Do_row
Do_col
Do_mult
Do_col
Do_mult
…
Do_row
…
Do_col
Do_mult
…
…
…
OAK RIDGE NATIONAL LABORATORY
Do_row
…
c128.cs.wisc.edu
U. S. DEPARTMENT OF ENERGY
9
Performance Consultant
cham.cs.wisc.edu
CPUbound
main
Do_row
Do_col
c001.cs.wisc.edu
c002.cs.wisc.edu
c128.cs.wisc.edu
myapp367
myapp4287
myapp27549
…
c001.cs.wisc.edu
c002.cs.wisc.edu
myapp{367}
myapp{4287}
myapp{27549}
main
main
main
Do_mult
…
Do_row
Do_col
Do_mult
Do_col
Do_mult
…
Do_row
…
Do_col
Do_mult
…
…
…
OAK RIDGE NATIONAL LABORATORY
Do_row
…
c128.cs.wisc.edu
U. S. DEPARTMENT OF ENERGY
10
Outline
Performance Consultant
MRNet
Distributed Performance Consultant
Sub-Graph Folding Algorithm
Evaluation
Summary
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
11
MRNet: Multicast/Reduction Overlay Network
Parallel tool infrastructure providing:
Scalable multicast
Scalable data synchronization and transformation
Network of processes between tool front-end and
back-ends
Useful for parallelizing and distributing tool activities
Reduce latency
Reduce computation and communication load at tool front-end
Joint work with Dorian Arnold (University of
Wisconsin-Madison)
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
12
Typical Parallel Tool Organization
Tool Front End
Tool
Daemons
d0
d1
d2
d3
dP-4
dP-3
dP-2
dP-1
App
Processes
a0
a1
a2
a3
aP-4
aP-3
aP-2
aP-1
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
13
MRNet-based Parallel Tool Organization
Tool Front End
Internal Process
Filter
Multicast/
Reduction
Network
Tool
Daemons
d0
d1
d2
d3
dP-4
dP-3
dP-2
dP-1
App
Processes
a0
a1
a2
a3
aP-4
aP-3
aP-2
aP-1
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
14
Outline
Performance Consultant
MRNet
Distributed Performance Consultant
Sub-Graph Folding Algorithm
Evaluation
Summary
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
15
Performance Consultant: Scalability Barriers
MRNet can alleviate scalability problem for
global performance data (e.g., CPU utilization
across all processes)
But front-end still processes local
performance data (e.g., utilization of process
5247 on host mcr398.llnl.gov)
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
16
Performance Consultant
cham.cs.wisc.edu
CPUbound
main
Do_row
Do_col
c001.cs.wisc.edu
c002.cs.wisc.edu
c128.cs.wisc.edu
myapp367
myapp4287
myapp27549
…
c001.cs.wisc.edu
c002.cs.wisc.edu
myapp{367}
myapp{4287}
myapp{27549}
main
main
main
Do_mult
…
Do_row
Do_col
Do_mult
Do_col
Do_mult
…
Do_row
…
Do_col
Do_mult
…
…
…
OAK RIDGE NATIONAL LABORATORY
Do_row
…
c128.cs.wisc.edu
U. S. DEPARTMENT OF ENERGY
17
Distributed Performance Consultant
cham.cs.wisc.edu
CPUbound
main
Do_row
Do_col
c001.cs.wisc.edu
c002.cs.wisc.edu
c128.cs.wisc.edu
myapp367
myapp4287
myapp27549
…
c001.cs.wisc.edu
c002.cs.wisc.edu
myapp{367}
myapp{4287}
myapp{27549}
main
main
main
Do_mult
…
Do_row
Do_col
Do_mult
Do_col
Do_mult
…
Do_row
…
Do_col
Do_mult
…
…
…
OAK RIDGE NATIONAL LABORATORY
Do_row
…
c128.cs.wisc.edu
U. S. DEPARTMENT OF ENERGY
18
Distributed Performance Consultant:
Variants
Natural steps from traditional centralized approach
(CA)
Partially Distributed Approach (PDA)
Distributed local searches, centralized global search
Requires complex instrumentation management
Truly Distributed Approach (TDA)
Distributed local searches only
Insight into global behavior from combining local search
results (e.g., using Sub-Graph Folding Algorithm)
Simpler tool design than PDA
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
19
Distributed Performance Consultant: PDA
cham.cs.wisc.edu
CPUbound
main
Do_row
Do_col
c001.cs.wisc.edu
c002.cs.wisc.edu
c128.cs.wisc.edu
myapp367
myapp4287
myapp27549
…
c001.cs.wisc.edu
c002.cs.wisc.edu
myapp{367}
myapp{4287}
myapp{27549}
main
main
main
Do_mult
…
Do_row
Do_col
Do_mult
Do_col
Do_mult
…
Do_row
…
Do_col
Do_mult
…
…
…
OAK RIDGE NATIONAL LABORATORY
Do_row
…
c128.cs.wisc.edu
U. S. DEPARTMENT OF ENERGY
20
Distributed Performance Consultant: TDA
cham.cs.wisc.edu
c001.cs.wisc.edu
c002.cs.wisc.edu
c128.cs.wisc.edu
myapp367
myapp4287
myapp27549
…
c001.cs.wisc.edu
c002.cs.wisc.edu
myapp{367}
myapp{4287}
myapp{27549}
main
main
main
Do_row
Do_col
Do_mult
Do_col
Do_mult
…
Do_row
…
Do_col
Do_mult
…
…
…
OAK RIDGE NATIONAL LABORATORY
Do_row
…
c128.cs.wisc.edu
U. S. DEPARTMENT OF ENERGY
21
Distributed Performance Consultant: TDA
cham.cs.wisc.edu
c001.cs.wisc.edu
c002.cs.wisc.edu
c128.cs.wisc.edu
myapp367
myapp4287
myapp27549
…
c001.cs.wisc.edu
c002.cs.wisc.edu
myapp{367}
myapp{4287}
myapp{27549}
main
main
main
…
c128.cs.wisc.edu
Sub-Graph Folding Algorithm
Do_row
Do_col
Do_mult
Do_col
Do_mult
…
Do_row
…
Do_col
Do_mult
…
…
…
OAK RIDGE NATIONAL LABORATORY
Do_row
U. S. DEPARTMENT OF ENERGY
22
Outline
Paradyn and the Performance Consultant
MRNet
Distributed Performance Consultant
Sub-Graph Folding Algorithm
Evaluation
Summary
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
23
Search History Graph Example
CPUbound
c33.cs.wisc.edu
c34.cs.wisc.edu
main
myapp{1272}
myapp{1273}
myapp{7624}
myapp{7625}
main
main
main
main
A
B
A
C
D
A
B
C
D
E
B
A
A
B
C
B
C
C
D
D
D
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
24
Search History Graphs
Search History Graph is effective for
presenting search-based performance
diagnosis results…
…but it does not scale to a large number of
processes because it shows one sub-graph
per process
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
25
Sub-Graph Folding Algorithm
Combines host-specific sub-graphs into
composite sub-graphs
Each composite sub-graph represents a
behavioral category among application
processes
Dynamic clustering of processes by qualitative
behavior
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
26
SGFA: Example
CPUbound
c33.cs.wisc.edu
c34.cs.wisc.edu
c*.cs.wisc.edu
main
myapp{1272}
myapp{1273}
myapp{7624}
myapp{*}
myapp{7625}
main
main
main
main
A
B
A
C
D
A
B
B
C
C
D
A
E
D
D
A
B
C
B
D
C
E
D
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
27
SGFA: Implementation
Custom MRNet filter
Filter in each MRNet process keeps folded
graph of search results from all reachable
daemons
Updates periodically sent upstream
By induction, filter in front-end holds entire
folded graph
Optimization for unchanged graphs
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
28
Outline
Performance Consultant
MRNet
Distributed Performance Consultant
Sub-Graph Folding Algorithm
Evaluation
Summary
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
29
DPC + SGFA: Evaluation
Modified Paradyn to perform bottleneck searches
using CA, PDA, or TDA approach
Modified instrumentation cost tracking to support
PDA
Track global, per-process instrumentation cost separately
Simple fixed-partition policy for scheduling global and local
instrumentation
Implemented Sub-Graph Folding Algorithm as custom
MRNet filter to support TDA (used by all)
Instrumented front-end, daemons, and MRNet
internal processes to collect CPU, I/O load
information
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
30
DPC + SGFA: Evaluation
su3_rmd
QCD pure lattice gauge theory code
C, MPI
Weak scaling scalability study
LLNL MCR cluster
1152 nodes (1048 compute nodes)
Two 2.4 GHz Intel Xeons per node
4 GB memory per node
Quadrics Elan3 interconnect (fat tree)
Lustre parallel file system
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
31
DPC + SGFA: Evaluation
PDA and TDA: bottleneck searches with up
to 1024 processes so far, limited by
partition size
CA: scalability limit at less than 64
processes
Similar qualitative results from all approaches
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
32
DPC: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
33
DPC: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
34
DPC: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
35
DPC: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
36
DPC: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
37
DPC: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
38
DPC: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
39
DPC: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
40
DPC: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
41
SGFA: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
42
Summary
Tool scalability is critical for effective use
of large-scale computing resources
On-line automated performance tools are
especially important at large scale
Our approach:
MRNet
Distributed Performance Consultant (TDA) plus
Sub-Graph Folding Algorithm
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
43
References
P.C. Roth, D.C. Arnold, and B.P. Miller, “MRNet: a
Software-Based Multicast/Reduction Network for
Scalable Tools,” SC 2003, Phoenix, Arizona,
November 2003
P.C. Roth and B.P. Miller, “The Distributed
Performance Consultant and the Sub-Graph Folding
Algorithm: On-line Automated Performance Diagnosis
on Thousands of Processes,” in submission
Publications available from http://www.paradyn.org
MRNet software available from
http://www.paradyn.org/mrnet
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
44
© Copyright 2026 Paperzz