SuperComputing 2001 PAPI Tutorial

Performance Tuning Using
Hardware Counter Data
Philip Mucci
[email protected]
Shirley Moore
[email protected]
Nils Smeds
[email protected]
SC 2001
November 12, 2001
Denver, Colorado
Outline
• Issues in application performance
tuning – 30 minutes
• General design of PAPI – 15 minutes
• PAPI high-level interface – 15 minutes
• PAPI low-level interface – 15 minutes
• Counter overflow interrupts and
statistical profiling – 30 minutes
(advanced)
• Tools that use PAPI – 30 minutes
• Code examples – 30 minutes
2
Issues in Application Performance Tuning
3
HPC Architecture
• RISC or super-scalar architecture
– Pipelined functional units
– Multiple functional units in the CPU
– Speculative execution
– Several levels of cache memory
– Cache lines shared between CPUs
4
Floating
Point
Unit
FPU1
Floating
Point
Unit
FPU2
Fixed
Point
Unit
FXU1
Fixed
Point
Unit
FXU2
Fixed
Point
Unit
FXU3
LD/ST
Unit
LD/ST
Unit
LS1
LS2
Branch history table: 2048 entries
Branch/Dispatch
Branch target cache: 256 entries
64 KB, 128-way
32 KB, 128-way
Memory Mgmt Unit
Instruction Cache
IU
Memory Mgmt Unit
Data Cache
DU
32 Bytes
BIU
POWER3
Processing Units
(Model 260)
32 Bytes
Bus Interface Unit: L2 Control, Clock
32 Bytes @ 200
MHz = 6.4 GB/s
L2 Cache
1-16 MB
16 Bytes @100
MHz = 1.6 GB/s
5XX Bus
Branch
Prediction
L1 Instruction Cache
And
Fetch/Pre-fetch Engine
Decoupling
Buffer
IA-32
Decode
And
Control
ITLB
8 Bundles
B B B
M M I
I
Itanium ™
Processor Block
Diagram
F F
Branch
Units
128 Integer Registers
Integer
and
MM Units
Bus Controller
Dual-Port
L1
Data
Cache
And
DTLB
L3 Cache
Branch & Predicate
Registers
128 FP Registers
ALAT
Scoreboard, Predicate,
NaTs, Exceptions
L2 Cache
Register Stack Engine / Re-Mapping
Floating
Point
Units
SIMD
FMAC
6
Hardware Counters
• Small set of registers that count
events, which are occurrences of
specific signals related to the
processor’s function
• Monitoring these events facilitates
correlation between the structure
of the source/object code and the
efficiency of the mapping of that
code to the underlying
architecture.
7
Pipelined Functional Units
• The circuitry on a chip that
performs a given operation is
called a functional unit.
• Most integer and floating point
units are pipelined
– Each stage of a pipelined unit working
simultaneously on different sets of
operands
– After initial startup latency, goal is to
generate one result every clock cycle
8
Super-scalar Processors
• Processors that have multiple
functional units are called superscalar.
• Examples:
– IBM Power 3
•2
•3
•2
•1
floating point units (multiply-add)
fixed point units
load/store units
branch/dispatch unit
9
Super-scalar Processors (cont.)
– MIPS R12K
• 2 floating point units (1 multiply-add, 1
add)
• 2 integer units
• 2 load/store units
– Alpha EV67
• Instruction fetch/issue/retire unit
• Integer execution unit (2 IU clusters)
• Floating point execution unit (2 FPUs)
10
Super-scalar Processors (cont.)
– Intel Itanium
• EPIC (Explicitly Parallel Instruction
Computing) design
• 4 integer units
• 4 multimedia units
• 2 load/store units
• 3 branch units
• 2 extended precision floating point units
• 2 single precision floating point units
11
Out of Order Execution
• CPU dynamically executes instructions
as their operands become available, out
of order if necessary
– Any result generated out of order is
temporary until all previous instructions
have successfully completed.
– Queues are used to select which
instructions to issue dynamically to the
execution units.
– Relevant hardware counter metrics:
instructions issued, instructions completed
12
Speculative Execution
• The CPU attempts to predict which
way a branch will go and continues
executing instructions
speculatively along that path.
– If the prediction is wrong, instructions
executed down the incorrect path
must be canceled.
– On many processors, hardware
counters keep counts of branch
prediction hits and misses.
13
Instruction Counts and
Functional Unit Status
• Relevant hardware counter data
–
–
–
–
–
–
Total cycles
Total instructions
Floating point operations
Load/store instructions
Cycles functional units are idle
Cycles stalled
• waiting for memory access
• waiting for resource
– Conditional branch instructions
• executed
• mispredicted
14
Cache and Memory Hierarchy
• Registers: On-chip circuitry used to
hold operands and results of
calculations
• L1 (primary) data cache: Small on-chip
cache used to hold data about to be
operated on
• L2 (secondary) cache: Larger (on- or
off-chip) cache used to hold data and
instructions retrieved from local
memory.
• Some systems have L3 and even L4
caches.
15
Cache and Memory Hierarchy
(cont.)
• Local memory: Memory on the
same node as the processor
• Remote memory: Memory on
another node but accessible over
an interconnect network.
• Each level of the memory
hierarchy introduces approximately
an order of magnitude more
latency than the previous level.
16
Cache Structure
• Memory on a node is organized as an
array of cache lines which are typically
4 or 8 words long. When a data item is
fetched from a higher level cache or
from local memory, an entire cache line
is fetched.
• Caches can be either
– direct mapped or
– N-way set associative
• A cache miss occurs when the program
refers to a data item that is not present
in the cache.
17
Cache Contention
• When two or more CPUs alternately and
repeatedly update the same cache line
– memory contention
• when two or more CPUs update the same
variable
• correcting it involves an algorithm change
– false sharing
• when CPUs update distinct variables that occupy
the same cache line
• correcting it involves modification of data
structure layout
18
Cache Contention (cont.)
• Relevant hardware counter metrics
– Cache misses and hit ratios
– Cache line invalidations
19
TLB and Virtual Memory
• Memory is divided into pages.
• The operating system translates the
virtual page addresses used by a
program into physical addresses used
by the hardware.
– The most recently used addresses are
cached in the translation lookaside buffer
(TLB).
– When the program refers to a virtual
address that is not in the TLB, a TLB miss
occurs.
• Relevant hardware counter metric: TLB
misses
20
Memory Latencies
• CPU register: 0 cycles
• L1 cache hit: 2-3 cycles
• L1 cache miss satisfied by L2 cache hit:
8-12 cycles
• L2 cache miss satisfied from main
memory, no TLB miss: 75-250 cycles
• TLB miss requiring only reload of the
TLB: ~2000 cycles
• TLB miss requiring reload of virtual
page – page fault: hundreds of millions
of cycles
21
Steps of Optimization
• Optimize compiler switches
• Integrate libraries
• Profile
• Optimize blocks of code that
dominate execution time by using
hardware counter data to
determine why the bottlenecks
exist
• Always examine correctness at
every stage!
22
General Design of PAPI
23
Goals
• Solid foundation for cross platform
performance analysis tools
• Free tool developers from reimplementing counter access
• Standardization between vendors,
academics and users
• Encourage vendors to provide hardware
and OS support for counter access
• Reference implementations for a
number of HPC architectures
• Well documented and easy to use
24
Overview of PAPI
• Performance Application Programming
Interface
• The purpose of the PAPI project is to
design, standardize and implement a
portable and efficient API to access the
hardware performance monitor
counters found on most modern
microprocessors.
• Parallel Tools Consortium project
http://www.ptools.org/
25
PAPI Counter Interfaces
• PAPI provides three interfaces to the
underlying counter hardware:
1. The low level interface manages hardware
events in user defined groups called
EventSets.
2. The high level interface simply provides the
ability to start, stop and read the counters
for a specified list of events.
3. Graphical tools to visualize information.
26
PAPI Implementation
Java Monitor GUI
Portable
PAPI Low Level
Layer
Machine
Specific
Layer
PAPI High Level
PAPI Machine
Dependent Substrate
Kernel Extension
Operating System
Hardware Performance Counter
27
PAPI Preset Events
• Proposed standard set of events
deemed most relevant for
application performance tuning
• Defined in papiStdEventDefs.h
• Mapped to native events on a
given platform
– Run tests/avail to see list of PAPI
preset events available on a platform
28
PAPI Release
• Platforms
– Linux/x86, Windows 2000
• Requires patch to Linux kernel, driver for
Windows
– Linux/IA-64
– Sun Solaris/Ultra 2.8
– IBM AIX/Power
• Contact IBM for pmtoolkit
– SGI IRIX/MIPS
– Compaq Tru64/Alpha Ev6 & Ev67
• Requires OS device driver from Compaq
– Cray T3E/Unicos
29
PAPI Release (cont.)
• C and Fortran bindings and Matlab
wrappers
• To download software:
http://icl.cs.utk.edu/projects/papi/
30
PAPI High-level Interface
31
High-level Interface
• Meant for application programmers
wanting coarse-grained
measurements
• Not thread safe
• Calls the lower level API
• Allows only PAPI preset events
• Easier to use and less setup
(additional code) than low-level
32
High-level API
• C interface
• Fortran interface
PAPI_start_counters
PAPIF_start_counters
PAPI_read_counters
PAPIF_read_counters
PAPI_stop_counters
PAPIF_stop_counters
PAPI_accum_counters
PAPIF_accum_counters
PAPI_num_counters
PAPIF_num_counters
PAPI_flops
PAPIF_flops
33
Setting up the High-level
Interface
• Int PAPI_num_counters(void)
– Initializes PAPI (if needed)
– Returns number of hardware counters
• int PAPI_start_counters(int *events, int len)
– Initializes PAPI (if needed)
– Sets up an event set with the given counters
– Starts counting in the event set
• int PAPI_library_init(int version)
– Low-level routine implicitly called by above
34
Controlling the Counters
• PAPI_stop_counters(long_long *vals, int alen)
– Stop counters and put counter values in array
• PAPI_accum_counters(long_long *vals, int
alen)
– Accumulate counters into array and reset
• PAPI_read_counters(long_long *vals, int alen)
– Copy counter values into array and reset counters
• PAPI_flops(float *rtime, float *ptime,
long_long *flpins, float *mflops)
– Wallclock time, process time, FP ins since start,
– Mflop/s since last call
35
PAPI_flops
• int PAPI_flops(float *real_time, float
*proc_time, long_long *flpins, float *mflops)
– Only two calls needed, PAPI_flops before and after
the code you want to monitor
– real_time is the wall-clocktime between the two calls
– proc_time is the “virtual” time or time the process
was actually executing between the two calls (not as
fine grained as real_time but better for longer
measurements)
– flpins is the total floating point instructions executed
between the two calls
– mflops is the Mflop/s rating between the two calls
36
PAPI High-level Example
long long values[NUM_EVENTS];
unsigned int
Events[NUM_EVENTS]={PAPI_TOT_INS,PAPI_TOT_CYC};
/* Start the counters */
PAPI_start_counters((int*)Events,NUM_EVENTS);
/* What we are monitoring? */
do_work();
/* Stop the counters and store the results in values */
retval = PAPI_stop_counters(values,NUM_EVENTS);
37
Return codes
Name
PAPI_OK
PAPI_EINVAL
PAPI_ENOMEM
PAPI_ESYS
PAPI_ESBSTR
PAPI_ECLOST
PAPI_EBUG
PAPI_ENOEVNT
PAPI_ECNFLCT
PAPI_ENOTRUN
PAPI_EISRUN
PAPI_ENOEVST
PAPI_ENOTPRESET
PAPI_ENOCNTR
PAPI_EMISC
Description
No error
Invalid argument
Insufficient memory
A system/C library call failed. Check errno variable
Substrate returned an error. E.g. unimplemented feature
Access to the counters was lost or interrupted
Internal error
Hardware event does not exist
Hardware event exists, but resources are exhausted
Event or envent set is currently counting
Events or event set is currently running
No event set available
Argument is not a preset
Hardware does not support counters
Any other error occured
38
PAPI Low-level Interface
39
Low-level Interface
• Increased efficiency and
functionality over the high level
PAPI interface
• About 40 functions
• Obtain information about the
executable and the hardware
• Thread-safe
• Fully programmable
• Callbacks on counter overflow
40
Low-level Functionality
• Library initialization
PAPI_library_init, PAPI_thread_init,
PAPI_shutdown
• Timing functions
PAPI_get_real_usec,
PAPI_get_virt_usec
PAPI_get_real_cyc, PAPI_get_virt_cyc
• Inquiry functions
• Management functions
• Simple lock
PAPI_lock/PAPI_unlock
41
Event sets
• The event set contains key information
– What low-level hardware counters to use
– Most recently read counter values
– The state of the event set (running/not
running)
– Option settings (e.g., domain, granularity,
overflow, profiling)
• Event sets can overlap if they map to the
same hardware counter set-up.
– Allows inclusive/exclusive measurements
42
Event set Operations
• Event set management
PAPI_create_eventset,
PAPI_add_event[s], PAPI_rem_event[s],
PAPI_destroy_eventset
• Event set control
PAPI_start, PAPI_stop, PAPI_read,
PAPI_accum
• Event set inquiry
PAPI_query_event, PAPI_list_events,...
43
Simple Example
#include "papi.h“
#define NUM_EVENTS 2
int Events[NUM_EVENTS]={PAPI_FP_INS,PAPI_TOT_CYC}, EventSet;
long_long values[NUM_EVENTS];
/* Initialize the Library */
retval = PAPI_library_init(PAPI_VER_CURRENT);
/* Allocate space for the new eventset and do setup */
retval = PAPI_create_eventset(&EventSet);
/* Add Flops and total cycles to the eventset */
retval = PAPI_add_events(&EventSet,Events,NUM_EVENTS);
/* Start the counters */
retval = PAPI_start(EventSet);
do_work();
/* What we want to monitor*/
/*Stop counters and store results in values */
retval = PAPI_stop(EventSet,values);
44
Overlapping Counters
retval = PAPI_start(InclEventSet);
retval = PAPI_start(OthersEventSet);
...
retval = PAPI_reset(OthersEventSet);
do_flops(NUM_FLOPS); /* Function call */
retval = PAPI_accum(OthersEventSet,Othersvalues);
...
retval = PAPI_stop(InclEventSet,Inclvalues);
printf("Counts: %12lld %12lld\n",
Inclvalues[0],
Inclvalues[0]-Othersvalues[0]);
45
Counter Domains
• int PAPI_set_domain(int domain);
– PAPI_DOM_USER User context counted
– PAPI_DOM_KERNEL Kernel/OS context
counted
– PAPI_DOM_OTHER Exception/transient mode
– PAPI_DOM_ALL All above contexts counted
– PAPI_DOM_MIN The smallest available
context
– PAPI_DOM_MAX The largest available
context
• All domains not available on all platforms
- OS dependent
46
Counter Granularity
• int PAPI_set_granularity(int granul);
– PAPI_GRN_THR
thread
count each individual
– PAPI_GRN_PROC count each individual
process
– PAPI_GRN_PROCG count each process group
– PAPI_GRN_SYS count on the current CPU
– PAPI_GRN_SYS_CPU count on every CPU's
– PAPI_GRN_MIN
– PAPI_GRN_MAX
(=PAPI_GRN_THR)
(=PAPI_GRN_SYS_CPU)
• Requires OS support
47
Using PAPI with Threads
• After PAPI_library_init need to register
unique thread identifier function
• For Pthreads
retval=PAPI_thread_init(pthread_self, 0);
• OpenMP
retval=PAPI_thread_init(omp_get_thread_num, 0);
• Each thread responsible for creation,
start, stop and read of its own counters
48
Using PAPI with Multiplexing
• Multiplexing allows simultaneous use
of more counters than are supported
by the hardware.
• PAPI_multiplex_init()
– should be called after PAPI_library_init()
to initialize multiplexing
• PAPI_set_multiplex( int *EventSet );
– Used after the eventset is created to turn
on multiplexing for that eventset
• Then use PAPI like normal
49
Issues with Multiplexing
• Some platforms support hardware
multiplexing, on those that don’t
PAPI implements multiplexing in
software.
• The more events you multiplex,
the more likely the representation
is not correct.
50
Multiplex Code Examples
From the PAPI source distribution:
tests/multiplex1.c
tests/multiplex1_pthreads.c
51
Native Events
• An event countable by the CPU can
be counted even if there is no
matching preset PAPI event
• Same interface as when setting up
a preset event, but a CPU-specific
bit pattern is used instead of the
PAPI event definition
52
Native Event Examples
From the PAPI source
distribution:
tests/native.c
ftests/native.F
53
Counter Overflow Interrupts
and
Statistical Profiling
54
Callbacks on Counter Overflow
• PAPI provides the ability to call
user-defined handlers when a
specified event exceeds a specified
threshold.
• For systems that do not support
counter overflow at the OS level,
PAPI sets up a high resolution
interval timer and installs a timer
interrupt handler.
55
PAPI_overflow
• int PAPI_overflow(int EventSet, int
EventCode, int threshold, int flags,
PAPI_overflow_handler_t handler)
• Sets up an EventSet such that
when it is PAPI_start()’d, it begins
to register overflows
• The EventSet may contain multiple
events, but only one may be an
overflow trigger.
56
Overflow Code Examples
From the PAPI source distribution:
tests/overflow.c
tests/overflow_pthreads.c
57
Statistical Profiling
• PAPI provides support for
execution profiling based on any
counter event.
• PAPI_profil() creates a histogram
of overflow counts for a specified
region of the application code.
58
PAPI_profil
int PAPI_profil(unsigned short *buf,
unsigned int bufsiz, unsigned long offset,
unsigned scale, int EventSet, int
EventCode, int threshold, int flags)
•buf – buffer of bufsiz bytes in which the
histogram counts are stored
•offset – start address of the region to be profiled
•scale – contraction factor that indicates how
much smaller the histogram buffer is than the
region to be profiled
59
Profiling Code Examples
From the PAPI source distribution:
tests/profile.c
tests/sprofile.c
tests/profile_pthreads.c
60
Tools that use PAPI
61
Perfometer
• Application is instrumented with PAPI
– call perfometer()
– call mark_perfometer(Color)
• Application is started. At the call to
perfometer, signal handler and timer are set
to collect and send the information to a Java
applet containing the graphical view.
• Sections of code that are of interest can be
designated with specific colors
– Using a call to set_perfometer(‘color’)
• Real-time display or trace file
62
Perfometer Display
Machine info
Flop/s Rate
Flop/s Min/Max
Process &
Real time
63
Perfometer Parallel Interface
64
Third-party Tools
that use PAPI
• DEEP/PAPI (Pacific Sierra)
http://www.psrv.com/deep_papi_top.html
• TAU (Allen Mallony, U of Oregon)
http://www.cs.uoregon.edu/research/paracomp/tau/
• SvPablo (Dan Reed, U of Illinois)
http://vibes.cs.uiuc.edu/Software/SvPablo/svPablo.htm
• Cactus (Ed Seidel, Max Plank/U of Illinois)
http://www.aei-potsdam.mpg.de
• Vprof (Curtis Janssen, Sandia Livermore
Lab) http://aros.ca.sandia.gov/~cljanss/perf/vprof/
• Cluster Tools (Al Geist, ORNL)
• DynaProf (Phil Mucci, UTK)
http://www.cs.utk.edu/~mucci/dynaprof/
65
DEEP/PAPI
66
SvPablo
67
TAU
68
vprof
69
Code Examples
70
Code Examples
• Parallelising a particle particle
simulator
• Parallelising a frequency domain
MHD simulator
71
Particle - particle simulator
• Particles fall in a well
• Particle interactions
computed for particles
in the neighbourhood
only
Neighborhood
• Occasionally the
neighbourhood list is
recomputed
• 1000 particles
• Neighbour list length
10000
• ~6000-7000
interactions
72
Algorithm used
For each particle i
For each neighbor j
Compute distance ij
Compute inter-particle force
Update force on particles i & j
Compute accelerations and
updated positions
• Force vector is
sum-updated in
a “random”
access pattern
• Little cache reuse
• Inhibits SMP
parallelization
73
Reversed neighborlist
• Introduce force
interaction vector
For each particle i
For each neighbor j
Compute distance ij
Compute inter-particle force
• Introduce a
reverse neighbour
list
Update force on particles i
Update force on particles j
Compute
accelerations
and update
positions
• Inter-particle force
written linearly,
but read randomly
in j-loop
• Force vector
updated linearly
74
Final performance
Wall clock time per time step
1 Naive load balancing
2 Neighbour balancing
75
Explanation
• User reports serial program 3 times
faster
– Several contributing factors:
• Compiler optimisations
• Compiler inlining
• Better cache utilization
• Without the linear traversing of writes no
speed-up (not shown in previous graph)
• Scaling problem on the SGI is a cache
issue? Whole problem fits nicely into one
8MB L2 cache
76
Frequency domain MHD
• Code makes frequent 3D FFT transformations
• Electric and magnetic field double complex
128 bit precision
• Array dimensions are (3,N,N,N), N=64
• Array size: 12MB per field
• In between calls matrices are set up in loops
M(:,j,k,l)=A(:,j,k,l) × B(:,j,k,l) + C(:,j,k,l)
• Parallel FFTs are available
• Parallel matrix set up is straight forward
77
Expected behaviour
• Code is expected to be memory bound
outside FFTs due to array sizes and
number of floating point operations vs.
memory accesses
• Going parallel on a bus gives no gain - or
does it?
• Speed-up should be obtainable on CCNUMA
• Code should run well on vector systems
with good FFTs and enough memory
ports
78
Observed behaviour
Overloaded system
Serial DXML FFTs
79
Obtained speed up vs. streams
• The IBM bus is “switched” - this
can explain that speed-up was
obtained
• Speed-up on the SGI platform
could be better
http://www.cs.virginia.edu/stream/standard/Bandwidth.html
Machine ncpus COPY SCALE
ADD TRIAD
IBM_SP-PWR3smp-High-222
8 2954.1
2821
3889 3872.2 10^6 BYTE/SEC
IBM_SP-PWR3smp-High-222
4 1603.5 1535.9 2218.3 2187.5
IBM_SP-PWR3smp-High-222
2 820.7 800.4 1182.1 1165.4
IBM_SP-PWR3smp-High-222
1
413 421.2 587.4 614.2
SGI_Origin2000-195
8
1355
1450
1413
1675
SGI_Origin2000-195
6
1066
1075
1262
1396
SGI_Origin2000-195
4
666
727
792
874
SGI_Origin2000-195
2
351
365
392
413
SGI_Origin2000-195
1
296
300
315
317
80
PAPI measurements — IBM
• Critical code
section
main loop
….
Call nlin(….)
….
end main loop
Subroutine nlin
Compute arrays
Call FFTs
Compute arrays
Call FFTs
Compute arrays
end subroutine nlin
Inclusive results
Main loop
FFTs
nlin
Seconds
169.90
109.66
131.13
Exclusive results Seconds
PAPI main loop:
38.76
PAPI FFTs:
109.66
PAPI nlin:
21.47
• Overlapping
counters used
81
Raw results
Exclusive results PAPI_L1_LDM
PAPI_L1_STM
PAPI_L2_LDM
PAPI_L2_STM
PAPI main loop:
234670773
9165559
0
675
PAPI FFTs:
159252099
16727100
0
0
PAPI nlin:
152516696
19557038
0
0
PAPI_LD_INS
PAPI_ST_INS
PAPI main loop:
894209461
513179931
PAPI FFTs:
6169588863
5057277064
PAPI nlin:
661376717
475098610
PAPI_FP_INS
PAPI_PRF_DM
PAPI main loop:
334859261
5042052
PAPI FFTs:
7967695530
10715
PAPI nlin:
1097492151
24339739
Seconds
38.8
109.7
21.5
82
Deduced results
Exclusive results LD_INS/FP_INS: ST_INS/FP_INS: LD_INS/L1_LDM:
PAPI main loop:
2.67
1.53
3.81
PAPI FFTs:
0.77
0.63
38.74
PAPI nlin:
0.60
0.43
4.34
L1 LD B/W:
L1 ST B/W:
PREF B/W:
PAPI main loop:
184.76
7.22
3.97
PAPI FFTs:
44.32
4.65
0.00
PAPI nlin:
216.78
27.80
34.60
PAPI main loop:
PAPI FFTs:
PAPI nlin:
MiBi/sec
8.64 M(FP_INS)/s
72.66
51.12
• Still not at
memory peak in
nlin
• Good cache reuse in
FFTs
• No cache reuse in
main loop (cache line
length is 32 byte)
83
For More Information
• http://icl.cs.utk.edu/projects/papi/
– Software and documentation
– Reference materials
– Papers and presentations
– Third-party tools
– Mailing lists
84