Performance Tuning Using Hardware Counter Data Philip Mucci [email protected] Shirley Moore [email protected] Nils Smeds [email protected] SC 2001 November 12, 2001 Denver, Colorado Outline • Issues in application performance tuning – 30 minutes • General design of PAPI – 15 minutes • PAPI high-level interface – 15 minutes • PAPI low-level interface – 15 minutes • Counter overflow interrupts and statistical profiling – 30 minutes (advanced) • Tools that use PAPI – 30 minutes • Code examples – 30 minutes 2 Issues in Application Performance Tuning 3 HPC Architecture • RISC or super-scalar architecture – Pipelined functional units – Multiple functional units in the CPU – Speculative execution – Several levels of cache memory – Cache lines shared between CPUs 4 Floating Point Unit FPU1 Floating Point Unit FPU2 Fixed Point Unit FXU1 Fixed Point Unit FXU2 Fixed Point Unit FXU3 LD/ST Unit LD/ST Unit LS1 LS2 Branch history table: 2048 entries Branch/Dispatch Branch target cache: 256 entries 64 KB, 128-way 32 KB, 128-way Memory Mgmt Unit Instruction Cache IU Memory Mgmt Unit Data Cache DU 32 Bytes BIU POWER3 Processing Units (Model 260) 32 Bytes Bus Interface Unit: L2 Control, Clock 32 Bytes @ 200 MHz = 6.4 GB/s L2 Cache 1-16 MB 16 Bytes @100 MHz = 1.6 GB/s 5XX Bus Branch Prediction L1 Instruction Cache And Fetch/Pre-fetch Engine Decoupling Buffer IA-32 Decode And Control ITLB 8 Bundles B B B M M I I Itanium ™ Processor Block Diagram F F Branch Units 128 Integer Registers Integer and MM Units Bus Controller Dual-Port L1 Data Cache And DTLB L3 Cache Branch & Predicate Registers 128 FP Registers ALAT Scoreboard, Predicate, NaTs, Exceptions L2 Cache Register Stack Engine / Re-Mapping Floating Point Units SIMD FMAC 6 Hardware Counters • Small set of registers that count events, which are occurrences of specific signals related to the processor’s function • Monitoring these events facilitates correlation between the structure of the source/object code and the efficiency of the mapping of that code to the underlying architecture. 7 Pipelined Functional Units • The circuitry on a chip that performs a given operation is called a functional unit. • Most integer and floating point units are pipelined – Each stage of a pipelined unit working simultaneously on different sets of operands – After initial startup latency, goal is to generate one result every clock cycle 8 Super-scalar Processors • Processors that have multiple functional units are called superscalar. • Examples: – IBM Power 3 •2 •3 •2 •1 floating point units (multiply-add) fixed point units load/store units branch/dispatch unit 9 Super-scalar Processors (cont.) – MIPS R12K • 2 floating point units (1 multiply-add, 1 add) • 2 integer units • 2 load/store units – Alpha EV67 • Instruction fetch/issue/retire unit • Integer execution unit (2 IU clusters) • Floating point execution unit (2 FPUs) 10 Super-scalar Processors (cont.) – Intel Itanium • EPIC (Explicitly Parallel Instruction Computing) design • 4 integer units • 4 multimedia units • 2 load/store units • 3 branch units • 2 extended precision floating point units • 2 single precision floating point units 11 Out of Order Execution • CPU dynamically executes instructions as their operands become available, out of order if necessary – Any result generated out of order is temporary until all previous instructions have successfully completed. – Queues are used to select which instructions to issue dynamically to the execution units. – Relevant hardware counter metrics: instructions issued, instructions completed 12 Speculative Execution • The CPU attempts to predict which way a branch will go and continues executing instructions speculatively along that path. – If the prediction is wrong, instructions executed down the incorrect path must be canceled. – On many processors, hardware counters keep counts of branch prediction hits and misses. 13 Instruction Counts and Functional Unit Status • Relevant hardware counter data – – – – – – Total cycles Total instructions Floating point operations Load/store instructions Cycles functional units are idle Cycles stalled • waiting for memory access • waiting for resource – Conditional branch instructions • executed • mispredicted 14 Cache and Memory Hierarchy • Registers: On-chip circuitry used to hold operands and results of calculations • L1 (primary) data cache: Small on-chip cache used to hold data about to be operated on • L2 (secondary) cache: Larger (on- or off-chip) cache used to hold data and instructions retrieved from local memory. • Some systems have L3 and even L4 caches. 15 Cache and Memory Hierarchy (cont.) • Local memory: Memory on the same node as the processor • Remote memory: Memory on another node but accessible over an interconnect network. • Each level of the memory hierarchy introduces approximately an order of magnitude more latency than the previous level. 16 Cache Structure • Memory on a node is organized as an array of cache lines which are typically 4 or 8 words long. When a data item is fetched from a higher level cache or from local memory, an entire cache line is fetched. • Caches can be either – direct mapped or – N-way set associative • A cache miss occurs when the program refers to a data item that is not present in the cache. 17 Cache Contention • When two or more CPUs alternately and repeatedly update the same cache line – memory contention • when two or more CPUs update the same variable • correcting it involves an algorithm change – false sharing • when CPUs update distinct variables that occupy the same cache line • correcting it involves modification of data structure layout 18 Cache Contention (cont.) • Relevant hardware counter metrics – Cache misses and hit ratios – Cache line invalidations 19 TLB and Virtual Memory • Memory is divided into pages. • The operating system translates the virtual page addresses used by a program into physical addresses used by the hardware. – The most recently used addresses are cached in the translation lookaside buffer (TLB). – When the program refers to a virtual address that is not in the TLB, a TLB miss occurs. • Relevant hardware counter metric: TLB misses 20 Memory Latencies • CPU register: 0 cycles • L1 cache hit: 2-3 cycles • L1 cache miss satisfied by L2 cache hit: 8-12 cycles • L2 cache miss satisfied from main memory, no TLB miss: 75-250 cycles • TLB miss requiring only reload of the TLB: ~2000 cycles • TLB miss requiring reload of virtual page – page fault: hundreds of millions of cycles 21 Steps of Optimization • Optimize compiler switches • Integrate libraries • Profile • Optimize blocks of code that dominate execution time by using hardware counter data to determine why the bottlenecks exist • Always examine correctness at every stage! 22 General Design of PAPI 23 Goals • Solid foundation for cross platform performance analysis tools • Free tool developers from reimplementing counter access • Standardization between vendors, academics and users • Encourage vendors to provide hardware and OS support for counter access • Reference implementations for a number of HPC architectures • Well documented and easy to use 24 Overview of PAPI • Performance Application Programming Interface • The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors. • Parallel Tools Consortium project http://www.ptools.org/ 25 PAPI Counter Interfaces • PAPI provides three interfaces to the underlying counter hardware: 1. The low level interface manages hardware events in user defined groups called EventSets. 2. The high level interface simply provides the ability to start, stop and read the counters for a specified list of events. 3. Graphical tools to visualize information. 26 PAPI Implementation Java Monitor GUI Portable PAPI Low Level Layer Machine Specific Layer PAPI High Level PAPI Machine Dependent Substrate Kernel Extension Operating System Hardware Performance Counter 27 PAPI Preset Events • Proposed standard set of events deemed most relevant for application performance tuning • Defined in papiStdEventDefs.h • Mapped to native events on a given platform – Run tests/avail to see list of PAPI preset events available on a platform 28 PAPI Release • Platforms – Linux/x86, Windows 2000 • Requires patch to Linux kernel, driver for Windows – Linux/IA-64 – Sun Solaris/Ultra 2.8 – IBM AIX/Power • Contact IBM for pmtoolkit – SGI IRIX/MIPS – Compaq Tru64/Alpha Ev6 & Ev67 • Requires OS device driver from Compaq – Cray T3E/Unicos 29 PAPI Release (cont.) • C and Fortran bindings and Matlab wrappers • To download software: http://icl.cs.utk.edu/projects/papi/ 30 PAPI High-level Interface 31 High-level Interface • Meant for application programmers wanting coarse-grained measurements • Not thread safe • Calls the lower level API • Allows only PAPI preset events • Easier to use and less setup (additional code) than low-level 32 High-level API • C interface • Fortran interface PAPI_start_counters PAPIF_start_counters PAPI_read_counters PAPIF_read_counters PAPI_stop_counters PAPIF_stop_counters PAPI_accum_counters PAPIF_accum_counters PAPI_num_counters PAPIF_num_counters PAPI_flops PAPIF_flops 33 Setting up the High-level Interface • Int PAPI_num_counters(void) – Initializes PAPI (if needed) – Returns number of hardware counters • int PAPI_start_counters(int *events, int len) – Initializes PAPI (if needed) – Sets up an event set with the given counters – Starts counting in the event set • int PAPI_library_init(int version) – Low-level routine implicitly called by above 34 Controlling the Counters • PAPI_stop_counters(long_long *vals, int alen) – Stop counters and put counter values in array • PAPI_accum_counters(long_long *vals, int alen) – Accumulate counters into array and reset • PAPI_read_counters(long_long *vals, int alen) – Copy counter values into array and reset counters • PAPI_flops(float *rtime, float *ptime, long_long *flpins, float *mflops) – Wallclock time, process time, FP ins since start, – Mflop/s since last call 35 PAPI_flops • int PAPI_flops(float *real_time, float *proc_time, long_long *flpins, float *mflops) – Only two calls needed, PAPI_flops before and after the code you want to monitor – real_time is the wall-clocktime between the two calls – proc_time is the “virtual” time or time the process was actually executing between the two calls (not as fine grained as real_time but better for longer measurements) – flpins is the total floating point instructions executed between the two calls – mflops is the Mflop/s rating between the two calls 36 PAPI High-level Example long long values[NUM_EVENTS]; unsigned int Events[NUM_EVENTS]={PAPI_TOT_INS,PAPI_TOT_CYC}; /* Start the counters */ PAPI_start_counters((int*)Events,NUM_EVENTS); /* What we are monitoring? */ do_work(); /* Stop the counters and store the results in values */ retval = PAPI_stop_counters(values,NUM_EVENTS); 37 Return codes Name PAPI_OK PAPI_EINVAL PAPI_ENOMEM PAPI_ESYS PAPI_ESBSTR PAPI_ECLOST PAPI_EBUG PAPI_ENOEVNT PAPI_ECNFLCT PAPI_ENOTRUN PAPI_EISRUN PAPI_ENOEVST PAPI_ENOTPRESET PAPI_ENOCNTR PAPI_EMISC Description No error Invalid argument Insufficient memory A system/C library call failed. Check errno variable Substrate returned an error. E.g. unimplemented feature Access to the counters was lost or interrupted Internal error Hardware event does not exist Hardware event exists, but resources are exhausted Event or envent set is currently counting Events or event set is currently running No event set available Argument is not a preset Hardware does not support counters Any other error occured 38 PAPI Low-level Interface 39 Low-level Interface • Increased efficiency and functionality over the high level PAPI interface • About 40 functions • Obtain information about the executable and the hardware • Thread-safe • Fully programmable • Callbacks on counter overflow 40 Low-level Functionality • Library initialization PAPI_library_init, PAPI_thread_init, PAPI_shutdown • Timing functions PAPI_get_real_usec, PAPI_get_virt_usec PAPI_get_real_cyc, PAPI_get_virt_cyc • Inquiry functions • Management functions • Simple lock PAPI_lock/PAPI_unlock 41 Event sets • The event set contains key information – What low-level hardware counters to use – Most recently read counter values – The state of the event set (running/not running) – Option settings (e.g., domain, granularity, overflow, profiling) • Event sets can overlap if they map to the same hardware counter set-up. – Allows inclusive/exclusive measurements 42 Event set Operations • Event set management PAPI_create_eventset, PAPI_add_event[s], PAPI_rem_event[s], PAPI_destroy_eventset • Event set control PAPI_start, PAPI_stop, PAPI_read, PAPI_accum • Event set inquiry PAPI_query_event, PAPI_list_events,... 43 Simple Example #include "papi.h“ #define NUM_EVENTS 2 int Events[NUM_EVENTS]={PAPI_FP_INS,PAPI_TOT_CYC}, EventSet; long_long values[NUM_EVENTS]; /* Initialize the Library */ retval = PAPI_library_init(PAPI_VER_CURRENT); /* Allocate space for the new eventset and do setup */ retval = PAPI_create_eventset(&EventSet); /* Add Flops and total cycles to the eventset */ retval = PAPI_add_events(&EventSet,Events,NUM_EVENTS); /* Start the counters */ retval = PAPI_start(EventSet); do_work(); /* What we want to monitor*/ /*Stop counters and store results in values */ retval = PAPI_stop(EventSet,values); 44 Overlapping Counters retval = PAPI_start(InclEventSet); retval = PAPI_start(OthersEventSet); ... retval = PAPI_reset(OthersEventSet); do_flops(NUM_FLOPS); /* Function call */ retval = PAPI_accum(OthersEventSet,Othersvalues); ... retval = PAPI_stop(InclEventSet,Inclvalues); printf("Counts: %12lld %12lld\n", Inclvalues[0], Inclvalues[0]-Othersvalues[0]); 45 Counter Domains • int PAPI_set_domain(int domain); – PAPI_DOM_USER User context counted – PAPI_DOM_KERNEL Kernel/OS context counted – PAPI_DOM_OTHER Exception/transient mode – PAPI_DOM_ALL All above contexts counted – PAPI_DOM_MIN The smallest available context – PAPI_DOM_MAX The largest available context • All domains not available on all platforms - OS dependent 46 Counter Granularity • int PAPI_set_granularity(int granul); – PAPI_GRN_THR thread count each individual – PAPI_GRN_PROC count each individual process – PAPI_GRN_PROCG count each process group – PAPI_GRN_SYS count on the current CPU – PAPI_GRN_SYS_CPU count on every CPU's – PAPI_GRN_MIN – PAPI_GRN_MAX (=PAPI_GRN_THR) (=PAPI_GRN_SYS_CPU) • Requires OS support 47 Using PAPI with Threads • After PAPI_library_init need to register unique thread identifier function • For Pthreads retval=PAPI_thread_init(pthread_self, 0); • OpenMP retval=PAPI_thread_init(omp_get_thread_num, 0); • Each thread responsible for creation, start, stop and read of its own counters 48 Using PAPI with Multiplexing • Multiplexing allows simultaneous use of more counters than are supported by the hardware. • PAPI_multiplex_init() – should be called after PAPI_library_init() to initialize multiplexing • PAPI_set_multiplex( int *EventSet ); – Used after the eventset is created to turn on multiplexing for that eventset • Then use PAPI like normal 49 Issues with Multiplexing • Some platforms support hardware multiplexing, on those that don’t PAPI implements multiplexing in software. • The more events you multiplex, the more likely the representation is not correct. 50 Multiplex Code Examples From the PAPI source distribution: tests/multiplex1.c tests/multiplex1_pthreads.c 51 Native Events • An event countable by the CPU can be counted even if there is no matching preset PAPI event • Same interface as when setting up a preset event, but a CPU-specific bit pattern is used instead of the PAPI event definition 52 Native Event Examples From the PAPI source distribution: tests/native.c ftests/native.F 53 Counter Overflow Interrupts and Statistical Profiling 54 Callbacks on Counter Overflow • PAPI provides the ability to call user-defined handlers when a specified event exceeds a specified threshold. • For systems that do not support counter overflow at the OS level, PAPI sets up a high resolution interval timer and installs a timer interrupt handler. 55 PAPI_overflow • int PAPI_overflow(int EventSet, int EventCode, int threshold, int flags, PAPI_overflow_handler_t handler) • Sets up an EventSet such that when it is PAPI_start()’d, it begins to register overflows • The EventSet may contain multiple events, but only one may be an overflow trigger. 56 Overflow Code Examples From the PAPI source distribution: tests/overflow.c tests/overflow_pthreads.c 57 Statistical Profiling • PAPI provides support for execution profiling based on any counter event. • PAPI_profil() creates a histogram of overflow counts for a specified region of the application code. 58 PAPI_profil int PAPI_profil(unsigned short *buf, unsigned int bufsiz, unsigned long offset, unsigned scale, int EventSet, int EventCode, int threshold, int flags) •buf – buffer of bufsiz bytes in which the histogram counts are stored •offset – start address of the region to be profiled •scale – contraction factor that indicates how much smaller the histogram buffer is than the region to be profiled 59 Profiling Code Examples From the PAPI source distribution: tests/profile.c tests/sprofile.c tests/profile_pthreads.c 60 Tools that use PAPI 61 Perfometer • Application is instrumented with PAPI – call perfometer() – call mark_perfometer(Color) • Application is started. At the call to perfometer, signal handler and timer are set to collect and send the information to a Java applet containing the graphical view. • Sections of code that are of interest can be designated with specific colors – Using a call to set_perfometer(‘color’) • Real-time display or trace file 62 Perfometer Display Machine info Flop/s Rate Flop/s Min/Max Process & Real time 63 Perfometer Parallel Interface 64 Third-party Tools that use PAPI • DEEP/PAPI (Pacific Sierra) http://www.psrv.com/deep_papi_top.html • TAU (Allen Mallony, U of Oregon) http://www.cs.uoregon.edu/research/paracomp/tau/ • SvPablo (Dan Reed, U of Illinois) http://vibes.cs.uiuc.edu/Software/SvPablo/svPablo.htm • Cactus (Ed Seidel, Max Plank/U of Illinois) http://www.aei-potsdam.mpg.de • Vprof (Curtis Janssen, Sandia Livermore Lab) http://aros.ca.sandia.gov/~cljanss/perf/vprof/ • Cluster Tools (Al Geist, ORNL) • DynaProf (Phil Mucci, UTK) http://www.cs.utk.edu/~mucci/dynaprof/ 65 DEEP/PAPI 66 SvPablo 67 TAU 68 vprof 69 Code Examples 70 Code Examples • Parallelising a particle particle simulator • Parallelising a frequency domain MHD simulator 71 Particle - particle simulator • Particles fall in a well • Particle interactions computed for particles in the neighbourhood only Neighborhood • Occasionally the neighbourhood list is recomputed • 1000 particles • Neighbour list length 10000 • ~6000-7000 interactions 72 Algorithm used For each particle i For each neighbor j Compute distance ij Compute inter-particle force Update force on particles i & j Compute accelerations and updated positions • Force vector is sum-updated in a “random” access pattern • Little cache reuse • Inhibits SMP parallelization 73 Reversed neighborlist • Introduce force interaction vector For each particle i For each neighbor j Compute distance ij Compute inter-particle force • Introduce a reverse neighbour list Update force on particles i Update force on particles j Compute accelerations and update positions • Inter-particle force written linearly, but read randomly in j-loop • Force vector updated linearly 74 Final performance Wall clock time per time step 1 Naive load balancing 2 Neighbour balancing 75 Explanation • User reports serial program 3 times faster – Several contributing factors: • Compiler optimisations • Compiler inlining • Better cache utilization • Without the linear traversing of writes no speed-up (not shown in previous graph) • Scaling problem on the SGI is a cache issue? Whole problem fits nicely into one 8MB L2 cache 76 Frequency domain MHD • Code makes frequent 3D FFT transformations • Electric and magnetic field double complex 128 bit precision • Array dimensions are (3,N,N,N), N=64 • Array size: 12MB per field • In between calls matrices are set up in loops M(:,j,k,l)=A(:,j,k,l) × B(:,j,k,l) + C(:,j,k,l) • Parallel FFTs are available • Parallel matrix set up is straight forward 77 Expected behaviour • Code is expected to be memory bound outside FFTs due to array sizes and number of floating point operations vs. memory accesses • Going parallel on a bus gives no gain - or does it? • Speed-up should be obtainable on CCNUMA • Code should run well on vector systems with good FFTs and enough memory ports 78 Observed behaviour Overloaded system Serial DXML FFTs 79 Obtained speed up vs. streams • The IBM bus is “switched” - this can explain that speed-up was obtained • Speed-up on the SGI platform could be better http://www.cs.virginia.edu/stream/standard/Bandwidth.html Machine ncpus COPY SCALE ADD TRIAD IBM_SP-PWR3smp-High-222 8 2954.1 2821 3889 3872.2 10^6 BYTE/SEC IBM_SP-PWR3smp-High-222 4 1603.5 1535.9 2218.3 2187.5 IBM_SP-PWR3smp-High-222 2 820.7 800.4 1182.1 1165.4 IBM_SP-PWR3smp-High-222 1 413 421.2 587.4 614.2 SGI_Origin2000-195 8 1355 1450 1413 1675 SGI_Origin2000-195 6 1066 1075 1262 1396 SGI_Origin2000-195 4 666 727 792 874 SGI_Origin2000-195 2 351 365 392 413 SGI_Origin2000-195 1 296 300 315 317 80 PAPI measurements — IBM • Critical code section main loop …. Call nlin(….) …. end main loop Subroutine nlin Compute arrays Call FFTs Compute arrays Call FFTs Compute arrays end subroutine nlin Inclusive results Main loop FFTs nlin Seconds 169.90 109.66 131.13 Exclusive results Seconds PAPI main loop: 38.76 PAPI FFTs: 109.66 PAPI nlin: 21.47 • Overlapping counters used 81 Raw results Exclusive results PAPI_L1_LDM PAPI_L1_STM PAPI_L2_LDM PAPI_L2_STM PAPI main loop: 234670773 9165559 0 675 PAPI FFTs: 159252099 16727100 0 0 PAPI nlin: 152516696 19557038 0 0 PAPI_LD_INS PAPI_ST_INS PAPI main loop: 894209461 513179931 PAPI FFTs: 6169588863 5057277064 PAPI nlin: 661376717 475098610 PAPI_FP_INS PAPI_PRF_DM PAPI main loop: 334859261 5042052 PAPI FFTs: 7967695530 10715 PAPI nlin: 1097492151 24339739 Seconds 38.8 109.7 21.5 82 Deduced results Exclusive results LD_INS/FP_INS: ST_INS/FP_INS: LD_INS/L1_LDM: PAPI main loop: 2.67 1.53 3.81 PAPI FFTs: 0.77 0.63 38.74 PAPI nlin: 0.60 0.43 4.34 L1 LD B/W: L1 ST B/W: PREF B/W: PAPI main loop: 184.76 7.22 3.97 PAPI FFTs: 44.32 4.65 0.00 PAPI nlin: 216.78 27.80 34.60 PAPI main loop: PAPI FFTs: PAPI nlin: MiBi/sec 8.64 M(FP_INS)/s 72.66 51.12 • Still not at memory peak in nlin • Good cache reuse in FFTs • No cache reuse in main loop (cache line length is 32 byte) 83 For More Information • http://icl.cs.utk.edu/projects/papi/ – Software and documentation – Reference materials – Papers and presentations – Third-party tools – Mailing lists 84
© Copyright 2026 Paperzz