Observing End to End Parallel I/O (I/O “Long‐Path”)

Observing End to End Parallel I/O (I/O “Long‐Path”) Performance Perspectives •  The Parallel Application •  The Transport Infrastructure •  Parallel I/O Servers •  There are performance questions associated with each of the above •  Some questions involve all of them Application Performance Questions •  How long does an application spend in I/O? –  If I/O is not fully synchronous, how do we measure this time? –  Is the local (within each rank) I/O time different from the global I/O time? –  And if so, does I/O lead to imbalance effects? –  What types of I/O operations take the longest? Can we categorize/classify them? Transport Performance Questions •  I/O can be split into request, processing, reply •  How long do each of the stages take? –  Application I/O call (MPI IO, native PVFS), Library –  Compute‐I/O Network, I/O node Daemon, PVFS plugin –  I/O – Storage Network, PVFS Servers, Storage •  Where is the bottleneck? –  For varying request types, sizes, …, workloads •  What operations contribute to the per‐stage time? PVFS Performance Questions •  I/O request tracking in the PVFS state machine –  How much time does each stage take? –  How long does the request remain in queues? •  What are service times for various requests? –  What is a good way to classify requests (sizes, input, output, meta data, data)? •  How do request scheduling decisions affect performance? Parallel I/O Tracking and the Long‐Path compute nodes Parallel Application MPI IO libc syscalls No changes to API or structures ADIO FUSE UNIX Explicit wire‐format encoding / decoding of TAU context, return performance data tree i/o nodes All exist in single process memory address space Direct, Implicit access to performance state ZOIDFS libzoid_cn ZOID daemon UNIX Compute node modules instrumented with TAU TAU tracks and passes context as call proceeds ZOIDFS UNIX PVFS Structure / API changes to libzoid_cn and ZOID daemon I/O Node modules instrumented with TAU PVFS Hints : Encode/Decode TAU context / data storage GigE PVFS Server Server accepts Hints (name/value) with requests TAU uses special measurement hints: e.g. context TAU performs context‐aware measurement PVFS2 Server / TAU Integration TAU Thread Groups • 
A PVFS2 Measurement Problem – 
– 
• 
• 
Significant number of concurrent threads  contention (locking) over performance state TAU performance model – 
– 
• 
Threads partitioned into groups. TAU maintains performance pool. A set of re‐usable performance contexts for each thread‐group On thread creation use free context. If no free context, new context created in pool. On thread deletion, context returned to pool All measurement inside a thread is performed within context No locking required for measurement events (only for thread creation / deletion) Manageable set of resulting contexts. Granularity controlled through group definition. – 
– 
• 
Avoids contention  splits performance state on per‐thread basis But with PVFS2, too many performance contexts Reconciling both : TAU Thread Groups – 
– 
– 
– 
– 
• 
• 
Legacy PVFS2 tracing ‐ single trace buffer / file Creates worker threads on demand to satisfy I/O requests and to hide wait latency Profiling: one profile per group – the measurement data of all participants in the group are aggregated Tracing : as many traces as there were concurrently executing threads (not total number of threads) Allows handling large thread counts by lock‐free performance pooling PVFS Server / TAU Integration Tau Trace Format (Ttf) API • 
• 
Specialized thread‐friendly instrumentation/measurement API and implementation The API –  Event format definition •  Free to define events and associated parameters •  Vararg (printf) style record format specification •  Record format parsed with state‐machine –  Event occurrence logging •  On event occurrence, performance data recorded in TAU trace format –  Threading event notification •  PVFS2 explicitly notifies TAU of thread creation / deletions • 
• 
• 
Emphasis on being light‐weight and avoiding contention Uses thread‐groups internally to split performance state Split event format definitions – global master copy / thread local copy –  Allows event formats defined in one thread to be logged by other threads without large lock overheads • 
Utilities for merging contexts and trace file format conversions • 
Also: Automatic Source instrumentation of PVFS2 with TAU/PDT possible Long Path Measurement • 
• 
Tracking I/O across Long‐path by passing TAU context from App. to PVFS2 server Tracing –  With context available tracing is straightforward –  Merge traces postmortem to analyze call‐flows • 
But tracing can be heavy‐weight • 
Distributed Call‐Flow Profiles –  At large scales, too much extra I/O for performance data – 
– 
– 
– 
– 
• 
Lighter‐weight profiles are annotated with context Granularity controlled through definition of context Profile data for existing context is aggregated and hence remains small Post‐mortem, profiles from the different modules are merged using annotations Produces chained call‐graphs Online Distributed Call‐Flow Profiles – 
– 
– 
– 
Similar to DCFP, but performs chaining at runtime In the call‐path context is piggy‐backed In the return path, performance data of remote components is piggy‐backed Performance data maintained by the caller (not the components themselves) 

Download Report

Observing End to End Parallel I/O (I/O “Long‐Path”)

Paperzz.com

Your Paperzz