Observing End to End Parallel I/O (I/O “Long‐Path”)

Observing
End
to
End
Parallel
I/O
(I/O
“Long‐Path”)
Performance
Perspectives
•  The
Parallel
Application
•  The
Transport
Infrastructure
•  Parallel
I/O
Servers
•  There
are
performance
questions
associated
with
each
of
the
above
•  Some
questions
involve
all
of
them
Application
Performance
Questions
•  How
long
does
an
application
spend
in
I/O?
–  If
I/O
is
not
fully
synchronous,
how
do
we
measure
this
time?
–  Is
the
local
(within
each
rank)
I/O
time
different
from
the
global
I/O
time?
–  And
if
so,
does
I/O
lead
to
imbalance
effects?
–  What
types
of
I/O
operations
take
the
longest?
Can
we
categorize/classify
them?
Transport
Performance
Questions
•  I/O
can
be
split
into
request,
processing,
reply
•  How
long
do
each
of
the
stages
take?
–  Application
I/O
call
(MPI
IO,
native
PVFS),
Library
–  Compute‐I/O
Network,
I/O
node
Daemon,
PVFS
plugin
–  I/O
–
Storage
Network,
PVFS
Servers,
Storage
•  Where
is
the
bottleneck?
–  For
varying
request
types,
sizes,
…,
workloads
•  What
operations
contribute
to
the
per‐stage
time?
PVFS
Performance
Questions
•  I/O
request
tracking
in
the
PVFS
state
machine
–  How
much
time
does
each
stage
take?
–  How
long
does
the
request
remain
in
queues?
•  What
are
service
times
for
various
requests?
–  What
is
a
good
way
to
classify
requests
(sizes,
input,
output,
meta
data,
data)?
•  How
do
request
scheduling
decisions
affect
performance?
Parallel
I/O
Tracking
and
the
Long‐Path
compute
nodes
Parallel
Application
MPI
IO
libc
syscalls
No
changes
to
API
or
structures
ADIO
FUSE
UNIX
Explicit
wire‐format
encoding
/
decoding
of
TAU
context,
return
performance
data
tree
i/o
nodes
All
exist
in
single
process
memory
address
space
Direct,
Implicit
access
to
performance
state
ZOIDFS
libzoid_cn
ZOID
daemon
UNIX
Compute
node
modules
instrumented
with
TAU
TAU
tracks
and
passes
context
as
call
proceeds
ZOIDFS
UNIX
PVFS
Structure
/
API
changes
to
libzoid_cn
and
ZOID
daemon
I/O
Node
modules
instrumented
with
TAU
PVFS
Hints
:
Encode/Decode
TAU
context
/
data
storage
GigE
PVFS
Server
Server
accepts
Hints
(name/value)
with
requests
TAU
uses
special
measurement
hints:
e.g.
context
TAU
performs
context‐aware
measurement
PVFS2
Server
/
TAU
Integration
TAU
Thread
Groups
• 
A
PVFS2
Measurement
Problem
– 
– 
• 
• 
Significant
number
of
concurrent
threads

contention
(locking)
over
performance
state
TAU
performance
model
– 
– 
• 
Threads
partitioned
into
groups.
TAU
maintains
performance
pool.
A
set
of
re‐usable
performance
contexts
for
each
thread‐group
On
thread
creation
use
free
context.
If
no
free
context,
new
context
created
in
pool.
On
thread
deletion,
context
returned
to
pool
All
measurement
inside
a
thread
is
performed
within
context
No
locking
required
for
measurement
events
(only
for
thread
creation
/
deletion)
Manageable
set
of
resulting
contexts.
Granularity
controlled
through
group
definition.
– 
– 
• 
Avoids
contention

splits
performance
state
on
per‐thread
basis
But
with
PVFS2,
too
many
performance
contexts
Reconciling
both
:
TAU
Thread
Groups
– 
– 
– 
– 
– 
• 
• 
Legacy
PVFS2
tracing
‐
single
trace
buffer
/
file
Creates
worker
threads
on
demand
to
satisfy
I/O
requests
and
to
hide
wait
latency
Profiling:
one
profile
per
group
–
the
measurement
data
of
all
participants
in
the
group
are
aggregated
Tracing
:
as
many
traces
as
there
were
concurrently
executing
threads
(not
total
number
of
threads)
Allows
handling
large
thread
counts
by
lock‐free
performance
pooling
PVFS
Server
/
TAU
Integration
Tau
Trace
Format
(Ttf)
API
• 
• 
Specialized
thread‐friendly
instrumentation/measurement
API
and
implementation
The
API
–  Event
format
definition
•  Free
to
define
events
and
associated
parameters
•  Vararg
(printf)
style
record
format
specification
•  Record
format
parsed
with
state‐machine
–  Event
occurrence
logging
•  On
event
occurrence,
performance
data
recorded
in
TAU
trace
format
–  Threading
event
notification
•  PVFS2
explicitly
notifies
TAU
of
thread
creation
/
deletions
• 
• 
• 
Emphasis
on
being
light‐weight
and
avoiding
contention
Uses
thread‐groups
internally
to
split
performance
state
Split
event
format
definitions
–
global
master
copy
/
thread
local
copy
–  Allows
event
formats
defined
in
one
thread
to
be
logged
by
other
threads
without
large
lock
overheads
• 
Utilities
for
merging
contexts
and
trace
file
format
conversions
• 
Also:
Automatic
Source
instrumentation
of
PVFS2
with
TAU/PDT
possible
Long
Path
Measurement
• 
• 
Tracking
I/O
across
Long‐path
by
passing
TAU
context
from
App.
to
PVFS2
server
Tracing
–  With
context
available
tracing
is
straightforward
–  Merge
traces
postmortem
to
analyze
call‐flows
• 
But
tracing
can
be
heavy‐weight
• 
Distributed
Call‐Flow
Profiles
–  At
large
scales,
too
much
extra
I/O
for
performance
data
– 
– 
– 
– 
– 
• 
Lighter‐weight
profiles
are
annotated
with
context
Granularity
controlled
through
definition
of
context
Profile
data
for
existing
context
is
aggregated
and
hence
remains
small
Post‐mortem,
profiles
from
the
different
modules
are
merged
using
annotations
Produces
chained
call‐graphs
Online
Distributed
Call‐Flow
Profiles
– 
– 
– 
– 
Similar
to
DCFP,
but
performs
chaining
at
runtime
In
the
call‐path
context
is
piggy‐backed
In
the
return
path,
performance
data
of
remote
components
is
piggy‐backed
Performance
data
maintained
by
the
caller
(not
the
components
themselves)