ppt - CERN Computing Seminars

Distributed Application Analysis and
Debugging using NetLogger v2
Brian L. Tierney
Lawrence Berkeley National Laboratory
GGF
NetLogger
The Problem
• The Problem
– When building distributed systems, we often observe
unexpectedly low performance
• the reasons for which are usually not obvious
– The bottlenecks can be in any of the following
components:
• the applications
• the operating systems
• the disks, network adapters, bus, memory, etc. on either the
sending or receiving host
• the network switches and routers, and so on
GGF
NetLogger
Solution
• A complete End-to-End monitoring framework that includes
the following components:
– instrumentation tools (application, middleware, and OS
monitoring)
– host and network sensors (host and network
monitoring)
– sensor management tools (sensor control system)
– event publication service
– event archive service
– event analysis and visualization tools
– a common set of formats and protocols for describing,
exchanging and locating monitoring data
GGF
NetLogger
NetLogger Toolkit
• We have developed the NetLogger Toolkit (short for Networked
Application Logger), which includes:
– tools to make it easy for distributed applications to log interesting
events at every critical point
• NetLogger client library (C, C++, Java, Perl, Python)
– tools for host and network monitoring
– event visualization tools that allow one to correlate application
events with host/network events
– NetLogger event archive and retrieval tools (new)
• NetLogger combines network, host, and application-level monitoring
to provide a complete view of the entire system.
•
Open Source, available at http://www-didc.lbl.gov/NetLogger/
GGF
NetLogger
NetLogger Analysis: Key Concepts
• NetLogger visualization tools are based on time correlated and object
correlated events.
– precision timestamps (default = microsecond)
• If applications specify an “object ID” for related events, this allows the
NetLogger visualization tools to generate an object “lifeline”
• In order to associate a group of events into a “lifeline”, you must assign
an “Event ID” to each NetLogger event
– Sample Event ID: file name, block ID, frame ID, etc.
GGF
NetLogger
Example: Combined Host and
Application Monitoring
GGF
NetLogger
NetLogger Tuning Results
• I/O followed by
processing
Next IO starts
when processing
ends
• overlapped I/O and
processing
process previous
block
remote IO
almost a 2:1 speedup
GGF
NetLogger
NetLogger Analysis of GridFTP:
multi-stream bug
GGF
NetLogger
Sample NetLogger Use
lp = NetLoggerOpen( progname,
x-netlog://loghost.lbl.gov, NL_MEM);
while (!done)
{
NetLoggerWrite(lp, "EVENT_START",
"TEST.SIZE=%d", size);
/* perform the task to be monitored */
done = do_something(data, size);
NetLoggerWrite(lp, "EVENT_END");
}
NetLoggerClose(lp);
GGF
NetLogger
NetLoggerOpen() shell environment
variables
• Enable/Disable logging:
setenv NETLOGGER_ON {true, on, yes, 1}: do logging
setenv NETLOGGER_ON {false, off, no, 0}: do not do logging
• Log Destination: setenv NETLOGGER_DEST logging destination
Examples:
setenv NETLOGGER_DEST file://tmp/netlog.log
write log messages to file /tmp/netlog.log
setenv NETLOGGER_DEST x-netlog://loghost.lbl.gov
send log messages to netlogd on host loghost.lbl.gov, default port
setenv NETLOGGER_DEST x-netlog://loghost.lbl.gov:6006
send log messages to netlogd on host loghost.lbl.gov, port 6006
• NETLOGGER_DEST overrides the URL passed in via the
NetLoggerOpen() call.
GGF
NetLogger
NetLogger Write Call
• Creates and Writes the log event:
NetLoggerWrite(nl, “EVENT_NAME”, “EVENTID=%d
F2=%d F3=%s F4=%.2f”, id, user_data,
user_string, user_float);
– timestamps are automatically done by library
– the “event name” field is required, all other fields are optional
– this call is thread-safe: automatically does a mutex lock
around write call (compile time option)
• Example:
NetLoggerWrite(nl, “HTTPD.START_DISK_READ”,
“HTTPD.FILENAME=%s HTTPD.HOST=%s”, filename,
hostname);
GGF
NetLogger
Event ID
• In order to associate a group of events into a “lifeline”, you
must assign an event ID to each NetLogger event
• Sample Event Ids
– file name
– block ID
– frame ID
– Socket ID
– user name
– host name
– combination of the above
– etc.
GGF
NetLogger
Sample NetLogger Use with Event
IDs
lp = NetLoggerOpen(progname, NULL, NL_ENV);
for (i=0; i< num_blocks; i++) {
NetLoggerWrite(lp, “START_READ”,
“BLOCK_ID=%d BLOCK_SIZE=%d”, i, size);
read_block(i);
NetLoggerWrite(lp, “END_READ”,
“BLOCK_ID=%d BLOCK_SIZE=%d”, i, size);
NetLoggerWrite(lp, “START_PROCESS”,
“BLOCK_ID=%d BLOCK_SIZE=%d”, i, size);
process_block(i);
NetLoggerWrite(lp, “END_PROCESS”,
“BLOCK_ID=%d BLOCK_SIZE=%d”, i, size);
NetLoggerWrite(lp, “START_SEND”,
“BLOCK_ID=%d BLOCK_SIZE=%d”, i, size);
send_block(i);
NetLoggerWrite(lp, “END_SEND”,
“BLOCK_ID=%d BLOCK_SIZE=%d”, i, size);
}
NetLoggerClose(lp);
GGF
NetLogger
How to Instrument Your Application
• You’ll probably want to add a NetLogger event to the
following places in your distributed application:
– before and after all disk I/O
– before and after all network I/O
– entering and leaving each distributed component
– before and after any significant computation
• e.g.: an FFT operation
– before and after any significant graphics call
• e.g.: certain CPU intensive OpenGL calls
• This is usually an iterative process
– add more NetLogger events as you zero in on the
bottleneck
GGF
NetLogger
NLV Analysis Tool:
Plots Time vs. Event Name
Menu bar
Title
Scale for load-line/
points
Events
Zoom box
Time axis
Legend
Max window size
Window size
Playback speed
Zoom-box actions
Summary
line
Playback controls
You are
here
Zoom window
controls
GGF
NetLogger
Use Case: Instrumented GridFTP
server
• An Efficient Data Channel Protocol is extremely important
• For example, consider the following:
– FTP server is instrumented to log the start and end times for all
network and disk read and writes, which are in blocks of 64 KBytes
– FTP Server has a fast RAID disk array and 1000-BT network
– 10 simultaneous clients (parallel GridFTP)
– Each client is transferring data at 10 Mbytes/second
– Total server throughput = 100 Mbytes/sec
• This will generate roughly 6250 events / second of monitoring data
• Assuming each monitoring event is 50 Bytes, this equates to 313
KBytes/second, or 1.1 GBytes per hour, of monitoring data.
GGF
NetLogger
NetLogger v2 Improvements
• Rewrite of client library
– Multiple log formats allowed with same API
• ASCII (ULM; name = value)
—e.g.: DATE=19980430133038.055784 HOST=foo.lbl.gov
PROG=testprog NL.EVNT=SEND_DATA SEND_SIZE=49332
• Binary log format
— much better performance, and requires less bandwidth, than
ASCII
—Crucial for GridFTP use-case: ASCII NetLogger format not fast
enough
– Other language APIs automatically generated with
SWIG
• Much faster than “100% native” implementations, esp. for script
languages such as Perl, Python, and TCL
• Changes and bug fixes in core automatically propagated to all
APIs
GGF
NetLogger
NetLogger v2 Improvements
• Reliability
– Provide fault-tolerance in WAN environment
– Recovers gracefully from network problems
– Also recovers from restart/reboot of the log receiver
• Trigger API
– Dynamically start, stop, change logging levels
GGF
NetLogger
New NetLogger Feature: Binary
Format
• Data is sent in native format: “receiver-makes-right”
• IEEE floating point only, simple data types supported
• Lots of buffering (64K) to reduce system call overhead
1000 events/sec
700
600
500
binary
ascii
pbio
log4j
400
300
200
100
0
c
java
Language
GGF
NetLogger
New NetLogger Feature:
Reliability API
• Reliability API is used to provide fault tolerance
– if remote NetLogger receiver become unavailable, will automatically
failover to alternate location (e.g: local disk or 2nd receiver)
– Option to periodically try to reconnect to the remote server
• send local disk file on reconnect
• NetLoggerSetBackup( handle, char *backupURL, int SendonReconnect )
– Set the backup URL
– SendonReconnect flag: if backup URL is local disk, send these results
after a successful reconnect
• NetLoggerSetReconnect( handle, int sec )
– Set the number of seconds between reconnect attempts when the
backup URL is in use
GGF
NetLogger
New NetLogger Feature:
Trigger API
• Trigger API is used to activate monitoring from an external
“activation service”
• NetLoggerSetTrigger( handle, char *filename. int sec)
– Check the configuration file specified in “filename”,
every “sec” seconds
• int NetLoggerSetTriggerFunc( handle, fn_t *fnptr, int sec )
– Call a user-defined trigger function every “sec” seconds
• NetLoggerCheckTrigger( handle)
– Check the trigger function immediately (in addition to the
periodic check from NetLoggerSetTrigger).
GGF
NetLogger
Monitoring Event Archive
• Archived event data is required for
– performance analysis and tuning
• compare current performance to previous results
– accounting
• The archive must be extremely high performance and scalable to
ensure that it does not become a bottleneck.
– heavily loaded FTP server could generate > 6000 monitoring
event per second
• Must use pipelining to guarantee that applications and sensors
never block when writing to the archive
• buffer event data on disk
• SQL capability desirable
– ability to do complex queries, find averages, standard
deviations, etc.
GGF
NetLogger
Monitoring Event Archive
Instrumented
application writes
events to disk buffer
Reads events from
producer and writes
to disk buffer
event data
buffer
Data Server
(e.g. FTP)
event data
buffer
Archive
Feeder
network
Event
Producer
reads events from disk
buffer and sends them to
archive consumer
Event Archive
Archive
Consumer
reads events from disk
buffer and feeds them to
SQL database at a
controlled rate
Requirements for FTP Server monitoring:
• 6200 events/sec
• 313 KBytes/second, (1.1 GBytes / hour)
GGF
NetLogger
monitoring
DB
Current Work
GGF
NetLogger
NetLogger + GMA
• NetLogger + Grid Monitoring Architecture (GMA)
– GMA is a publish / subscribe event system
– Using SOAP for control channel
– Using NetLogger binary for data channel
• Sender: NetLoggerWrite()
• Receiver: NetLoggerRead()
• NetLogger binary format is a very efficient transport
protocol for simple event data
GGF
NetLogger
Activation Service
• We are using the GMA producer/consumer interfaces to build an
activation service for NetLogger
– Creates the “trigger file” used by the NetLogger trigger API
7.
Consumer
(sends
activation
request)
1.
Activation Service
Event Producer
6.
2.
Trigger
Manager
Trigger File:
event type A: log host X
event type B: log host Y
event type C: log host Z
…..
Activation Service
adds entries to
“trigger file” to start
logging certain
events for a given
client
Send events back
to consumer
3.
Event Filter /Buffer
Manager
5.
NetLogger
Instrumented
Application
NetLogger calls in
application periodically
checks “trigger file” for
new entry
GGF
NetLogger
4.
event data
buffer
Create output buffer
file on disk
Timestamps:
Clock Synchronization Issues
• To correlate events from multiple systems requires synchronized clocks
– NTP (Network Time Protocol) is required to synchronize the clocks
of all systems
• How accurate does this synchronization need to be?
– We have found that to analyze systems from the “user perspective”
requires:
• microsecond resolution between events on a single host
(gettimeofday() system call)
• millisecond resolution between WAN hosts
—fairly easy to achieve this with NTP
• somewhere in between for LAN hosts
• Recommendation: everyone use IETF timestamp standard
– example: 2002-01-18T21:20:07.401662Z
– YYYY-MM-DDTHH:MM:SS.SZ (T=date-time separator, Z = GMT)
– http://www.ietf.org/internet-drafts/draft-ietf-impp-datetime-05.txt
GGF
NetLogger
For More Information
http://www-didc.lbl.gov/NetLogger
email: [email protected], [email protected]
GGF
NetLogger