TeraGrid Data Transfer

TeraGrid Data Transfer
Joint EGEE and OSG Workshop on
Data Handling in Production Grids
June 25, 2007 - Monterey, CA
Derek Simmel [email protected]
Pittsburgh Supercomputing Center
TeraGrid Data Transfer
• Topics
– TeraGrid Network (June 2007)
– TeraGrid Data Kits
– GridFTP
– HPN-SSH
– WAN Filesystems
• Lustre-WAN and GPFS-WAN
– Advanced Solutions
• Scheduled Data Jobs - DMOVER
• Getting data to/from MPPs - PDIO
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
2
TeraGrid Network
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
3
network.teragrid.org
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
4
Performance Monitoring
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
5
TeraGrid Data Kits
• Data Movement
– GridFTP, HPN-SSH
– TeraGrid Globus deployment includes VDTcontributed improvements to the Globus toolkit
• Data Management
– SRB support
• WAN Filesystems
– Development: GPFS-WAN, Lustre-WAN
– Future: pNFS with GPFS-WAN & Lustre-WAN
client modules
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
6
TeraGrid GridFTP service
• Standard target names
– gridftp.{system}.{site}.teragrid.org
• Sets of striped servers
– Most sites have deployed multiple (4~12)
data stripe servers per (HPC) system
– Mix of 10GbE and 1GbE deployments
– Multiple data stripes services started on
10GbE GridFTP data transfer servers
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
7
speedpage.teragrid.org
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
8
GridFTP Observations
• Specify server configuration parameters in an
external file (-c option)
– Allows updates to configuration on the fly between
invocations of GridFTP server
– Facilitate custom setups for dedicated user jobs
• Make the server block size parameter match the
default (parallel) filesystem block size for the
filesystem visible to the GridFTP server
– How to accommodate user configurable filesystem block
sizing (e.g. Lustre)? Don’t know yet…
• -vb is still broken
– Calculate throughput using time as a wrapper instead
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
9
GridFTP server configuration
• Recommended Striping for TeraGrid sites:
– 10GbE: 4- or 6-way striping per interface
• Not more since most 10GbE are limited by PCI-X bus
– 1GbE: 1 stripe each
• Factors:
– TeraGrid network is uncongested
• Multiple stripes/flows are not necessary to mitigate
congestion-related loss
– Mix of 10GbE and 1GbE
• Striping is determined by the receiving server config 8x1GbE -> 2x10GbE = 2 stripes unless the latter are
configured with multiple stripes each
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
10
globus-url-copy -tcp-bs
• TCP Buffer Size
– Goal is to optimize the buffer size large enough to handle as
many bytes as can typically be in flight between the source
and target
• Too small: waste time in transfer waiting at source for
responses from target when you could have been sending data
• Too big: waste time having to retransmit packets that got
dropped at the target because the target ran out of buffer space
and/or could not process them fast enough
– TeraGrid tgcp tool uses TCP buffer size values calculated
from measurements between TeraGrid sites over the
TeraGrid network
– Autotuning kernels/OSs
• Linux kernel 2.6.9 or later
• Microsoft Windows Vista
• Observed superior performance at TACC and ORNL on
systems with autotuning enabled
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
11
Other Performance Factors
• Other TCP implementation factors that will affect
network performance
– RFC 1323 TCP extensions for high performance
• Window scaling - you’re limited to 64K max without this
• Timestamps
– Protection against wrapped sequence numbers
in high-capacity networks
– RFC 2018 SACK (Selective ACK support)
• Receiver sends ACK’s with info about what packets it has seen
- allowing the sender to only resend missing packets, thus
reducing retransmissions
– RFC 1191 Path MTU discovery
• Packet sizes should be maximized for network
• MTU=9000 bytes on TeraGrid network (mix of 1Gb & 10Gb i/fs)
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
12
Additional Tuning Resources
• TCP tuning guide:
– http://www.psc.edu/networking/projects/tcptune/
• Autotuning:
– Jeff Semke, Jamshid Mahdavi, Matt Mathis - 1998
auto tuning paper:
• http://www.psc.edu/networking/ftp/papers/autotune_sigco
mm98.ps
– Dunigan Oak Ridge auto tuning:
• http://www.csm.ornl.gov/~dunigan/netperf/auto.html
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
13
GridFTP 4.1.2 Dev Release
• TeraGrid GIG Data Transfer team is
investigating new GridFTP features
– “pipeline” mode to more-efficiently transfer
large numbers of files
– Automatic data stripe server failure
recovery
– sshftp:// - transfers to/from SSH servers
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
14
HPN-SSH
• So What’s Wrong with SSH?
– Standard SSH is slow in wide area networks
– Internal bottlenecks prevent SSH from using all of
the network you have
• What is HPN-SSH?
– A set of patches to greatly improve the network
performance of OpenSSH
• Where do I get it?
– http://www.psc.edu/networking/projects/hpn-ssh
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
15
(Current) Standard SSH
Throughput as a Function of RTT
600
500
MB/s
400
300
200
100
0
1
9
17
25
33
41
49
57
65
73
81
89
97
105 113 121 129 137 145
RTT (in ms)
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
16
The Real Problem with SSH
• It is *NOT* the encryption process!
– If it was:
• Faster computers would give faster throughput.
Which doesn’t happen.
• Transfer rates would be constant in local and
wide area network. Which they aren’t.
• In fact transfer rates seem dependent on RTT,
the farther away the slower the transfer.
• Any time rates are strongly linked to
RTT it implies a receive buffer problem
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
17
What’s the Big Deal?
• Receive buffers are used to regulate the data
rate of TCP
• The receive buffer is how much data can be
unacknowledged at any one point. The
sender will only send out that much data
*until* it gets an ACK
– If your buffer is set to 64k the sender can only
send 64k per round trip no matter how fast the
network actually is
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
18
How Bad Can it Be?
• Pretty bad
– Lets say you have a 64KB receive buffer
June 25, 2007
RTT
Link
BDP
Utilization
100ms
10Mbs
125KB
50%
100ms
100Mbs
1.25MB
5%
100ms
1000Mbs
12.5MB
0.5%
TeraGrid Data Transfer - [email protected] - ©2007 PSC
19
SSH is RWIN Limited
• Analysis of the code reveals
– SSH Protocol V2 is multiplexed
• Multiple channels over one TCP connection
– Must implement a flow control mechanism per
channel
• Essentially the same as the TCP receive window
– This application level RWIN is effectively set to
64KB. So real connection RWIN is
MIN(TCPrwin, SSHrwin)
• Thus TPUTmax = 64KB/RTT
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
20
Solving the Problem
• Use getsockopt() to get TCP(rwin) and
dynamically set SSH(rwin)
– Performed several times throughout
transfer to handle autotuning kernels
• Results in 10x to 50x faster throughput
depending on cipher used on well tuned
system
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
21
HPN-SSH versus SSH
200
180
160
140
Mb/s
120
hpn-ssh
100
ssh
80
60
40
20
June 25, 2007
ae
s1
28
-c
tr
ae
s1
92
-c
tr
ae
s2
56
-c
tr
rij
nd
ae
l
ae
s1
92
-c
bc
ae
s2
56
-c
bc
ar
cf
ou
r
8cb
c
ca
st
12
fis
hcb
c
cb
c
bl
ow
3d
es
-
ae
s1
28
-c
bc
0
TeraGrid Data Transfer - [email protected] - ©2007 PSC
22
HPN-SSH Advantages
• Users already know how to use scp
– Keys, ~/.ssh/config file preferences & shortcuts
• Speed is roughly comparable to single-stripe
GridFTP and Kerberized FTP
• Use existing authentication infrastructure
– GSISSH now includes HPN patches
– Do both GSI and Kerberos authn with MechGlue
• Can be used with other applications
– rsync, svn, SFTP, ssh port forwarding, etc.
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
23
HPN-SSH Issues
• Users are accustomed to using scp/sftp
to transfer files to/from login nodes
– Now that HPN-scp can be a bandwidth hog
like GridFTP, interactive login nodes are no
longer the best place for it
– 3rd-party transfer
• scp a:file b:file2 = (ssh a; scp a:file b:file2)
• Tricky to configure hosts that you don’t want to
give interactive ssh access on?
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
24
TeraGrid HPN-SSH service
• Currently available with default SSH
service on many HPC login hosts
– Login hosts running current GSISSH
include HPN-SSH patches
• Likely to move HPN-SSH to dedicated
data service nodes
– e.g. existing GridFTP data server pools
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
25
WAN Filesystems
• A common filesystem (or at least the
transparent semblance of one) is one of
the most commonly user-requested
enhancements for TeraGrid
• WAN Filesystems on TeraGrid:
– Lustre-WAN
– GPFS-WAN
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
26
TeraGrid Lustre-WAN
• Active TeraGrid sites include PSC, Indiana
Univ., and ORNL. NCAR to be added soon
• We’ve seen good performance across the
TeraGrid network
– As high as 977MB/s for a single client over 10GbE
• 2 active Lustre-WAN filesystems (PSC & IU)
• Currently experimenting with alpha version of
Lustre that supports Kerberos authentication,
encryption (metadata, data) & UID mapping
– Uses some of the NFSv4 infrastructure built by
UMICH
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
27
TeraGrid GPFS-WAN
• 700TB GPFS-WAN filesystem housed at San
Diego Supercomputer Center
• Currently mounted across TeraGrid network
at SDSC, NCSA, ANL and PSC
• Divided into three categories
– Collections 150TB
– Projects 475TB - User projects apply for space
– Scratch 75TB (purged periodically)
– Note: GPFS-WAN filesystems are not backed up
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
28
Advanced Solutions
• TeraGrid staff actively work on custom
solutions to meet the needs of the NSF
user community
• Examples:
– DMOVER
– Parallel Direct I/O (PDIO)
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
29
Scheduled Data Jobs
• Traditional HPC batch jobs waste CPU
allocations staging data in/out from
– Why be charged CPU hours for thousands of
CPUs sitting idle while data is moved?
• Goals
– Schedule data movement as its own separate job
– Exploit opportunities for parallelism to reduce
transfer time
– Co-schedule with HPC application jobs as needed
• Approach: Create a “canned” job that users
can run to instantiate a file transfer service
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
30
DMOVER
• Designed for use on lemieux.psc.edu to
allow data movement in/out of /scratch
• Data relayed from lemieux.psc.edu
compute nodes via interconnect to
Access Gateway nodes on WAN
• Portable DMOVER edition currently
under development for use on other
TeraGrid platforms
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
31
DMOVER job script example
#PBS -l rmsnodes=4:4
#PBS -l agw_nodes=4
# root of the file(s)/directory(s) to transfer
export SrcDirRoot=$SCRATCH/mydata/
(a convenience)
# path to the target sources, relative to SrcDirRoot (wildcards allowed)
export SrcRelPath="*.dat"
# destination host name
(one or more, round-robin)
export DestHost=tg-c001.sdsc.teragrid.org,
tg-c002.sdsc.teragrid.org,tg-c003.sdsc.teragrid.org,
tg-c004.sdsc.teragrid.org
# root of the file(s)/directory(s) at the other side (dest path)
export DestDirRoot=/gpfs/ux123456/mydata/
# run the process manager
/scratcha1/dmover/dmover_process_manager.pl "$SrcDirRoot" "$SrcRelPath"
"$DestHost" "$DestDirRoot" "$RMS_NODES"
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
32
DMOVER Process Perl Script
for ($i=0; $i<=$#file; $i++){
if (!$child){
# pick host IDs, unless we just got them from wait()
$ret = system($cmd);
if ($i<$nStreams){
}
$shostID = $i % $ENV{'RMS_NODES'};
$dhostID = $i % ($#host+1);
# keep the number of streams constant
$dest=$host[$dhostID];
if ($nStreams<=$i+1){
}
$pid = wait;
# re-use whichever source host just finished...
# command to launch the transfer agent
$shostID = $cid{$pid}[0];
$cmd = "prun -N 1 -n 1 -B `offset2base $shostID`
$DMOVERHOME/dmover_transfer.sh $SrcDirRoot
$file[$i] $dest $DestDirRoot $shostID"
# re-use whichever remote host just finished...
$dhostID = $cid{$pid}[1];
delete($cid{$pid});
}
$child = fork();
}
if ($child){
$cid{$child}[0] = $shostID;
while (-1 != wait){
$cid{$child}[1] = $dhostID;
}
June 25, 2007
sleep(1);
}
TeraGrid Data Transfer - [email protected] - ©2007 PSC
33
DMOVER Transfer Agent
export X509_USER_PROXY=$HOME/.proxy
export GLOBUS_LOCATION=/usr/local/globus/globus-2.4.3
export GLOBUS_HOSTNAME=`/bin/hostname -s`.psc.edu
. $GLOBUS_LOCATION/etc/globus-user-env.sh
# set up Qsockets
. $DMOVERHOME/agw_setup.sh $5
SrcDirRoot=$1
SrcRelPath=$2
DestHost=$3
DestDirRoot=$4
args="-tcp-bs 8388608"
cmd="$GLOBUS_LOCATION/bin/globus-url-copy $args
file://$SrcDirRoot/$SrcRelPath gsiftp://$DestHost/$DestDirRoot/$SrcRelPath"
echo `/bin/hostname -s` : $cmd
time agw_run $cmd
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
34
What about MPPs?
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
35
Getting Data to/from MPPs
• Massively-Parallel Processing (MPP) systems (e.g.,
bigben.psc.teragrid.org - Cray XT3) do not have pernode connectivity to WANs
• Nodes only run a microkernel (e.g. Cray catamount)
• Users need a way to:
– Stream data into and out from a running application
– Steer their running application
• Approach:
– Dedicate nodes in a running job to data I/O
– Relay data in/out of the system via a proxy service on a
dedicated WAN I/O node
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
36
Portals Direct I/O (PDIO)
• Remote Virtual File System middleware
– User calls pdio_write() on compute node*
– Data is routed to external network
• and written on any remote host on the WAN
• in real-time (while your simulation is running)
– Useful for live demos, interactive steering, remote
post-processing, checkpointing, etc.
– New development beta simply hooks standard
POSIX file I/O
• No need for users to customize source code - just relink
with PDIO library
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
37
PPM
PPM
PPM
WAN
Viz Server
Compute Nodes
Computation,
Disconnected Vis cluster
No Steering
render
Before PDIO
ETF Net
PPM
input
PPM
Steering
I/O Nodes
Cray XT3: “BigBen”
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
Remote Site
38
Compute Nodes
PPM
PPM
PPM
Real-time, remote control & I/O
recv
recv
recv
input
WAN
PPM
PPM
ETF Net
pdiod
pdiod
pdiod
pdiod
pdiod
pdiod
Steering
I/O Nodes
Cray XT3: “BigBen”
June 25, 2007
Viz Server
Compute Node I/O,
Portals to TCP routing,
WAN FS Virtualization
render
After PDIO
TeraGrid Data Transfer - [email protected] - ©2007 PSC
Remote Site
39
pdio_write() Performance
Pittsburgh, PA to Tampa, FL
8K buffers, 64 streams, aggregate BW= 530 MB/s
1.1
(6% less than local)
Time (ms)
1.05
1
0.95
0.9
0.85
0
10
20
30
40
50
Iterations
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
40
Scientific Applications & User
Communities using PDIO
Nektar:
Arterial
Blood Flow
G. Karniadakis, et al.
Hercules:
Earthquake
Modeling
PPM:
Solar
Turbulence
J. Bielak, et al.
P. Woodward, et al.
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
41
Acknowledgements
• Chris Rapier, PSC
– HPN-SSH research, development and
presentation materials
• Kathy Benninger, PSC
– Network analysis of TeraGrid GridFTP servers
behavior, transfer performance and TCP tuning
recommendations
• Doug Balog, PSC; Steve Simms, Indiana U.
– Lustre-WAN
• Nathan Stone, PSC
– DMOVER, Parallel Direct I/O
June 25, 2007
TeraGrid Data Transfer - [email protected] - ©2007 PSC
42