TeraGrid Data Transfer Joint EGEE and OSG Workshop on Data Handling in Production Grids June 25, 2007 - Monterey, CA Derek Simmel [email protected] Pittsburgh Supercomputing Center TeraGrid Data Transfer • Topics – TeraGrid Network (June 2007) – TeraGrid Data Kits – GridFTP – HPN-SSH – WAN Filesystems • Lustre-WAN and GPFS-WAN – Advanced Solutions • Scheduled Data Jobs - DMOVER • Getting data to/from MPPs - PDIO June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 2 TeraGrid Network June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 3 network.teragrid.org June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 4 Performance Monitoring June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 5 TeraGrid Data Kits • Data Movement – GridFTP, HPN-SSH – TeraGrid Globus deployment includes VDTcontributed improvements to the Globus toolkit • Data Management – SRB support • WAN Filesystems – Development: GPFS-WAN, Lustre-WAN – Future: pNFS with GPFS-WAN & Lustre-WAN client modules June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 6 TeraGrid GridFTP service • Standard target names – gridftp.{system}.{site}.teragrid.org • Sets of striped servers – Most sites have deployed multiple (4~12) data stripe servers per (HPC) system – Mix of 10GbE and 1GbE deployments – Multiple data stripes services started on 10GbE GridFTP data transfer servers June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 7 speedpage.teragrid.org June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 8 GridFTP Observations • Specify server configuration parameters in an external file (-c option) – Allows updates to configuration on the fly between invocations of GridFTP server – Facilitate custom setups for dedicated user jobs • Make the server block size parameter match the default (parallel) filesystem block size for the filesystem visible to the GridFTP server – How to accommodate user configurable filesystem block sizing (e.g. Lustre)? Don’t know yet… • -vb is still broken – Calculate throughput using time as a wrapper instead June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 9 GridFTP server configuration • Recommended Striping for TeraGrid sites: – 10GbE: 4- or 6-way striping per interface • Not more since most 10GbE are limited by PCI-X bus – 1GbE: 1 stripe each • Factors: – TeraGrid network is uncongested • Multiple stripes/flows are not necessary to mitigate congestion-related loss – Mix of 10GbE and 1GbE • Striping is determined by the receiving server config 8x1GbE -> 2x10GbE = 2 stripes unless the latter are configured with multiple stripes each June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 10 globus-url-copy -tcp-bs • TCP Buffer Size – Goal is to optimize the buffer size large enough to handle as many bytes as can typically be in flight between the source and target • Too small: waste time in transfer waiting at source for responses from target when you could have been sending data • Too big: waste time having to retransmit packets that got dropped at the target because the target ran out of buffer space and/or could not process them fast enough – TeraGrid tgcp tool uses TCP buffer size values calculated from measurements between TeraGrid sites over the TeraGrid network – Autotuning kernels/OSs • Linux kernel 2.6.9 or later • Microsoft Windows Vista • Observed superior performance at TACC and ORNL on systems with autotuning enabled June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 11 Other Performance Factors • Other TCP implementation factors that will affect network performance – RFC 1323 TCP extensions for high performance • Window scaling - you’re limited to 64K max without this • Timestamps – Protection against wrapped sequence numbers in high-capacity networks – RFC 2018 SACK (Selective ACK support) • Receiver sends ACK’s with info about what packets it has seen - allowing the sender to only resend missing packets, thus reducing retransmissions – RFC 1191 Path MTU discovery • Packet sizes should be maximized for network • MTU=9000 bytes on TeraGrid network (mix of 1Gb & 10Gb i/fs) June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 12 Additional Tuning Resources • TCP tuning guide: – http://www.psc.edu/networking/projects/tcptune/ • Autotuning: – Jeff Semke, Jamshid Mahdavi, Matt Mathis - 1998 auto tuning paper: • http://www.psc.edu/networking/ftp/papers/autotune_sigco mm98.ps – Dunigan Oak Ridge auto tuning: • http://www.csm.ornl.gov/~dunigan/netperf/auto.html June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 13 GridFTP 4.1.2 Dev Release • TeraGrid GIG Data Transfer team is investigating new GridFTP features – “pipeline” mode to more-efficiently transfer large numbers of files – Automatic data stripe server failure recovery – sshftp:// - transfers to/from SSH servers June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 14 HPN-SSH • So What’s Wrong with SSH? – Standard SSH is slow in wide area networks – Internal bottlenecks prevent SSH from using all of the network you have • What is HPN-SSH? – A set of patches to greatly improve the network performance of OpenSSH • Where do I get it? – http://www.psc.edu/networking/projects/hpn-ssh June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 15 (Current) Standard SSH Throughput as a Function of RTT 600 500 MB/s 400 300 200 100 0 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 RTT (in ms) June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 16 The Real Problem with SSH • It is *NOT* the encryption process! – If it was: • Faster computers would give faster throughput. Which doesn’t happen. • Transfer rates would be constant in local and wide area network. Which they aren’t. • In fact transfer rates seem dependent on RTT, the farther away the slower the transfer. • Any time rates are strongly linked to RTT it implies a receive buffer problem June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 17 What’s the Big Deal? • Receive buffers are used to regulate the data rate of TCP • The receive buffer is how much data can be unacknowledged at any one point. The sender will only send out that much data *until* it gets an ACK – If your buffer is set to 64k the sender can only send 64k per round trip no matter how fast the network actually is June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 18 How Bad Can it Be? • Pretty bad – Lets say you have a 64KB receive buffer June 25, 2007 RTT Link BDP Utilization 100ms 10Mbs 125KB 50% 100ms 100Mbs 1.25MB 5% 100ms 1000Mbs 12.5MB 0.5% TeraGrid Data Transfer - [email protected] - ©2007 PSC 19 SSH is RWIN Limited • Analysis of the code reveals – SSH Protocol V2 is multiplexed • Multiple channels over one TCP connection – Must implement a flow control mechanism per channel • Essentially the same as the TCP receive window – This application level RWIN is effectively set to 64KB. So real connection RWIN is MIN(TCPrwin, SSHrwin) • Thus TPUTmax = 64KB/RTT June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 20 Solving the Problem • Use getsockopt() to get TCP(rwin) and dynamically set SSH(rwin) – Performed several times throughout transfer to handle autotuning kernels • Results in 10x to 50x faster throughput depending on cipher used on well tuned system June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 21 HPN-SSH versus SSH 200 180 160 140 Mb/s 120 hpn-ssh 100 ssh 80 60 40 20 June 25, 2007 ae s1 28 -c tr ae s1 92 -c tr ae s2 56 -c tr rij nd ae l ae s1 92 -c bc ae s2 56 -c bc ar cf ou r 8cb c ca st 12 fis hcb c cb c bl ow 3d es - ae s1 28 -c bc 0 TeraGrid Data Transfer - [email protected] - ©2007 PSC 22 HPN-SSH Advantages • Users already know how to use scp – Keys, ~/.ssh/config file preferences & shortcuts • Speed is roughly comparable to single-stripe GridFTP and Kerberized FTP • Use existing authentication infrastructure – GSISSH now includes HPN patches – Do both GSI and Kerberos authn with MechGlue • Can be used with other applications – rsync, svn, SFTP, ssh port forwarding, etc. June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 23 HPN-SSH Issues • Users are accustomed to using scp/sftp to transfer files to/from login nodes – Now that HPN-scp can be a bandwidth hog like GridFTP, interactive login nodes are no longer the best place for it – 3rd-party transfer • scp a:file b:file2 = (ssh a; scp a:file b:file2) • Tricky to configure hosts that you don’t want to give interactive ssh access on? June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 24 TeraGrid HPN-SSH service • Currently available with default SSH service on many HPC login hosts – Login hosts running current GSISSH include HPN-SSH patches • Likely to move HPN-SSH to dedicated data service nodes – e.g. existing GridFTP data server pools June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 25 WAN Filesystems • A common filesystem (or at least the transparent semblance of one) is one of the most commonly user-requested enhancements for TeraGrid • WAN Filesystems on TeraGrid: – Lustre-WAN – GPFS-WAN June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 26 TeraGrid Lustre-WAN • Active TeraGrid sites include PSC, Indiana Univ., and ORNL. NCAR to be added soon • We’ve seen good performance across the TeraGrid network – As high as 977MB/s for a single client over 10GbE • 2 active Lustre-WAN filesystems (PSC & IU) • Currently experimenting with alpha version of Lustre that supports Kerberos authentication, encryption (metadata, data) & UID mapping – Uses some of the NFSv4 infrastructure built by UMICH June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 27 TeraGrid GPFS-WAN • 700TB GPFS-WAN filesystem housed at San Diego Supercomputer Center • Currently mounted across TeraGrid network at SDSC, NCSA, ANL and PSC • Divided into three categories – Collections 150TB – Projects 475TB - User projects apply for space – Scratch 75TB (purged periodically) – Note: GPFS-WAN filesystems are not backed up June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 28 Advanced Solutions • TeraGrid staff actively work on custom solutions to meet the needs of the NSF user community • Examples: – DMOVER – Parallel Direct I/O (PDIO) June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 29 Scheduled Data Jobs • Traditional HPC batch jobs waste CPU allocations staging data in/out from – Why be charged CPU hours for thousands of CPUs sitting idle while data is moved? • Goals – Schedule data movement as its own separate job – Exploit opportunities for parallelism to reduce transfer time – Co-schedule with HPC application jobs as needed • Approach: Create a “canned” job that users can run to instantiate a file transfer service June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 30 DMOVER • Designed for use on lemieux.psc.edu to allow data movement in/out of /scratch • Data relayed from lemieux.psc.edu compute nodes via interconnect to Access Gateway nodes on WAN • Portable DMOVER edition currently under development for use on other TeraGrid platforms June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 31 DMOVER job script example #PBS -l rmsnodes=4:4 #PBS -l agw_nodes=4 # root of the file(s)/directory(s) to transfer export SrcDirRoot=$SCRATCH/mydata/ (a convenience) # path to the target sources, relative to SrcDirRoot (wildcards allowed) export SrcRelPath="*.dat" # destination host name (one or more, round-robin) export DestHost=tg-c001.sdsc.teragrid.org, tg-c002.sdsc.teragrid.org,tg-c003.sdsc.teragrid.org, tg-c004.sdsc.teragrid.org # root of the file(s)/directory(s) at the other side (dest path) export DestDirRoot=/gpfs/ux123456/mydata/ # run the process manager /scratcha1/dmover/dmover_process_manager.pl "$SrcDirRoot" "$SrcRelPath" "$DestHost" "$DestDirRoot" "$RMS_NODES" June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 32 DMOVER Process Perl Script for ($i=0; $i<=$#file; $i++){ if (!$child){ # pick host IDs, unless we just got them from wait() $ret = system($cmd); if ($i<$nStreams){ } $shostID = $i % $ENV{'RMS_NODES'}; $dhostID = $i % ($#host+1); # keep the number of streams constant $dest=$host[$dhostID]; if ($nStreams<=$i+1){ } $pid = wait; # re-use whichever source host just finished... # command to launch the transfer agent $shostID = $cid{$pid}[0]; $cmd = "prun -N 1 -n 1 -B `offset2base $shostID` $DMOVERHOME/dmover_transfer.sh $SrcDirRoot $file[$i] $dest $DestDirRoot $shostID" # re-use whichever remote host just finished... $dhostID = $cid{$pid}[1]; delete($cid{$pid}); } $child = fork(); } if ($child){ $cid{$child}[0] = $shostID; while (-1 != wait){ $cid{$child}[1] = $dhostID; } June 25, 2007 sleep(1); } TeraGrid Data Transfer - [email protected] - ©2007 PSC 33 DMOVER Transfer Agent export X509_USER_PROXY=$HOME/.proxy export GLOBUS_LOCATION=/usr/local/globus/globus-2.4.3 export GLOBUS_HOSTNAME=`/bin/hostname -s`.psc.edu . $GLOBUS_LOCATION/etc/globus-user-env.sh # set up Qsockets . $DMOVERHOME/agw_setup.sh $5 SrcDirRoot=$1 SrcRelPath=$2 DestHost=$3 DestDirRoot=$4 args="-tcp-bs 8388608" cmd="$GLOBUS_LOCATION/bin/globus-url-copy $args file://$SrcDirRoot/$SrcRelPath gsiftp://$DestHost/$DestDirRoot/$SrcRelPath" echo `/bin/hostname -s` : $cmd time agw_run $cmd June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 34 What about MPPs? June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 35 Getting Data to/from MPPs • Massively-Parallel Processing (MPP) systems (e.g., bigben.psc.teragrid.org - Cray XT3) do not have pernode connectivity to WANs • Nodes only run a microkernel (e.g. Cray catamount) • Users need a way to: – Stream data into and out from a running application – Steer their running application • Approach: – Dedicate nodes in a running job to data I/O – Relay data in/out of the system via a proxy service on a dedicated WAN I/O node June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 36 Portals Direct I/O (PDIO) • Remote Virtual File System middleware – User calls pdio_write() on compute node* – Data is routed to external network • and written on any remote host on the WAN • in real-time (while your simulation is running) – Useful for live demos, interactive steering, remote post-processing, checkpointing, etc. – New development beta simply hooks standard POSIX file I/O • No need for users to customize source code - just relink with PDIO library June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 37 PPM PPM PPM WAN Viz Server Compute Nodes Computation, Disconnected Vis cluster No Steering render Before PDIO ETF Net PPM input PPM Steering I/O Nodes Cray XT3: “BigBen” June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC Remote Site 38 Compute Nodes PPM PPM PPM Real-time, remote control & I/O recv recv recv input WAN PPM PPM ETF Net pdiod pdiod pdiod pdiod pdiod pdiod Steering I/O Nodes Cray XT3: “BigBen” June 25, 2007 Viz Server Compute Node I/O, Portals to TCP routing, WAN FS Virtualization render After PDIO TeraGrid Data Transfer - [email protected] - ©2007 PSC Remote Site 39 pdio_write() Performance Pittsburgh, PA to Tampa, FL 8K buffers, 64 streams, aggregate BW= 530 MB/s 1.1 (6% less than local) Time (ms) 1.05 1 0.95 0.9 0.85 0 10 20 30 40 50 Iterations June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 40 Scientific Applications & User Communities using PDIO Nektar: Arterial Blood Flow G. Karniadakis, et al. Hercules: Earthquake Modeling PPM: Solar Turbulence J. Bielak, et al. P. Woodward, et al. June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 41 Acknowledgements • Chris Rapier, PSC – HPN-SSH research, development and presentation materials • Kathy Benninger, PSC – Network analysis of TeraGrid GridFTP servers behavior, transfer performance and TCP tuning recommendations • Doug Balog, PSC; Steve Simms, Indiana U. – Lustre-WAN • Nathan Stone, PSC – DMOVER, Parallel Direct I/O June 25, 2007 TeraGrid Data Transfer - [email protected] - ©2007 PSC 42
© Copyright 2025 Paperzz